XML

From Elvanör's Technical Wiki
Jump to navigation Jump to search

XML is a general purpose format. However, mastering all aspects of XML manipulation is not as easy as it seems. There are lots of libraries, tools and concepts to understand.

General Concepts of XML

Tree of nodes

  • If you have a parent node containing text and an inline element in the text, the actual DOM tree will contain 3 child nodes: two text nodes and the inline node. The tree order will of course correspond to the actual order (eg, the one would expect).
 <div>This is <span>some text</span> that I like.</div>

The 3 nodes will be: text node with a content of "This is", span element, and text node with a content of " that I like.". The span element itself contains a text node.

Validation

  • Note that potentially, 5 characters may need to be encoded in XML: < (&lt;), > (&gt;), & (&amp;), " (&quot;), ' (&apos;). However most of the time, the apostrophe and double quotes are not illegal in XML code. A validation parser will report the document as legal. Problems only arise when these characters are used in attributes delimiters that have the same symbol. In practice, & and < will always need to be encoded for example.
  • Eclipse can perform automatic validation on a XML document if the schema or DTD is provided (even if it hosted on a remote HTTP server: Eclipse will automaticaly fetch it). This is very powerful.
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 
	http://docs.oasis-open.org/xliff/v1.2/cs02/xliff-core-1.2-strict.xsd">

XSLT

  • XSLT allows you to perform generic XML transformations using a markup language, since XSLT is in itself a XML format.

XPath

  • XPath is a general query language for XML.

Namespaces

XML Processing in Groovy

  • In Groovy you have access to a large panels of XML processing options, since you have Groovy specific libraries, plus all the Java libraries underneath.

XMLParser

  • XMLParser will parse an XML document and return a tree of nodes. These nodes are not W3C standard DOM objects, these are generic Groovy node objects. Thus you cannot call advanced DOM methods on them. But for simple purposes it may be enough. XMLParser allows updating of a DOM tree.
  • Beware that by default, XMLParser is namespace aware. So if your XML document contains a namespace, you cannot access the child of a node in the usual Groovy way:
myNode.child // you won't access the child node of the myNode element
myNode[child] // this won't work either

def ns = new groovy.xml.Namespace('urn:oasis:names:tc:xliff:document:1.2', "ns")

myNode[ns.child] // this will work: you need to use the namespace or declare it default for the document

You can disable namespace processing if you want though (via a constructor argument for example).

XMLSlurper

  • XMLSlurper should also be used for simple cases, since it is not fully compliant. If you have an inline node inside a text for example, it seems that the tree node won't be the standard one (eg, text nodes and element nodes cannot be mixed).
  • XMLSlurper is not namespace aware by default.

DOMCategory

  • DOMCategory allows you to manipulate W3C DOM node objects, but with the usual Groovy syntax enhancements. This can be very useful: you have the W3C standard DOM API at your disposal but still benefit from Groovy enhancements.

XML processing in Java

  • First of all, in the JDK the standard W3C DOM interfaces are supported. But this does not mean that every part of that standard is implemented (included) in the JDK. Which parts are implemented and which parts are not is not entirely clear to me, although it seems that as of JDK 1.6, the Load & Save specification is not present.
  • Interfaces not implemented may be added via additional libraries (jars), for example Apache Xerces.

W3C DOM

  • This is the standard API. It is very powerful but quite complex and its design implies that you work at a low level (dealing with nodes in the tree directly).
  • To obtain an object implementing a W3C interface, use the following code:
DOMImplementationLS underlyingImplementation = (DOMImplementationLS) document.getImplementation().getFeature("LS", "3.0");
LSSerializer serializer = underlyingImplementation.createLSSerializer();

This would for example allow you, given a W3C XML DOM document named document, to obtain a Load & Save serializer object.

DOM4J

  • This API allows you to process DOM W3C objects. However, it has its own API for dealing with XMl, which is at a higher level than the W3C one.
  • To use XPath you must provide the Jaxen library jar file.
  • This library is quite old and outdated (does not support generics in collections). Its two main advantages is the ability to deal with W3C objects, and its speed.

JDOM

  • Not tried yet.

XOM

  • Not tried yet. According to the website, the focus of this library is on conformance and correct implementation of the XML specifications.

Escaping strings for XML

  • Apache commons-lang contains a escapeXml() method in the org.apache.commons.lang.StringEscapeUtils class. However, I don't like it since Unicode characters are currently escaped too. So you can very easily code that simple encoding method yourself.
  • Note that anyway, you should rarely need to escape an XML string manually. It is best to leave all the XML writing / processing work to your XML library.