XML

From Elvanör's Technical Wiki
Revision as of 11:47, 27 August 2008 by Elvanor (talk | contribs)
Jump to navigation Jump to search

XML is a general purpose format. However, mastering all aspects of XML manipulation is not as easy as it seems. There are lots of libraries, tools and concepts to understand.

General Concepts of XML

Tree of nodes

  • If you have a parent node containing text and an inline element in the text, the actual DOM tree will contain 3 child nodes: two text nodes and the inline node. The tree order will of course correspond to the actual order (eg, the one would expect).
 <div>This is <span>some text</span> that I like.</div>

The 3 nodes will be: text node with a content of "This is", span element, and text node with a content of " that I like.". The span element itself contains a text node.

Validation

  • Note that potentially, 5 characters may need to be encoded in XML: < (&lt;), > (&gt;), & (&amp;), " (&quot;), ' (&apos;). However most of the time, the apostrophe and double quotes are not illegal in XML code. A validation parser will report the document as legal. Problems only arise when these characters are used in attributes delimiters that have the same symbol. In practice, & and < will always need to be encoded for example.
  • Eclipse can perform automatic validation on a XML document if the schema or DTD is provided (even if it hosted on a remote HTTP server: Eclipse will automaticaly fetch it). This is very powerful.
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 
	http://docs.oasis-open.org/xliff/v1.2/cs02/xliff-core-1.2-strict.xsd">

XSLT

  • XSLT allows you to perform generic XML transformations using a markup language, since XSLT is in itself a XML format.

XPath

  • XPath is a general query language for XML. Official W3C documentation.
  • To define XPath expressions, you have a full syntax and an abbreviated syntax, which is much concise and thus better. For example:
//element[@type="warning"]

would select all elements with an attribute of type equal to "warning".

  • Don't forget that XPath expressions can select not only nodes, but also attributes (attributes can be seen as nodes actually).
  • There is a W3C XPath interface specification, but it's not implemented in the JDK 1.6 or in any Java / Groovy library that I know of. So usually you have to use another API for XPath use.

Namespaces

  • The default namespace in an XML document is the one that does not have a prefix associated, for example:
 <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 
	http://docs.oasis-open.org/xliff/v1.2/cs02/xliff-core-1.2-strict.xsd">

would define "urn:oasis:names:tc:xliff:document:1.2" as the default namespace. "http://www.w3.org/2001/XMLSchema-instance" is associated to the "xsi" prefix.

XML Processing in Groovy

  • In Groovy you have access to a large panels of XML processing options, since you have Groovy specific libraries, plus all the Java libraries underneath.

Whitespace treatment

  • Treatment of whitespace is strange in Groovy. The text() method called on an element keeps the whitespace, but called on a text node removes it. See the following code:
def doc = groovy.xml.DOMBuilder.parse(new StringReader("<body><x>   x   </x></body>"))
use(groovy.xml.dom.DOMCategory)
{
	println ">" + doc.x.text() + "<"
	println ">" + doc.x[0].children()[0].getNodeValue() + "<"
}

Output:

>   x   <
>x<


XMLParser

  • XMLParser will parse an XML document and return a tree of nodes. These nodes are not W3C standard DOM objects, these are generic Groovy node objects. Thus you cannot call advanced DOM methods on them. But for simple purposes it may be enough. XMLParser allows updating of a DOM tree.
  • Beware that by default, XMLParser is namespace aware. So if your XML document contains a namespace, you cannot access the child of a node in the usual Groovy way:
myNode.child // you won't access the child node of the myNode element
myNode[child] // this won't work either

def ns = new groovy.xml.Namespace('urn:oasis:names:tc:xliff:document:1.2', "ns")

myNode[ns.child] // this will work: you need to use the namespace or declare it default for the document

You can disable namespace processing if you want though (via a constructor argument for example).

XMLSlurper

  • XMLSlurper should also be used for simple cases, since it is not fully compliant. If you have an inline node inside a text for example, it seems that the tree node won't be the standard one (eg, text nodes and element nodes cannot be mixed).
  • XMLSlurper is not namespace aware by default.

DOMCategory

  • DOMCategory allows you to manipulate W3C DOM node objects, but with the usual Groovy syntax enhancements. This can be very useful: you have the W3C standard DOM API at your disposal but still benefit from Groovy enhancements.
  • I am not sure of the namespace processing support in DOMCategory: in the constructor a boolean switches it on or off, however I cannot see a way of associating a namespace with the document once parsed.

XML MarkupBuilder

  • The markup builder allows you to create XML documents. The main idea of this builder is to transform an existing data structure in a XML document. In particular, it is not intended to build the document in several steps; it should be built all at once in the main builder closure. This means that the correct way to use it is to first build a data structure, then transform it into an XML tree using the builder.

Important Warnings

  • Be very careful when accessing objects such as Node. The constant Node.TEXT_NODE for example, can be easily equal to null because Groovy does not consider the org.w3c.dom Node class, but another one. Due to auto imports this is not very clear. The best is to explicitely use the namespace, eg:
assert node.getNodeTyp() == org.w3c.dom.Node.TEXT_NODE

XML Processing in Java

  • First of all, in the JDK the standard W3C DOM interfaces are supported. But this does not mean that every part of that standard is implemented (included) in the JDK. Which parts are implemented and which parts are not is not entirely clear to me, although it seems that as of JDK 1.6, the Load & Save specification is not present.
  • Interfaces not implemented may be added via additional libraries (jars), for example Apache Xerces.
  • Some of W3C DOM interfaces have no interface at all in the JDK (for example the XPath specification).

Native Java APIs (javax.xml.*)

  • Currently the native Java XML APIs seem limited and awkward to use. Better to use third party libraries! For example in javax.xml.namespace, the NamespaceContext is an interface, but you have no way (that I know of, at least) to obtain an instance of this interface. Thus support for namespaces in native Java XPath is missing...
  • If you disable namespace processing in for example Groovy's parser, it can still be used though.

W3C DOM

  • This is the standard API. It is very powerful but quite complex and its design implies that you work at a low level (dealing with nodes in the tree directly).
  • To obtain an object implementing a W3C interface, use the following code:
DOMImplementationLS underlyingImplementation = (DOMImplementationLS) document.getImplementation().getFeature("LS", "3.0");
LSSerializer serializer = underlyingImplementation.createLSSerializer();

This would for example allow you, given a W3C XML DOM document named document, to obtain a Load & Save serializer object.

  • When using LSSerializer, the encoding used for output seems to vary according to the first operation you perform on the serializer instance (which is quite strange). For example calling outputToString() will set the encoding to UTF-16, and outputToURI() will use UTF-16 later.
  • You cannot add a node coming from another document into the current document. You must use the importNode() method of the Document interace (which will basically clone the node).

DOM4J

  • This API allows you to process DOM W3C objects. However, it has its own API for dealing with XMl, which is at a higher level than the W3C one.
  • To use XPath you must provide the Jaxen library jar file.
  • This library is quite old and outdated (does not support generics in collections). Its two main advantages is the ability to deal with W3C objects, and its speed.
  • Final conclusion: do not use. The API does not feel very natural and the lack of support for generics makes it outdated. Development is not active on this library. The only advantage that I see is support for W3C objects.

JDOM

  • Just started to use it. It looks nice. One thing I don't like is that if you use namespaces, you must specify the namespace for every item you create.

XOM

  • Not tried yet. According to the website, the focus of this library is on conformance and correct implementation of the XML specifications.

Escaping strings for XML

  • Apache commons-lang contains a escapeXml() method in the org.apache.commons.lang.StringEscapeUtils class. However, I don't like it since Unicode characters are currently escaped too. So you can very easily code that simple encoding method yourself.
  • Note that anyway, you should rarely need to escape an XML string manually. It is best to leave all the XML writing / processing work to your XML library.

XML Processing in Python

  • XML support is poor in Python. Using standard DOM APIs seem possible (one should then look for the minidom implementation). A more complete DOM implementation (that hopefully allows for XPath support, but I did not check thoroughly) is available via the external PyXML library / package.
  • Another choice is to use the ElementTree library, which allows for pseudo DOM / XML tree support, as noted in the next section.

ElementTree

  • Official Site.
  • This library is bundled with Python 2.5, but the bundled version is 1.2. It does not allow for XPath like search expressions. The version allowing for (simple) XPath expressions is 1.3, but it is currently in alpha and does not seem actively developed.