XML
XML is a general purpose format. However, mastering all aspects of XML manipulation is not as easy as it seems. There are lots of libraries, tools and concepts to understand.
General Concepts of XML
Tree of nodes
- If you have a parent node containing text and an inline element in the text, the actual DOM tree will contain 3 child nodes: two text nodes and the inline node. The tree order will of course correspond to the actual order (eg, the one would expect).
<div>This is <span>some text</span> that I like.</div>
The 3 nodes will be: text node with a content of "This is", span element, and text node with a content of " that I like.". The span element itself contains a text node.
Validation
- Note that potentially, 5 characters may need to be encoded in XML: < (<), > (>), & (&), " ("), ' ('). However most of the time, the apostrophe and double quotes are not illegal in XML code. A validation parser will report the document as legal. Problems only arise when these characters are used in attributes delimiters that have the same symbol. In practice, & and < will always need to be encoded for example.
- Eclipse can perform automatic validation on a XML document if the schema or DTD is provided (even if it hosted on a remote HTTP server: Eclipse will automaticaly fetch it). This is very powerful.
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/cs02/xliff-core-1.2-strict.xsd">
XSLT
- XSLT allows you to perform generic XML transformations using a markup language, since XSLT is in itself a XML format.
XPath
- XPath is a general query language for XML. Official W3C documentation.
- To define XPath expressions, you have a full syntax and an abbreviated syntax, which is much concise and thus better. For example:
//element[@type="warning"]
would select all elements with an attribute of type equal to "warning".
- The default axis is child:: (this means if you don't specify anything, child:: is assumed). However, there are many other useful axes like descendant::, following-sibling::, etc.
- Don't forget that XPath expressions can select not only nodes, but also attributes (attributes can be seen as nodes actually).
- There is a W3C XPath interface specification, but it's not implemented in the JDK 1.6 or in any Java / Groovy library that I know of. So usually you have to use another API for XPath use.
Predicates
- The [] notation is a predicate. Usually it's used to get the nth child of an element (as in a[3]); but it can contains much more powerful expressions ([@id="myId"] for example). What's important to note is that this predicate depends on the axis. Sometimes you will have to use parenthesis to change the axis. Example:
//div[@class="Important"][2] -> selects all the div.Important elements that are the second children of an element (//div[@class="Important"])[2] -> select the second div.Important element in the document
- This difference is subtile but very important to understand.
Useful XPath Functions
- contains is very useful. Examples:
span[contains(text(), "Hello World")] a[contains(@href, "google.com")]
Namespaces
- The default namespace in an XML document is the one that does not have a prefix associated, for example:
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/cs02/xliff-core-1.2-strict.xsd">
would define "urn:oasis:names:tc:xliff:document:1.2" as the default namespace. "http://www.w3.org/2001/XMLSchema-instance" is associated to the "xsi" prefix.
XSD
- An XSD file is a schema definition file - it allows you to define an XML format and perform validation on it. Some IDEs such as Eclipse will work much better with XML files when a XSD file is provided. An XSD file is an XML file itself.
- XSD W3C Tutorial. Useful to get a basic understanding of concepts before reading the full specification.
- In XSD you define a lot of types. The types can be explicitly created with the complexType tag for example, or you can use "anonymous inner types", within an element body, to define the type of the element. This is better when you don't use the type more than once, as it allows you to not explicitly reference a type declared elsewhere.
- Within a complexType tag, you must use a structure tag to hold your child elements. The structure can be a sequence, all, choice or group (and maybe others). A sequence imposes an order on the elements; all allows you to have elements in arbitrary order (but imposes some restrictions). Choice allows you to choose between various options - and group references a group of elements defined elsewhere.
- You can have complexType content empty, which means the element will be like <hr /> in XHTML (eg, no children).
- The attributes defined must go after the sequence (structure) on a complex type. For example:
<element name="component-data" maxOccurs="unbounded" minOccurs="1"> <complexType> <sequence> <element name="x-path" type="string"></element> <element name="required-string" type="string" minOccurs="0"></element> </sequence> <attribute name="type" type="string"></attribute> </complexType> </element>
- To define a simple string type that can only takes its value in a list of possible values, use:
<simpleType name="ExtractMethodType"> <restriction base="string"> <enumeration value="standard" /> <enumeration value="advanced" /> </restriction> </simpleType>
- There is apparently no way to impose children element constraints based on the value of an attribute in the parent element. Eg, the contained elements cannot be different depending on the value of the attribute. For this reason, it's better to have a design where the attributes value do not influence the contained elements.
- A type corresponds to the contents of an element and its attributes. So a type is linked with a (named) element. Sometimes however you want to define a group, which corresponds to a saved structure (choice, sequence or all). A group definition contains the name attribute; you refer to an defined group using the same tag (group) but with the "ref" attribute.
XML Processing in Groovy
- In Groovy you have access to a large panels of XML processing options, since you have Groovy specific libraries, plus all the Java libraries underneath.
Whitespace treatment
- Treatment of whitespace is strange in Groovy. The text() method called on an element keeps the whitespace, but called on a text node removes it. See the following code:
def doc = groovy.xml.DOMBuilder.parse(new StringReader("<body><x> x </x></body>")) use(groovy.xml.dom.DOMCategory) { println ">" + doc.x.text() + "<" println ">" + doc.x[0].children()[0].getNodeValue() + "<" } Output: > x < >x<
XMLParser
- XMLParser will parse an XML document and return a tree of nodes. These nodes are not W3C standard DOM objects, these are generic Groovy node objects. Thus you cannot call advanced DOM methods on them. But for simple purposes it may be enough. XMLParser allows updating of a DOM tree.
- Beware that by default, XMLParser is namespace aware. So if your XML document contains a namespace, you cannot access the child of a node in the usual Groovy way:
myNode.child // you won't access the child node of the myNode element myNode[child] // this won't work either def ns = new groovy.xml.Namespace('urn:oasis:names:tc:xliff:document:1.2', "ns") myNode[ns.child] // this will work: you need to use the namespace or declare it default for the document
You can disable namespace processing if you want though (via a constructor argument for example).
XMLSlurper
- XMLSlurper should also be used for simple cases, since it is not fully compliant. If you have an inline node inside a text for example, it seems that the tree node won't be the standard one (eg, text nodes and element nodes cannot be mixed).
- XMLSlurper is not namespace aware by default.
DOMCategory
- DOMCategory allows you to manipulate W3C DOM node objects, but with the usual Groovy syntax enhancements. This can be very useful: you have the W3C standard DOM API at your disposal but still benefit from Groovy enhancements.
- I am not sure of the namespace processing support in DOMCategory: in the constructor a boolean switches it on or off, however I cannot see a way of associating a namespace with the document once parsed.
XML MarkupBuilder
- The markup builder allows you to create XML documents. The main idea of this builder is to transform an existing data structure in a XML document. In particular, it is not intended to build the document in several steps; it should be built all at once in the main builder closure. This means that the correct way to use it is to first build a data structure, then transform it into an XML tree using the builder.
Important Warnings
- Be very careful when accessing objects such as Node. The constant Node.TEXT_NODE for example, can be easily equal to null because Groovy does not consider the org.w3c.dom Node class, but another one. Due to auto imports this is not very clear. The best is to explicitely use the namespace, eg:
assert node.getNodeTyp() == org.w3c.dom.Node.TEXT_NODE
XML Processing in Java
General Information
- XML processing in Java is complex at first, mainly because a lot of options exist.
- Some of the standard W3C DOM interfaces are supported in the JDK (at least in Java 1.6). Not all of the interfaces are present in the JDK (the XPath specification for example is absent). And an implementation is not guaranteed for every interface. Which parts are implemented and which parts are not is not entirely clear to me, although it seems that as of JDK 1.6, the Load & Save specification is not present.
- Apart from the W3C interfaces, Java has its own JAXP (Java API for XML processing) interfaces. Version 1.4 is bundled with the 1.6 JDK: this API mainly allows to obtain W3C DOM objects, and to write them to XML files. It contains parser classes, etc.
- Interfaces not implemented (and absent interfaces) may be added via additional libraries (jars), for example Apache Xerces. Apache Xerces, version 2, implements the DOM Level 3 Load & Save Specification. It also contains the W3C org.w3c.dom.xpath interface (although apparently the implementation is absent).
Native Java APIs (javax.xml.*)
- Currently the native Java XML APIs seem limited and awkward to use. Better to use third party libraries!
- XPath support in particular is problematic. There is native XPath support in Java 5+, via JAXP but it lacks good namespace support. This is because javax.xml.namespace.NamespaceContext is an interface without an implementation. Writing your own is not too hard, but it's a bit stupid not to have included a default implementation. If you don't use this interface, you cannot use XPath with DOM documents that are namspace aware.
- If you disable namespace processing, javax.xml.xpath support is good enough though. Use it like this:
XPath xpath = XPathFactory.newInstance().newXPath(); List <Node> nodes = xpath.evaluate( "//car", records, XPathConstants.NODESET);
W3C DOM
- This is the standard API. It is very powerful but quite complex and its design implies that you work at a low level (dealing with nodes in the tree directly).
- To obtain an object implementing a W3C interface, use the following code:
DOMImplementationLS underlyingImplementation = (DOMImplementationLS) document.getImplementation().getFeature("LS", "3.0"); LSSerializer serializer = underlyingImplementation.createLSSerializer();
This would for example allow you, given a W3C XML DOM document named document, to obtain a Load & Save serializer object.
- When using LSSerializer, the encoding used for output seems to vary according to the first operation you perform on the serializer instance (which is quite strange). For example calling outputToString() will set the encoding to UTF-16, and outputToURI() will use UTF-16 later.
- When using LSSerializer, if the document has no namespace support, the xmlns attribute on the document element won't be written to the output XML file. This maybe a bug or expected behavior; the solution is to activate namespace support if you want that attribute.
- You cannot add a node coming from another document into the current document. You must use the importNode() method of the Document interace (which will basically clone the node).
- In Java, I did not find any library implementing support for the W3C DOM Level 3 XPath implementation. The best option to use XPath in Java seems to use Jaxen directly or via the DOM4j / JDOM APIs.
Jaxen, Xalan
- Jaxen is a library for XPath processing. It is the underlying library used by DOM4J and JDOM. You will need to add it to obtain XPath support with these libraries, although you can also use Jaxen on its own.
- Xalan is an XSLT / XPath library.
DOM4J
- This API allows you to process DOM W3C objects. However, it has its own API for dealing with XMl, which is at a higher level than the W3C one.
- To use XPath you must provide the Jaxen library jar file.
- This library is quite old and outdated (does not support generics in collections). Its two main advantages is the ability to deal with W3C objects, and its speed.
- Final conclusion: do not use. The API does not feel very natural and the lack of support for generics makes it outdated. Development is not active on this library. The only advantage that I see is support for W3C objects.
Validation with DOM4J
- This is done by the underlying SAX parsers. You should have a (recent) version of the Apache Xerces library in your classpath, and use the following code (assuming you want to validate with a XSD):
SAXReader saxReader = new SAXReader(true); saxReader.setFeature("http://apache.org/xml/features/validation/schema", true); // This is necessary with an XSD
JDOM
- Just started to use it. It looks nice. One thing I don't like is that if you use namespaces, you must specify the namespace for every item you create.
XOM
- Not tried yet. According to the website, the focus of this library is on conformance and correct implementation of the XML specifications.
Escaping strings for XML
- Apache commons-lang contains a escapeXml() method in the org.apache.commons.lang.StringEscapeUtils class. However, I don't like it since Unicode characters are currently escaped too. So you can very easily code that simple encoding method yourself.
- Note that anyway, you should rarely need to escape an XML string manually. It is best to leave all the XML writing / processing work to your XML library.
XML Processing in Python
- XML support is poor in Python. Using standard DOM APIs seem possible (one should then look for the minidom implementation). A more complete DOM implementation (that hopefully allows for XPath support, but I did not check thoroughly) is available via the external PyXML library / package.
- Another choice is to use the ElementTree library, which allows for pseudo DOM / XML tree support, as noted in the next section.
ElementTree
- Official Site.
- This library is bundled with Python 2.5, but the bundled version is 1.2. It does not allow for XPath like search expressions. The version allowing for (simple) XPath expressions is 1.3, but it is currently in alpha and does not seem actively developed.