XML for Java Developers
G22.3033-002
Dr. Jean-Claude Franchitti
New York University
Computer Science Department
Courant Institute of Mathematical Sciences
Session 3: Processing XML
Documents in Java Using XPath and XSLT
Course Title: XML for Java Developers Course
Number: g22.3033-002
Instructor: Jean-Claude Franchitti Session: 3
The Java API for XML Parsing (JAXP) defined through
the Java Community Process provides a common interface for accessing XML
documents. The W3C has defined the Document Object Model (DOM), which provides
a standard interface for working with an XML document in a tree hierarchy,
whereas the Simple API for XML (SAX) lets a program parse an XML document
sequentially, based on an event handling model. Both of these standards (SAX
being a de facto standard) complement the JAXP.
In many cases XPath and XSLT provide simpler, more elegant ways than the standard Java APIs for solving application problems. In some simple samples, we will compare a pure Java/XML solution with one that utilizes XPath and/or XSLT. Both XSLT and XPath are part of the Extensible Stylesheet Language (XSL) specification. XSL consists of three parts: the XSL language specification itself, XSL Transformations (XSLT), and XML Path Language (XPath). XSL is a language for transforming XML documents; it includes a definition -- Formatting Objects -- of how XML documents can be formatted for presentation. XSLT specifies a vocabulary for transforming one XML document into another. You can consider XSLT to be XSL minus Formatting Objects. The XPath language addresses specific parts of XML documents and is intended to be used from within an XSLT stylesheet.
You should have some familiarity with XML, XSLT, the DOM APIs, and Apache Xerces/Xalan.
The problem
XML provides a vehicle
to accomplish a good design practice in Web programming: the
Model-View-Controller pattern (MVC), or, in simpler terms, the separation of
application data from presentation data. If the application data is formatted
in XML, it can easily be bound -- typically in a servlet or Java ServerPage --
to, say, HTML templates by using an XSL stylesheet.
But XML can do much more than merely help with
model-view separation for an application's frontend. We currently observe more
and more widespread use of components (for example, components developed using
the EJB standard) that can be used to assemble applications, thus enhancing
developer productivity. Component reusability can be improved by formatting the
data that components deal with in a standard way. Indeed, we can expect to see
more and more published components that use XML to describe their interfaces.
Because XML-formatted data is language-neutral, it becomes usable in cases where the client of a given application service is not known, or when it must not have any dependencies on the server. For example, in B2B environments, it may not be acceptable for two parties to have dependencies on concrete Java object interfaces for their data exchange. New technologies like the Simple Object Access Protocol (SOAP) address these requirements.
All of these cases have one thing in common: data is stored in XML documents and needs to be manipulated by an application. For example, an application that uses various components from different vendors will most likely have to change the structure of the (XML) data to make it fit the need of the application or adhere to a given standard.
Code written using the Java APIs mentioned above
would certainly do this. Moreover, there are more and more tools available with
which you can turn an XML document into a JavaBean and vice versa, which makes
it easier to handle the data from within a Java program. However, in many
cases, the application, or at least a part of it, merely processes one or more
XML documents as input and converts them into a different XML format as output.
Using stylesheets in those cases is a viable alternative.
Use XPath to locate nodes in an XML document
As stated above, the XPath language is used to
locate certain parts of an XML document. As such, it's meant to be used by an
XSLT stylesheet, but nothing keeps us from using it in our Java program in
order to avoid lengthy iteration over a DOM element hierarchy. Indeed, we can
let the XSLT/XPath processor do the work for us. Let's take a look at how this
works.
Let us assume that we have an application scenario
in which a source XML document is presented to the user (possibly after being
processed by a stylesheet). The user makes updates to the data and, to save
network bandwidth, sends only the updated records back to the application. The
application looks for the XML fragment in the source document that needs to be
updated and replaces it with the new data.
We will create a little sample that will help you
understand the various options. For this example, we assume that the
application deals with address records in an addressbook. A sample addressbook document looks like this:
<addressbook>
<address>
<addressee>John Smith</addressee>
<streetaddress>250 18th Ave SE</streetaddress>
<city>Rochester</city>
<state>MN</state>
<postalCode>55902</postalCode>
</address>
<address>
<addressee>Bill Morris</addressee>
<streetaddress>1234 Center Lane NW</streetaddress>
<city>St. Paul</city>
<state>MN</state>
<postalCode>55123</postalCode>
</address>
</addressbook>
The application (possibly, though not necessarily, a
servlet) keeps an instance of the addressbook in memory as a DOM Document object. When the user
changes an address, the application's frontend sends it only the updated <address> element.
The <addressee> element is used to uniquely identify an address; it
serves as the primary key. This would not make a lot of sense for a real
application, but we do it here to keep things simple.
We now need to write some Java code that will help
us identify the <address> element in the source tree
that needs to be replaced with the updated element. The findAddress() method below shows how that
can be accomplished. Please note that, to keep the sample short, we've left out
the appropriate error handling.
public Node findAddress(String name, Document source) {
Element root = source.getDocumentElement();
NodeList nl = root.getChildNodes();
// iterate over all address nodes and find the one that has the correct
addressee
for (int i=0;i<nl.getLength(); i++) {
Node n = nl.item(i);
if ((n.getNodeType() == Node.ELEMENT_NODE) &&
(((Element)n).getTagName().equals("address"))) {
// we have an address node, now we need to find the
// 'addressee' child
Node addressee = ((Element)n).getElementsByTagName("addressee").item(0);
// there is the addressee, now get the text node and compare
Node child = addressee.getChildNodes().item(0);
do {
if ((child.getNodeType()==Node.TEXT_NODE) &&
(((Text)child).getData().equals(name))) {
return n;
}
child = child.getNextSibling();
} while (child != null);
}
}
return null;
}
The code above could most likely be optimized, but it is obvious that iterating over the DOM tree can be tedious and error prone. Now let's look at how the target node can be located by using a simple XPath statement. The statement could look like this:
//address[child::addressee[text() =
'Jim Smith']]
We can now rewrite our previous method. This time, we use the XPath statement to find the desired node:
public
Node findAddress(String name, Document source) throws Exception {
// need to recreate a few helper objects
XMLParserLiaison xpathSupport = new XMLParserLiaisonDefault();
XPathProcessor xpathParser = new XPathProcessorImpl(xpathSupport);
PrefixResolver prefixResolver = new
PrefixResolverDefault(source.getDocumentElement());
// create the XPath and initialize it
XPath xp = new XPath();
String xpString = "//address[child::addressee[text() =
'"+name+"']]";
xpathParser.initXPath(xp, xpString, prefixResolver);
// now execute the XPath select statement
XObject list = xp.execute(xpathSupport, source.getDocumentElement(),
prefixResolver);
// return the resulting node
return list.nodeset().item(0);
}
The above code may not look a lot better than the previous try, but most of this method's contents could be encapsulated in a helper class. The only part that changes over and over is the actual XPath expression and the target node.
This lets us create an XPathHelper class, which looks like
this:
import org.w3c.dom.*;
import org.xml.sax.*;
import org.apache.xalan.xpath.*;
import org.apache.xalan.xpath.xml.*;
public class XPathHelper {
XMLParserLiaison xpathSupport = null;
XPathProcessor xpathParser = null;
PrefixResolver prefixResolver = null;
XPathHelper() {
xpathSupport = new XMLParserLiaisonDefault();
xpathParser = new XPathProcessorImpl(xpathSupport);
}
public NodeList processXPath(String xpath, Node target) thrws SAXException {
prefixResolver = new PrefixResolverDefault(target);
// create the XPath and initialize it
XPath xp = new XPath();
xpathParser.initXPath(xp, xpath, prefixResolver);
// now execute the XPath select statement
XObject list = xp.execute(xpathSupport, target, prefixResolver);
// return the resulting node
return list.nodeset();
}
}
After creating the helper class, we can rewrite our finder method again, which is now very short:
public
Node findAddress(String name, Document source) throws Exception {
XPathHelper xpathHelper = new XPathHelper();
NodeList nl = xpathHelper.processXPath(
"//address[child::addressee[text() = '"+name+"']]",
source.getDocumentElement());
return nl.item(0);
}
The helper class can now be used whenever a node or a set of nodes needs to be located in a given XML document. The actual XPath statement could even be loaded from an external source, so that changes could be made on the fly if the source document structure changes. In this case, no recompile is necessary.
Process XML documents with XSL stylesheets
In some cases, it makes sense to outsource the
entire handling of an XML document to an external XSL stylesheet, a process in
some respects similar to the use of XPath as described in the previous section.
With XSL stylesheets, you can create an output document by selecting nodes from
the input document and merging their content with stylesheet content, based on
pattern rules.
If an application changes the structure and content
of an XML document and producing a new document, it may be better and easier to
use a stylesheet to handle the work rather than writing a Java program that
does the same job. The stylesheet is most likely stored in an external file,
allowing you to change it on the fly, without the need to recompile.
For example, we could accomplish the processing for
the addressbook sample by creating a
stylesheet that merges the cached version of the addressbook with the updated one, thus
creating a new document with the updates in it.
Here is a sample of such a stylesheet:
<?xml
version='1.0'?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml"/>
<xsl:variable
name="doc-file">http://mymachine.com/changed.xml</xsl:variable>
<!-- copy everything that has no other pattern defined -->
<xsl:template match="* | @*">
<xsl:copy><xsl:copy-of
select="@*"/><xsl:apply-templates/></xsl:copy>
</xsl:template>
<!-- check for every <address> element if an updated one exists -->
<xsl:template match="//address">
<xsl:param name="addresseeName">
<xsl:value-of select="addressee"/>
</xsl:param>
<xsl:choose>
<xsl:when
test="document($doc-file)//addressee[text()=$addresseeName]">
<xsl:copy-of
select="document($doc-file)//address[child::addressee[text()=$addresseeName]]"/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Note that the above stylesheet takes the updated
data out of a file called changed.xml. A real application would
obviously not want to store the changed data in a file before processing it.
One solution is to add a special attribute to the <address> element, indicating whether
or not it has been updated. Then the application could simply append the
updated data to the source document and define a different stylesheet that
detects updated records and replaces the outdated ones.
All the application has to do now is create an XSLTProcessor object and let it do the
work:
import
org.apache.xalan.xslt.*;
...
XSLTProcessor processor = XSLTProcessorFactory.getProcessor();
processor.process(new XSLTInputSource(sourceDoc.getDocumentElement(),
new XSLTInputsource("http://mymachine.com/updateAddress.xsl"),
new XSLTResultTarget(newDoc.getDocumentElement());
sourceDoc = newDoc;
...
Conclusion
As shown in this handout, the manual parsing and processing of an XML document is only one option, and that we may be able to use of XPath expressions and XSL stylesheets to avoid a lot of parsing and iterating, thus reducing the amount of code that we need to write. Moreover, under this system the information about how the data is processed is stored externally and can be changed without recompiling the application. The mechanisms described here can be used for the creation of presentation data for a Web application, but can also be applied in all cases in which XML data needs to be processed.
Valuable XML-related resources
·
Sun's Java API for XML Parsing (JAXP) page:
http://java.sun.com/xml/docs/api/
·
The DOM API from the W3C:
http://www.w3.org/DOM/
·
For good XML tutorials, visit the XML Zone of IBM's
developerWorks:
http://www.ibm.com/developer/xml/
·
More information on the Apache Xerces XML parser and
the Apache Xalan XSL processor:
http://xml.apache.org
·
More on the Simple Object Access Protocol (SOAP):
http://www.ibm.com/software/developer/library/soap/soapv11.html
·
For more on XSLT, see "What is XSLT" by G.
Ken Holman from XML.com:
http://www.xml.com/pub/2000/08/holman/