XML for Java Developers
G22.3033-002
Dr. Jean-Claude Franchitti
New York University
Computer Science Department
Courant Institute of Mathematical Sciences
Session 2: IE5’s
Implementation of the XSL Specification
Course Title: XML for Java Developers Course Number:
g22.3033-002
Instructor: Jean-Claude Franchitti Session: 2
The example presented in the handout on "XSL
Transformations" presents an XSL style sheet that uses patterns to locate
objects within the document tree, and shows how you can specify template rules
to format these objects.
The example is interesting in that it demonstrated
the process of transforming XML into HTML and shows how you can combine CSS
style rules to format the HTML output.
The example can be tested using IBM's LotusXSL style
sheet engine. In that arrangement, the style sheet is to be served along with
its accompanying XML document from a Java servlet. The example can also be
tested from the command line.
The style sheet should run in most XSL processors,
including Internet Explorer 5's. However, if used with Explorer 5, the entire
document, save one word ("by"), is missing. In tracking down the
problem, you can discover a few differences between the XSL draft specification
and Internet Explorer's implementation.
The differences reflect the quickly changing
specification. Nevertheless, an understanding of how IE processes style sheets
will save you countless hours of head scratching in other cases.
The
Problem
The XML document, news.xml, presents a typical
online news story. The document contains elements that describe the various
parts of the article including its title, dek (subtitle), and byline, as well
as formatting information within paragraphs. For example, the first character in
the first paragraph is a drop-cap letter that would be larger than the rest of
the characters in the document.
The style sheet used to transform the XML document
goes something like this: A root template is used to create a boilerplate of
the HTML output. Wherever we want to place text from the XML document, a
pattern is used to select that portion of the document. Then <xsl:apply-templates> is called to process child
nodes. For example, we call <xsl:apply-templates select="Story/SectionTitle"/> to include the title of the
document at the top of the story.
When <xsl:apply-templates> is called, two things
happen. First, the processor grabs the content specified in the pattern: In
this case, the content of the Story/SectionTitle element is News&Views. Next, <apply-templates> looks in the style sheet to
see if there are any templates that apply to this node. In the original
example, the story title is processed directly in the root template and no
other template rules are specified. Thus, it is simply included with the
appropriate formatting information.
However, later in the style sheet when <xsl:apply-templates select="Story//BodyText"/> is called, a chain of
events occurs. The style sheet is as follows:
Listing One:
<xsl:stylesheet
...
<TITLE><xsl:apply-templates
select="Story/SectionTitle"/></TITLE>
...
<H2><xsl:apply-templates
select="Story/Headline"/></H2>
...
<DIV Class="copy">
<xsl:apply-templates
select="Story//BodyText"/>
</DIV>
...
</xsl:template>
<xsl:template match="BodyText">
<P><xsl:apply-templates/></P>
</xsl:template>
<xsl:template match="DropCap">
<DIV
Class="DropCap"><xsl:apply-templates/></DIV>
</xsl:template>
<xsl:template match="bold">
<B><xsl:apply-templates/></B>
</xsl:template>
...
</xsl:stylesheet>
First, the content from the BodyText element is retrieved. This
time, a template for Body
Text
exists, so it becomes instantiated and is processed. The only thing the BodyText template does is call <apply-templates> to process its children.
And herein lies the problem.
At the time the style sheet was written, it was not
clear whether <apply-templates> should process all
descendants of the current node, or just its immediate children. LotusXSL
assumes that all descendants will be processed. From an author's perspective,
this is better, because it means that you don't have to write a separate
template rule for every element type in your document. Microsoft, on the other
hand, assumed that only immediate child nodes should be processed. That means
you must write a separate template for each and every element type in your
document.
Solving
the Mystery
As you might guess, to get the style sheet in
Listing One to work properly in Internet Explorer 5, you must write template
rules for all of your document's element types. Essentially, what you must do
is get your templates to cascade down through the tree in order to touch all of
the elements. To do this, most rules will simply call <apply-templates> to traverse to the next
level of child nodes. If any element type requires special formatting, you can
simply add it to that template.
Listing Two contains an excerpt from the revised
style sheet. The first thing to note in the new example is that we have renamed
the elements in the XML document for readability. In particular, the BodyText element has been renamed to
aBody, and the story element has been renamed to
article.
Listing Two:
<xsl:stylesheet
...
<xsl:template match="/">
<TITLE><xsl:value-of
select="article/headline"/></TITLE>
...
<Span ID="BoxCopy">
<xsl:value-of
select="article/headline"/>
</Span><BR></BR>
<DIV Class="aBody">
<P><xsl:apply-templates select="article//aBody"/></P>
</DIV>
...
</xsl:template>
<xsl:template match="aBody">
<P><xsl:apply-templates /></P>
</xsl:template>
<xsl:template match="para1">
<P><xsl:apply-templates /></P>
</xsl:template>
<xsl:template match="para">
<P><xsl:apply-templates /></P>
</xsl:template>
<xsl:template match="para2">
<P><xsl:apply-templates /></P>
</xsl:template>
<xsl:template match="dropCap">
<DIV Class="dropCap"><xsl:apply-templates
/></DIV>
</xsl:template>
<xsl:template match="bold">
<B><xsl:apply-templates /></B>
</xsl:template>
<xsl:template match="italic">
<I><xsl:apply-templates /></I>
</xsl:template>
<xsl:template match="byline[@Email]">
<A HREF="mailto:mfloyd@BeyondHTML.com"><xsl:apply-templates/></A>
</xsl:template>
</xsl:stylesheet>
Browsing through Listing Two, you'll notice that the
root template includes a rule to process the aBody element. The pattern article//aBody says "start at the
document element article and select any descendants
that are aBody elements." This allows
aBody elements that are nested
within other elements to be processed. When this <apply-templates> is instantiated, the
processor looks for any templates that match aBody. Since there is an aBody template, it becomes
instantiated. The only statement in this template is an <apply-templates>, which says "process
all immediate child nodes."
Within the tree structure, child nodes of aBody include para1, para, and para2. Once again, the processor
sets out to find templates for these element types and locates a template for
each. The template for para1 simply inserts an HTML
paragraph element (<P>) and calls <apply-templates> to process its child nodes.
Children of the para1 element include the dropCap, bold, and italic elements. Note that there's
also a text child node that represents para1's content. Again templates
exist for each of these element types. In the case of the dropCap template, an HTML <DIV CLASS="dropCap"> element is inserted. Don't
mistake this reference as pointing to an XML element. The CLASS attribute for this <DIV> actually references a CSS <STYLE> rule of the same name,
which is located in the root template (not shown).
Next, the dropCap template also calls <apply-templates> to process its child nodes.
This time, the only child node is a text node representing the element's
content -- in this case, the "W" character. This character is inserted
into the <DIV> element and the processor
moves on to process the other templates.
Although you would expect this approach to solve the
mystery. The title, dek, byline, and document text all should have appeared in
the browser. But the same problem remains: Only the solitary word
"by" appears in the window.
It turns out that Microsoft requires that you create
a template rule to process all node types that are not specified as an element
type. This means you must create templates to process attributes, comments,
processing instructions, and yes, text nodes. That seems peculiar since text is
so common that a template for handling it is built in to XSL. Nevertheless, IE
requires that you include the template rule found in Example 1 to display an
element's content. With this tiny bit of code, the mystery is solved.
Example 1:
<xsl:template match="text()">
<xsl:value-of />
</xsl:template>
Presumably, these implementation details were not
made clear when Microsoft wrote its processor. In any case, this template rule
should be included in any style sheet you design for use in IE.
More
Q & A
You may want to dynamically build a hypertext link
using XSL, where the target filename is an attribute of an XML element. In
other words, you may need to create something like
<<A HREF="target.xml">
where target comes from an XML element
<GOHERE ref="target">
Assuming that you are outputting HTML from XSL (that
is, transforming the XML to HTML), you could simply use a pattern to access the
attribute, then generate an HTML anchor tag using the attribute for the HREF.
In fact, the earlier example uses this approach to create a link to a biography
in the author's byline.
The code is shown below:
<xsl:template
match="byline[@Email]">
<A
HREF="mailto:jcf@cs.nyu.edu">
<xsl:apply-templates/></A>
The next question is how does one determine whether
he or she is outputting to HTML or XML? After trying the above, the resulting
text string could be a stunningly perfect
<A HREF="target.xml>xxx</A>
displayed on the screen when launching the XML file.
But it isn't a link that IE5 should have interpreted.
You could think that there is no distinction between
XML and HTML in the above output tree type. So is there something to declare,
like a processing instruction (PI).
As it turns out, a complete answer involves a
lengthy explanation. First, the interpretation process depends on how you load
the XML stream. For example, you could simply launch the XML file in the
browser and rely on XSL to process the document, or you could load it and
process via the DOM. The important point to keep in mind is that conceptually
there are two trees -- the source tree and the result tree. The source tree is
constructed by parsing the original XML document and placing all elements,
attributes, comments, processing instructions, and so on into your tree
structure.
The result tree is constructed from what is
specified in the XSL style sheet (in this case, transformed HTML). At this
point, the tree nodes represent well-formed XML. If the output is to go to a
file, then the output will look like HTML and can be processed by any HTML
browser. Presumably, IE shortcuts this process. That is, when you launch the
XML file directly in IE, it parses the document into the source tree,
constructs the result tree, reads it, and processes the output as HTML.
Conclusion
One thing you may have learned from all of this is
that despite vendors' best efforts to comply with existing standards, various
XSL engines still exhibit peculiar differences. Part of the problem is that
implementation depends on which version of the standard was used, and what
state it was in at the time. Web developers, must still grapple with these
differences.