Connect Banner
for layout only

Search This Site

for layout only


Link to Current Issue
Link to Archives
Link to About Connect Page
for layout only
 
current issue
for layout only
Browse the table of contents, or select an option from this menu:
for layout only
 
Category: Humanities Computing

Publishing XML Files on the Web


The Humanities Computing Group within ITS' Academic Computing Services is working with the FAS Department of History on a web-based publication project, The Public Writings of Margaret Sanger. This online edition is part of a much larger endeavor, The Margaret Sanger Papers Project, which includes an already completed microfilm edition of over 9,000 of Sanger's documents, plus a four-volume book edition of Sanger's papers, the first volume of which has been published with the title The Woman Rebel, 1900-1928.

Although, when completed, the online edition will be much smaller than either the microfilm or print edition (600 documents), it represents an important part of the project, as it will create a new level of accessibility to the writings of Sanger, a noted social reformer and birth control advocate in the first half of the 20th century.

Project Director and Editor Dr. Esther Katz, Assistant Director and Associate Editor Cathy Moran Hajo, and Associate Editor Peter C. Engelman have chosen XML as the format in which to store the transcribed Sanger documents. XML was selected because it is the standard storage method for transcribed documents in the field of digital libraries and archives. One reason XML is the standard is its application- and platform-independence, which allows for interoperable storage, exchange, and usage. Since XML files are text files, they can be stored in the file system of any operating system (Windows, UNIX, Linux, Macintosh) and be read and manipulated by any program (word processor, web browser, Java program, Perl script, PHP, etc.).

XML and HTML

Another reason XML is the standard is its allowance for rich, "descriptive" markup. Although HTML can be used to describe the structure of a document, it is typically used to indicate how a document should be displayed, and is limited to demarcation of headers, paragraphs, line breaks, divisions, and tables, plus some "presentational" information such as font style and size. With XML, on the other hand, it is possible to indicate that strings of text are names, dates, places, titles, footnotes, and so on, or to include whatever categorization is desired.

The code below illustrates the differences between HTML and XML.* The first section shows how a document would typically be marked up in HTML:

I think the City of <b>Philadelphia</b> owes a vote of thanks to this Committee for bringing together this brilliant group of speakers to discuss the subject as it has been discussed today. After listening to some of the papers that I heard, I was reminded of something <i>H. G. Wells</i> said last year in talking about the advance of culture and intelligence. He said ["]I have one test of intelligence and that is the attitude of the man or woman on birth control, as that is the real test of intelligence,["] and I think if you heard, as I did, the statements made today and if you compared those statements and the attitude of mind to the same thoughts that were expressed at this sort of a conference years ago, if we would put these thoughts into the hands of the legislators it would be a great help and the opposition would be forced to come to some sort of a conclusion and let us hear what they have to say.

And this is the same text marked up in XML:

I think the City of <place reg="Philadelphia, PA" rend="bold"> Philadelphia</place> owes a vote of thanks to this Committee for bringing together this brilliant group of speakers to discuss the subject as it has been discussed today. After listening to some of the papers that I heard, I was reminded of something <person reg="Herbert George Wells" rend="italics">H. G. Wells</person> said last year in talking about the advance of culture and intelligence. He said "<quote who="WELHE">I have one test of intelligence and that is the attitude of the man or woman on birth control, as that is the real test of intelligence</quote>," and I think if you heard, as I did, the statements made today and if you compared those statements and the attitude of mind to the same thoughts that were expressed at this sort of a conference years ago, if we would put these thoughts into the hands of the legislators it would be a great help and the opposition would be forced to come to some sort of a conclusion and let us hear what they have to say.

Although this is a rudimentary example, it illustrates how XML turns markup from a tool for describing how a document should be displayed into a system for storing data. If we take these two snippets of code as examples, a program could be written to search the XML document for any person mentioned in the text by looking for the <person> element. Then an index of people mentioned in the text could be created. This would not be possible with HTML.

The irony of XML is that one of its greatest strengths, platform- and application-independence, is also one of its weaknesses. Since it is a relatively new technology and since it is not tied to any one program or system, XML can be difficult to work with. The publication of XML files on the Web illustrates this difficulty. Unlike HTML files, which display very nicely across a range of web browsers, documents marked up in XML alone can produce variable and often unsatisfactory results when viewed in a browser.

screenshot of a sample XML document when viewed in Mozilla on Mac OSX
 
Figure 1. A sample XML document when viewed directly in Mozilla on Macintosh OS X.

Figure 1 shows one of our Sanger XML documents as it appears when viewed directly in Mozilla on Macintosh OS X. The same document viewed in Internet Explorer on Mac OS X is displayed in Figure 2. While the Internet Explorer view is more informative—displaying, as it does, the XML elements in a hierarchical view—this is rarely what a programmer would want the reader to see.

screenshot of a sample XML document when viewed in Internet Explorer on OS X
 
Figure 2. The same document viewed in Internet Explorer on Mac OS X.

Making XML More Reader-Friendly

W3C (The World Wide Web Consortium), the international Web standards body, has developed and is continuing to develop standards for the publication of XML files on the Web. One method is Cascading Style Sheets (CSS). If you are familiar with HTML, you most likely are aware of Cascading Style Sheets.

A Cascading Style Sheet is a file that allows you to define particular styles for HTML elements. For instance, with CSS you can indicate that you want all <h1> elements to be a certain size, color, and style by specifying it in the style sheet. A typical style definition would look something like:

h1{text-align: center; font-size: 18px; font-weight: bold}

The same holds true for XML. You can use a CSS to indicate how certain tags should be displayed (or if they should be displayed at all). Figure 3 shows what our XML document looks like when it has been associated with a CSS. The result is more pleasing to the eye and in line with what a reader would expect.

screenshot of a sample XML document viewed with a CSS applied
 
Figure 3. The sample document after a Cascading Style Sheet has been applied.

For many people, the use of CSS with their XML documents may suffice for their web publication needs. There are limitations, however. Your XML document can only display its information in the order in which it occurs in the document and, more important, you are relying on your reader to have a browser that supports CSS for XML. As of now, browser support of CSS for XML is a bit sketchy; therefore, it is risky to rely on client-side technology for your site to work.

There are other limitations to CSS. There are times when you may want to rearrange the order in which the information in your XML document appears. For instance, the Sanger documents contain a header with metadata (information about the data) concerning author name and date of publication.

<mepHeader>
<prepDate>2002-08-05 KH</prepDate>
<prepDate>2002-08-15 KH</prepDate>
<docAuthor>Margaret Sanger</docAuthor>
<docDate value="1929-02-27">27 Feb 1929</docDate>
<docTitle>
<titlePart>Birth Control</titlePart>
</docTitle>
<idno>msp101861</idno>
<sourceDesc>
<bibl>
<title>Margaret Sanger Papers, Library of Congress</title>
</bibl>
<bibl><title>LCM</title> 128 :0446-047</bibl>
</sourceDesc>
<respStmt>
<resp></resp>
<name id="WELHE">Herbert George Wells</name>
<name id="ASTNA">Nancy Witcher Langhorne Astor</name>
</respStmt>
</mepHeader>

This information occurs at the beginning of the XML document but needs to appear in different places when displayed. There are also situations in which you want particular conditions to control which data in the XML document is displayed. For instance, you could have an XML document containing a list of names and addresses. What if you want to display only those names and addresses from a certain city or zip code?

This is where XSLT comes in. XSLT stands for Extensible Stylesheet Language Transformations, a powerful style sheet language developed by W3C to work with XML. XSLT is too powerful and too complex to describe in great detail here, but it works like CSS, in that a style sheet is applied to an XML file.

It is also similar to CSS in that it operates on XML elements specified in the style sheet. However, it does much more than apply styles to these elements. It can apply actions to these elements, such as transforming the element name. It can use programming logic, such as "if/then" conditionals. Most important, it can transform an XML file and output it as another XML file or as an HTML, text, or PDF file, or even stream the output to another program, such as a web browser. Today, XSLT is typically used to transform an XML document into an HTML file that can be easily viewed in a web browser, eliminating any special client-side technology that would be needed to view the XML documents.

XSLT is based on templates. A template is created for a specified XML element. When the style sheet finds that element in the XML document, the template is applied. For instance, there is an element, <unclear>, used in the Sanger documents to indicate when a word is unclear in the original document. Perhaps the handwriting is illegible or the print has been smeared. In the Sanger XML file it looks like this:

<unclear>some text here</unclear>
<p>In the web browser it should display like this: [some text here?].
<p>The HTML code needed for this output is:
[<em>some text here</em>?]
<p>To complete this transformation, XSLT uses code that looks like this:
<xsl:template match="unclear">
[<em>
<xsl:apply-templates/>
</em>?]
</xsl:template>

This translates to: "whenever you come across the <unclear> element, print Ô[<em>' and then apply templates". This means search within the <unclear> element to see if there are any other elements that need templates applied. If not, just print out the text within the <unclear> element. Then print out "</em>?]". This transforms <unclear>some text here</unclear> to [<em>some text here</em>?], which, when viewed in a web browser, looks like: "[some text here?]".

That was a very long explanation for such a simple thing, but once you learn XSLT and discover that you can do very powerful transformations, the promise of XML becomes clear. Your XML document truly becomes a data source from which you can produce any sort of document.

How do you perform these transformations? The style sheet does not just magically transform the XML file into another XML or HTML file. Theoretically, you should be able to link to any XSLT file from your XML document and allow the web browser to perform the transformation in real time, similar to the way CSS works. This, however, is not the case in reality. Only certain web browsers—and, typically, only the most recent versions of these—successfully perform XSLT transformations. The hope is that someday all web browsers will have the built-in functionality to run XSLT transformations, but at present, you can't rely on this.

In the meantime, you need to use an intermediate piece of software called an XSLT processor that can read the XML file and the XSLT file and output a third file, in our case, HTML. When we were initially developing the style sheet for the Sanger project, we used an XSLT processor named Saxon. Saxon is written in Java and in its simplest form runs as a command-line program. If you are ambitious, you can write your own GUI, but for our purposes the command line worked well.

We used Saxon to test our style sheets by transforming some sample XML documents into HTML files and viewing the HTML files in a web browser. When we were satisfied with the style sheet, we began to look for a more robust solution, preferably one that would do the transformations on the fly. That is, we wanted a person to be able to visit the site and request a certain Sanger document, and have the XML file transformed on the fly and sent to the web browser as a stream of HTML.

That way, as more XML files were added to the collection or changes were made to the files, they would not have to be transformed manually into HTML, and the reader would always get the most recent versions of Sanger documents. Of course, if your documents will be static, you can do the XSLT transformations ahead of time (with Saxon or another processor) and simply store the HTML files on your server for viewing by the reader.

Publishing HTML from XML on the Fly

One possibility we considered was to write a Java applet that would be embedded in a web page and use Saxon to perform the transformations. After some research, however, we decided there was a solution that would require less application programming on our side. That solution was to use PHP with its XSLT extension.

What is PHP? To quote the PHP website (http://www.php.net), "PHP is a widely-used general-purpose scripting language that is especially suited to web development and can be embedded into HTML." It is typically used in conjunction with the MySQL database management system and Apache web server to create dynamic websites. It is similar to JSP and the commercial web applications Lasso, Active Server Pages, and Cold Fusion. It is also similar in some respects to Perl and Python.

We have been using PHP in the ITS Humanities Computing Group for over a year for other projects with great success, and it was natural for us to investigate its XSLT capabilities. Lucky for us, PHP can be configured to perform XSLT transformations.

The installation of PHP that comes with most Linux distributions does not have the XSLT extension enabled, so some configuration is needed. First, the Sablotron XSLT processor must be installed. The developers of the XSLT extension for PHP hope to support other processors in the future, but for now Sablotron must be used. Then, PHP must be recompiled and installed to take advantage of the XSLT processor. For detailed instructions on this installation process, see http://www.nyu.edu/its/humanities/docs/php_xslt.html. Although PHP is typically run on Linux and Unix servers, it can also be installed on Windows as needed.

Once PHP is configured for XSLT transformations, the coding is quite easy. The lines of code that perform the process are simply:

$xh = xslt_create();
xslt_process($xh,
"$sample.xml",
"sample.xsl", "sample.html");

The first line creates a new XML processor. This is used for returning errors and other feedback. The next line processes an XML document called "sample.xml", using an XSLT file called "sample.xsl", and outputs an HTML file called "sample.html". What makes PHP easy to use is that its code exists in a file with a .php extension but can be embedded in HTML and accessed through a web browser. This code can easily be added to other PHP and HTML code to produce a page that accepts variables and generates HTML on the fly.

The following is the full PHP code needed to publish an XML document on the fly, (with explanatory comments in italics for this example):

<?php
/* Allocate a new XSLT processor
$xh = xslt_create();
/* Process an XML document using the XSLT file named "test.xsl". The argument for the XML document to be processed is not a static file name, but instead a variable name. The variable value is passed via an URL to this PHP page (http://www.foo.edu/transform.php?var=filename.xml). There is no HTML file output. Instead, the HTML is output to the variable named "$result".
$result = xslt_process($xh, "$_GET[var]", "test.xsl");
/* If the transformation is successful ($result has a value) then print out $result. When PHP prints $result, it is simply sending HTML to the browser as a stream.
if ($result) {
print $result;
}
/* If the transformation did not work, then print out this error message:
else {
print "Sorry, $_GET[var] could not be transformed by test.xsl into $_GET[var] the reason is that .xslt_error($xh) and the error code is " .xslt_errno($xh);
}
?>

These few lines of code allow you to do on-the-fly, server-side XSLT transformations that create nicely formatted HTML documents suitable for the Web. With a little more code, our final product looks very nice indeed. Those awkward XML files we saw in the beginning are now transformed and displayed to the reader as shown in Figure 4.

screenshot of the sample XML document after XSLT
 
Figure 4. The sample XML document after on-the-fly, server-side XSLT.

There are other solutions for publishing XML documents on the Web. You can work with the application program interfaces (APIs) provided with Saxon and develop your own server-side transformation system, but it involves more application development time and experience programming in C, C++, or Java. You can also use the Apache suite of products including Cacoon, the XML application server, but these products are still in their early stages of development and not 100% reliable, and also require working with an API to develop your own interface.

PHP is a nice solution because it is free, already installed on most Linux machines, also runs on Windows, relatively easy to configure for use with XSLT, very easy to code, and already works in conjunction with an Apache web server. With very little coding and configuration, you can create a fast and robust XML publishing system.

Resources


* Text source: Margaret Sanger Papers, Library of Congress, LCM 128:0446-047.

Author Biography

Matthew Zimmerman is a Humanities Computing Specialist in ITS' Academic Computing Services. He can be reached at matthew.zimmerman@nyu.edu.

Page last reviewed: November 5, 2003. All content © New York University.
Questions or comments about this site? Send e-mail to: its.connect@nyu.edu.