table of contents        previous chapter        next chapter

 

 

V. Digitization and Encoding of Text

 

Introduction

Digitized text is an important component of many cultural heritage projects, but the question of how to digitize it—what format, how much detail, what user activities to support—is a complex one. As with the digitization of other kinds of materials, you must consider a number of factors including the nature of the original material, the purpose of the digitized version, and the availability of relevant expertise, technical support, and funding.

It is also important to be aware of the purposes and limitations of the various digital formats available, to be sure not only that they suit your current goals and requirements, but also that they do not restrict your options in the future. Digital text formats vary considerably in the ease with which they can be converted to other formats, and in the variety of output methods they support. Proprietary systems such as word processing or page description formats (e.g. Microsoft Word, PDF) may be powerful and convenient tools for creating printed output, and may also allow for web publication of the results (via a “save as HTML” function), but if you need to move your data to another software platform you risk losing formatting and other information. Because such systems depend on the existence of proprietary software—whose licensing terms and very existence cannot be counted on over the long term—they are unsuited for archival purposes or for the creation of durable cultural resources.

In the past ten years, there has been a rapid growth in standards-based methods of text digitization using Standard Generalized Markup Language (SGML) and more recently its derivative, Extensible Markup Language (XML). These approaches avoid the problems of proprietary software, offering data longevity and the flexibility to move from platform to platform freely. There are now increasing numbers of tools for creating, editing, publishing, and manipulating textual materials encoded in this way, and this trend is likely to continue. Our treatment of digital text will therefore focus on standards-based methods of text digitization, which offers the best long-term solution to the needs of projects creating digital cultural heritage collections.

It may be useful to establish some basic terminology at the start. By “digitization” we mean any process by which information is captured in digital form, whether as an image, as textual data, as a sound file, or any other format. When speaking of the digitization of documents, the term may refer either to the capture of page images—merely a picture of the document—or to the capture of a full-text version, in which the document is stored as textual characters. In its most minimal “plain-text” form, a full-text version of a document may be simply that: the text of the document expressed as ASCII or Unicode characters and nothing more. Unlike a page image, such a document can be searched for particular words or phrases, but does not convey any information about the original appearance or structure of the document. An “encoded” version of the same document will include additional information or “markup” of various kinds, expressing the document’s structure, its formatting, or other information its creators wish to capture. Although strictly speaking, we can use the terms “markup” and “encoding” to refer to a wide range of added information—including word processor formatting codes or encryption—these words are now most frequently used to refer to SGML or XML markup. And although this kind of markup is usually applied to full-text documents, it is also possible to embed page images in an SGML or XML-encoded document structure, or to pair images with encoded information such as subject keywording, publication data, or administrative metadata.

 

Definition Box:

Definitions

Page Image:     A digital image of a page of text, captured by a scanner or digital camera, and expressed as a set of pixels in a format such as JPEG or TIFF.

Encoding, markup:     In this context, the process of adding information to a digital text by including markup (usually SGML or XML) which explicitly identifies structural and other features of the text. The term “markup” refers to the added information. In a broader sense, encoding may refer to any kind of added information or algorithmic transformation which, when applied to a data file, enables it to perform some special function.

Keying: A process by which a person manually types, or ‘keys,’ text from source page images, original printed materials, photocopies, or microforms.

OCR:     Optical Character Recognition, a process by which software reads a page image and translates it into a text file by recognizing the shapes of the letters with various levels of accuracy that are difficult to predict. OCR-generated text tends to be described as either “uncorrected (or raw)” or as “corrected.”

Metadata:     Strictly speaking, any data that is about other data; in this context, more specifically, the term usually refers to information describing a data file or document (for instance, publication information, revision history, data format, rights status).

SGML:     Standard Generalized Markup Language, an international standard (ISO 8879) since 1986, is a metalanguage which can be used to define markup languages.

XML:     Extensible Markup Language, a subset of SGML which was published as a W3C recommendation in 1998.

TEI:     The Text Encoding Initiative, an international consortium that publishes the TEI Guidelines for Electronic Text Encoding and Interchange, an SGML- and XML- compliant encoding language for the capture of literary and linguistic texts, widely used in the scholarly and cultural heritage communities.

EAD:     Encoded Archival Description, an SGML- and XML-compliant encoding language used for the capture of archival finding aids, widely used in the library and archival communities.

METS: Metadata Encoding and Transmission Standard, an XML-compliant standard for encoding a variety of metadata about digital library objects.

 

Having selected the materials you wish to digitize, (and having decided whether your text digitization strategy will include producing page images) there are several methods to choose from.

Page images are produced by scanning the text. Optical character recognition software reads such an image and creates a full-text version of the document by identifying individual character shapes and translating them into actual letters. Note that there are many factors affecting the accuracy of the OCR process, including the contrast of the original document, the fonts used, etc. Alternatively, if you have decided that page images are not needed, you can key text directly from a variety of sources. The formats available, and the handling policies pertaining to them, will dictate whether keying will be done from original materials, photocopies, or microforms. You can have the document keyed by project personnel or by a data capture service. Markup can be added during the keying process or subsequently. Increasingly, data capture services are willing to cater for specialized encoding needs, even using complex markup languages like those in the guidelines of the Text Encoding Initiative (TEI) and Encoded Archival Description (EAD); see the more detailed descriptions below and the URLs at the end of the chapter. It may be also possible to automate the encoding to some extent; how much will depend on several factors, including the consistency and predictability of your data, the level of encoding you require, and the amount of time and ingenuity you are willing to spend developing automated encoding tools. Some tools for this purpose exist already, and it is worth checking with existing encoding projects to see whether they may have something which will accomplish what you need. Finally, you may be able to obtain a digitized version of the materials you need from another project, such as the Oxford Text Archive or the University of Virginia Etext Center, either in plain text or with basic text encoding and metadata.

A few concrete scenarios may help to illustrate the differences between these various approaches, and the kinds of project goals each one is best suited to support. First, consider a fairly straightforward example: a project working with a collection of Victorian novels and poetry for which it owns copies of the materials to be digitized (the Victorian Women Writers Project, http://www.indiana.edu/~letrs/vwwp/, is an example of such a project). Since the materials are fragile and of historical interest, capturing page images makes sense both as a form of digital preservation, and in order to give readers access to an image of the original materials for research purposes. (However, given the uniformity of these texts and the general unremarkability of the pages, one can also imagine deciding against this step, particularly if the project did not own copies of the materials.) In addition, since these are literary documents, researchers will want to work closely with the text, so capturing a full-text version also makes sense. Adding TEI markup will enable readers to perform complex searches which take into account the genres and local textual structures which may be of interest. Since the page image has already been digitized, and since the materials in question are typographically regular, OCR is a good option for capturing the text, but since the project's audience is a scholarly one, careful proofreading will also be necessary to ensure that requirements for accuracy have been met. The question of how deeply to mark up the documents requires careful balancing of costs and desired function. Basic TEI markup can be applied nearly automatically using scripts, but more detailed encoding requires additional staff, training time, and additional review for encoding consistency. The additional benefit to researchers would have to be substantial to justify this cost.

Consider another project in which a large quantity of heterogeneous materials must be digitized quickly, and made accessible for political and historical research (for instance, the Forced Migration Online collection at the Refugee Studies Project Centre at Oxford, http://www.rsc.ox.ac.uk/TextWeb/fmo.html). Retrieval is acknowledged as an important goal, but at the same time the nature of the individual documents does not seem to warrant complex markup; most users want to find documents relevant to their research, but are not interested in the texts' internal structure. Furthermore, given the volume of documents and the urgency of the project, costs and speed are important factors. The project's designers therefore decide to digitize all the documents as page images; since the documents vary so much and are likely to be unfamiliar, it is important for readers to see an image of the original. In addition, to support detailed retrieval at the topic level, they decide to OCR the text, but because of the volume and cost issues, they choose not to proofread each document. Since users are searching through such large volumes of material, the error rate of uncorrected OCR still provides acceptable accuracy when used only for indexing and query purposes. Users will see only the page images, not the full-text version, so the errors will not be visible. In addition, the project creates basic metadata for each document, to allow for accurate tracking and high-level retrieval.

These two kinds of projects have arrived at what look like fairly similar approaches (page images, text captured with OCR, and metadata), though for very different reasons. Consider now a third project, whose materials are medieval manuscripts to be digitized for literary and linguistic study. (See, for example, The Electronic Beowulf project, http://www.uky.edu/~kiernan/eBeowulf/guide.htm.) As before, there are likely to be important reasons to include images of manuscripts, but their uniqueness and fragility magnify the legal and logistical challenges. Because of the nature of the research to be supported, a full-text version is essential, but OCR is probably out of the question. Instead, the project needs to choose an appropriate method for keying in the text. For medieval manuscripts, this may require a scholar familiar with the manuscript to perform or oversee the transcription. Some materials may be readable enough by a non-expert to allow for keyboarding by a data capture service, or by locally trained encoders who are not subject specialists. Finally, to support literary and linguistic analysis, the project needs to use a detailed markup scheme such as the TEI, which will allow for the capture of features such as regularized or lemmatized readings of words, morphological and syntactic information, prosodic structures, textual variants, ambiguous readings, and so forth. With page images linked to this level of encoded full text, scholars can compare versions of the manuscript, check difficult readings against the image, search for particular linguistic features and compare their occurrence from poem to poem or from manuscript to manuscript. The collection is a full-fledged scholarly tool of extraordinary power, but it also requires considerable resources and expertise to create.

Projects planning to capture a full-text version of the document should also refer to the sections on OCR (Optical Character Recognition) and Keyboarding in Section VI, Images, in addition to the sections below.

 

Character encoding

Even a plain-text document without markup contains some very basic encoding of the character information in which the document is expressed. There are two principal character encoding systems which deserve brief discussion here: ASCII and Unicode.

ASCII—the American Standard Code for Information Interchange—was proposed by ANSI (the American National Standards Institute) in 1963, and finalized in 1968. It was designed to facilitate compatibility between various types of data processing equipment. ASCII assigns the 128 decimal numbers from 0 to 127 to letters, numbers, punctuation marks and common special characters. This is commonly referred to as the ‘low ASCII character set’. Various extensions to the ASCII character set have been created over the years to assign the decimal numbers between 128 and 255 to special, mathematical, graphical and non-Roman characters. If you wish to provide texts in a ‘plain text format’, i.e. with the file extension “txt”, then you must use the low ASCII character set. The extended ASCII character set has limitations that do not apply to the lower set and must be used with more caution. There is also more than one extended character set (IBM and Microsoft each have their own) and this diminishes interoperability.

There is some argument to be made for always providing a plain text format without markup, whatever other encoding scheme is used, especially in the creation of electronic corpora. There is not much extra work involved in making the texts available in this simplified form, especially as plain ASCII texts are often the most common starting point for creating other encoded text. Certainly encoded texts (for instance, SGML/XML, COCOA, or other formats) should be created and stored in plain ASCII rather than in a proprietary system such as a word-processing format, to guarantee their longevity and platform-independence. Projects such as the Thesaurus Musicarum Latinarum (TML), where Latin musical notation has been encoded using the low ASCII character set, demonstrate that encoding complex features, while maintaining a high degree of interoperability and longevity, can be achieved with ASCII characters. This high level of interoperability is particularly important for the TML because its small but highly dispersed user base uses a wide variety of hardware and software. Furthermore, the small file size of ASCII text files means the output of the project can be distributed quickly and easily.

Unicode is an international standard (implementing ISO 10646) which was developed to overcome problems inherent in earlier character encoding schemes, where the same number can be used for two different characters or different numbers used for the same character. Unicode assigns a unique number to each character; in addition, it is multi-platform, multi-language and can encode mathematical and technical symbols. It is especially useful for multilingual texts and scripts which read right to left. Originally there were 65,000 characters available; now there are three encoding forms which can be used to represent over 1,000,000 characters.

 

Link Box for character encoding:

Unicode: http://www.unicode.org/

ASCII: http://www.asciitable.com/

Other formats which were used in the past are worth remembering:

 

Text markup

Text markup is a fundamentally interpretive activity, in ways that are both powerful and challenging. With a well-designed encoding language, you can express the facts and insights which are most useful for your purposes in working with the text. However, the interpretive quality of text encoding should not be taken to mean that encoded texts are purely, or merely, subjective creations. The most widely-used humanities text encoding languages such as TEI or EAD have been developed by particular communities whose needs and methodological assumptions they reflect.[1] These languages are designed to allow the representation of the significant information these communities want to capture. This information includes not only basic facts—or assumptions so deeply shared that there is no disagreement about them—but also significant interpretive statements, such as those of critical editing or archival description. One important role of the encoding language is to enable such statements to be made in a rigorous and consistent way, according to the practices of the community within which they are to be used. Text encoding thus makes it possible to bridge the gap between local research and insight and the discourse of the larger community, and to articulate interpretative statements in a way that is broadly intelligible.

The example given here shows a historical letter encoded using the guidelines of the Model Editions Partnership (MEP, http://mep.cla.sc.edu). As with all SGML-encoded documents, each textual element is enclosed within tags (set off by angle brackets) which mark its beginning and ending. The encoded text begins with a header (here abbreviated for simplicity) which contains the document metadata. Following this, the body of the document (encoded as <docBody>) contains first a heading, dateline, and salutation, and then the main text of the letter encoded as a series of paragraphs. At the end is a postscript with its own dateline. The elements and their grouping reflects the interests and research needs of documentary historians, for whom the MEP encoding scheme is designed.

 

Example Box:

A Sample Fragment of an SGML-Encoded Text (some encoding omitted for clarity)

<doc>

<mepHeader> ... </mepHeader>

<docBody>

<head>To <addressee>Martha Laurens</addressee></head>

<dateline>

<place>Charles Town</place>, <date>August 17, 1776</date>

</dateline>

<salute>My Dear Daughter</salute>

<p>It is now upwards of twelve Months ...</p>

...

<p>You will take care of my Polly too ...</p>

<signed>your affectionate Father</signed>

<ps>

<dateline><date>19th</date></dateline>

<p>Casting my Eye over ...</p>

</ps>

</docBody>

</doc>

 

The balance between disciplinary constraint and local expression is managed in SGML and XML through the document type definition (DTD), which is a formal statement of the tags permitted in the encoding system (i.e. what textual structures it is capable of naming) and how they may be nested. Different DTDs handle this balance very differently, depending on their intended purposes. Structurally, DTDs may be quite strict, specifying with great precision the order and nesting of tags, or they may be quite lenient, allowing greater latitude to the encoder. Similarly, in their definition of a tag vocabulary, DTDs may provide for very nuanced distinctions between features, or they may use fewer, more generally defined elements. Finally, DTDs may be constructed in an attempt to anticipate and codify all the possible encoding situations in the domain they cover, or they may instead provide methods for the encoder to handle unforeseen circumstances. All of these approaches have their potential uses, and in choosing or designing an encoding system it is essential to understand the nature of your own material and goals, so that you can choose appropriately. Although a structurally strict encoding system may seem to limit one's options unnecessarily, in fact such a system can be valuable in constraining data entry and in ensuring that identically structured documents are all encoded alike. A more lenient DTD is superficially easier to work with, but through its flexibility it also opens up the likelihood of inconsistency and of time-consuming debate about which option to choose in a given circumstance. A DTD with a rich lexicon of tags may be essential for describing certain kinds of textual features in detail, but is an encumbrance when only a simple encoding is required.

DTDs and the encoding systems they represent can often be adapted by the individual project to suit local needs, and this is particularly true (and may be particularly useful) in the case of large, multi-genre DTDs like the TEI. However, one important function of the DTD is to allow for comparison and interoperation between collections of encoded data. By changing it, you in effect secede from the encoding community of which that DTD is the expression, and you diminish the possibility of using common tools for analysis, display, and retrieval. Some encoding systems are designed to be extensible; for instance, the TEI provides an explicit mechanism by which individual projects may define TEI-conformant tag sets which are adaptations of the TEI encoding scheme. When done with care—and preferably in concert with other projects with similar needs—such adaptations may improve an existing encoding system while avoiding the disadvantages described above. For projects dealing with highly idiosyncratic data, or projects attempting to capture features for which no encoding system exists, adaptation may be simply unavoidable. In such cases, you should be prepared to think through your changes carefully and document them thoroughly.

 

SGML and XML

Although there are a number of encoding systems which have been developed for humanities computing use over the past few decades—many of them still in use—Standard Generalized Markup Language (SGML) and its derivative, Extensible Markup Language (XML) deserve particular attention here both because they are so widely used and because they should be. As international standards,[2] they receive a level of attention from software developers and from the standards community which guarantees their comparative longevity, and because they are non-proprietary, they can be used to create archival collections which are free from software or hardware dependencies, and hence less prone to obsolescence.

Strictly speaking, SGML and XML are metalanguages: systems for defining encoding languages. Because they provide a standardized method for specifying things like how a tag is delimited or how its structural definition is written, software written to this standard can be used with documents encoded with any SGML or XML-conformant encoding language, regardless of the particular tag set or the kinds of data they contain. The most significant text encoding systems for cultural resources—TEI, EAD, CIMI, METS, and others—are all written in SGML and XML, as is the ubiquitous HTML. There also exist SGML and XML versions of data standards like MARC.

The advantages of SGML and XML, as suggested above, stem partly from their status as international standards. In addition, because this kind of encoding allows for the complete separation of structure and presentation, SGML/XML-encoded documents can be repurposed or used as a base format from which to derive specific versions for different purposes: word-processing files for printing, HTML for delivery on the web, Braille output for the visually disabled, and so forth. SGML/XML encoding is particularly valuable for the kinds of cultural heritage work covered in this report, because it permits the description of the text's constituent parts in terms which are meaningful for retrieval and intellectual analysis. We might express this whimsically by saying that an encoded document “knows” what its own parts are in the same way that a scholar or reader does: concepts like “heading” and “quotation” and “poem” and “author” are accessible as primary terms of analysis (assuming they are part of the encoding language used). Furthermore, unlike some of the earlier encoding languages that were designed to avoid verbosity at all costs, SGML/XML encoding is actually fairly easy to understand once the eye becomes accustomed to seeing the tags. Encoding languages like TEI and EAD use tag names which are expressive of their function—<note>, <author>, <name>, <list>, <quote>, and the like—and because they represent the ideas people actually have about documents, they quickly become intelligible even to the untrained reader.

The disadvantages of SGML and XML have largely to do with their formalism as data structures. Because of their requirement that all documents be expressed in the form of a hierarchy or tree of nested elements, they are not ideal for representing truly non-hierarchical materials (for instance, sketchbooks). Although XML cannot simultaneously represent multiple hierarchies in the same document (and SGML can do so only with great difficulty), these occur so frequently that systems have been developed to handle most ordinary cases, and in practice this is not usually an obstacle to the use of SGML or XML, only a design consideration.

While SGML in its many applications—HTML, TEI, EAD, and others—is widely used by cultural heritage projects, we are entering a period of transition where XML is becoming more widespread. XML is in effect a streamlined version of SGML, in which some features of SGML which make it unnecessarily complex to implement have been eliminated.[3] Like SGML, XML is a metalanguage which can be used to define specific encoding languages, and all of the encoding languages discussed here are now available in an XML version. However, rather than abandoning SGML in favor of XML, projects seem to be using both. With the growing availability of XML software for publication and browsing of online documents, XML has become a central component in the delivery of cultural heritage materials, and will only become more so in the future. Several factors are particularly significant in this shift:

Unlike SGML, XML documents do not require a DTD. For publication purposes, the lack of a DTD is not much of a concern as long as the document is well-formed (that is, as long as its elements all nest within one another and as long as they all have start-tags and end-tags). An XML style sheet (XSL) can still format the document and browsers can still display it without checking against a DTD to see that the document is valid. However, for production purposes, working without a DTD is not advisable, since it makes it impossible to check documents for consistency of encoding. An alternative to DTDs is XML Schemas, which are another way to specify and validate the structure of your documents. XML Schemas are currently under review by the W3C, and offer some advantages over DTDs, such as typing of elements, grouping of declarations, and inheritance of properties. For more details, see http://www.w3.org.

 

Developing a Document Type Definition

One of the first steps for new text encoding projects is to identify the encoding system and specific DTD to be used. As suggested above, this process also involves articulating the project's methodological commitments and audience, as well as its more general area of focus. There are a number of encoding languages that fall within the domain of the cultural heritage community, but their range of usefulness does not overlap by much. The TEI DTD is primarily intended for full-text encoding of humanities and scholarly texts; the EAD DTD addresses the needs of the archival community for encoding finding aids and guides to collections. At this point, projects that deal primarily with artifacts, works of art or audio-visual material are far less well served.

It is probably clear by now that if you can use an existing DTD for your materials, you probably should. However, if your material falls outside the realm of existing encoding systems, you may need to develop one yourself. Projects such as the Oriental Institute of Chicago, whose text material does not fall within the broad western text canon around which TEI and the majority of DTDs have been designed, must either develop their own or await developments from others. While simple DTDs are easy to create, developing an encoding system that will fully represent your materials may require considerable research and time. The complexity of the task increases with the heterogeneity of the collection and the level of detail you wish to represent. For a project with a large number of texts with variable structures and features, it can take many years of development, application and refinement to produce a DTD that meets all of its requirements. This is by no means an impossible task, and important projects like the William Blake Archive ( http://www.blakearchive.org/public/about/tech/index.html), The Orlando Project ( http://www.ualberta.ca/ORLANDO/), the Electronic Text Corpus of Sumerian Literature ( http://www-etcsl.orient.ox.ac.uk/project/sgml-xml.htm), and others have taken this approach. However, the implications for a project's funding, staffing, and training as well as the time-scale for deliverables must be taken into account.

 

Definition Box:

Definitions

DTD:     Document Type Definition, the formal set of rules that define the elements that may occur within an encoded document and their structural relationships (their relative order and nesting)

Content model:     A component of a DTD, giving the structural definition of a particular element

Occurrence indicators:     Within a content model, the occurrence indicators show how often a given element may appear (once only, at least once, or any number of times) and whether it is required or optional.

#PCDATA:     Parsed Character Data, i.e. words and spaces but not tags.

Tag:     an individual piece of encoding that marks the start or end of a textual feature, set off from the document's content by special characters (in practice, usually angle brackets: <tag>).

Element:     A textual feature within an encoded document, including the start-tag, the end-tag, and the encoded content: <name>John Smith</name>

Attribute:     a modifier to an element, almost as an adjective or an adverb modifies a noun. Attributes come in many varieties and may be used to indicate the type of element (for instance, <name type="person">), its location in a series, a link to a related element, the language of the element's content, an alternate reading, and a wide variety of other kinds of information.

Entity reference:     a special character sequence which is used as a placeholder for some other text. Entity references are often used to encode characters which cannot be typed directly in ASCII, such as accented characters, ornaments, or non-roman alphabets. They may also be used as a placeholder for boilerplate text. In SGML and XML, entity references typically begin with an ampersand and end with a semicolon, e.g. &eacute; Entity references can also be used to point to external files, such as page images, that can be referenced in the markup and displayed as if they were embedded in the text.

 

The example below shows a simple DTD fragment that describes a very basic encoding for poetry. Each permitted element is declared together with a specification of what it may contain, and in what order. The occurrence indicators (question mark and plus sign) indicate whether the element in question is optional, and how many times it may occur. The commas separating the elements indicate that the elements must occur in this order. Thus the first line of this DTD specifies that a poem may start with an optional heading, followed by at least one or more <lg> elements, and ending with an optional closer. The second line indicates that the <poem> element also has a required type attribute, which provides a way of identifying the kind of poem more specifically. In this case, the DTD defines a list (unrealistically brief) of possible values, although it is also possible to leave the values unspecified.

Most of the other elements in this DTD are defined so as to contain simply #PCDATA, or Parsed Character Data (in other words, any valid character). However, the <lg> (line group) element has a slightly more complex content model. It may contain <l> or <lg> elements; the vertical bar indicates that either one may occur. The plus sign means that one or more of the group of elements it modifies (in this case, <l> and <lg>) must occur. The net result, therefore, is that an <lg> element may contain one or more verse lines, or one or more nested <lg> elements, or a mixture of the two. It may not contain naked characters (without an enclosing element), nor may it be empty.

 

Example Box:

A Simple XML DTD Fragment

 

<!ELEMENT poem (head?, lg+, closer?) >

<ATTLIST poem type (sonnet | stanzaic | irregular ) #REQUIRED >

<!ELEMENT head (#PCDATA) >

<!ELEMENT lg (l | lg)+ >

<!ELEMENT l (#PCDATA) >

<!ELEMENT closer (#PCDATA) >

 

The encoded example text that follows represents one of many possible valid documents conforming to this DTD. Equally valid would be a poem consisting of a single line group containing a single line, without a heading or a closer. When designing a DTD, it's equally important to consider the elements you wish to be able to omit, and the elements you wish to require. In other words, you need to decide not only what constitutes the minimum valid document, but also what constitutes the greatest variation you may need to accommodate.

 

Example Box:

A Simple XML Encoded Document:

 

<?xml version="1.0" encoding="US2" standalone="yes" ?>

<poem type="stanzaic">

<head>The Clod and the Pebble</head>

<lg>

<l>Love seeketh not itself to please, </l>

<l>Nor for itself hath any care, </l>

<l>But for another gives it ease, </l>

<l>And builds a heaven in hell's despair. </l>

</lg>

<lg>

<l>So sung a little clod of clay, </l>

<l>Trodden with the cattle's feet, </l>

<l>But a pebble of the brook</l>

<l>Warbled out these metres meet: </l>

</lg>

<l>Love seeketh only Self to please, </l>

<l>To bind another to its delight, </l>

<l>Joys in another's loss of ease, </l>

<l>And builds a hell in heaven's despite. </l>

</lg>

<closer>William Blake, Songs of Experience</closer>

</poem>

 

The remainder of this chapter will discuss in more detail some text encoding languages of particular relevance for cultural heritage materials.

 

HTML

A brief discussion of HTML is warranted here if only because it is so widely used and so familiar. HTML is essentially a formatting and display language for the web, designed as a small set of tags for simple hyperlinked documents, and as such its value as a form of descriptive markup is extremely limited. It lacks the vocabulary necessary to describe many of the basic features of cultural heritage materials—most significantly their metadata, their genres, and their textual structure. In cases where HTML is adequate for describing such materials (because the materials themselves or the representations desired are extremely simple) a simple TEI-based DTD would be nearly as easy to use and much more upwardly mobile. XHTML offers some improvements over HTML. Although it offers no greater descriptive power (since it provides the same tag set as HTML 4.0), it does allow for the enforcement of XML compliance such as being well-formed, allowing validation against DTDs or schemas, and extensibility through the formal definition of new modules. For projects that choose to use some form of HTML, XHTML will at least offer better interoperation and increased delivery options (for instance, to the increasing variety of web-enabled devices such as mobile phones and hand-held computers).

Projects that create SGML-encoded texts still rely heavily on HTML because SGML encoded texts cannot be viewed online by most web browsers, and there is a grave shortage of SGML-aware software at this time. Projects such as the William Blake Archive or the Victorian Women Writers Project have developed tools to convert SGML documents into HTML for viewing on the web. Others, such as the Women Writers Project, use commercial software that performs the translation to HTML dynamically. With the advent of XML, web publication is likely to become much more straightforward, and conversion to HTML as an intermediate will become unnecessary.

 

Definition Box:

HTML

XHTML

 

TEI (Text Encoding Initiative)

For projects creating full-text resources, the TEI Guidelines[4] are the predominant choice. The Guidelines have been adopted by a large number of projects representing a range of different kinds of text, and have proved highly adaptable to local requirements. Among the projects surveyed, the use of TEI DTDs in encoding texts is one of the clearest cases of the adoption of standards for a particular type of material. Their use indicates the close match between the TEI's goals in creating the guidelines and the goals that text encoding projects had in mind when creating their texts:

At the same time, most projects have found either that the full implementation of TEI is unnecessary, or that the benefit did not justify the extra time and intellectual effort required. Many have turned to the TEI Lite DTD, a simplified view of the full TEI DTD. [5] The purpose of TEI Lite—meeting 90% of the needs of 90% of users—seems to be borne out in practice, and TEI Lite has become the common starting point for a large number of text encoding centers and projects, including the Virginia Etext Center and the Michigan Humanities Text Initiative. While an understanding of the full TEI Guidelines is still desirable, not least for deciding what elements can be ignored, the use of TEI Lite is recommended as a starting point for good practice in text encoding. It is always possible to add further layers of detail at a later stage, if your needs change.

The basic structure of a TEI-encoded document is very simple. Every TEI document must begin with a <teiHeader> element, which contains the document metadata. The header may be very simple, but can also accommodate detailed information about the electronic text's publication, source, subject matter, linguistic characteristics, and revision history. Following the header is a <text> element which in turn contains <front>, <body>, and <back>, which in turn contain <div> elements. In addition to accommodating all of the typical features of texts—paragraphs, lists, headings, the various components of poetry and drama, names, dates, quotations, bibliographic citations, and so forth—the TEI Guidelines also provide for more specialized encoding of features such as prosodic structures, morphological analysis, subject keywording, and similar features which are useful for various kinds of scholarly textual research.

 

EAD (Encoded Archival Description)

Although the main thrust of DTD development has been in the direction of humanities and scholarly texts, several other DTDs have been developed to cater for heritage institutions with different text encoding requirements. The most significant of these for the archival community has been the Encoded Archival Description (EAD) DTD.

The EAD DTD began as a cooperative venture in 1993 at the University of California, Berkeley. It aimed to develop a non-proprietary encoding standard for archival finding aids that would include information beyond what is provided by traditional machine-readable finding aids, such as MARC.

The project chose SGML as the most appropriate encoding language, as its document type definition (DTD) concept makes it ideal for the consistent encoding of similarly-structured documents, the key to successful electronic processing. An analysis of the structural similarities of finding aids helped construct the initial FINDAID DTD. This simplified, improved and expanded access to archival collections by linking catalog records to finding aids, enabling the searching of multiple networked finding aids and keyword access. The release of version 1.0 of the EAD DTD was delayed until 1998 in order to make it compatible with the emerging XML.

EAD documents consist of two major parts. The first part is the <eadheader> element, which contains metadata about the finding aid and its encoded representation. The second part is the <archdesc> element, which contains the information about the archival materials described.

The EAD header was modeled on that of the Text Encoding Initiative (TEI). It consists of four elements (some of which are further sub-divided):

The uniformly ordered elements in the <eadheader> make searches more predictable. Such searches can filter large numbers of machine-readable finding aids by specific categories such as title, date and repository. The <eadheader> is obligatory, so archivists are forced to include essential information about their finding aids that were not recorded in paper form. The optional <frontmatter> element can be used to create title pages that follow local preferences.

Because finding aids generally describe material at several different, but related levels of detail, these unfolding, hierarchical levels are represented within the <archdesc> element. The <archdesc> provides for a descriptive overview of the whole unit followed by more detailed views of the parts. The data elements that describe the whole unit are gathered together under a parent element called <did> (descriptive identification). These <did> elements are the key to good description as they facilitate retrieval of a cohesive body for discovery. Once the high (or unit) level of description is complete, the component parts can be described using the Description of Subordinate Components or <dsc> tag, at whatever level of detail is appropriate for the collection and the resources available.

 

Dublin Core

While not an encoding system in its own right, the Dublin Core deserves a reference here as part of good practice in creating encoded metadata. The Dublin Core Metadata Element Set defines a set of 15 essential metadata components (for instance, author, title, format) which are broadly useful across disciplines and projects for resource discovery and retrieval. These components can be used to add metadata to HTML files (using the <meta> tag) but can also be used in other contexts to create basic metadata for a wide range of digital resources. Dublin Core does not provide for detailed administrative or technical metadata, and as such is largely suited for exposing resources for search and retrieval, rather than for internal resource management and tracking. In addition, since its goal is to be simple and broadly applicable to a wide variety of resources, it does not provide for the kind of highly structured metadata about specific document types that TEI and EAD offer. Although projects using these encoding systems will probably not need to use the Dublin Core, they may find it useful to be aware of it as a possible output format for distributing metadata about their resources.

 

METS

The Metadata Encoding and Transmission Standard (METS) is an XML-based encoding standard for digital library metadata. It is both powerful and inclusive, and makes provision for encoding structural, descriptive, and administrative metadata. It is designed not to supersede existing metadata systems such as Dublin Core or the TEI Header, but rather to provide a way of referencing them and including them in the METS document. As a result, it is an extremely versatile way of bringing together a wide range of metadata about a given digital object. Through its structural metadata section, it allows you to express the relationships between multiple representations of the digital object (for instance, encoded TEI files, scanned page images, and audio recordings), as well as relationships between multiple parts of a single digital representation (for instance, the sections of an encoded book). Its administrative metadata section supports the encoding of the kinds of information projects require to manage and track digital objects and their delivery: technical information such as file format and creation; rights metadata such as copyright and licensing information; information about the analog source; and information on the provenance and revision history of the digital objects, including any data migration or transformations which have been performed. METS is a very recently developed standard but is well worth watching and using.

 

Link Box:

Links to Useful Resources for digital representation and markup of text:

 


[1] Arguably, in other domains such as industrial or computer documentation, where text encoding languages are intended to govern the creation of new digital documents rather than the representation of existing ones, encoding cannot by its nature be interpretive, since the author and encoder operate with the same agency (even if they are not the same person).

[2] Strictly speaking, while SGML is an international standard (ISO 8879), XML is only a recommendation from an international consortium (the World Wide Web Consortium). In practice, this is a distinction that makes little difference.

[3] It is important to note that the greatest impact of these differences is on software design; to the encoder and the end user, the change from XML and SGML is not difficult to make.

[4] The TEI Guidelines for Electronic Text Encoding and Interchange (P3, the third release but the first official publication) were published in 1994. The latest version, P4, adds XML compatibility and was published in March 2002.

[5] The TEI DTD is designed as a set of modules which can be combined in various ways to accommodated many different types of texts. Thus there is no single TEI DTD, but rather a set of DTDs that represent the various combinations of the modules. The TEI Lite DTD is a simple combination of the most widely required TEI modules.

 

  table of contents        previous chapter        next chapter




valid xhtml 1.1
abp~03/03