In English texts printed before 1630, the letters v, u, j, and i did not have the values that they have today. The word that we spell 'jury' was written 'iury'; the word that we spell 'ivory' was written 'iuory', etc. Another common typographical convention, characteristic of somewhat later texts, was to represent 'W' as 'VV' or 'Vv'. These archaic conventions make early texts difficult to read and compromises the matching of forms in information retrieval tasks. The Women Writers Project uses SGML tagging to encode a regularized spelling for such typographical variants, thereby allowing the option to display and search on either the original form or the regularized form.
In the first 300 or so texts encoded by the WWP from 1989 to 1999, nearly half contained some manifestation of typographical difference from modern English requiring regularization. Encoding this information by hand was time-consuming and inefficient, but about 90 texts were manually tagged with regularized forms by encoders. These provided a substantial body of useful data for understanding the nature, extent, and frequency of distribution of what we have termed 'vuji' and 'VV' phenomena.
The Scholarly Technology Group undertook to develop a system to automatically identify words subject to this typographic convention and tag them with the regularized form. This system has two major components: an SGML-aware wordlist-based program, and a set of pattern matching rules derived from linguistic principles for English consonants and vowels. Both components have been designed to work with WWP markup conventions for such things as word division across a line break, errors or abbreviations within a word, and structural elements to be excluded from regularization.
The wordlist-based program matches whole words with a dictionary list of known forms requiring regularization and replaces the word in the text with a form containing appropriate markup. The pattern-matching component uses a set of regular expressions to identify probable candidates for regularization that have not been found on the wordlist. Each such match found is presented to the encoder who can accept or reject regularization, as appropriate.
Anne A∫kewes an∫were vnto Iohan La∫∫els letter. Oh frynde mo∫t derelye belo ued in God. I maruele not a lyt tle, vvhat ∫huld moue yow, to iud ge in me ∫o ∫ledre a faythe, as to feare deathe, vvhych is the ende of all my∫erye.Askew, Anne. The lattre examinacyon of Anne Askewe, 1547 Marpurg, 1547. Women Writers Online. Women Writers Project, Brown University. Unpublished.
WWP encoding practice documents structural features of the
text, such as paragraphs (p element),
chapters, stanzas, etc. (div element
with a type attribute); typographical
features such as line breaks (lb
element), page breaks (pb), and
catchwords & signatures (mw with
type attribute); renditional
characteristics such as italicization and superscription; and
links to textual annotations. Intra-word entity references are
used for characters that are not on the modern computer
keyboard, including soft hyphens (shy),
long s (s), accented letters (e.g.,
eacute), and macrons (e.g.,
emacr) -- macrons are further
documented with an abbr element whose
expan attribute indicates the specific
nasal consonant elided in the original. In addition to the
expansion of macrons, intra-word elements often occur for a
variety of reasons, including editorial insertion of omitted
characters or soft hyphens (in the
corr attribute of
sic), editorial correction of other
obvious errors (also in corr of
sic), expansion of some abbreviations
(in the expan attribute of
abbr), and, of course, the tagging of
'vuji' and 'vv' characters with their modern form (in the
reg attribute of
orig).
<speaker rend="align(center)slant(upright)">Anne A&s;kewes an&s;were <orig reg="u">v</orig>nto
<lb><orig reg="J">I</orig>ohan La&s;&s;els letter.</speaker>
<p>Oh frynde mo&s;t derelye belo<sic corr="­"></sic>
<lb><orig reg="v">u</orig>ed in God. I mar<orig reg="v">u</orig>ele not a lyt<sic corr="­"></sic>
<lb>tle, what &s;huld mo<orig reg="v">u</orig>e yow, to <orig reg="j">i</orig>ud<sic corr="­"></sic>
<lb>ge in me &s;o &s;l<abbr expan="en">ē</abbr>dre a faythe, as to<anchor id="ter306" corresp="lat306">
<lb>feare deathe, whych is the ende
<lb>of all my&s;erye.
Critical editions of early texts usually retain old spelling and typographical features of the copy text. Editions for general reading usually normalize typographical conventions, even when they retain the "old-spelling texture" [1]. The SGML version makes it possible to display either the original (as above) or a regularized version (as below).
Anne Askewes answere unto Johan Lassels letter. Oh frynde most derelye belo- ved in God. I marvele not a lyt- tle, what shuld move yow, to jud- ge in me so slendre a faythe, as to feare deathe, whych is the ende of all myserye.
In addition to providing considerable improvement in readability, the WWP's encoding practice makes it possible to retrieve early attestations of forms that would otherwise not qualify as 'matches' to a search query. For example, a search for beloved within two words of God would fail to match against the appropriate line in the unregularized Askew text.[2]
This paper describes
orig
reg markup.The project described in this paper tests the extent to which TEI encoding of 'vuji' phenomena can be automated, and characterizes the human intervention required to evaluate and process this feature of early modern English texts.
Copyright © 2000 by Syd Bauman, Jacque Russom, and Brown University