TYPE OF PROPOSAL: poster
TITLE: Markup vs. Character Encoding: The quandary of handling the
epigraphical/papyrological ’Äúunderdot’Äù in
computer representation
KEYWORDS: markup, Unicode, underdot
AUTHOR: Deborah Anderson 
AFFILIATION: Vis. Scholar, Dept. of Linguistics, UC Berkeley 
E-MAIL: dwanders@socrates.berkeley.edu
CONTACT ADDRESS: 1348 Burkette Drive, San Jose, CA 95129
FAX NUMBER: c/o Dept. of Linguistics, UCB, (510) 643-5688
PHONE NUMBER: (408) 255-4842
Equipment needed: wallboard and a plug for a laptop computer.
Note: Book titles below are surounded by underscores. 

Problem

When dealing with ancient texts written on various media for detailed
scholarly publication, it is critical to convey information on the
specifics of the writing. After a photo, scanned image, or line
drawing is made from the original text, texts are commonly transferred
to paper or an electronic medium. In order to capture the information
from the inscription, transliteration and transcription schemes in
Roman letters (or Greek, for materials using Greek script) are often
used to capture all the characters--whether clearly legible, faint,
damaged--and the empty spaces.

Ancient texts, especially those in damaged condition, present
difficulties, for they must rely upon the subjective judgment of the
transcriber (and editor) in deciding what characters are present,
whether an erasure is identifiable, the amount of empty space(s),
etc. Certain conventions on how to represent these details have been
created and are followed in various fields (i.e., double square
brackets [[ ]] enclose erasures, angle brackets < > indicate
letters made in err by the scribe). A common method in transliteration
and transcription for denoting a damaged character--or one whose
identity is uncertain--is a dot placed below a letter. This underdot
is common in ancient Greek texts and Latin, for example, at least
those editions intended for the scholar interested in paleography,
philology, etc.

A problem arises in how to handle the underdot in computer
representation. This question surfaced in a project done at UC
Berkeley in conjunction with the Berkeley Library, wherein the
_Indo-European Studies Bulletin_, a publication affiliated with the
UCLA Indo-European Studies Program, was being put online, using XML,
Unicode, and a TEI-Lite DTD. The underdot appeared in a Sabellian
inscription. Since our project intended to test out the use of
Unicode, we reviewed the options available in Unicode and employed the
combining underdot (U+0323) after the character. On the surface level,
this reflected the character represented in the Sabellian
article. However, the combining underdot raised a potential problem:
by interrupting the plain text string with the diacritic, searching
for the entire word could be impeded, unless the underdot was taken
into account when searching. More importantly, should the underdot
actually be encoded as a separate character? Is it on the same level
as an ’Äúa acute’Äù, for example, where the
diacritic is an essential part of the character? Or should markup be
used, such as denoting the sign with a <damage>  and/or
<unclear> tag? Markup then could be visually rendered according
to one’Äôs own convention or taste.

Since Unicode support is only just now becoming more prevalent in new
software and hardware, most computer projects have adopted ASCII
representations of the underdot and other epigraphic and papyrological
symbols. For Greek, the Beta Code of the Thesaurus Linguae Graecae has
been widely adopted by projects (such as Perseus). Eventually, a
changeover to Unicode will occur, and the need to decide how to handle
it is becoming more pressing: should one use a character encoding or
markup?

The underdot is but one member of a long list of epigraphic and
papyrological symbols used in transcription and transliteration. An
agreement amongst scholars ought to be made if there is to be
consistency in handling these symbols within the same discipline and
across disciplines, since similar problems are faced in other
fields. Currently, Unicode proposals on cuneiform, Coptic, and Iranian
await to see how the problem is resolved in the Greek and Latin
sphere, since it will influence their projects. Or are the variations
between fields (and lack of communication so great) that a
discipline-specific approach will prevail?

Issues

A number of important issues arise when reviewing the problem more deeply:

--While the underdot is frequently used to indicate damage or
  uncertainty, it is not necessarily consistently used with this broad
  definition, even in ancient Greek materials. In a standard book used
  for Greek dialects, Carl Darling Buck’Äôs _The Greek
  Dialects_ (Chicago and London, 1955), he states: ’ÄúThe
  occasional use of a dot under a letter indicates that it is
  mutilated. But this is commonly disregarded if the proper reading is
  reasonably certain’Äù (p. 184). In Mycenaean materials
  (Emmett Bennett, Jr., and Jean-Pierre Olivier, _The Pylos Tablets
  Transcribed, Part I: Texts and Notes_, Rome, 1973), however, an
  underdot under a digit can indicate that there is a question whether
  the number is in the text at all, a problem regarding the identity
  of the number, or it may merely indicate that the number is almost
  illegible’Äù (p. 10).  Indeed, if fine granularity of a
  text is intended, ’Äúdamage’Äù and
  ’Äúuncertainty’Äù can and probably should be
  separated as two distinct elements, and this is so done in a new
  proposal, ’ÄúEpidoc’Äù, being worked by at the
  University of North Carolina by Tom Elliott, Hugh Cayless, and Helen
  Hawkins (http://asgle.classics.unc.edu/review/epidoc.htm,
  http://asgle.classics.unc.edu/review/epidoc.pdf).

--In some languages, an underdot has a specific phonetic meaning. In
  Sanskrit, it is used for a retroflex s. The phonetic sense specific
  to the underdot is at variance with the unclear sign
  meaning. Potential confusion with the phonetic sign is possible in
  searching.

--One potential problem of character-encoding with Unicode is an
  apparent ambiguity of certain signs for the naive user. A scholar
  looking for a combining underdot when skimming through the Unicode
  Standard (or scrolling down the choices under MS Office
  2000’Äôs Arial Unicode MS font) may choose, quite
  incorrectly, U+093C, the Devanagari sign nukta, which has very
  specific use. This error would cause problems for searching and
  rendering.

--A number of characters for damaged signs are proposed in a Unicode
  proposal for Egyptian hieroglyphs. While such signs would appear
  with the hieroglyphic characters and not in a Roman-type
  transliteration/transcription scheme, it significantly offers a
  character-encoded model for conveying a damaged sign, and not one
  based on markup. Could this option peacefully co-exist with a
  markup-only approach used in other projects and is this advisable?

--Unicode will allow using the ancient scripts more fully, since the
  character encoding standard should allow for easier writing,
  rendering, and printing of the original scripts, beyond what printed
  publications have been able to offer in the past. (However, this is
  only possible with necessary Unicode-enabled operating system,
  software, and font support for the characters are present.) Hence, a
  fuller representation of a text with the ancient scripts could be
  added between the layers of photo/drawing and Romanized
  transliteration/transcription. Instead of using an underdot to
  indicate a faint letter, for example, markup could be used with the
  original script (as well as the transliterated/transcribed version)
  to make the sign (or letter) appear fainter or in a slightly
  different color.

--A consistent markup scheme could be used with a style sheet to
  render the faint/damaged letter in a variety of ways, as suggested
  above, offering wide extensibility. If a text is intended for
  beginners, the markup indicating the traces of letters or erasures
  could be disregarded. If Hittite scholars regularly used a special
  symbol for a mutilated sign, for example, this could also be
  accommodated by changes to the style sheet.

--Since new Unicode character encoding proposals can take from two to
  five years from the first proposal until approval into ISO 10646,
  markup offers a much quicker solution. A ’Äúbest
  practices’Äù guide to markup, similar to the Epidoc
  proposal, would be needed.

A Possible Solution?

Markup seems to present the best option, for it allows flexibility and
provides a speedier means to put epigraphic/papyrological text
information on the Web. Also, the use of the underdot reflects
information regarding damage/uncertainty/etc. of a character, and
hence is probably best not encoded as a separate character. However,
if markup is to be advocated as the best approach here, can
user-friendly software be created in the foreseeable future for typing
and rendering of such markup schemes?  This poster is intended to
encourage further open discussion on the ’Äúunderdot
quandary’Äù and its ramifications, and to seek input from
others, particularly those with projects on ancient texts or whose
expertise is on markup and relevant technology.