Connect Spring 1999  Humanities Computing


Optical Character Recognition

Steven Killings

The human ability of visual perception and cognition has been the subject of philosophical and mathematical inquiry for centuries, if not millennia. The parable of Plato's cave comes readily to mind and has been used often as a metaphor, if not outright authority, among sophists seeking to explain the phenomenon in human terms.

In the parable, several men are chained up in a cave, facing a wall on which they see shadows. Light from a fire behind the men casts the shadows on the wall. These shadows are of the men themselves and of objects located, unseen by them, between them and the fire. Because the prisoners can see nothing but the shadows, they take the shadows for reality.

One of the men shakes off his chains and, turning round, makes his way to the mouth of the cave. When he gets there and looks out he sees the sun shining on the objects of the world outside, but when he returns to the cave to tell his fellows what he has seen, they don't believe him. The escape of the prisoner into the light represents the process of philosophical enlightenment. Plato's point is that if we don't understand philosophy then we will see only shadows, the appearances of things, rather than their true form.

I find it suitable to invoke Plato not for what he can tell us about the importance of philosophy but what he can tell us about Optical Character Recognition (OCR). If you find this strange, read on.

To Plato, the true object of knowledge, the reason for its pursuit, if you will, is the Form of something. This is necessary since in his philosophy he recognized that knowledge of the physical world was unattainable, that things were too much in flux and that their appearance and properties were too variable. Thus, he reasoned, there must be a higher Form, something fixed and unchangeable which our minds can grasp. One could know a cat, for instance, by understanding its "cat-ness," not for the fact that it had a tail, green eyes and whiskers.

Optical Character Recognition software operates in a similar manner to Plato's Philosophy of Forms. Characters are recognized and discriminated according to how much they participate in the "letter-ness" of any particular Form of letter. Programmers call this process "feature extraction." Thus, the miniscule letter "e" which can appear in many different fonts and sizes, can still be distinguished by its essential shape, its "e-ness," which in illustrative terms could be described as a closed half loop with a tail descending below.

Particular characteristics of the letter as it appears in the document are discarded. For instance, a serifed majuscule E is the same as a non-serifed E.

Errors, of course, occur in this process of recognition. The software could decide that a certain letter is much more "c-like" than "e-like," for a variety of reasons, chief among them the clearness of the document's contrast between black text and white space.

However, many errors that occur in OCR software occur because of its inability to decide between the "letter-ness" of characters which have similar likeness, such as "I" and "1" in non-serifed fonts. Even serifed fonts suffer from some consistent errors. Most OCR programs will still have a problem recognizing all the characters in the word "minimum" in a serifed font like Times New Roman. It is a poor piece of OCR software, however, that reads the letter "w" as the letter "f" in a clear image. In short, OCR software can decipher the "shadow" of a character and decide what its true Form is.

How, you ask, did my computer get a degree in philosophy? Of course it didn't, but it may have a degree in history.

Much of what commercial OCR software depends on for its analysis of documents depends to a very high degree on the way printed text has commonly appeared in the last 500 years. (However, we must remember that the Italian Humanist scribes and printers of the fourteenth century whom we normally credit with the invention of the most common Latin typeface, Roman, were themselves slavishly copying the fine miniscules and mises-en-page they found in Carolingian manuscripts of the ninth and tenth century, unwittingly thinking them to be genuine products of the Roman Empire rather than French.)

Characteristics such as kerning (the amount of white space between characters), hyphenation, leading (the space between lines of text) and indentation, which have traditionally been the concern of moveable type compositors since Gutenberg, play a significant role in OCR's ability to vectorize or zone the document, that is to say, to break it down into its component parts. This process is common to OCR software's "pre-processing" phase. Indeed, it is OCR software's ability to interpret the white space in documents that makes common functions like zoning work. This is why, when performing OCR, it is wise to use a document with the highest and clearest contrast possible.

Even with this ability, OCR still has problems with uncommon document formats. For instance, line numbers in modern critical editions of poetry are often interpreted as separate words next to the line where they appear, and not as discrete entities apart from the text they formalize.

Columns and text flow are another problem, since most OCR software will simply interpret from left to right, top to bottom regardless of how the document was meant to be read. Often, it is necessary to zone documents manually, especially if the document contains a critical apparatus, marginal notation or other textual additions of academic publications. This software feature, I would argue, is essential for any serious OCR operation.

Of all these features which inhabit a typical modern document, it is our tradition of word shape and separation that OCR uses to its greatest benefit. In visual terms, a word is distinguished by its characters' relation to the white space surrounding it and the nature of its letter face (for instance, small thin strokes are common to handwriting, and thick short strokes are common to non-serifed print fonts).

Psychologists describe this as its "Bouma-shape" (after Dutch psychologist Herman Bouma) in cognition studies. It is a useful phrase here. Bouma shapes in Western writing are very different from Chinese, Korean or Japanese. Eastern languages, in general, have words that are square-like and of similar size, while words in Western languages are generally elongated series of Latin characters, and are separated by a consistent white space.

Commercial OCR software, in large part, depends on this uniformity in text to discriminate words accurately.1 It is certainly the case that the most common Latin typeface, Roman, will be interpreted by currently available OCR packages more accurately than other typefaces.2

Uncommon shapes of words, so-called "word art" being a good example, are likely to fool OCR software. However, every imprint suffers from a great degree of variation in the shape of its characters. Generally, the more primitive the printing process, the more likely errors occur in its character shapes. During the era of hand-press printing, ink was often unevenly impressed on the page and occasionally formed blotches that distorted letters. Typefaces were routinely broken during the imprinting process and resorted into compositors' trays. During the machine-press period, these errors occur with less frequency and less harm to the character's integral shape. Offset printing has largely removed these errors in the modern era, while laser and inkjet printing can render character shapes measured in pixels.

Word orientation, as well, is a key ingredient in word recognition. Most OCR software offers the ability to unskew documents that contain usually no greater than one or two degrees of rotation after scanning. Vertical or diagonal text or other uncommon or rare orientations are unlikely to be interpreted accurately.

In the final analysis, when you hear 98 percent accuracy rates quoted for OCR software packages, consider that these were most likely accomplished using laser printed business documents, where the degree of variation among characters is significantly small and where the orientations of characters is fixed and regular. An OCR operation on an average nineteenth century imprint will almost certainly be completed with less exactness.

In addition to these caveats, OCR's inability to distinguish uncommon or unrecognized Bouma-shapes has led many manufacturers of OCR software to build in training features so users can teach the software to assign letter values to the unrecognized character shapes and words it finds.

A typical OCR software package for European or Western languages uses the standard ASCII character set (the 128 characters of standard upper and lower case letters plus some common typographical symbols) for its values. Typically, the use of Extended ASCII (the 128 characters plus some mathematical symbols and, more importantly, the diacritical and other characters common to languages other than English) is tied to the foreign language dictionary feature of the software, if it exists.3

Standing as we are at the beginning of the Digital Age, computer Optical Character Recognition would seem to be one of those key processes that make the important leap from the world of print which we have known for centuries to the newly charted realm of digital documents. For those pursuing the digitization of our print heritage, a general knowledge of how OCR software works hopefully will lead to a better appreciation and awareness of its limitations and capabilities.[ C ]


Footnotes


1There have been recent studies of computer-based character recognition of multilingual and macronic texts that have tried to surmount this problem. For a survey of recent scholarship on character recognition in computer and mathematical studies, see the University of Maryland's Document and Video Processing Group's Document Image Understandingbibliography, published annually at documents.cfar.umd.edu/biblio/.

2In a recent test of nineteenth-century German fraktura, the most common German print font of the last century and the beginning of this one, OmniPage Pro and Xerox Textbridge revealed that word shapes can be discriminated somewhat accurately (accuracies ranging from 60 to 80 percent before training).

3For a trade review of the most common commercial OCR software, see David Haskin, "Optical Character Recognition," PC Magazine, January 20th, 1998, www.zdnet.com/pcmag/features/ocr/_intro.htm


Steven Killings is a Humanities Computing Specialist with ACF.
steve.killings@nyu.edu

Posted February 12, 1999