Connect Fall 1996:  COMPUTING AND THE HUMANITIES


Unicode: Writing in the Global Village

Joseph Hargitai

With the proliferation of computers throughout the world, the need for applications that can process more than one language at a time has grown. To address this need, manufacturers and software developers have been exploring ways to overcome the barrier presented by the thousands of written characters, scripts, and symbols used every day across the globe.

The first wave of internationalization was the creation of localized systems. Apple, Microsoft, and IBM delivered language-specific operations systems as early as the mid-eighties. While the level of localization varied from product to product—using native language in documentation only, using native language in menu systems, filenames, applications, system calls, compilers, error logs—they all stopped short of being capable of multiple languages. In addition, different language systems often required language-specific applications—Chinese Windows did not run applications developed for Japanese Windows, and vice-versa—resulting in serious cost increase of software development.

Second came language-enhanced systems, localized computers that could switch between languages and use a wide range of software. Such a system was WorldScript, introduced by Apple in 1992. WorldScript technology allowed users to install language kits on an existing native system and switch at will between languages. In order to make use of this feature, one needed to use applications that understood WorldScript Technology. (Apple supplies appropriate versions of TeachText with each language kit, and there are additional applications, such as Word Perfect 3.5, Nisus Write, Photoshop 3.0.5).

WorldScript technology was as close an approximation of a multiple-language computing system as possible without the use of a fundamentally redesigned operating system. An ideal internationalized system, however, would have to go further. It should offer provision for a uniform platform for all software regardless of the language or languages used, and in addition it should offer portability across applications, platforms, and networks.

The central issue of such undertaking—internationalization—was character encoding. By the late 1980s it was clear that existing international encoding standards and quickly emerging national sub-standards, while serving their local needs well, could not provide the necessary foundation for a unified global system. ASCII, the venerable 7-bit system, could code only 128 characters. (To see why, read "Bits, Bytes, and Character Sets: Mathematics Is Density.") ISO 8859 (aka Latin-1), an 8-bit step-up, could code 256 characters. While sufficient for most European alphabets and some non-roman alphabets, such as Cyrillic, Hebrew, and Arabic, these standards lacked the capacity to render non-alphabetic languages that use sets of symbols and ideographs. To accommodate the goals of internationalization, the lowest common denominator—the number of bits assigned to each character—had to be increased.

In 1989, an informal group originated by Xerox and Apple, and later joined by AT&T, IBM, Lotus, Microsoft, NeXT, and Novell, founded the Unicode Consortium. The group published its first standard for character encoding in 1990 (Unicode Standard version1.0). At the same time, the ISO (International Standards Organization) was working on a similar encoding scheme, ISO 10646; this was a 32-bit system, with potential code space for 4 billion characters. To avoid confusion, the two organizations agreed to combine their standards. Unicode Standard version 1.1 and ISO 10646, finally published in 1993, were identical in the 16-bit range. The result was a comprehensive character set organized as a table of 16-bit values that allowed for 65,536 possible characters—alphabetic, syllabic, and ideographic alike—with ample space for standard scientific and mathematical notations as well.

In addition, Unicode also resolved the unique problem of character ordering and contextual form in scripts like Arabic and Farsi by adding semantic information (character ordering and multidirectional algorithms) on the code level. For example, in Arabic each character can take different shapes depending on whether it falls at the beginning, middle, or end of a word, or the beginning of a sentence. Since each use requires a different ligature, each position needs to be encoded as a separate character.

How would software and hardware based on such an encoding system help the end-user?

A student could present a paper using special characters, multiple alphabets, ideographs, and symbols. A historian could use hieroglyphs in the same sentence with descriptive text. A linguist could use multidirectional text to illustrate her point in Arabic, Chinese, and Thai within the same file. A business owner could search multinational phone books online, in which names are represented with correct and consistent spelling. Information managers could scan databases containing data in different languages. They could sort, compile, and display the result with a variety of software. And, of course, one could use extensive computer networks to publish, e-mail, and browse in multiple languages.

What hurdles must such an encoding system overcome?

Despite these hurdles, Unicode may soon become the most common multilingual character-coding system. Support for multiple-language use is quickly growing. New operating systems—AT&T's Plan 9, Windows NT, Novell's Netware 4.01 Directory Services, Sybase's Gain Momentum, and Apple's Newton already support Unicode. Projected Mac OS 8, Copland, and updates of Win95 will also be Unicode-compliant. In addition to operating systems, emerging 32-bit development kits and compilers will be supporting Unicode characters.

 Promising large-scale implementations of Unicode are also under way. The National Library of Australia deployed a set of software tools, MASS, to provide a multilingual environment for clients and developers. The library hopes to catalogue its multinational archive for continental and transcontinental usage. MASS (Multilingual Application Support Service), was developed by the Institute of Systems Science at the National University of Singapore. Using an X-terminal emulator program called UXTERM, clients can input, display, and search data in over 150 languages.

Another example comes from the language-development team of Duke University. WinCALIS (Computer-Assisted Language Instruction System for Windows) is an extensive Unicode-based program designed by language teachers for language teachers. Currently used by the computing labs of Duke, Vanderbilt, Tufts, Rice, and James Madison Universities, WinCALIS is proving itself to be a strong model for language instruction.

The engineering of Unicode is not a purely technical issue. The phenomenal growth of personal computer usage around the world, along with the dramatic shift of hardware and software development away from North America, has changed the way we look at computing. New questions have been raised: Will there be a continuation of parallel standards or will we have a new all-encompassing standard? One question has already been answered: old standards will have to change. [ C ]


Joseph Hargitai worked at the ACF Innovation Center at the time of this article's publication.
joseph.hargitai@nyu.edu

Posted 26 September 1996. Last reviewed 30 November 2005.