Connect Fall 1996:  COMPUTING AND THE HUMANITIES


Bits, Bytes, and Character Sets:
Mathematics Is Density

David Frederickson

Character Sets

When we type something on a computer or transmit it over a network, we're limited by the character set available. An old standard American typewriter couldn't readily type accented letters or British pound signs, because they weren't part of the available character set; conversely, a French typewriter probably didn't include a dollar sign.

On a typewriter, the character set was limited by the number of typing keys: about forty-five keys, with the number of characters doubled by using the shift key. On a computer, with more keys on the board and more combining keys added to the lowly shift, the number of characters in a set is limited not by the keys but by mathematics. The ASCII set that's most commonly used has 128 characters (actually, several of the "characters" are invisible, for things like tabs and line breaks and spaces); the common ANSI set adds another 128 characters, mostly letters with diacriticals used in European languages, for a total of 256.

Why such strange numbers, and not a round number like 100 or 200?

Bits and Bytes

The reason for the peculiar numbers lies in binary mathematics. If you think back to junior high, you may recall a frustrating period wherein your teacher tried to convince you that non-decimal number systems, with more or fewer digits than ten, were possible. A binary system, with only two digits, 0 and 1, is possible -- though inefficient in terms of space, since decimal 2 is represented in binary as 10, and decimal 3 as 11, decimal 4 as 100, 5 as 101, etc.: numbers get longer faster in binary. But binary code is basic to computers, which are essentially millions of minute switches that can be either on or off -- a pair of states easy to represent with 1 and 0.

Now with a one-digit binary number, we can have either 0 or 1: two choices. If we add another digit, we get four choices: decimal 0, 1, 2, and 3. Another digit doubles the number of choices again, to eight (0 = 0; 1 = 1; 10 = 2; 11 = 3; 100 = 4; 101 = 5; 110 = 6; 111 = 7 . . . )

So a principle emerges: each binary digit you add doubles the number of choices, or the size of the decimal number you can represent -- 4 digits yield 16 choices, 5 yield 32, 6 yield 64, and 7 yield 128. (You may recognize the series as powers of 2: 21, 22, 23, 24, ... 27.)

In computer terminology, the choice between a 1 and a 0 is a bit of information, and all computer codes are made up of strings of these bits: 100111010100101, ad infinitum. But a long string like that is very hard to read, to check, and to transmit; it helps to break the string into smaller bytes to keep things straight. By now, the 8-bit byte is well-nigh universal; and eight binary digits, not coincidentally, yield 256 choices.

E-mail and 7-bit ASCII

Like classic Teletype machines (which were based on telegraphy and the very limited Morse code), early computers contented themselves with numbers and capital letters, along with a few punctuation marks; these fit comfortably within a 64-character set. But if you need lowercase letters as well, you've passed that limit and need the 128 choices provided by 7 bits. Thus the ASCII set.

When you send a string of electrical pulses over a wire, it's a good idea to have a way of checking to see whether the transmission is clean; the electrical pulses of static could well garble your transmission. For that reason, modem transmissions often include a check bit, which gives a good indication of whether the bits got through ungarbled: if the sum of a string of digits is odd, the check bit is a 0; if even, it's 1. Not foolproof, but useful. Thus one possible protocol is to transmit 8-bit bytes, with the eighth bit being a check bit. And that leaves us with either ASCII or -- if sender and receiver agree on it -- another 128-character set.

As its full name (American Standard Code for Information Interchange) implies, ASCII is American, which made sense when most computing was American. To represent languages other than English, though, more letters are needed; those needs were met with the ANSI set and with other code pages, which made it possible to assign code strings to other letters. Fairly early on, good word-processing programs allowed one to take characters from a dozen different sets, for different national scripts, and for various intellectual fields such as mathematics, chemistry, or engineering. But to do so, the user, or at least the program and computer, must constantly shift from one code page to another, essentially saying, "Look on code page 14 and give me character number 37; now go back to page 1," and so on. It works; it's relatively cheap in terms of storage and computer time. But it's messy, and it still doesn't include all the characters one might need. Russian, anyone? Georgian? Thai? Japanese?

The Dream of a Universal Character Set

As computers became more powerful, storage cheaper, and transmission faster, considerations of economy became less pressing. And though economy is still important in many situations and in most developing countries, the rapid changes we've seen in computing and networking make it seem pointless to hamper ourselves by sticking to 7-bit or 8-bit character sets. Why not have a 10-bit set whose 1024 characters could represent all European languages? Or a 16-bit set that could include large swaths for Chinese and Japanese ideographs? And thus the idea of Unicode was born. [ C ]


David Frederickson edits Connect.
frederickson@nyu.edu

Posted 26September 1996