On a typewriter, the character set was limited by the number of typing keys: about forty-five keys, with the number of characters doubled by using the shift key. On a computer, with more keys on the board and more combining keys added to the lowly shift, the number of characters in a set is limited not by the keys but by mathematics. The ASCII set that's most commonly used has 128 characters (actually, several of the "characters" are invisible, for things like tabs and line breaks and spaces); the common ANSI set adds another 128 characters, mostly letters with diacriticals used in European languages, for a total of 256.
Why such strange numbers, and not a round number like 100 or 200?
Now with a one-digit binary number, we can have either 0 or 1: two choices. If we add another digit, we get four choices: decimal 0, 1, 2, and 3. Another digit doubles the number of choices again, to eight (0 = 0; 1 = 1; 10 = 2; 11 = 3; 100 = 4; 101 = 5; 110 = 6; 111 = 7 . . . )
So a principle emerges: each binary digit you add doubles the number of choices, or the size of the decimal number you can represent -- 4 digits yield 16 choices, 5 yield 32, 6 yield 64, and 7 yield 128. (You may recognize the series as powers of 2: 21, 22, 23, 24, ... 27.)
In computer terminology, the choice between a 1 and a 0 is a bit of information, and all computer codes are made up of strings of these bits: 100111010100101, ad infinitum. But a long string like that is very hard to read, to check, and to transmit; it helps to break the string into smaller bytes to keep things straight. By now, the 8-bit byte is well-nigh universal; and eight binary digits, not coincidentally, yield 256 choices.
When you send a string of electrical pulses over a wire, it's a good idea to have a way of checking to see whether the transmission is clean; the electrical pulses of static could well garble your transmission. For that reason, modem transmissions often include a check bit, which gives a good indication of whether the bits got through ungarbled: if the sum of a string of digits is odd, the check bit is a 0; if even, it's 1. Not foolproof, but useful. Thus one possible protocol is to transmit 8-bit bytes, with the eighth bit being a check bit. And that leaves us with either ASCII or -- if sender and receiver agree on it -- another 128-character set.
As its full name (American Standard Code for Information Interchange) implies, ASCII is American, which made sense when most computing was American. To represent languages other than English, though, more letters are needed; those needs were met with the ANSI set and with other code pages, which made it possible to assign code strings to other letters. Fairly early on, good word-processing programs allowed one to take characters from a dozen different sets, for different national scripts, and for various intellectual fields such as mathematics, chemistry, or engineering. But to do so, the user, or at least the program and computer, must constantly shift from one code page to another, essentially saying, "Look on code page 14 and give me character number 37; now go back to page 1," and so on. It works; it's relatively cheap in terms of storage and computer time. But it's messy, and it still doesn't include all the characters one might need. Russian, anyone? Georgian? Thai? Japanese?
Posted 26September 1996
|
|
|
| |