Copyright © 1999 G. Adam Stanislav. All rights reserved. |
What is i18nI18n is short for internationalization (i, plus 18 other characters, plus n). The creators of the original personal computer used a 7-bit character code, commonly known as ASCII, which is capable of encoding only the very basic characters of Roman alphabet. That makes it useless for just about any language other than English and Latin. To solve this problem, the International Standards Organization developed a family of standards called ISO-8859. Each of these standards either combines the accented characters of several languages into an 8-bit code, or allows the use of a different alphabet (such as Cyrillic, Greek, and others). Each of the standards starts with the same encoding as ASCII. It then defines additional characters. For example, ISO-8859-2 encodes the character sets of various Central-European languages into an 8-bit set. Alas, while this solves the problem of using a personal computer in any one language, it is not possible to encode all characters used by the various languages of the world into eight bits. Is this a problem?That, of course, depends on your needs. If, for example, you want to design a web page in a language other than English, you can probably use one of the ISO-8859 encodings. In that case, it is not a problem. But, suppose you were asked to help a Buddhist scholar design a web site to publish the results of a study of various Buddhist texts. You will need to display the text of the study itself in English (or whatever language it is written in). You will also need to display some of the original texts. Suppose they are in Sanskrit, Tibetan, and Chinese. Each of these languages uses a different character set. Add to it the fact that Chinese alone cannot fit into a pure 8-bit representation. What options do you have?
None of these options offers a truly satisfactory solution. Multi-byte encodingThe obvious solution to i18n problems is the use of multi-byte encoding, i.e., using more than one 8-bit byte (or octet in Internet parlance). Several multi-byte encodings have been developed over the years. Some of them are suitable only for Chinese or Japanese. But one system that has emerged as clearly superior is Unicode. The Unicode standard uses 16-bit mapping. In other words, it assigns a 16-bit integer to the various characters of all alphabets currently in existence. Additionally, it maps some other pictographs, such as mathematical symbols, dingbats, and others. It also reserves part of the map for private use. Anyone can use the private section to map any glyphs or pictographs they want. A popular effort exists to extend the private section into a four-byte encoding using 31 bits (referred to as UCS-4). This allows enough space for the use of fictional alphabets (e.g., Clingon) and extinct alphabets (e.g., Egyptian hieroglyphics), while being backwards compatible with the Unicode standard. Note that the Unicode standard only maps glyphs to a 16-bit integer. It does not specify how the 16-bit value is to be represented on the computer. The main reason for this is that there are two principal ways of doing it in the world of computers. One places the least (or less) significant byte (LSB) before the most (or more) significant byte (MSB). The other does the exact opposite: It places the LSB after the MSB. This is quite irrelevant as long as a computer is not connected to a network: It simply uses its own native representation of 16-bit integers. But the moment the computer needs to exchange Unicode data with another computer without knowing its native format (which happens all the time on the Internet), some kind of encoding protocol is required. RFC 2279 defines UTF-8, an encoding of any 31-bit value to a unique combination of one to six octets (8-bit bytes). This is the preferred Unicode encoding protocol on the WWW. It is also fully compatible with UCS-4. RFC 2277 states: “Protocols MUST be able to use the UTF-8 charset, which consists of the ISO 10646 coded character set combined with the UTF-8 character encoding scheme, as defined in [10646] Annex R (published in Amendment 2), for all text.” Because of that, recent versions of all popular web browsers support UTF-8 encoding. Depending on the underlying operating system and fonts used, they will either display the right glyph as mapped by the Unicode standard, or they will show the closest glyph they can find. Take, for example, the letter S with a caron. If UTF-8 encoded, it will either show up as an S with a caron, or just a plain S. Without UTF-8 all bets are off (for example, if using ISO-8859-2 encoding, it will show up either as an S with a caron or as a copyright symbol, or as something else dependent on the font used). To see how your browser handles this particular example, here it is: If you want to test your browser thoroughly, try this link. ToolsTools to convert text files into the UTF-8 format are necessary. Some have been developed, others still need to be made.
These tools allow you to create and edit UTF-8 files containing the entire span of the Unicode standard. They can also be called transparently by other programs, such as various X11 editors, or CGI scripts. I used all these tools to convert this web page into UTF-8. Putting it all togetherThe tools included here were designed to work together. They allow you to convert files from any encoding to any encoding, and to and from UTF-8. Suppose you have an HTML file created under Windows 95 using its non-standard encoding. Simply pipe everything in the proper order, like this (placing it all on one line, which I have split up here for legibility): tuc -i input.html | utrans -p CP1252 | uhtrans | hutrans | ptrans -p ISO-8859-1 -o output.html |