UTF-8: Is that in Klingon language?


GLOSSARY:
ANSI - American National Standards Institute
ASCII - American Standard Code for Information Interchange
UTF - Unicode Transformation Format


This is not something that the average person bothers with, but my curiosity made me look into it. This explains why some webpages (and emails) are displayed in illegible gibberish! It's actually quite interesting.

Computers use ASCII codes to display the letters of the alphabet, utilizing the numbers from 32 to 255. However, around the world, these codes generated different characters, to accomodate different languages. ASCII was not condusive to worldwide communications!

Then came ANSI standards, where the ASCII codes for only 32 to 127 are accepted worldwide. "Code Pages" were needed for each language (EG: 862 for Israel, 737 for Greeks), specifying codes for 128 to 255, based on the needs of each language. But there were still issues for Asian languages because we only had 8 bits and their alphabetical characters didn't fit in that.

When we upgraded to 16 bits, UNICODE allowed for all possible characters, encompassing every language, and then some. Every letter, in every alphabet, in every language, is assigned a magic number, or "code point" (eg: U+0041 for capital A) . Pretty incredible!

Note: The Character Map (on Window computers), or the Character Palette (on Apple computers), shows code points.

Now, I do admit that UNICODE confuses me, so I'm not going to get into it! I struggled with the "endian modes" and the Unicode Byte Order Marks... blah, blah, blah... However, understand that, in spite of Unicode being overkill for the English and West-European languages, it was very necessary for the Asian languages. Finally, we have achieved the ability to have comprehensive communications worldwide!

When UTF-7, 8, 16, and 32 came along, a better way to store code points correctly was established, which allows people in various countries to read "foreigner's" emails, or webpages. If a writer of any webpage (or email) does NOT specify how code points are stored, they are sadly defeated by a form of "WWW tunnel vision," especially when their work ends up looking like a mess of gobbledygook on some computer screens on the other side of the world.

Computers need to know what storage method is being used. Emails would have to have a string in its header with:

          Content-Type: text/plain; charset="UTF-8"

Webpages needed to supply storage information inside the HEAD tag, such as:

          <html>
          <head>
          <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Luckily for us, many email programs automatically specify the storage method in the header section (usually hidden from us) of each email. On webpages, if it isn't declared, most browsers will try to guess what method was used, and then display the webpage based on that assumption. There are no guarantees that browsers won't think it was written in Klingonese. Poor Klingons, they will see nothing, but gibberish!

Now some smart Klingon, assuming he wants to, and knows how, will have to explore the Encoding menu to find the right method which would help to read such webpages!

If all this makes sense to you, well, Smarty-pants, why don't you just go live at the Unicode Consortium!

Table of Contents