Critical Section

Archive: January 4, 2011

<<< January 3, 2011
Home
January 7, 2011 >>>

[way more than] everything you ever wanted to know about character encoding

Tuesday, 01/04/11 10:34 PM

<ράντ type="οπτιοναλ">

Have you ever seen this, and wondered what the heck it means?

<?xml encoding="UTF-8"?>

Read on! As you’ll see, this is a beautiful solution to a tough problem.

In the beginning we needed to represent characters as numbers, and so there was ASCII, and it was good. (I’m skipping FIELDATA and let’s not mention EBCDIC in polite company.) ASCII was a way of representing every character in 7 bits, such that they fit comfortably in a byte, leaving a high-order parity bit. Where “every” includes all letters, numbers, and all the punctuation you could ever want, including stuff used for programming like quotes, equal signs, and angle brackets.

And then C was created and the Unix runtime libraries, and they cleanly supported ASCII. Char was a native type, one byte, and strings could be stored in arrays of char, and by convention the character NULL = 00 was used to mark the end of a string. A metric ton of code was written which processed, managed, and manipulated such strings. Most of it ignores the content of strings, secure in the knowledge that every character resides comfortably in one byte, and the whole was terminated by a zero.

Life was good.

Gradually however it became apparent that “every” did not actually include all characters. There were these people in Europe who used áccènts and ümlaûts. They even had some different punctuation like upside down exclamation points¡ Who knew? And so the high order bit formerly used for parity was reused and 128 more characters were defined. Such a luxury, there was even room for graphics characters that could be used for drawing lines and stuff; ▀▄ yippee ▄▀. These characters still fit snugly into one byte, and that metric ton of code worked perfectly.

But … turns out there are all these other people on Earth who don’t use the Roman alphabet, and they use computers too! We’re talking Greek, Cyrillic, Armenian, Hebrew, Arabic, and so on. (Armenians send email? Who knew?) So… now what? Well it was determined that for any one person or computer, they could live with an additional 128 characters, only different people needed different additional characters. So the concept of code pages was invented. Each code page was the definition of 128 characters which was used when that high-order bit was set. On any one computer the code page was fixed, but different computers could use different code pages. And characters *still* fit snugly into a byte, and 00 still meant the end of a string, and all that code still worked.

Whew!

But … turns out there are all these other people on Earth, who don’t use alphabets, they use symbols! We’re talking Chinese, Japanese, Korean, and so on. (Korean XML? Who knew?) So… now what? We need literally thousands of characters to store all these symbols… Oh no, Mr. Bill!

Time to do a big reset. And so the concept of Unicode was invented. Unicode is one mapping that assigns a unique number to every character and symbol used by humans on Earth to communicate. They are called “code points”, and there are a lot of them (246,943 as I type this, probably more by the time you read it). That is way more than can fit inside a byte, snugly or otherwise. But now we have a way to map all of Kanji (漢字), yay!

So… we have Unicode, but how do we represent these code points inside a computer? This is where character encoding comes in... there are different ways of encoding a series of Unicode code points as computer data.

In this, as in just about everything else in computing, there is the crummy nonstandard technique Microsoft uses (called UTF-16), and the elegant cool technique everyone else uses (called UTF-8). BTW you will win bar bets if I tell you UTF stands for Unicode Transformation Format, please PayPal me 10% commission.

Let’s take UTF-16 first so you can really appreciate UTF-8.

UTF-16 is the idea that all Unicode characters are stored in two byte “words”. Every code point is assigned a 16-bit value, characters are 16-bits wide, and strings are arrays of 16-bit words. The end of a string is indicated by a 16-bit zero. There are some problems with this; first, that metric ton of code written in the old days will no longer work, second, most strings are now twice as big in memory as they used to be, and third, there are more than 65,536 code points, so there are too many characters and symbols to fit into 16-bit words! Okay, so here’s what we’ll do; first, we’ll rewrite all the old code, create new subroutines for everything. No problem. Second, we don’t care about memory. Third, for code points too big to fit into one 16-bit word, we’ll use two 16-bit words. There will be a special range of values (D800-DBFF) which mean “I am the first word of a two-word (four-byte) sequence”. Of course if you look at a second word, you won’t know if it is just a 16-bit value, or the second word of a 32-bit value, but that’s a detail. Oh, and yeah there is a byte ordering problem, some computers represent 16-bit values with the high order byte first (big endian) and some with the low order byte first (little endian), so we will start every string with the value FEFF, so that everyone can tell.

I am not making this up, that’s UTF-16, and that is the way all characters are stored and processed inside Windows. If you are reading this on a Windows PC, all these characters are coming to you via UTF-16. These are called “wide characters”, there are “wide” alternative versions of string manipulation function, and the whole thing is massively ugly.

Now let’s move on to UTF-8.

UTF-8 is the idea that all Unicode characters are stored in 8-bit bytes, just like before. Some code points fit in one byte, some in two, some in three, etc.; as many bytes as are needed to represent the code point. The end of a string is indicated by a zero, just like before. The values 1-127 are standard ASCII, just like before. (In one fell swoop, perfect backward compatibility!) All the old code still works, and we don’t need new subroutines for everything. Some characters require more than one byte, but we only use the bytes we need, so no memory is wasted. When you see a byte, you can tell immediately whether it is the first byte of a multi-byte sequence. There are no endian issues. It is a beautiful solution to the problem.

When you see a byte, you can what kind of byte it is by the value:

00

is NULL, the end of a string

01-7F

stand for themselves, and do not appear elsewhere ("ASCII")

80-BF

are always not the first byte of a character, they are the 2nd, 3rd, etc. bytes of a multi-byte character

C0-C1

are invalid ("overlong" start of a 2-byte character)

C2-DF

are the start of a 2-byte character

E0-EF

are the start of a 3-byte character

F0-F7

are the start of a 4-byte character

F8-FB

are the start of a 5-byte character (not needed yet, but maybe when we colonize Mars :)

FC-FD

are the start of a 6-byte character (probably will never be needed)

FE-FF

are invalid (mostly to protect against UTF-16!)

And you may ask yourself, ~~how did I get here?~~ what do I have to do to support UTF-8? Well if you don’t care about content, nothing! Your strings will still work even if they contain UTF-8 encoded characters, and you may take the rest of this post off. You have strings of 8-bit bytes, terminated by zeros, and you're happy.

If you do care about content, and the content is ASCII, not much has changed. You can scan for common parse characters like “<” or “=” inside a string, just like before.

Finally if you care about content, and the content might not be ASCII then you have to be aware that the byte length of a string is not necessarily the same as the character length. To count bytes, you, um, just count bytes. To count characters, you count bytes which are in the range 01-7F and C0-FF, and skip bytes in the range 80-BF. Pretty simple. Copying and moving characters strings is exactly like before. Mostly stuff just works. To find a multi-byte value, you search for the multi-byte value in the string; the encoding ensures that a given sequence of characters only ever means one thing.

</ράντ>

:)

Return to the archive.