by Kevin Goldsmith

Created

November 22, 2005

I thought I was pretty up on localization. I understand locales and code pages. I know the difference between Unicode and ASCII. My strings are stored as resources. My UIs are dynamic enough to resize themselves to fit various languages. I have released several products for multiple world markets… These were all on the Windows Platform, however.

When I came to Adobe and started doing cross platform development, it really woke me up. This kick to the head came when I started having to deal with string encodings.

If you know windows development, you will know MBCS and Unicode (UCS-2). Most windows developers I’ve met are unaware of any other string encodings. They have the rule in their heads that you use CHAR * for English-only apps and WCHAR * for international ones. However, there are several standard string encodings as well (MBCS is not a standard, it is a Microsoft-only thing that can really cause problems if you are doing xPlatform dev work). UTF-8 is the best known since XML and HTML documents are encoded as UTF-8 by default. UTF-8, like MBCS, can represent a single character in multiple bytes (which is why you need to be very careful if you think a CHAR represents a character and can be treated like such). We tend to use UTF-8 encodings for our strings since it is the closest thing to a cross-platform string encoding OS support-wise.

Microsoft Unicode (UCS-2) represents each character as two bytes and only two bytes, therefore you can do all the nice character comparison tricks you want to. Java uses UCS-2 as its standard encoding.

Unfortunately, UCS-2 cannot represent all the characters that are used in the world. There are some characters in Chinese and other languages that cannot be encoded with the fixed-byte system of UCS-2. UCS-2 is a variant and subset of UTF-16. In UTF-16 character is encoded by a minimum of two bytes, but a single characters encoding can stretch to 4 bytes or more (although not really more since Unicode is currently capped out at 21 bits).

Some developers I work with are now using UTF-32 encodings for simplicity, but someday we may end up with UTF-64 or even UTF-128 (maybe after the aliens conquer us or every atom in the universe will have its own character).

There is a pretty good Unicode Tutorial over at jBrowse which I wished I’d found when I was first learning about this stuff.

[Updated this blog to reflect a discussion with my co-worker Bob, thanks Bob!]