Archive for November, 2005

November 22, 2005

String encodings

I thought I was pretty up on localization. I understand locales and code pages. I know the difference between Unicode and ASCII. My strings are stored as resources. My UIs are dynamic enough to resize themselves to fit various languages. I have released several products for multiple world markets… These were all on the Windows Platform, however.

When I came to Adobe and started doing cross platform development, it really woke me up. This kick to the head came when I started having to deal with string encodings.

If you know windows development, you will know MBCS and Unicode (UCS-2). Most windows developers I’ve met are unaware of any other string encodings. They have the rule in their heads that you use CHAR * for English-only apps and WCHAR * for international ones. However, there are several standard string encodings as well (MBCS is not a standard, it is a Microsoft-only thing that can really cause problems if you are doing xPlatform dev work). UTF-8 is the best known since XML and HTML documents are encoded as UTF-8 by default. UTF-8, like MBCS, can represent a single character in multiple bytes (which is why you need to be very careful if you think a CHAR represents a character and can be treated like such). We tend to use UTF-8 encodings for our strings since it is the closest thing to a cross-platform string encoding OS support-wise.

Microsoft Unicode (UCS-2) represents each character as two bytes and only two bytes, therefore you can do all the nice character comparison tricks you want to. Java uses UCS-2 as its standard encoding.

Unfortunately, UCS-2 cannot represent all the characters that are used in the world. There are some characters in Chinese and other languages that cannot be encoded with the fixed-byte system of UCS-2. UCS-2 is a variant and subset of UTF-16. In UTF-16 character is encoded by a minimum of two bytes, but a single characters encoding can stretch to 4 bytes or more (although not really more since Unicode is currently capped out at 21 bits).

Some developers I work with are now using UTF-32 encodings for simplicity, but someday we may end up with UTF-64 or even UTF-128 (maybe after the aliens conquer us or every atom in the universe will have its own character).

There is a pretty good Unicode Tutorial over at jBrowse which I wished I’d found when I was first learning about this stuff.

[Updated this blog to reflect a discussion with my co-worker Bob, thanks Bob!]

3:49 PM Permalink
November 11, 2005

Hey all, I’m still here

Sorry for the no-posting. I’ve been pretty busy working on something I can’t quite talk about right now… I should be able to talk about it very soon though, so watch this space!

It is always difficult for me to know what to post to an official “corporate” blog. While you wouldn’t know it from my postings on adobe.com, I have over-blog-itis. I have three personal blogs, this blog and an internal Adobe blog as well. It gets really hard to know which blog post goes where. For the moment, I’m trying to restrict myself to talking about Help Center and tips on other Adobe products here, but that is getting a bit restrictive, so I’m going to maybe incorporate some other topics here as well. That way I can get at least a post a week up.

I’m also curious to find out what people want to know about. I can talk about working at Adobe, software deveopment, C++, Help Center (of course), whatever ya want to know. This blog, for me, is all about establishing a dialog with our customers, so let me know what you want to know… I can’t talk about everything I’m working on, because some of it definitely falls into the “if I tell you, I’ll have to kill you” territory, but I’ll do my best to answer what I can.

4:59 PM Permalink