by Kevin Goldsmith

Created

November 22, 2005

I thought I was pretty up on localization. I understand locales and code pages. I know the difference between Unicode and ASCII. My strings are stored as resources. My UIs are dynamic enough to resize themselves to fit various languages. I have released several products for multiple world markets… These were all on the Windows Platform, however.

When I came to Adobe and started doing cross platform development, it really woke me up. This kick to the head came when I started having to deal with string encodings.

If you know windows development, you will know MBCS and Unicode (UCS-2). Most windows developers I’ve met are unaware of any other string encodings. They have the rule in their heads that you use CHAR * for English-only apps and WCHAR * for international ones. However, there are several standard string encodings as well (MBCS is not a standard, it is a Microsoft-only thing that can really cause problems if you are doing xPlatform dev work). UTF-8 is the best known since XML and HTML documents are encoded as UTF-8 by default. UTF-8, like MBCS, can represent a single character in multiple bytes (which is why you need to be very careful if you think a CHAR represents a character and can be treated like such). We tend to use UTF-8 encodings for our strings since it is the closest thing to a cross-platform string encoding OS support-wise.

Microsoft Unicode (UCS-2) represents each character as two bytes and only two bytes, therefore you can do all the nice character comparison tricks you want to. Java uses UCS-2 as its standard encoding.

Unfortunately, UCS-2 cannot represent all the characters that are used in the world. There are some characters in Chinese and other languages that cannot be encoded with the fixed-byte system of UCS-2. UCS-2 is a variant and subset of UTF-16. In UTF-16 character is encoded by a minimum of two bytes, but a single characters encoding can stretch to 4 bytes or more (although not really more since Unicode is currently capped out at 21 bits).

Some developers I work with are now using UTF-32 encodings for simplicity, but someday we may end up with UTF-64 or even UTF-128 (maybe after the aliens conquer us or every atom in the universe will have its own character).

There is a pretty good Unicode Tutorial over at jBrowse which I wished I’d found when I was first learning about this stuff.

[Updated this blog to reflect a discussion with my co-worker Bob, thanks Bob!]

COMMENTS

  • By James - 4:04 AM on November 23, 2005  

    Interesting run-down on character encodings. We’re looking to making our product work for Asian markets. Can you give any simple pointers on how UTF-8 (or any other character encoding) works in PDFs?

  • By Mihai Nita - 10:36 AM on November 23, 2005  

    There are many things to consider.The rule of thumb is: moving data arround with utf-8, complex processing with utf-16 (or utf-32 if the native api is utf-32). (http://www.unicode.org/faq/programming.html)Many libraries and programming languages selected utf-16. Java (Sun), ICU (IBM), Xerces/Xalan (Apache), the native Mac OS X API (Apple), Qt (Trolltech).See here: http://www.unicode.org/notes/tn12/And about “UTF-64 or even UTF-128” this is not going to happen. “640KB^H^H^H^H^H UTF-16 should be enough for everybody”:-)

  • By Kevin Goldsmith - 10:39 AM on November 23, 2005  

    I don’t know too much about the innards of PDF, sorry. Did you try this question in the forums?

  • By Jim DeLaHunt - 11:25 PM on November 24, 2005  

    Kevin: what I think you are bringing up is that there needs to be a clear body of knowledge that every developer internalises that lets them do their part to make internationalised code. Sure, the experts will have lots of arcane rules (perhaps as many as 0x10FFFF arcane rules) which they know to follow, but the developers who make their living specialising in something else need to have a core set they can live with. You’ve pointed out several members of that core set: separating strings, resizeable dialogs, etc., but knowing when to use UTF-8, UTF-16, etc is also in that set.Now I don’t know where this core set is documented in an accessible manner. I like the jbrowse tutorial’s “bluffer’s guide” approach, though I disagree with some of their details. I’ll look for a reference to pass to you.I could use such a reference for my work in consulting on world-ready software engineering.I also think the domain of cross-platform code is similar: there are arcane details which the platform independence experts will master, but there need to be a core set of knowledge which ever developer internalises. Use of UTF-8 and UTF-16, or factoring out encoding from the code, would probably be in that subset too.–Jim DeLaHunt

  • By Jim DeLaHunt - 11:58 PM on November 24, 2005  

    James of momento.com.au:You ask about how UTF-8 works in PDFs. I’ll guess that you want your users to be able to type text in Asian languages (Japanese, Chinese perhaps) into your software, and have your software generate PDFs with that text intact.The simple answer is that if you can print to Windows or Mac OS printing system calls and get the right printed results, then you can print to the Adobe PDF printer (part of Acrobat) and Adobe PDF will take care of all the manipulations to generate a PDF file with the correct text.At another level, I can suggest that you take a look at the PDF Reference, http://partners.adobe.com/public/developer/pdf/index_reference.html , chapter 5 “Text”. See especially the reference to the CMaps like UniGB-UTF16-H (simplified Chinese), UniCNS-UTF16-H (traditional Chinese for Hong Kong), and UniJIS-UTF16-H (Japanese). Waving my hands a bit about font choice and use of alternate glyphs, you might be able to write UTF-16 encoded text to the Tj operator and get correct results. This advice is for you if you are generating your PDF content directly from your application.A good place to turn to is the Acrobat Developer Support area at Adobe. Start at http://partners.adobe.com/public/developer/acrobat/devcenter.html .Now none of these answers may be at the appropriate level for your application. I might be completely missing the point of your question.Sorry if this sounds like an answer that says Adobe, Adobe, Adobe, but they did write the book and a lot of the infrastructure for PDF. Also, I worked there for a long time, and those habits die hard. :-)Best wishes, –Jim DeLaHunt