Okay. It is time to put some “K” into CJK…
Seriously, much of the content of this blog has been focused on Chinese and Japanese issues. This article will provide some much-deserved Korean content.
I spent the last few days coming to grips with Old Hangul (옛한글 yethangeul), specifically how to implement proper shaping using the three registered OpenType GSUB features, ‘ljmo‘ (Leading Jamo Forms), ‘vjmo‘ (Vowel Jamo Forms), and ‘tjmo‘ (Trailing Jamo Forms).
I like ASCII. Do I like ASCII because of all the wonderful things one can do with its extraordinarily large repertoire of 94 printable characters? Actually, yes. Before I defend that answer, I’d like to point out that ASCII has three important strengths: simplicity, robustness, and ubiquity. In other words, ASCII is simple in that it has a relatively small number of characters; it forms a subset of virtually every encoding, Unicode or otherwise; and is supported everywhere. In fact, ASCII can be used to represent Unicode through the use of notations. Richard Ishida‘s excellent Unicode Code Converter is an excellent way to explore the various notations that are currently in use.
The first set of ideographs to be encoded in Unicode (Version 1.1), which are referred to as CJK Unified Ideographs, are also referred to as the URO, which is an abbreviation for Unified Repertoire and Ordering. None of the other extensions are given this label. Extensions A through D have been standardized, and Extension E will soon be standardized. Only Extension A is in the BMP (Basic Multilingual Plane). Extension B and beyond are in Plane 2, which is called the SIP (Supplementary Ideographic Plane). What makes the URO special or unique?
As described in the August 24, 2012 article, I am currently updating most of our OpenType Japanese fonts. One aspect of the update is to include the 32 additional IVSes, based on the March 2, 2012 version of the IVD (Ideographic Variation Database), which means that all of the kanji in Adobe-Japan1-6 now have a “plain text” representation. Another aspect of this particular update is to use the latest UTF-32 CMap resources, which include three additional mappings, one of which is U+9FCC that was appended to the URO (Unified Repertoire & Ordering) in Unicode Version 6.1. But, the topic of this article is about fixing a small number of glyphs, and the techniques that I used to do so.
Founded in 1987 by Yuan Ho and Seth Schneider, IMUG (International Multilingual User Group) has become Silicon Valley’s best user group for all matters related to internationalization, localization, and globalization. Meeting once a month, usually the third Thursday, IMUG meetings include a presentation by a distinguished member of the internationalization, localization, or globalization community. Adobe began hosting IMUG meetings in 2010 for odd-numbered months, and its Globalization Engineering Council serves as the official host. Even-numbered months are hosted by Google.
Regardless of whether you reside in the San Francisco Bay Area, I encourage you to attend IMUG meetings. Those that are hosted by Adobe are broadcasted via Adobe Connect, meaning that attendees need not be on site.
It seems that I have presented for IMUG six times, in 1995 (Adobe Systems’ CID-Keyed Fonts For Large Character Sets), 1999 (Adventures in Multilingual Publishing), 2005 (The Adobe-Japan1-6 Character Collection), 2008 (Ideographic Variation Sequences: Implementation Details), 2010 (Kazuraki: Adobe Systems’ Groundbreaking New Japanese Typeface), and 2011 (The Power of “Plain Text” & the Importance of Meaningful Content).
The next IMUG meeting, which will be hosted at Adobe, will include an intriguing presentation by Microsoft’s Michael Kaplan, which will be about new internationalization features in Windows 8, scheduled to be released on October 26, 2012, which is, by the way, approximately two months before our world ends. Please plan to attend, either in person or via Adobe Connect.
Many thanks to Nozomu KATŌ (加藤望) for bringing to my attention that the Adobe-Japan1-6 Unicode CMap resources were missing the following mapping:
U+207F → CID+15908
I decided to add this mapping to the following eight Adobe-Japan1-6 Unicode CMap resources this evening:
The eight updated CMap resources were just posted to the CMap Resources open source project that is hosted at Open @ Adobe, and the details are in the associated forum post.
On July 25, 2012, Apple released to the world Mac OS X Version 10.8 (aka Mountain Lion). Among the many new features in this latest iteration of Mac OS X is support for CFR objects. For those who are not aware, CFR objects are based on ISO/IEC 14496-28:2012 (Composite Font Representation), and are used to define both composite fonts and fallback fonts. CFR objects effectively break the 64K glyph barrier. Mac OS X Version 10.8 is thus the first implementation that has broken the 64K glyph barrier.
ISO/IEC 10646:2012 (Third Edition) was just published. This is the first version of the standard that includes multiple-column Code Charts for Extension B, and for CJK Compatibility Ideographs. Another significant aspect of ISO/IEC 10646:2012 is that it is equivalent to Unicode Version 6.1.
For Adobe, the publishing of this new version of the standard represents a significant milestone, because it means that every Adobe-Japan1-6 kanji is either directly encoded, or is directly associated with a registered IVS in the IVD (Ideographic Variation Database).
Speaking of Unicode Version 6.1, the printed version of the Core Specification is available via POD from Lulu, and at a very attractive price.
Years ago, I wrote a Perl script, called unicode-rows.pl, that takes a fully-qualified PostScript name—composed of a CIDFont resource name, two hyphens, and a UTF-32 CMap resource name—then generates a PostScript file that can be distilled into a PDF. The resulting PDF file is a Unicode table, arranged in groups of 256 code points. If the UTF-32 CMap resource includes even a single mapping for a particular group of 256 code points, a page is created.
I have prepared examples that are based on the UniJIS2004-UTF32-H and UniJIS-UTF32-H CMap resources.
As alluded to at the end of the May 9, 2012 CJK Type Blog article, I had plans to build additional CFR objects for testing purposes. That particular article supplied two 64K-glyph OpenType/CFF fonts, which provided BMP and Plane 1 coverage, and served as component fonts for the supplied CFR object, UnicodeGetaCFR.cfr. In today’s article, I will supply a CFR object that encompasses all of Unicode, meaning the BMP and the 16 Supplementary Planes, along with the component fonts that it references. In other words, coverage for 1,112,030 code points, each of which has a unique glyph. These represent valuable testing resources for developers who plan to support CFR objects in their products as a way to break the 64K glyph barrier.
Unicode Version 6.1 includes a total of 1,002 CJK Compatibility Ideographs. The February 22, 2012 CJK Type Blog article includes a table that provides the details in terms of when they were added to Unicode, version-wise.
Of the 1,002 CJK Compatibility Ideographs that are in Unicode, 89 have Japanese sources. The Japanese sources are JIS X 0213:2004, Jinmei-yō Kanji (人名用漢字), IBM, and ARIB STD-B24. In addition, some of them have multiple Japanese sources, and while most of them are intended to use the same glyph regardless of the source, a very small number of them—three to be precise—do not.
As the IVD Registrar, I am very pleased to announce that a new version of the IVD (Ideographic Variation Database) was released on March 2nd, 2012. It incorporates the results of PRI 183 and PRI 187.
I am pleased to announce that Adobe once again has the privilege and honor of being a Gold Sponsor of the Internationalization & Unicode Conference, the 36th iteration of which will take place in October of this year.
For those who have had the opportunity to attend this conference in the past, I am preaching to the choir when I state that much of the benefit of attending is not from listening to the scheduled sessions—though they have incredible value—but rather that there is an opportunity to have face-to-face discussions with others in the industry.
If you plan to attend IUC36, I hope to see you there!
Unicode Version 6.1 was released on 01/31/2012, and now includes 74,617 CJK Unified Ideographs, along with 1,002 CJK Compatibility Ideographs. 732 characters were added, and there are now a staggering 110,116 characters in the standard.
Speaking of staggering, as Unicode grows, it becomes more important to keep track of what character is encoded where, and sometimes it is useful to know when a character was encoded. For this purpose, the DerivedAge.txt datafile is an incredibly useful resource.
In terms of CJK Unified Ideographs and CJK Compatibility Ideographs, I spent part of the morning assembling a single-page PDF file that encapsulates many important details of their history. I hope that readers of this blog find it to be useful.