CJK Compatibility Ideographs

Unicode Version 6.1 includes a total of 1,002 CJK Compatibility Ideographs. The February 22, 2012 CJK Type Blog article includes a table that provides the details in terms of when they were added to Unicode, version-wise.

Of the 1,002 CJK Compatibility Ideographs that are in Unicode, 89 have Japanese sources. The Japanese sources are JIS X 0213:2004, Jinmei-yō Kanji (人名用漢字), IBM, and ARIB STD-B24. In addition, some of them have multiple Japanese sources, and while most of them are intended to use the same glyph regardless of the source, a very small number of them—three to be precise—do not.

One issue that affects CJK Compatibility Ideographs is that they are subject to normalization, and regardless of which of the four normalization forms that are used—NFC, NFD, NFKC, or NFKD—they are converted into their canonical equivalent, which is a CJK Unified Ideograph. For example, the two CJK Compatibility Ideographs U+FA5D (艹) and U+FA5E (艹) normalize to the same CJK Unified Ideograph, U+8279 (艹). What is currently at issue is that 57 of the 861 Jinmei-yō Kanji are CJK Compatibility Ideographs, and when normalized, they become a different, but related, kanji. This is a problem because these 57 CJK Compatibility Ideographs cannot be preserved, because it’s not possible to guarantee that normalization will not be applied, except in completely closed environments. Any interchange of them is subject to data loss.

Registered IVSes (Ideographic Variation Sequences) and Standardized Variants, both of which are variation sequences, represent mechanisms that are immune to the effects of normalization. Thus, a CJK Compatibility Ideograph that is represented by either a registered IVS or Standardized Variant becomes immune to the effects of normalization.

While I do not wish to use this article to debate the pros and cons of various solutions, because doing so is the responsibility of interested/affected national bodies and organization, I do wish to use this opportunity to provide a two-page table that lists these 89 CJK Compatibility Ideographs, and which provides the registered “Adobe-Japan1″ and “Hanyo-Denshi” IVSes that correspond to them. The last four columns represent the Japanese sources that correspond to these 89 CJK Compatibility Ideographs.

Also included in the table is a column, shaded in gray, that includes a subset of the 1,002 Standardized Variants that have been proposed in WG2 N4246R (the 1,002 proposed variation sequences are attached to the document, but also available separately here). Three pairs in this column have been shaded darker to indicate that same Standardized Variant would be used regardless of the Japanese source. This is because they are intended to preserve the distinction of CJK Compatibility Ideographs, in terms of their relationship with their canonical equivalents, not their glyphs. Note that these Standardized Variants have merely been proposed, and are currently under discussion, and thus should not be used.

The cells that have been shaded in cyan are the three cases for which the same CJK Compatibility Ideograph has more than one Japanese source, and for which different glyphs are required. For each pair, one cell has been shaded darker to indicate a higher priority, specifically that it is the JIS X 0213:2004 glyph. The cells that have been shaded in yellow simply represent different CJK Compatibility Ideographs that share the same canonical equivalent.

In terms of history, the 2007-12-14 version of the IVD (Ideographic Variation Database) includes 92 registered “Adobe-Japan1″ IVSes that correspond to these 89 CJK Compatibility Ideographs. The 2010-11-14 version of the IVD adds one registered “Hanyo-Denshi” IVS, <8218,E0103> (IB1067), that corresponds to one of these 89 CJK Compatibility Ideographs, U+FA6D, and represents a duplicate of the equivalent registered “Adobe-Japan1″ IVS, <8218,E0101> (CID+13695). The just-released 2012-03-02 version of the IVD adds 89 registered “Hanyo-Denshi” that correspond to these 89 CJK Compatibility Ideographs, all of which represent duplicates of registered “Adobe-Japan1″ IVSes. This is okay, because each IVD collection is considered private unless their registrants agree to share IVSes.

Edited To Add: Professor Koichi Yasuoka (安岡孝一) quickly and kindly pointed out, in his reply to my comment on his March 4, 2012 blog post, that the “Adobe-Japan1″ IVS for U+FA40 was incorrect in the table. It was originally <61F2 E0101> (CID+13369), but should have been <61F2 E0103> (CID+21072). The two-page table was corrected accordingly.

