Unicode has become the de facto way in which to represent text in digital form, and for good reason: its character set covers the vast majority of the world’s scripts. Other benefits of Unicode include the following:
- That it is under active and continuous development, meaning that with each new version, more scripts are being supported, and additional characters for existing scripts are being standardized.
- That it is aligned and kept in sync with ISO/IEC 10646 (available at no charge), which is quite a feat.
With regard to font development, Unicode is considered the default encoding for OpenType, which refers to the ‘cmap‘ table. The most common ‘cmap’ subtables are Formats 4 (BMP-only UTF-16) and 12 (UTF-32). The latter is used only when mappings outside of the BMP (Basic Multilingual Plane), meaning from one or more of the 16 Supplementary Planes, are used.
In the spirit of encouraging developers, especially those in Japan, to provide better or more broad support for Unicode, which usually entails abandoning Shift-JIS encoding, I became inspired this evening to put together a Top Ten List that provides various Reasons To Abandon Shift-JIS Encoding, similar to the Unicode Beyond-BMP one that I prepared a couple years ago.
While humor is intended in such Top Ten Lists, there is also a serious side to this issue: Given that today’s systems work together, by clinging to Shift-JIS, developers can adversely affect other systems that do support Unicode.
[This Japanese version of the May 31, 2013 article entitled CSS Orientation Test OpenType Fonts is courtesy of Hitomi Kudo (工藤仁美).]
五月三十一日にアドビの新しいオープンソースプロジェクトで、「CSS Orientation Test OpenType Fonts」をリリースしたのでお知らせします。このオープンソースプロジェクトは、Unicodeの次期UTR #50（「Unicode Vertical Text Layout」）のエディタである石井宏治氏のリクエストをもとに開発された、二つのOpenType/CFFフォントを含みます。これらフォントの目的は、フォント開発者がより簡単にグリフの方向に関するテストを行えるよう考慮したものです。
I am pleased to announce that the new CSS Orientation Test OpenType Fonts open source project was launched on Adobe’s open-source portal, Open@Adobe, today. This open source project consists of two OpenType/CFF fonts that were developed at the request of Koji Ishii (石井宏治), the editor of Unicode’s forthcoming UTR #50 (Unicode Vertical Text Layout). The purpose of these fonts is for developers to be able to more easily test whether glyph orientation in their implementation is correct or not.
OpenType fonts are ‘sfnt’ (scalable font) resources that are comprised of several well-defined tables. One of these tables, which is the topic of today’s article, is the ‘cmap‘ (character map) table. The ‘cmap’ table, put simply, maps characters codes to Glyph IDs (GIDs) that refer to glyphs in the ‘glyf‘ or ‘CFF‘ (Compact Font Format) table, depending on the “flavor” of the OpenType font. What is important about the ‘cmap’ table is that it makes the glyphs usable. Without the ability to map from character codes, which are used by virtually all applications and OSes, the glyphs in a font are useless, and cannot be readily accessed or used.
As I wrote nearly a year ago, the Adobe-Identity-0 ROS is useful for building special-purpose fonts, especially CJK ones whose glyph coverage does not match one of our public ROSes. Our latest Adobe-Identity-0 ROS font is the open-source Adobe Blank, whose purposes and implementation details are described on our sister blog, Typblography.
Sequences are important in the context of Unicode, and UAX #34 (Unicode Named Character Sequences) is a good reference for Unicode sequences. The first type of sequence that typically comes to mind in the context of Japanese are Ideographic Variation Sequences (IVSes), which are registered and maintained by The Unicode Consortium via the Ideographic Variation Database (IVD). There are also Standardized Variation Sequences that are much more closely bound to the standard.
I will close this particular topic by detailing how to support these proposed standardized variants in OpenType/CFF fonts.
For fonts that are currently IVS-enabled, such as those that include Format 14 ‘cmap’ subtables with Adobe-Japan1 or Hanyo-Denshi IVSes, it is important to note that the proposed standardized variants can co-exist with them, at least in terms of being specified in the font. For the former, I created an Adobe-Japan1_sequences.txt file that includes all registered Adobe-Japan1 IVSes, along with 89 of the 1,002 proposed standardized variants. The 89 standardized variants are at the end of the file. AFDKO tools, such as makeotf and spot, already support these standardized variants. When building OpenType/CFF fonts using the makeotf tool, this file is specified as the argument of the “-ci” command-line option.
To continue from the December 26, 2012 article, I should first point out that there is a relationship between these 1,002 proposed standardized variants and IVSes (Ideographic Variation Sequences). Standardized variants are standardized, hence their name. IVSes, on the other hand, are registered via a process that is described in UTS #37 and administered by the IVD Registrar (which happens to me at the moment).
One problem that has been plaguing CJK Compatibility Ideographs is the fact that they are adversely affected by normalization. Regardless of which of the four normalization forms is applied—NFC, NFD, NFKC, or NFKD—they are converted to their canonical equivalents, which are CJK Unified Ideographs. This is a problem, particularly for Japan, because 75 kanji in JIS X 0213:2004 kanji map to CJK Compatibility Ideograph code points. Furthermore, 57 of these 75 kanji correspond to Jinmei-yō Kanji (人名用漢字), meaning that they are used for personal names. The bottom-line problem with CJK Compatibility Ideographs is that any application of normalization, by any process, will permanently remove any distinctions between a CJK Compatibility Ideograph and its canonical equivalent. Not all processes are under one’s direct control, meaning that it is impossible to guarantee that normalization will not be applied. My opinion is that it is prudent to assume that normalization will be applied, and that preemption is the best solution.
In the December 4, 2012 Old Hangul article I mentioned that the ‘ccmp’ GSUB feature that is referenced in Microsoft’s Developing OpenType Fonts for Korean Hangul Script document is not necessary. Jaemin Chung kindly pointed out to me that environments that do not yet support Unicode Version 5.2 still require the ‘ccmp‘ (Glyph Composition/Decomposition) GSUB feature to be present, otherwise proper shaping will not happen.
The main purpose of this short article is to provide a revised Perl script, named mkoldhangul-ccmp.pl, that adds a complete ‘ccmp’ GSUB feature definition for environments that do not yet support Unicode Version 5.2 (or greater). The sample glyph-map.txt datafile that maps the Unicode-based glyph names to CIDs is unchanged.
Okay. It is time to put some “K” into CJK…
Seriously, much of the content of this blog has been focused on Chinese and Japanese issues. This article will provide some much-deserved Korean content.
I spent the last few days coming to grips with Old Hangul (옛한글 yethangeul), specifically how to implement proper shaping using the three registered OpenType GSUB features, ‘ljmo‘ (Leading Jamo Forms), ‘vjmo‘ (Vowel Jamo Forms), and ‘tjmo‘ (Trailing Jamo Forms).
I like ASCII. Do I like ASCII because of all the wonderful things one can do with its extraordinarily large repertoire of 94 printable characters? Actually, yes. Before I defend that answer, I’d like to point out that ASCII has three important strengths: simplicity, robustness, and ubiquity. In other words, ASCII is simple in that it has a relatively small number of characters; it forms a subset of virtually every encoding, Unicode or otherwise; and is supported everywhere. In fact, ASCII can be used to represent Unicode through the use of notations. Richard Ishida‘s excellent Unicode Code Converter is an excellent way to explore the various notations that are currently in use.
The first set of ideographs to be encoded in Unicode (Version 1.1), which are referred to as CJK Unified Ideographs, are also referred to as the URO, which is an abbreviation for Unified Repertoire and Ordering. None of the other extensions are given this label. Extensions A through D have been standardized, and Extension E will soon be standardized. Only Extension A is in the BMP (Basic Multilingual Plane). Extension B and beyond are in Plane 2, which is called the SIP (Supplementary Ideographic Plane). What makes the URO special or unique?
As described in the August 24, 2012 article, I am currently updating most of our OpenType Japanese fonts. One aspect of the update is to include the 32 additional IVSes, based on the March 2, 2012 version of the IVD (Ideographic Variation Database), which means that all of the kanji in Adobe-Japan1-6 now have a “plain text” representation. Another aspect of this particular update is to use the latest UTF-32 CMap resources, which include three additional mappings, one of which is U+9FCC that was appended to the URO (Unified Repertoire & Ordering) in Unicode Version 6.1. But, the topic of this article is about fixing a small number of glyphs, and the techniques that I used to do so.
Founded in 1987 by Yuan Ho and Seth Schneider, IMUG (International Multilingual User Group) has become Silicon Valley’s best user group for all matters related to internationalization, localization, and globalization. Meeting once a month, usually the third Thursday, IMUG meetings include a presentation by a distinguished member of the internationalization, localization, or globalization community. Adobe began hosting IMUG meetings in 2010 for odd-numbered months, and its Globalization Engineering Council serves as the official host. Even-numbered months are hosted by Google.
Regardless of whether you reside in the San Francisco Bay Area, I encourage you to attend IMUG meetings. Those that are hosted by Adobe are broadcasted via Adobe Connect, meaning that attendees need not be on site.
It seems that I have presented for IMUG six times, in 1995 (Adobe Systems’ CID-Keyed Fonts For Large Character Sets), 1999 (Adventures in Multilingual Publishing), 2005 (The Adobe-Japan1-6 Character Collection), 2008 (Ideographic Variation Sequences: Implementation Details), 2010 (Kazuraki: Adobe Systems’ Groundbreaking New Japanese Typeface), and 2011 (The Power of “Plain Text” & the Importance of Meaningful Content).
The next IMUG meeting, which will be hosted at Adobe, will include an intriguing presentation by Microsoft’s Michael Kaplan, which will be about new internationalization features in Windows 8, scheduled to be released on October 26, 2012, which is, by the way, approximately two months before our world ends. Please plan to attend, either in person or via Adobe Connect.
Many thanks to Nozomu KATŌ (加藤望) for bringing to my attention that the Adobe-Japan1-6 Unicode CMap resources were missing the following mapping:
U+207F → CID+15908
I decided to add this mapping to the following eight Adobe-Japan1-6 Unicode CMap resources this evening:
The eight updated CMap resources were just posted to the CMap Resources open source project that is hosted at Open @ Adobe, and the details are in the associated forum post.
On July 25, 2012, Apple released to the world Mac OS X Version 10.8 (aka Mountain Lion). Among the many new features in this latest iteration of Mac OS X is support for CFR objects. For those who are not aware, CFR objects are based on ISO/IEC 14496-28:2012 (Composite Font Representation), and are used to define both composite fonts and fallback fonts. CFR objects effectively break the 64K glyph barrier. Mac OS X Version 10.8 is thus the first implementation that has broken the 64K glyph barrier.
ISO/IEC 10646:2012 (Third Edition) was just published. This is the first version of the standard that includes multiple-column Code Charts for Extension B, and for CJK Compatibility Ideographs. Another significant aspect of ISO/IEC 10646:2012 is that it is equivalent to Unicode Version 6.1.
For Adobe, the publishing of this new version of the standard represents a significant milestone, because it means that every Adobe-Japan1-6 kanji is either directly encoded, or is directly associated with a registered IVS in the IVD (Ideographic Variation Database).
Speaking of Unicode Version 6.1, the printed version of the Core Specification is available via POD from Lulu, and at a very attractive price.
Years ago, I wrote a Perl script, called unicode-rows.pl, that takes a fully-qualified PostScript name—composed of a CIDFont resource name, two hyphens, and a UTF-32 CMap resource name—then generates a PostScript file that can be distilled into a PDF. The resulting PDF file is a Unicode table, arranged in groups of 256 code points. If the UTF-32 CMap resource includes even a single mapping for a particular group of 256 code points, a page is created.
I have prepared examples that are based on the UniJIS2004-UTF32-H and UniJIS-UTF32-H CMap resources.