One of the questions one may ask about the Adobe-branded Source Han Sans and Google-branded Noto Sans CJK open source Pan-CJK typeface families is whether they are GB 18030–compliant. Compliant? Sort of. Certified? Not yet.
Let me explain…
Let it be known that the “OpenType Collection” (OTC) format was born on 09/21/2011 at Pho Minh Restaurant in Cupertino, California. Present from Adobe were the following: David Lemon, Ken Lunde, Sairus Patel, and Read Roberts. Present from Apple were Antonio Cavedoni, Julio Gonzalez, Yasuo Kida, Peter Lofting, and Tony Tseung. — Adobe & Apple
The above declaration paved the way for supporting (CFF-based) OpenType Collections in Apple’s OS X (beginning from Version 10.8) and in Adobe’s applications (beginning from CS6).
This is an update to the short article that I wrote last August.
Now that it is officially 2015, and given that Unicode Version 8.0 is scheduled to be released mid-year, exactly what is expected to be included—at least in terms of CJK Unified Ideographs—is becoming clearer. Perusing the still-in-process UCD (Unicode Character Database) sheds much light on this. (I also recommend checking the pipeline from time to time.)
I recently updated the single-page PDF that tallies the CJK Unified Ideographs and CJK Compatibility Ideographs that are in Unicode, to include the latest information for Version 8.0, along with what can be expected to be included in Version 9.0 (mid-2016).
The highlights for Unicode Version 8.0 include nine of the 29 Urgently Needed Characters being appended to the URO (Unified Repertoire & Ordering) and Extension E (5,762 characters). The remaining 20 Urgently Needed Characters, along with Extension F, are expected to be included in Unicode Version 9.0.
Source Han Sans (and Noto Sans CJK) supports the first four of the nine Urgently Needed Characters that are being fast-tracked for Version 8.0, along with 108 Extension E code points for supporting China’s 通用规范汉字表 (Tōngyòng Guīfàn Hànzìbiǎo).
While it was not uncommon for early (pre-Unicode) CJK character set standard to include characters that correspond to scripts of other languages or used in other countries, such as the extent to which Japanese kana were included in standards from China and Korea, it was not common for one of these countries to produce a standard for a seemingly different language. Enter GB 12052-89 (entitled Korean Character Coded Character Set for Information Interchange, or 信息交换用朝鲜文字编码字符集 in Chinese), which is a GB (PRC) standard that sort of broke this mold.
One of the reasons why Source Han Sans—and obviously the Google-branded Noto Sans CJK—can be considered the world’s first Pan-CJK typeface family is due to its support for Korean hangul. While it is common to support modern hangul in Korean fonts, supporting archaic hangul is relatively uncommon. One of the more challenging aspects of developing Source Han Sans was implementing support for archaic hangul, which also included handling 500 high-frequency archaic hangul syllables. This article will thus detail what went into supporting archaic hangul in Source Han Sans. I’d like to once again thank our talented friends at Sandoll Communication for designing the glyphs for these characters.
As I described in an article earlier this year, GB 18030 artificially imposes a visual difference between Radicals #74 (⽉) and #130 (⾁) for character pairs that differ only in this component, though conventions for Simplified Chinese use a unified form that looks like Radical #74. In that article I pinpointed a case for which the character that uses Radical #130 is in error, because its left-side radical uses the Radical #74 form, and the corresponding character that uses Radical #74 is outside the scope of GB 18030 (at least for now).
Thanks to Jaemin Chung, I was able to find three errors within the scope of GB 18030, as shown below:
According to the principles imposed by GB 18030, the characters on the left are in error, and should be visually distinct from those on the right in terms of their left-side radical.
(Uni-chan image designed by Mary Jenkins)
In addition to attending IRG43 (ISO/IEC JTC1/SC2/WG2/IRG Meeting #43) in November as a US/Unicode delegate, I will also be serving as the Adobe host for this meeting, which will take place at Adobe’s headquarters in downtown San José, California. It will be a busy week for me, because while I will need to stay focused on the meeting itself, I also need to be mindful of matters related to logistics, before and during the meeting. Extension F (called Extension F1 by the IRG) is in the process of being handed off to WG2, and work on Extension G (called Extension F2 by the IRG) is expected to begin in earnest before and during IRG43.
By the way, the last time that an IRG meeting was held in the US was IRG37, which was hosted by Google in Mountain View, California in November of 2011. Before that, it was IRG29, which was hosted by Adobe in November of 2007.
I am very much looking forward to the meeting, meeting with the delegates, and being part of important CJK Unified Ideograph work.
A commemorative T-shirt may be necessary… ☺
Unless you have been living in a cave or under a rock, you’ve no doubt heard of Source Han Sans or Noto Sans CJK through the initial announcements from Adobe or Google who jointly developed them, or elsewhere. These two Pan-CJK typeface families, which are joined at the hip because they differ only in name, were released to the world at large, as open source fonts, on the afternoon of July 15, 2014 in the US, which was the morning of July 16, 2014 in East Asia, their target audience. Click on the preview below to view a single-page PDF that shows all 65,535 glyphs from one of these fonts:
Over the next several months I plan to publish a series of articles on this blog that will detail various aspects of the development process that I employed for building these two typeface families. Although the subsequent articles will mention only Source Han Sans by name, they also pertain to its twin, Noto Sans CJK.
Given that Unicode has declared mid-year annual major releases, we can expect Unicode Version 8.0 to be released in about a year, in mid-2015. In terms of ideographs, we can expect some additions, specifically a small number of UNC (Urgently Needed Character) additions to the URO (Unified Repertoire & Ordering) that were discussed in the June article, along with Extension E. This single-page PDF provides a tentative look at the CJK Unified Ideographs, along with CJK Compatibility Ideographs for good measure.
Unicode Version 7.0 was release on June 16, 2014.
One of the accomplishments at IRG #42 last month was the addition of 29 new CJK Unified Ideographs to the URO (Unified Repertoire & Ordering), specifically from U+9FCD through U+9FE9. The first four are shown above.
The Moji_Joho IVD Collection was first introduced via PRI 259 last December, which initiated a mandatory—according to UTS #37—90-day Public Review Period. The submitter received three sets of comments, and after making minor changes, submitted the materials for registering the new IVD collection, along with its initial set of IVSes. The Moji_Joho IVD Collection and its initial set of IVSes were officially registered on May 16, 2014, which represents the fourth version of the IVD (Ideographic Variation Database).
The 2014-05-16 version of the IVD thus registers a new IVD collection, Moji_Joho, along with its initial set of 10,710 IVSes, 9,685 of which are shared—through mutual agreement—with the registered Hanyo-Denshi IVD Collection. Some enhancements were also made to the IVD_Stats.txt file, specifically that the shared IVSes are explicitly listed at the end of the file.
One additional statistic is that the highest VS (Variation Selector) used is currently VS48 (U+E011F), meaning that 32 of the 240 VSes allocated for IVS use are now being used. Of course, it is relatively easy to figure out with which BC (Base Character) VS48 is used, and an educated guess would be that it is either U+9089 (邉) or U+908A (邊). It is the former:
9089 E011F; Moji_Joho; MJ026193
The highest VS used with the latter BC is currently VS37 (U+E0114).
As the IVD Registrar, I’d like to use this opportunity to thank everyone who made the effort to review PRI 259. I’d also like to congratulate those who prepared the Moji_Joho IVD Collection for both review and registration.
Shown above is the top portion of the printed version of Taiwan’s MOE 國字標準字體方體母稿 (Fangti) standard. (For those who are interested, its ISBN is 957-00-8392-1.) What is provided online are effectively scans of the 常用字 and 次常用字 sections, which contain 4,808 and 6,343 hanzi, respectively. Although included in the printed version of the standard, the 罕用字 section, which contains 1,907 additional hanzi, is not provided online. In terms of sheer numbers, these 1,907 additional hanzi appear to completely cover Big Five (both levels) and CNS 11643 Planes 1 and 2.
The purpose of today’s article is to describe two additional issues in this glyph standard that my new friend, Kuang-che Wu (吳光哲) of Google, recently found.
What is shown above is a trivial difference in a two-component structure that is present in many ideographs, such as 滕 (U+6ED5), 縢 (U+7E22), 螣 (U+87A3), 謄 (U+8B04), and 騰 (U+9A30). This difference is, of course, unifiable. What this article is about is consistency within a standard, mainly referring to the source standards from each region. The focus of this article is on the forms used in ROC (Republic of China; 中華民國 Zhōnghuá Mínguó), which is more commonly referred to Taiwan (臺灣 or 台灣 Táiwān).
Recent work has led me to more closely explore U+4548 (☞䕈☜), which is in CJK Unified Ideographs Extension A. (What is shown in parentheses in the previous sentence is likely to be different than what is shown in the excerpt above.)
The image above is an excerpt from the latest Extension A Code Charts. At first glance, everything seem normal. The differences between the G (China) and T (Taiwan) glyphs are expected, and perhaps more importantly, unifiable.
In the previous article I mentioned that 85 kanji that correspond to JIS X 0213:2004 currently have kIRG_JSource JA source references, but I made no mention about possible glyph differences between what is shown in the Code Charts and JIS X 0213:2004. I found at least seven kanji, among these 85, that have significant glyph differences between these two Japanese sources. I prepared this table that shows these glyph differences, by using excerpts from the Extension A code charts for the kIRG_JSource JA glyphs and Heisei Mincho W3 (平成明朝W3) for the JIS X 0213:2004 glyphs.
To continue yesterday’s article about different prototypical glyphs for Unicode code points that are common between JIS X 0212-1990 and JIS X 0213:2004, today’s article will focus on the normative references that correspond to JIS X 0213:2004, or rather the lack thereof.
Most Japanese font developers are—perhaps painfully—aware of the 168 kanji whose prototypical glyphs changed in 2004 via the JIS X 0213:2004 standard. What is not broadly known are those kanji whose prototypical glyphs are different between JIS X 0212-1990 and JIS X 0213 (both versions).
JIS X 0212-1990 was established in 1990, and included 5,801 kanji in a single block. JIS X 0213:2000 was established a full ten years later, and included 3,685 kanji in two levels (1,249 kanji in Level 3, and 2,436 in level 4). Ten additional kanji were added in JIS X 0213:2004, bringing the total to 3,695. When the Unicode code points that correspond to these two JIS standards are compared, 2,743 of them are common, 3,058 are specific to JIS X 0212-1990, and 952 are specific to JIS X 0213:2004.
Interestingly, when the prototypical glyphs of the 2,743 kanji that are in common—in terms of having a shared Unicode code point—are compared, 30 of them are different. I prepared a single-page table that shows the differences using genuine Heisei Mincho W3 (平成明朝W3) glyphs, which also provides Adobe-Japan1-6 CIDs for all but three of the JIS X 0212-1990 prototypical glyphs (these three glyphs are thus candidates for Adobe-Japan1-7). Also, all of the JIS X 0213 kanji are from the original 2000 version, except for the one that corresponds to U+7626 that was introduced in 2004. This character’s entry is shaded in the PDF.
It’s hard to imagine that it has been nearly three years since I posted the always-enjoyable Unicode Beyond-BMP Top Ten List, so I figured that an updated version, which takes into account developments that have transpired since then, was in order for the current year of 2014.
As the title makes blatantly obvious, today we will cover a topic about China (中华人民共和国 zhōnghuá rénmín gònghéguó).
For those who are not aware, there are twelve IDCs (Ideographic Description Characters) in Unicode, from U+2FF0 through U+2FFB, that are used in IDSes (Ideographic Description Sequences) which are intended to visually describe the structure of ideographs by enumerating their components and arrangement in a hierarchical fashion. Any Unicode character can serve as a IDS component, and the IDCs describe their arrangement. The IRG uses IDSes as a way to detect potentially duplicate characters in new submissions. All existing CJK Unified Ideographs have an IDS, and new submissions require an IDS.
This article describes a technique that uses IDSes combined with OpenType functionality to pseudo-encode glyphs that are unencoded or not yet encoded. If memory serves, it was Taichi KAWABATA (川幡太一) who originally suggested this technique.