Unless you have been living in a cave or under a rock, you’ve no doubt heard of Source Han Sans or Noto Sans CJK through the initial announcements from Adobe or Google who jointly developed them, or elsewhere. These two Pan-CJK typeface families, which are joined at the hip because they differ only in name, were released to the world at large, as open source fonts, on the afternoon of July 15, 2014 in the US, which was the morning of July 16, 2014 in East Asia, their target audience. Click on the preview below to view a single-page PDF that shows all 65,535 glyphs from one of these fonts:
Over the next several months I plan to publish a series of articles on this blog that will detail various aspects of the development process that I employed for building these two typeface families. Although the subsequent articles will mention only Source Han Sans by name, they also pertain to its twin, Noto Sans CJK.
Given that Unicode has declared mid-year annual major releases, we can expect Unicode Version 8.0 to be released in about a year, in mid-2015. In terms of ideographs, we can expect some additions, specifically a small number of UNC (Urgently Needed Character) additions to the URO (Unified Repertoire & Ordering) that were discussed in the June article, along with Extension E. This single-page PDF provides a tentative look at the CJK Unified Ideographs, along with CJK Compatibility Ideographs for good measure.
Unicode Version 7.0 was release on June 16, 2014.
One of the accomplishments at IRG #42 last month was the addition of 29 new CJK Unified Ideographs to the URO (Unified Repertoire & Ordering), specifically from U+9FCD through U+9FE9. The first four are shown above.
The Moji_Joho IVD Collection was first introduced via PRI 259 last December, which initiated a mandatory—according to UTS #37—90-day Public Review Period. The submitter received three sets of comments, and after making minor changes, submitted the materials for registering the new IVD collection, along with its initial set of IVSes. The Moji_Joho IVD Collection and its initial set of IVSes were officially registered on May 16, 2014, which represents the fourth version of the IVD (Ideographic Variation Database).
The 2014-05-16 version of the IVD thus registers a new IVD collection, Moji_Joho, along with its initial set of 10,710 IVSes, 9,685 of which are shared—through mutual agreement—with the registered Hanyo-Denshi IVD Collection. Some enhancements were also made to the IVD_Stats.txt file, specifically that the shared IVSes are explicitly listed at the end of the file.
One additional statistic is that the highest VS (Variation Selector) used is currently VS48 (U+E011F), meaning that 32 of the 240 VSes allocated for IVS use are now being used. Of course, it is relatively easy to figure out with which BC (Base Character) VS48 is used, and an educated guess would be that it is either U+9089 (邉) or U+908A (邊). It is the former:
9089 E011F; Moji_Joho; MJ026193
The highest VS used with the latter BC is currently VS37 (U+E0114).
As the IVD Registrar, I’d like to use this opportunity to thank everyone who made the effort to review PRI 259. I’d also like to congratulate those who prepared the Moji_Joho IVD Collection for both review and registration.
Shown above is the top portion of the printed version of Taiwan’s MOE 國字標準字體方體母稿 (Fangti) standard. (For those who are interested, its ISBN is 957-00-8392-1.) What is provided online are effectively scans of the 常用字 and 次常用字 sections, which contain 4,808 and 6,343 hanzi, respectively. Although included in the printed version of the standard, the 罕用字 section, which contains 1,907 additional hanzi, is not provided online. In terms of sheer numbers, these 1,907 additional hanzi appear to completely cover Big Five (both levels) and CNS 11643 Planes 1 and 2.
The purpose of today’s article is to describe two additional issues in this glyph standard that my new friend, Kuang-che Wu (吳光哲) of Google, recently found.
What is shown above is a trivial difference in a two-component structure that is present in many ideographs, such as 滕 (U+6ED5), 縢 (U+7E22), 螣 (U+87A3), 謄 (U+8B04), and 騰 (U+9A30). This difference is, of course, unifiable. What this article is about is consistency within a standard, mainly referring to the source standards from each region. The focus of this article is on the forms used in ROC (Republic of China; 中華民國 Zhōnghuá Mínguó), which is more commonly referred to Taiwan (臺灣 or 台灣 Táiwān).
Recent work has led me to more closely explore U+4548 (☞䕈☜), which is in CJK Unified Ideographs Extension A. (What is shown in parentheses in the previous sentence is likely to be different than what is shown in the excerpt above.)
The image above is an excerpt from the latest Extension A Code Charts. At first glance, everything seem normal. The differences between the G (China) and T (Taiwan) glyphs are expected, and perhaps more importantly, unifiable.
In the previous article I mentioned that 85 kanji that correspond to JIS X 0213:2004 currently have kIRG_JSource JA source references, but I made no mention about possible glyph differences between what is shown in the Code Charts and JIS X 0213:2004. I found at least seven kanji, among these 85, that have significant glyph differences between these two Japanese sources. I prepared this table that shows these glyph differences, by using excerpts from the Extension A code charts for the kIRG_JSource JA glyphs and Heisei Mincho W3 (平成明朝W3) for the JIS X 0213:2004 glyphs.
To continue yesterday’s article about different prototypical glyphs for Unicode code points that are common between JIS X 0212-1990 and JIS X 0213:2004, today’s article will focus on the normative references that correspond to JIS X 0213:2004, or rather the lack thereof.
Most Japanese font developers are—perhaps painfully—aware of the 168 kanji whose prototypical glyphs changed in 2004 via the JIS X 0213:2004 standard. What is not broadly known are those kanji whose prototypical glyphs are different between JIS X 0212-1990 and JIS X 0213 (both versions).
JIS X 0212-1990 was established in 1990, and included 5,801 kanji in a single block. JIS X 0213:2000 was established a full ten years later, and included 3,685 kanji in two levels (1,249 kanji in Level 3, and 2,436 in level 4). Ten additional kanji were added in JIS X 0213:2004, bringing the total to 3,695. When the Unicode code points that correspond to these two JIS standards are compared, 2,743 of them are common, 3,058 are specific to JIS X 0212-1990, and 952 are specific to JIS X 0213:2004.
Interestingly, when the prototypical glyphs of the 2,743 kanji that are in common—in terms of having a shared Unicode code point—are compared, 31 of them are different. I prepared a single-page table that shows the differences using genuine Heisei Mincho W3 (平成明朝W3) glyphs, which also provides Adobe-Japan1-6 CIDs for all but three of the JIS X 0212-1990 prototypical glyphs (these three glyphs are thus candidates for Adobe-Japan1-7). Also, all of the JIS X 0213 kanji are from the original 2000 version, except for the one that corresponds to U+7626 that was introduced in 2004. This character’s entry is shaded in the PDF.
It’s hard to imagine that it has been nearly three years since I posted the always-enjoyable Unicode Beyond-BMP Top Ten List, so I figured that an updated version, which takes into account developments that have transpired since then, was in order for the current year of 2014.
As the title makes blatantly obvious, today we will cover a topic about China (中华人民共和国 zhōnghuá rénmín gònghéguó).
For those who are not aware, there are twelve IDCs (Ideographic Description Characters) in Unicode, from U+2FF0 through U+2FFB, that are used in IDSes (Ideographic Description Sequences) which are intended to visually describe the structure of ideographs by enumerating their components and arrangement in a hierarchical fashion. Any Unicode character can serve as a IDS component, and the IDCs describe their arrangement. The IRG uses IDSes as a way to detect potentially duplicate characters in new submissions. All existing CJK Unified Ideographs have an IDS, and new submissions require an IDS.
This article describes a technique that uses IDSes combined with OpenType functionality to pseudo-encode glyphs that are unencoded or not yet encoded. If memory serves, it was Taichi KAWABATA (川幡太一) who originally suggested this technique.
I was recently asked, indirectly via Twitter, about changes and additions that were made to our JIS2004-savvy CMap resources, specifically UniJIS2004-UTF32-H and UniJISX02132004-UTF32-H. The former also includes UTF-8 (UniJIS2004-UTF8-H) and UTF-16 (UniJIS2004-UTF16-H) versions that are kept in sync with the master UTF-32 version by being automagically generated by the CMap resource compiler (and decompiler), cmap-tool.pl, which I developed years ago.
Of course, all of these CMap resources also have vertical versions that use a “V” at the end of their names in lieu of the “H,” but in the context of OpenType font development, the vertical CMap resources are virtually unused and worthless because it is considered much better practice to explicitly define a ‘vert‘ GSUB feature for handling vertical substitution. In the absence of an explicit definition, the AFDKO makeotf tool will synthesize a ‘vert’ GSUB feature by using the corresponding vertical CMap resources.
With all that being said, what follows in this article is a complete history of these two CMap resources, which also assign dates, and sometimes notes, to each version.
I spent a couple of days curling up with GB 18030 (both versions: 2000 and 2005), which is PRC’s latest and greatest national character set standard, and came across an oddity that my gut tells me is a design flaw. At the very least, it is an issue about which font developers need to be aware.
What I found were eight instances of CJK Unified Ideographs with a left-side Radical #130 that uses the Traditional Chinese or Taiwan-style form, instead of the expected Simplified Chinese or PRC-style form that looks the same as Radical #74. Screen captures from the latest Unicode Code Charts, whose glyphs agree with both versions of GB 18030, are shown below:
As the IVD Registrar, I am very pleased to announce PRI 259 (Public Review Issue #259), which is the combined registration of the new Moji_Joho IVD Collection and sequences for that IVD collection. According to procedures set forth in UTS #37 (Unicode Technical Standard #37, Unicode Ideographic Variation Database), the 90-day public review, which commences today, allows interested parties to submit comments, suggestions, and errors to the registrant via Unicode’s reporting form.
Not all PDF authoring applications are the same, in terms of the extent to which they preserve the text content of the original document. Of course, this is not necessarily the fault of the PDF authoring application, but rather it is due to a disconnect between the PDF authoring process and access to the text content of the original document.
The best example for demonstrating this is to create a document that includes the two kanji 一 (U+4E00) and ⼀ (U+2F00). The reason why these two characters represent a good example is because in mainstream Japanese fonts, mainly those that are based on the Adobe-Japan1-x ROS, both map to the same glyph, specifically CID+1200.
If you download and unpack the 4E00vs2F00.zip file, you will find two PDF files, an Adobe InDesign file, and an MS Word file. If you open the original documents and search for 一 (U+4E00), you will find only a single instance, which is the one that is marked by the Unicode scalar value. However, if you open the respective PDF files, you will notice a difference. The one that is based on the MS Word file now includes two instances of 一 (U+4E00), and ⼀ (U+2F00) is no longer included in its content. You can search a PDF file by Unicode scalar value by using the “\uXXXX” notation, such as \u4E00 for U+4E00 (一). (Note: Depending on the version of MS Word that is being used, the PDF file may instead include two instances of ⼀ (U+2F00). I am using Microsoft Word for Mac 2011 Version 14.3.8.)
Adobe InDesign has a built-in PDF library that has direct access to the text content, and is thus able to inject it into the text layer of the PDF file that it produces. MS Word uses a different pathway for producing a PDF file, one that does not have access to the text content of the original document.
For those who have been interested in ISO/IEC 14496-28:2012 (Composite Font Representation), which standardizes an XML format for defining font objects (aka CFR objects) that can reference more than one font resource and thus break the 64K glyph barrier, I am pleased to let this blog’s readership know that it is now among ISO’s Freely Available Standards. I am particularly pleased about this news, mainly because some developers have indicated that purchasing the standard effectively served as a barrier to supporting it. Well, the barrier has been removed!
Note that this change makes a whole lot of sense, because two ISO standards that are closely tied to CFR, ISO/IEC 10646 (Universal Coded Character Set, aka Unicode) and IEO/IEC 14496-22 (Open Font Format), are already among these freely available standards.
Also note that there is no direct download URL for this or other freely available ISO standards, because one must first agree to the no-cost licensing terms by clicking a button.
Some people naïvely think that ISO/IEC 10646 and Unicode, which are joined at the hip, make the development of national standards an obsolete practice. As my IRG41 contribution, IRG N1964 (Continued National Standards Development & Horizontal Extensions), makes clear, nothing is further from the truth, especially when it involves CJK Unified Ideographs.
The content of this paper had been brewing in my head since IRG38, and only recently has congealed into a concise one-page paper that should be daunting to no one. If you are interested in such issues, please read the paper and provide feedback.
While the finishing touches are being put on Unicode Version 6.3, which will include the 1,002 Standardized Variants that I already mentioned, everything appears to be on track for Unicode Version 7.0, which will be in sync with ISO/IEC 10646:2014 (4th Edition).
Extension E, which adds 5,762 new CJK Unified Ideographs, is on track to be included in Version 7.0. This will bring the total number of CJK Unified Ideographs to a staggering 80,379 characters. I spent part of this morning preparing an updated version of my CJK Unified/Compatibility Ideographs table that provides a glimpse at Unicode Version 7.0.
(Note that neither Unicode Version 7.0 nor ISO/IEC 10646:2014 have been released or published, meaning that implementers should keep this caveat in mind, hence the use of “glimpse” in the title of this article.)