Unless you have been living in a cave or under a rock, you’ve no doubt heard of Source Han Sans or Noto Sans CJK through the initial announcements from Adobe or Google who jointly developed them, or elsewhere. These two Pan-CJK typeface families, which are joined at the hip because they differ only in name, were released to the world at large, as open source fonts, on the afternoon of July 15, 2014 in the US, which was the morning of July 16, 2014 in East Asia, their target audience. Click on the preview below to view a single-page PDF that shows all 65,535 glyphs from one of these fonts:
Over the next several months I plan to publish a series of articles on this blog that will detail various aspects of the development process that I employed for building these two typeface families. Although the subsequent articles will mention only Source Han Sans by name, they also pertain to its twin, Noto Sans CJK.
Although this article is not about CJK, its purpose is to describe how I was put onto the CJK path. I studied French in high school, but it was really my studies of Russian, courtesy of the United States Army, that eventually put me on the CJK path. Immediately after graduating high school in 1983, I entered US Army Basic Training at Fort Leonard Wood, Missouri, which was followed by Interrogator School at Fort Huachuca, Arizona. The third part of my training, which was associated with my MOS (Military Occupational Specialty), Interrogator, was to learn Russian.
Given that Unicode has declared mid-year annual major releases, we can expect Unicode Version 8.0 to be released in about a year, in mid-2015. In terms of ideographs, we can expect some additions, specifically a small number of UNC (Urgently Needed Character) additions to the URO (Unified Repertoire & Ordering) that were discussed in the June article, along with Extension E. This single-page PDF provides a tentative look at the CJK Unified Ideographs, along with CJK Compatibility Ideographs for good measure.
Unicode Version 7.0 was release on June 16, 2014.
One of the accomplishments at IRG #42 last month was the addition of 29 new CJK Unified Ideographs to the URO (Unified Repertoire & Ordering), specifically from U+9FCD through U+9FE9. The first four are shown above.
Twenty years ago this month, in May of 1994, I successfully defended my PhD dissertation, entitled Prescriptive Kanji Simplification, which concluded my graduate studies at The University of Wisconsin-Madison’s Department of Linguistics. Madison is located approximately 20 miles from where I grew up (Mount Horeb, Wisconsin).
The Moji_Joho IVD Collection was first introduced via PRI 259 last December, which initiated a mandatory—according to UTS #37—90-day Public Review Period. The submitter received three sets of comments, and after making minor changes, submitted the materials for registering the new IVD collection, along with its initial set of IVSes. The Moji_Joho IVD Collection and its initial set of IVSes were officially registered on May 16, 2014, which represents the fourth version of the IVD (Ideographic Variation Database).
The 2014-05-16 version of the IVD thus registers a new IVD collection, Moji_Joho, along with its initial set of 10,710 IVSes, 9,685 of which are shared—through mutual agreement—with the registered Hanyo-Denshi IVD Collection. Some enhancements were also made to the IVD_Stats.txt file, specifically that the shared IVSes are explicitly listed at the end of the file.
One additional statistic is that the highest VS (Variation Selector) used is currently VS48 (U+E011F), meaning that 32 of the 240 VSes allocated for IVS use are now being used. Of course, it is relatively easy to figure out with which BC (Base Character) VS48 is used, and an educated guess would be that it is either U+9089 (邉) or U+908A (邊). It is the former:
9089 E011F; Moji_Joho; MJ026193
The highest VS used with the latter BC is currently VS37 (U+E0114).
As the IVD Registrar, I’d like to use this opportunity to thank everyone who made the effort to review PRI 259. I’d also like to congratulate those who prepared the Moji_Joho IVD Collection for both review and registration.
Shown above is the top portion of the printed version of Taiwan’s MOE 國字標準字體方體母稿 (Fangti) standard. (For those who are interested, its ISBN is 957-00-8392-1.) What is provided online are effectively scans of the 常用字 and 次常用字 sections, which contain 4,808 and 6,343 hanzi, respectively. Although included in the printed version of the standard, the 罕用字 section, which contains 1,907 additional hanzi, is not provided online. In terms of sheer numbers, these 1,907 additional hanzi appear to completely cover Big Five (both levels) and CNS 11643 Planes 1 and 2.
The purpose of today’s article is to describe two additional issues in this glyph standard that my new friend, Kuang-che Wu (吳光哲) of Google, recently found.
What is shown above is a trivial difference in a two-component structure that is present in many ideographs, such as 滕 (U+6ED5), 縢 (U+7E22), 螣 (U+87A3), 謄 (U+8B04), and 騰 (U+9A30). This difference is, of course, unifiable. What this article is about is consistency within a standard, mainly referring to the source standards from each region. The focus of this article is on the forms used in ROC (Republic of China; 中華民國 Zhōnghuá Mínguó), which is more commonly referred to Taiwan (臺灣 or 台灣 Táiwān).
Recent work has led me to more closely explore U+4548 (☞䕈☜), which is in CJK Unified Ideographs Extension A. (What is shown in parentheses in the previous sentence is likely to be different than what is shown in the excerpt above.)
The image above is an excerpt from the latest Extension A Code Charts. At first glance, everything seem normal. The differences between the G (China) and T (Taiwan) glyphs are expected, and perhaps more importantly, unifiable.
In the previous article I mentioned that 85 kanji that correspond to JIS X 0213:2004 currently have kIRG_JSource JA source references, but I made no mention about possible glyph differences between what is shown in the Code Charts and JIS X 0213:2004. I found at least seven kanji, among these 85, that have significant glyph differences between these two Japanese sources. I prepared this table that shows these glyph differences, by using excerpts from the Extension A code charts for the kIRG_JSource JA glyphs and Heisei Mincho W3 (平成明朝W3) for the JIS X 0213:2004 glyphs.
To continue yesterday’s article about different prototypical glyphs for Unicode code points that are common between JIS X 0212-1990 and JIS X 0213:2004, today’s article will focus on the normative references that correspond to JIS X 0213:2004, or rather the lack thereof.
Most Japanese font developers are—perhaps painfully—aware of the 168 kanji whose prototypical glyphs changed in 2004 via the JIS X 0213:2004 standard. What is not broadly known are those kanji whose prototypical glyphs are different between JIS X 0212-1990 and JIS X 0213 (both versions).
JIS X 0212-1990 was established in 1990, and included 5,801 kanji in a single block. JIS X 0213:2000 was established a full ten years later, and included 3,685 kanji in two levels (1,249 kanji in Level 3, and 2,436 in level 4). Ten additional kanji were added in JIS X 0213:2004, bringing the total to 3,695. When the Unicode code points that correspond to these two JIS standards are compared, 2,743 of them are common, 3,058 are specific to JIS X 0212-1990, and 952 are specific to JIS X 0213:2004.
Interestingly, when the prototypical glyphs of the 2,743 kanji that are in common—in terms of having a shared Unicode code point—are compared, 30 of them are different. I prepared a single-page table that shows the differences using genuine Heisei Mincho W3 (平成明朝W3) glyphs, which also provides Adobe-Japan1-6 CIDs for all but three of the JIS X 0212-1990 prototypical glyphs (these three glyphs are thus candidates for Adobe-Japan1-7). Also, all of the JIS X 0213 kanji are from the original 2000 version, except for the one that corresponds to U+7626 that was introduced in 2004. This character’s entry is shaded in the PDF.
It’s hard to imagine that it has been nearly three years since I posted the always-enjoyable Unicode Beyond-BMP Top Ten List, so I figured that an updated version, which takes into account developments that have transpired since then, was in order for the current year of 2014.
Although today is April 1st, this is actually a brief non-joke article. Honestly and truly. (However, I cannot say the same about Toshiya SUZUKI’s WG2 N4572. ☺)
The background is that during my last visit to Japan, which was mainly to attend IRG #41 in Tokyo during the latter half of November of 2013, Kunihiko OKANO (岡野邦彦) requested an Adobe-Japan1-6 version of Adobe Blank during a dinner at a restaurant called かつ吉. The purpose of such a font is to serve as a template for font development purposes, meaning that its structure—in terms of ‘sfnt’ tables, FDArray elements, and number of glyphs (CIDs 0 through 23057)—is identical to a genuine Adobe-Japan1-6 font, but that all of its functional glyphs are non-spacing and blank, like Adobe Blank.
I am pleased to announce that the Adobe-Japan1-6 version of Adobe Blank, called Adobe Blank AJ16, is now available in the Downloads section of the open source project, specifically in the AJ16 subdirectory. Of course, this font is not intended to be installed and used in applications, but rather to be opened or inspected by font development tools.
Okano-san also requested Adobe-Japan1-3, Adobe-Japan1-4, and kana subset versions, which will soon be added to the “Adobe Blank OpenType Font” open source project.
As the title makes blatantly obvious, today we will cover a topic about China (中华人民共和国 zhōnghuá rénmín gònghéguó).
For those who are not aware, there are twelve IDCs (Ideographic Description Characters) in Unicode, from U+2FF0 through U+2FFB, that are used in IDSes (Ideographic Description Sequences) which are intended to visually describe the structure of ideographs by enumerating their components and arrangement in a hierarchical fashion. Any Unicode character can serve as a IDS component, and the IDCs describe their arrangement. The IRG uses IDSes as a way to detect potentially duplicate characters in new submissions. All existing CJK Unified Ideographs have an IDS, and new submissions require an IDS.
This article describes a technique that uses IDSes combined with OpenType functionality to pseudo-encode glyphs that are unencoded or not yet encoded. If memory serves, it was Taichi KAWABATA (川幡太一) who originally suggested this technique.
I was recently asked, indirectly via Twitter, about changes and additions that were made to our JIS2004-savvy CMap resources, specifically UniJIS2004-UTF32-H and UniJISX02132004-UTF32-H. The former also includes UTF-8 (UniJIS2004-UTF8-H) and UTF-16 (UniJIS2004-UTF16-H) versions that are kept in sync with the master UTF-32 version by being automagically generated by the CMap resource compiler (and decompiler), cmap-tool.pl, which I developed years ago.
Of course, all of these CMap resources also have vertical versions that use a “V” at the end of their names in lieu of the “H,” but in the context of OpenType font development, the vertical CMap resources are virtually unused and worthless because it is considered much better practice to explicitly define a ‘vert‘ GSUB feature for handling vertical substitution. In the absence of an explicit definition, the AFDKO makeotf tool will synthesize a ‘vert’ GSUB feature by using the corresponding vertical CMap resources.
With all that being said, what follows in this article is a complete history of these two CMap resources, which also assign dates, and sometimes notes, to each version.
For readers of this blog who engage in social media, please consider following @CJKType on Twitter. Of course, one of the purposes of @CJKType will be to announce new CJK Type Blog articles, but it is also meant to engage the user and developer community with regard to CJK- and font-related issues, and will also announce or retweet other items of interest to these groups.
As described in last month’s article, our tools engineer developed two Python scripts for assembling and disassembling ‘sfnt’ collections, both of which operate on TrueType-based source fonts to produce a traditional TrueType Collection (TTC) font or to break apart one, but also operate on CFF-based source fonts to produce a new font species known as an OpenType Collection (OTC).
The purpose of this follow up article is to convey the news that these scripts have been tweaked slightly, and have been included in a new version of AFDKO that was released on 2014-02-18 as Build 61250. One of the benefits of the integration with AFDKO is that the tools are now easier to run, as a simple command.
I would like to use this opportunity to introduce two new things.
First, OpenType Collections. TrueType Collections have been around for many years, and are commonplace for OS-bundled fonts. What I am speaking of are ‘sfnt’ Collections that include a ‘CFF ‘ (PostScript charstrings) table rather than a ‘glyf‘ (TrueType charstrings) one. The advantage of an ‘sfnt’ Collection is that fonts that differ in minor ways can be combined into a single resource, which can provide substantial size savings.
Second, brand new AFDKO tools, in the form of two Python scripts, for building, breaking apart, and displaying a synopsis of an OTC’s tables. These scripts were developed by our incredibly talented font tools engineer, Read Roberts, so all thanks should go to him for preparing them.