For those who missed the memo, Unicode Version 9.0 was released on June 21, 2016, which added exactly 7,500 characters to the standard. Unicode now includes a total 128,172 characters, which is just shy of 3,000 characters under two full 256×256 planes.
While Version 9.0 does not add any new CJK Unified Ideographs, I used this opportunity to enhance my single-page CJK Unified/Compatibility Ideographs document to better track unassigned code points for the relevant blocks and planes. The image at the top of this article shows the first half of the document, and if you click on it, you’ll access the original PDF file that can be squirreled away for reference purposes.
I also used this opportunity to update my tentative Unicode Version 10.0 document in the same way.
As usual, enjoy!
This will be a short, sweet, and to-the-point article. Sorry, no graphics nor photos.
When developing name-keyed fonts, glyph names matter. They matter a lot. When developing new fonts, the glyph names should either be explicitly listed in AGLFN (Adobe Glyph List For New Fonts) or derivable via the AGL Specification. Glyph names that adhere to AGLFN or the AGL Specification result in fonts with well-formed 'cmap' tables, which means that their glyphs will behave better in a broader range of environments. I cannot stress the importance of this.
CIDs (Character IDs), on the other hand, represent a completely different beast. If a font is genuinely CID-keyed, it means that there are absolutely no glyph names, regardless of whether the source font or fonts that were used to build the CID-keyed font were named-keyed. Once a font resource becomes CID-keyed, the original glyph names are literally jettisoned, and the only way in which to map Unicode values to glyphs is via the 'cmap' table, which is usually done using a UTF-32 CMap resource. In other words, when developing fonts that are intended to be deployed in a CID-keyed fashion, the source glyph names play absolutely no role in how such fonts are processed.
One of my more popular open source fonts is Adobe Blank, and to a less extent the related Adobe Blank 2 because it uses a 'cmap' table format, Format 13, that is not broadly supported. Actually, Adobe Blank provides absolutely nothing, because it maps all 1,111,998 Unicode code points to a range of 2,048 non-spacing and non-marking glyphs, yet such a font is useful for particular scenarios, such as addressing the FOUT (Flash Of Unstyled Text) problem.
Allow me to introduce Adobe NotDef, which is modeled after Adobe Blank in that it covers all of Unicode and maps to a range of 2,048 glyphs, but differs in that the functional glyphs are spacing and marking. The original suggestion for Adobe NotDef came from Dave Crossland. The glyphs match the shape and advance width of the standard Adobe .notdef glyph that is invoked in environments that do not support font fallback when the selected font does not include a glyph for a particular character, and as Dave wrote, Adobe NotDef is useful for font fallback purposes in that it can be used to prevent the display of non-standard .notdef glyphs that may be present in some fonts in the font fallback chain.
The Unicode Consortium celebrated its 25th anniversary in January of this year. The photo above is the celebratory (U+1F955 CARROT; a Unicode Version 9.0 candidate) cake that was enjoyed during the UTC (Unicode Technical Committee) #146 meeting that was hosted by IBM in San José from January 25th through 28th, 2016.
Plane 2, the SIP (Supplementary Ideographic Plane), is almost full.
Right off the bat, in Unicode Version 3.1 (March of 2001), Extension B filled it nearly two-thirds of the way with its 42,711 characters, along with 542 CJK Compatibility Ideographs. Extension C with 4,149 characters was added in Version 5.2 (October of 2009), Extension D with a mere 222 characters was added in Version 6.0 (October of 2010), and Extension E with 5,762 characters was added in Version 8.0 (June of 2015). On tap for Unicode Version 10, scheduled for a June of 2017 release, is Extension F that currently includes 7,473 characters (U+2CEB0 through U+2EBE0).
Actually, we do.
As pointed out in Matthew Rechs‘ recent and excellent Typekit Blog article about Unicode’s Adopt a Character campaign, these badges were designed by the very talented Jake Giltsoff of the Typekit team at Adobe. Mine for U+1F421 🐡 BLOWFISH is shown above.
It seems that I am on roll, having released two new open source fonts on GitHub within the past week. The previous—and brief—article that was about the LOCL Test OpenType/CFF font simply pointed to the repository. This article will be longer. I promise.
Inspired by the font that I prepared for and referenced in the previous article, I decided to launch a dedicated open source project for this useful test font, LOCL Test.
Although this article shares its title with an article from four years ago that was about the excitement associated with attending ATypI Hong Kong 2012, this particular one will focus on efforts to properly support Hong Kong SAR (aka HK or Hong Kong) in the Adobe-branded Source Han Sans and Google-branded Noto Sans CJK typeface families, but also in infrastructure, such as OSes and apps.
In other words, this article is not about traveling to Hong Kong, but rather about properly supporting Hong Kong in OSes, apps, and fonts.
A peculiar series of events that took place on April 1st (no joke) and 2nd of this year led to the discovery of what can only be described as somewhat of a revelation: A small number of CJK Compatibility Ideographs are necessary for China. This is important, because I made the following statement on page 168 of CJKV Information Processing, Second Edition:
—Humans make mistakes—
—Anything made by humans has the potential to include mistakes—
The most important things about mistakes are that 1) we recognize them, lest they propagate; 2) we learn from them; 3) we make an effort not to repeat them; and 4) we try to fix them, if possible.
Some mistakes are more easily fixed than others. Mistakes that cannot be fixed must be worked around.
With that said, an interesting event of historical significance occurred in June of 2000:
The first version of the IVD (Ideographic Variation Database) was issued on 2007-12-14, meaning over eight years ago, and there have been three subsequent revisions, the latest being issued on 2014-05-16. There are currently three registered IVD collections: Adobe-Japan1, Hanyo-Denshi, and Moji_Joho. A significant number of IVSes are shared between the latter two IVD collections, 9,685 to be exact. While I cannot speak to the latter two IVD collections, the Adobe-Japan1 one is supported by hundreds of OpenType fonts via the Format 14 (Unicode Variation Sequences) ‘cmap‘ subtable. Furthermore, the number of apps and OSes that support UVSes has reached critical mass.
With all that said, there is a rather substantial missing link in terms of IVD support infrastructure: the all-important input method.
The next UTC (Unicode Technical Committee) meeting, the 147th one, takes place during the week of May 9th, and will be hosted at the Adobe headquarters in San José, California. All members of the Unicode Consortium, especially voting members, are encouraged to attend.
Much of the thinking that I did with regard to this unregistered—but hopefully soon-to-be-registered—IVD (Ideographic Variation Database) collection was done while visiting my parents in South Dakota, with one of the highlights of that trip being a scenic drive through Badlands National Park.
First and foremost, please forget, or at least ignore, most everything that was written in the 2016-02-13 and 2016-02-20 articles (which makes one wonder why I am linking to them, but I digress). Far too many things have changed, and what I present in this article represents the IVD collection that I hope will be registered later this year.
Continuing where I left off with the first article about this subject, I’d like to point out some of the implementation details and their ramifications in this article.
One of my longer term goals for the open source Source Han Sans project has been to eventually register a Pan-CJK IVD (Ideographic Variation Database) collection that would allow the regional variants to display and be preserved in “plain text” environments, and I think that I may have achieved a breakthrough the other day.
CJK Unified Ideographs is a very deep and fascinating subject, and there are people who sometimes— and anecdotally—claim that it is a bottomless pit (or infinite tunnel, hence the quote in this article’s title). The latter may appear to be accurate, especially when one considers what is happening on that front, such as Extension F with its 7,473 new characters slated for Unicode Version 10.0 in June of 2017, and with work on Extension G commencing.
Although I am a couple days late, I’d like to use this opportunity to welcome the year of the monkey, and to wish a Chinese New Year to all of my Chinese friends, colleagues, and blog readers. May this year be safe, prosperous, and enjoyable.
One of the fringe benefits of moving offices—especially when one has accumulated nearly 25 years of font-related material and it is thus not a pain-free exercise—is discovering historical documents, some of which turn out to be true gems. Our team is preparing to move from the Adobe East Tower to the West one, and part of the process is figuring which material to keep, and which to put into File 13. Anyway, I had been recently looking for a particular presentation that I prepared many years ago, and was fortunate enough to come across it while sifting through my accumulated materials.
It is with great sadness that I write that Unicode Version 9.0, whose beta was authorized yesterday, on the last day of UTC #146, will include no additional CJK Unified Ideographs. The next opportunity for additional CJK Unified Ideographs is therefore Unicode Version 10.0, which is slated for a June 2017 release, and is expected to include 21 Urgently Needed Characters (UNCs) that are appended to the URO (Unified Repertoire & Ordering), along with Extension F (see IRG N2156 for the latest version) that currently includes 7,473 characters.
Interestingly, and as long as Extension F’s block remains stable, there are only 3,088 code points remaining in Plane 2 (SIP), specifically U+2EBF0 through U+2F7FF, along with 1,502 code points at the end of Plane 2, immediately following CJK Compatibility Ideographs Supplement, specifically U+2FA20 through U+2FFFD.
The image above is an excerpt of a PDF that shows what Unicode Version 10.0 is likely to include in terms of ideographs. If you click on the image, you will get the actual PDF. Of course, the yellow stuff is tentative and subject to change.
Updated on 2016-06-26 to reflect the additional UNC appended to the URO at U+9FEA, along with a net decrease of 12 characters in Extension F, reducing it to 7,473 characters.