Not all PDF authoring applications are the same, in terms of the extent to which they preserve the text content of the original document. Of course, this is not necessarily the fault of the PDF authoring application, but rather it is due to a disconnect between the PDF authoring process and access to the text content of the original document.
The best example for demonstrating this is to create a document that includes the two kanji 一 (U+4E00) and ⼀ (U+2F00). The reason why these two characters represent a good example is because in mainstream Japanese fonts, mainly those that are based on the Adobe-Japan1-x ROS, both map to the same glyph, specifically CID+1200.
If you download and unpack the 4E00vs2F00.zip file, you will find two PDF files, an Adobe InDesign file, and an MS Word file. If you open the original documents and search for 一 (U+4E00), you will find only a single instance, which is the one that is marked by the Unicode scalar value. However, if you open the respective PDF files, you will notice a difference. The one that is based on the MS Word file now includes two instances of 一 (U+4E00), and ⼀ (U+2F00) is no longer included in its content. You can search a PDF file by Unicode scalar value by using the “\uXXXX” notation, such as \u4E00 for U+4E00 (一). (Note: Depending on the version of MS Word that is being used, the PDF file may instead include two instances of ⼀ (U+2F00). I am using Microsoft Word for Mac 2011 Version 14.3.8.)
Adobe InDesign has a built-in PDF library that has direct access to the text content, and is thus able to inject it into the text layer of the PDF file that it produces. MS Word uses a different pathway for producing a PDF file, one that does not have access to the text content of the original document.
For those who have been interested in ISO/IEC 14496-28:2012 (Composite Font Representation), which standardizes an XML format for defining font objects (aka CFR objects) that can reference more than one font resource and thus break the 64K glyph barrier, I am pleased to let this blog’s readership know that it is now among ISO’s Freely Available Standards. I am particularly pleased about this news, mainly because some developers have indicated that purchasing the standard effectively served as a barrier to supporting it. Well, the barrier has been removed!
Note that this change makes a whole lot of sense, because two ISO standards that are closely tied to CFR, ISO/IEC 10646 (Universal Coded Character Set, aka Unicode) and IEO/IEC 14496-22 (Open Font Format), are already among these freely available standards.
Also note that there is no direct download URL for this or other freely available ISO standards, because one must first agree to the no-cost licensing terms by clicking a button.
Some people naïvely think that ISO/IEC 10646 and Unicode, which are joined at the hip, make the development of national standards an obsolete practice. As my IRG41 contribution, IRG N1964 (Continued National Standards Development & Horizontal Extensions), makes clear, nothing is further from the truth, especially when it involves CJK Unified Ideographs.
The content of this paper had been brewing in my head since IRG38, and only recently has congealed into a concise one-page paper that should be daunting to no one. If you are interested in such issues, please read the paper and provide feedback.
For those who have read all or parts of my latest book, CJKV Information Processing (Second Edition), published by O’Reilly Media at the end of 2008, or who reference it when necessary, this article should be of interest. I did post all of these to my Twitter account on October 11, 2013, but I figured that it was also good to include them in a single article that is easily referenced.
In addition to writing the entire book, I also typeset the entire book, except for its cover. This included compiling the Index. This allowed me to inject a bit of humor into the book, to include hidden messages. While this article is not a complete account of the easter eggs (イースター・エッグ in Japanese), some of the more memorable ones are chronicled here. In other words, this article does not include a complete list of them.
While the finishing touches are being put on Unicode Version 6.3, which will include the 1,002 Standardized Variants that I already mentioned, everything appears to be on track for Unicode Version 7.0, which will be in sync with ISO/IEC 10646:2014 (4th Edition).
Extension E, which adds 5,762 new CJK Unified Ideographs, is on track to be included in Version 7.0. This will bring the total number of CJK Unified Ideographs to a staggering 80,379 characters. I spent part of this morning preparing an updated version of my CJK Unified/Compatibility Ideographs table that provides a glimpse at Unicode Version 7.0.
(Note that neither Unicode Version 7.0 nor ISO/IEC 10646:2014 have been released or published, meaning that implementers should keep this caveat in mind, hence the use of “glimpse” in the title of this article.)
As I described in Part 1, Part 2, and Part 3 of this series, Standardized Variants offer a Normalization-proof representation for the 1,002 CJK Compatibility Ideographs, which are encoded in the BMP, and at the end of Plane 2. These 1,002 Standardized Variants have been approved, and will be included in Unicode Version 6.3. They will, of course, also be included in IS0/IEC 10646.
In an effort to provide to font developers advance support for the Standardized Variants that correspond to glyphs in Adobe’s public ROSes, the next version of AFDKO will include a new version of the Adobe-Japan1_sequences.txt file that appends entries that correspond to 89 of these Standardized Variants, along with Adobe-CNS1_sequences.txt and Adobe-Korea1_sequences.txt files that specify 14 and 270 entries, respectively, that correspond to these Standardized Variants. If you click on the file names, you can download the files and use them immediately. These are used with the AFDKO makeotf tool, and specified as the argument of the “-ci” command-line option.
The Unicode Consortium announced the release of UTR #50, Unicode Vertical Text Layout, today, via Twitter and their blog. Although I was involved in this Unicode Technical Report to some extent, any congratulatory comments should be directed toward its original and current editors, Eric Muller and Koji ISHII (石井宏治), respectively.
Twenty years ago this month, in September of 1993, something remarkable happened in my life. My first book, entitled Understanding Japanese Information Processing (日本語情報処理), was published by O’Reilly Media (called O’Reilly & Associates back then). It had a very distinctive cover, which is shown with the two subsequent books:
In an effort to make sure that the infrastructure to support UTR #50 (Unicode Vertical Text Layout) will be in place—sooner rather than later—I spent a significant part of last week working with key people within Adobe, and at Microsoft and W3C, to put together a proposal for a new OpenType feature, to be tagged ‘vrtr’, for supporting this soon-to-be published standard. Below is full description that we came up with, and which was submitted for inclusion in the OpenType Specification and in OFF (ISO/IEC 14496-22 or Open Font Format):
Friendly name: Vertical Alternates For Rotation
Registered by: Adobe/Microsoft/W3C
Function: Transforms default glyphs into glyphs that are appropriate for sideways presentation in vertical writing mode. While the glyphs for most characters in East Asian writing systems remain upright when set in vertical writing mode, glyphs for other characters—such as those of other scripts or for particular Western-style punctuation—are expected to be presented sideways in vertical writing.
Example: As a first example, the glyphs for FULLWIDTH LESS-THAN SIGN (U+FF1C; “＜”) and FULLWIDTH GREATER-THAN SIGN (U+FF1E; “＞”) in a font with a non-square em-box are transformed into glyphs whose aspect ratio differs from the default glyphs, which are properly sized for sideways presentation in vertical writing mode. As a second example, the glyph for LEFT SQUARE BRACKET (U+005B, “[“) in a brush-script font that exhibits slightly rising horizontal strokes may use an obtuse angle for its upper-left corner when in horizontal writing mode, but an alternate glyph with an acute angle for that corner is supplied for vertical writing mode.
Recommended implementation: The font includes versions of the glyphs covered by this feature that, when rotated 90 degrees clockwise by the layout engine for sideways presentation in vertical writing, differ in some visual way from rotated versions of the default glyphs, such as by shifting or shape. The vrtr feature maps the default glyphs to the corresponding to-be-rotated glyphs (GSUB lookup type 1).
Application interface: For GIDs found in the vrtr coverage table, the layout engine passes GIDs to the feature, then gets back new GIDs.
UI suggestion: This feature should be active by default for sideways runs in vertical writing mode.
Script/language sensitivity: Applies to any script when set in vertical writing mode.
Feature interaction: The vrtr and vert features are intended to be used in conjunction: vrtr for glyphs intended to be presented sideways in vertical writing, and vert for glyphs to be presented upright. Since they must never be activated simultaneously for a given glyph, there should be no interaction between the two features. These features are intended for layout engines that graphically rotate glyphs for sideways runs in vertical writing mode, such as those conforming to UTR#50. (Layout engines that instead depend on the font to supply pre-rotated glyphs for all sideways glyphs should use the vrt2 feature in lieu of vrtr and vert.) Because vrt2 supplies pre-rotated glyphs, the vrtr feature should never be used with vrt2, but may be used in addition to any other feature.
I will be attending IRG41 in November, meaning that I will be in Tokyo for the latter half of November. Airline tickets have been purchased, and hotel reservations have been made. This is made possible because Adobe is a Full Member of The Unicode Consortium. I enjoy every visit to Japan, meaning that I am looking forward to spending almost two weeks there.
One of my goals during this trip is to enjoy at least one lunch at ガンジー (静岡県焼津市), and one dinner at とんき (東京都目黒区).
Seriously, the primary focus at IRG41 clearly will be on Extension F, which is now in full progress.
UTC (Unicode Technical Committee) Meeting #136 took place last week, and one of the significant outcomes was that UTR (Unicode Technical Report) #50 was advanced from Draft to Approved status. Congratulations to Koji ISHII (石井宏治), its editor, and also to Eric Muller, who took the initiative to start this project and served as its first editor.
I have advocated the use of the special-purpose and language-neutral Adobe-Identity-0 ROS over the past few years, and have developed several CID-keyed fonts that take advantage of this ROS, but keep in mind that its use can act like a double-edge sword.
On one hand, it provides font developers with great flexibility, in terms of the glyph complement of a font. In other words, font developers need not be restricted to one of our public CJK ROSes, such as Adobe-Japan1-6, or a subset thereof. Kazuraki is an example of a Japanese font whose glyph set requirements didn’t fit Adobe-Japan1-6, so the Adobe-Identity-0 ROS was used.
On the other hand, font developers need to develop all of the necessary resources, such as the UTF-32 CMap Resource that is used as the basis of the ‘cmap‘ table, which maps Unicode code points to glyphs in the font, along with any GSUB features. In addition, and because the Adobe-Identity-0 ROS is language-neutral in that its designation does not specify or suggest a primary language, some applications may incorrectly assign a primary language to such fonts. This, of course, is due to heuristics (発見的教授法 in Japanese), or more specifically, their failure.
Unicode has become the de facto way in which to represent text in digital form, and for good reason: its character set covers the vast majority of the world’s scripts. Other benefits of Unicode include the following:
- That it is under active and continuous development, meaning that with each new version, more scripts are being supported, and additional characters for existing scripts are being standardized.
- That it is aligned and kept in sync with ISO/IEC 10646 (available at no charge), which is quite a feat.
With regard to font development, Unicode is considered the default encoding for OpenType, which refers to the ‘cmap‘ table. The most common ‘cmap’ subtables are Formats 4 (BMP-only UTF-16) and 12 (UTF-32). The latter is used only when mappings outside of the BMP (Basic Multilingual Plane), meaning from one or more of the 16 Supplementary Planes, are used.
[I’d like to preface this article by stating that it was written and contributed by our esteemed colleague, Taro YAMAMOTO (山本太郎), who manages our Japanese typeface design efforts in our Tokyo office. — KL]
NHK World’s TV program, Design Talks, to be broadcast from 1:30 to 2AM on Thursday, June 13th (UTC, London Time), will feature Japanese typography and typeface design. Various kinds of characters, such as Chinese ideographs, Japanese hiragana and katakana syllables, as well as Latin alphabet characters are used in Japanese typography, and it has a deep relationship with the tradition of Japanese calligraphy and handwriting, which were artistically made, and represent a culmination from the past. This program tries to shed light on the unique characteristics of Japanese typography by interviewing talented type designers of today, one of whom is Adobe’s own Ryoko NISHIZUKA (西塚涼子).
For more information about the TV Program: Design Talks (please refer to the links on that page to find out the program schedule and how to watch the program).
[For those in the US, you can check the schedule to find out when this program will be broadcasted. The easiest way to watch the program is by using the “NOW ON AIR” pod in the upper-right corner of the main page. For those in the PDT time zone, such as California, it will be broadcasted at 6:30PM and 10:30PM on Wednesday, June 12th, and at 2:30AM, 6:30AM, 10:30AM, and 2:30PM on Thursday, June 13th. — KL]
In the spirit of encouraging developers, especially those in Japan, to provide better or more broad support for Unicode, which usually entails abandoning Shift-JIS encoding, I became inspired this evening to put together a Top Ten List that provides various Reasons To Abandon Shift-JIS Encoding, similar to the Unicode Beyond-BMP one that I prepared a couple years ago.
While humor is intended in such Top Ten Lists, there is also a serious side to this issue: Given that today’s systems work together, by clinging to Shift-JIS, developers can adversely affect other systems that do support Unicode.
[This Japanese version of the May 31, 2013 article entitled CSS Orientation Test OpenType Fonts is courtesy of Hitomi Kudo (工藤仁美).]
五月三十一日にアドビの新しいオープンソースプロジェクトで、「CSS Orientation Test OpenType Fonts」をリリースしたのでお知らせします。このオープンソースプロジェクトは、Unicodeの次期UTR #50（「Unicode Vertical Text Layout」）のエディタである石井宏治氏のリクエストをもとに開発された、二つのOpenType/CFFフォントを含みます。これらフォントの目的は、フォント開発者がより簡単にグリフの方向に関するテストを行えるよう考慮したものです。
I am pleased to announce that the new CSS Orientation Test OpenType Fonts open source project was launched on Adobe’s open-source portal, Open@Adobe, today. This open source project consists of two OpenType/CFF fonts that were developed at the request of Koji Ishii (石井宏治), the editor of Unicode’s forthcoming UTR #50 (Unicode Vertical Text Layout). The purpose of these fonts is for developers to be able to more easily test whether glyph orientation in their implementation is correct or not.
OpenType fonts are ‘sfnt’ (scalable font) resources that are comprised of several well-defined tables. One of these tables, which is the topic of today’s article, is the ‘cmap‘ (character map) table. The ‘cmap’ table, put simply, maps characters codes to Glyph IDs (GIDs) that refer to glyphs in the ‘glyf‘ or ‘CFF‘ (Compact Font Format) table, depending on the “flavor” of the OpenType font. What is important about the ‘cmap’ table is that it makes the glyphs usable. Without the ability to map from character codes, which are used by virtually all applications and OSes, the glyphs in a font are useless, and cannot be readily accessed or used.
The prototypical Serif and Sans Serif typeface style distinction in Korean has traditionally used the names Myeongjo (명조체/明朝體 myeongjoche) and Gothic (고딕체/고딕體 godikche), respectively. But, in 1993, the Republic of Korea (South Korea) Ministry of Culture, in an attempt to standardize typographic terms, recommended the use of Batang (바탕 batang) and Dotum (돋움 dotum) as the proper names for these two typeface styles.
At the time the Ministry of Culture recommendation was made, which was a period when printing was the most common use of fonts, Batang was meant for body text, and Dotum was for display or emphasis purposes. Mobile devices have provided a new use for Dotum, because its lack of serifs provided superior readability on mobile devices with smaller screens that necessitated smaller point sizes, and the original rationale for these new names seems to no longer apply.
From what I can tell, Korean type foundries have not embraced the Batang and Dotum names, and have actually resisted their use. What probably didn’t help was the fact that Microsoft released TrueType fonts with these exact names, with no additional qualifiers: Batang and Dotum. In other words, it seems that Microsoft’s use of these names polluted their chance at more widespread use, because they were treated as typeface names, not typeface style names.
In closing this brief article, I am curious about what our blog readership thinks about this particular issue. I welcome any and all comments.
[This (Simplified) Chinese version of the May 1, 2013 Typblography article entitled Adobe contributes font rasterizer technology to FreeType is courtesy of Gu Hua (顾华).]
现代字库有两种字形轮廓格式可供选择—TrueType或者CFF。TrueType是Apple于1990年开发的，而CFF（Compact Font Format）格式是Adobe基于1984年首次发布的Type 1格式（常称为PostScript字库）衍生出的第二代格式。无论是TrueType还是CFF都可被用于OpenType字库中。它们有很多共性，但也有两个主要区别：它们使用不同的数学运算方法描述字形曲线，以及使用不同的hinting技术（Hinting：提供光栅化提示，以确保在有限的像素里尽可能地准确显示每个字形）。TrueType侧重于在字体中构建指令，而Type1和CFF更多地依赖光栅器的智能处理。这使得光栅器质量显得尤为重要，对于这次合作，Adobe期望在使用FreeType环境上能显著改善CFF字体显示效果。