Advantages of Numeric Character References

Unicode has become the preferred way in which to represent text in digital form, and for good reason. Its broad coverage of our planet’s scripts and languages is the single greatest reason why this has happened. All of the major OSes have embraced Unicode. In other words, if you develop a product that makes use of text data, and if it doesn’t support Unicode, you’re doing something wrong.

Unicode comes in a variety of representations called encoding forms. The three most basic Unicode encoding forms are UTF-8, UTF-16, and UTF-32. The latter two are also available in explicit little- or big-endian flavors: UTF-16LE, UTF-16BE, UTF-32LE, and UTF-32BE. These are covered in Chapter 4 of CJK Information Processing, Second Edition. But, there are times when a bomb-proof way of representing Unicode characters is needed, or when an otherwise ASCII-only web document requires the occasional Unicode characters. For these purposes, and in the context of web documents, Numeric Character References (aka, NCRs) have great advantages. One of the advantages is its human-readability in terms of conveying an explicit Unicode code point. Another advantage is that only ASCII characters are used for this notation, which is its bomb-proof aspect.

To briefly demonstrate now NCRs work, consider the following two renditions of U+4E00, which is the first CJK Unified Ideograph of the URO (Unified Repertoire & Ordering), and which means “one”: 一 (NCR representation: “一“) versus δΈ€ (binary representation in one of the Unicode encoding forms). Both forms look the same, and they should. If you examine the HTML source of this page by using the appropriate display option of your browser, you’ll see the difference. All modern browsers supports NCRs.

An NCR is composed of three parts. The first part is comprised of the three characters &#x. What follows is the character designator, which is best described as a Unicode scalar value, such as U+4E00, but without the “U+” prefix, meaning 4E00. The last part is simply a trailing semicolon.

Some appropriate uses of NCRs include the occasional use of a copyright or trademark symbol in what would otherwise be an ASCII-only web document, or for the occasional use of any non-ASCII character, such as 一.

Comments are closed.