Archive for July, 2012

Internationalized Resource Identifiers: Standards Progress

The idea of a Uniform Resource Locator (URL) is a key Web innovation: the “hyper” of hypertext. URLs function as a combined locator (how to find it) and identifier (how to name it) for reference to other Internet resources within documents (using hypertext, such as the HyperText Markup Language [HTML]), email, and a variety of other Internet protocols (e.g., the HyperText Transfer Protocol [HTTP]).
URLs were designed to be portable and easily transcribed at a time when most computers had very limited support for character sets. As a result, the allowed characters for a URL is limited to a subset of safe characters that are always available, much like identifiers in most programming languages: the ASCII letters, digits, and a few punctuation characters.  However, unlike programming languages, URLs are frequently made visible to users. Web users see and type URLs, and it is common for people to use URLs in advertising, written communication, and spoken announcements.
Since most of the world uses languages which are written with characters not allowed in URLs, there has been considerable interest in development of a kind of URL which allows the use of other (“non-ASCII”) characters drawn from Unicode — the standard for representing characters for the world’s languages. This new identifier is called an Internationalized Resource Identifier (IRI); it overlaps the existing URL syntax, based on the idea that some systems might still be URL-only while others might allow IRIs.
This was pretty good in theory, but in practice there have been a number of problems: For example, having multiple ways of writing the same identifier can cause security and reliability problems if implementations aren’t uniform. The standard, rather than converging, has undergone some pressure and divergence because of the wide variety of implementations.
Work continues to try to bring the concerned implementors together to work out the details and ensure that there is a single standard for IRIs in browsers, email, HTML, plain text, and other contexts. Specifications are developed in the World Wide Web consortium (W3C) and the Internet Engineering Task force. Adobe’s Larry Masinter and Roy Fielding continue to work on the related standards as editors, specification authors and reviewers.

As with most standards, the overall concept is simple; it’s the details that are difficult given that any changes to the core addressing standards for the Web have significant implications for security, reliability, and compatibility with existing deployed systems.