I was having a conversation with a couple guys the other day about data aggregation, and the topic of language came up. They wanted to know how you can programmatically determine what language a site is written in (language as in spoken and written language, not computer language). Off the top of my head, I guessed one could uncover clues in the site’s HTTP headers, character encoding, or by geocoding the site’s IP address.
It turns out to be a harder problem to solve than I initially thought. HTTP headers are really no more help than a site’s character encoding which really isn’t much help at all since UTF-8 can pretty much encode any language there is. And geocoding an IP address is really nothing more than a hint for all the regular reasons geocoding IP address doesn’t always work, and for the additional reason that a server being in a particular country doesn’t really tell you anything about the language the sites on the server are written in (I used to live in Japan, but never posted a single thing in Japanese).
I did a little research, and it looks like folks like Google use very complex techniques for determining a site’s language like comparing characters and words against known sets of characters and words in a database. This seems like a reasonable approach, but not one that I could implement in a reasonable amount of time (like a couple of hours), so I did what I always do when faced with a very complex problem: I looked for an obvious and simple solution.
What I eventually decided was that the sites out there with the most content, and content which is updated most frequently (and therefore are the most interesting sites to index) actually almost always tell you what language they are written in through their RSS or Atoms feeds. Of course, a research paper isn’t likely to have an RSS feed, but most news sites and just about all blogs certainly do.
I tested the theory by writing a Ruby script that crawls sites and their feeds looking for things like xml:lang attributes and other language-related tags. I ran the finished product against a sample of 50 non-English blogs from MXNA and determined that the technique is about 60% accurate. Not great, but not too bad, either. I also rediscovered a lesson I’d already learned many times over when writing aggregators, which is that you should never trust data you don’t control since all but one of the sites that my script got wrong actually lied and claimed to be a language it wasn’t (in every case, they claimed to be English rather than the language they were actually written in). How do I explain the fact that about 40% of blogs seem to lie about their language? I’m sure it’s an innocent mistake. Most people don’t really know much about how RSS and Atom work, and just trust their blogging software to do the right thing. Even if they write the software themselves, they probably don’t really know what all the RSS/Atom tags actually mean. RSS is sort of the new HTML: as long as it mostly works, it’s good enough for most people.
Like I said, the script works by crawling sites and their RSS or Atom feeds. If you’re interested in the source, you can grab it here. I also threw together a CGI wrapper for it, so you can test it out online yourself here. If you’re new to Ruby and don’t feel like decoding all the regular expressions, here is a brief description of how it actually works.
- You give it the URL of a blog or other news site which you assume has an RSS or Atom feed, and it grabs the source.
- It checks to see if the site’s source itself gives any clues about the language it’s written it through meta tags.
- If it doesn’t find any clues in the site itself, it looks for link tags referencing RSS or Atom feeds. If it doesn’t find any, it searches for any links in the page that seems like they might references an RSS or Atom feed.
- Assuming it has found one more links to probable feeds, it resolves the URLs, grabs the source, and looks for language-related attributes and tags in the feeds.
- Finally, once it finds what it thinks is a language code, it simple converts it into an actual language name.
Some lessons learned:
- Never trust data that you don’t own unless absolutely necessary. This is an important lesson in data aggregation. Don’t trust dates, languages, or even things like character encodings and categories. Determine as much as you can yourself, and be skeptical of the rest. As C3PO says in The Empire Strikes Back, "R2D2, you know better than to trust a strange computer."
- If you have a non-English blog, check your feed and make sure you’re not masquerading as English. I knew my script wouldn’t be 100% accurate, but I was really surprised to discover that almost %40 of non-Enligh blogs claim to be written in English.
- Use link tags to point to your site’s feeds to allow people to "auto-discover" them. You can add a link tag to the head section of your site which references your feeds which allows things like Firefox, Safari, future versions of IE, aggregators, and rogue Ruby scripts to find your feeds more easily.