« December 2005 | Main | February 2006 »

January 31, 2006

Flex Beta 2 and Free Flex SDK

It's all true. We've decided to release a free version of the Flex SDK. We haven't decided exactly what's going to be in it yet, but at a minimum, a compiler and the framework -- everything you need to build Flex 2 applications. We're also going to release a free version of Flex Enterprise Services that will be limited by the number of connections it allows, and will only run on a single server. Of course, Flex Builder 2 and the full version of Flex Enterprise Services 2 won't be free, but now anyone will be able to get started building Flex 2 applications at no cost whatsoever. This is all we really know at this point, but I'll post more details when I have them.

Also, Flex Builder 2 and Flex Enterprise Services 2 will go into public beta tomorrow, so get ready. Some of the new features include:

If you're still getting up to speed on Flex, I made a post the other day entitled "Clarifying the Term Flex" which describes in detail all the different components of Flex. I'll let you know when the beta is available for download, or you can just keep an eye on Adobe Labs (rss).

Posted by cantrell at 3:21 PM. Link | Comments (11) | References

Programmatically Determining a Site's Language

I was having a conversation with a couple guys the other day about data aggregation, and the topic of language came up. They wanted to know how you can programmatically determine what language a site is written in (language as in spoken and written language, not computer language). Off the top of my head, I guessed one could uncover clues in the site's HTTP headers, character encoding, or by geocoding the site's IP address.

It turns out to be a harder problem to solve than I initially thought. HTTP headers are really no more help than a site's character encoding which really isn't much help at all since UTF-8 can pretty much encode any language there is. And geocoding an IP address is really nothing more than a hint for all the regular reasons geocoding IP address doesn't always work, and for the additional reason that a server being in a particular country doesn't really tell you anything about the language the sites on the server are written in (I used to live in Japan, but never posted a single thing in Japanese).

I did a little research, and it looks like folks like Google use very complex techniques for determining a site's language like comparing characters and words against known sets of characters and words in a database. This seems like a reasonable approach, but not one that I could implement in a reasonable amount of time (like a couple of hours), so I did what I always do when faced with a very complex problem: I looked for an obvious and simple solution.

What I eventually decided was that the sites out there with the most content, and content which is updated most frequently (and therefore are the most interesting sites to index) actually almost always tell you what language they are written in through their RSS or Atoms feeds. Of course, a research paper isn't likely to have an RSS feed, but most news sites and just about all blogs certainly do.

I tested the theory by writing a Ruby script that crawls sites and their feeds looking for things like xml:lang attributes and other language-related tags. I ran the finished product against a sample of 50 non-English blogs from MXNA and determined that the technique is about 60% accurate. Not great, but not too bad, either. I also rediscovered a lesson I'd already learned many times over when writing aggregators, which is that you should never trust data you don't control since all but one of the sites that my script got wrong actually lied and claimed to be a language it wasn't (in every case, they claimed to be English rather than the language they were actually written in). How do I explain the fact that about 40% of blogs seem to lie about their language? I'm sure it's an innocent mistake. Most people don't really know much about how RSS and Atom work, and just trust their blogging software to do the right thing. Even if they write the software themselves, they probably don't really know what all the RSS/Atom tags actually mean. RSS is sort of the new HTML: as long as it mostly works, it's good enough for most people.

Like I said, the script works by crawling sites and their RSS or Atom feeds. If you're interested in the source, you can grab it here. I also threw together a CGI wrapper for it, so you can test it out online yourself here. If you're new to Ruby and don't feel like decoding all the regular expressions, here is a brief description of how it actually works.

Some lessons learned: