Holiday Greetings from Lexiteria

Lexiteria, a unique word products and services company, licensed our world peace design and customized it with their logo and message.

I’m biased naturally, but I really like this card!

Multilingual design is a passion of mine and, though I don’t advertise it, I do provide custom design work (as well as the licensing of existing designs). If you have any projects in mind, please contact me.

Select Language(s): Facebook takes language settings to the next level

Facebook is updating its user profile page (profiled in-depth here) and one of the changes is to ask users to self-select one or more languages.

Facebook already looks at the user’s web browser for an indication of what language he/she prefers, but this is far from a perfect solution. Having the user self-select languages allows Facebook to do some fairly nifty things when it comes to aligning users (suggested friends, interests, etc.) — and, of course, improving ad targeting. It remains to be seen what Facebook does with these multilingual user profiles.

My friends must wonder how I’ve sudden become so multilingual as I’ve been playing with this feature. Here’s my current list of languages (Spanglish included):

And here’s how these languages are presented on my profile:

Notice the globe icon!

Thankfully, Facebook is correctly using a globe icon to indicate languages. Now perhaps Facebook will kill the globe icon for notifications (something I’ve been critical of before). Facebook now uses the globe icon for three different features on its portal — something’s gotta give.

That said, I like that Facebook is making languages a high-level profile feature.

Multilingual web users are an important segment of the Internet. In my experiences, I have found that these users tend to be very key proselytizers of social networking services — and I’m sure Facebook is well aware of this as well.

Properly supporting multilingual users is an important step forward and one that I believe other companies will attempt to duplicate in the years ahead.

Translation in the next century

Jaap van der Meer of TAUS wrote a great article on translation in the next century. A few choice quotes:

In this new regime, translation is multidirectional, from any language into any language. Quality requirements are different for different users and different usages. Machine translation is good enough for the largest volumes of dynamic web content, whereas pre-sales texts require a step-up in quality from the current one-translation-fits-all policy.

And

It is still unsettling of course when you realize that 90% of the translated words will be generated by machine translation engines, probably at no charge to the end-user. But considering there is a non-stop stream of multimedia information, the translation market will certainly innovate and assert its value in different ways.

And speaking of this next wave of translation, have you seen Google’s new Global Advertiser service? Google has now officially gotten into the web globalization business.

Inside Google’s language detection tool

One of the highlights of attending the 2010 Unicode Conference was listening to Richard Sites explain how he developed Google’s statistical language detection algorithm.

You may have already used this algorithm as part of Google Translate:

The feature is also embedded into Google’s Chrome browser. If Chrome detects a web page in a language other than your browser’s default language, it will ask you (as shown below) if you’d like the past translated (assuming Google supports the translation).

The tool works by scanning a chunk of text and then segmenting and analyzing four-character “tokens.” These tokens are compared against a very large table of reference tokens that have language properties associated with them. If you’ve played with this feature on Google Translate, you’ll notice that the accuracy of the algorithm improves as you add text.

Richard set out to create a tool that would identify 180 languages across 57 scripts. The hard part was in creating this reference table of tokens. It required analyzing millions of words of source text of each language, which isn’t so easy when you consider that some of these languages just aren’t represented that well on the Internet. Even Wikipedia came up short for many languages (it supports more than 170 languages, but not all of these languages have much in the way of content).

Now you may think that the detection tool could also look at the HTML lang tag or the character encoding of the web page to come up with its answer. But as Richard noted, roughly 5% of all web pages specify an incorrect text encoding. Even Wikipedia incorrectly labels languages on occasion. To further complicate matters, between 5% and 10% of all web pages include more than one language.

If you’re into learning more, you can take a look at the code itself — the Chrome branch is open sourced here.

A few  notes from Richard’s talk:

  • Other sources of language data used to fine-tune the algorithm included the BBC and Watchtower.org — the web site of the Jehovah’s Witnesses.
  • This algorithm was developed as a 20% project. Which makes me wonder what the heck Richard is doing with the other 80% of his time.
  • The reason Chrome only covers 52 languages is that the application needs to be compact — for fast downloading. Each additional language requires additional reference tables.
  • There are some language pairs that still pose challenges to the algorithm (because the languages themselves are so closely related). Challenging pairs include, among others, Indonesian and Malay, Czech and Slovak, and Bosnian and Croatian.
  • In 2008, English made up 42% of all web content (I would bet it’s under 40% today).
  • Also as of 2008, 47 languages covered roughly 99% of all web content.