Previewing the 2011 Web Globalization Report Card

I’ve begun work on the 7th edition of the Report Card. To produce this report I individually review more than 200 global web sites across more than 20 industries. Needless to say, I’ve got a busy month ahead!

I’ve already done a first pass on a number of web sites and have some initial thoughts to share:

  • As regular readers know, Google and Facebook finished in a dead heat for first place last year, with Google having a slight advantage. Both companies made significant changes over the past twelve months, changes that promise to make this another photo finish.
  • I’ve noticed an increase in the number of sites using geolocation for navigation. Unfortunately, some of these sites are not using geolocation as well as they should. As I’ve noted in my book, geolocation should never be used without a visual global gateway in place. Geolocation is an excellent tool, but it presents a number of edge cases that only a global gateway can solve.
  • I’ve seen some amazing global gateways so far, and, in some cases, demonstrating vast improvements over previous global gateways. I’ll be documenting a number of these gateways in the report.
  • Companies continue to add languages. After initial analysis, Indonesian is hot, as is Russian and Turkish. Last year, the average number of languages was 20. I suspect we’ll see increase again this year. Keep in mind that this is just the average. Companies like Cisco, Apple, and DHL are well above 20 languages.
  • For last year’s report, I began measuring “community localization” — the integration of social networking platforms into local web sites. I wasn’t just looking at Twitter and Facebook use around the world, but at how companies are fostering communities. I’ve noticed quite a lot of Facebook integration around the world. Below is a home page visual from Samsung Italy:
  • Samsung also promotes its Twitter feed on the home page of its Brazil site. And Samsung is far from alone.
  • Finally, I’m noticing lots and lots of web site surveys.They’re popping up everywhere and in many languages. Somebody please make them stop!

Here is the link to the 2010 Report Card. All companies included in this report will be included in 2011. We’ll have a page for the 2011 report up shortly.

Translation in the next century

Jaap van der Meer of TAUS wrote a great article on translation in the next century. A few choice quotes:

In this new regime, translation is multidirectional, from any language into any language. Quality requirements are different for different users and different usages. Machine translation is good enough for the largest volumes of dynamic web content, whereas pre-sales texts require a step-up in quality from the current one-translation-fits-all policy.

And

It is still unsettling of course when you realize that 90% of the translated words will be generated by machine translation engines, probably at no charge to the end-user. But considering there is a non-stop stream of multimedia information, the translation market will certainly innovate and assert its value in different ways.

And speaking of this next wave of translation, have you seen Google’s new Global Advertiser service? Google has now officially gotten into the web globalization business.

Inside Google’s language detection tool

One of the highlights of attending the 2010 Unicode Conference was listening to Richard Sites explain how he developed Google’s statistical language detection algorithm.

You may have already used this algorithm as part of Google Translate:

The feature is also embedded into Google’s Chrome browser. If Chrome detects a web page in a language other than your browser’s default language, it will ask you (as shown below) if you’d like the past translated (assuming Google supports the translation).

The tool works by scanning a chunk of text and then segmenting and analyzing four-character “tokens.” These tokens are compared against a very large table of reference tokens that have language properties associated with them. If you’ve played with this feature on Google Translate, you’ll notice that the accuracy of the algorithm improves as you add text.

Richard set out to create a tool that would identify 180 languages across 57 scripts. The hard part was in creating this reference table of tokens. It required analyzing millions of words of source text of each language, which isn’t so easy when you consider that some of these languages just aren’t represented that well on the Internet. Even Wikipedia came up short for many languages (it supports more than 170 languages, but not all of these languages have much in the way of content).

Now you may think that the detection tool could also look at the HTML lang tag or the character encoding of the web page to come up with its answer. But as Richard noted, roughly 5% of all web pages specify an incorrect text encoding. Even Wikipedia incorrectly labels languages on occasion. To further complicate matters, between 5% and 10% of all web pages include more than one language.

If you’re into learning more, you can take a look at the code itself — the Chrome branch is open sourced here.

A few  notes from Richard’s talk:

  • Other sources of language data used to fine-tune the algorithm included the BBC and Watchtower.org — the web site of the Jehovah’s Witnesses.
  • This algorithm was developed as a 20% project. Which makes me wonder what the heck Richard is doing with the other 80% of his time.
  • The reason Chrome only covers 52 languages is that the application needs to be compact — for fast downloading. Each additional language requires additional reference tables.
  • There are some language pairs that still pose challenges to the algorithm (because the languages themselves are so closely related). Challenging pairs include, among others, Indonesian and Malay, Czech and Slovak, and Bosnian and Croatian.
  • In 2008, English made up 42% of all web content (I would bet it’s under 40% today).
  • Also as of 2008, 47 languages covered roughly 99% of all web content.

The next Internet revolution will not be in English

This visual depicts about half of the currently approved internationalized domain names (IDNs), positioned over their respective regions.

Notice the wide range of scripts over India and the wide range of Arabic domains. I left off the Latin country code equivalents (in, cn, th, sa, etc.) to illustrate what the Internet is going to look like (at a very high level) in the years ahead.

This next revolution is a linguistically local revolution. In terms of local content, it is already happening. Right now, more than half of the content on the Internet is not in English. Ten years from now, the percentage of English content could easily drop below 25%.

But there are a few technical obstacles that have so far made the Internet not as user friendly as it should be for people in the regions highlighted above. They’ve been forced to enter Latin-based URLs to get to where they want to go. Their email addresses are also Latin-based. This will all change over the next two decades.

For those of us who are fluent only in Latin-based languages, this next wave of growth is going to be interesting, if not a bit challenging. In a Latin-based URL environment, you can still easily navigate to and around non-Latin web sites and brands. For example, if I want to find Baidu in China, I can enter www.baidu.cn. For Yandex in Russia, it’s yandex.ru.

But flash forward a few years and these Latin URLs (though they’ll still exist) may no longer function as the front doors into these markets.

Try Яндекс.рф. It currently redirects to Yandex.ru.

In a few years, I doubt this redirection will exist.

We’re getting close to a linguistically local Internet — from URL to email address. There are still significant technical obstacles to overcome. It will be exciting to see which companies take the lead in overcoming them — as these companies will be well positioned to be leaders in these emerging markets.

UPDATE: I’ve expanded on this topic in a recent article on IP Watch.