We sell proxies for GeoIP testing, but geoip isn't the right solution for every problem. We saw a great post on Hacker News a few weeks ago suggesting that sites should use the browser's
Accept-Language header to determine language, not their region of the world, and we agree! This got us thinking: How many sites actually support that?
Coles Notes Version: Of the top 10,000 sites according to Alexa, we were able to query 9,515. Of those sites, only 687 gave us different languages when asked (687/9515*100=7.2%). Of those 687, 131 of them had "google" in them (think google.com, google.ml, google.co.ug, etc).
1. Get list of top sites.
This seemed like it should have been the easiest step, but it's pretty clear that Amazon bought Alexa then bolted an API onto the side of AWS and forgot about it. It's best to click links randomly until the sample code just starts working, then stop. Once I got the list of the 10,000 top sites, I shoved the data into MySQL.
2. Find list of most common languages
Testing every possible language didn't seem feasible… But it seems likely that websites that support multiple languages would support at least one common language so I found a random site purporting to list the most common languages on the web and used that. Since a few resources listed both "Chinese" and "Mandarin Chinese", I decided to check both in my script. My final list is:
- English en
- Chinese zh
- Spanish es
- Arabic ar
- Portuguese pt
- Japanese ja
- Russian ru
- Malay ml
- French fr
- German de
- Mandarin Chinese cmn
3. Strip HTML out of response
I wanted to run language detection on the responses to see what they'd sent me, but that seemed wrought of peril what with all the HTML tags, styling, etc embedded. Luckily the HTML2Text library does a pretty good job. Hand it the HTML, get the plain text response back. :gold-star:
4. Determine Language
There's two options here. Some sites are nice enough to declare their language in the HTML, e.g.
<html lang="en">; if we don't get a language we'll need to figure it out on our own. I've used the cld language detection library in Ruby since I didn't see anything comparable for PHP. Luckily the kind folks at Detect Language have an API that will give me the language of a string really quickly. Handing them a few dollars to make this problem go away seemed like a great deal.
5. Run the script
The script is pretty simple: load up all the sites / languages and iterate, sleeping after every request. Sure it would be slow, but who cares. I ran it with
screen and it took a few days.
6. Raw Results
A really ugly SQL query gave me some rough data to work with:
select count(*) as `lang_support`, `domain`, site_id, group_concat(language_detected) as `ld` from results inner join `languages` on `language_id` = `languages`.`id` inner join `20170424` on `results`.`site_id` = `20170424`.`rank` where `languages`.`header` = language_detected group by `site_id`
Rather than brushing up on my SQL, I wrote a PHP script to iterate over it and give me some understandable output:
google.com supports 10: de-CA,fr-CA,es-419,ml-CA,en-CA,zh-CN,ja-CA,pt-PT,ar-CA,ru-CA youtube.com supports 10: es,ml,en,zh,ja,pt,ar,ru,de,fr facebook.com supports 10: ja,pt,ar,ru,de,fr,es,ml,en,zh-Hans www.baidu.com supports 1: zh yahoo.com supports 1: en-CA reddit.com supports 2: en,zh google.co.in supports 7: de-IN,fr-IN,ml,ja-IN,pt-PT,ar-IN,ru-IN qq.com supports 1: zh-CN twitter.com supports 8: fr,ja,es,pt,ar,ru,de,en taobao.com supports 1: zh amazon.com supports 1: en
In total, 687 sites support multiple languages, 131 of them were googley
I'm surprised how few sites support determining language based on
Accept-Language. We know that users are more likely to spend money when websites are localized for their needs. Google has clearly embraced this globally, but Yahoo hasn't.
Looking at a breakdown of the number of languages supported (log scale), I expected to see a gradual trail off with the most popular sites supporting many languages, then less popular sites supporting just a few. But with 119 sites support 2 languages, and 94 supporting 6, the trend seems absent.
Graphing the number of sites that support multiple languages in buckets of 1,000 sites shows a strong peak on the left side, but no other real trends. Again I might have expected a more gradual trend down and to the right, showing that the more popular a site is, the more likely it is to support multiple languages, but thats not what we're seeing.
Zooming in on just the top 1000 sites, it's tempting call it more of a down and to the right trend, but it's mostly just a big spike on the left.
Other Notes & Tips
- 194 of the top 1000 and 48 of the top 100 sites support multiple languages.
- A surprising number of sites require the
wwwprefix. For example http://faa.gov will not even resolve, http://www.faa.gov works fine. The Alexa data never contains
www, so you may need to add it yourself.
- I was pretty happy with Detect language, though it was occasionally confused by footers in English on non-english translations of sites. For example, a site without a lot of text content might use a consistent legal disclaimer and copyright footer or cookie notice. That said, I can hardly blame them for giving me the language of text that appeared on the page.
- html2text was also pretty fantastic. "Does what it says on the tin" is great praise for a library, and it applies here.
- A large number of sites refuse to answer HTML requests from known scripting user agents. I used GuzzlePHP, and initially didn't change the default agent of
Guzzle/4.0 curl/7.21.4 PHP/5.5.7. Just setting my user agent to
WonderNetwork.com-LangCheckerv0.1allowed me to grab 500 more sites.
- If you're going to go spelunking in the Alexa list, use private browsing. There's a lot more porn than I was expecting.
Join the discussion on Hacker News