[JT] Jochen Topf's Blog
Sun 2012-06-24 19:59

Choosing a Language

One key issue of the multilingual map project is obviously how we choose which language labels to render into a map. There are two sides to consider:

  1. What name tags do we have in the OSM data and
  2. What is the preferred language (or the preferred languages) of the user

Side 1 seems to be simple. It is obvious that we can’t invent data thats not available. If there is no Mongolian name for an object in OSM we don’t have it. But it is a bit more complex, because not all name tags come with the information on which languages they are in. The “name” tag itself, as well as the “int_name”, the “alt_name” and some other tags don’t tell us which language they are using. Depending on the location of the object we might be able to tell which language a name tag is probably in. But sometimes people use name tags in the form “LanguageA (LanguageB)” or “LanguageA – LanguageB”, so this might be difficult.

And we have to take scripts into account. There might be a Serbian name:sr tag using Cyrillic letters, which might not be useful to a user who can’t read Cyrillic, but it can be transliterated into Latin script.

So we probably need some kind of “language detection” and “normalizing” step first, that finds all name-related tags and decides what languages and scripts are involved. This is never going to be perfect in practice, but most cases should be reasonably straightforward to decide. And we always have a good “fallback”: If in some case our algorithm doesn’t properly detect a language, we tell OSM users to add a “name:language” tag which will always overwrite whats in the generic name tags.

On the other side we have the “map owner” or the end user deciding which languages they would like. In some cases the person creating a map can decide which languages should be used. Maybe he knows the target audience well enough. But it our case we want the flexibility given to the user. The user should be able to choose the languages he or she wants to see. This is not only one language but a list of languages. The user might prefer his native French, but can also read Spanish or English. I’ll not go into the detail how a users tells us technically which language to use, this might be by HTTP header or cookies or some other means. Lets postpone this question and just assume that the user gives us an ordered list of the languages he wants.

We know get into a few problems:

First, if we pre-render map tiles based on only one preferred language, we can’t always give the users the choice he wants. Say we created the a French language map and decide to render French labels and if they are not available fall back to English and Spanish in that order. But the user might rather see the Spanish labels than the English ones. We have just given the user the wrong labels.

This problem becomes more pronounced when we consider different scripts. Most people probably only know one script. I only understand Latin letters and am completely lost when I see, say, Chinese characters. I’d rather see the Latin transliteration of a Mongolian language name of a Chinese city if that happens to be available than the original Hanzi characters. I don’t speak the Mongolian language but the chances are still better that I recognize something this way.

Another problem is the question what to do if none of the languages preferred by the user is available. We could always fall back to a default language, or better, have a list of default languages to fall back to. But again, those defaults might be better if we take the languages the user did choose into account. If a user choose Spanish as one of her languages, chances are a Portugese or Italian name might be better to understand for her than, say, a German label, because how closely related those languages are.

Now to be sure, there are a lot of special cases here, but for most people a few rules are probably enough. But it is not easy to define those few rules without a lot of knowledge about the languages written around the world and the people using them. Getting those rules wrong can seriously discriminate against some people. And it could lead to even worse edit wars in OSM than we already have. So we really have to keep the system as flexible as possible.

It is difficult enough to pre-render maps in several hundred different languages, but it is obviously not possible to pre-render maps in all the different language combinations a user might want. One obvious solution would be not to pre-render the maps but create them on-the-fly based on the users language choices. Or at least render the labels on the fly on a pre-rendered base map. I think we should seriously consider this option. The question becomes, of course, if it is technically feasible to do this quickly enough. But if it is at all possible to do this, I think it is the way to go.

Comments can be directed to the Multilingual maps wiki page.

Tags: maps · multilingual maps · openstreetmap · rendering