[JT] Jochen Topf's Blog
Mon 2013-09-23 17:47

Semicolons in OSM Tags

OpenStreetMap doesn’t allow multiple tags with the same tag key. But there is a believe out there in OSM land, that you can work around this by putting all the tag values you want into one tag using semicolons as separator characters. So if there is a bank with an ATM machine, people tag [amenity=bank;atm]. Or there is a road with several refs, then they tag them as [ref=I 70;US 40] and hope that it will magically work. And it does work in some cases. But in many cases it doesn’t. Let’s look into this a bit.

First, there is no special case for semicolons anywhere in the basic OSM code. Semicolons are just characters like any other. But there are some editors and programs using OSM data that treat semicolons specially in some cases. Unfortunately there is no general agreement what these special cases are and how the data should be interpreted. Unlike what some people may think, there is no general treatment of semicolons in the same way in all cases.

A Set of Values

What most people expect the semicolon to mean is as a separating character between different non-ordered values, so [ref=I 70;US 40] is the same as [ref=US 40;I 70]. You can’t add both tags [ref=I 70] and [ref=US 40] on the same way, so you put both values together into one tag. In this case the values separated by semicolons form a set of values and you can take them apart and look at each one of them by itself to find it’s meaning. For ref tags this works quite well and there are maps that do just this:

MapQuest Open rendering of multiple US highway shields

Other maps will sometimes show the highway shield with the full text of the ref including the semicolon, or maybe no shield at all. And even though MapQuest maps have this special case, it only works for small number of shields. So while in this case it is reasonably clear what the meaning should be, most software does not actually interpret it that way.

Not Always a Set

But values separated by semicolons not always represent sets of values, sometimes they can represent an ordered list. To mark sea buoys for instance, an ordered list of colours is used. The tag [seamark:buoy_lateral:colour=red;white;red;white] appears more than 700 times in the database, it means that there are coloured stripes on the buoy in that order. The order is important, and, as you can see, a colour can appear several times. In contrast, a tag [ref=US 40;I 70;US 40] wouldn’t make any sense. So whenever we interpret values with semicolons we have to take the tag key into account and treat each case specially.

Editors


JOSM tag conflict dialog

One particular problem are the editors and how they use the semicolon when you merge ways. I believe JOSM started doing this, but iD has the same or similar behaviour: When you merge two ways in JOSM the default setting for any tag keys that appear in both ways with different values is to concatenate those values using semicolons as separators. This is almost always the wrong thing to do though. Yes, you can choose one or the other value or enter a completely new one, but looking at the scores of useless tags in the database many people go with the default. Just look at the values of the oneway key in taginfo. People keep fixing the problems, but you’ll always find tags like [oneway=yes;yes;no] in there. There is just no way this can ever make sense. And neither does [highway=residential;service] for that matter. I think the editors should be fixed and not allow you to make those errors so easily, especially because some of these cases are difficult to fix without local knowledge, so other people can’t easily fix them later.

One particular problem arises when tags contain semicolons “naturally” and are later merged. Many objects contains source tags, for instance, and seeing them merged is not unusual: [source=bing;survey] and so on. But the French imports have source tags that already contain a semicolon: [source=cadastre-dgi-fr source : Direction Générale des Impôts – Cadastre ; mise à jour : 2009]. If you join this with [source=bing] it starts to get really confusing.

Imports

And then there are the dreaded imports. Some import gave us nearly 60,000 tags [water=lake;pond], more than [water=lake] and [water=pond] taken together. A naive program might decide that there are 60,000 lakes and 60,000 ponds in OSM that happen to be at the same location. That is what the “simple interpretation” of the semicolon as a divider between independent tag values would tell us. But of course in this case it only means that somebody couldn’t make up their mind whether to use lake or pond and chose the worst: both. I suspect that somebody will fix this in the data at some point but for the time being it means that everybody working with OSM data has to have a special case for this.

Another common tag key (more than 20,000 nodes) is [census:population] which contains values like [107;2006] and [217;2006]. This probably means a population number in a specific year, unfortunately it is not documented, I can only guess. And it definitely is an unusual use of the semicolon, this is no set of independent values.

Conclusion

Real men, goes the unwritten rule of American punctuation, don’t use semi-colons.

(Ben MacIntyre, columnist in The Times, London)

So what is the conclusion of all this? For mappers I suggest avoiding semicolons where at all possible. If you can, do not use semicolons as special separator characters and do not use them as normal characters either. Often they are not necessary. The over hundred ways tagged with [name=Route Transcanadienne;Trans Canada Highway] should have used different name:fr and name:en tags.

Mappers should be aware that values with semicolons need to be handled specially by users of the data and many users don’t do that (yet). And even if they interpret them they might understand them in a different way from what you expect. Instead of [amenity=bank;atm] you can use [amenity=bank] and [atm=yes]. Use special tags like int_ref to avoid “overloading” ref. But there are cases where semicolons work and, anyway, we can’t completely avoid them. Let’s work on defining our data model better and make it clearer where those semicolons can and should be used and how they are to be interpreted.

Tags: openstreetmap · osmdata · tagging