[JT] Jochen Topf's Blog
Mon 2012-11-26 18:04

An Area Datatype for OSM

One of the biggest problems with the OSM data model is that there are no real polygons. Polygons or multipolygons are “simulated” using closed ways or multipolygon relations and several other methods. These multipolygon relations are often broken. Some of the problems resulting from this non-existence of multipolygons are affecting many people, for instance with broken coastlines and country or other boundaries.

This is not a new problem. In March 2011 I created the wiki page The Future of Areas as a focal point for documentation and discussions around this and others built on it. There are some interesting ideas there, but until now no clear solution could be found. Every proposed solution has many drawbacks. At the Karlsruhe hack weekend in June 2012 a group of us discussed this problem during dinner and we made some good progress. Unfortunately nobody wrote down afterwards what we had talked about. In this blog post I try to build on this discussion and try to formulate the proposal.

Problems

Before we go into a solution we have to make clear what the problems are with current (multi-)polygons. First there are several ways of handling what is essentially a single issue. You can have simple (smallish) polygons using closed ways. For larger polygons (like whole countries) or multipolygon you need multipolygon (or boundary) relations. And there are some cases where we don’t use either of those but have more special solutions again, coastlines ( which essentially describe land (or water) polygons) and river areas modelled by “riverbank” ways come to mind. Having several solutions for essentially the same problem is wasteful and complicated.

The second problem is that some polygons are really small (say a rectangular building) and some are huge, like a whole continent. Any solution must take this into account. The third problem are the many broken multipolygons. There have been tools around for years to help fix those broken multipolygons, but they get broken again and again. We (the OSM community) has failed to produce a solution that actually works.

For any solution to be viable we must break with some conventions in the way our data model and our API work. First, the way we edit OSM data has always been: Download complete objects, edit them, and upload the new version of the objects. There have been discussions about partial downloads and partial edits, but that has never materialized. But to be able to edit huge objects like country-sized multipolygons, it will be essential that we are able to download and edit parts of an object.

Another time-honored tradition has been that the central OSM database and the API are rather dumb. They do only minimal checking on the data that is being uploaded, allowing broken geometries to be uploaded, for instance, ways with a single node or polygons with self-intersections. If we ever want to have a chance of having country-sized objects without geometry errors, we have to change this. The central database (or the API) has to make sure that we don’t create objects with those problems.

Area objects

So, when we look at those requirements it becomes actually relatively straightforward to think about a solution for the polygon problem. Let’s define an “area” object very similar to a “way” object. A list of references to nodes and some tags. In addition we require that the first and last references nodes must be identical.

How should the API look for editing such things? A request goes to the API to download a given bounding box that the user wants to edit. To be able to work with an area we need all nodes inside that bounding box plus at least the next node after the last nodes in the bounding box. If we have the complete area all is fine, if not we need one more thing: We need to know which side is “inside” and which is “outside”. To define this we just define that the nodes in an area must be sorted so that they always go around the polygon in a clockwise order. (This is arbitrary, it could be counter-clockwise, but it is a common convention to use clockwise). The editor can now always draw the area properly (inside the bounding box it has downloaded), together with the tags it has all information it needs.

And this representation has another interesting property: If the whole polygon is valid (no self-intersections of the boundary etc.) and if all edits are restricted to the bounding box and the result of those edits is valid, the whole polygon will still be valid. If the user uploads some changes we don’t have to check the validity of the whole polygon, only of the changes. This is very important because otherwise the check would be too expensive.

Why is the whole polygon still valid? I am quite sure that this is the case, but I don’t have a rigorous mathematical proof. Somebody should probably find one to make sure we are not forgetting anything here.

There are, of course, many questions left. For instance I have described the “area” datatype just for simple polygons and I have only described outer rings. I think the area datatype should allow multiple outer and inner rings, so it should describe complete multipolygons as defined by the simple feature definition.

We might also need some API functions to migrate data, says change a closed way into an area. This might be done client-side but just adding an area and removing the way, but we’ll see which approach is better.

Migration

Migration of existing OSM data to the new model is relatively easy, because there is nothing taken away from the current data or API. First we need to implement the new area object in the central database and all the needed API calls. We then need to change editors to allow working with the new areas including download and upload through the new API calls. All other OSM software such as osm2pgsql is changed to allow the new-style areas in addition to old-style areas.

Once all of this is in-place and people can actually work with the new area datatype we automatically convert old-style way-based polygons and multipolygon relations to the new style. We can start with simple polygons (such as all closed ways that are valid polygons and are tagged with building=yes and nothing else), and work our way to more and more complex cases. What’s left over are polygons that are in some way broken or other corner-cases that have to be fixed manually. Mappers can just load those cases into an editor, add a new polygon and remove the old ways/relations. Once most of the data is migrated we remove support for old-style polygons from osm2pgsql and other such software. Data that’s not yet converted will not show up on the map any more which should give the boost to the community to finish converting those cases, too.

Where do we go from here?

We obviously need a lot more discussion in the community to make sure this is a valid proposal and that we can solve all the problems with it that need solving. Maybe I have overlooked something important and it doesn’t work at all? And once we are reasonably sure we have a valid way to go we need buy-in from all relevant parties.

I think it would probably make a good diploma thesis (or something like it) to work out all the details of this new data model, the API and the changes needed to the different kinds of software working with OSM data. (Feel free to approach me if you are interested.)

Update: Several people have pointed out that this could break if there are edges of the polygon going through the downloaded bounding box but without nodes inside the bounding box. Of course we have to download the nodes at the end of those edges, too. Then what I am describing above still works. And of course we have to figure out how to implement all of this efficiently.

Tags: osm