In a recent discussion on the OpenStreetMap talk mailing list around imports in France a point came up again that has been raised a few times: If we had some notion of layers in OSM data, maybe, some tasks such as imports would be easier or could be done in a better way. I want to look into this “layer” issue a little bit.
When I am talking about layers I am not talking about the “layer” tag that is used to model bridges, tunnels and the like. I am talking about layers more in the sense as they are usually used in GIS software, where all objects of a certain type are in the same layer, but objects of different types are always in different layers. Or these different layers I am talking about could be more like different databases or datasets. Whatever the details we’ll end up with same data in different places than other data. We’ll talk about these things later.
First I want to talk about what we want these extra “layers” for:
Imports could use “layers” in two ways, they could be used to record the data in its original state before any changes have been done to make it fit into the main OSM database. This is especially important if the original data is later lost, maybe because the original source organisation didn’t keep it or doesn’t make it available any more. Later this data could help when updated data becomes available at the original source, because it lets us identify what changed in between the source revisions and then those changes can be pushed into the main OSM database, too.
The second way is as some kind of staging area for imports. Changes to the original data can be done step by step with the help of the whole community, until everybody is satisfied with the state of the data and it can then be integrated into the main OSM database. This would improve on the current situation where such intermediate steps are usually only available to the one person doing the import in form of on .osm file saved from JOSM or so. Note that there a many complex issues to be solved here, such as how merging of the different databases can actually be done etc. I’ll not go into those. Another advantage of such “trial” databases is that our usual tools would work with it including the rendering toolchain and quality assurance tools. This would help with getting the import into a usable state.
Another discussion we have had many times is how to store historical data in OSM. One of the biggest problems with our current database setup is that the historical data is only interesting to a relatively small number of people, but everybody editing current OSM data sees the old data, too, cluttering up the view and making editing harder. If we move historical data into its own “layer”, these problems would go away.
The OSM database already contains some data that arguably doesn’t belong in the main database. Some people have, for instance, created large ways showing the areas where we have high resolution aerial images. Strictly speaking that’s not data that describes the planet we live on and it should not be part of the main OSM database. Similarly data about our mapping efforts sometimes is in the main database and sometimes it is not. We have FIXME tags in the main database, the OpenStreetBugs reports live in their own database. We have other databases for systematic re-mapping efforts and so on, storing areas that can be flagged as “todo” or “done”. If we had some generic way of bringing those databases together and at the same time separate from the data about the actual planet we live on, it could make it much easier to develop new tools and easier to integrate them into editors.
Occasionally you need a database to test software or a “sandbox” to play in to learn how OSM works. Of course this can already be done today, just set up your own OSM database. But if we see those databases as “just another layer”, some things might be easier to use and understand. For instance if we have developed the functionality to copy data from the main database to a historical database, the same functionality can be used to populate a sandbox database from the main database.
Sometimes people want to store very specialized data in OSM that doesn’t really fit. One example would be biologist who want to map bird migration paths or the areas where specific species have their nests. Having extra “layers” would allow those people to benefit from all the tools such as editors and renderers that we have developed without cluttering up the main database.
Several “layers” could also help with the level-of-detail problem. OpenStreetMap maps tend to look good in zoom level 17 or 18 and it is reasonably easy to make nice maps for very small zoom levels from, for instance, Natural Earth Data. But in the middle levels OSM data often looks poor. The reason is that we collect the data with a specific amount of detail that’s good for detailed maps, but it is hard to aggregate this data automatically for lower zoom levels. If we had several “layers”, we could have several “OpenStreetMaps”, one for each zoom level (or at least a group of several zoom levels).
Another use case might arise in the future when there are editing wars. In such as case we could freeze the data in certain areas and editing is only allowed in a copy of the data. Or still allow editing, but a “sane” copy is used to generate the map from. Of course there are many technical and social problems with this, maybe we do not want the “balkanization”. Splitting up the map into many small maps might not be a good idea. All I am saying at the moment, is that having a “layering” facility might somehow be useful in this context.
Now that we have seen quite a lot of different examples of where such “layers” could be used we have to think about how they would look like. I can basically see two different designs:
First we could give each object in the database an extra “Layer ID” field. Objects in different layers have different layer IDs and you are never allowed to make connection between objects in different layers. So you can not have a way referencing a node in a different layer. But you can move an object that’s not attached to anything (or a group of objects that only have connections to each other) from one layer to another. This would, for instance, be handy for the historical data use case. When a building is razed, you just move it wholesale into the other layer. Dumps could be done for the whole database and/or per layer. We could even move parts of the database into different layers. Say one layer for the road network and a different one for landcover information. This way you could edit part of the data and ignore the rest. Unfortunately it is not at all clear which types of objects should go into which layers and the interaction with the free tagging system is going to be tricky. And there are many advantages to the current system where a node that’s moved can move the highway and the landcover.
Note that in this setup all layers share the same ID space. This would make it easier to work with the data when you are interested in several layers at once. But it would make it slightly harder if you work on layers separately because you have to lug around those huge IDs even for layers that don’t contain much data. And it has one large drawback: The system can’t be federated. Because we have to give out unique IDs we can only have one central database.
So the second design (which I prefer) works differently. Instead of having different “layers”, we can think of it as having different databases or datasets. Each database works on its own and doesn’t really have much connection to the other databases. All we need now is some way of telling them apart by giving each a unique name or identifier. The standard way to do this on the Internet these days is giving them a URL as identifier. This way everybody can setup a new database without having to ask anybody. The OSMF could run the http://db.osm.org/planet, http://db.osm.org/historical, and http://db.osm.org/bugreports database or whatever and the Foobar University can run the http://foobar.edu/biology/bird-migrations database. We can store a JSON document with some metadata behind those URLs. Metadata such as name, description, contact, and license. Storing the license could help with imports for instance when the tools allow copying of data from PD licensed database to ODbL licensed databases, but not the other way around. Different databases can have different users allowed to read and edit them. We can even throw in some kind of registry server where databases can register (if they want to) to make them easier to find.
All of this is pretty easy as you can already setup your own database using the RailsPort software. All that is needed is some support in the editors and other software that makes accessing multiple databases easier.
Of course I have glossed over many problems with this approach. How can you move or copy data from one database to the other? That will entail a renumbering (unless the target database is completely empty). And you loose the connection between the data in the old database and the new database. But where that’s needed some kind of conversion table could be used. This table stores tuples of something like (timestamp, source-db-url, target-db-url, source-object-id, target-object-id, source-object-version, target-object-version). Or you could write the source URL/ID/version into tags in the target database. There are plenty of things we need to figure out in detail. But once we have some generic tools to move or copy data between databases they will come in handy in many of the use cases described above.
And of course even if we have a way of copying data between databases there will be the problem of how we keep all this different data synchronized. What happens if you detect an error in one database and fix it but can’t access a different database or forget to fix related data there?
What about planet dumps? Planet dumps would look the same as they look now except that the source of the dump would be noted in the dump (and in the replication diffs). Not all of the databases need regular dumps of course, but each database “provider” can decide that on his own. Because the database is noted in the file we can make sure never to load data that came from one database into a different one by accident. (Of course if we actually want to do that and make sure we do the proper renumbering of IDs, that’s fine, too.)
Marking the database source in the files can also help in another way: Say I take the dump of the main planet database and filter it, for instance by removing all “created_by” tags which most people are not interested in, to make it easier and faster to handle. The problem is that the resulting OSM file is outwardly indistinguishable from the original and it seems to contain the same objects. But those objects are slightly different! Once we have the database marker in the files, we just mark it as belonging to a different database and there is no danger of getting them mixed up.
Once we have made it this far, we could also think about abstracting from the database format used. For the import use case it might make sense for instance to store the original source data that might have come from a shapefile in a database that’s more suitable for data formatted in this way. And allowing different database formats could also help when migrating to a new data models in OSM or a new API version. In fact we need at least a way of marking different API versions, because it might not be possible to simply copy data from OSM API 0.6 to OSM API 0.7 databases and we can not expect all databases to switch at the same time.
So where does that leave us? I have presented an idea how extra “layers” or “databases” could be integrated in the “OSM world”. This is not a proposal (yet) and many details need to be figured out, but it can be the start of a discussion. The most important question we have to decide is whether having such an open federated system is what we really want. On the one hand it allows great flexibility (and OSM is all about flexibility), but on the other hand it could lead to the project being split up into smaller geographic or thematic communities. Once we have such a system in place and the software to support it, it becomes relatively easy to fork OSM. Maybe the community in some country decides they want to do things their own way? Maybe we’ll regret that we have sent all the bird-migration-biologists away to their own database?