[JT] Jochen Topf's Blog
Mon 2020-12-21 13:07

Osm2pgsql Middle Improvements

The osm2pgsql code contains a part called the “middle”. The middle is responsible for keeping track of all OSM objects read from the input file(s), their attributes, tags, and, most importantly, the relations between those objects. This is needed for several reasons.

We need the object relations to assemble the geometries of objects. For ways the way node locations are needed, for relations, the member ways etc. So the middle stores all node locations and, later, when ways are read, can look up those node locations again when the way geometries are needed for the output. Similarly it needs to store the list of nodes used in a way, so that later on when osm2pgsql assembles, for instance, multipolygon geometries from relations, it can look up the member ways and their way node locations.

Usually objects are processed the moment they are read from the input file and then forgotten about. But when we are using the new two-stage processing feature of the flex backend we re-process some objects later on. For that we need to keep track of all objects that we might need later including their tags.

Usually object attributes such as the version of an object and the user who last changed it, are not relevant for osm2pgsql users. When rendering a normal map they just don’t matter. But osm2pgsql optionally allows you to store this information (when the --extra-attributes option is specified). For use cases like quality assurance, change detection, etc. this can be useful. In this case the middle has to keep track of these, too.

So far we have only talked about importing data. But when you want to update an existing osm2pgsql database, you also need to keep all the original objects around, so that changes can propagate correctly. When a node changes, for instance, osm2pgsql has to find all ways using that node and make sure to update them, too.

There are two distinct cases that the middle must support: In an import-only workflow you are just importing the database once and then never update it. (If you need an updated version, you throw it away and re-start from scratch.) In this case the middle has to store its data only during a single run of osm2pgsql. But when you have a workflow based on an initial import and later updates, the middle has to keep its data around and re-use it on update runs.

This leaves us with three basic options where the middle can store its data:

  1. Only store the data in memory (for the import-only case)
  2. Store the data in the database
  3. Store the data in the file system

In reality you probably want some combination of those. In a typical configuration these days the node locations are stored in a flat node file on the filesystem and the rest of the data in the database. And it might use an additional cache in memory to speed things up.

So between the different use cases, different processing options, different input file sizes from small extracts to full planets, different hardware used by different users, etc. there are quite a lot of configurations a middle must support. And it has to do that efficiently. It doesn’t make sense to, for instance, keep the object attributes around when you never need them.

The existing middle code has been growing organically over the years. But todays PostgreSQL database is different from what it was 10 years ago and our needs have changed. It is time to re-evaluate all of this and see whether we can come up with better solutions. As an added difficulty we have to think about backwards compatibility so that users don’t have to re-import unnecessarily.

As part of my work for the OSMF I have looked into this and cleaned up the code in the existing middle implementations and around it and cleaned up the internal APIs. This code is already in master. I also wrote some new proof-of-concept code for modernized middle implementations. This is now available in a draft pull request.

I have looked at two use-cases at the ends of the spectrum and written two middle implementations for them: On the one end a middle that supports the import-only workflow for small and medium sized input files. Node locations and way node lists are stored efficiently in memory. No tags or attributes are stored and updates are not possible, but imports are fast and memory-efficient. Typical use cases for this middle would be fire-and-forget databases set up to test something or databases that store pre-filtered data such as list of specific POIs or so.

On the other end I implemented a new middle using the database as backing store. It stores everything and allows updates. It is slow, but it supports all use cases.

These implementations are meant to start a discussion. They will not end up in the final code as they are now. They show options to users and help us think about the different use cases and how best to present the options to the users. And they allow us to evaluate whether the internal APIs between the middle code and other parts of osm2pgsql make sense.

With this I am concluding the work funded by the OSMF over the last months. But of course this is not the end of development. I’ll keep working on this in the future. Depending on whether I can find further funding development will be faster or slower.

As a next step we are planning an osm2pgsql developer and user virtual meetup in January to get user input and talk about anything related to osm2pgsql. We’ll announce a date soon.

Tags: OSMF · openstreetmap · osm2pgsql · software development