Tue 2023-07-25 16:46

Improving the middle of osm2pgsql

In osm2pgsql there is that code that stores the raw data it gets from the input files and makes it available to other parts of osm2pgsql for further processing. This code is called, in osm2pgsql-speak, the “middle”. There are two implementations of this “middle”, one stores the data in RAM and doesn’t support updates. The other stores the data in the database and supports updates. Back in 2020 I started looking into that code to see what could be improved. In 2021 we released version 1.5.0 which contained a basically rewritten RAM middle. And now, in 2023, I finally got around to improving the database middle.

There were several things in the old database middle which weren’t that great that we wanted to address.

Database table format

First, the format osm2pgsql uses in the database is rather strange. Tags are stored as an array with alternating keys and values, relation members are stored in an even stranger format. That’s done for historical reasons and it works well enough for osm2pgsql itself, but it made it rather awkward to access this data from SQL. With modern versions of PostgreSQL we have better options for storing this data, in both these cases we can use the JSONB type to efficiently store and access this data.

This change reduces the database size by nearly 10% and only slows down the import by a few minutes for the whole planet.

Storing attributes

The old format stores OSM object attributes (timestamp, version, changeset, user id, and user name) as pseudo-tags in the same place the other tags are stored. This is not only rather inefficient, it also means we have to disentangle real and pseudo-tags when we read the data again.

The new code simply stores the attributes in their own database columns. If an attribute isn’t there, a NULL is stored. The exception is the user name. Because the user id and user name duplicates information it is much more efficient to just store the user id with each OSM object and have an extra lookup table for the user names. This has the added benefit that when a user changes their user name, the lookup table will be updated the next time the user changes an object, this way the data is always consistent.

Because most users don’t need those attributes, the attribute storage is only enabled with a specific command line option. That did’t change.

Storing the attributes used to make the database more than 50% larger. With these changes the added storage needed is negligable. Processing time has improved somewhat but is still much larger compared to processing time without attributes. (This is not because of anything in the “middle”, but because with attributes enabled osm2pgsql will send all nodes through the output processing, not just nodes with tags.)

Node tags

The old format never stored node tags in the database, because historically this was not needed for osm2pgsql. Way and relation tags need to be stored so that when a change in a node or way comes in, all parent ways and relations can also be updated. But because nodes are never the parent of another object the node tags are never needed.

But we want to do more complex processing today. For a while now we have the so-called two-stage processing which allows special processing of relation members. But that only works for way members and not for node members, because we were missing those node tags.

The new code can now store the node tags, too. You have to ask for this using a command line option, because not everybody will need this and storage costs quite a bit of disk space (as I write this about 30GB plus 5GB index for the whole planet). With the new format you can now access this data from SQL. The code to tie this in with two-stage processing isn’t written yet though, more work to do…

Performance

There were several problems with the way osm2pgsql was processing the data and accessing the database which were fixed in this effort. Some of these we have known about for a long time, other issues only were diagnosed now that we were looking into all the details.

For each changed node or changed way, osm2pgsql asks the database for all ways and relations the changed objects are in. The new code asks only once for all nodes and once for all ways. This change alone can mean a 10x speed increase!
We used to process some ways and relations twice if a member node or way changed and also the object itself changed. That’s just wasteful. Saves about 10% processing time to not do this.
When deleting objects, the database would in some cases not realize that it should use an existing index to figure out what to delete. So it used a full table scan which turns an operation that should basically take no time at all into one that can take 10 seconds or more. And we have many of those queries. Simply forcing the database to use the index fixes this.
For each object read from a change file we have to delete a possibly existing earlier version of that object from the database. But many objects are brand new so they can’t possibly exist in the database. So we now get the maximum node, way, and relation ids that exist in the database and check all incoming objects against those. For every object with larger than the known max id, we omit the delete, which gets us a slight performance boost.

There are more places where performance can probably be improved, but it needs a lot of testing with different scenarios to figure out which change actually improves performance and which makes it worse. Quite often the database (or more often the complex interaction between database and osm2pgsql) behaves differently than what the user (or developer) would expect. It needs a lot of time to figure out exactly what’s happening and why and how the different pieces interact.

Our test server (graciously sponsored by FOSSGIS e.V.) spent many days and weeks doing planet imports and updates in different configurations to help figure out what works and what doesn’t.

The Properties table

There is another “piece of infrastructure” that we recently added. The osm2pgsql_properties table stores some information about how osm2pgsql was called and about the input data. This means that when you are running updates (with osm2pgsql --append) osm2pgsql already knows some of the settings it needs and you don’t have to specify them on the command line any more.

This feature comes in handy now that we have the old “legacy” database format and the shiny “new” database format. For some time osm2pgsql needs to support both formats so that users can move over when it is convenient for them to do a re-import. The format is stored in that properties table so that it is automatically used for updates.

So what have we gained?

For the first time we now have a documented database format so that users can “officially” use it instead of treating it as some internal thing that nobody is allowed to touch. This allows more specialized processing of OSM data involving several related objects, for instance barriers on roads or something like that.

For the first time it is now possible to easily store all OSM data with all tags and all attributes in a database using osm2pgsql. This opens up new use case for users who don’t just want a rendering database but do statistics or special analytics on raw OSM data. Before now you had to use Osmosis (or some other specialized tools) to do these things, now you can use osm2pgsql (and get all the other features of osm2pgsql at the same time).

While the new database format takes slightly more processing time than the old format this is more than made up with the performance improvements we have done. Users running minutely updates will probably only see slight improvements if at all, but if you import a larger update, processing time might only be a tenth of what is was before.

Status

All this great new stuff is available in the “master” branch at the moment, but we’ll do a release soon. The new database format is still marked as experimental, because we want to have the option to change the format if user testing should reveal some bigger flaw. You need to activate it using a command line option. See the Documentation for all the details. In further releases we’ll finalize the format and sooner or later make it the default.

Many thanks to Thunderforest and Geofabrik for supporting this work!

Tags: openstreetmap · osm2pgsql