The number of objects in OpenStreetMap is growing and with it the number of unique IDs. For node IDs we have gone over 4 billion now. Soon they will not fit into an unsigned 32 bit integer any more (232=4294967296).
Every program that works with OSM data has to take this into account. All of the important libraries and programs have been fixed long ago to store IDs in 64 bit integers. But this comes at a price. Whenever you work with OSM data, you’ll need twice the memory to store the IDs. Say you need to find all the ways any node is in. For this you need some kind of lookup table that maps node IDs to way IDs. Most nodes are in one or more ways, so you need roughly 4 billion nodes times 8 bytes for the node IDs plus 4 bytes for the way IDs (ways are only up to about 340 million so they are fine in 32 bit). You are already at 48 GByte! Instead of the 32 GByte you would need for 32 bit node IDs. (By the way: there are more tricks you can do to save space here but they make your program more complex. Still worth exploring though.)
We’d still have some ID space left if we’d reuse IDs of deleted objects. This would lead to all sorts of problems so we don’t want to do this. But the fact remains there are a lot of unused IDs out there. And when you are not working with the full OSM dataset, but with a smaller extract, there are a lot more IDs you never see. This influences the choice of data structures we can use to work with OSM data. The most efficient way to store something is in arrays, the ID of an object can be used as an index into the array to find this information again. Because we don’t have to store the ID itself in this case, this is a very efficient data structure. But only if the ID space is actually used well. If the ID space is only sparsly populated, this is wasteful, because most elements in the array are empty.
One way to work around all these issues is to renumber all the objects we have and give them smaller, consecutive IDs. If we need the original OSM IDs, for instance because we want to change something in the data and upload new versions of the objects, this doesn’t help us. But in many cases all we need are unique IDs, we don’t care what they are.
I have added a renumber command to the osmium command line tool a while ago, but it was very slow and memory inefficient and so didn’t work on larger extracts. I now rewrote the internals of this command so it will work on sizable extracts and even the whole planet. On my server it can renumber the Germany extract in less than three minutes using about 3 GBytes of RAM and Europe in about 21 minutes using 16 GBytes. The planet will need more than 32 GB RAM. The highest node ID after renumbering Europe is about 1.6 billion, so it will fit comfortably into even a 32 bit signed integer.
If you want to play around with it you need the master version of osmium-tool from Github. Tell me how it worked for you (and find any bugs) so we can release it in the next official version of osmium.
Tags: openstreetmap · osmium