Over the last months I have been busy working on a project for Mapbox. As part of that project I have spent a lot of time improving Osmium. And not just a few changes here or there but a more or less complete redesign based on the experience of developing and using Osmium for the last nearly three years now.
One of the biggest drawbacks with Osmium was always that it was single-threaded only and by its design it was difficult to change that. Efficient multithreading needs different data structures and it is especially important not to use too much dynamic memory allocation, because allocating and deallocating memory is always inherently synchronized. (Unless you use your own allocator.) So I redesigned the way OSM objects are stored internally. They are stored in large buffers now containing many objects instead of using lots of little bits and pieces of memory. This step alone made creation of those objects about 10% faster. But it also is an important building block in making things work better when running multithreaded. As an added benefit, these objects are trivial to serialize. Just write the buffer memory to disk or to a socket. A different process (potentially on a different host) can read the data in again and use the objects directly.
I have started to bring multithreading into Osmium, but it will be a longer process to add it to the different parts. OSM XML files are now read in a separate thread from the main program and when reading PBF files a configurable number of threads can be used. Preliminary benchmarks indicate PBF reading to be about twice as fast this way. But there are a lot of improvements that can and will be done later.
In 2011 the long-awaited new C++ version was standardized. C++11 has numerous improvements and it not only makes programming in C++ much easier and more fun, it can also bring performance gains. C++11 is also the first C++ standard that includes multithreading as part of the core language, before that there were several different multithreading approaches used on different operating systems.
Because of the many benefits I decided to require C++11 for the redesigned Osmium. Unfortunately this means it will not work with older compilers and older systems. But the standard has been a long time in the making and compiler writers had a long time to add support. Both GCC and clang support all relevant features. I have tested both GCC 4.8 and clang 3.2 and they work great. And the error reporting in clang is so much better now than in GCC and in earlier versions of clang that it is often quite easy to figure out what went wrong.
There are many other changes in the new Osmium. Some as simple as renaming a class or function to make it easier to understand. Some as complex as the redesign of the way handlers work. And it is not complete yet. The biggest missing piece is the multipolygon support. There are also many parts that need tidying up and some interfaces that are not quite tied down yet. I would not recommend it yet for serious development, but I encourage you to take a look, try it out, and report problems.
One important step is to try it with as many use cases as possible to see whether it supports them properly and make sure the interfaces are okay, before we set them in stone. And I need your help with that. Tell me what you have been doing with Osmium (or not doing, because it didn’t work for some reason) and we can have a look at whether and how the new Osmium can do these things.
Until now Osmium has lived in my personal github space and the documentation is on the OSM wiki. But now with the project growing up and becoming bigger it should get it’s own space. For this I have created the osmcode organization on github and the osmcode.org domain. Currently these places contain only the new Osmium repository, but I will add more Osmium-related stuff soon.
Many thanks to Mapbox for supporting this work!
I’ll be giving a talk at State of the Map 2013 in September in Birmingham about “High Performance OSM Data Manipulation With Osmium”. I hope to see you there!