Some month ago I introduced Taginfo and mentioned there in passing that the statistics collection is done based on a C++ framework called Osmium. I promised another post and I am delivering on this promise now. I have been working on Osmium on and off over the last 6 months or so and, while it is far from perfect, I think it could be actually useful for other people now. So lets dive right in. What is Osmium and what is it good for?
Osmium is a C++ framework for working with OSM data files. Osmium will parse OSM files (XML or PBF) and call back into handlers for each object (node, way, relation) it encounters along the way. Osmium tries to be a very thin layer with as little overhead as possible. It doesn’t do anything with the OSM data, not even store it. It just gives you those objects one after another. The handlers it calls can then do interesting things with those objects.
Of course you can write your own handlers, but Osmium already comes with a few handlers:
The Statistics handler counts number of nodes, ways, relations, tags etc. It’s pretty simple, really, and probably a good starting point if you want to write your own handler.
Even simpler is the Bbox handler that calculates the bounding box for the input data from the node locations.
The NodeLocationStore handler will store the location of each node in memory and then use this data to build the way geometries. There are two different ways of storing the data, one is better for smaller, one better for larger OSM files.
The TagStats handler is used for creating the statistics for Taginfo.
A more useful handler is the Multipolygon handler. It assembles proper multipolygons from relations tagged with type=multipolygon or type=boundary. It can even correct some common mistakes like rings that were not properly closed. If you add this handler to your application, you’ll get an additional callback for every (multi)polygon. This handler works only if you read the input file twice, on the first pass the handler stores information about all multipolygon relations in memory, on the second pass, it assembles them from the node and way data.
Not everybody can or wants to write C++. Thats why Osmium optionally embeds the Google V8 Javascript engine. The same callbacks you get in C++ you can also get in Javascript. Because the Google V8 Javascript engine compiles the Javascript down to machine code, this is actually quite fast. And you still have the flexibility and easier coding of Javascript. You might have thought of Javascript as a language just for web browsers, but it is a general purpose language like many others and actually quite well suited for this task because of all the effort put in by Google to make it fast. Note that this is not some kind of browser integration, there is no DOM and no web page to show. All you get is the Javascript (with some extentions built in).
So how does this work in practice? If you want to use Osmium with the embedded Javascript, you can just use the osmjs application that comes with Osmium. You call it from the command line with a Javascript file and an OSM file and it does its work. If you want to work in C++, you need the Osmium source code and write your own handlers and application. You can use the handlers provided and the existing tagstats and osmjs applications to see how all of this fits together.
Osmium is available from Github. You’ll find more information in the README files. The documentation is currently pretty basic. Feel free to ask me questions if you want to use Osmium and don’t understand something.
Tags: c++ · dev · javascript · openstreetmap