Thu 2020-04-30 20:45

Paying off technical debt in Osm2pgsql

Osm2pgsql is one of the work horses of the OpenStreetMap world. It is used to import OSM data into a database which is then used for rendering maps (among other uses). Osm2pgsql is behind the map you see on openstreetmap.org and many others. The program has been around since 2006 and has accumulated many changes and additions over time. And the code has grown more and more complex. Part of that is inevitable because the program does a lot more than when it was started. But part of that is complexity that doesn’t have to be there. For a few months now I have been working on reducing that complexity and want to tell you about that work.

Accumulating technical debt is a totally normal thing for any piece of software. As the developer works on the code, their focus is, naturally, on those features they need or the bugs that need to be fixed. Everything else around the changes you do is often ignored. Only later you recognize that something is more complex that it needs to be, that functions have the wrong names, because their purpose changed, that some code has become superfluous, because the world around the software changed. Only with hindsight you realize that something could have been done better. Osm2pgsql has been around a long time and it had many different developers working on it, each with their own style, strengths and weaknesses. Many eyes make the code better, but they also lead to inconsistencies. From the early versions written in C, the code moved to C++03 and later to the newer C++11 version. But there is never enough time to clean up everything, so there are still parts in there that are more C than C++.

Technical debt is especially problematic for Open Source software. We want our code to be clean and easy to understand so that new developers can come in and help. New developers usually come because they have a specific problem in mind they want to fix or a feature they need added. If the code is reasonably easy to understand and well modularized, they can understand the part they need to change and do the change. But if the code is hard to understand and the pieces all depend on each other, they need more time than they are willing to spend on their problem to even understand what’s currently going on in the code. Therefore the cleaner the code is, the greater the chances are you’ll get more people to help.

As I write this there are 55 open issues on Github. Some of them are several years old. Often they don’t look that complicated to solve. But if you look closer, you can often see that they need a lot of structural changes under the hood which developers are reluctant to do. Untangling some of the more complex innards of Osm2pgsql can help us make those issues easier to solve.

For the last months I have been working on Osm2pgsql, paying off technical debt. Like in many other cases I started working on this because I wanted to do some specific changes that were more difficult than expected. I’ll talk about the flex backend work that started all this in a later blog post. Unlike in many other cases I could convince somebody to pay for this work. Companies are often willing to pay for new features or bug fixing, but not when they can’t see the immediate benefits for them. So I am really lucky that I could convince Andy Allan of Thunderforest and Frederik Ramm of Geofabrik, both heavy users of Osm2pgsql to each pay for part of this work.

So what is it that I am doing exactly? Part of it is simply cleaning up code, fixing formatting to make it more consistent, changing variable names where they don’t match the coding standard, etc. These are tiny things but collectively they do make a difference. Our brains have evolved to pick out differences, if something is always the same it fades into the background. If routine things all look alike, they disappear and the developer looking at the code can concentrate on the actual things that matter.

It’s a similar issue with warnings from compilers and other tools. If you get zillions of warning, you ignore them. Most of them aren’t a problem probably, they are only warnings after all. But there are a few which are important. Only which ones are important? Better to fix them all. If you get only an occasional warning, you see it and fix it, making the code a little bit better. Previous developers on Osm2pgsql have done a great job at fixing compiler warnings. With very reasonable warning level settings there are no warnings from GCC and Clang (there are still warnings on Windows). But clang-tidy, a tool that gets more and more popular, still finds a lot of issues. I am working through them to either fix them or disable those that don’t apply to us. But all this is only a tiny part, more interesting are the bigger changes.

One simple example are a few functions used in several places in the code with names like nodes_add() and ways_delete(). When I saw them for the first time I was confused because of the plural there. All these functions ever do is add a single node or delete a single way, not more than one. Maybe in earlier versions of the code they could add multiple nodes? Maybe the thinking when naming them was “I have many nodes here and add another one”? Anyway, after I had convinced myself that the singular would be better here (and that wouldn’t want to change the functions to accept several nodes or ways in the near future), I changed the names to node_add() and way_delete(). I tiny change for sure, but a step in the right direction. Many tiny steps make for big changes over time (look at the OSM map for an example :-). Maybe one day I (or somebody else) will change the name again to add_node() and delete_way() which sounds a bit better… Just like when adding data to OSM my goal is always to do improvements step by step. If you want to make it perfect from the start, you’ll never get there. Changes in one place let you see other places that can be slightly improved and so on. A virtuous cycle.

Here is another, more involved, example: Osm2pgsql has quite complex code to keep track of OSM objects and their changes to make sure changes in objects are reflected in dependent objects as well. So changes in nodes trigger a “rebuilding” of the ways they are in and so on. The code for this is somewhat spread out over the code base and it isn’t easy to follow. Before I can change it to add new features, I need to understand it and add some tests to make sure I am not breaking anything in the process.

So I am looking at some part of the code called “middle” and am trying to figure out exactly what it does and writing tests for it. It uses something called an id_tracker which keeps track of the ids of ways or relations that we still need to process. The id_tracker has a member function called size() which is called quite often. The size seems to be important somehow, so I write simple code to test it: Put one id in the tracker and make sure there is one id in there now. Only the test fails. There are two ids in there! Diving deeper I find out that the id_tracker does some magic internal things I don’t quite understand yet, but it seems it is on purpose that the size is wrong. Hm. Okay, start from a different perspective: What code is checking the size and does it matter if it is correct or not?

Turns out there are essentially two reasons why some piece of code is looking at the size: One is to print it out for the user. It doesn’t matter much if it says there are 17 ways still pending or 18. This display is just giving the user a rough idea about the amount of work still to do, are there 17 ways or are there 17,000 ways still pending?

The other uses always look something like this ... if (tracker.size() > 0) .... So they don’t look at the size itself at all, all they want to know is whether there is something in the id_tracker or not. So if the size is not quite right, it doesn’t matter. The important difference is between “zero” and “more than zero”. So I added a new function empty() which does this check and returns a boolean, the code above becomes ... if (!tracker.empty()) .... Above that in the call chain there is a function called pending_size() which can now be removed and replaced by a function called has_pending().

It all seems like a tiny change and it is. But to make that tiny change I had to understand how everything fit together and that the size didn’t matter at all here. This gives me several things now: a) the code is easier to understand because I am talking about what matters (that there is work pending) in that context, b) it might be faster, because generally speaking checking if something is there is easier then counting how many somethings are there (though in this specific instance it will probably not make a difference), and c) most importantly I learned that I can change the id_tracker code itself (which I want to do at some point because it is quite complex and probably overcomplicated) without having to care about the exact size reported because nobody is interested in that anyway.

This kind of work leads you deeper and deeper into the rabbit hole, before you can finally find out something about the code, fix it, test it, or document it. And then move on to the next thing. Every little piece doesn’t take that much time, but there are a lot of these little pieces. And some are much more complex than the examples above. But chipping away at the code piece by piece makes it better. It is like undoing a complex knot that seems impossible at first, but if you thread the yarn through loop after loop it becomes easier and easier.

This work will never end, while Osm2pgsql evolves we’ll have to do this again and again. I am not the first one to do this, other developers have already done a lot of cleaning up. But some cruft has accumulated and my current goal is to spend some time explicitly working on that to get it out of the way. Then I can get back to what triggered this whole thing, working on actual feature improvements. I’ll talk about those ideas and features in later blog posts.

Tags: openstreetmap · osm2pgsql · software development