[JT] Jochen Topf's Blog
Wed 2022-12-14 11:52

PostgreSQL raster experiments

Like any project, the generalization project I am working on has its share of experiments. Not everything you try out will work in the end, or is fast enough, or actually needed, etc. These experiments often don’t end up in the progress reports or final reports, but without them we can’t create something new. I want to talk about three experiments related to raster data in PostgreSQL/PostGIS, one successful, one failed and one where the outcome is still open.

(This post is part of a series about generalization of OSM data.)

A successful experiment

As part of the raster-based generalization described in this blog post I needed to look at the generated raster data. It was pretty easy to write out the raster as PNG, because the CImg library I am using can do that with one command (if the PNG library is installed). I wanted to see the image in QGIS though in the geographically correct place. That’s still pretty easy to do if you know the “world file” trick. You can write out a World file with the same name as the image file (but different suffix) describing where in the world the image file is supposed to be. (This works for many image formats, not only PNG.) QGIS and other software can read that (thanks to the support for world files in the underlying GDAL libary) and show the image in the right place.

But this is still a bit cumbersome, because you have to add each of those images to QGIS in a separate layer. But PostGIS supports raster data so it should be possible to add these images to the database and let QGIS read them from there. (I would probably not do that for production and with lots of large images, but adding a few smaller images to the database isn’t a problem.) The common way to get raster images into the database is through the GDAL library, but I didn’t want that overhead, so I looked around for a simpler library but couldn’t find anything. In the end I had to dig up the specs for the raster format and roll my own. The documentation isn’t great, but the format is simple enough that I could implement it with not much effort. And from there it was only a few clicks to get the images into QGIS.

But there was a snag: Sometimes it took a long time (several minutes) to load a few images into QGIS. This took me a while to figure out. When QGIS loads a layer for the first time it asks the database for metadata about the layer, for vector layers these are things like “is there a unique id on that table?”. For raster layers it seems to ask for a lot of meta data like what type of raster data this is, what the extent of the images is etc. If that meta data is not easily available from the database, QGIS will instead load all the data to figure out things for itself, which needs huge amounts of memory and a long time. So you have to make sure that that meta data is available, then things are quick. To do that you simply have to run the AddRasterConstraints() function in the database after importing your data. Not really intuitive and not well documented, but some googling found this solution for me. Oh, and you can’t have a filter on those layers (which gets translated to a WHERE clause for the database) because then QGIS has, again, to get all that meta data itself. So now I am not creating just one table with all the images, but one table per image type (image before/after processing, image for different types of objects).

So there was a bit of coding, a lot of experimenting and even more frustrating searching of the web involved to get to this point, but now it all works and I learned a lot about raster images, PostGIS and QGIS in the process.

A failed experiment

The polygon generalization works by rendering the polygons into rasters, doing some magic (described in another blog post) and creating vectors again. The first step, rendering into raster is done by reading the vector data from the database and using functions from the CImg library to do the rendering. But now that I know more about raster data in PostGIS, maybe that’s something which could be done in PostGIS?

PostGIS has a function called ST_AsRaster(), described as “Converts a PostGIS geometry to a PostGIS raster.”. That sounds promising. So I played around with that a bit. And yes, it can turn vector geometries into raster images. Problem was, I could not get the results I needed. I don’t need one raster per geometry, I need one raster per tile filled with all the geometries in that tile. The documentation is rather terse and there are 10 different versions of that function. I tried lots of different approaches and couldn’t get anywhere. And the web wasn’t all that helpful either. Either nobody has ever done anything like this or I couldn’t find the right way to search for it.

In my desperation I was reading though more of the PostGIS docs and found that I had been trying this with the wrong command. There is another command called ST_SetValue(), with the description “Returns modified raster resulting from setting the value of a given band in a given columnx, rowy pixel or the pixels that intersect a particular geometry.”. Somewhat harder to read that sentence but if you do carefully, you see that it basically also says that it renders geometries into pixels. After I found that command it became quite straightforward to create a raster image of the right size and in the right geographical place, add an empty band and then use ST_SetValue() to render the geometry in there.

So, yes, I got it to work! But I still classify this as a, sort of, failed experiment. Because after all that I decided not to use this functionality. There are two reasons for that: First, the resulting image is larger than the vector data. If I raster the data into, say 2048×2048 pixel images, the image will be something like 4 MBytes. I added up the sizes of all the vector data needed to render that image in some example location and found that it is almost always smaller. So we are not saving on network bandwith when transferring that data between database and application. (This will not necessarily be true for all cases and if we use smaller raster images this might be different.) The second reason is this: Rendering the vectors in the database means we are doing more work in the place where it is most difficult to scale the system. We can easily add more servers outside the database doing the rendering. But scaling the database with replication and all that is more difficult to do. So from a systems point of view it is better to do as much of the work outside the database as possible and keep the database itself lean and mean.

So I again learned a lot about PostGIS and raster, but didn’t use what I have experimented with. But it might well be that at some point later on I will revisit that decision. Rendering in the database has the great advantage that it can be done with a few SQL commands. That’s easier than to write some C++ code. So it might make sense to allow this at least as an option for experimenting.

Another experiment, outcome still open

Raster data is binary data. When it is read from and written to the database, the current osm2pgsql uses text mode though. The database connection is handled through the libpq library which can send and receive column data in text or binary mode, but the binary mode is not used. To get the binary data through it is “hex encoded”, each byte is sent as two bytes with the hex code of the value.

Handling everthing in text mode is a bit simpler and the amount of data sent this way in osm2pgsql is usually small (it is only used when reading geometries while updating OSM data), but with the sending and receiving of more data in the generalization steps this will change. And only having to send half as much data and save the encoding and decoding at both ends could be a big win. So I wrote some code to use the binary mode, which works in principle, but it is not as clean and easy to use as I want it. And libpq has the problem that you can either read all columns in text mode or all in binary mode but not use a mixture which makes accessing the data a bit more complex.

So for the moment I kept the old code and will sit on this a bit. Maybe in a few days or weeks I’ll give it another look and see how I can best integrate this. There is also the question of the large data import which uses the PostgreSQL COPY command which also has a text and binary format. Maybe that needs changing, too?

Tags: generalization · openstreetmap