Faster OpenStreetMap imports 10/10: summary and conclusion

As is my wont, the preceding blogs are rambly and incoherent. This final one will hopefully put a neat bow on it all.

Step 0: decompress (not recommended)

This is optional and not recommended, but the initial .osm.pbf file can be decompressed. It turns out that the I/O savings make it better to not do this. It was useful early on when I needed to look at things in a hex editor, but causes slowdowns now.

Step 1: nodeindex

The first step in the whole process is to make a mapping of node to location, which can be efficiently searched. We'll call this a .nodeidx file.

Step 2: tagfilter

Once we have all the locations saved, we want to minimize the size of the input file, by discarding unnecessary nodes. This is done via tagfilter, and all it does is remove nodes which would never make it to Postgres, so that subsequent steps are faster. We get about 20% saving, which is a non-trivial amount of work to skip.

Possible enhancement: re-compress the file on output.

Possible enhancement: remove unused ways and relations too.

Step 3: noderewrite

This step combines a .nodeidx and a .osm.pbf file, such that all the ways contain location data instead of node references. In order to manage memory usage, this process runs over the .osm.pbf multiple times, using a subset of the .nodeidx each time.

Possible enhancement: re-compress the file on output.

Step 4: wayindex

Next up, we extract all the ways (and their locations) from our rewritten .osm.pbf in to a binary searchable .wayidx

Step 5: relindex

For the final step of pre-processing, combine the .osm.pbf with the .wayidx file to create a .relidx file. This file is not binary searchable, it is intended to be linearly scanned. It's a sequence of (relation id, (way id, (location, ...), ...)).

To restate: the .relidx contains, in order, all the data that osm2pgsql will need to process each relation without accessing the database.

Because this needs to binary search the .wayidx, which may not fit in memory, this step also splits up the work in to multiple passes, however the work is based on the size of the .relidx, not the .osm.pbf

Possible enhancement: make this multi-threaded. This doesn't fit the existing framework, as proof-of-concept it's currently straight line single threaded code.

Step 6: osm2pgsql import

Finally, we are ready to perform the import, let's look at individual steps.

Step 6a: process a node

Original

  • (Possibly) store the node in the permanent database
  • Store the node in cache
  • Store the node in the temporary database

New

  • Store the node in the permanent database

Step 6b: process a way

Original

  • For each node in the way, find the corresponding node location in the cache or temporary database
  • (Probably) store the way in the permanent database
  • Store the way in the temporary database

New

  • (Probably) store the way in the permanent database

Step 6c: process a relation

Original

  • Store the relation in the temporary database
  • For each way in the relation, find the list of nodes in the temporary database
  • For each node in each way, find the list of node locations in the cache or temporary database
  • (Probably) store the relation in the permanent database

New

  • Read forward in the .relidx file to find the relation
  • Read the required relation data from .relidx
  • (Probably) store the relation in the permanent database

Step 6d: finalizing the database

Original

  • Create indexes for the permanent database

New

  • Create indexes for the permanent database

...

:(

Conclusion

By doing a bunch of pre-processing, we have two major differences:

  • All steps can control the amount of memory required. With a regular import, a small cache doesn't mean you require less memory, it means you're swapping when you access the database.
  • We've eliminated all database reads except for creating indexes. If your database doesn't fit entirely in memory, you can end up processing relations at 5-15/second (at least, those are the numbers I had attempting to import planet) which comes out to about 9-10 days.

We can't do anything about the indexing, except give Postgres access to more memory.

Result

Timings for planet:

  • nodeindex: 23:33
  • tagfilter: 24:35
  • rewritenodes: 6:17:57 (4 passes)
  • wayindex: 17:15
  • relindex: 2:13:55 (10 passes), this phase is currently single threaded, could potentially have been as low as 1/12 this time.
  • import (w/out index): 6:23:33
  • import (index only): 3d 13:17:40

total time: 16:00:48 + 3d 13:17:40 = 4d 5:18:28*

squirrel@emu:~/src/osm2pgsql/build$ make -j12 && ./osm2pgsql planet-200907-d-f-r.osm.pbf --relation-index planet-200907-d-f-r.relidx --host 192.168.16.2 --user postgres --database act --style ../../openstreetmap-carto/openstreetmap-carto.style --slim --hstore --drop --tag-transform-script ../../openstreetmap-carto/openstreetmap-carto.lua --number-processes 12
osm2pgsql version 1.3.0 (1.3.0-changed)

Using lua based tag processing pipeline with script /home/squirrel/src/openstreetmap-carto/openstreetmap-carto.lua
Using projection SRS 3857 (Spherical Mercator)
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads

Reading in file: planet-200907-d-f-r.osm.pbf
Using PBF parser.
Processing: Node(140369k 47.0k/s) Way(695842k 38.96k/s) Relation(8115737 3743.4/s)  parse time: 23013s
Node stats: total(140369384), max(7882275697) in 2986s
Way stats: total(695842170), max(844877676) in 17859s
Relation stats: total(8115737), max(11590153) in 2168s
...
All indexes on planet_osm_point created in 23949s
All indexes on planet_osm_roads created in 39635s
All indexes on planet_osm_line created in 160748s
All indexes on planet_osm_polygon created in 307060s
squirrel@emu:~/src/tltools$ ./tile --lat1 -85 --lat2 85 --long1 -180 --long2 180 --zoom 5 --width 1200 --height 566 --output-image world.png --layer admin-low-zoom

Where to get it

* It's entirely possible this whole project could be substituted with an SSD, I dunno. I had fun.

Faster OpenStreetMap imports 9/n: v2

Looking at the time it takes to import the US (5h) vs the planet (10-14 days), it's clear there's a discrepancy somewhere. The planet is about 7.3x larger, yet takes 50-70 times longer to import.

It's fairly apparent that this is due to database access, as osm2pgsql needs to query Postgresql for the ways that it finds in the relation (and then do a lookup for the nodes, but we've removed that.)

Dealing with this is a bit troublesome. One approach would be to extend the PBF format and figure out how to embed the locations directly in to the relation, perform yet another rewrite of the .osm.pbf, modify libosmium to consume the modification, and modify osm2pgsql to consume the modification in libosmium. The second approach would be to write a new index, and modify osm2pgsql to lookup from the index instead of the database.

As an estimate for what this would do, let's imagine an index that looks like:

  • Way data: Way ID + Offset in to location data for start of locations
  • Location data: For each way, a series of locations

When we look at the planet, we get the following stats:

  • 695,842,170 ways
  • 7,482,480,390 locations (formerly nodes) referenced by ways

For arguments sake, let's say that each way and location takes 8 bytes.

  • Way data: 8 bytes * 695,842,170 ways = 5GB
  • Location data: 8 bytes * (7,482,480,390 locations + 695,842,170 ways*) = 65.4GB

This is approximately 70GB of data, and we don't get to do multiple passes. Using delta encoded varints**, we can get it down to 41GB which is still not viable to use without multiple passes.

How can we do multiple passes then?

There's two options for this:

  1. Modify osm2pgsql to do multiple passes, which honestly seems like a huge pain.
  2. Take the way index, and use it to index the relations, then use the relation index in osm2pgsql. This is the approach taken.

Site node: while digging through the libosmium code, I noticed that ways can already contain lat/lon coordinates, although this isn't specified in the protobuf definition. It's also not used by osm2pgsql, even if it's present.

* Extra overhead to determine the number of locations in a way

** This stopped being theoretical a while ago

Faster OpenStreetMap imports 8/n: optimizing the pbf

After establishing that flat nodes may or may not be better, let's make sure it's not better by applying some additional optimizations. We're going to take a multi-pronged attack to optimize things here.

Prong 1: multi-threaded rewriting

We can process a PBF faster by multi-threading the handling. There's three parts to this:

  • Read the basic header information ("process", single thread)
  • Parse and mutate the blob ("worker", multi threaded)
  • Output something ("writer", single thread, optional)

This was basically the first bit of code written, because it was obvious that this whole saga would require multiple tools which all follow this same pattern.

Prong 2: removing dead entities

As we've established, the vast majority of nodes are purely location information for ways. This means that if we want to, we can just remove them. There's a list of tags which the OSM carto file drops, if there's no tags left, the node is ignored. We can apply the same logic, except we can do it multi-threaded.

ways and relations can also be given the same filtering, although there is less to remove, so it's probably not worth it.

Prong 3: removing the node lookup

Much of the series has been about this, but it's important to re-iterate that even with flat nodes, the flat nodes memory is required during import to Postgres, which potentially takes memory away from Postgres. When we rewrite the node IDs as locations, Postgres doesn't need to be running, which gives us more memory.*

Did we do better?

Applying the entire process to the US takes the following timings:

  1. Build index: 2m18s
  2. Remove dead nodes: 2m55s
  3. Rewrite live nodes (3 passes): 13m43s
  4. Import: 4h42m

It turns out that decompressing the file before processing also increases the time to run the first two steps (the 2nd step outputs a decompressed file), which was surprising. This may not be the case when running on an SSD.

And planet:

  • Build node index: 23m10s
  • Remove dead nodes: 29m13s
  • Rewrite live nodes (3 passes): 8h29m27s total, 2h49m49s/pass
  • Rewrite live nodes (4 passes): 9h47m49s total, 2h26m57s/pass
  • Rewrite live nodes (5 passes): 4h35m56s total, 55m11s/pass
  • Rewrite live nodes (6 passes): 4h30m36s total, 45m6s/pass
  • Importing: server had an unrelated failure after about a week, while still processing relations (at about 10/second), however the estimate was 10-14 days, not including the final Postgres indexing.

Several runs for rewrite gives us some insights about the limitations, specifically that we're swapping. The active size of the index is inversely proportional to the number of passes. If that's too large to retain in memory, we hit disk. Nobody likes hitting disk (remember my disk is rust.)

The remove dead nodes step could likely be merged in to rewrite live nodes, since a non-trivial amount of the work is simply decoding and encoding the data. Further, there's a bunch of dead data which could be removed:

  1. ways with no tags
  2. relations with no tags
  3. nodes and relations in a relation definition
  4. unused tags in all entity types

* in practice I rarely bother to close Postgres because I am very lazy.

Faster OpenStreetMap imports 7/n: optimizing the index

Attempt 1

Attempt 1 of the index was simply a huge array (on disk) with pairs of ID/Location. The ID could be binary searched, and then the location read out from the data next to it.

This was a decent proof of concept, however it also means that half the working set is wasted, if all we're doing is searching for the ID and ignoring the location until we find the ID.

But can we do better?

Attempt 2

Attempt 2 of the index was to de-interleave the previous array. Two files were written out, one with the IDs, and one with the locations. At the end, the locations file was shoved on to the end of the IDs file.

This was successful, and improved lookup times generally. It is with this attempt that 60 hours to rewrite the planet occurred.

But can we do better?

Attempt 3

The further we reduce the working set, the better our memory utilization. The better our memory utilization, the less passes we need over planet. The ID list is the primary data set to shrink.

Attempt 3 is a list of tuples of the form (first id, last id, offset in to location array). The location array is unchanged and occurs at the end of the file. Combining multiple IDs in to a single block yields two primary advantages:

There's approximately 113 million index elements, and 6.3 billion nodes in planet. Performing a binary search on the nodes requires ~33 comparisons, whereas searching the index elements is only ~27. Since there's roughly as many nodes to look up as there is nodes in the file (it's not 1:1, but it's close enough), that's 6 comparisons * 6 billion nodes = ~36 billion comparisons that are skipped.

Less memory usage

Each index element is 24 bytes. With 113 million elements, that's approximately 2.5GB (with 50GB of location data).

But we can do better?

Attempt 4

An index element needs to know 3 things, the start of a range, the end of a range, and where to find the location data the element refers to. If we structure this as:

  • 8 bytes of data for the ID
  • 3 bytes of data for the size of the range (in practice the maximum range appears to be about 150k, 3 bytes lets us go up to 16m)
  • 5 bytes of data for the offset to the location (this lets us store a maximum of 1000 billion locations, which is plenty)

... then we only need 16 bytes instead of 24. It also fits in to cache lines better, but that probably doesn't matter. 16 byte with 113 million index elements is 1.7GB of memory, which will happily fit in memory, even on a Raspberry Pi 4. Since the code is already written for running multiple passes, it could also be ran on a lesser Pi.

But can we do better?

What about flat nodes?

osm2pgsql supports a cache mode known as flat nodes. This is (if I understand correctly) an on disk array with a length of the highest node ID, which stores every single node location. This is suggested to be used only when importing planet, because the file size is a function of the highest node ID (currently ~50GB (and always grows). It should be even faster than the index implemented above, because it's literally a 1:1 lookup - there is no searching the index. The location for the node with ID 34 is stored 34*8 bytes in to the file.

So why not use that?

Well, the goal is to import the US. planet is simply being used as a stress test. It probably is better to use flat nodes for planet, and it may even be better to use it for the US.

But can we do better?

Faster OpenStreetMap imports 6/n: smashing dependencies for fun and profit

Our original diagram described a PBF file structure:

But consumption of it looks more like this:

The cache is indicated by the red line. If we can do better there, we win. Or to phrase it another way, if we can remove the dependency of ways on nodes, then we don't need a cache.

Removing the dependency

A way contains a list of node IDs. A way does not care about the tags on the node, only its location. node IDs are 64 bits. A location is 2 * 32bit values*.

What if we just replace all the node IDs in a way with the actual node locations?

To do this, we need to pre-process a file in two steps:

  1. Parse the file and build an index of node -> location on disk (6.3 billion nodes makes it impractical to use memory)
  2. Parse the file again and replace all the node IDs in each way with the raw location, using the index.

At first blush, this seems to be very similar to the existing cache system, except the index is on disk. However, it provides a set of advantages:

  1. The pre-processing can be ran without Postgres, which frees up more memory (removing the cache also allows more memory for Postgres later)
  2. The process can be ran multi-threaded, although it is still probably going to fall back to disk speed (but point 4)
  3. The index is memory mapped, allowing the OS to swap data in and out as it sees fit (but point 4)
  4. We can perform multiple passes over the location data, which means only part of the index is required at a time, ensuring we can keep everything in memory. At large data sizes, this means performance becomes passes*ram speed instead of 1*disk speed

Doing some basic measurements, the average cache miss costs osm2pgsql about 100uS. A miss occurs when any node is not in the cache, as the cost is really the database access. On the flip side, if all 10 nodes in an 10 node way all miss, the cost is a single database access, so it's hard to judge the true cost of this. Lacking C++ experience, it's harder for me to drill in to these costs.

With the on disk index, however, lookups are typically between 8-16uS**. Without going in to details, this does take advantage of things such as IDs typically being sequential.

The US

Running this on my system for the US does generally hold the entire index in memory, but it's pushing it.

The planet

Indexing planet takes about 35 minutes. Attempting to apply it to the planet fails - the required working set is just too large, and swapping ensues. This is fairly trivial to manage however, by doing multiple passes over the file. The index is known to be in increasing order, so we can chop it up in to segments, and run the file through for each segment. This takes about 6 hours for a 10% pass, or 2.5 days, which is about as long as it took to import the US on the same system.

But can we do better?

* Each part of a location (lat/long) is actually 0.000000001 * (perBlockOffset + (perBlockScale * location)). The default scale is 100, and I've only seen it higher, not lower. With the scale in place, the range is -180,000,000,000 to 180,000,000,000 which doesn't fit in a 32bit value, however with the scale factored out, the range is -1,800,000,000 to 1,800,000,000 which does fit in 32 bits. In theory we lose precision by factoring out the scale, in practice it's always 100 or 1000, and the offset is 0. In double practice, the precision we lose is less than 1cm.

** for now