Faster OpenStreetMap imports 8/n: optimizing the pbf
After establishing that flat nodes may or may not be better, let's make sure it's not better by applying some additional optimizations. We're going to take a multi-pronged attack to optimize things here.
Prong 1: multi-threaded rewriting
We can process a PBF faster by multi-threading the handling. There's three parts to this:
- Read the basic header information ("process", single thread)
- Parse and mutate the blob ("worker", multi threaded)
- Output something ("writer", single thread, optional)
This was basically the first bit of code written, because it was obvious that this whole saga would require multiple tools which all follow this same pattern.
Prong 2: removing dead entities
As we've established, the vast majority of nodes
are purely location information for ways
. This means that if we want to, we can just remove them. There's a list of tags which the OSM carto file drops, if there's no tags left, the node is ignored. We can apply the same logic, except we can do it multi-threaded.
ways
and relations
can also be given the same filtering, although there is less to remove, so it's probably not worth it.
Prong 3: removing the node lookup
Much of the series has been about this, but it's important to re-iterate that even with flat nodes, the flat nodes memory is required during import to Postgres, which potentially takes memory away from Postgres. When we rewrite the node
IDs as locations, Postgres doesn't need to be running, which gives us more memory.*
Did we do better?
Applying the entire process to the US takes the following timings:
- Build index: 2m18s
- Remove dead
nodes
: 2m55s - Rewrite live
nodes
(3 passes): 13m43s - Import: 4h42m
It turns out that decompressing the file before processing also increases the time to run the first two steps (the 2nd step outputs a decompressed file), which was surprising. This may not be the case when running on an SSD.
And planet
:
- Build node index: 23m10s
- Remove dead
nodes
: 29m13s - Rewrite live
nodes
(3 passes): 8h29m27s total, 2h49m49s/pass - Rewrite live
nodes
(4 passes): 9h47m49s total, 2h26m57s/pass - Rewrite live
nodes
(5 passes): 4h35m56s total, 55m11s/pass - Rewrite live
nodes
(6 passes): 4h30m36s total, 45m6s/pass - Importing: server had an unrelated failure after about a week, while still processing
relations
(at about 10/second), however the estimate was 10-14 days, not including the final Postgres indexing.
Several runs for rewrite gives us some insights about the limitations, specifically that we're swapping. The active size of the index is inversely proportional to the number of passes. If that's too large to retain in memory, we hit disk. Nobody likes hitting disk (remember my disk is rust.)
The remove dead nodes
step could likely be merged in to rewrite live nodes
, since a non-trivial amount of the work is simply decoding and encoding the data. Further, there's a bunch of dead data which could be removed:
ways
with no tagsrelations
with no tagsnodes
andrelations
in arelation
definition- unused tags in all entity types
* in practice I rarely bother to close Postgres because I am very lazy.