Faster OpenStreetMap imports 10/10: summary and conclusion
As is my wont, the preceding blogs are rambly and incoherent. This final one will hopefully put a neat bow on it all.
Step 0: decompress (not recommended)
This is optional and not recommended, but the initial .osm.pbf
file can be decompressed. It turns out that the I/O savings make it better to not do this. It was useful early on when I needed to look at things in a hex editor, but causes slowdowns now.
Step 1: nodeindex
The first step in the whole process is to make a mapping of node
to location, which can be efficiently searched. We'll call this a .nodeidx
file.
Step 2: tagfilter
Once we have all the locations saved, we want to minimize the size of the input file, by discarding unnecessary nodes
. This is done via tagfilter
, and all it does is remove nodes
which would never make it to Postgres, so that subsequent steps are faster. We get about 20% saving, which is a non-trivial amount of work to skip.
Possible enhancement: re-compress the file on output.
Possible enhancement: remove unused ways
and relations
too.
Step 3: noderewrite
This step combines a .nodeidx
and a .osm.pbf
file, such that all the ways
contain location data instead of node
references. In order to manage memory usage, this process runs over the .osm.pbf
multiple times, using a subset of the .nodeidx
each time.
Possible enhancement: re-compress the file on output.
Step 4: wayindex
Next up, we extract all the ways
(and their locations) from our rewritten .osm.pbf
in to a binary searchable .wayidx
Step 5: relindex
For the final step of pre-processing, combine the .osm.pbf
with the .wayidx
file to create a .relidx
file. This file is not binary searchable, it is intended to be linearly scanned. It's a sequence of (relation id, (way id, (location, ...), ...)).
To restate: the .relidx
contains, in order, all the data that osm2pgsql
will need to process each relation
without accessing the database.
Because this needs to binary search the .wayidx
, which may not fit in memory, this step also splits up the work in to multiple passes, however the work is based on the size of the .relidx
, not the .osm.pbf
Possible enhancement: make this multi-threaded. This doesn't fit the existing framework, as proof-of-concept it's currently straight line single threaded code.
Step 6: osm2pgsql import
Finally, we are ready to perform the import, let's look at individual steps.
Step 6a: process a node
Original
- (Possibly) store the
node
in the permanent database - Store the
node
in cache - Store the
node
in the temporary database
New
- Store the
node
in the permanent database
Step 6b: process a way
Original
- For each
node
in the way, find the correspondingnode
location in the cache or temporary database - (Probably) store the
way
in the permanent database - Store the
way
in the temporary database
New
- (Probably) store the
way
in the permanent database
Step 6c: process a relation
Original
- Store the
relation
in the temporary database - For each
way
in therelation
, find the list ofnodes
in the temporary database - For each
node
in each way, find the list ofnode
locations in the cache or temporary database - (Probably) store the
relation
in the permanent database
New
- Read forward in the
.relidx
file to find therelation
- Read the required
relation
data from.relidx
- (Probably) store the
relation
in the permanent database
Step 6d: finalizing the database
Original
- Create indexes for the permanent database
New
- Create indexes for the permanent database
...
:(
Conclusion
By doing a bunch of pre-processing, we have two major differences:
- All steps can control the amount of memory required. With a regular import, a small cache doesn't mean you require less memory, it means you're swapping when you access the database.
- We've eliminated all database reads except for creating indexes. If your database doesn't fit entirely in memory, you can end up processing relations at 5-15/second (at least, those are the numbers I had attempting to import
planet
) which comes out to about 9-10 days.
We can't do anything about the indexing, except give Postgres access to more memory.
Result
Timings for planet
:
- nodeindex: 23:33
- tagfilter: 24:35
- rewritenodes: 6:17:57 (4 passes)
- wayindex: 17:15
- relindex: 2:13:55 (10 passes), this phase is currently single threaded, could potentially have been as low as 1/12 this time.
- import (w/out index): 6:23:33
- import (index only): 3d 13:17:40
total time: 16:00:48 + 3d 13:17:40 = 4d 5:18:28*
squirrel@emu:~/src/osm2pgsql/build$ make -j12 && ./osm2pgsql planet-200907-d-f-r.osm.pbf --relation-index planet-200907-d-f-r.relidx --host 192.168.16.2 --user postgres --database act --style ../../openstreetmap-carto/openstreetmap-carto.style --slim --hstore --drop --tag-transform-script ../../openstreetmap-carto/openstreetmap-carto.lua --number-processes 12
osm2pgsql version 1.3.0 (1.3.0-changed)
Using lua based tag processing pipeline with script /home/squirrel/src/openstreetmap-carto/openstreetmap-carto.lua
Using projection SRS 3857 (Spherical Mercator)
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads
Reading in file: planet-200907-d-f-r.osm.pbf
Using PBF parser.
Processing: Node(140369k 47.0k/s) Way(695842k 38.96k/s) Relation(8115737 3743.4/s) parse time: 23013s
Node stats: total(140369384), max(7882275697) in 2986s
Way stats: total(695842170), max(844877676) in 17859s
Relation stats: total(8115737), max(11590153) in 2168s
...
All indexes on planet_osm_point created in 23949s
All indexes on planet_osm_roads created in 39635s
All indexes on planet_osm_line created in 160748s
All indexes on planet_osm_polygon created in 307060s
squirrel@emu:~/src/tltools$ ./tile --lat1 -85 --lat2 85 --long1 -180 --long2 180 --zoom 5 --width 1200 --height 566 --output-image world.png --layer admin-low-zoom

Where to get it
* It's entirely possible this whole project could be substituted with an SSD, I dunno. I had fun.