As is my wont, the preceding blogs are rambly and incoherent. This final one will hopefully put a neat bow on it all.

Step 0: decompress (not recommended)

This is optional and not recommended, but the initial .osm.pbf file can be decompressed. It turns out that the I/O savings make it better to not do this. It was useful early on when I needed to look at things in a hex editor, but causes slowdowns now.

Step 1: nodeindex

The first step in the whole process is to make a mapping of node to location, which can be efficiently searched. We'll call this a .nodeidx file.

Step 2: tagfilter

Once we have all the locations saved, we want to minimize the size of the input file, by discarding unnecessary nodes. This is done via tagfilter, and all it does is remove nodes which would never make it to Postgres, so that subsequent steps are faster. We get about 20% saving, which is a non-trivial amount of work to skip.

Possible enhancement: re-compress the file on output.

Possible enhancement: remove unused ways and relations too.

Step 3: noderewrite

This step combines a .nodeidx and a .osm.pbf file, such that all the ways contain location data instead of node references. In order to manage memory usage, this process runs over the .osm.pbf multiple times, using a subset of the .nodeidx each time.

Possible enhancement: re-compress the file on output.

Step 4: wayindex

Next up, we extract all the ways (and their locations) from our rewritten .osm.pbf in to a binary searchable .wayidx

Step 5: relindex

For the final step of pre-processing, combine the .osm.pbf with the .wayidx file to create a .relidx file. This file is not binary searchable, it is intended to be linearly scanned. It's a sequence of (relation id, (way id, (location, ...), ...)).

To restate: the .relidx contains, in order, all the data that osm2pgsql will need to process each relation without accessing the database.

Because this needs to binary search the .wayidx, which may not fit in memory, this step also splits up the work in to multiple passes, however the work is based on the size of the .relidx, not the .osm.pbf

Possible enhancement: make this multi-threaded. This doesn't fit the existing framework, as proof-of-concept it's currently straight line single threaded code.

Step 6: osm2pgsql import

Finally, we are ready to perform the import, let's look at individual steps.

Step 6a: process a node

Original

  • (Possibly) store the node in the permanent database
  • Store the node in cache
  • Store the node in the temporary database

New

  • Store the node in the permanent database

Step 6b: process a way

Original

  • For each node in the way, find the corresponding node location in the cache or temporary database
  • (Probably) store the way in the permanent database
  • Store the way in the temporary database

New

  • (Probably) store the way in the permanent database

Step 6c: process a relation

Original

  • Store the relation in the temporary database
  • For each way in the relation, find the list of nodes in the temporary database
  • For each node in each way, find the list of node locations in the cache or temporary database
  • (Probably) store the relation in the permanent database

New

  • Read forward in the .relidx file to find the relation
  • Read the required relation data from .relidx
  • (Probably) store the relation in the permanent database

Step 6d: finalizing the database

Original

  • Create indexes for the permanent database

New

  • Create indexes for the permanent database

...

:(

Conclusion

By doing a bunch of pre-processing, we have two major differences:

  • All steps can control the amount of memory required. With a regular import, a small cache doesn't mean you require less memory, it means you're swapping when you access the database.
  • We've eliminated all database reads except for creating indexes. If your database doesn't fit entirely in memory, you can end up processing relations at 5-15/second (at least, those are the numbers I had attempting to import planet) which comes out to about 9-10 days.

We can't do anything about the indexing, except give Postgres access to more memory.

Result

Timings for planet:

  • nodeindex: 23:33
  • tagfilter: 24:35
  • rewritenodes: 6:17:57 (4 passes)
  • wayindex: 17:15
  • relindex: 2:13:55 (10 passes), this phase is currently single threaded, could potentially have been as low as 1/12 this time.
  • import (w/out index): 6:23:33
  • import (index only): 3d 13:17:40

total time: 16:00:48 + 3d 13:17:40 = 4d 5:18:28*

squirrel@emu:~/src/osm2pgsql/build$ make -j12 && ./osm2pgsql planet-200907-d-f-r.osm.pbf --relation-index planet-200907-d-f-r.relidx --host 192.168.16.2 --user postgres --database act --style ../../openstreetmap-carto/openstreetmap-carto.style --slim --hstore --drop --tag-transform-script ../../openstreetmap-carto/openstreetmap-carto.lua --number-processes 12
osm2pgsql version 1.3.0 (1.3.0-changed)

Using lua based tag processing pipeline with script /home/squirrel/src/openstreetmap-carto/openstreetmap-carto.lua
Using projection SRS 3857 (Spherical Mercator)
Setting up table: planet_osm_point
Setting up table: planet_osm_line
Setting up table: planet_osm_polygon
Setting up table: planet_osm_roads

Reading in file: planet-200907-d-f-r.osm.pbf
Using PBF parser.
Processing: Node(140369k 47.0k/s) Way(695842k 38.96k/s) Relation(8115737 3743.4/s)  parse time: 23013s
Node stats: total(140369384), max(7882275697) in 2986s
Way stats: total(695842170), max(844877676) in 17859s
Relation stats: total(8115737), max(11590153) in 2168s
...
All indexes on planet_osm_point created in 23949s
All indexes on planet_osm_roads created in 39635s
All indexes on planet_osm_line created in 160748s
All indexes on planet_osm_polygon created in 307060s
squirrel@emu:~/src/tltools$ ./tile --lat1 -85 --lat2 85 --long1 -180 --long2 180 --zoom 5 --width 1200 --height 566 --output-image world.png --layer admin-low-zoom

Where to get it

* It's entirely possible this whole project could be substituted with an SSD, I dunno. I had fun.