Now that we have a rough understanding of the system, let's look at the problem:

US (8gb cache)

squirrel@emu:~/osm2pgsql/build$ ./import us-20200905.osm.pbf
osm2pgsql version 1.3.0 (1.3.0-changed)

...

Reading in file: us-20200905.osm.pbf
Using PBF parser.
Processing: Node(882792k 495.1k/s) Way(81909k 2.04k/s) Relation(271710 16.6/s)
node cache: stored: 638575926(72.34%), storage efficiency: 60.90% (dense blocks: 52425, sparse nodes: 309555201), hit rate: 73.02%
Osm2pgsql failed due to ERROR: std::bad_alloc
This is a summary of the first blog in the series

Attempting to import the US fails after about 16 hours. Look at the average timings it comes out to 12 hours, but it was 16 hours of real time. Don't ask me why.

I've given this import 8GB of cache, and it ran out of memory as there's a bunch of other things running on my system, including Postgres itself.

NSW (1GB cache)

squirrel@emu:~/osm2pgsql/build$ ./import new_south_wales-20200910.osm.pbf
osm2pgsql version 1.3.0 (1.3.0-changed)
Node-cache: cache=1000MB, maxblocks=16000*65536, allocation method=11
Mid: pgsql, cache=1000
Reading in file: new_south_wales-20200910.osm.pbf
Using PBF parser.

Processing: Node(17770k 740.4k/s) Way(1280k 31.22k/s) Relation(41956 2330.9/s)  parse time: 83s

Node stats: total(17770063), max(7888587896) in 24s
Way stats: total(1280580), max(845539027) in 41s
Relation stats: total(41956), max(11597658) in 18s

All indexes on planet_osm_roads created in 4s
All indexes on planet_osm_point created in 9s
All indexes on planet_osm_polygon created in 19s
All indexes on planet_osm_line created in 23s

Osm2pgsql took 107s overall
node cache: stored: 17770063(100.00%), storage efficiency: 50.53% (dense blocks: 154, sparse nodes: 16952522), hit rate: 99.97%

NSW (16MB cache)

squirrel@emu:~/osm2pgsql/build$ ./import new_south_wales-20200910.osm.pbf
osm2pgsql version 1.3.0 (1.3.0-changed)

Node-cache: cache=16MB, maxblocks=256*65536, allocation method=11
Mid: pgsql, cache=16
Reading in file: new_south_wales-20200910.osm.pbf
Using PBF parser.

Processing: Node(17770k 740.4k/s) Way(1280k 4.76k/s) Relation(41956 590.9/s)  parse time: 364s

Node stats: total(17770063), max(7888587896) in 24s
Way stats: total(1280580), max(845539027) in 269s
Relation stats: total(41956), max(11597658) in 71s

All indexes on planet_osm_roads created in 3s
All indexes on planet_osm_point created in 9s
All indexes on planet_osm_polygon created in 36s
All indexes on planet_osm_line created in 17s

Osm2pgsql took 401s overall
node cache: stored: 1044602(5.88%), storage efficiency: 49.81% (dense blocks: 1, sparse nodes: 1044481), hit rate: 7.07%

NSW (cache disabled)

squirrel@emu:~/osm2pgsql/build$ ./import new_south_wales-20200910.osm.pbf
osm2pgsql version 1.3.0 (1.3.0-changed)
Node-cache: cache=12MB, maxblocks=192*65536, allocation method=11
Mid: pgsql, cache=12

Reading in file: new_south_wales-20200910.osm.pbf
Using PBF parser.

Processing: Node(17770k 740.4k/s) Way(1280k 4.59k/s) Relation(41956 626.2/s)  parse time: 370s

Node stats: total(17770063), max(7888587896) in 24s
Way stats: total(1280580), max(845539027) in 279s
Relation stats: total(41956), max(11597658) in 67s

All indexes on planet_osm_roads created in 4s
All indexes on planet_osm_point created in 8s
All indexes on planet_osm_polygon created in 36s
All indexes on planet_osm_line created in 18s

Osm2pgsql took 407s overall
node cache: stored: 782458(4.40%), storage efficiency: 49.75% (dense blocks: 1, sparse nodes: 782337), hit rate: 0.00%

A few things to point out here:

  • With 1GB cache, the hit rate is effectively 100% - the misses are nodes which aren't present (ie, they're in Victoria), and therefore it is impossible to hit.
  • With 16MB cache, the hit rate is 7%
  • The time difference between no cache, and 16MB cache is effectively none.
  • Processing nodes takes the same time
  • Processing ways takes 6.5x the time
  • Processing relations takes 3x the time
  • Indexing is generally the same, although polygons take twice as long, I'm going to mumble about disk cache and ignore it.
  • Overall it takes 4x the time

This tells us that the cache is important*. Low cache is effectively no cache. This is likely because the cost of a cache miss is paid when any node in a way is missing from the cache. If a way contains 10 nodes, and 9 are in the cache, then that's a cache miss. The summary doesn't count it like this however.

Adding code to track this, we now see the following (for NSW):

  • 512MB: 105 seconds, 99.97% cache, 238 db hits, 1492504 avoids (perfect caching)
  • 256MB: 131 seconds, 95.24% cache, 96358 db hits, 1396384 avoids - big oof
  • 128MB: 327 seconds, 52.91% cache, 896566 db hits, 596176 avoids - bigger oof
  • no cache: 399 seconds, 0% cache, 1492742 db hits, 0 avoids - chungus oof

At just 50% cache utilization, we're much closer to worst case. The DB access is absolutely killing us.

We know we have two conflicting goals now:

  1. Minimize cache size, so as to allow maximum memory usage for the rest of the processing (including Postgres)
  2. Maximizing cache utilization and hit rate / minimize DB access

* duh