Looking at the time it takes to import the US (5h) vs the planet (10-14 days), it's clear there's a discrepancy somewhere. The planet is about 7.3x larger, yet takes 50-70 times longer to import.

It's fairly apparent that this is due to database access, as osm2pgsql needs to query Postgresql for the ways that it finds in the relation (and then do a lookup for the nodes, but we've removed that.)

Dealing with this is a bit troublesome. One approach would be to extend the PBF format and figure out how to embed the locations directly in to the relation, perform yet another rewrite of the .osm.pbf, modify libosmium to consume the modification, and modify osm2pgsql to consume the modification in libosmium. The second approach would be to write a new index, and modify osm2pgsql to lookup from the index instead of the database.

As an estimate for what this would do, let's imagine an index that looks like:

  • Way data: Way ID + Offset in to location data for start of locations
  • Location data: For each way, a series of locations

When we look at the planet, we get the following stats:

  • 695,842,170 ways
  • 7,482,480,390 locations (formerly nodes) referenced by ways

For arguments sake, let's say that each way and location takes 8 bytes.

  • Way data: 8 bytes * 695,842,170 ways = 5GB
  • Location data: 8 bytes * (7,482,480,390 locations + 695,842,170 ways*) = 65.4GB

This is approximately 70GB of data, and we don't get to do multiple passes. Using delta encoded varints**, we can get it down to 41GB which is still not viable to use without multiple passes.

How can we do multiple passes then?

There's two options for this:

  1. Modify osm2pgsql to do multiple passes, which honestly seems like a huge pain.
  2. Take the way index, and use it to index the relations, then use the relation index in osm2pgsql. This is the approach taken.

Site node: while digging through the libosmium code, I noticed that ways can already contain lat/lon coordinates, although this isn't specified in the protobuf definition. It's also not used by osm2pgsql, even if it's present.

* Extra overhead to determine the number of locations in a way

** This stopped being theoretical a while ago