Faster OpenStreetMap imports 9/n: v2
Looking at the time it takes to import the US (5h) vs the planet (10-14 days), it's clear there's a discrepancy somewhere. The planet is about 7.3x larger, yet takes 50-70 times longer to import.
It's fairly apparent that this is due to database access, as osm2pgsql needs to query Postgresql for the ways
that it finds in the relation
(and then do a lookup for the nodes
, but we've removed that.)
Dealing with this is a bit troublesome. One approach would be to extend the PBF format and figure out how to embed the locations directly in to the relation
, perform yet another rewrite of the .osm.pbf, modify libosmium to consume the modification, and modify osm2pgsql to consume the modification in libosmium. The second approach would be to write a new index, and modify osm2pgsql to lookup from the index instead of the database.
As an estimate for what this would do, let's imagine an index that looks like:
Way
data: Way ID + Offset in to location data for start of locations- Location data: For each
way
, a series of locations
When we look at the planet, we get the following stats:
- 695,842,170
ways
- 7,482,480,390 locations (formerly
nodes
) referenced byways
For arguments sake, let's say that each way
and location
takes 8 bytes.
- Way data: 8 bytes * 695,842,170 ways = 5GB
- Location data: 8 bytes * (7,482,480,390 locations + 695,842,170 ways*) = 65.4GB
This is approximately 70GB of data, and we don't get to do multiple passes. Using delta encoded varints**, we can get it down to 41GB which is still not viable to use without multiple passes.
How can we do multiple passes then?
There's two options for this:
- Modify osm2pgsql to do multiple passes, which honestly seems like a huge pain.
- Take the
way
index, and use it to index therelations
, then use therelation
index in osm2pgsql. This is the approach taken.
Site node: while digging through the libosmium code, I noticed that ways
can already contain lat/lon coordinates, although this isn't specified in the protobuf definition. It's also not used by osm2pgsql, even if it's present.
* Extra overhead to determine the number of locations in a way
** This stopped being theoretical a while ago