Faster OpenStreetMap imports 6/n: smashing dependencies for fun and profit
Our original diagram described a PBF file structure:

But consumption of it looks more like this:

The cache is indicated by the red line. If we can do better there, we win. Or to phrase it another way, if we can remove the dependency of ways
on nodes
, then we don't need a cache.

Removing the dependency
A way
contains a list of node
IDs. A way
does not care about the tags on the node
, only its location. node
IDs are 64 bits. A location is 2 * 32bit values*.
What if we just replace all the node
IDs in a way
with the actual node
locations?
To do this, we need to pre-process a file in two steps:
- Parse the file and build an index of
node
-> location on disk (6.3 billion nodes makes it impractical to use memory) - Parse the file again and replace all the
node
IDs in eachway
with the raw location, using the index.
At first blush, this seems to be very similar to the existing cache system, except the index is on disk. However, it provides a set of advantages:
- The pre-processing can be ran without Postgres, which frees up more memory (removing the cache also allows more memory for Postgres later)
- The process can be ran multi-threaded, although it is still probably going to fall back to disk speed (but point 4)
- The index is memory mapped, allowing the OS to swap data in and out as it sees fit (but point 4)
- We can perform multiple passes over the location data, which means only part of the index is required at a time, ensuring we can keep everything in memory. At large data sizes, this means performance becomes
passes*ram speed
instead of1*disk speed
Doing some basic measurements, the average cache miss costs osm2pgsql about 100uS. A miss occurs when any node
is not in the cache, as the cost is really the database access. On the flip side, if all 10 nodes
in an 10 node
way
all miss, the cost is a single database access, so it's hard to judge the true cost of this. Lacking C++ experience, it's harder for me to drill in to these costs.
With the on disk index, however, lookups are typically between 8-16uS**. Without going in to details, this does take advantage of things such as IDs typically being sequential.
The US
Running this on my system for the US does generally hold the entire index in memory, but it's pushing it.
The planet
Indexing planet takes about 35 minutes. Attempting to apply it to the planet fails - the required working set is just too large, and swapping ensues. This is fairly trivial to manage however, by doing multiple passes over the file. The index is known to be in increasing order, so we can chop it up in to segments, and run the file through for each segment. This takes about 6 hours for a 10% pass, or 2.5 days, which is about as long as it took to import the US on the same system.
But can we do better?
* Each part of a location (lat/long) is actually 0.000000001 * (perBlockOffset + (perBlockScale * location))
. The default scale is 100, and I've only seen it higher, not lower. With the scale in place, the range is -180,000,000,000 to 180,000,000,000 which doesn't fit in a 32bit value, however with the scale factored out, the range is -1,800,000,000 to 1,800,000,000 which does fit in 32 bits. In theory we lose precision by factoring out the scale, in practice it's always 100 or 1000, and the offset is 0. In double practice, the precision we lose is less than 1cm.
** for now