Faster OpenStreetMap imports 7/n: optimizing the index
Attempt 1
Attempt 1 of the index was simply a huge array (on disk) with pairs of ID/Location. The ID could be binary searched, and then the location read out from the data next to it.
This was a decent proof of concept, however it also means that half the working set is wasted, if all we're doing is searching for the ID and ignoring the location until we find the ID.
But can we do better?
Attempt 2
Attempt 2 of the index was to de-interleave the previous array. Two files were written out, one with the IDs, and one with the locations. At the end, the locations file was shoved on to the end of the IDs file.
This was successful, and improved lookup times generally. It is with this attempt that 60 hours to rewrite the planet occurred.
But can we do better?
Attempt 3
The further we reduce the working set, the better our memory utilization. The better our memory utilization, the less passes we need over planet
. The ID list is the primary data set to shrink.
Attempt 3 is a list of tuples of the form (first id, last id, offset in to location array). The location array is unchanged and occurs at the end of the file. Combining multiple IDs in to a single block yields two primary advantages:
Less elements to search
There's approximately 113 million index elements, and 6.3 billion nodes
in planet
. Performing a binary search on the nodes
requires ~33 comparisons, whereas searching the index elements is only ~27. Since there's roughly as many nodes to look up as there is nodes in the file (it's not 1:1, but it's close enough), that's 6 comparisons * 6 billion nodes = ~36 billion comparisons that are skipped.
Less memory usage
Each index element is 24 bytes. With 113 million elements, that's approximately 2.5GB (with 50GB of location data).
But we can do better?
Attempt 4
An index element needs to know 3 things, the start of a range, the end of a range, and where to find the location data the element refers to. If we structure this as:
- 8 bytes of data for the ID
- 3 bytes of data for the size of the range (in practice the maximum range appears to be about 150k, 3 bytes lets us go up to 16m)
- 5 bytes of data for the offset to the location (this lets us store a maximum of 1000 billion locations, which is plenty)
... then we only need 16 bytes instead of 24. It also fits in to cache lines better, but that probably doesn't matter. 16 byte with 113 million index elements is 1.7GB of memory, which will happily fit in memory, even on a Raspberry Pi 4. Since the code is already written for running multiple passes, it could also be ran on a lesser Pi.
But can we do better?
What about flat nodes?
osm2pgsql supports a cache mode known as flat nodes. This is (if I understand correctly) an on disk array with a length of the highest node ID, which stores every single node location. This is suggested to be used only when importing planet
, because the file size is a function of the highest node ID (currently ~50GB (and always grows). It should be even faster than the index implemented above, because it's literally a 1:1 lookup - there is no searching the index. The location for the node
with ID 34 is stored 34*8 bytes in to the file.
So why not use that?
Well, the goal is to import the US. planet
is simply being used as a stress test. It probably is better to use flat nodes for planet
, and it may even be better to use it for the US.
But can we do better?