Faster OpenStreetMap imports 5/n: failing to re-order entities
With what we've explored so far, this leads to my first thought, re-ordering entities by collecting ...
- A small number of
relations
- All the
ways
thoserelations
depend on - All the
nodes
thoseways
depend on
... repeated many times, until all the relations
are processed, then collect ...
- A small number of unprocessed
ways
- All the
nodes
thoseways
depend on
... repeated many times, until all the remaining ways
are processed, then collect ...
- All the un-processed
nodes
At this point, we've now gone from ...

... to ...

In effect, re-ordering entities such that we maximize spatial/temporal locality.
Perhaps some additional smarts could be done to put nodes
which are only ever referenced once towards the start, that way when they're evicted, it can be with confidence that they won't be referred to again.
This would ensure that most of the nodes
that are needed will be fresh in the cache, and could probably also output the "ideal" cache size.
It does violate the rule about entity ordering however, and osm2pgsql will show a warning about it, but from looking at the code it's just a warning and it's only there in case you reference an entity before it's been seen.
But this brings us back to where we started: given an arbitrary relation
, we need to find the nodes it references. The nodes
are in order, so we could binary search within them to find the information we need. Of course because blocks are an arbitrary size, we'd need an index of where each block starts. Our process is then:
- Binary search the blocks to find which block contains our
node
(we'll assume the index containsnode
range information) - Decode the block
- Binary search the
nodes
in the block - Repeat 6,301,138,735 times
Certain bits can also be skipped with some extra smarts - for example, nodes
in a way
are typically sequential, so once we have a node
, the next node
is likely to be next to it.
We also need to track which entities have been output. If we use a simple bit array for it, that's 835MB. There's other data structures that would likely be more space efficient, such as a range tree, but it's starting to sound like a bit of a pain, especially if we ever need to store more than a single bit of data per node
(like how many times a node is referenced, so we can store those nodes
later in the file)
All of this sounds kinda annoying, and will probably blow up in to some O(n^3) graph problem. Let's shelf it, and see what else we can do.