With what we've explored so far, this leads to my first thought, re-ordering entities by collecting ...

  • A small number of relations
  • All the ways those relations depend on
  • All the nodes those ways depend on

... repeated many times, until all the relations are processed, then collect ...

  • A small number of unprocessed ways
  • All the nodes those ways depend on

... repeated many times, until all the remaining ways are processed, then collect ...

  • All the un-processed nodes

At this point, we've now gone from ...

... to ...

In effect, re-ordering entities such that we maximize spatial/temporal locality.

Perhaps some additional smarts could be done to put nodes which are only ever referenced once towards the start, that way when they're evicted, it can be with confidence that they won't be referred to again.

This would ensure that most of the nodes that are needed will be fresh in the cache, and could probably also output the "ideal" cache size.

It does violate the rule about entity ordering however, and osm2pgsql will show a warning about it, but from looking at the code it's just a warning and it's only there in case you reference an entity before it's been seen.

But this brings us back to where we started: given an arbitrary relation, we need to find the nodes it references.  The nodes are in order, so we could binary search within them to find the information we need.  Of course because blocks are an arbitrary size, we'd need an index of where each block starts.  Our process is then:

  • Binary search the blocks to find which block contains our node (we'll assume the index contains node range information)
  • Decode the block
  • Binary search the nodes in the block
  • Repeat 6,301,138,735 times

Certain bits can also be skipped with some extra smarts - for example, nodes in a way are typically sequential, so once we have a node, the next node is likely to be next to it.

We also need to track which entities have been output.  If we use a simple bit array for it, that's 835MB.  There's other data structures that would likely be more space efficient, such as a range tree, but it's starting to sound like a bit of a pain, especially if we ever need to store more than a single bit of data per node (like how many times a node is referenced, so we can store those nodes later in the file)

All of this sounds kinda annoying, and will probably blow up in to some O(n^3) graph problem.  Let's shelf it, and see what else we can do.