## Faster OpenStreetMap imports 5/n: failing to re-order entities

With what we've explored so far, this leads to my first thought, re-ordering entities by collecting ...

- A small number of
`relations`

- All the
`ways`

those`relations`

depend on - All the
`nodes`

those`ways`

depend on

... repeated many times, until all the `relations`

are processed, then collect ...

- A small number of unprocessed
`ways`

- All the
`nodes`

those`ways`

depend on

... repeated many times, until all the remaining `ways`

are processed, then collect ...

- All the un-processed
`nodes`

At this point, we've now gone from ...

... to ...

In effect, re-ordering entities such that we maximize spatial/temporal locality.

Perhaps some additional smarts could be done to put `nodes`

which are only ever referenced once towards the start, that way when they're evicted, it can be with confidence that they won't be referred to again.

This would ensure that most of the `nodes`

that are needed will be fresh in the cache, and could probably also output the "ideal" cache size.

It does violate the rule about entity ordering however, and osm2pgsql will show a warning about it, but from looking at the code it's just a warning and it's only there in case you reference an entity before it's been seen.

But this brings us back to where we started: given an arbitrary `relation`

, we need to find the nodes it references. The `nodes`

are in order, so we could binary search within them to find the information we need. Of course because blocks are an arbitrary size, we'd need an index of where each block starts. Our process is then:

- Binary search the blocks to find which block contains our
`node`

(we'll assume the index contains`node`

range information) - Decode the block
- Binary search the
`nodes`

in the block - Repeat 6,301,138,735 times

Certain bits can also be skipped with some extra smarts - for example, `nodes`

in a `way`

are typically sequential, so once we have a `node`

, the next `node`

is likely to be next to it.

We also need to track which entities have been output. If we use a simple bit array for it, that's 835MB. There's other data structures that would likely be more space efficient, such as a range tree, but it's starting to sound like a bit of a pain, especially if we ever need to store more than a single bit of data per `node`

(like how many times a node is referenced, so we can store those `nodes`

later in the file)

All of this sounds kinda annoying, and will probably blow up in to some O(n^3) graph problem. Let's shelf it, and see what else we can do.