Faster OpenStreetMap imports 2/n: Understanding OSM PBF files
What are maps?
OSM maps typically come in a
.osm.pbf file. In the general case, this is either going to be
planet, which is the largest dataset (not including historical data), or a partial export. As
planet is large, I'll be ignoring that*, and dealing with 3 exports:
- ACT for a dataset which is small enough to analyze by hand, and it can be imported without much drama without any cache at all**
- NSW for a larger dataset, because it's where a bunch of my GPS data is. If something runs in O(n^2) memory it might fail, but O(n) is probably going to fit in RSS.
- US for a challenging dataset and because it's my ultimate goal. If something runs in O(n^2) memory, it will fail. If something runs in O(n) memory it will probably fail
What are in maps?
PBF files are a series of blocks, each block consisting of 3 parts:
- The length of the
- The header,
- The data,
This is a rather bothersome format, which I'll try to untangle and explain the problems behind it. First off, the
Blob are protobuf encoded data. When decoding protobuf data, the entire buffer is processed - there is no end of message marker, and no built-in length, so without a length, you can't process it. Therefore, we start with the length of the first protobuf, the
BlobHeader gives you two more pieces of information: the size and type of the
Blob. The overall format is extensible, which allows for new types of blobs to be added by simply defining a new type in the
BlobHeader. If a process doesn't know how to consume a particular block type, it can skip over it.
Blob then contains one of two things: compressed data, or non-compressed data. The data is optionally decompressed, and then decoded in to the type specified by the
This is why the format has multiple wrappers: you might not care about a
Blob, but you won't know unless you look at the
BlobHeader first to get the type. You can't get the
BlobHeader unless you know how big it is. So it goes
The first block must be of type
OSMHeader, and contains a bunch of things that aren't relevant but I'll talk about briefly anyway. Specifically there is the
optional_features lists. Previously it was noted that correctness is not important, but if it was, my changes would define a new
required_features, because I'm changing some pretty fundamental things.
All subsequent blocks will be of type
OSMData. The exact format doesn't matter, except to say that it contains a list of
relations, which collectively I am calling "entities". So what are they?
What are entities?
This diagram outlines the general structure and relationship between the 3 entity types:
node is an ID, a location and a list of tags. This could be a single stop sign or tree, or part of a larger structure in a
way is an ID, a list of ordered
node IDs, and a list of tags. These typically define a line such as a road or row of trees, or a closed polygon, outlining a lake or building.
relation is a list of entity IDs, a list of entity types (
relation) , a list of entity roles, and a list of tags. A
role is additional data about what an entity is. For example, a lake with an island would be defined as two
ways, one of which has a role of outer, and one has a role of inner. Of note is that the idea of inner comes from the relation, not the way. The same way could be used to define the island as a geographical feature in its own right.
Tags, common between all 3 types, are a list of any name=value data. Some examples:
- name:ko=오스트레일리아, for the Korean translation of Australia
- maxspeed=60, to define the speed limit on a street
- natural=water, to define that an entity is water.
Not every entity has tags, especially
nodes as they are frequently just a location in a
In a well formed PBF file, all the
nodes will appear in ID order, followed by all the
ways in ID order, followed by all the
relations in ID order. Anecdotally the ratio between each type of data is 1000
nodes : 100
ways : 1
relation. The US is 882M
ways, and 731k
Additionally, exports can contain references to things not in the export. For example, NSW includes relation 80500, "Australia", which references relation 2316741, "Victoria". This relation is not in the NSW export***.
** ACT is my Colorado, if you know what I mean.
*** NSW does however have a reference to 63172, "Verkehrsverbund Rhein-Ruhr".