Faster OpenStreetMap imports 2/n: Understanding OSM PBF files
What are maps?
OSM maps typically come in a .osm.pbf
file. In the general case, this is either going to be planet
, which is the largest dataset (not including historical data), or a partial export. As planet
is large, I'll be ignoring that*, and dealing with 3 exports:
- ACT for a dataset which is small enough to analyze by hand, and it can be imported without much drama without any cache at all**
- NSW for a larger dataset, because it's where a bunch of my GPS data is. If something runs in O(n^2) memory it might fail, but O(n) is probably going to fit in RSS.
- US for a challenging dataset and because it's my ultimate goal. If something runs in O(n^2) memory, it will fail. If something runs in O(n) memory it will probably fail
What are in maps?
PBF files are a series of blocks, each block consisting of 3 parts:
- The length of the
BlobHeader
- The header,
BlobHeader
- The data,
Blob
This is a rather bothersome format, which I'll try to untangle and explain the problems behind it. First off, the BlobHeader
and Blob
are protobuf encoded data. When decoding protobuf data, the entire buffer is processed - there is no end of message marker, and no built-in length, so without a length, you can't process it. Therefore, we start with the length of the first protobuf, the BlobHeader
.
Decoding the BlobHeader
gives you two more pieces of information: the size and type of the Blob
. The overall format is extensible, which allows for new types of blobs to be added by simply defining a new type in the BlobHeader
. If a process doesn't know how to consume a particular block type, it can skip over it.
A Blob
then contains one of two things: compressed data, or non-compressed data. The data is optionally decompressed, and then decoded in to the type specified by the BlobHeader
type.
This is why the format has multiple wrappers: you might not care about a Blob
, but you won't know unless you look at the BlobHeader
first to get the type. You can't get the BlobHeader
unless you know how big it is. So it goes BlobHeader
length, BlobHeader
containing Blob
length, Blob
.
The first block must be of type OSMHeader
, and contains a bunch of things that aren't relevant but I'll talk about briefly anyway. Specifically there is the required_features
and optional_features
lists. Previously it was noted that correctness is not important, but if it was, my changes would define a new required_features
, because I'm changing some pretty fundamental things.
All subsequent blocks will be of type OSMData
. The exact format doesn't matter, except to say that it contains a list of node
, ways
, or relations
, which collectively I am calling "entities". So what are they?
What are entities?
This diagram outlines the general structure and relationship between the 3 entity types:

A node
is an ID, a location and a list of tags. This could be a single stop sign or tree, or part of a larger structure in a way
or relation
.
A way
is an ID, a list of ordered node
IDs, and a list of tags. These typically define a line such as a road or row of trees, or a closed polygon, outlining a lake or building.
A relation
is a list of entity IDs, a list of entity types (node
, way
, or relation
) , a list of entity roles, and a list of tags. A role
is additional data about what an entity is. For example, a lake with an island would be defined as two ways
, one of which has a role of outer, and one has a role of inner. Of note is that the idea of inner comes from the relation, not the way. The same way could be used to define the island as a geographical feature in its own right.
Tags, common between all 3 types, are a list of any name=value data. Some examples:
- name:ko=오스트레일리아, for the Korean translation of Australia
- maxspeed=60, to define the speed limit on a street
- natural=water, to define that an entity is water.
Not every entity has tags, especially nodes
as they are frequently just a location in a way
.
In a well formed PBF file, all the nodes
will appear in ID order, followed by all the ways
in ID order, followed by all the relations
in ID order. Anecdotally the ratio between each type of data is 1000 nodes
: 100 ways
: 1 relation
. The US is 882Mnodes
, 81M ways
, and 731k relations
.
Additionally, exports can contain references to things not in the export. For example, NSW includes relation 80500, "Australia", which references relation 2316741, "Victoria". This relation is not in the NSW export***.
* lol
** ACT is my Colorado, if you know what I mean.
*** NSW does however have a reference to 63172, "Verkehrsverbund Rhein-Ruhr".