Faster OpenStreetMap imports 3/n: Understanding the import process
The tool which imports data in to Postgres is called osm2pgsql. The OSM parsing is handled by libosmium, a header only library for parsing OSM files. It contains multiple backends, all of which except Postgres will be ignored.
osm2pgsql passes each file to libosmium, which memory maps the file, decodes each block, and executes a callback for each node
, way
, and relation
as it is decoded.
If data is compressed, it must be decompressed before being used. If data is not compressed, a lot of it can be used "in place".
As each node
is processed, it is added to up to 3 places:
- An in memory cache mapping ID -> Location
- A temporary database table,
planet_osm_nodes
, mapping ID -> Location. This is the fallback for the cache - Potentially a permanent database table,
planet_osm_points
, mapping ID -> Location + Tags. First tags are filtered, and then if certain tags are present, it will be added to the database, otherwise it is ignored.
As each way
is processed, it is added to a number of potential places:
- A temporary database table,
planet_osm_ways
, mapping ID -> List of Node IDs + Tags - After filtering of tags, if certain tags are present, it will ask the cache for all the locations of the
nodes
in theway
. If a location can't be found in the cache, it falls back toplanet_osm_nodes
. Remember that exports may not contain all the data, so it is expected that sometimes a location can never be looked up. In this case, osm2pgsql acts as though thenode
wasn't in theway
at all. - If more than 2 nodes are successfully found, the
way
is then added to eitherplanet_osm_lines
orplanet_osm_polygons
depending on whether it is a line or an area. Some lines may also be added toplanet_osm_roads
, a special table for different zoom levels.
As each relation
is processed, it goes through a similar process:
- Added to a temporary
planet_osm_rels
table, mapping ID -> Node IDs+Way IDs+Rel IDs+Tags - If the relation passes the filter, all
ways
contained in it are pulled from theplanet_osm_ways
table and processed further, typically as a polygon with holes in it, or a polygon consisting of multiple ways.
After the PBF file is fully processed, osm2pgsql then potentially deletes the planet_osm_nodes
, planet_osm_ways
, and planet_osm_rels
tables and kicks off an indexing job for the remaining 4 tables (it can keep the temporary tables and index all 7, but this is not required for our usage.)
At this point, it's done.