Using PostgreSQL as a graph database

Cover image

Who needs a graph database when you already have PostgreSQL

by Krzysztof Dryś · 4 min read backends databases SQL

This is a short story of how we used PostgreSQL to store and query graphs. This is a story with a happy ending because we found a simple and cheap solution to what seemed a complex problem.

Why would you store graphs in an SQL database?permalink

To begin with, we have data that looks like a graph. Additionally, we have a lot of experience with operating SQL databases and little-to-none experience with graph databases. I don't mean only using them. I mean stuff like high availability, backups and performance monitoring.

So we searched for the possible solutions and found this paper. It is not exactly an easy read, but it makes one thing very clear. SQL database can be efficiently used to store and query graph data. To quote the abstract:

We show that existing mature, relational optimizers can be exploited with a novel schema to give better performance for property graph storage and retrieval than popular noSQL graph stores.

Getting our hands dirtypermalink

Before we implement anything, let's look at the business case. After all, we want the schema to model "the real world".

What entities do we have at Ingrid? Our mission is to create a great delivery experience for everyone. Our entities are: order, shipment, package and tracking number. Actually, there are a few more, but let's keep it simple.

What are the possible relations between them? Well, it's complicated.

The online shop uses Ingrid to show shipping options to an end-customer. End-customer places an order, which is then sent to them using one or many shipments. Each shipment can have many packages (in the case of a big cart, which does not fit one package). Each package can have many tracking numbers (think: multi-leg delivery). Merchant can also book a shipment in an external service and attach the tracking number to an order created in Ingrid.

Anything can be a part of anything (sort of).

The schema

We want to translate this graph of relations into a database schema. Entities (order, shipment, tracking number etc.) will be nodes. An edge will indicate that there is a relationship between two nodes (in our case "is part of").

CREATE TYPE entity_type AS ENUM (
'order',
'tracking_number',
'shipment',
'package',
);

-- entity is a node of our graph.
-- It represents a business entity in our system.
CREATE TABLE entity (
node_id BIGSERIAL,
entity_type entity_type NOT NULL,
entity_id TEXT NOT NULL,
PRIMARY KEY (node_id)
);

CREATE UNIQUE INDEX entity_entity_type_entity_id_unique_indx ON entity (entity_type, entity_id);

-- relation is an edge of a graph.
-- It represents "is part of" relationship.
CREATE table relation (
parent BIGINT NOT NULL,
child BIGINT NOT NULL,
PRIMARY KEY (parent, child),
);

CREATE INDEX relation_child_parent_indx ON relation(child, parent);

We distinguish between the primary key (node_id) and the business identifier (composed of entity_type and entity_id). This is to make indexes smaller and queries faster. Moreover, inserting is faster when your primary key is serial. This is because you always append to the end of the table. Though admittedly, we have no benchmark to prove these claims.

Querying

How do you search data in such tables? Here is query used to find all orders connected to a particular tracking number:

WITH RECURSIVE traverse(node_id, entity_type, entity_id) AS (
SELECT
node_id,
entity_type,
entity_id
FROM
entity
WHERE
entity.entity_id = $1 AND
entity.entity_type = 'tracking_number'
UNION ALL
SELECT
entity.node_id,
entity.entity_type,
entity.entity_id
FROM traverse JOIN
relation ON traverse.node_id = relation.child JOIN
entity ON relation.parent = entity.node_id
)
SELECT
traverse.entity_id as order_id
FROM traverse
WHERE traverse.entity_type = 'order'
GROUP BY traverse.entity_id
ORDER BY traverse.entity_id ASC
LIMIT $2;

This was the first recursive SQL query used in Ingrid, so everyone needed a little reminder of how this works.

This recursive query is made of two parts. The first part is executed only once. It will find all entities which are the specified tracking number:

SELECT
node_id,
entity_type,
entity_id
FROM
entity
WHERE
entity.entity_id = $1
entity.entity_type = 'tracking_number'

The second part references traverse, which is also a result, hence we call the whole query recursive. Still, it is easier for me to analyse it as if it was executed iteratively:

  • in the first iteration, traverse contains results of the first query,
  • in the second iteration, traverse is the result of the first iteration,
  • in the third iteration, traverse is the result of the second iteration,
  • and so it goes until one of the iterations returns no rows.
SELECT
entity.node_id,
entity.entity_type,
entity.entity_id
-- traverse table will have a different value on each "iteration"
FROM traverse JOIN
-- join with relation to get parent ids.
relation ON traverse.node_id = relation.child JOIN
-- join with entity to get more data about the parents.
entity ON relation.parent = entity.node_id

Everything stops when the second query returns no results. After that, results from each iteration are "appended" (regular UNION/UNION ALL rules applies) and "returned" to be used by subsequent select.

If you want to read more about the recursive queries in PostgreSQL, read the official documentation.

Note that this query terminates only because we don't have cycles in our graph. We don't have any rule (like trigger or constraint) to enforce that. We do this on the application level.

The performance

It is all very nice and impressive that PostgreSQL can do this (actually, so can MySQL). Still, how does this perform on production?

Let's look at the numbers. At the time of writing, we have been running this service on production for two months. The service handles about 5 entities inserts and 5 relations inserts per second. It also handles 2 queries per seconds. Most of the time executing the search query takes as little as 10 milliseconds (we are talking about 99.9% percentile).

All these reads and writes together use up to 5% of the time of one CPU core (this database occupies a dedicated instance). So, while the significant traffic is still ahead of this setup, we think it should scale pretty well.

Wrapping it uppermalink

There are dedicated solutions for storing graph data. We decided that they are overkill for us and went for PostgreSQL. We are happy with the solution because we implemented it quickly, it is stable (at least so far) and does not consume a lot of resources.


Cover photo by @scottwebb on Unsplash

Does Ingrid sound like an interesting place to work at? We are always looking for good people! Check out our open positions