In the Guts n’ Glory of Database Internals series (which I’ll probably continue if people suggest new topics), I talked about the very low-level things that are involved in actually building a database — from how to ensure consistency to the network protocols. But those are very low-level concerns. Important ones, but very low level. In this series, I want to start going up a bit in the stack and actually implement a toy database on top of a real production system, to show you what the database engine actually does.
In practice, we divide the layers of a database engine this way:
- Low-level storage (how we save the bits to disk), journaling, ACID.
- High-level storage (what kind of storage options do we have, B+Tree, nested trees, etc).
- Low-level data operations (working on a single item at a time).
- High-level data operations (large-scale operations, typically).
- Additional features (subscribing to changes, for example).
In order to do something interesting, we are going to be writing a toy graph database. I’m going to focus on levels 3 & 4 here, the kind of data operations that we need to provide the database we want, and we are going to build over pre-existing storage solution that handles 1 & 2.
Selecting the storage engine — sometimes it makes sense to go elsewhere for the storage engine. Typical examples includes using LMDB or LevelDB as embedded databases that handle the storage, and you build the data operations on top of that. This works, but it is limiting. You can’t do certain things, and sometimes you really want to. For example, LMDB supports the notion of multiple trees (and even recursive trees), while LevelDB has a flat key space. That has a big impact on how you design and build the database engine.
At any rate, I don’t think it will surprise anyone that I’m using Voron as the storage engine. It was developed to be a very flexible storage engine, and it works very well for that purpose.
We’ll get to the actual code in tomorrow’s post, but let’s lay out what we want to end up with:
- The ability to store nodes (for simplicity, a node is just an arbitrary property bag).
- The ability to connect nodes using edges.
- Edges belong to types, so KNOWS and WORKED_AT are two different connection types.
- An edge can be bare (no properties) or have data (again, for simplicity, just arbitrary property bag).
The purpose of the toy database we build is to allow the following low-level operations:
- Add a node.
- Add an edge between two nodes.
- Traverse from a node to all its edges (cheaply).
- Traverse from an edge to the other nodest (cheaply).
That is it, should be simple enough, right? With that in mind, stay tuned for the next part of this series, where we dive into building a flexible database.