I've been talking about the different aspects of building a database engine in detail for the past month or so. But I tried to talk about each topic independently, so it will make sense. The problem is that in the real world, there are actually quite a lot of related stuff that impact on one another. This series of posts is meant to tie everything together, so you’ll have a better understanding how the design decisions in one place being affected by the requirement in somewhere that seems utterly unrelated.
Before we can talk about the implementation details, let's see what we are trying to achieve. Voron is:
- High performance.
- Single write, multiple readers (MVCC)
- Fully ACID
In this post, I’m not going to talk about the data model or how we sort it... or anything like that. No, we are at a much lower level than that. We are going to go over how we access the raw data pages and manage them.
There are actually multiple players involved here. We have the journal for durability of writes, we have the data file to store the data, the scratch file to implement Multi-Versioning Concurrency Control, and the Page Translation Tables to provide a snapshot isolation for concurrent transactions.
The design of Voron is immensely simplified by the fact that we chose to go with a single writer model. We share this design decision with other databases engines such as LMDB, LevelDB, RocksDB, etc. Concurrent write transactions are much more complex and require a lot more effort, and you still have the serialization at the journal level, although I explored multiple ways around it. With Voron, we decided to go with a single write transaction for the simplicity, and then implemented transaction merging on top of that, which give us a tremendous performance boost in high load scenarios.
But let us talk about MVCC. The idea is that we have concurrent versions of the data, so each transaction has a snapshot of the entire database and can operate on that without fear of write transactions modifying data while it is running. Let's explore how this works when the database starts.
The key to that is the notion of the page translation table, from now on, known as the PTT. When the database starts, we have an empty PTT, and the data file itself. We open a read transaction, which has the following data:
- PTT: [ /* no entries */
- Data file
Whenever the read transaction needs to read a page, it consults the PTT, find that there is nothing there, and read the page from the data file. We keep the read transaction open and open a new write transaction. It also gets a PTT and the data file, but it also needs to keep track of a few other things:
- PTT: [/* no entries */]
- Data file
- Dirty pages
Now, we want to make a change to the database, which happens to fall on Page #3. Here we have problem: we can’t modify the data file directly, ReadTx-1 is still running, and it might want to read the data in Page #3 at some point. Instead of modifying the data directly, we copy the page into the scratch file.
The scratch file is just a temporary file that we use to store data copies. After we copy the data, we update the PTT. Now when we search for Page #3, we’ll find the location of the page in the scratch file. As far as the write transaction is concerned, this doesn’t matter. A page is a page is a page, and it doesn’t matter where it is at.
Committing the transaction means taking all of the dirty pages in the write transaction and writing them to the log. After which we atomically set the PTT for the write transaction as the global PTT. Now, all future read transactions will have the new PTT and when they ask for Page #3, they will get the page from the scratch file.
A new write transaction that needs to (again) modify Page #3, will create another copy of the Page inside the scratch file. This ends up looking like this:
We have three copies of Page #3. One for the original read transaction (in the data file), one for the current read transactions (yellow in the scratch file), and the current modified page (orange in the scratch file) that we are writing to.
When the write transaction completes, we again flush the dirty pages to the journal and then publish our PTT so all future transactions can see the changes.
Of course, that is just one side of it, in my next post, I’ll discuss how we clear the scratch file and move data back to the data file.