I’ve spoken before about the idea behind Voron’s Journal. It allows us to do some pretty sweet things:
- Sequential writes.
- Incremental backups.
- Multi-Versioning Concurrency Control.
- Mutable trees.
We haven’t really run perf testing yet (we have some early benchmarks, but they aren’t ready to be talked about yet), but things are looking up.
In particular, because we write to the log sequentially, we get a much better write speed. I’ll discuss the mechanics of flushing the journal in a bit, and I already talking about incremental backup.
What I really want to talk about now is how we implement MVCC and a major optimization: Mutable trees.
In LMDB, the engine uses a copy on write method to ensure that you can modify the data while other transactions are reading from it. This leads to a lot of of writes (and free space). We can avoid that entirely by using the journal file. This end up reduce the amount of writes we have to do by a factor of 10 or so in our random writes scenario. You can read more about this here.
The downside of having a journal is that you can’t really just keep the data there. You need to move it to the data file at some point. (CouchDB’s engine uses what is effectively only a journal file, but need to compact to save space).
And as it turns out, there are a lot of issues that we need to consider here. To start with, we have a minor write amplification. We are (at worst) writing twice as much data as we get. The reason that I am saying that we are writing at most twice as much data is that there are actually optimizations in place to stop up from writing the same page multiple times. So if we wrote five values in five transactions, we would have to write the 5 transactions to the journal, but only update a single page in the data file.
More interesting are how we are going to actually handle flushing the journal. Or to be rather more exact, when we are going to do that. We need to balance system speed vs. number of journal files vs. recovery time vs. performance spikes.
My current thinking is that we’ll start with 1MB log file (see side note), and flush to disk every 1MB. When a log file is full, we create a new one, and if there is more than a single log file, we will double the size of the log file (up to 64 – 256MB, I guess). This allows us to very quickly upsize the log file automatically without needing to do explicit configuration. It also means that under load, we will get bigger log files, hence better write performance. Once we are beyond the 1MB range (a size that I don’t think require any optimizations at all), we need to start thinking on how to do work in parallel, both flushing to the data file and writing to the current log file.
Side note: optimizing small and growing
One of the things that we noticed with our users is that we have a lot of users that are trying out the database with very small data sets. With RavenDB default configuration, we are set up to take about 64MB of disk space, and we got some complaints about that. It might be better to start small and increase the size, than start with a more reasonable default that hurts small dbs.
What I think we’ll do is allow only one concurrent data flushing, which will be how we limit the amount of work that flushing can take. After every write, we will check how much space we have to flush. If the space is about 50% of the current log file size, we will initiate a flush. We will also initiate a flush if there have been no writes for 5 seconds. What we are trying to do with this logic is that ideally, under load, we will always have writes to both the log file and flushing from the log file. The reason we want to do that is to ensure that we get consistent performance. We want to avoid spikes and stalls in performance. It is more important to be consistent that allow high burst speeds.