RavenDB has been using the low-level Esent as our storage engine from day one. We toyed with building our own storage engine in Munin, but it was only in 2013 that we started to pay serious attention to that. A large part of that was the realization that Esent is great, but it isn’t without its own issues and bugs (it is relatively easy to get it into referencing invalid memory, for example, if you run multiple defrag operations) that we have to work around. But, two major reasons were behind our decision to invest a lot of time and effort into building Voron.
When a customer has an issue, they don’t care that it's some separate component that is causing it; we need to be able to provide them with an answer, fast. And escalating to Microsoft works, but it is cumbersome and, in many cases, result in a game of telephone. This can be amusing when you see kids do it, but is not so much fun when it is an irate customer with a serious problem on the phone.
The second reason is that Esent was not designed for our workload. It has done great by us, but it isn’t something that we can take and tweak and file away all the rough edges in our own usage. In order to provide the level of performance and flexibility we need, we have to own every critical piece in our stack, and Voron was the result of that. Voron was released as part of RavenDB 3.0, and we have had some customers running the biggest databases on it, battle testing it in production for the past two (+) years.
With RavenDB 4.0, we have made Voron the only storage engine we support, and have extended it further. In particular, we added low-level data structures that changed some operations from O(M * logN) to O(logN), push more functionality to the storage engine itself and simplified the concurrency model.
In Voron 3.0, the only data structure that we had was the B+Tree, we had multiple of those, and you could recurse them, but that was it. Data was stored in the following manner:
- Documents tree (key: document id, value: document JSON)
- Etag tree (key: etag, value: document key)
We had one B+Tree as the primary, which contained the actual data, and a number of additional indexes, which allow us to search on additional fields, then find the relevant data by looking up in the data tree.
This had two issues. The first was that our code needed to manually make sure that we were updating all the index trees whenever we updated/deleted/created any values. The second was that a scan over an index would result in the code first doing an O(logN) search over the index tree, then for each matching result it would do another lookup to the actual data tree.
In practice, this only showed up as a problem when you have over 200 million entries, in which case the performance cost was noticeable. But for that purpose, and for a bunch of other (which we’ll discuss in the next post) we made the following changes to Voron.
We now have 4 data structures supported.
- B+Tree – same as before, variable size key & value.
- Fixed Sized B+Tree – int64 key, and fixed size (can be configured at creation time) size of value. Same as the one above, but let's take advantage of various optimization when we know what the total size is.
- Raw data section – allow to store data and give an opaque id to access the data later.
- Table – combination of raw data sections with any number of indexes.
The raw data section is interesting, because it allows us to just ask it to store a bunch of data, and get an id back (int64), and it has an O(1) cost for accessing those values using the id.
We then use this id as the value in the B+Tree, which means that our structure now looks like this:
- Raw data sections – [document JSON, document JSON, etc]
- Documents tree (key: document id, value: raw data section id)
- Etags tree (key: etag, value: raw data section id)
So now, getting the results from the etags tree is an seek into the tree O(logN), and then just O(1) cost for each document that we need to get from there.
Another very important aspect is the fact that Voron is based on memory mapped files, which allows for some really interesting optimization opportunities. But, that will be the subject of the next post.