RavenDB Retrospective: BASE Indeces
RavenDB Retrospective: BASE Indeces
In this post, we take a look at how RavenDB handles indices by making use of the BASE principle. Read on to find out more.
Join the DZone community and get the full member experience.Join For Free
RavenDB was designed from the get-go with an ACID document store and BASE indexes. ACID stands for Atomic, Consistent, Isolated, Durable, and BASE stands for Basically Available, Soft state, Eventually consistent.
That design had been conceived by twin competing needs. First, and most obviously, a database should never lose data. Second, we want to ensure that the system remains responsive even under load. It is quite common to have spikes in production traffic, and we wanted to be able to be able to handle it with better aplomb.
In particular, the kind of promises that are made by RavenDB queries allow us to perform quite a few performance optimizations. In databases that require that all indexes to be up-to-date on transaction commits, you’ll find that there is a very high cost to adding indexes to the system. Each additional index means additional work is needed at query time. It also makes things such as aggregating indexes (map/reduce, in RavenDB terms) a lot harder to build.
By having BASE indexes, we gain the ability to batch multiple writes into a single index update operation. It also allows us to defer writing the indexes to the disk, avoiding costly I/O operations. But most importantly, by changing the kind of promise that we give to users, we are able to avoid a lot of locks, complexity, and hardship inside RavenDB. This may seem like a small thing, but this is actually quite important. Take a look at this study:
In fact, there are a lot of studies on the overhead of locking in database systems, and that has been a hot research topic for many years. By choosing a different architecture, we can avoid a lot of those costs and complexities.
So far, that was the explanation from the point of view of the database creator. What about the users?
Here, the tradeoff is more nuanced. On the one hand, there is a certain level of complexity that people have to deal with the notion that queries on just-inserted data might not include it (stale queries). On the other hand, it means that queries are consistently faster, and we can handle spikes in traffic and load much more easily and consistently.
But it is a mental model that can be hard to follow, even when you are familiar with it. Probably, the most common issue with RavenDB’s BASE indexes is the case of Post/Redirect/Get. Let us look at how this may play out:
In here, we actually have two requests, one that adds a new order to the system, and the other that fetches the details. If you have redirected to the new order page, everything is going to work as expected, and you won’t notice anything even if the indexes are stale at the time of the request. But a pretty common scenario is to add the new order, and then go and look at the list of orders for this customer, and if the index didn’t have the chance to update between those two requests (which typically happens very quickly) then the customer will not see the new order.
That particular scenario is responsible for the vast majority the pain we have seen from our users around BASE indexes.
Now, one of the great things about BASE indexes is that the user gets to choose whatever they want to wait for the up-to-date results or whatever they want whatever is there right now. And we have had mechanisms to control this at a very granular level (including options for personal consistency control, so different customers will have different waits depending on their own previous behavior). But we have found that this is something that puts a lot of responsibility on the developer to control the flow of their users on their applications.
So in RavenDB, we have changed things a bit. Now, instead of processing the write requests as soon as possible, you can ask for the server to wait until the relevant indexes have processed:
In other words, when you call SaveChanges, it will wait until the indexes have been updated, so when you return from the call, you can be certain that the results of any future queries will include all the changes on that transaction. This moves the responsibility to the write side and makes such scenarios much easier to handle.
Given all of that, and our experience with RavenDB for the past 8 years or so, we spiked how it would look like with ACID indexes, at least for certain things. The problem is that this pretty much takes out of the equation a lot of the power and flexibility that we get from Lucene (more on why you can’t do that in Lucene in a bit) and forces us to offer what are essentially B+Tree indexes. Those are so limited that we would have to offer:
- B+Tree indexes: ACID (simple property / range queries), with different indexes needed for different queries and ordering options.
- Lucene indexes: BASE, full text, spatial, facets, etc. queries. Much more flexible and easy to use.
- Map/reduce indexes: BASE (because you aren’t going to run the full map/reduce during the original transaction).
The problem is that then we would have the continuous burden of explaining when to use which index type and how to deal with the different limitations. It will also make it much more complex if you have a query that can use multiple indexes, and there are problems associated with creating new ACID indexes on live systems. So it would generate a lot of confusion and complexity to users, for a fairly small benefit that we can address already with the “wait on save” option.
As for why we can’t do it all via Lucene anyway, the problem is that this wouldn’t be sustainable. Lucene isn’t really meant for individual operations, it shines when you push large amounts of data through it. It also doesn’t really have the facilities to be transactional, we have actually solved that particular problem in RavenDB 4.0, but it was neither pretty nor easy, and it doesn’t alleviate the issue of “we do best in large batches.” RavenDB’s BASE indexes are actually designed to take advantage of that particular aspect. Because under load, we’ll process bigger batches and reap the performance benefits that they bring.
BASE indexes also make for much simpler operations. You can define a new index without fearing that you'll lock the database, and it enables scenarios such as side-by-side indexing to update index definitions without impacting the running system.
Finally, a truly massive benefit of BASE indexes is that they allow us to change the following statement: more indexes means faster reads, slower writes. Fewer indexes means slower reads, faster writes. By moving the actual indexing work to a background task, we let the writes go though as fast as they possibly can.
Indexes still have a cost, and the more indexes you have, the higher the cost (we still have to do some work here). But in the vast majority of the cases, we can squeeze this kind of work between writes, in times that the database would be idling.
What that means is that you can have more indexes at the same cost and that your queries are going to be using those indexes (and are going to be fast).
Published at DZone with permission of Oren Eini , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.