On Scalable RavenDB Applications
Join the DZone community and get the full member experience.
Join For FreeRavenDB is a 2nd generation document database for the .NET platform. It has been used in production since 2010 and is probably the nicest database that you’ll get to meet, even if this is said by its author.
I have written articles before talking about the basics of using RavenDB, in this article; I decided that I want to focus on something quite different. I want to talk about the scalability characteristics of RavenDB and how we can take advantage of them to build highly scalable sites.
Before we can actually discuss scalability as it pertains to RavenDB, we need to define what it is. I think that Werner Vogels’ definition is probably the best I have found: A service is said to be scalable if when we increase the resources in a system, it results in increased performance in a manner proportional to resources added.
As he is the CTO at Amazon, I am pretty sure that he has at least some notion of what scalability is all about.
There are two sides to RavenDB scalability. The first is how it contrasts with using a relational database, and the second is the actual features that are there to make sure that you can scale your solution.
RavenDB is a document database, and documents are generally aggregates. Let us take the example of an online shop, as tired as it may be. If I was using a relational database, an order would usually require several tables to store all of the information about the order (Orders, OrderLines, Discounts, Payments, ShippingLogs, etc).
In RavenDB, it is all a single document, which can be loaded, manipulate and saved using a single command. Just this fact makes it drastically different to a relational database. Instead of worrying about how many queries it is going to take to load the order, you can just pull it up. Instead of having to worry about coarse grained locking, table locks, transaction isolation levels and all of the nastiness associated with that, you can just relax and let RavenDB handle all of those details.
Because you are working on an aggregate, you are changing the whole thing as a single unit, and you can choose on an individual operation basis what level of concurrency control you want to apply.
Let us take a look at a real world example, the following model is from our own internal order processing system:
As you can see, an order has a collection of payments. In RavenDB, this is stored as a single document, like so (orders/1):
{ "Type":"Yearly", "Quantity":10, "ProductId":"products/NHProf", "PhoneNumber":"052 52 548 6969", "Email":"ayende@ayende.com", "Address":{ "Address1":"Hapikus 34", "Address2":null, "City":"Sede Izhak", "State":null, "ZipCode":"38840", "Country":"Israel" }, "DeliveryDetails":{ "Name":"Ayende Rahien", "Company":"Hiberanting Rhinos" }, "EndsAt":"9999-12-31T23:59:59.9999999", "LicenseFor":"Hibernating Rhinos", "IpAddress":"89.138.170.144", "Payments":[ { "PaymentIdentifier":"E1561235", "Total":{ "Currency":"EUR", "Amount":1300.0 }, "VAT":{ "Currency":"EUR", "Amount":325.0 }, "At":"2010-01-01T22:55:35.0000000", "Link":"https://euro.swreg.org/cgi-bin/r.cgi?o=E1561235" }, { "PaymentIdentifier":"E1561234", "Total":{ "Currency":"EUR", "Amount":1300.0 }, "VAT":{ "Currency":"EUR", "Amount":325.0 }, "At":"2009-01-01T22:55:35.0000000", "Link":"https://euro.swreg.org/cgi-bin/r.cgi?o=E1561234" } ] }
You can see that we have a few interesting things in this model. First, we are able to natively structure complex types, such as Address or DeliveryDetails. Second, we are able to store, directly inside the document, a collection.
This seems trivial, until you realize that in a relational system, you cannot do any such thing. You are limited to a flat (and predefined) set of columns, and if you want to store a collection, your only choice is to create another table and join to that.
This is a very important distinction, because it stands at the heart of RavenDB; the notion of documents as complex structures, rather than flat list of predefined columns that you join to.
But how does this relate in any way to scalability?
The notion of a single aggregate that contains all of the information related to an entity is actually a very powerful idea when it comes to scalability. To start with, it drastically simplifies transaction management. The document is the transaction boundary, and because RavenDB document store is completely ACID, we can rely on that to manage our transactions.
Not only that, we can also rely on the ACID nature of RavenDB when we handle caching. In fact, RavenDB by default will handle all caching for you, only contacting the server to check if something has changed since we last saw the document.
When we model our documents, we usually try to take RavenDB behavior into account and model things based on the actual usage we expect. Most requests should be able to complete by operating on a single document, since that is the natural unit of work that we have with RavenDB. Even in the case when we use multiple documents (for example, we need to load the Customer document as well as the Order), we are still usually only modifying one document (in our example, the Order).
Those are the modeling aspects that allows you to gain better performance from a RavenDB based system. You load less data, you can get it in far fewer requests and you have smart caching at the transport layer without having to do any work for that.
But what happen when you suddenly get a spike in traffic? RavenDB is actually smart enough to optimize itself for those scenarios. Internally, RavenDB is composed of several parts, one of them we've already discussed, the Document Storage. This is a fully ACID component, which gives us transactional guarantees over our documents.
But there is also another part, the Indexing Component, and that one isn’t ACID. It is, in fact, BASE, Basically Available, Soft State, Eventually consistent. RavenDB applies indexing operations to updates on the database in an async manner. That means that there is a potential gap between the time we save a document to RavenDB and the time it shows up in queries. (Note that there is no such gap when you load the document by id, though. That goes to the document store, which is fully consistent at all times).
That sounds like a problem, doesn’t it? But in practice, this is a very important feature. It gives us the ability to handle load in a much more graceful manner. In most situations, you would rather get a slightly out of date answer instead of a nasty error message.
Of course, that isn’t always true, and you can tell RavenDB that you want your information without any stale results whatsoever. The important thing, is that this is a choice that you are making, based on your knowledge of the actual business scenario. It isn’t something that is forced on you by the database design.
And how does this relate to scalability again? Well, let us look at the numbers. Using common update rates, we usually have 15 – 20 milliseconds between a document update and RavenDB completing that document indexing process. Under very heavy load (our test scenario include writing 3 million documents to RavenDB), the average latency between a document write and it showing up in the indexes (and thus visible for queries) rose to a second and a half.
However, that was the only impact on the system. Writers and readers were not at all affected by the indexing process, and we were able to continue processing requests throughout. Compare that to the common row and table locks that you might have to face under heavy write scenarios using a relational database.
It usually takes some time for people to get used to this sort of thinking, but you have already applied similar strategies in your own applications. Any time that you are using a cache, you are making the choice that it is better to give the user a response as quickly as possible as then have to wait until we can give the user the most accurate answer as possible.
Because this is such a common scenario, RavenDB already does it for us. We can ask to get the accurate results, of course, but then we are being explicitly aware of the price of having to potentially wait for those results.
All of those features that we have discussed so far are appropriate when running on a single node. They allow us to gain better performance from the system, but we haven’t really talked about scalability yet. The reason that I covered them, even in as brief a detail as this, is that they are important to understand how RavenDB achieve scalability.
We actually have two different modes of distributing RavenDB: High Available Clusters and Sharding Clusters. High Availability Clusters allows us to replicate information between nodes (in either master/slave or multi-master) and serve as a failover node in case of failure.
They were originally designed specifically for a hot standby / failover scenarios, but as it turned out, they have additional usages as well. MSNBC.COM is using RavenDB High Availability Clusters to create read slaves. They run multiple RavenDB nodes in multiple data centers, all of them replicating to one another in a master/ master fashion. A write master is designated for each data center, and all the writes go to it, while reads are being shared among all of the nodes in the data center. This setup assumes that all of the data can reside in a single RavenDB server. The size limit on a RavenDB server is 16 Terabytes, so that is quite appropriate for many scenarios. But what about when we want to use actual sharding, splitting the data into multiple nodes, rather than have the data in all nodes?
RavenDB comes with builtin support for sharding. Let us take a look at how we can implement sharding by region on our online store example.
We have Customers, Orders, Invoices and Products. We want to shard the Customers, Orders and Invoices, but we have no interest in sharding Products. In fact, for Products, we have a single write master, and multiple read slaves. The reasoning behind this is that we have small amount of Products, and no real reason to shard them.
Customers and their orders, however, do shard well, and we have a lot more of them. So let us see how we can use RavenDB to shard them. The following is the setup for sharding Customers, Orders and Invoices:
var shards = new Dictionary<string, Dictionary > { {"Asia", new Dictionary {Url = "http://ravendb-asia"}}, {"America", new Dictionary {Url = "http://ravendb-america"}}, {"Europe", new Dictionary {Url = "http://ravendb-europe"}}, }; var shardStrategy = new ShardingStrategy(shards) .ShardingOn<Customer>(x => x.Region) .ShardingOn<Order>(x => x.CustomerId) .ShardingOn<Invoice>(x => x.OrderId); var documentStore = new ShardedDocumentStore(shardStrategy); documentStore.Initialize();
There are several things that happen here. First, we define our shards, and give each of them a meaningful name. Next, we define the sharding strategy. The sharding strategy tells RavenDB how it should split the data among the shards. In this case, Customers who belong to the Europe region will go to the ravendb-eurupe shard.
But more interesting, because we are sharding Orders and Invoices based on their parent id, an Order belong to a European customer will actually end up on the European shard. The same for Invoices belonging to an Order belong to a European Customer.
RavenDB is smart enough to figure this out and put related information on the same shard. This means that you have strong locality of information. All of the related data about a Customer is in a single shard, and you can use single node options such as Includes (the ability to fetch related information in a single query), Live Projection (running transformation during the query, including fetching addition data from other documents), and more.
What about querying that information? RavenDB will optimize itself so if you are querying for the recent Orders of a Customer from Asia, only the Asian shard will be hit. It is smart enough to recognize that there is absolutely no need to query the other shards.
But what about when you are querying something that is inherently cross-shard? For example, what if I wanted to query for the 50 most recent Customers, regardless of their region? RavenDB will be able to process this query against all shards, merge the results and give them back to you as if you were working against a single node.
The same holds true for map/reduce operations. RavenDB can handle distributed map/reduce, by querying all the shards and re-reducing the results.
For the most part, however, you should structure your application so even though you are using shards, most of your requests only touch a single shard. This helps for the overall system performance and stability.
Finally, you also have the option of mixing things up, you can have sharding and replication, so if one of your shards goes down, we have a hot backup available to respond to all of the clients.
In conclusion
It is hard to do justice to such a complex topic in a short article, but I do hope that I was able to provide you with a hint of RavenDB capabilities in this regard. We have spent a lot of time an effort trying to make sure that everything just works, and that things are as simple as they possible can be. But not too simple, RavenDB doesn’t believe in being a straightjacket. You have all the knobs required to customize the behavior specifically for your own needs.
Give it a try, you wouldn’t be disappointed…
Opinions expressed by DZone contributors are their own.
Comments