How to Save a Million Dollars on Databases
Each database task needs a proper tool that doesn’t cost a ton of money. Cold data should be stored in a SQL database and hot data should be stored in a system specially engineered for this purpose.
Join the DZone community and get the full member experience.Join For Free
Let’s talk databases. Why do we need databases rather than good old files? How are files inferior to databases, or rather, how are databases superior to files?
The answer is obvious. Databases are a more advantageously structured storage system. They allow transactions, queries, etc. That’s why we love databases. But sometimes we hate them, as well, because databases can cause headaches.
Here’s a simple problem that you have probably experienced with your database at some point: you have a server with a database and a number of applications that are making requests to the server; the database is responding, performing queries, and making updates, and everything is basically going well — that is, until the workload grows too heavy for the database to handle.
Assuming that this is a read-only workload, the problem can be solved using replication. You can set up as many replicas as you need and can run all read-only operations on the replicas while all updates are performed on the master.
If this is a read/write workload, however, then replication — and even master-master replication — won’t help, since all updates will hit all the replicas or masters at the end of the day. Therefore, no matter how many replicas you set up, you won’t reduce the workload. In this scenario, you need sharding.
When you shard, you cut your database into pieces and put the pieces onto different machines. You can shard a database almost endlessly. Sharding is more complex than replication, however, because you need to somehow cut your dataset into pieces. You can do this by tables or within a table by hash, range, field, etc. — there are many options.
So, by adding replicas and sharding a database, you can handle almost any workload. Sounds like everything is taken care of and the issue is closed, right?
But There Is a Problem…
…that has very little to do with the technologies involved. Watching the rapidly growing number of servers, your boss gets frustrated because they cost money. The number of user requests goes up, and so does the number of queries and transactions — so you keep adding more servers. After all, you are just a techie, you don’t have to worry about money — you just leave it to Finance. So you tell your boss: “All is well. We have an infinitely scalable system; we are simply adding servers to cover the load.” But your boss replies: “Great, but we are losing money. And if we don’t solve the problem, we’ll have to shut everything down. Even though our business is growing, our costs for databases and servers are growing even faster.” So, suddenly this problem is for you to solve, not for Finance, because it concerns technology. So what’s next? Go to a cloud service? That would be way more expensive. Optimize? Let’s assume that you’ve already optimized all of your queries and created all of the necessary indexes.
Perhaps the solution is to cache frequently used data, which can be stored in the cache and thus accessed without numerous replicas or shards.
Problems With a Cache
Great, problem solved: one Memcached instance can replace a whole rack of replica servers. But this comes with a price.
Your application writes data to both the cache and the database, which are not replicated, and this causes data inconsistency. For example, you may change the cache first, and then the database. But if for some reason, the operation on the database fails, the application will crash, or there will be network errors. Then, the user will get an error message, but the cache changes will not be rolled back. So now, the data in the cache and the data in the database differ. But nobody knows this, and the application continues to work, assuming that the cache is the source of truth when it actually isn’t. And when the cache is restarted, the updated copy of data in the cache is lost because there is a different version in the database.
The main point is that the data is lost no matter the order in which you write. So, you could also first write data to the database, and the cache update could fail. The application would work with the old data in the cache, while the corresponding data in the database would be different, and again — nobody would know. If the cache is restarted, the data is still lost.
So, in either case, an update goes lost. And this means that you’re not adhering to a basic database property: a guarantee that the committed data will remain in the database. Basically, a commit is not a “real” commit.
You can deal with this data inconsistency problem, however, by using a “smart cache,” which means that the application works with the cache only, which in turn updates the database in a write-through manner. The smart cache first changes the database, and only then changes itself. So, if for some reason, the data isn’t changed in the database nor is it changed in the cache, the data will almost always be consistent. An internal cache update can’t fail because the cache is in-memory storage, and an in-memory update always succeeds, barring that the memory is damaged, but that problem is extremely rare and beyond the scope of this article. In any case, if the RAM were damaged, a smart cache would likely crash and bury its cached data. This would be bad, but at least it wouldn’t result in data inconsistency.
Still, there is another rare case regarding smart caches when data can turn out to be inconsistent. Let’s say that an application updates the cache and the cache then applies the changes to the database. Next, the database confirms to the cache that the operation was successful, but right at this moment, the network connection fails, so the cache doesn’t receive the confirmation. Basically, it thinks that the data wasn’t committed to the database, so it doesn’t change the data in its RAM. The application keeps working with the old data, and when the cache restarts — oops, the data is different again. This case is very rare, but it can happen.
You can read some details here about how Facebook minimizes the probability of inconsistency inside their caching layer, which is named TAO. So, you can use some special techniques to minimize the problem, but it is hard to eliminate it.
Worse still, a cache (smart or not) doesn’t solve the sharding problem because updates don’t become faster. Each commit has to be committed to a durable store (i.e. not the cache). And of course, your boss hates sharding because it requires ever more money to purchase more servers.
Next problem: a cache isn’t usually a database, but rather a regular key/value store. So forget about complex queries and transactions and say goodbye to stored procedures and secondary indexes. Basically, to use itm you will have to completely rewrite your application.
Problem number four is the “cold start.” A newly started cache is empty, so has no data. At this point, all
SELECTrequests will go straight to the database and bypass the cache because there is nothing in the cache yet. Therefore, you will need to add more replicas again to “warm up” the cache and process the big number of
SELECTrequests that will temporarily go straight to the database. Thus, we need a number of replicas just to warm up the cache (maybe fewer than we would need without it, but still). This is rather wasteful, isn’t it? But you can’t achieve a quick start without these replicas. In any case, let’s take a closer look at this problem.
To keep data “warm” as opposed to “cool,” a cache needs to be persistent, i.e. the data must be stored somewhere on disk. The cache starts up and uploads the data. But there’s a catch: a cache is in-memory storage, so it has to be fast; but when you pair it with a disk, doesn’t this make the cache as slow as a database? Actually, no, it doesn’t, if persistence is implemented properly.
The easiest way to persist your cache is to regularly dump it all to disk. This can be done asynchronously in the background. If done correctly, dumping doesn’t make any operation slower, nor does it load the processor too much. Instead, it makes the warm-up period considerably shorter by making sure that a newly started cache already has its own data snapshot, which is read linearly and quickly. This allows the cache to warm up far faster than using any number of database replicas.
Is the solution that easy? Well, assume that we make a dump every five minutes. If there is a crash in between, all of the changes made since the previous dump will be lost. For applications like statistics, this is okay, while for many other applications, it’s unacceptable.
Now, let’s consider another problem with dumping: a dump requires a lot of disk bandwidth, but there can be other operations (i.e. event logging) that need the disk at the same time. During the dump, other disk operations will run slowly, and this will keep happening. We can avoid this by maintaining a transaction log instead of regularly dumping the cache. The inevitable question is, “How is this possible? It’s a cache, it’s fast, and here, we are logging each transaction.” Actually, it’s not a problem. If we’re logging transactions into a file sequentially on a regular spinning hard drive, the write speed will be up to 100 MB/sec. Let’s say that an average transaction size is 100 bytes; that’s one million transactions per second. Keeping these numbers in mind, we’ll obviously never exceed the disk performance while logging the cache. By the way, here is a test showing that one million transactions per second is really possible. Transaction logging instead of dumping also solves the IOPS problem: we load the disk just as much as necessary to persist all the data, the data is always “fresh,” we don’t lose it, and the warm-up is fast.
But transaction logging has its cons. When maintaining the log, updates for the same element don’t get grouped into a single record. When you end up with multiple updates and the cache has to go through all of the log records at startup, it can take longer than starting from the dump. Besides, the log itself can take up plenty of space; maybe not even fit onto the disk.
To solve this problem, let’s combine both dumping and logging. Why not? We can make a dump, i.e. create a snapshot, just once a day, and keep logging all updates as they occur. In the snapshot, we save the latest modification ID. When we need to restart the cache, we can read the snapshot and upload it to memory, then read the log starting from the latest snapshot modification and apply the updates to RAM. That’s it: the cache is warmed up. It’s as fast as reading from the dump. So, we’re done with the cold start; now, let’s solve the other three problems in our list.
The Other Problems
Let’s consider the temperature of data. I mean “hot” vs. “cold” data. “Hot” is the most frequently accessed data, and all other data is “cold.” Since we’re talking about a cache as an efficient way of handling frequently accessed data, we definitely have the ability to divide our dataset into “cold” and“hot.”
As a rule, there is plenty of cold data and very little hot data. That’s how it is. But we are replicating and sharding the whole database to process queries and transactions with the hot data. So, it’s small, but hot data costs us a lot. We can ask ourselves, “Why copy everything? Let’s shard the hot data only.” But that won’t help: we’ll have to use the same number of servers since we replicate and shard not because the data doesn’t fit into memory or disk, but because we run out of CPU power. So, sharding and replicating hot data alone is not the solution. And your boss is still mad because he still has to pay for new servers.
So, what can be done? We have a cache, but the problem is hot data in the database. But wait: the cache also stores data just like the database. And we can replicate it to ensure fault tolerance. So, what’s stopping us from using the cache as a primary data source? Lack of transactions? Fine, we can add transactions. Lack of other database properties? We can also implement them in the cache.
In this manner, we solve the other three problems as long as we don’t have to store hot data in the database, only in the cache! We don’t need sharding, either, since we don’t have to distribute the database among multiple servers; the cache successfully handles the whole read/write workload. Yes, a persistent cache can handle a heavy write workload because, remember, transaction logging and snapshot dumping affect throughput only slightly.
So, all database features can be implemented in the cache.
That’s what we did and the name of the resulting product is Tarantool. In terms of reading and writing, it works as fast as a cache while also having all of the database features that we need. It’s a fast and durable single source of truth. Therefore, we don’t have to back Tarantool with another database. Problem solved!
Tarantool Capabilities and Specifics
We’ve been replicating and sharding lots of cold data just to process a small amount of hot data. Now, the rarely requested and modified cold data stays in a traditional database, and the hot data goes to Tarantool. In other words, Tarantool is a database for hot data. As a result, two Tarantool instances (master and replica) are enough for most tasks. Actually, we can get away with just one instance since its access pattern and throughput are the same as a regular cache, despite the fact that Tarantool is a true database.
But there is another problem solved by Tarantool, which is more of a psychological problem: how can you give up your traditional database with its dependable data storage and cozy ACID transactions and just use a cache? Indeed, as soon as you start using Memcached or any other cache on top of a database, you effectively ditch the benefits of a database (remember our discussion regarding inconsistency and lost updates above). From this perspective, Tarantool not only speeds up your work and help you save money but also brings you back to the world of databases with transactions, stored procedures, secondary indexes, etc.
Let me say a few words about parallel transactions. Tarantool can act as a Lua application server. It considers a Lua script as one transaction: it performs all reads from memory and sends all changes to a temporary buffer so that they eventually get written to disk as one piece. But while the data is being written, another transaction can read the old data version without any locks! Transactions will slow down only if you exceed the throughput capacity of sequential disk writes — which, remember, gives you the tremendous throughput of roughly one million transactions per second.
How We Move Data From Hot to Cold
So far, this process is not automatic in Tarantool. But we’re working on it. We analyze the logs of our applications and make the decision whether to consider some data “hot.” For example, users’ profiles in the Mail.Ru email service are mostly hot because they’re frequently requested; therefore, we transfer all of the profiles to Tarantool. We might grab the cold ones, too. But even with this overrun, it’s much more effective to use Tarantool here than a SQL database. For one thing, Tarantool has a highly optimized memory footprint. Here’s an interesting fact: a SQL database stores everything on disk, but 10-20% of its data must be cached in memory. However, the memory footprint of a traditional SQL database is 1.5-2x greater than Tarantool’s footprint, which turns that 20% into 30-40%.
One Million Dollars
This sum of money is not just made up for an eye-catching header. It’s the money we actually saved in one of our Mail.Ru Group projects. Basically, we had to store users’ profiles somewhere. A profile is a small portion of information (500b-1Kb) that contains a name, a number of sent emails, and the various flags and service data that are generally required for each page load (so, in other words, data that is frequently requested and updated).
First, we considered MySQL for our problem. We deployed 16 MySQL replicas and shards, and we gradually began to duplicate the read and write workloads from the profiles. However, the 16-server MySQL farm went down at 1/8 of our total workload. And that was after we performed optimization! So, we decided to give Tarantool a try. Tarantool needed just four servers to handle the workload that would have been ultimately distributed among more than 128 MySQL servers. Actually, one Tarantool server would have been enough: we added three for fault tolerance. So if you consider the reduced number of servers along with the reduced hosting expenses, the total savings comes out to around one million dollars.
And that’s just one case. We have found uses for Tarantool in many of our projects. For example, there are 120 hard-working instances in our email and cloud services. If we used SQL-based solutions, we’d need tens of thousands of servers with MySQL or other SQL database management systems — PostgreSQL, Oracle, or whichever. It is hard to conceive of just how large this cost would be.
So, the moral of this story is that each task needs a proper tool that doesn’t cost a ton of money. Cold data should be stored in a SQL database, and hot data — which is requested and updated frequently — should be stored in a system specially engineered for this purpose.
In any case, should you be interested in learning more about Tarantool, please contact us here to be connected with core developers!
Opinions expressed by DZone contributors are their own.