In my previous post, I described four applications (three implemented, one an example) that require, or at least strongly benefit from, strong ACID transactions. With that out of the way, we can now get to the fun bits.
Today's claim: Most databases that claim to be 100% ACID compliant actually fudge definitions, goof up, or deliberately mislead. This applies to 30-year old giants as well as strapping upstarts; it's just a fact of life in databases.
I could make a giant chart with all of the products that claim to be ACID — and list where they fail. But that chart would be partial, soon-dated, and probably contain errors. It would certainly open up a discussion about specifics that might miss the forest for the trees.
So in lieu of that, here's general guidance about the ways many databases fail to meet that 100% ACID standard. This is how you can evaluate a new database's ACID promises. The content below will help you look past the hype and/or marketing to get a sense of how mature the development is, what engineering tradeoffs have been made, and how serious the database developers are about backing up their promises.
Let's start with the definition, which is described in a bit more detail on our product page on ACID.
- "Atomic" is well defined. Do, or do not. No halvsies.
- "Consistent" is a bit of a lightweight term, and people ask whether it was just added to make the acronym, so I won't dwell on it.
- "Isolated," according to the definition, is black and white. But many systems relax the guarantee of isolation and still claim ACID compliance. This has become accepted because almost every system does it, and people in glass houses don't throw stones.
- People argue about when you can call a transaction "Durable," and whether it needs to be on disk or on multiple machines, etc. I don't think the implementation matters, so long as confirmed, committed transactions aren't undone later because of data loss or because of some other consistency problem.
It's worth examining all the ways in which a database product can screw up ACID. Keep these in mind when evaluating a system that claims ACID transactions. Sometimes, "mostly ACID" is still useful, but only if the user and the database implementers understand precisely what "mostly" means in a particular case.
Relaxed Isolation Levels
The "I" in ACID stands for Isolation and implies that concurrent transactions can't interfere with each other's execution. Ideally, transactions behave as if they are executed one after another, i.e. serially. In practice, concurrency helps with throughput as well as ensures fairness and priority across users and jobs, so complicated concurrency schemes are implemented to balance throughput with isolation.
However, many databases implement less than serializable isolation, or offer multiple isolation levels and ask operators to choose between performance and full ACID guarantees.
The tradeoff between isolation levels and performance can also sometimes be thought of as a tradeoff between isolation levels and features. For example, for single-operation key-value put-get workloads, isolation can be quite relaxed without corrupting workloads. As applications become more complex, more isolation becomes important. Even something as simple as counting records that match a predicate can be problematic under weaker isolation models. For example, even the simplest read, modify, and then write on a single value may corrupt data under less-than-serializable or repeatable-read isolation; i.e. you will see failures under snapshot isolation and especially the dreaded read-committed isolation. Forget about doing more complex math, like computing the standard deviation, and storing it in the database.
My last blog post covered four applications that need strong ACID, and none of them would be effective under the weak isolation models pushed by many vendors.
The worst part about weak isolation is that its effects often go unnoticed, silently corrupting data. If you'd like to learn more, Peter Bailis has written about weak isolation in more detail.
Failure to Understand Tradeoffs at Design Time
It cannot be understated how important it is to have a plan when starting out to develop a distributed database. What consistency guarantees will be made? With respect to the CAP theorem, when push comes to shove, will you choose availability or consistency? Does the system support ACID transactions? If so, what isolation levels are supported? Is consistency going to be tunable? What will the system do at launch, versus five years in the future?
To answer these questions, you need to understand the use cases the system is targeting, but importantly, you need the distributed systems and database expertise to actually translate your goals into an architecture.
Two specific things to consider when designing a system are distribution and data/query/transactional models.
Some systems start as single-node databases and add clustering. The issue here is that many of the operations that are easy to support on single-node systems are difficult to support on multi-node systems. It may be that a JOIN operation that ran nearly instantly on one machine suddenly takes seconds to run on the cluster because now the data needs to be moved over the network. An application that ran fine under strong isolation models on one database node may refuse to scale without serious isolation relaxation on a cluster. Rarely is it possible to drop a clustered system into an application designed for a single-node system.
Some systems start out as a distributed key-value stores and grow into more. These systems have thought about distribution from day one but have thrown out the baby with the bathwater when it comes to useful database features. As they try to add transactions and/or a strong query/data model, they often make compromises. Few systems have had success adding full ACID transactions if it wasn't part of the original design. Most of these systems only support consistent operations on a single key.
Still other systems just make assumptions that don't pan out. They either built a system that solves a problem people don't have, they built a system that asks too much of the user, or they just made a plan that didn't hold up in the face of real-world applications.
One clear example is systems that move data to the query instead of the other way around. Data is variable in size, but sometimes big. Queries are small, logical things. If you always need to move all the data to a "transaction node" or some kind of processing center in order to query it, then you've built a system that's not just inefficient, but all that data movement, latency, and coordination is going to make strong ACID and good performance incompatible.
The most important part of having a plan is understanding the tradeoffs involved and what makes your system different than others. If your system makes a different choice than almost every other system, it's important to be able to explain why and to describe the benefits and downsides.
Now that you have a plan, how do you ensure that the end result matches the plan, that the plan matches the real world, and that that you, generally speaking, haven't screwed up all over the place? Having helped build VoltDB, I know that building a robust distributed system is hard, and bugs are unavoidable. Having said that, some systems have a lot more bugs than others. Often that comes down to testing.
There are a million ways to verify code, from TLA+ Model Checking to simulation (a la FoundationDB) to more traditional fault injection. The challenge is not simply that fault-tolerance has so many corner cases (it does); it's that database workloads can be so varied. You have to verify every kind of workload in every kind of failure scenario. You have to try combinations of workloads with combinations of failures. Each time you add a simple feature with an on/off switch, you might be doubling the test load.
Testing requires not only engineering resources but also physical resources. When we first discussed testing VoltDB, a search for test harnesses shipping with other databases didn't turn up much. The number of unit tests we run for each commit is sizable, but the number of long-running and fault-injecting performance and cluster tests is on another scale. We make this commitment as an enterprise-class product. Not all databases can do the same, or choose to do the same.
It's natural for databases to have bugs, but databases developed with a clear plan and a comprehensive test strategy will almost certainly have fewer. I gave a talk at the StrangeLoop conference last year on one dimension of VoltDB testing, if you'd like to understand a sliver of the effort we expend to ensure VoltDB works: "All In With Determinism for Performance and Testing in Distributed Systems."
Taking Shortcuts and Crossing Fingers
Somewhere in your code, there's a corner case — say if two hosts fail at the same time. If you hit that corner case, you can't maintain ACID guarantees. You might get lucky, or the system might lose data. Other times — and this is arguably worse — the system just silently corrupts data.
Shortcuts are a choice made by database developers. Sometimes the choice is explicit: Someone assumes that two nodes will never fail at the same time. Sometimes it's a corner case no one paid attention to, and thus no tests were built.
One of the tricky things with corner cases is the difference between "never" and "OMG so rare!" From my experience with production users (and from common sense), if you have enough customers, and they run for long enough, even fantastically unlikely things happen. A one-in-a-billion chance is pretty likely when you're performing thousands of operations per second.
This is why Kyle Kingsbury has been so successful with his Jepsen posts. If you haven't read through them, I highly recommend it. There's a lot more detail about how specific systems go wrong, but also stories of how database developers respond to issues. Some welcome the testing and set about fixing problems. Others do less.
Keeping two or more copies of data is harder than you would expect in an ACID database.
Is the replication synchronous? The latency cost of synchronous replication, even when copies of your data are connected to the same switch, is sometimes substantial. Many systems default to asynchronous replication to achieve higher performance. Obviously, asynchronous replication can compromise the durability of transactions committed.
Is the replication deterministic? Are operations replayed on the replicas replicated in the same order as on the primary in all cases? If not, then non-commutative operations could lead to divergent values between the primary and the replicas. Sometimes the operations themselves might not be deterministic. This a bigger concern for logical replication schemes, or even hybrid schemes. An example might be: move the student with the highest grade to another class. If two students are tied, the outcome of this operation may not be deterministic, and you may end up with divergent data on different replicas. One system believes student x has been moved to AP Algebra; the other replica believes student x is still in Algebra 101.
Sharding is the process of horizontally partitioning data and queries into multiple discrete, smaller databases, then managing data access in a management layer, or even in the client.
One of the classic sharding scenarios involves running multiple MySQL servers and putting smarts in the client to manage data locality. As common as this is, you run into several problems mentioned above.
- You're probably not running MySQL using serializable isolation unless you explicitly made that performance tradeoff.
- Transactions across nodes are supremely difficult when managed by the client. You can open transactions across nodes and commit them together, but that's not only slow, there are many ways in which it can fail. Simply managing what happens when a cross-shard transaction fails because the coordinating client code fails is often an elaborate process of compensating actions and repair procedures.
- The client code to manage consistency across shards is custom code. The odds that it's buggy are, um… higher than production hardened code that's built into a well-tested database.
Other systems move sharding management to the server side, or use a transaction coordinator process. This makes it the database developer's problem to find bugs, which is good, but there are still many systems that make ACID compromises while distributing data.
- For cross-shard operations, sometimes developers make the same mistakes that users sharding MySQL make. Sometimes locks aren't held across shards. Imagine a debit-credit transaction that moves money from an account on one shard to an account on another. If the debit fails on one shard, but the credit succeeds on the other, is the credit always rolled back or is money created from thin air?
- Maybe a transaction makes a decision across shards based on a small lookup table that's updated daily. Does that lookup table update transactionally across shards, or is it possible for one cross-shard transaction to see two versions of the same lookup table?
- If a shard and all of its replicas fail, is there a way to get the database back to a consistent state, or are you stuck recovering one shard whose data is older than all other shards when it rejoins?
Fuzzy Backup & Business Continuity
People think about backups in terms of asteroids smashing into their datacenter, and it's good to be prepared, but backups tend to be more about planned events and logistics. On its face, backup is the process of writing a complete record of state to disk or to another system in a format in which it can be reloaded if needed. The fundamental question: Can I separate my data from my database without losing information?
Being able to move data around lets developers test development code against production data. It lets developers move data between clusters cleanly or migrate their data to a new configuration or application. Backups are also just as likely to mitigate human problems as celestial ones. What happens when a bug in a client process starts corrupting data in the database? Can you restore to a point before the failure?
Backup can be surprisingly hard, even in a single-node system. There are a few different strategies to backing up a live system.
- You can stop all operations and write a copy of all data out. This can be a perfect copy, but because it affects write availability, it is not always a popular choice.
- You can make a backup using MVCC or copy-on-write mode, where any new writes to the system are kept alongside the value of the data when the backup started. Then the backup can write a perfect copy of the data from the time the backup started. At VoltDB, we call this a transactionally consistent snapshot. With this kind of backup, you have to be careful to manage writes to data after the backup begins. Done well, you can have write availability and ACID backups.
- You can also make a weakly isolated backup, using anything from completely dirty reads to "read committed" to "repeatable read." Restoring from this backup has a pretty high probability of corrupting ACID guarantees if any data is being changed/mutated in the system.
Of course, when you move to a horizontally-scaled system, there is a new way to corrupt ACID with backups. Whether you are stopping the system to backup or using a transactionally consistent snapshot, you have to be extremely careful to ensure that the backup begins at the same transactionally logical point-in-time across all nodes.
As an aside, VoltDB makes strongly consistent global backups trivial. Read more here.
I'll be the first to say it: The system I've worked so hard on, namely VoltDB, isn't perfect. I'm not sure any system is. I will say that protecting your data, promising the strongest ACID guarantees we can, and delivering on those promises is taken very seriously here.
VoltDB promises serializable ACID consistency, the strongest guarantee available. It does this even when distributed across a cluster. We've looked at all the ways a system can be compromised discussed here and made sure our design and implementation are as planned and fully tested. Our implementation of the RAFT consensus protocol shipped in 2011, a full two years before the RAFT paper (pdf) was published. Our automated testing matrix struggles mightily, night after night, to test every feature with every kind of failure. Our crack support team has the support of the full company when any data consistency or data loss bug is found. This is what we hang our hats on: making the fastest, most transactionally-powerful and safest system available.
Discuss this post on Hacker News.