Curator's Note: The content of this article was written by Trek Palmer over at the NuoDB tech blog.
Greetings oh-so-persistent nuonians! Your faithful consistency nerd is here to continue the discussion of MVCC. For background please examine part 1 (wherein our heroes are introduced to MVCC), part 2 (wherein our heroes witness examples) or part 3 (where our heroes are exposed to the subtleties of vanilla mvcc). This chapter of the story will introduce you to distributed MVCC, in particular the NuoDB flavor.
The Problem Distributed MVCC is Solving
NuoDB is a distributed ACID database. We recognize that applications are easier to write with transactions, and that not all transactions are created equal. Like any good database, we allow the user to tell us what level of consistency they want, and we need to act accordingly. NuoDB uses MVCC to handle concurrent reads and writes to shared data, and the previous posts have gone over the broad strokes of what that entails. However, in a distributed system we can have situations where multiple separate nodes each want to update the same record at the same time.
‘Classic’ databases were designed as a single-server solution, where concurrency came in the form of local threads. Much work was done in decades past to build systems that could resolve update conflicts between threads. In general, the approaches involved ‘managers’ which is a term I’m going to use for any centralized synchronized object that was the main clearing-house for requests that had to be done consistently. For readers familiar with run-of-the-mill thread programming, a manager was usually some fine-grained lock-based object that would hand out ‘grants’ for record versions to requesting threads. What this means is if multiple threads are trying to add a new record version ‘simultaneously’, the manager will determine who gets what versions (and if anyone needs to be failed). This is perfectly fine for a single-node system, and optimizing local-memory concurrent data structures is a task that can keep armies of engineers occupied for untold eons.
Distributed systems are, at a high level of abstraction, just large-scale concurrent systems. Instead of function calls and atomic locks, a distributed system uses messages and message processing. The similarity starts to break down when you want to make things fast. The network is orders of magnitude slower than local memory, and a design that worked jim-dandy in local memory may (and often does) fail to scale to multiple machines even when retooled for distributed operation. One of the first lessons I learned when transitioning from a concurrent programmer to a distributed programmer was that centralized manager-type systems are a bad idea in distributed systems. The reason is simple enough, a centralized point of control may require blocking, which can basically stop all nodes for a time. There are also reliability concerns, namely: what do you do if the node running as the ‘manager’ dies? So, centralized points of control can cause issues that defeat the reasons for using a distributed system in the first place (e.g. performance and reliability). ‘Well, shoot’ some of you may be saying, ‘if we can’t just distributify the old manager-based approaches, how are we going to swing a distributed mvcc database?’
There are some tools in the distributed engineer’s toolkit available to us. The first trick is that we build everything from atoms (fine-grained partitioning). A table itself is actually a galaxy of atoms. Each atom is a small, independent distributed object. As a table grows, it grows by adding more atoms rather than inflating atoms indefinitely. Because the atoms are independent, we can leverage another trick (asynchronous messaging) to overlap and pipeline processing with messaging to hide as much message processing latency as possible. These are useful tricks and help keep atom processing lean and fast. However the real speedup is in applying asynchrony more generally.
A time-honored trope in systems engineering is ‘to make the common case fast’. In general, any marginally well-designed application won’t have every thread on every node banging away on the same record. Therefore, most databases assume that the common case for record update is no-conflict or low-conflict. NuoDB makes a similar assumption. Therefore we’d prefer not to have to wait for messages from a ‘manager node’ in order to process an update. Therefore every node assumes that it has the right to install a new version with what it assumes is the latest version and then march ahead. Of course, conflicts do happen and we need to be able to detect them and deal with them.
Of Chairmen and Record Metadata
Consider a database with 3 nodes (1,2 and 3). All of them agree that the latest version for row 42 is 5. Nodes 2 and 3 have clients connected to them that are attempting to update row 42. The simplest approach would be to have node 2 confirm that it was allowed to update row 42 before proceeding. But if node 2 had to chat with all its peers before allowing that update, then updates would be incredibly slow as each row would require all kinds of chit-chat. This slowdown is doubly bad in the assumed common case where most updates don’t conflict at all. So, obviously doing distributed coordination for each row before update is slow. Nuodb is not slow, therefore nuodb doesn’t do that. What nuodb does is that node 2 and node 3 both assume that it’s ok for them to install a new version locally and that someone, somewhere will call them on it if it isn’t actually ok.
In the example above, node 2 and 3 would both independently approve their respective client updates, provisionally give them version 6 and then march ahead. Of course, now we have an undetected and unresolved conflict. Node 2 thinks that row 42, version 6 is A and node 3 thinks that it is B. How can this work? And how will the conflict be detected, and then dealt with? The system works due to the magic of MVCC. Because the updates are part of an uncommitted transaction, they won’t be visible to anyone else. Therefore, MVCC restricts the scope of conflict detection and resolution for a single row to exactly the set of active transactions that are updating that row.
The problem is not that node 2′s update is ‘bad’, or that node 3′s update is ‘bad’. The problem is that both changes can’t be in flight at the same time. Therefore, we need some way of detecting that such a conflict exists. In NuoDB, there’s no extra-special node that is responsible for database leadership. There’s no leader/master/mongo nodes. However, when we have a situation like this, we need some kind of arbitration. Fortunately, nuodb has a distributed, light-weight psuedo-leadership that we can piggy-back on. In NuoDB each Atom has a ‘chairman’. A node that can handle these arbitration situations. Every peer knows who the chairman is, even during waves of node failure. A lot of clever logic has gone into making chairmanship and chairman changes message-free, so there’s no election process or similar on-demand consensus gathering to worry about. In the example, that means that one of nodes 1,2 or 3 are chairman for the atom containing the record metadata for row 42. Because the chairman may not be executing a transaction in the set of conflicting transactions, we need to make sure that the chairman still finds out about record updates. Of course, we had to do this anyway to keep everybody consistent. Therefore, we just need to make sure that the update changes are broadcast before commits, so that the chairman has everything it needs to break any ties. This broadcast-before-commit process has an added benefit, which is in the common case of no conflicts, all the changes will already be replicating around the database before the commit is even invoked (which leads to faster replication and commits).
The chairman breaks ties in a simple way. When two nodes are attempting to update the same row at the same time, the chairman will grant the update to the node who’s update notification arrived at the chairman first. In the example, assume that Node 1 is the chairman. Node 2 and node 3 both broadcast their update changes. Assume that node 3′s message arrives first. Therefore node 1 will install node 3′s record version and now the canonical record metadata atom will note that row 42, version 6 is ‘B’ and is owned by the transaction on node 3. When node 2′s message shows up, the conflict is detected and a failure response is sent to node 2. This is illustrated below. In the example below, we’re assuming that both node 2 and node 3′s transactions are updating multiple rows, so that local row updates are proceeding even though some of the updates haven’t been officially endorsed by the chairman.
This system now enables us to do distributed conflict detection lazily and asynchronously. The only time node 2 or node 3 would have to block awaiting a response is if a commit-point was reached and the system had to determine the final state of the update operation (e.g. the transaction is being committed). Because SQL’s updates are inherently bulk updates, this means that the latency for multiple updates can all be overlapped and effectively eliminated (or greatly reduced). Another thing to bear in mind is that NuoDB nodes are constantly garbage collecting disused atoms, therefore even though the database itself may consist of 100 nodes, only the exact subset of nodes executing against the rows in question will have to participate in any conflict detection decision.
We’re now 2 out of 3 steps towards having distributed MVCC. We know that using normal MVCC semantics means that we don’t have to do any fancy coordinating of pending changes. We also know how to leverage the chairmanship concept to do some tie-breaking between conflicting updates from different nodes. We’re now left with a final problem: ‘how do we resolve a conflict, once detected?’ This is more complicated, not least because the answer to this question depends upon the semantics that the client is looking for. For example, SQL allows clients to set ‘isolation levels’. This is basically a knob for trading off consistency for performance. Each isolation level may have completely different semantics to support. Therefore, we actually need a meta-solution to this problem. We need a system that will allow us to handle update conflicts differently depending upon client-requested semantics. From the universe of possibilities there an obvious set of 3:
- Fail right away (e.g. on conflict, abort the transaction)
- Block until the conflicting transaction’s final state is known, then react against that state
- Something special
The choice between 1 and 2 is similar to the choice that needed to be made for non-distributed MVCC and currently is chosen based on isolation level in NuoDB. Choice number 3 is a bit of a cop-out, but it basically refers to strategies that will allow the system to detect a conflict that can be ‘worked around’ and then work around it. NuoDB does this for several situations where a conflict can be fixed post-facto. For example, if node 3 wins the race to update row 42 to version 6, there are situations where it is ok to change node 2′s update to be updating row 42 to version 7 instead. In these cases, NuoDB has determined that transactional semantics will allow us to effectively enforce that node 3′s transaction is happening before node 2′s transaction. Doing this correctly and efficiently requires some fancy bookkeeping. For example, now that node 2′s transaction is happening after node 3′s, what happens if the transaction on node 3 aborts and is rolled back? Of course, this depends on the isolation level of node 2′s transaction and whether or not node 2′s transaction used/read any state that’s now being rolled back. By being even more optimistic about updates, NuoDB avoids aborting transactions that could otherwise commit cleanly.