Hello again, fearless readers. This post will be a continuation of the discussion started here. I will continue to pontificate upon isolation levels as a SQL construct and how to understand them in a distributed MVCC database like NuoDB. NB: this post, like the previous one will use graphical representations of side-by-side timelines. The first column is 'transaction1' which is running at the same isolation level as the three other columns. Each of the remaining columns are demonstrating the behaviors of the three supported isolation levels in NuoDB. The way to read these timelines is that time flows from the top to the bottom and from left to right (that is, in a given horizontal slice, the actions in transaction 1 happened-before transaction 2). All of these examples are being executed against a sample DB initialized via:
SQL> create table t1 (f1 integer); SQL> insert into t1 values (1), (3), (5), (7), (9); SQL> select * from t1; F1 --- 1 3 5 7 9
UpdateObscura and Indexed Updates
Without further ado, lets get into the examples.
Example 1, SELECT FOR UPDATE
Select for update is an interesting SQL construct. It looks as if it's a read (after all, it contains 'SELECT') and it returns a result set. However SELECT FOR UPDATE is in fact, a form of write. Basically, a SELECT FOR UPDATE writes every row matching the WHERE clause, but overwrites it with the exact value it had before. It then returns the overwritten rows as a result set. Because the standard requires that the effective 'locks' on the rows need to be held until transaction commit, SELECT FOR UPDATE is useful when users need to do their own two-phase locking. Let's look at an example of SELECT FOR UPDATE in action:
Here we see that all three isolation levels will block on a select for update, however they all have different behaviors.
READ_COMMITTED is blocking because it can't decide which record version to write until Txn 1 has finished
CONSISTENT_READ is blocking because it has detected a concurrent update and doesn't know whether to fail or update until txn 1 has finished
WRITE_COMMITTED is blocking because it's just waiting to install a new record version on top of Txn 1's record version
This is where the justification for WRITE_COMMITTED can be intuited. WRITE_COMMITTED allows for serial semantics of commutative writes to be executed more efficiently than the other isolation levels. The reasoning here is similar to that behind the existence of SEQUENCEs. If I, as an application developer, know that an update (or set of updates) commutes with the updates other transactions will be executing against the same table, then WRITE_COMMITTED is the only isolation level that will let all the transaction's updates pile up concurrently. A complicating factor is that strict SQL semantics mandate that an UPDATE (or DELETE) cannot return until its changes are consistent. Therefore, the performance improvement will be at the single statement level. This currently also applies to stored procedures, however there is more semantic flexibility there.
Example 2: Indexed updates
Some of you may have noticed that all the examples have been performed against a table with no indexes. What this means is that the record selection is inherently imprecise. Even though an UPDATE may need to change a single row, the transaction has no choice but to scan every row in the table to find all the matches. Of course, this is exactly what indexes are supposed to help with. Let's see what (if anything) changes if we stick an index on the table:
> CREATE INDEX idx1 ON t1(f1);
In this example, transaction 2 didn't have to block for transaction 1 to commit, because the index allowed both update queries to pick the exact matching records. Since the record sets for the updates of Txn 1 and Txn 2 don't overlap, there is no write conflict, so both transactions can proceed. An important thing to note is that both the non-blocking and blocking behaviors are consistent with the definition of READ_COMMITTED. The difference is that the non-blocking behavior is performance improvement enabled by indexes. If both transaction 1 and 2 had tried to update the same record version (regardless of the precision of the write set selection) there would still be an update conflict.
What does this mean? CONSISTENT_READ (snapshot isolation), enforces precision by its definition, so indexes don't speed up writes there. But if your application is using READ_COMMITTED or WRITE_COMMITTED, you could reduce contention by adding an index that increases the precision of write-set selection. This is a good rule of thumb to bear in mind when diagnosing performance issues.
Performance and Latency considerations:
NuoDB is a distributed database. This is cool because it means that NuoDB can scale and adapt in ways that a traditional single-node solution cannot. But it also means that system designers and application developers must now also factor in latency when thinking about the performance characteristics of their application.
CONSISTENT_READ and WRITE_COMMITTED both do SNAPSHOT versioning for normal SELECT statements. Therefore, there is no latency sensitivity for these isolation levels. That is, regardless of how slow the link is between two halves of a database, the readers on either side of the link won't suffer performance degradation due to the latency.
The intuition for this behavior is that it's a direct consequence of the relativistic logic of a distributed system. In NuoDB, transactional control is local to a single node. Therefore, when a transaction starts on a TE, the 'happened-before' set of transactions on that node are known exactly. It is precisely the set of transactions known to be committed at that node at that moment. Therefore, it doesn't matter what the 'global' ordering is (and hence, no need for atomic clocks), any transaction not already known to be committed is then considered to be executing concurrently, and visibility calculations are made appropriately.
READ_COMMITTED is a different story. Because the definition of the isolation level is to only read the most recently committed versions of a record, READ_COMMITTED's read behavior is dependent upon the timing of commit messages that arrive at the TE handling the READ_COMMITTED transaction. Of course, no one writing a READ_COMMITTED application can expect any two executions of the same SELECT to return the same value (as per the SQL standard). Depending on the application's semantics, READ_COMMITTED may tie the completion time of a local transaction with that of a remote transaction. Any given SELECT will always return the most recently committed versions, but if a local transaction is spinning waiting for a remote commit to be propagated, it will suffer from latency. This is a situation where data locality would definitely improve transaction latency.
To put this another way, READ_COMMITTED allows the application to escape from the strict isolation and see committed changes as they are published, but if the application puts the writing transaction on one side of the world and the reader on the other, the reader will have to wait for the writer's changes to propagate around the world. There's nothing radical in this statement, it's just more of a reminder to those application developers dipping their toes in distributed waters for the first time. Always respect the locality of your application.
Here it is appropriate to discuss (briefly) the manner in which NuoDB performs updates. Updates are done optimistically with asynchronous coordination. A given table will be broken up into a multitude of 'atoms', which are independent distributed objects. For each record, there is an atom that it resides in. For each atom, there is a distinguished node we refer to as the 'chairman'. The chairman exists to break ties when multiple nodes are attempting to update the exact same record version. When a node updates a record version:
- The transaction optimistically installs a new version with the updated value
- The transaction broadcasts the update to all the peers (asynchronously)
- The record atom chairman double checks this update for conflicts
- On conflict, a rejection is sent to the updating node
- On success, a confirmation is sent to the updating node
The updating node will continue to process the record set for the update even while waiting for confirmation from the record chairman. In this way, bulk updates are interleaved in the most efficient way possible. This optimistic approach means that in the common case (a successful update), no further work needs to be done and the updated versions are already flowing over the network to all the peers. The consequence is that on a failed update, the updating node must do a bit of extra work and send out backout messages to 'undo' the failed update.
READ_COMMITTED and WRITE_COMMITTED transactions will block on any update conflict occurring against the table (one or more conflicting row updates). This is to prevent phantoms and other ugly write anomalies. However, this means that if some records are being updated globally, READ_COMMITTED and WRITE_COMMITTED transactions will be forcing a kind of global 2PL (two phase locking) on everything. If the row is being updated by a transaction located on the other end of a high-latency link, that means that these updates will have execution times that will grow with that latency. As we saw above, if each update has no or very low probability of overlap with other transactions, using an index to make the write set as specific as possible can reduce or eliminate blocking due to spurious conflicts.
CONSISTENT_READ is snapshot isolation, which can be leveraged so that only if two transactions are updating the exact same record version will they have to wait for one another. Because of the way updates are coordinated via chairmen, it is possible for a single row update to have to wait for the round-trip of an update message. However, bulk updates take advantage of asynchrony and send off all the update requests asynchronously, so that an N-row update won't require N serial round trips and the messaging coordinating with the chairman will be part of the flow of update versions that have to propagate to all peers anyway.
Here, the performance recommendation is intuitive. It is perfectly fine to have a table that is accessed globally. However, avoid updating specific rows globally regularly. In general, globally mutating global state is a dodgy proposition (doubly-so in distributed systems). Most things end up having a kind of natural locality anyway that prevents this from being an issue. For example, one could imagine a global 'CUSTOMERS' table with billions of entries. However, it seems highly unlikely that any realistic application would be simultaneously updating the same customer row in Europe, Asia and N. America repeatedly. Even if there's some long-running analytics queries that are churning over the entire table, if they're running at CONSISTENT_READ, they'll never hold up UPDATEs at any isolation level. Every now and then, there might be cause to update a row globally, but an application that does so constantly and with high frequency will do poorly in any distributed context. To take the example further, because each TE is constantly dropping unused atom copies from its memory as long as the application demonstrates some locality, NuoDB will automatically start caching table data to reflect that locality. What this means is that the application developer doesn't have to agonize over creating a custom caching layer, the application's natural locality will be reflected in the caching behavior of NuoDB itself (for more details see: An introduction to NuoDB's cache, Memcached vs NuoDB).
Hopefully this has helped fill in some of the gaps left over from the previous post, and has given my faithful readers some intuition about how to reason about isolation levels in NuoDB. I hope you've walked away from this with some rules of thumb that will help you reason about your database application as it runs on NuoDB.