Let It Crash: On Distributed Transaction Design

DZone 's Guide to

Let It Crash: On Distributed Transaction Design

Using transactions with distributed systems is a great way to maximize the number of possible failure modes.

· Performance Zone ·
Free Resource

"Let it crash" is a watchword in the Erlang world that has a pretty specific meaning in that context, but can be seriously misapplied if taken out of context.

In the Erlang world, a "crash" is the termination of an actor in its specific context. In a well-designed actor system, the actors have very specific jobs and if they cannot complete that job they are free to fail immediately. This is a bit of a problem for folks working in the JVM world as "crash" can be overloaded to mean things that change the semantics of a transactional system.

Real world(ish) example: Suppose you have a distributed system that accepts a message, writes it to a data store, then hands a new message off to three other components. Suppose further that the transactional semantics of the system are such that the job isn't "done" until all four operations have completed and are either #1 permanently successful or #2 permanently failed.

The important detail here is that when performing a transfer, we want the balances of both accounts to be updated as a single transaction, and we cannot be in a state where the money has left one account but has not arrived in the other account. To do this requires the concept of a distributed transaction, but without using an "out of the box" distributed transaction coordinator. To clarify, we will assume that the components described are exposed via web services and don't have access to each other's underlying transaction management system.

So, to design this, the trivial implementation (let's call it the synchronous model) is as follows:

In this model, we need to handle a situation where if EITHER of the nested transactions fail, the requestor can rollback the OTHER transaction also and report back to the client that the entire transaction has failed. I won't dig into the details, as this is fairly complicated. The important detail here is that the entire transaction is synchronous and blocking. This means that the client and the requestor must hang around (typically in memory) waiting for the other components to report "success" or "failure" before reporting anything to the client. Moreover, it means that a failure of any component ultimately is a "permanent" failure (from the client's perspective), and it's up to the client to retry the transaction if it's important. While some failures might genuinely be permanent (one or the other of the accounts maybe don't or will never exist), while other failures (connectivity to one or other of the account updaters) may only be transient and/or short-lived.

In many ways, this simplifies things as it delegates the responsibility of management of success or failure to the leftmost component. That having been said, there is still potential for things to go wrong if, for example, the first updater succeeds but then the requestor dies and is unable to rollback the first transaction.

When put that way, it's obvious (I hope), that there needs to be some intermediate management that determines if there are any "partial" transactions, so if the request processor dies it can immediately rollback partial transactions should a failure occur. As an example, here is what this might look like.

We're still dodging some internal transaction management housekeeping, but the important detail is that between the point where the client lost track of the request (because it died), and the final "transaction failed" from the supervisor, the client has no idea what the state of the transaction is...it genuinely could be that the transaction succeeded, but the connectivity between the transfer requestor and the client simply failed.

So, the problems in this model are twofold: #1 it's "mostly" synchronous (though the Request supervisor -> client messaging clearly isn't), and #2 it assumes that it's "OK" for the transfer requestor to simply fail should an intermittent failure in part of the system cause a partial update to have happened. Obviously, this may or may not be acceptable depending on the business rules at play, but it is certainly a common model... i.e. you aren't sure if the transaction worked because, as the client, your network went down, so you get an out of band email from the Request supervisor at your bank confirming that it, in fact, did fail.

While this is a good approach, it does tend to tie up more resources in highly concurrent systems, doesn't deal with failure very well (you only have a few hard coded strategies you can use), and when you scale to dozens or hundreds of components, the chances of a single failure becomes so large that you are unlikely to ever succeed.

The next approach, which enables more flexibility, is to assume things will intermittently fail, and design the transaction semantics into the application protocol. This allows you to have "less robust" individual components, but adds the complexity of transaction management to the entire system. An example of how this might work:

The important details here are: #1: transaction details become persistent in the Transfer Store; #2: the Transfer Supervisor takes on the responsibility for the semantics of how the transaction strategy is managed; #3: the transaction gains the capabilities to become durable across transient failures with "most" components in the system; and #4: each independent component only needs to be available for smaller amounts of time. In general, these are all desirable qualities, but...

Some of the negative side effects of this approach are that: #1: as the designer of the system, you now are explicitly responsible for the details of the transactional behavior; #2: if the system is to be robust across component failures, the operations must be idempotent (not have side effects across invocations). As an example of how this might be more robust, let's look at how we might implement a behavior that is durable across transient failures:

In this model, the transfer supervisor implements a simple retry strategy when the account requestor is unavailable. While this potentially makes the system more robust and accounts for failure with much more flexibility, it's obvious that the Account Requestor (or the store) needs to be able to discern that sometimes it might receive duplicates and be able to handle that gracefully. Additionally, it becomes more important to know the difference between something that mutates state and something that simply acknowledges that it is in agreement with your perspective on the state of the system.

More importantly, the latter approach now means we must take into account the difference between a "permanent failure", and a "transient failure" and this is often not a trivial task. For example, is transferring between a nonexistent account and a real account a transient problem or not? If you think that's a trivial answer, think about a situation where there is yet another async process that creates and destroys accounts? Is it acceptable to retry in 1 minute (in case the account is in the process of being created when you initially try the transfer)?

In conclusion, while distributing transactions into smaller pieces adds great power, it also comes with great responsibility. This approach to designing systems is the genesis of the "let it crash" mantra bandied about by Scala and Erlang acolytes. "Let it crash" doesn't necessarily mean "relax your transaction semantics" or "don't handle errors"; it means you can delegate responsibility for recovery from failure out of a synchronous process and deal with it in more robust and novel ways.

erlang, software craftmanship

Published at DZone with permission of Michael Mainguy , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}