Consistency Through Compensation in Microservices
In this post, we look at one approach by which we can achieve resiliency in our application even under server crashes and any intermediate failures.
Join the DZone community and get the full member experience.
Join For FreeThis article addresses the eventual consistency aspect of transactions in a microservices environment where a transaction spans more than one microservice and where transaction failure midway through is imminent.
Existing Business Use Case
Suppose that we currently have a monolithic order management system that is backed by an RDBMS.
Business Requirements
Our business expects us to give them the following guarantees:
- Never accept an order if the inventory does not have enough stock.
- Every item in the inventory is either assigned to an order or available for further purchases by a customer within a reasonable amount of time.
As every seasoned developer knows, a monolith is great for systems that want consistency because of RDBMSs and the ACID guarantees they provide. But this comes at the cost of autonomy and scalability.
Because of ACID guarantees, when we update both order and inventory tables, either the transaction succeeds or fails atomically. Because of this, it is easy to ensure that:
Total inventory = amount of stock left in inventory + quantity used up by previous orders.
In essence, it's relatively easy to fulfill this requirement in our monolithic world since we will be working against a single database.
Please note that most businesses will be OK if we oversell items even when we don’t have corresponding stock in the inventory. This is because once the order is placed, the business can ask vendors to send more of the oversold item still fulfill their customer’s order. The point this article addresses is “can the technology live within business constraints?” Suppose our business does not permit overselling because we are selling rare items, say classic paintings, which means we can't oversell as no other vendor can supply additional stock of all items in our inventory.
Breaking the Monolith
Suppose we chose microservices for achieving autonomy in terms of scalability/development velocity/release schedule, etc. It follows that we can’t share a single database across all of these microservices. Sharing a database across microservices forces each team to coordinate with other teams as soon as the database schema needs to change, which defeats the whole purpose of microservices.
Each type of microservice (can have any number of instances) will have to have its own private datastore. In other words, multiple instances of the same service type share a single database, but this database is never shared across microservices.
After we identified our bounded contexts, we came up with the following breakup of our monolith into two microservices, namely:
order-service
whose responsibility it is to accept customer orders.inventory-service
whose responsibility it is to allow changes to inventory by both order service and also by our own internal inventory team.
At first glance, fulfilling these business requirements in microservices does not look that hard.
We can make an order-service
call to inventory-service
to check if sufficient stock exists to fulfill the order. If stock is available, reduce the given quantity of stock and permit the order to go through. If not, deny the order.
But any seasoned developer will tell you it is not that simple (and I would not have written this article if it was this simple).
Problems With Concurrent Transactions
Suppose two different users try purchasing the same item simultaneously when there isn't enough stock to fulfill both the orders. If proper care is not taken at the inventory-service
level, we might allow both orders to go through, even though we don’t have enough stock to fulfill both orders.
As any one who is familiar with concurrency control would ask,why doesn’t inventory-service
reject the second request by utilizing optimistic concurrency control or pessimistic concurrency control?
Once we apply this fix, if two users attempt a simultaneous purchase only one of them will succeed if fulfilling both orders results in stock becoming negative
Problem With Server Failures
Even with the above solution, we are still left with a problem.
When it comes to servers on which our application runs, no one can guarantee that the hardware will never fail. So server failure is a given. Every application must keep this in mind and should be resilient to these failures.
What if inventory-service
reduced the quantity in its database and order-service
dies before it commits to its database?
Now we are left with an interesting situation where we have missing inventory from the system because the server crashed. Welcome to data inconsistency.
Going Back to the Drawing Board
Now we know businesses will be really unhappy if, even though we have unsold physical inventory, its missing in the software because one of our servers crashed in the middle of the transaction.
There are multiple ways this problem can be addressed. In this article, I will present one way of addressing this.
In the previous approach, an order
will be in either a confirmed
/rejected
state and the client is stateless.
But in our new approach:
- An order will have three states, namely
initialized
,confirmed
, andrejected
- And our client would be stateful.
- In addition, the client will follow the redirect after POST pattern.
The reason for this redirect is that when a client encounters a failure like the one in the previous scenario, we don’t want the client to retry the operation and place multiple orders for the same cart because the user refreshed their browser. The previous pattern avoids this situation completely.
Given that, here is our new approach:
- As soon as a request is received,
order-service
saves an order with aninitialized
status to its own DB. - Immediately,
order-service
will make call toinventory-service
with the<order id>
and additional item information to reduce the stock in a background thread with an exclusive row level lock on the newly created order. order-service
responds to the client with HTTP202 Accepted
(if API-based client)/302 Found
(if browser-based) response with a link to check the status of the order through an HTTPLocation
header.- If the background order placement succeeds,
order-service
will update the status of the order toconfirmed
and release the row level lock on the order. - If not,
order-service
will update the status of the order torejected
and release the row level lock on the order. - Once the client sees the above
202
/302
response, it will redirect the user to the order status page (in the case of a browser-based client) so that browser refresh does not cause duplicate orders. It's a best practice to show a loading screen while the order is being processed. - Finally,
order-service
will respond to the order status request withinitialized
/confirmed
/rejected
status. - If the order is not yet concluded, the latest status can be obtained by the client using polling or the server can implement long polling to avoid the client putting on too much polling load.
From point (2)
of the mechanism above, in a normal flow, order-service
takes an exclusive row level lock
over the order row and releases only once when the transaction is concluded.
If server crashes in the middle, this lock on the order row will be be released immediately by the database.
We will use this aspect to make sure that we don't interfere with the transactions that are currently in progress, when we try to reconcile/compensate (which will be discussed later).
Now back to the crash issue in order-service
.
Since a crash leaves our data in inconsistent state, order-service
will have to explicitly reconcile/compensate for these failures at the application level by fetching the latest status of the order from inventory-service
for abandoned orders. This is why its called Compensating Transaction
Compensation guarantees the order status will be consistent with the business rules eventually, but not immediately.
Depending on how soon we want this reconciliation, we can implement reconciliation in the following three stages:
- Startup reconciliation: Orders abandoned before the instance is started are guaranteed to be consistent once the instance is up (optional).
- On demand reconciliation: Orders requested by the client are guaranteed to be consistent once the request is processed successfully (optional).
- Scheduled reconciliation: Orders that are abandoned before the scheduler ran are guaranteed to be consistent once the scheduler runs successfully (mandatory).
The guarantees each of the above three reconcilers provides are only valid for those orders for which the reconciler can acquire an exclusive lock. If some other instance’s reconciler is doing compensation for any given order, the current reconciler just skips over them.
Now let's look at each one of these options and how the reconciliation/compensation works.
Startup Reconciliation
This type of reconciliation guarantees that all the orders that were abandoned before the startup of the instance are guaranteed to be consistent once the application is up.
Suppose that order-service
crashed in the middle of a transaction (abandoned). The instance can be brought back online manually, by using Kubernetes, or via some other mechanism.
Here is what we should do as part of theorder-service
startup to meet the above guarantee:
order-service
has to check if there are any orders in theinitialized
state.- Attempt to acquire exclusive row level lock on each of the above orders. If lock cannot be acquired, just skip over it as this order is current being processed by some other instance.
- For each of the above order on which lock can be acquired,
order-service
will requestinventory-service
about the latest status of the stock consumed by the given<order id>
inventory-service
will return latest status, whichorder-service
will use to update in its own database and release the exclusive row level lock.
This way, order-service
guarantees all orders that were abandoned before the startup will be consistent after startup.
On Demand Reconciliation
Implementing this gives a guarantee that any abandoned order/transaction will be reconciled on the subsequent status request for that given order. Note that this only reconciles one specific order, not all abandoned orders.
The way this works is that, as soon as we receive a request for order status:
order-service
will get the status of the order from local the DB.- If status is
initialized
, attempt to acquire an exclusive row level lock. - If the row level lock cannot be acquired (some other instance is working on this), then wait until an exclusive row level lock can be acquired on the order.
- Once the lock is acquired, check if the status is still
initialized
. If it is, move to step (5), otherwise return to the latest status. order-service
will queryinventory-service
about the latest status of the stock consumed by the given<order id>
inventory-service
will return the latest status, whichorder-service
will use to update in its own database and release exclusive row level locks.
Suppose that we have two instances of order-service
and the first instance is handling the order request and the server crashed midway through execution, just like before. When the client resends the order status request (clients are expected to do this on a best effort basis), it will be received by the second instance, at which point, we can do compensation logic for the abandoned order as part of this request. This way the client does not have to wait for the scheduled reconciliation to get the latest status (next section).
Scheduled Reconciliation
This type of reconciliation guarantees that all the orders that were abandoned before the start of the current run of the scheduler are guaranteed to be consistent once the scheduler's run is complete.
Since relying on the client to reconcile all orders that failed midway through execution is not a good idea, the client might never be able to contact the server due to network issues or some other reason. Thus, we want a mechanism within the application itself that does auto reconciliation at regular intervals. This is where scheduled reconciliation comes in.
At regular intervals, we have to schedule a task to be run within our application that does the reconciliation.
For each run of the scheduled task, this is what the reconciler does:
order-service
will fetch all orders that are in theinitialized
state.- Attempt to acquire an exclusive row level lock on each of the above orders. If the lock cannot be acquired, just skip over it as this order is currently being processed by some other instance.
- For each of the above orders on which lock can be acquired,
order-service
will requestinventory-service
about the latest status of the stock consumed by the given<order id>
inventory-service
will return the latest status, whichorder-service
will use to update its own database and release an exclusive row level lock.
Conclusion
We looked at one approach by which we can achieve resiliency in our application even under server crashes and any intermediate failures.
In order to achieve that, we needed:
- a stateful client.
- compensating transactions/reconcilers at various levels.
We looked at three types of reconciliations:
- Startup reconciliation
- On demand reconciliation
- Scheduled reconciliation
It is important to note that each application must have a scheduled reconciler as this guarantees regular reconciliation and is not dependent on external actors. The other two types of reconciliations are optional but preferred.
In a typical application, you would have all three reconcilers to reduce the amount of time the system will be in inconsistent state in case of unexpected failures.
If you have any queries/suggestions let me know in the comments section.
Published at DZone with permission of Ashok Koyi. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments