Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Consistency Through Compensation in Microservices

DZone 's Guide to

Consistency Through Compensation in Microservices

In this post, we look at one approach by which we can achieve resiliency in our application even under server crashes and any intermediate failures.

· Microservices Zone ·
Free Resource

Image title

This article addresses the eventual consistency aspect of transactions in a microservices environment where a transaction spans more than one microservice and where transaction failure midway through is imminent.

Existing Business Use Case

Suppose that we currently have a monolithic order management system that is backed by an RDBMS.

Business Requirements

Our business expects us to give them the following guarantees:

  1. Never accept an order if the inventory does not have enough stock.
  2. Every item in the inventory is either assigned to an order or available for further purchases by a customer within a reasonable amount of time.

As every seasoned developer knows, a monolith is great for systems that want consistency because of RDBMSs and the ACID guarantees they provide. But this comes at the cost of autonomy and scalability.

Because of ACID guarantees, when we update both order and inventory tables, either the transaction succeeds or fails atomically. Because of this, it is easy to ensure that:

Total inventory = amount of stock left in inventory + quantity used up by previous orders.

In essence, it's relatively easy to fulfill this requirement in our monolithic world since we will be working against a single database.

Please note that most businesses will be OK if we oversell items even when we don’t have corresponding stock in the inventory. This is because once the order is placed, the business can ask vendors to send more of the oversold item still fulfill their customer’s order. The point this article addresses is “can the technology live within business constraints?” Suppose our business does not permit overselling because we are selling rare items, say classic paintings, which means we can't oversell as no other vendor can supply additional stock of all items in our inventory.

Breaking the Monolith

Suppose we chose microservices for achieving autonomy in terms of scalability/development velocity/release schedule, etc. It follows that we can’t share a single database across all of these microservices. Sharing a database across microservices forces each team to coordinate with other teams as soon as the database schema needs to change, which defeats the whole purpose of microservices.

Each type of microservice (can have any number of instances) will have to have its own private datastore. In other words, multiple instances of the same service type share a single database, but this database is never shared across microservices.

After we identified our bounded contexts, we came up with the following breakup of our monolith into two microservices, namely:

  1. order-service whose responsibility it is to accept customer orders.
  2. inventory-service whose responsibility it is to allow changes to inventory by both order service and also by our own internal inventory team.

Microservices architecture

At first glance, fulfilling these business requirements in microservices does not look that hard.

We can make an order-service call to inventory-service to check if sufficient stock exists to fulfill the order. If stock is available, reduce the given quantity of stock and permit the order to go through. If not, deny the order.

Microservices architecture

But any seasoned developer will tell you it is not that simple (and I would not have written this article if it was this simple).

Problems With Concurrent Transactions

Suppose two different users try purchasing the same item simultaneously when there isn't enough stock to fulfill both the orders. If proper care is not taken at the inventory-service level, we might allow both orders to go through, even though we don’t have enough stock to fulfill both orders. 

Image title

As any one who is familiar with concurrency control would ask,why doesn’t inventory-service reject the second request by utilizing optimistic concurrency control or pessimistic concurrency control?

Once we apply this fix, if two users attempt a simultaneous purchase only one of them will succeed if fulfilling both orders results in stock becoming negative

Image title

Problem With Server Failures

Even with the above solution, we are still left with a problem.

When it comes to servers on which our application runs, no one can guarantee that the hardware will never fail. So server failure is a given. Every application must keep this in mind and should be resilient to these failures.

What if inventory-service reduced the quantity in its database and order-service dies before it commits to its database?

Image title

Now we are left with an interesting situation where we have missing inventory from the system because the server crashed. Welcome to data inconsistency.

Going Back to the Drawing Board

Now we know businesses will be really unhappy if, even though we have unsold physical inventory, its missing in the software because one of our servers crashed in the middle of the transaction.

There are multiple ways this problem can be addressed. In this article, I will present one way of addressing this.

In the previous approach, an order will be in either a confirmed/rejected state and the client is stateless.

But in our new approach:

  1. An order will have three states, namely initializedconfirmed, and rejected
  2. And our client would be stateful.
  3. In addition, the client will follow the redirect after POST pattern.

The reason for this redirect is that when a client encounters a failure like the one in the previous scenario, we don’t want the client to retry the operation and place multiple orders for the same cart because the user refreshed their browser. The previous pattern avoids this situation completely.

Given that, here is our new approach:

  1. As soon as a request is received, order-service  saves an order with an initialized status to its own DB.
  2. Immediately, order-service will make call to inventory-service with the <order id> and additional item information to reduce the stock in a background thread with an exclusive row level lock on the newly created order.
  3. order-service responds to the client with HTTP 202 Accepted (if API-based client)/302 Found (if browser-based) response with a link to check the status of the order through an HTTP Location header.
  4. If the background order placement succeeds, order-service will update the status of the order to confirmed and release the row level lock on the order.
  5. If not, order-service will update the status of the order to rejected and release the row level lock on the order.
  6. Once the client sees the above 202/302 response, it will redirect the user to the order status page (in the case of a browser-based client) so that browser refresh does not cause duplicate orders. It's a best practice to show a loading screen while the order is being processed.
  7. Finally, order-service will respond to the order status request with initialized/confirmed/rejected status.
  8. If the order is not yet concluded, the latest status can be obtained by the client using polling or the server can implement long polling to avoid the client putting on too much polling load.

From point (2) of the mechanism above, in a normal flow, order-service takes an exclusive row level lock over the order row and releases only once when the transaction is concluded.

If server crashes in the middle, this lock on the order row will be be released immediately by the database.

We will use this aspect to make sure that we don't interfere with the transactions that are currently in progress, when we try to reconcile/compensate (which will be discussed later).

Now back to the crash issue in order-service.

Since a crash leaves our data in inconsistent state, order-service will have to explicitly reconcile/compensate for these failures at the application level by fetching the latest status of the order from inventory-service for abandoned orders. This is why its called Compensating Transaction

Compensation guarantees the order status will be consistent with the business rules eventually, but not immediately.

Depending on how soon we want this reconciliation, we can implement reconciliation in the following three stages:

  1. Startup reconciliation: Orders abandoned before the instance is started are guaranteed to be consistent once the instance is up (optional).
  2. On demand reconciliation: Orders requested by the client are guaranteed to be consistent once the request is processed successfully (optional).
  3. Scheduled reconciliation: Orders that are abandoned before the scheduler ran are guaranteed to be consistent once the scheduler runs successfully (mandatory).

The guarantees each of the above three reconcilers provides are only valid for those orders for which the reconciler can acquire an exclusive lock. If some other instance’s reconciler is doing compensation for any given order, the current reconciler just skips over them.

Now let's look at each one of these options and how the reconciliation/compensation works.

Startup Reconciliation

This type of reconciliation guarantees that all the orders that were abandoned before the startup of the instance are guaranteed to be consistent once the application is up.

Suppose that order-service crashed in the middle of a transaction (abandoned). The instance can be brought back online manually, by using Kubernetes, or via some other mechanism.

Here is what we should do as part of theorder-service startup to meet the above guarantee:

  • order-service has to check if there are any orders in the initialized state.
  • Attempt to acquire exclusive row level lock on each of the above orders. If lock cannot be acquired, just skip over it as this order is current being processed by some other instance.
  • For each of the above order on which lock can be acquired, order-service will request inventory-service about the latest status of the stock consumed by the given <order id>
  • inventory-service will return latest status, which order-service will use to update in its own database and release the exclusive row level lock.

This way, order-service guarantees all orders that were abandoned before the startup will be consistent after startup.

Image title

On Demand Reconciliation

Implementing this gives a guarantee that any abandoned order/transaction will be reconciled on the subsequent status request for that given order. Note that this only reconciles one specific order, not all abandoned orders.

The way this works is that, as soon as we receive a request for order status:

  1. order-service will get the status of the order from local the DB.
  2. If status is initialized, attempt to acquire an exclusive row level lock.
  3. If the row level lock cannot be acquired (some other instance is working on this), then wait until an exclusive row level lock can be acquired on the order.
  4. Once the lock is acquired, check if the status is still initialized. If it is, move to step (5), otherwise return to the latest status.
  5. order-service will query inventory-service about the latest status of the stock consumed by the given <order id>
  6. inventory-service will return the latest status, which order-service will use to update in its own database and release exclusive row level locks.

Suppose that we have two instances of order-service and the first instance is handling the order request and the server crashed midway through execution, just like before. When the client resends the order status request (clients are expected to do this on a best effort basis), it will be received by the second instance, at which point, we can do compensation logic for the abandoned order as part of this request. This way the client does not have to wait for the scheduled reconciliation to get the latest status (next section).

Image title

Scheduled Reconciliation

This type of reconciliation guarantees that all the orders that were abandoned before the start of the current run of the scheduler are guaranteed to be consistent once the scheduler's run is complete.

Since relying on the client to reconcile all orders that failed midway through execution is not a good idea, the client might never be able to contact the server due to network issues or some other reason. Thus, we want a mechanism within the application itself that does auto reconciliation at regular intervals. This is where scheduled reconciliation comes in.

At regular intervals, we have to schedule a task to be run within our application that does the reconciliation.

For each run of the scheduled task, this is what the reconciler does:

  • order-service will fetch all orders that are in theinitialized state.
  • Attempt to acquire an exclusive row level lock on each of the above orders. If the lock cannot be acquired, just skip over it as this order is currently being processed by some other instance.
  • For each of the above orders on which lock can be acquired, order-service will request inventory-service about the latest status of the stock consumed by the given <order id>
  • inventory-service will return the latest status, which order-service will use to update its own database and release an exclusive row level lock.

Image title

Conclusion

We looked at one approach by which we can achieve resiliency in our application even under server crashes and any intermediate failures.

In order to achieve that, we needed:

  1. a stateful client.
  2. compensating transactions/reconcilers at various levels.

We looked at three types of reconciliations:

  • Startup reconciliation
  • On demand reconciliation
  • Scheduled reconciliation

It is important to note that each application must have a scheduled reconciler as this guarantees regular reconciliation and is not dependent on external actors. The other two types of reconciliations are optional but preferred.

In a typical application, you would have all three reconcilers to reduce the amount of time the system will be in inconsistent state in case of unexpected failures.

If you have any queries/suggestions let me know in the comments section.

Topics:
domain driven design ,microservices ,fault tolerance ,microservices tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}