Distributed Saga and Resiliency of Microservices
Distributed Saga and Resiliency of Microservices
Eventual Consistency for Microservices using Distributed Saga.
Join the DZone community and get the full member experience.Join For Free
Sagas are typically used for modeling long-lived transactions like those involved in workflows. It is not advisable to use two-phase transaction protocols to control long-lived transactions since the locking of resources for prolonged durations across trust boundaries is not practical, rather is not at all advisable. Sagas are similar to nested transactions. In a nested transaction, atomic transactions are embedded in other transactions. In sagas, each of these transactions has a corresponding compensating transaction. While a Saga proceeds with its steps, if any of the transactions in a saga fails, the compensating actions for each transaction that was successfully run previously will be invoked so as to nullify the effect of the previously successful transactions.
Modeling a Saga and setting up a Saga Infrastructure is rather straight forward in Local and simple deployments, however when we want to scale out in public cloud environments, and that too with multiple instances of the same type of microservice, there comes a new list of challenges. We will look at few of them in this discussion.
Setting the Context for Saga
We will not attempt to explain what a Saga is in more details here since it's covered well in the DZone article titled “Distributed Sagas for Microservices.” The sample described there is an apt business case for our discussion, and we will use that in our explanation here. The scenario is that of doing a Travel booking. Many Travel agent booking systems provide this feature where they aggregate multiple kinds of travel inventories from different enterprises who actually own the inventory. It’s very unlikely that a single enterprise owns inventory for all these resources, but it’s highly likely that the end-user wants to book one or more of these resources in a single transaction because a confirmed hotel booking with a non-confirmed flight booking is not very useful for him!
In our case, The Travel agent enterprise refers to www.makeyourtrip.com. Travelers can come to this web site and reserve a complete travel package, which involves booking inventories of multiple types.
Let's assume that makeyourtrip has got a partnership with 3 other enterprises for retrieving inventories, as follows:
- www.yertz.com - Can Rental Services
- www.jilton.com - Hotel Room Rental Services
- www.selta.com - Flight Booking Services
The microservices for business operations for above enterprises can be portrayed as illustrated in Figure 01.
Figure 01 Orchestration based Saga using HTTP
We will use a variant of the Hexagonal Architecture representation to depict each microservice. Figure 01 shows such a representation, and we have 4 such microservices each corresponding to the 4 enterprises mentioned earlier. A booking request from a traveler will come through the internet and hits the travel agent application first. We will assume that all that is required as input parameters to carry out a complete travel booking is captured in a single request from the traveler’s user agent device by the travel agent’s microservice, which is Trip microservice.
Referring to Figure 01, let us look at a typical sequence of actions:
- Traveler sends a “Book Trip” request from the browser, which will hit the Trip microservice.
- The Trip microservice is responsible for starting Saga. It calls on what is called as a Saga coordinator endpoint, starting a Saga. The coordinator announces a Saga identifier in response. The Trip microservice enlists itself with the created Saga by calling the Saga coordinator providing the Saga identifier and handing over addresses of REST endpoints for compensation (optionally confirmation) callbacks. Those are endpoint the coordinator can call back in response to the outcome of the execute action on any participating microservices.
- The Trip microservice takes the Saga ID and adds it as an HTTP header to the REST call to the 3 other fulfilling microservices, Cab Microservice, Hotel Microservice and Flight Microservice.
- The called microservices distinguishes the Saga and they can enlist themselves (by announcing REST endpoints for compensation/confirmation callbacks) to Saga coordinator.
- The participant microservices, viz. Cab Microservice, Hotel Microservice and Flight Microservice executes the business process
- Any of the participant microservices could fail the Saga by calling “Execution Failure” on the Saga coordinator.
- Saga coordinator sends commands either to confirm or to compensate, to all participants
- On way back, the initiator microservice is responsible for finishing the Saga by calling complete (with a success or failure flag) on the Saga coordinator with the saga identifier.
Saga Transaction Semantics
A distributed saga is a collection of transactions, in our case, 3 other transactions each corresponding to transactions at 3 other mentioned microservices. As we have seen in the introductory section, for a Saga with n transactions:
T1, T2, ..., Tn
Each transaction has a compensating transaction:
C1, C2, ..., Cn
A compensating transaction (Cn) will semantically undo a previously completed corresponding transaction (Tn).
A distributed saga guarantees that either
T1, T2, ..., Tn
T1, T2, ..., Ti, Ci, ..., C2, C1
The Saga Execution Coordinator (SEC), aka Saga Manager orchestrates the entire logic of coordination and is responsible for the execution of the saga. All of the steps in a given Saga are recorded in what is called as a Saga Log, and the SEC writes to and reads from and interprets and triggers actions based on the records of the Saga Log.
Orchestration Based Saga
An Orchestration based Saga is based on a central coordinator who is responsible for instructing individual services to continue or rollback. Here this central coordinator service is responsible for centralising the Saga’s decision making and sequencing the constituent transactions. Figure 01 represents an Orchestration based Saga. The inherent weakness of an Orchestration based Saga is its central coordinator, which is it’s single point of failure. The Saga manager can be a separate service or a part of one of the coordinating microservices. If the Saga Manager goes down in the middle of a Saga, the entire participating microservices might remain inconsistent till the Saga Manager comes back to operational state.
Choreography Based Saga
In a Choreography based Saga, each microservice listens to commands or events from other microservices to determine if it should execute or compensate. Here there is no central coordination and each microservice produces and listen to other microservice’s commands or events events and decides if and what action should be taken or not.
Figure 02 Choreography based Saga using HTTP
Figure 02 represents a Choreography based Saga, which is an improvement over the Orchestration based Saga represented in Figure 01. In Choreography based Saga there is no central Saga Manager, but all of the participant microservices plays it’s part of the Saga Manager. In other words, each microservice takes care of itself, and puts a best effort to help the other participant microservices.
Choreography based Saga is much suited for microservices due to the decentralised nature of services. Let us look at this with an example scenario. Assume that, our Jilton Restaurant chain has expanded its business, and now allows ancillaries too to be booked along with hotel rooms. Knowing this, if a traveller is so particular that his entire travel plan is dependent on whether he can add a confirmed “Thai massage” too in his Trip booking, the Saga architecture would now look like as shown in Figure 03.
Figure 03 Saga Architecture accommodating expanding business scenarios
Different from our earlier visited microservices where all the microservices are connected through the internet, the new microservice called Ancilliary Microservice can be deployed in the same LAN or VPC where the Hotel Microservice too is deployed, hence they no more need to traverse through the internet to communicate, but can communicate through the local network.
Further, in this case Hotel Microservice can enlist Ancilliary Microservice too as a part of the same Saga, but this is not visible to the external world. The Hotel Microservice would reserve a room using a choreography-based saga that consists of the following steps:
- The Hotel Microservice receives the POST /rooms request and reserves a Room in a PENDING state
- It then sends a POST /ancilliary request and reserves a Thai Massage (or emits a Room Reserved event, which includes request parameters for Ancilliary booking)
- The Ancillary Microservice attempts to reserve a Thai Massage, and responds indicating the outcome
- The Hotel Microservice responds to the global Saga, and the rest continues as before.
HTTP as the Saga Transport Protocol
In Figure 01, Figure 02 and Figure 03, we have represented the participant Microservices to be separated by the internet. That is the reason we have shown HTTP as the Transport mechanism between the microservices. HTTP is inherently less reliable, especially over the internet. Hence the end to end reliability of your Saga Transaction is also dependent on the interconnectivity between the microservices. We would say, the reliability of the Saga in Figure 01, Figure 02 and Figure 03 is low.
The reason of less reliable alone won’t exclude the usage of HTTP for Saga executions. For every “Execute” and “Compensate” command, we can attach “Ack” or acknowledgment, which will improve overall reliability, however, that’s not straightforward. You need more number of Request-Response cycles, which will further degrade the overall reliability. If Ack is modeled not as a response to the original synchronous HTTP Request, instead, if it's modeled as a separate HTTP call back from the called microservice to the caller microservice, it violates the principle of good design practice which states that we should avoid cyclic dependency. Ideally, a Distributed Saga is always to be modeled as a DAG (Directed Acyclic Graph). This will avoid cyclic dependencies. A DAG is a finite directed graph with no directed cycles.
In Business Software Applications, this can be facilitated to certain extend by following the basic principles of "Good Design." One principle is to use the notion of Coarse-Grained Services and Fine-Grained Services.
A Coarse-Grained Service is an aggregate of Fine-Grained Services
However, A Coarse-Grained Service can be an aggregate of other Coarse-Grained Services, too.
We further need to avoid self dependencies and reverse dependencies.
Messaging for Saga Transport Protocol
The next best option is to use Message Brokers between Microservices. Message Brokers are more reliable, and Message Brokers with persistence are most reliable. This is represented in Figure 04.
Figure 04 Request cycle of a Choreography based Saga using Message Brokers
Using Message Brokers, we can use a Point-to-Point Channel which ensures that only one consumer consumes any given message. If the channel has multiple consumers, only one of them can successfully consume a particular message. If multiple consumers try to consume a single message, the channel ensures that only one of them succeeds, so the consumers do not have to coordinate with each other. The channel can still have multiple consumers to consume multiple messages concurrently, but only a single consumer consumes any one message. This is important for Microservices scalability, which we will touch base later.
When two microservices communicate via messaging, the communication is one-way. The reliability of Saga executions can be reexamined now. For every “Execute” and “Compensate” command, we can attach “Ack” or acknowledgement, which will improve overall reliability as in the case of HTTP transport. To do this, we want our microservices to have a two-way conversation, or Request-Reply pattern where a requestor Microservice sends a request, a replier Microservice receives the request and returns a reply, and the requestor receives the reply. This is shown in Figure 05.
Figure 05 Response cycle of a Choreography based Saga using Message Brokers
Figure 04 and Figure 05 also shows another aspect, we have replaced the online traveler with a B2B system. This is representative to show that Saga execution can happen cross-enterprise domains and trust boundaries in B2B scenarios, and message brokers provide channels reliable more than HTTP transport for Saga execution.
Hybrid Transport for Saga
Having seen how we can use both HTTP and Message Brokers as Saga Transport mechanism, there is nothing which prevents you from using a combination of these protocols in implementing a Saga, which is shown in Figure 06.
Figure 06 Saga using Heterogeneous Protocols
It is possible to abstract the intricacies of multiple protocols at an Enterprise level Saga framework so that responsibility of individual developers can be limited to implementing actual transaction logic and compensation logic and invoking required child transactions alone.
Microservice instances can come and go, so do Saga Managers. Since Saga are long running, you need to keep safe the state of Saga. You would require distributed and replicated logs to keep state, and keep the Saga Manager stateless to increase reliability of the Saga Manager. If so, Saga Managers may also go down without any disruption so that they can resurrect Saga when they come back and resume Saga from the current state. Saga Log contains various state-changing operations, such as begin saga, end saga, abort saga, begin Ti, end Ti, begin Ci, end Ci, etc. Figure 07 represents a Saga Log.
Figure 07 Saga Log implemented as a Distributed Log
The saga log is often implemented using a distributed log, and systems such as Kafka are commonly used for the implementation. The LogCloud mentioned in the references section is a cloud-enabled transaction logging and recovery service, so your cloud application nodes themselves do not need any persistent storage. All transaction logging and recovery is taken care of by the LogCloud, so Microservices can come and go without affecting transaction recovery.
The Saga Log mentioned in this section helps to build Orchestration based Saga Manager, of course nothing limits you from using it to build a Choreography based Saga Manager too. However, we will look at a slightly different approach in building Choreography based Saga Manager.
Design a Choreography Based Saga Manager
Having looked at various factors affecting the reliability of a Saga Transactions, we will now look at a feasible design for a Choreography based Saga Manager. My team in my current organization does pioneering work in Microservices and have a thought leadership on the topic, and they are currently working on an Enterprise Framework for a Saga Manager, but with variations in the design approaches described here.
As has been stated earlier, in a Choreography based Saga Manager there is no central Saga Manager, but all of the participant Microservices plays it’s part of the Saga Manager. In other words, each Microservice takes care of itself, and puts a best effort to help the other participant Microservices to come to an eventual consistent state. An Enterprise level Saga framework can abstract and separate out all Saga related complexities. Developers at the most play with few annotation as shown in the below pseudo code:
Listing 01 Pseudo Code for a Saga Task
While an Enterprise level Saga framework can abstract and separate out all Saga related complexities, the infrastructure required for Saga could work hand in hand with the business entities in the background. There is a reason for this, and we will look into that now. Figure 08 represents a design based on CQRS (Command Query Responsibility Segregation) pattern:
Figure 08 Microservices Scalability
In a CQRS pattern, the Read data is separated out from the Write data. Read data can then be scaled out infinitesimally, and whenever there is a change in the Write data, its corresponding Read data has to be synchronised. As can be seen from Figure 08, the Business Services (Cab Service, Cab Aggregate, etc.) and the Saga Services (MessageListener, BookCabTask, etc.) share the same resource manager, ie. the single Write Store. This makes Business operations and Saga operations to be executed atomically using local transactions to the same Write Store. This is very much permissible in a Microservices design. However, when you want to have more instances of the same Microservice, you can see that multiple instances of the same Microservice will share the single instance of the shared Write schema, which also doesn’t violate Microservices design principles. Careful observation would reveal one more aspect to you, even though its not much relevant for the Saga design here - The Cab Aggregate (Write Model) is replicated across instances. Data consistency can be maintained by optimistic concurrency here. Axon CQRS framework mentioned ion the references section helps in building this architecture.
Further, the Message Broker uses the same Write data store for its persistence purposes which is used by the Write Business entities. This means, (update) operations by the Message Listener (Saga Infrastructure operations) and Cab Service (Business Operations) are happening in the single resource (Write Store), hence can be executed atomically using local transactions. There is much more theory behind all aspects mentioned in this section which are well beyond the scope of this article, however all those theories with working samples are available in the book titled “Practical Microservices Architectural Patterns” mentioned in the references section below. Chapter 15 deals with many optimisations for Transactions in a Microservices Architecture.
Let's also touch base upon one another intricacy here - what if something fails while you read from the Message Listener and executes a Saga task as shown in Figure 08? This question is more relevant while using Message Brokers which doesn’t participate in transactions. You could use “Client Acknowledge” mode for the Message Listener, and you can wait to acknowledge till towards the end of the Saga Task. So far, so good - what if, the Saga task was successful, but the “Client Acknowledge” action alone fails? The event or message still remains in the Broker, and is susceptible to be read again, leading to duplicates.
Similarly, different instances of the same Microservice in Figure 08 is listening to different partitions of the same topic! What if they listen to the same partition?
In short, there is a hell lot of design aspects to be taken care, including but not limited to:
- Message Duplicates
- Message reaching out of order (A compensate reaching before an execute event)
We will leave those questions open for your thinking for the time being. You have answers to many of these queries from resources in the References section.
Saga is an excellent way of representing and executing real-world business transactions spanning more than one system or service, or spanning across domains and enterprises. Orchestration based Saga is what has been practiced more, however, Choreography based Saga is a perfect fit for decentralized kind of applications, especially microservices. You could use either off the shelf Saga coordinators, or if you have to develop in-house you could abstract it as a reusable library. Design of a fault-tolerant Saga coordinator is not a trivial task, however if you are clear on the boundaries within which your use cases has to perform with defined SLAs, you could also attempt to create it in-house, which could integrate and perform better along with your enterprise-grade microservices.
- Practical Microservices Architectural Patterns: Event-Based Java Microservices with Spring Boot and Spring Cloud
- Distributed Sagas for Microservices
- Narayana LRA: implementation of saga transactions
- Hexagonal Architecture
- Cloud-enabled transaction logging and recovery service
- Apache Kafka partitions
- Axon, Java CQRS Framework
Opinions expressed by DZone contributors are their own.