BankNext Event-Driven Case Study: SAGA Compensation
Take a look into this event-driven BankNext case study, which explores a SAGA pattern to enforce system integrity in a transactional system.
Join the DZone community and get the full member experience.Join For Free
“BankNext” employed event-driven choreography to enhance its processing capacity manifolds without compromising on system flexibility. Things were good until abnormal system behavior was noticed when reconciling customer and account information. The requirement was that every active customer in the system must have a valid account. However, there were many active customers without any account.
1- Dangling Customers - Active Customers without any valid Account
2- Linkage not created - Valid Customers and Accounts present, but linkage between these entities is missing
How did that happen?
BankNext’s current event-driven choreography (w/o SAGA) can be found here.
Back to the Drawing Board
After careful analysis, the engineering team observed the following:
1. BankNext’s event-driven system is composed of multiple msvcs that independently manage data persistence.
2. If any of the msvcs in this transactional flow fails then there is no provision to correct/rollback the data that was already persisted by an upstream msvc.
3. This was the root cause of severe data integrity problems in the system.
1. The “CustomerMgt” msvc successfully creates the customer entity in the “customer” table in the RDBMS.
2. Next, “AccountMgt” msvc creates the account entity in the “account” DB table, after receiving the subscription on the Kafka topic.
3. In the last step, “EntityAggregation” msvc links these 2 entities by creating an entry into the “customer_account_map” table in the RDBMS.
4. The process is considered successful if, and only if, the above 3 steps are atomic; i.e., either 3 steps complete successfully or none complete.
5. If/when the “Accounts” or “EntityAggregation” msvc fails for any reason then the data integrity problems occur.
Example Scenario 1
- Customer entity created successfully.
- Account Msvc fails the request per the business validation rule (eg. only “SAVINGS” & “CURRENT” allowed but “OVERDRAFTS” received)
- Account entity creation aborts.
- Customer is left without an Account. Thus Dangling Customer.
Example Scenario 2
- Customer entity created successfully.
- Account entity created successfully.
- EntityAggregation Msvc fails the request per the business validation rule (eg. for “USA” only type allowed is “SAVINGS” but “CURRENT” received).
- Customer and Account linking aborts.
- Active Customer and Account is created but is not linked.
Solution Approach: SAGA Compensation
The engineering team concluded that the architecture lacked a self-correcting/compensating mechanism.
1. In such a failure scenario, the solution requires that all the data persisted by upstream msvcs be rolled back.
2. Thus, the system data is brought back to the state that it was before this txn started.
3. SAGA approach is recommended to accomplish this task.
Technical Implementation: SAGA Compensation (Github)
1. New Kafka topics are created for capturing failure messages in case of failure scenarios:
2. When a msvc fails, a failure message is published to the corresponding failure topic.
3. When “AccountMgt” publishes to the “account_failure_topic”, “CustomerMgt” which is subscribed to this topic, will be notified and initiate the DB rollback activity.
4. When “EntityAggregation” publishes to the “entitymap_failure_topic”, “AccountMgt” and “CustomerMgt,” which is subscribed to this topic, will be notified and initiate the DB rollback activity.
5. SAGA compensation will accomplish txn atomicity and restore the original system state.
State of the DB
New Architecture Positives
1. SAGA enables robustness and graceful system integrity restoration.
2. Provides a sturdy & flexible mechanism to accomplish process atomicity.
3. Eliminates tight coupling via publish and subscribe, as needed.
New Architecture Negatives
1. SAGA adds to system complexity and maintenance.
2. Observability, system debugging, and tracing capabilities need to be significantly ramped up.
3. There is a danger that the rollback operation triggered by the SAGA may itself fail.
4. Retries may be needed in such scenarios which further complicate matters
Summary : Architecture, TechStack & Rationale
Published at DZone with permission of Vijay Redkar. See the original article here.
Opinions expressed by DZone contributors are their own.