Designing Fault-Tolerant Messaging Workflows Using State Machine Architecture

State machine patterns, such as Stateful Workflows, Sagas, and Replicated State Machines, improve message reliability, sync consistency, and recovery.

Pankaj Taneja

May. 30, 25 · Analysis

Likes (3)

Comment

Save

4.5K Views

Abstract

As a leader of projects for the backend of a global messaging platform that maintains millions of users daily, I was also responsible for a couple of efforts intended to enhance the stability and failure tolerance of our backend services. We replaced essential sections of our system with the help of the state machine patterns, notably Stateful Workflows. The usage of this model led to the elimination of problems in the field of message delivery, visibility of the read receipt, and device sync, such as a mismatch of phone directories.

The intention of this article is to let the reader know how to keep a messaging infrastructure highly available and adaptable by sharing the practicalities and trials one faces when bringing the said architectures into production.

Introduction

When dealing with distributed systems, you should always assume that failure will happen. In our messaging platform, it became very clear to us very quickly that unpredictable behavior was not something we should look at as a once-in-a-blue-moon occurrence, as it was in fact the standard state of affairs. Our infrastructure had to deal not only with network partitions and push notification delays but also with user device crashes, and our engineers did a great job in coping with such problems.

Up to that time, instead of having service-level retry logic scattered all over, we selected a more systematic way of achieving the task, which involved the use of state machines. In the end, when we reimagined our business-critical workflows as entities with state, we realized that we had really found the way not only to automate a proper failure recovery process but also to do it in a predictable, observable, and consistent manner.

This piece will focus on three main designs that we made use of — Stateful Workflows, Sagas, and Replicated State Machines — and how, through them, we not only built an impervious system but also let it respond to any failure scenario gracefully.

Using Stateful Workflows for Message Delivery

Message delivery is, without a doubt, the most crucial aspect of our system. In the beginning, we used a queue-based system without statefulness to send messages to devices. Unfortunately, we constantly faced unforeseen cases of the process stopping in the middle, which led to a situation where the user did not receive the message at all or received it with a significant delay.

We tackled this problem by introducing the Stateful Workflow Pattern with the help of Temporal:

Message Workflow States

Send Message Initiated
Message Stored
Push Notification Dispatched
Delivery Confirmed
Read Acknowledged

Every transition from one state to another was done by events to which timers and retries were added. When a notification was not delivered (probably due to APNs/FCM complications), the system used an exponential backoff method to retry the request. In case the delivery confirmation failed to arrive in a timely manner, we made a note of the event, and if the customer wished, we might also trigger resolution mechanisms such as sending notifications by email.

Each step was stored in the database's memory, which later enabled workflows to restart from the place where they stopped most recently, even after the system crashed or the node restarted. As a result, the number of messages lost was significantly decreased and the error states were visual in our monitoring applications.

Implementing the Saga Pattern for Multi-Device Sync

Another vital point is the importance of staying identical in the status of read messages on all the user devices. It means that if the user reads the message on one gadget, the change should be instant on all other gadgets.

The above was implemented in a simple way, it was a Saga:

Step 1: Mark the message as read on Device A.
Step 2: Sync to cloud state.
Step 3: Push read receipt to Devices B and C.

Each of the steps was a local transaction. We would just component the corresponding reactions if one of them fails, thus no consistency would be lost. For example, if the failure is a sync to the cloud, then we would change the state backward and inform A of the problem, so that the result is no partial changes made.

This very method lets us reach even consistency without the need for global locks or distributed transactions, which are both intricate and accident-prone.

Using Replicated State Machines for Metadata Storage

In order to keep the data, like the conversation state and preferences, in a consistent state, we have employed Replicated State Machines based on the Raft agreement protocol. It is this design that enabled us to:

Appoint a leader to manage writes
Copy the changes to all followers
Bring the state back by getting logs, if there is a crash

This method was specifically beneficial for ensuring that we have a persistent chat indexing service and group membership management, where the state view was always correct.

Comparative Analysis of Patterns

I compared the most common state machine-based fault tolerance patterns to arrive at a solution that worked well for us.

Aspect	Replicated State Machine	Stateful Workflow	Saga Pattern
Primary Goal	Strong consistency & availability	Long-running orchestration	Distributed transaction coordination
Consistency Model	Strong (linearizable)	Eventually consistent (recoverable)	Eventually consistent
Failure Recovery	Re-execution from logs	Resume from persisted state	Trigger compensations
Tooling Examples	Raft (etcd, Consul), Paxos	Temporal, AWS Step Functions	Temporal, Camunda, Netflix Conductor
Ideal For	Consensus, leader election, config stores	Multi-step business workflows	Business processes with rollback needs
Complexity	High (due to consensus)	Moderate	High (compensating logic needed)
Execution Style	Synchronous (log replication)	Asynchronous, event-driven	Asynchronous, loosely coupled

Results and Benefits

Implementing state machine patterns brought the following improvements that could be measured:

Message delivery retries fell by 60%.
Read receipt sync issues were cut down by 45%.
Service crashes recovery time reached under 200ms.
Incident resolution time thus got decreased by observability.

Furthermore, we managed to come up with internal tools such as dashboards obtained through the visualization workflow state per message during on-call incidents.

Conclusion

In a messaging system, reliability is not an add-on — it's a must. The users assume that their messages are delivered, read, and synchronized at the same moment. Therefore, using state machines to model essential workflows, we developed a fault-tolerant system that could gracefully recover from dangers. The decomposition of Stateful Workflows, Sagas, and Replicated State Machines gave us the means to regard faults as equal entities in our architecture.

Although the implementation was a bit of a hassle, the benefits of robustness, clarity, and operational efficiency were significant. These patterns are now the foundation of how we are thinking of building our services throughout the organization in a strong manner.

Architecture Machine Fault (technology) workflow

Opinions expressed by DZone contributors are their own.

Related

Trending