DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Modeling Saga as a State Machine
  • Architecting Compound AI Systems for Scalable Enterprise Workflows
  • Guide to LangChain Runnable Architecture
  • Architecting for Resilience: Strategies for Fault-Tolerant Systems

Trending

  • How to Save Money Using Custom LLMs for Specific Tasks
  • Deployment Lessons You Only Learn the Hard Way
  • The Agentic Agile Office: Streamlining Enterprise Agile With Autonomous AI Agents
  • Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
  1. DZone
  2. Software Design and Architecture
  3. Microservices
  4. Designing Fault-Tolerant Messaging Workflows Using State Machine Architecture

Designing Fault-Tolerant Messaging Workflows Using State Machine Architecture

State machine patterns, such as Stateful Workflows, Sagas, and Replicated State Machines, improve message reliability, sync consistency, and recovery.

By 
Pankaj Taneja user avatar
Pankaj Taneja
·
May. 30, 25 · Analysis
Likes (3)
Comment
Save
Tweet
Share
4.4K Views

Join the DZone community and get the full member experience.

Join For Free

Abstract 

As a leader of projects for the backend of a global messaging platform that maintains millions of users daily, I was also responsible for a couple of efforts intended to enhance the stability and failure tolerance of our backend services. We replaced essential sections of our system with the help of the state machine patterns, notably Stateful Workflows. The usage of this model led to the elimination of problems in the field of message delivery, visibility of the read receipt, and device sync, such as a mismatch of phone directories. 

The intention of this article is to let the reader know how to keep a messaging infrastructure highly available and adaptable by sharing the practicalities and trials one faces when bringing the said architectures into production.

Introduction 

When dealing with distributed systems, you should always assume that failure will happen. In our messaging platform, it became very clear to us very quickly that unpredictable behavior was not something we should look at as a once-in-a-blue-moon occurrence, as it was in fact the standard state of affairs. Our infrastructure had to deal not only with network partitions and push notification delays but also with user device crashes, and our engineers did a great job in coping with such problems.

Up to that time, instead of having service-level retry logic scattered all over, we selected a more systematic way of achieving the task, which involved the use of state machines. In the end, when we reimagined our business-critical workflows as entities with state, we realized that we had really found the way not only to automate a proper failure recovery process but also to do it in a predictable, observable, and consistent manner. 

This piece will focus on three main designs that we made use of — Stateful Workflows, Sagas, and Replicated State Machines — and how, through them, we not only built an impervious system but also let it respond to any failure scenario gracefully.

Using Stateful Workflows for Message Delivery 

Message delivery is, without a doubt, the most crucial aspect of our system. In the beginning, we used a queue-based system without statefulness to send messages to devices. Unfortunately, we constantly faced unforeseen cases of the process stopping in the middle, which led to a situation where the user did not receive the message at all or received it with a significant delay.

We tackled this problem by introducing the Stateful Workflow Pattern with the help of Temporal:

Message Workflow States

  1. Send Message Initiated
  2. Message Stored
  3. Push Notification Dispatched
  4. Delivery Confirmed
  5. Read Acknowledged

Every transition from one state to another was done by events to which timers and retries were added. When a notification was not delivered (probably due to APNs/FCM complications), the system used an exponential backoff method to retry the request. In case the delivery confirmation failed to arrive in a timely manner, we made a note of the event, and if the customer wished, we might also trigger resolution mechanisms such as sending notifications by email.

Each step was stored in the database's memory, which later enabled workflows to restart from the place where they stopped most recently, even after the system crashed or the node restarted. As a result, the number of messages lost was significantly decreased and the error states were visual in our monitoring applications.

Implementing the Saga Pattern for Multi-Device Sync 

Another vital point is the importance of staying identical in the status of read messages on all the user devices. It means that if the user reads the message on one gadget, the change should be instant on all other gadgets.

The above was implemented in a simple way, it was a Saga:

  • Step 1: Mark the message as read on Device A.
  • Step 2: Sync to cloud state.
  • Step 3: Push read receipt to Devices B and C.

Each of the steps was a local transaction. We would just component the corresponding reactions if one of them fails, thus no consistency would be lost. For example, if the failure is a sync to the cloud, then we would change the state backward and inform A of the problem, so that the result is no partial changes made.

This very method lets us reach even consistency without the need for global locks or distributed transactions, which are both intricate and accident-prone.

Using Replicated State Machines for Metadata Storage 

In order to keep the data, like the conversation state and preferences, in a consistent state, we have employed Replicated State Machines based on the Raft agreement protocol. It is this design that enabled us to:

  • Appoint a leader to manage writes
  • Copy the changes to all followers
  • Bring the state back by getting logs, if there is a crash

This method was specifically beneficial for ensuring that we have a persistent chat indexing service and group membership management, where the state view was always correct.

Comparative Analysis of Patterns

I compared the most common state machine-based fault tolerance patterns to arrive at a solution that worked well for us.

Aspect

Replicated State Machine

Stateful Workflow

Saga Pattern

Primary Goal

Strong consistency & availability

Long-running orchestration

Distributed transaction coordination

Consistency Model

Strong (linearizable)

Eventually consistent (recoverable)

Eventually consistent

Failure Recovery

Re-execution from logs

Resume from persisted state

Trigger compensations

Tooling Examples

Raft (etcd, Consul), Paxos

Temporal, AWS Step Functions

Temporal, Camunda, Netflix Conductor

Ideal For

Consensus, leader election, config stores

Multi-step business workflows

Business processes with rollback needs

Complexity

High (due to consensus)

Moderate

High (compensating logic needed)

Execution Style

Synchronous (log replication)

Asynchronous, event-driven

Asynchronous, loosely coupled


Results and Benefits

Implementing state machine patterns brought the following improvements that could be measured:

  • Message delivery retries fell by 60%.
  • Read receipt sync issues were cut down by 45%.
  • Service crashes recovery time reached under 200ms.
  • Incident resolution time thus got decreased by observability.

Furthermore, we managed to come up with internal tools such as dashboards obtained through the visualization workflow state per message during on-call incidents.

Conclusion

In a messaging system, reliability is not an add-on — it's a must. The users assume that their messages are delivered, read, and synchronized at the same moment. Therefore, using state machines to model essential workflows, we developed a fault-tolerant system that could gracefully recover from dangers. The decomposition of Stateful Workflows, Sagas, and Replicated State Machines gave us the means to regard faults as equal entities in our architecture.

Although the implementation was a bit of a hassle, the benefits of robustness, clarity, and operational efficiency were significant. These patterns are now the foundation of how we are thinking of building our services throughout the organization in a strong manner.

Architecture Machine Fault (technology) workflow

Opinions expressed by DZone contributors are their own.

Related

  • Modeling Saga as a State Machine
  • Architecting Compound AI Systems for Scalable Enterprise Workflows
  • Guide to LangChain Runnable Architecture
  • Architecting for Resilience: Strategies for Fault-Tolerant Systems

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook