Master-Class: Understanding Database Replication (Single, Multi, and Leaderless)
This article walks through the different database replication strategies, provides practical insights, identifies common pitfalls, and explains how to overcome them.
Join the DZone community and get the full member experience.
Join For FreeReplication = same data, multiple nodes.
You do it for three reasons: to survive node failures, serve reads from nearby replicas, or spread load. Simple enough. The algorithm you pick, though, is where things get complicated. It decides how writes propagate, what breaks during failures, and how honest your consistency guarantees actually are.
Three main approaches exist. Here's how each one works, where it hurts, and what that pain actually looks like in production. (I have used single-leader and multi-leader strategies by myself.)
1. Single Leader
Used by MySQL, PostgreSQL, and MongoDB. Most teams start here, and most teams stay here.
One node takes all writes — that's the leader. Everyone else is a follower. Leader writes locally, ships a change log to followers, and followers apply it. Reads can hit the leader or any follower, which is why this model works so well for read-heavy apps.
The first real decision: sync or async.
Synchronous replication means the leader waits for at least one follower to confirm before ACK-ing the client. If the leader dies mid-flight, that follower still has the data. Sounds great until one follower gets slow — now every write blocks waiting on it. Not practical for most production systems.
Asynchronous flips this. Leader confirms immediately, replicates in the background. Write throughput goes up. But if the leader crashes before the follower syncs, those writes are gone — and the client already got a success response. That's the tradeoff you're accepting.
What most systems actually run: semi-synchronous. One follower is synchronous, the rest async. You get a durability guarantee without the full latency hit.
Follower failure is easy.
Follower goes down, comes back, checks where it left off in the replication log, and asks the leader for everything it missed. That's catch-up recovery. It works cleanly.
Leader failure is not.
Failover has three steps: detect the failure, elect a new leader, and reconfigure clients. Each one has hidden complexity.
Detection uses heartbeat timeouts — but a slow leader and a dead leader look identical. Too short a timeout and you trigger unnecessary failovers under load. Too long and you're sitting with write downtime while the system figures out what happened.
Election picks the follower with the most recent data. Fine in theory. In practice, if that follower wasn't fully caught up, writes the old leader had already ACK'd and would quietly drop. The client thinks their data is safe. It isn't.
Then there's split-brain. The old leader recovers, doesn't know it was replaced, and starts accepting writes again. Now two nodes are independently processing writes with no coordination. Most systems handle this through fencing, but it has to be implemented carefully — fencing bugs are the kind that hit you at 3 AM.
2. Multi-Leader
More than one node accepts writes. Common use case: you have data centers in multiple regions and don't want every write from Tokyo routing to Virginia.
With multi-leader, users write to their nearest data center. The sync between DCs happens in the background. Write latency from the client's perspective drops significantly.
The other use case worth knowing: offline-capable apps. A calendar or notes app where each device is its own local leader, syncing when it reconnects. CouchDB is built around exactly this.
The conflict problem.
Single leader sidesteps write conflicts entirely — only one node accepts writes. A multi-leader can't do that. Two users edit the same record in different data centers simultaneously, both writes get accepted, and now you have a conflict to resolve at sync time.
Ways to handle it:
- Conflict avoidance – route all writes for a specific record to the same leader. No conflict possible. This is the simplest solution and works well until a user changes location or a data center goes down.
- Last Write Wins – attach a timestamp, higher wins. Simple to implement, and it silently discards real concurrent writes because distributed clocks aren't reliable. Use with caution.
- CRDTs – data structures specifically designed so concurrent updates merge deterministically. A counter CRDT safely combines increments from multiple sources. Works cleanly for the data types it supports, which is a limited set.
- Application-level resolution – surface the conflict to your code and let it decide. Most control, most work. You're now writing conflict resolution logic alongside your business logic.
Multi-leader setups are powerful and genuinely hard to operate. The conflict resolution piece alone is enough to make most teams avoid it unless they really need multi-region writes.
3. Leaderless
No leader at all. Any node takes reads or writes. Popularized by Amazon's Dynamo paper, used by Cassandra and Riak.
How consistency works without a coordinator: quorums
Three numbers drive everything:
n— total replicasw— confirmations needed for a write to succeedr— nodes queried on a read
The rule is w + r > n.
That inequality guarantees the write set and read set overlap — at least one node in your read quorum saw the latest write. With n=3, w=2, r=2, every read hits at least one node that processed the most recent write.
You can tune these to trade off consistency against latency. w=1 makes writes fast but fragile. r=1 makes reads fast but potentially stale. Higher values tighten consistency and add latency. The numbers are simple; getting the right settings for your actual workload takes more thought.
Stale data doesn't fix itself.
Nodes go offline, miss writes, and come back out of date. Two things handle this:
- Read repair – client reads from multiple nodes, spots the stale version, pushes the newer value back. Works for data that gets read frequently. Data that rarely gets touched can stay stale for a long time.
- Anti-entropy – a background process that continuously diffs replicas and syncs them. Doesn't preserve write order, which makes consistency reasoning harder at the application layer.
Replication Lag
Every async system has it. The gap between "leader accepted the write" and "all replicas reflect it" is never zero — its size depends on your network, hardware, and current load.
Three guarantees you can implement on top of the replication layer to make lag less painful:
- Read-your-own-writes – after a user submits something, they always see their own update on the next read. Simple fix: route reads for a user's own data to the leader; reads for other users' data can hit replicas.
- Monotonic reads – a user reading the same data twice never sees it go backward. Happens when two reads hit replicas at different lag points. Fix: pin a session to the same replica.
- Consistent prefix reads – if write A causally precedes write B, any reader who sees B sees A too. Matters anywhere write order is meaningful — chat, audit logs, event feeds.
None of these is free. Each one adds routing logic or session state to your system.
| Single Leader | Multi-Leader | Leaderless | |
|---|---|---|---|
| Best fit | Read-heavy, want simplicity | Multi-region writes, offline sync | High write throughput, max availability |
| The catch | Failover is harder than it looks | Conflict resolution complexity | Eventual consistency is your problem now |
No universally correct answer. The right pick depends on where your users are, what consistency you actually need, and how much operational complexity your team can handle.
Opinions expressed by DZone contributors are their own.
Comments