The Dual Write Problem: What Looks Safe in Code but Breaks in Production
The dual write problem is one of the most common consistency issues in distributed systems. There are four patterns to resolve this.
Join the DZone community and get the full member experience.
Join For FreeA system that crashes is easier to fix than one that silently produces wrong results. The dual write problem is exactly that kind of bug.
It is surprisingly common and often misunderstood, even by teams that have encountered it in production. Understanding the dual write problem starts with seeing why the obvious solution fails, and ends with four patterns that address it correctly.
The Dual Write Problem
The dual write problem occurs when a service needs to write to two separate systems as part of a single logical operation. The most common example in modern microservices is writing to a database and publishing an event to a message broker like Kafka.
Consider this Spring service:
@Transactional
public void placeOrder(Order order) {
orderRepository.save(order);
kafkaTemplate.send("orders", order);
}
This looks safe. The @Transactional annotation is there. The order is saved, the event is sent. The transaction boundary is not what it appears to be.
The Transactional Risk
The @Transactional annotation wraps the database operation in a transaction. But Kafka is not part of that transaction. These are two entirely separate systems with no shared coordinator. The transaction commits or rolls back the database write. It has no knowledge of and no control over what Kafka does.
Here are the two failure scenarios that expose the problem:
Scenario 1: The event is sent, but the database rolls back.
@Transactional
public void placeOrder(Order order) {
orderRepository.save(order); // DB write succeeds
kafkaTemplate.send("orders", order); // Kafka confirms the message
// Lets say JVM crashes here, before DB transaction commits
// DB rolls back. Order does not exist
// Kafka message is already out and the downstream consumers process an order
}
Downstream consumers receive and process an event for an order that does not exist in the database. The system is now inconsistent, and neither side knows it.
Scenario 2: The database commits, but the event is lost.
@Transactional
public void placeOrder(Order order) {
orderRepository.save(order); // DB write succeeds
kafkaTemplate.send("orders", order);
// Lets say Kafka broker crashes before message is durably written
// DB commits successfully
// Event is lost and downstream consumers never process the order
}
The order exists in the database, but no downstream service ever processes it. Inventory is never updated, the confirmation email is never sent, and the warehouse never picks up the item.
Both scenarios are real production failures. Both are silent. Neither produces an immediate error that would trigger an alert.
XA Transactions and Their Shortcomings
Java supports distributed transactions through XA (two-phase commit) via JTA (Java Transaction API). In theory, XA coordinates a transaction across multiple resources, including databases and message brokers. In practice, it has three fundamental problems:
- Slow: Two-phase commit requires multiple round-trips between all participants before a transaction can complete, adding significant latency to every operation.
- Fragile: If the transaction coordinator crashes between the prepare and commit phases, resources are left in an uncertain state that requires manual intervention to resolve.
- Incompatible with Kafka: Kafka's transactional model is internal to Kafka itself and does not participate in a JTA-coordinated distributed transaction.
XA is not a viable solution for modern event-driven architectures.
The Four Patterns That Address the Problem
1. The Transactional Outbox Pattern
Instead of writing directly to Kafka, write the event to an outbox table in the same database transaction as the primary data. A separate background publisher then reads the outbox table and publishes events to Kafka.
@Transactional
public void placeOrder(Order order) {
orderRepository.save(order);
outboxRepository.save(new OutboxEvent("orders", order));
// Both writes are in the same DB transaction
}
A background publisher reads the outbox and sends it to Kafka:
@Scheduled(fixedDelay = 100)
public void publishOutboxEvents() {
List<OutboxEvent> events = outboxRepository.findUnpublished();
events.forEach(event -> {
kafkaTemplate.send(event.getTopic(), event.getPayload());
outboxRepository.markPublished(event);
});
}
This guarantees at least one delivery. Downstream consumers must be idempotent to handle potential duplicates.
2. Change Data Capture (CDC)
Rather than writing to an outbox table manually, CDC monitors the database transaction log directly and forwards changes to Kafka automatically.
The database write is the only operation the application performs. CDC captures the change from the transaction log and publishes it to Kafka asynchronously. The application does not know Kafka at all.
This is the cleanest separation of concerns but introduces an operational dependency on a CDC pipeline that must be maintained and monitored.
3. Event Sourcing
Event sourcing eliminates the dual write problem at its root by making the event the source of truth. Instead of writing the state to a database and publishing an event, only the event is stored. The current state of an entity is derived by replaying its event history.
A common mistake is to call eventBus.publish directly after appending to the event store:
public void placeOrder(Order order) {
OrderPlacedEvent event = new OrderPlacedEvent(order);
eventStore.append(event); // succeeds
eventBus.publish(event); // fails and dual write problem reintroduced
}
If eventBus.publish fails after the event store write succeeds, the dual write problem is reintroduced. The event is stored, but downstream consumers never receive it.
The correct approach is to never publish directly from the application. A separate background publisher reads from the event store and publishes to the message broker:
public void placeOrder(Order order) {
OrderPlacedEvent event = new OrderPlacedEvent(order);
eventStore.append(event); // single write, single system
}
// Background publisher reads from event store and publishes
@Scheduled(fixedDelay = 100)
public void publishEvents() {
List<Event> unpublished = eventStore.findUnpublished();
unpublished.forEach(event -> {
eventBus.publish(event);
eventStore.markPublished(event);
});
}
There is only one write target in the application layer. The trade-off is a significant increase in architectural complexity and a learning curve for teams unfamiliar with event-sourced systems.
4. The Listen to Yourself Pattern
This pattern inverts the usual flow. Instead of writing to the database first and then publishing an event, the service publishes the event to Kafka first and listens to its own event to update its own state.
// Publish the event first
public void placeOrder(Order order) {
kafkaTemplate.send("orders", order);
}
// Listen to own event and update database state
@KafkaListener(topics = "orders")
public void onOrderPlaced(Order order) {
orderRepository.save(order);
}
Once the event is confirmed by Kafka, it will not be lost even if the service crashes immediately after. When the service restarts, it will consume the event from Kafka and update the database. The trade-off is that reads immediately after a write may not reflect the latest state, and the listener must be idempotent to handle duplicate deliveries.
Choosing the Right Pattern
Each pattern solves the dual write problem but with different trade-offs. Choosing the right one depends on the complexity the team can sustain and the consistency guarantees the system requires.
| Pattern | Complexity | Delivery Guarantee | Best For |
|---|---|---|---|
|
Transactional Outbox |
Low |
At least once |
Most microservices |
|
CDC |
Medium |
At least once |
High-throughput systems |
|
Event Sourcing |
High |
At least once |
Compliance-driven systems |
|
Listen to Yourself |
Low |
At least once |
Simple event-driven flows |
For most use cases, the Transactional Outbox Pattern is the right starting point. It is straightforward to implement, works with any database, and requires no additional infrastructure beyond a background publisher.
Conclusion
The dual write problem is one of those issues that sits quietly in a codebase until a precisely timed failure exposes it. The @Transactional annotation provides a false sense of safety when a second external system is involved. Understanding why transactions do not cross system boundaries is fundamental knowledge for anyone building event-driven distributed systems. Knowing the patterns that address this is equally important.
The code that looks safe often is not. The fix is not complex. But it requires knowing the problem exists in the first place.
Opinions expressed by DZone contributors are their own.
Comments