Combining Temporal and Kafka for Resilient Distributed Systems

Kafka handles durable event streaming while Temporal manages long-running workflow state, retries, and recovery to build resilient distributed systems.

Akhil Madineni

Jun. 09, 26 · Analysis

Likes (1)

Comment

Save

2.0K Views

Kafka and Temporal address different failure boundaries, and resilient distributed systems often need both rather than one as a substitute for the other. Kafka is built to move ordered, replayable event streams across many consumers and machines, while Temporal is built to keep long-running application logic alive as durable Workflow Executions that recover from crashes, outages, and worker restarts by replaying persisted Event History. The combination becomes compelling when Kafka is used to carry facts and Temporal is used to remember intent, timers, retries, and compensations across the lifetime of a business process.

Kafka as the Event Backbone and Temporal as the Control Plane

Kafka’s model is centered on totally ordered partitions, consumer groups, and offsets. A partition is consumed by exactly one consumer in a subscribing consumer group at a time, and Kafka keeps consumer state compact by treating progress as an offset that can be checkpointed, committed manually, or even rewound for reprocessing. That model is excellent for integration boundaries, stream processing, and decoupling producers from downstream services. What it does not provide by itself is durable orchestration for business logic that must wait for hours, react to multiple messages over time, and recover mid-process without rebuilding state externally. Temporal fills that gap by treating a Workflow Execution as a durable, reliable, scalable function that owns local state, receives messages through Signals or Updates, and advances by replaying persisted history instead of starting over from scratch after failure.

Keep Kafka at the Boundary of Workflow Replay

The most important design rule is simple: Kafka client calls do not belong inside Workflow code. Temporal requires deterministic workflow logic on replay, and its documentation explicitly places non-deterministic work, such as API calls and database queries, inside Activities.

A Workflow should behave like a compact state machine that decides what should happen next, while Activities perform the side effects that may fail or need retries. That separation is what allows Kafka to remain an external event fabric without corrupting Temporal replay semantics.

    Java
   
 

   private boolean paymentReceived;

private final OrderActivities activities =
    Workflow.newActivityStub(
        OrderActivities.class,
        ActivityOptions.newBuilder()
            .setStartToCloseTimeout(Duration.ofSeconds(30))
            .setRetryOptions(
                RetryOptions.newBuilder()
                    .setInitialInterval(Duration.ofSeconds(1))
                    .setMaximumInterval(Duration.ofSeconds(30))
                    .build())
            .build());

@WorkflowMethod
public void process(String orderId) {
    activities.reserveInventory(orderId);
    boolean paid = Workflow.await(Duration.ofHours(2), () -> paymentReceived);
    if (!paid) {
        activities.releaseInventory(orderId);
        activities.publishTimedOut(orderId);
        return;
    }
    activities.publishConfirmed(orderId);
}

@SignalMethod
public void paymentCaptured(String paymentId) {
    paymentReceived = true;
}
  

This workflow is intentionally boring, which is precisely why it is robust. Inventory reservation and event publication are pushed into Activities, while the workflow itself only keeps state and waits. The two-hour wait is not a sleeping thread in application memory; Temporal persists timers so the execution resumes even after worker or service interruptions. Kafka, in this pattern, supplies the external payment event, but Temporal owns the long-lived timeout and the recovery semantics.

A thin Kafka bridge can then translate an incoming record into a Temporal message instead of embedding orchestration logic in the consumer loop. Signal-With-Start is especially useful because it either signals an existing workflow or starts a new one with the same Workflow ID and immediately applies the signal, which removes a large class of race conditions between creation and update.

    Java
   
 

   public void onMessage(ConsumerRecord<String, PaymentEvent> record) {
    WorkflowStub workflow =
        client.newUntypedWorkflowStub(
            "OrderWorkflow",
            WorkflowOptions.newBuilder()
                .setWorkflowId("order-" + record.key())
                .setTaskQueue("order-workflows")
                .build());

    workflow.signalWithStart(
        "paymentCaptured",
        new Object[] { record.value().paymentId() },
        new Object[] { record.key() });

    consumer.commitSync();
}
  

That handoff should be designed as duplicate-tolerant rather than duplicate-impossible. Kafka allows manual control over when a record is considered consumed, but a crash after Temporal accepts the signal and before the offset is committed can still trigger redelivery. A practical way to make that safe is to keep the Workflow ID stable for the business entity and to make Activities idempotent, because Temporal may retry Activity executions as part of normal failure handling.

Failure Semantics Matter More Than Labels

The most common architectural mistake in Kafka and Temporal systems is to over-claim exactly-once semantics. Kafka’s idempotent producer ensures that retries do not create duplicate writes in the stream, and Kafka transactions allow atomic writes across partitions and topics. Kafka Streams goes further by defining end-to-end exactly-once around a very specific boundary: input topic offsets, state stores, and output topics are committed atomically because they are all inside Kafka’s storage model.

Temporal, meanwhile, gives an effectively once-scheduled experience for Activities, but still expects Activity implementations to be idempotent because retries can occur after partial execution or worker failure. The combined system, therefore, does not become end-to-end exactly-once by default; that only happens when idempotency keys or transactional guarantees explicitly cover every external side effect that matters.

    Java
   
 

   public void publishConfirmed(String orderId) {
    producer.beginTransaction();
    try {
        producer.send(new ProducerRecord<>("order-confirmed", orderId, orderId)).get();
        producer.commitTransaction();
    } catch (Exception ex) {
        producer.abortTransaction();
        throw ex;
    }
}
  

This kind of publishing Activity is useful when Workflow progress must result in one or more Kafka records that either all appear or all fail together. The producer should be configured for idempotence, durable acknowledgments, and a transactional.id, but the design should still assume that non-Kafka side effects may need compensation.

Temporal’s error-handling guidance recommends rollback logic with the Saga pattern for multi-step processes, which maps naturally to workflows that can reserve inventory, attempt payment, publish status, and then compensate in reverse order if one boundary fails after another has already succeeded.

Long-Running Streams Need Long-Running Discipline

Once Kafka is feeding entity-centric workflows for days or weeks, operational details start to matter as much as API design. Reusing the same business key as the Kafka record key and the Temporal Workflow ID creates a clean ownership model: Kafka uses keys to select partitions, partitions remain totally ordered, and Temporal guarantees that only one Workflow Execution with a given ID is open at a time. That alignment naturally serializes updates for a customer, order, or account across both systems. At the same time, the Kafka side of the bridge should stay thin enough to keep polling regularly, because consumers that stop polling can be considered dead and rebalanced out of the group.

Temporal workflows that receive large numbers of Signals or perform many Activity calls also need history management. Event History is the mechanism that makes recovery possible, but it has performance limits and hard ceilings; Temporal warns as history grows and recommends Continue-As-New for long-running executions or workloads that process thousands of events. That becomes especially important in Kafka-driven entity workflows, where a single logical process can become a permanent mailbox unless it periodically rolls forward into a fresh run. Code evolution must also be handled deliberately because workflow logic is replayed; Temporal’s versioning guidance requires patching or worker versioning when changes would otherwise introduce non-determinism for in-flight executions.

Conclusion

Temporal and Kafka work best together when each is allowed to solve the problem it was built for. Kafka should distribute ordered, replayable events across the system boundary, and Temporal should hold the durable state machine that decides what those events mean over time. With that separation, retries stop leaking into application code, timers stop depending on process uptime, and compensations stop turning into chains of callbacks and ad hoc status flags.

The result is not merely a system that survives failures, but a system whose failure semantics remain understandable under load, redelivery, redeployments, and long-running business latency.

kafka systems microservices

Opinions expressed by DZone contributors are their own.

Related

Trending