DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Observability for the Invisible: Tracing Message Drops in Kafka Pipelines
  • Evolving Spring Boot APIs to an Event-Driven Mesh
  • End-to-End Event Streaming With Kafka, Spring Boot and AWS SQS/SNS (Production-Ready Code Guide)
  • From APIs to Event-Driven Systems: Modern Java Backend Design

Trending

  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • Data Contracts as the "Circuit Breaker" for Model Reliability
  • How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
  • Optimizing High-Volume REST APIs Using Redis Caching and Spring Boot (With Load Testing Code)
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. Monitoring and Observability
  4. Designing Retry-Resilient Fare Pipelines With Idempotent Event Handling

Designing Retry-Resilient Fare Pipelines With Idempotent Event Handling

Design retry-safe fare pipelines with idempotent event handling to ensure consistency across failures, retries, and duplications.

By 
Ravi Teja Thutari user avatar
Ravi Teja Thutari
DZone Core CORE ·
Jul. 23, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
3.1K Views

Join the DZone community and get the full member experience.

Join For Free

In modern flight booking systems, streaming fare updates and reservations through distributed microservices is common. These pipelines must be retry-resilient, ensuring that transient failures or replays don’t cause duplicate bookings or stale pricing. A core strategy is idempotency: each event (e.g., a fare-update or booking command) carries a unique identifier so processing it more than once has no adverse effect. 

In practice, this means assigning a unique event ID or idempotency key to each fare request. For example, an airline booking API might attach a unique request token so that if a user’s retry comes through, the service recognizes and ignores it. This prevents duplicate bookings if, say, a payment call times out and is retried. Similarly, pricing events should include a unique price-update ID. Downstream consumers (cache updaters, booking services, analytics) record these IDs (in a database or distributed cache) and skip any event with an already-seen ID.

For example, consider a booking service that listens on a Kafka topic, fare-reservations. Its consumer method might check an EventStore before processing:

Java
 
@KafkaListener(topics="fare-reservations")
public void handleFareEvent(String payload) {
    FareEvent event = objectMapper.readValue(payload, FareEvent.class);
    UUID eventId = event.getEventId();    
    if (processedCache.contains(eventId) || database.hasProcessed(eventId)) {
        // Skip duplicate event
        return;
    }
    try {
        // Process the fare reservation (e.g., charge credit card, reserve seat)
        reservationService.reserveFare(event);
        // Mark as processed (persist ID and possibly result)
        processedCache.add(eventId);
        database.saveProcessed(eventId);
    } catch (TransientFailureException e) {
        // Rethrow to trigger Kafka retry (leave offset uncommitted)
        throw e;
    } catch (PermanentFailureException e) {
        // Send to DLQ since retry likely won't help
        deadLetterPublisher.publish(event, e);
    }
}


This snippet shows an idempotent consumer pattern: it looks up the event ID in a cache/DB and only processes if it has not been seen before. If processing succeeds, it records the ID. It distinguishes error types: transient errors (e.g., a temporary database timeout) are rethrown so Kafka will redeliver, while permanent errors (e.g., data validation failure) go immediately to a dead-letter queue (DLQ) for later inspection.

Key Best Practices in This Pipeline

Unique Event IDs / Idempotency Keys

Every fare update or booking request carries a unique ID. These IDs become idempotency keys checked by consumers to deduplicate processing. On the producer side, ensure IDs are truly unique (often using UUIDs or composite keys of orderID+timestamp). On the consumer side, persist seen IDs (in-memory caches or a transactional “processed_events” table) so the handler can quickly recognize duplicates. For example, one blog notes that idempotency is commonly implemented by storing processed message IDs and ignoring repeats.

Kafka Consumer Retry Safety

Kafka consumers should treat each message as potentially delivered multiple times. By default, if a consumer throws an exception before committing the offset, Kafka will redeliver the same message (at-least-once semantics). Treat the commit as a “transaction boundary”: do not commit until you have fully processed the message. If processing fails (transient exception), letting the exception propagate causes a retry. 

After retrying a configurable number of times, you should route the message to a DLQ to avoid endless loops. For example, Spring-Kafka supports an error handler that, after N failed delivery attempts, automatically sends the record to a DLQ topic. A snippet from Spring’s docs shows sending to DLQ after 10 retries:

Java
 if (deliveryAttempt > 9) {
    recoverer.accept(msg.getHeaders().get(KafkaHeaders.RAW_DATA, ConsumerRecord.class), ex);
    return "FAILED";
}
throw ex;


This ensures repeat failures don’t clog the pipeline. Throughout, designing consumers to be retry-safe means any action (db write, external call) can safely be repeated or rolled back. For example, database writes should use upserts or transactions that can be retried, and downstream HTTP calls should themselves be idempotent or guarded by idempotency keys.

Transient vs. Permanent Errors

Clearly distinguish errors that warrant a retry from those that should stop. Transient errors include network glitches, temporary unavailability of an external pricing service, or momentary database issues. These should trigger automatic retries with exponential backoff. For instance, if a fare price lookup times out, retrying a few times may succeed. 

By contrast, permanent errors — such as a message failing schema validation or a missing lookup key — mean the event is bad. These should not be retried indefinitely; instead, fail fast and send the event to a DLQ. In code, this often means catching known fatal exceptions and handling them specially (as in the snippet above). Spring-Kafka’s documentation emphasizes using a DeadLetterPublishingRecoverer in an error handler: after a set number of failed attempts, the recoverer publishes the raw record to a dead-letter topic for investigation. These DLQ messages can be monitored and eventually corrected or reprocessed manually.

Fallback and Reprocessing Strategies

Sometimes the pipeline can attempt fallback logic before giving up. For example, if a live pricing service is down, the consumer might use a cached price or a default rate. Alternatively, design a periodic reprocessing job: messages in DLQ or a retry topic can be replayed when issues are fixed. For example, if a fare update was skipped due to missing data, after fixing the source data, you can republish the event or replay it from the DLQ. Maintain an “outbox” or change-log (as in the Transactional Outbox pattern) so that events can be replayed safely. 

In one scenario, an airline reused a replayable event log to refresh caches after noticing stale prices – essentially re-broadcasting missed updates. It’s critical that any reprocessing also respects idempotency keys so that already-applied updates are not duplicated.

Dead-Letter Queue (DLQ) Routing and Observability

All components should emit clear logs and metrics for retries and duplicates. Route poison-pill messages to a DLQ topic with meaningful partition keys (e.g., based on airline or fare class) so teams can process them separately. Use application logs to record duplicate detection and DLQ moves; these logs can be aggregated. 

For observability, capture metrics on retry and DLQ counts. For instance, increment a counter each time a consumer skips a duplicate or sends a message to the DLQ. Tools like Datadog can ingest these logs and metrics: for example, Datadog’s Kafka integration can track broker throughput and consumer lag. Monitoring key metrics helps spot retry storms or unexpected message surges. Datadog also supports log analytics and alerts: you can create a log-based metric to count lines indicating duplicate-event skips or DLQ publishes. Alerts on rising consumer lag or sudden jumps in DLQ volume can quickly signal that the pipeline is struggling (e.g., a downstream service is failing, or duplicate replays are occurring). 

Datadog’s Data Streams Monitoring even maps flows across Kafka pipelines end-to-end, so a large backlog or flood of events is visible as a spike on the dashboard. In practice, airlines often correlate event IDs or user IDs in their logs to trace a booking request across services — this correlates well with observability tools and helps debug where an idempotency break might occur.

Implementing these practices yields a robust fare pipeline. For instance, one anonymized case involved a customer submitting a fare booking twice due to a frontend timeout. Because the booking service used idempotency keys, the second event was recognized as a duplicate and silently ignored, preventing a double-charge (this matches the common recommendation to use unique keys on booking APIs). 

In another incident, a legacy Kafka cluster’s short offset retention caused consumers to reset to the earliest offset and reprocess old fare-change events. That replay unintentionally updated some customers with old, stale prices. The team fixed it by extending retention to avoid unexpected replays and by marking events so duplicates were filtered. Later, they added Datadog alerts on consumer restarts and offset resets to catch such cases early.

Conclusion

Overall, a retry-resilient fare pipeline uses idempotent event consumers, careful error classification, and thorough observability. Unique event IDs and idempotency checks make retries safe. Kafka’s offset semantics and error handlers are leveraged to retry transient failures and route irrecoverable messages to DLQs. 

By distinguishing between transient and permanent failures, implementing fallback logic, and monitoring retries and duplicates, teams ensure that fare updates and bookings remain accurate and timely. This holistic approach — combining idempotency keys, safe consumer design, DLQ handling, and observability — makes distributed fare pipelines robust against failures and retries, preserving both system integrity and customer trust.

Event kafka Pipeline (software) Observability

Opinions expressed by DZone contributors are their own.

Related

  • Observability for the Invisible: Tracing Message Drops in Kafka Pipelines
  • Evolving Spring Boot APIs to an Event-Driven Mesh
  • End-to-End Event Streaming With Kafka, Spring Boot and AWS SQS/SNS (Production-Ready Code Guide)
  • From APIs to Event-Driven Systems: Modern Java Backend Design

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook