DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Deployment

In the SDLC, deployment is the final lever that must be pulled to make an application or system ready for use. Whether it's a bug fix or new release, the deployment phase is the culminating event to see how something works in production. This Zone covers resources on all developers’ deployment necessities, including configuration management, pull requests, version control, package managers, and more.

icon
Latest Premium Content
Trend Report
Developer Experience
Developer Experience
Refcard #233
Getting Started With Kubernetes
Getting Started With Kubernetes
Refcard #379
Getting Started With Serverless Application Architecture
Getting Started With Serverless Application Architecture

DZone's Featured Deployment Resources

When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps

When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps

By Kajol Shah
Teams often say they are building one app. A lot of the time, that is not true. I saw this while reviewing a telemedicine MVP. At first, the plan sounded simple enough: video visits, messaging, scheduling, and basic records. Then the version-one list kept growing: Patient appprovider dashboardAdmin panelMessagingVideoBillingEHR connectionDevice support later At that point, this was no longer one app. It was several systems being planned as one MVP. A patient-facing productA provider-facing productAn admin productA set of outside-service connections When a team treats all of that like one first release, things get messy before development even starts. The Moment It Stopped Being One App The problem was not the number of screens. The problem was the number of users, roles, and data rules hiding behind those screens. A patient needed intake, booking, reminders, and follow-up. A provider needed schedules, patient context, notes, and quick actions during the day. An admin needed visibility, support tools, and role controls. The outside-services side added video vendors, messaging vendors, EHR work, and, later, device data. That is not one product. That is a group of different systems with different jobs. Once that became obvious, the planning changed. Split the Product by User First Before estimating anything, it helps to split the product by who it is for. For this telemedicine project, the first useful split looked like this: 1. Patient Side This part handled: IntakeBookingRemindersFollow-up messagingJoining a visit The patient's side had to stay simple. It also had to be clear about what the patient could and could not see. 2. Provider Side This part handled: Schedule viewPatient detailsVisit notesQuick responsesRole-based access This was not just a different set of screens. It had different speed needs, different daily habits, and different data access rules. 3. Admin Side This part handled: Role setupSupport actionsVisibility into operationsReportingNon-clinical controls Admin work often looks small during planning. In real projects, it adds a lot of rules and a lot of testing. 4. Outside-Service Work This part handled: Video vendor setupMessaging vendor setupEHR-related workFuture device dataLogging and audit-related movement of data This is where many teams get surprised. Video, messaging, and EHR are not tiny add-ons. Each one brings its own work. Start With Access Rules Before the Feature List In multi-role products, one of the quickest ways to find hidden work is to define access rules early. Before locking the feature list, ask: Who can create this dataWho can read itWho can change itWho can delete itWho can export it For the telemedicine project, this made a big difference. A few features looked simple in the scope doc. Once the team asked who could view or change the related data, the work got much larger. A basic example: Admins can help fix booking problems. That sounds harmless. But then the real questions start: Can admins see messages?Can they see visit notes?Can they see call history?Can they open uploaded files? That one sentence can change a big part of the system. Access rules often show hidden work much faster than a feature list does. Treat Outside Services as Separate Work Another mistake teams make is treating outside services like small items on a checklist. On paper, it can look like this: VideoMessagingEHR later In practice, each one adds its own work: Vendor setupRequest and response formatsError handlingRetry rulesLoggingReplacement cost if the vendor needs to change later That is why these items should be planned separately. For the telemedicine case, once video, messaging, and EHR work were split out from the main product list, the first release became easier to define. Some items that seemed close to launch were clearly not ready for version one. Ship One Complete Path First Once the team stopped calling everything an MVP, the first release got smaller. The version-one path that stayed in looked like this: Patient intakeAppointment bookingSecure video through the chosen vendorFollow-up messagingBasic provider access controls That was enough to test whether the product solved a real problem for a clinic. What moved out of the first release: Deeper EHR workMore reportingDetailed billing flowsDevice supportBroader admin tooling Those things were not bad ideas. They just did not belong in the first build. 4 Simple Documents to Create Before Sprint Planning When a team starts to suspect that one MVP is several systems, four short documents can help a lot. 1. User-to-System Map List each part of the product and the main user for it. 2. Permission Matrix Write down who can create, view, change, delete, and export each type of data. 3. Outside-Service List Separate core product work from vendor work and data that moves in or out of the system. 4. First-Release Path Write the one end-to-end path that version one has to get right. These are short documents, but they make planning much better. Why This Matters Outside Healthcare, Too This lesson is not only for telemedicine. It applies to any multi-role product where the team is building for more than one type of user. That includes: Customer apps with admin panelsSaaS products with back-office toolsPlatforms with provider and client sidesProducts that depend on outside vendors from day one The moment a team has different users with different goals, the work stops being “just one app.” Final Point A lot of MVPs get too big because teams keep calling them one product long after that stops being true. The fix is not always better estimates. Sometimes the fix is much simpler: Split the product by user.Write down the access rules.Separate outside-service work.Ship one complete path first. That makes the first release easier to plan, easier to build, and easier to test. More
Event-Driven Pipelines With Apache Pulsar and Go

Event-Driven Pipelines With Apache Pulsar and Go

By Shivi Kashyap
A Practical Walkthrough Most distributed systems eventually hit a wall with their messaging layer, whether it's Kafka's tight coupling between compute and storage, RabbitMQ's limited replay capabilities, or the operational overhead of managing multiple tools for queuing and streaming. Apache Pulsar was engineered to address these gaps from the ground up. In this article, we'll dissect a working Go-based demo that wires together a Pulsar producer, consumer, and Prometheus monitoring layer into a cohesive, observable messaging pipeline. The full source is on GitHub. Why Pulsar Deserves a Closer Look Pulsar's architecture makes a deliberate trade-off that most messaging systems avoid: it physically separates the broker tier (which handles routing, subscriptions, and protocol) from the storage tier (Apache BookKeeper, which handles persistence). This isn't just an implementation detail. It means you can independently autoscale message routing capacity without touching your storage cluster, and vice versa. Beyond the architecture, a few capabilities stand out for engineering teams: Multi-tenancy at the protocol level – tenants, namespaces, and topics form a three-level hierarchy, making it practical to run a single Pulsar cluster for multiple teams or services without namespace collisions.Four distinct subscription semantics – Exclusive, Shared, Failover, and Key_Shared give you precise control over how messages are distributed across consumer instances, something Kafka's consumer group model doesn't natively offer.Cursor-based message retention – Pulsar retains messages based on subscription cursors, not time-based log compaction. A consumer that falls behind doesn't lose messages; it simply catches up from its last acknowledged position.Native schema enforcement – The built-in schema registry validates message payloads at the broker level before they reach consumers, catching contract violations at the boundary rather than deep inside application logic. What the Demo Builds The project is structured as three independent Go binaries, each with a single responsibility: reStructuredText ├── producer/ │ ├── main.go # HTTP server → Pulsar publisher │ ├── go.mod │ └── go.sum ├── consumer/ │ └── main.go # Pulsar subscriber → message processor ├── monitor/ │ └── main.go # Pulsar producer + Prometheus metrics server ├── prometheus.yml # Scrape configuration └── README.md All three use `github.com/apache/pulsar-client-go/pulsar` — the official, Apache-maintained Go client. The client is not a thin wrapper; it implements the full Pulsar binary protocol, connection pooling, producer batching, and automatic Prometheus metric registration. Component 1: The Producer The producer exposes a single HTTP endpoint. An incoming HTTP request triggers a Pulsar publish operation, decoupling the caller from any direct knowledge of the messaging infrastructure. Go package main import ( "context" "fmt" "log" "net/http" "github.com/apache/pulsar-client-go/pulsar" ) var client pulsar.Client func main() { var err error client, err = pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6650", }) if err != nil { log.Fatal(err) } defer client.Close() http.HandleFunc("/publish", func(w http.ResponseWriter, r *http.Request) { msg := r.URL.Query().Get("msg") err := publishMessage(msg) if err != nil { w.Write([]byte("msg failed to published")) } else { w.Write([]byte("msg successfully published")) } }) if err := http.ListenAndServe(":8080", nil); err != nil { log.Fatal(err) } } func publishMessage(msg string) error { producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "my-topic", }) if err != nil { log.Fatal(err) } _, err = producer.Send(context.Background(), &pulsar.ProducerMessage{ Payload: []byte("Hello"), }) return err } A few architectural observations worth unpacking: The HTTP-to-Pulsar bridge pattern is deliberately pragmatic. Rather than requiring every upstream service to embed a Pulsar client, you expose a thin HTTP adapter. This is particularly valuable when integrating with systems that speak HTTP natively — CI/CD pipelines, third-party webhooks, or legacy services that can't easily adopt a new client library. `pulsar.NewClient` establishes a connection pool, not a single TCP connection. The client maintains persistent connections to the broker and handles reconnection, load balancing across broker nodes, and TLS negotiation transparently. Calling `client.Close()` via `defer` ensures all in-flight messages are flushed before the process exits. `producer.Send` with `context.Background()` submits the message to the producer's internal send queue. The Pulsar client batches outgoing messages by default (configurable via `BatchingMaxMessages` and `BatchingMaxPublishDelay`), which significantly improves throughput under load without any changes to application code. For production use, the producer instance should be created once at startup and reused across requests. Creating a new producer per request incurs connection overhead and bypasses the batching optimization entirely. Component 2: The Consumer The consumer subscribes to a topic and processes messages with explicit acknowledgment. The subscription type chosen here — `pulsar.Shared` — has meaningful implications for how the system scales. Go consumer, err := client.Subscribe(pulsar.ConsumerOptions{ Topic: "topic-1", SubscriptionName: "my-sub", Type: pulsar.Shared, }) if err != nil { log.Fatal(err) } defer consumer.Close() for i := 0; i < 10; i++ { msg, err := consumer.Receive(context.Background()) if err != nil { log.Fatal(err) } fmt.Printf("Received message with Id: %#v -- content: '%s'\n", msg.ID(), string(msg.Payload())) consumer.Ack(msg) } if err := consumer.Unsubscribe(); err != nil { log.Fatal(err) } Subscription Semantics in Depth Pulsar's subscription model is one of its most differentiating features. Here's how the four types behave at the broker level: Exclusive – The broker enforces that only one consumer holds the subscription at any time. A second consumer attempting to subscribe with the same name will receive an error. This guarantees strict message ordering but eliminates horizontal scaling.Shared – The broker distributes messages across all active consumers in round-robin order. Any number of consumers can join or leave the subscription dynamically. This is the right choice for stateless workloads where processing order doesn't matter and throughput is the priority.Failover – The broker designates one consumer as the active receiver. Others remain connected but idle, ready to take over if the active consumer disconnects. This preserves ordering while providing high availability — a pattern common in financial transaction processing.Key_Shared – The broker routes messages with the same key consistently to the same consumer instance. This enables stateful processing (e.g., per-user session aggregation) without external coordination, as long as the consumer count remains stable. Why Explicit Acknowledgment Matters `consumer.Ack(msg)` signals to the broker that the message has been durably processed and can be removed from the subscription's cursor. If the consumer process crashes between `Receive` and `Ack`, the broker will redeliver the message to another consumer in the subscription. This is the mechanism behind at-least-once delivery. For workloads that require exactly-once semantics, Pulsar supports transactional acknowledgment, where the `Ack` and any downstream writes are committed atomically. That's a more advanced topic, but the foundation is the same `Ack` call shown here. Component 3: The Monitor The monitor is architecturally the most interesting component. It runs two HTTP servers concurrently — one for the application endpoint, one for Prometheus metrics — and uses a Pulsar producer to generate observable traffic. Go package main import ( "context" "fmt" "log" "net/http" "strconv" "github.com/apache/pulsar-client-go/pulsar" ) func main() { client, err := pulsar.NewClient(pulsar.ClientOptions{ URL: "pulsar://localhost:6605", }) if err != nil { log.Fatal(err) } defer client.Close() prometheusPort := 2112 go func() { if err := http.ListenAndServe(":"+strconv.Itoa(prometheusPort), nil); err != nil { log.Fatal(err) } }() producer, err := client.CreateProducer(pulsar.ProducerOptions{ Topic: "topic-1", }) if err != nil { log.Fatal(err) } defer producer.Close() ctx := context.Background() webPort := 8082 http.HandleFunc("/produce", func(w http.ResponseWriter, r *http.Request) { msgId, err := producer.Send(ctx, &pulsar.ProducerMessage{ Payload: []byte(fmt.Sprintf("hello-world")), }) if err != nil { log.Fatal(err) } else { log.Printf("Published message: %v", msgId) fmt.Fprintf(w, "Message Published: %v", msgId) } }) if err := http.ListenAndServe(":"+strconv.Itoa(webPort), nil); err != nil { log.Fatal(err) } } How the Pulsar Client Registers Prometheus Metrics When `pulsar.NewClient` is called, the Go client automatically registers a set of Prometheus collectors with the default `prometheus.DefaultRegisterer`. No additional instrumentation code is required. The metrics are served at `/metrics` on whatever port you bind `http.DefaultServeMux` to port — in this case, port `2112`. The metrics exposed include: `pulsar_client_producers_opened` / `pulsar_client_producers_closed` – producer lifecycle counters`pulsar_client_consumers_opened` / `pulsar_client_consumers_closed` – consumer lifecycle counters`pulsar_client_messages_published_total` – cumulative publish count per topic`pulsar_client_publish_latency_seconds` – histogram of end-to-end publish latency`pulsar_client_bytes_published_total` – total bytes written to the broker Running the Prometheus metrics server in a goroutine while the main goroutine handles the application HTTP server is idiomatic Go concurrency. Both servers share the same `http.DefaultServeMux`, which is why the Prometheus `/metrics` handler (registered automatically by the client library) is accessible on the metrics port without any explicit route registration. Prometheus Scrape Configuration YAML scrape_configs: - job_name: pulsar-client-go-metrics scrape_interval: 10s static_configs: - targets: - localhost:2112 The `scrape_interval: 10s` is a reasonable starting point for development. In production, you would typically align this with your alerting resolution requirements — a 30-second interval is common for dashboards, while 10 seconds or less is appropriate for latency-sensitive alerting rules. With these metrics flowing into Prometheus, you can build Grafana panels that surface producer throughput, consumer lag, and publish latency percentiles with three signals that matter most when diagnosing messaging pipeline issues. Running the Full Pipeline Prerequisites Apache Pulsar standalone Go 1.18+Prometheus (optional) Startup Sequence 1. Launch Pulsar in standalone mode: Shell bin/pulsar standalone 2. Start the producer service: Shell cd producer && go run main.go 3. Start the consumer service: Shell cd consumer && go run main.go 4. Start the monitor service: Shell cd monitor && go run main.go 5. Start Prometheus: Shell prometheus --config.file=prometheus.yml 6. Publish a message via the producer endpoint: Shell curl "http://localhost:8080/publish?msg=test_pulsar_message_publish_event" # msg successfully published 7. Trigger the monitor producer: Shell curl http://localhost:8082/produce # Message Published: (messageId) 8. Inspect raw Prometheus metrics: Shell curl http://localhost:2112/metrics | grep pulsar Engineering Takeaways Decouple publish triggers from client library dependencies. The HTTP-to-Pulsar adapter pattern used in the producer is not just a demo convenience. It is a legitimate architectural boundary. Services that need to emit events don't need to know anything about Pulsar's protocol, topic naming, or client configuration. They make an HTTP call; the adapter handles the rest.Match subscription type to processing semantics, not just throughput. A common mistake is defaulting to `Shared` for everything because it scales horizontally. If your processing logic is stateful, for example, aggregating events per user session. `Key_Shared` gives you the same horizontal scalability while preserving per-key ordering without any application-level coordination.Treat the acknowledgment boundary as your consistency boundary. Everything between `Receive` and `Ack` is your processing window. Any side effects (database writes, downstream API calls, cache updates) that happen in this window must be idempotent, because Pulsar will redeliver the message if the consumer fails before acknowledging. Design your processing logic around this constraint from the start, not as an afterthought.Zero-cost observability is a genuine advantage. The fact that `pulsar-client-go` registers Prometheus metrics automatically means you get throughput, latency, and connection health data from the moment your application starts, without writing a single line of instrumentation code. This is a meaningful operational advantage over client libraries that require manual metric registration. Extending the Demo The current implementation is intentionally minimal. Here are technically meaningful extensions worth exploring: Schema enforcement – Replace raw `[]byte` payloads with Pulsar's schema-aware producer/consumer API. Using `pulsar.NewAvroSchema` or `pulsar.NewJSONSchema` moves payload validation to the broker, preventing malformed messages from ever reaching consumers.Dead-letter topic routing – Configure `DeadLetterPolicy` on the consumer to automatically route messages that exceed a maximum redelivery count to a separate topic. This prevents poison-pill messages from blocking the subscription indefinitely.Producer batching tuning – Set `BatchingMaxMessages`, `BatchingMaxSize`, and `BatchingMaxPublishDelay` on `ProducerOptions` to optimize the throughput/latency trade-off for your specific workload profile.Graceful shutdown – Add `os/signal` handling to flush in-flight messages and close the producer cleanly before the process exits. The current `defer client.Close()` handles the happy path but won't fire on `SIGKILL`.Kubernetes-native deployment – Package each component as a container and deploy using the official Pulsar Helm chart. The producer and monitor can be exposed as Kubernetes Services; the consumer can run as a Deployment with HPA scaling based on the `pulsar_client_messages_published_total` metric exported to a custom metrics adapter. Conclusion The `apache-pulsar` project demonstrates that building a production-grade messaging pipeline with Apache Pulsar and Go doesn't require much code. It requires understanding the right abstractions. The producer-consumer-monitor triad covers the three concerns that matter in any event-driven system: getting data in, getting data out, and knowing what's happening in between. Pulsar's architecture, decoupled storage, flexible subscription semantics, and built-in observability make it a strong candidate for teams that have outgrown simpler messaging systems and need more precise control over delivery guarantees, scaling behavior, and operational visibility. Source Code https://github.com/shivik/apache-pulsar-demo More
Zero-Downtime Deployments for Java Apps on Kubernetes
Zero-Downtime Deployments for Java Apps on Kubernetes
By Ramya vani Rayala
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
By Wallace Espindola
Pragmatica Aether: Let Java Be Java
Pragmatica Aether: Let Java Be Java
By Sergiy Yevtushenko
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions

When AWS announced Lambda Durable Functions at re: Invent 2025, my first reaction was, "Okay, but how is this different from Step Functions?" I have been building serverless workflows on AWS for a while now, and Step Functions has always been my go-to service for orchestrating multi-step pipelines. So naturally, I wanted to put this new capability to the test. I decided to build a simple document processing workflow, an ETL pipeline with human-in-the-loop approval using both Durable Functions and Step Functions, then run 1,000 actual document processing workflows through each system. What I found surprised me. Not just the cost difference (79% cheaper with Durable Functions), but the trade-offs that nobody is really talking about yet. In this tutorial, I will walk you through building a zero-cost approval workflow using Lambda Durable Functions with Python. Along the way, I will share the actual cost numbers and the lessons that would have saved me a few hours of debugging. The Problem: Approval Workflows Are Expensive If you have ever built a document processing system that requires human approval, you know the pain. Someone uploads a file, your system processes it, and then... it sits there. Waiting for a human to review and approve it. That wait can be 5 minutes, 20 minutes, or even hours. Traditional approaches to handling this waiting are: Polling: Your code keeps checking every 30 seconds — "Is it approved yet? How about now?" making those calls the entire time.Always-on server: An EC2 instance or ECS container sits idle, costing you money 24/7, just to catch that one approval event.External state management: You build a custom solution with DynamoDB, SQS, and Lambda triggers — works fine, but it requires you to maintain a state machine you built yourself. What if your workflow could just... pause? No compute charges. No polling. Just pause, wait for the human to do their thing, and resume exactly where it left off. That is exactly what Lambda Durable Functions enables with the wait_for_callback pattern. What We Are Building Here is the workflow we will implement: Extract data → Transform data → Load data → Wait for approval (≈20 min) → Finalize & archive A CSV file gets uploaded to an S3 bucket under the uploads/ prefix. Our durable function picks it up, runs it through three ETL steps (extract, transform, load), then pauses execution and waits for a human to approve the processed data through a shared approval API. Once approved, the function resumes, finalizes the job, and archives the file. The key part? During that 20-minute (or 2-hour, or 2-day) approval wait, you pay absolutely nothing for compute. Architecture Overview The project uses three separate SAM stacks: Markdown shared-resources/ # Approval API, DynamoDB, SNS (shared by both systems) durable-functions/ # Lambda Durable Functions ETL pipeline step-functions/ # Step Functions ETL pipeline (for comparison) The shared approval handler serves for both workflow types using a single API. When a job comes in for approval, it checks the workflowType field, and if it is durable-functions, it calls send_durable_execution_callback_success.If step-functions, it calls send_task_success. Same API endpoint, different callback mechanisms under the hood. Prerequisites Before we begin, make sure you have the following: AWS SAM CLI (latest version recommended)Python 3.14 runtime AWS account with Lambda, DynamoDB, S3, SNS, and API Gateway accessDocker for local Lambda testing Check your SAM CLI version: Markdown sam --version Step 1: Deploy Shared Resources First Before the ETL pipeline, we need the shared infrastructure — the approval API, DynamoDB table for pending approvals, and SNS topic for notifications. Here is the shared-resources/ SAM template: YAML # shared-resources/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Shared resources for ETL approval workflow Parameters: ApproverEmail: Type: String Description: Email address to receive approval notifications Default: [email protected] Resources: PendingApprovalsTable: Type: AWS::DynamoDB::Table Properties: TableName: etl-pending-approvals BillingMode: PAY_PER_REQUEST AttributeDefinitions: - AttributeName: jobId AttributeType: S KeySchema: - AttributeName: jobId KeyType: HASH TimeToLiveSpecification: AttributeName: ttl Enabled: true ApprovalNotificationTopic: Type: AWS::SNS::Topic Properties: TopicName: etl-approval-notifications Subscription: - Endpoint: !Ref ApproverEmail Protocol: email ApprovalApi: Type: AWS::Serverless::Api Properties: Name: ETL-Approval-API StageName: prod ApprovalHandlerFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETL-Approval-Handler CodeUri: ./src Handler: approval_handler.handler Runtime: python3.14 MemorySize: 256 Timeout: 30 Environment: Variables: APPROVALS_TABLE: !Ref PendingApprovalsTable Policies: - DynamoDBCrudPolicy: TableName: !Ref PendingApprovalsTable - Version: '2012-10-17' Statement: - Effect: Allow Action: - states:SendTaskSuccess - states:SendTaskFailure - lambda:SendDurableExecutionCallbackSuccess - lambda:SendDurableExecutionCallbackFailure Resource: '*' Events: ApproveJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /approve/{jobId} Method: POST RejectJob: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /reject/{jobId} Method: POST GetJobStatus: Type: Api Properties: RestApiId: !Ref ApprovalApi Path: /status/{jobId} Method: GET Notice the approval handler has permissions for both states:SendTaskSuccess (Step Functions) and lambda:SendDurableExecutionCallbackSuccess (Durable Functions). This is the shared handler approach, one API that works with both workflow types. Deploy it: Markdown cd shared-resources sam build sam deploy --guided Step 2: The Durable Functions SAM Template Now the ETL pipeline itself for the Duration Functions. The key addition is the DurableConfig property. The DurableConfig property tells Lambda to enable durable execution for your function. YAML # durable-functions/template.yaml AWSTemplateFormatVersion: '2010-09-09' Transform: AWS::Serverless-2016-10-31 Description: Lambda Durable Functions ETL Pipeline Globals: Function: Runtime: python3.14 Architectures: - arm64 MemorySize: 512 Timeout: 900 Resources: ETLOrchestratorFunction: Type: AWS::Serverless::Function Properties: FunctionName: ETLDurableOrchestrator CodeUri: ./src Handler: handlers/etl_handler.lambda_handler MemorySize: 1024 Timeout: 900 DurableConfig: ExecutionTimeout: 86400 # 24 hours for human approval RetentionPeriodInDays: 14 # Keep execution history for debugging AutoPublishAlias: live Policies: - AWSLambdaBasicExecutionRole - S3CrudPolicy: BucketName: !Sub "${RawBucketName}-${AWS::AccountId}" - S3CrudPolicy: BucketName: !Sub "${ProcessedBucketName}-${AWS::AccountId}" - DynamoDBCrudPolicy: TableName: !Ref ETLMetadataTable - DynamoDBCrudPolicy: TableName: etl-pending-approvals - SNSPublishMessagePolicy: TopicName: etl-approval-notifications Events: S3Upload: Type: S3 Properties: Bucket: !Ref RawDataBucket Events: s3:ObjectCreated:* Filter: S3Key: Rules: - Name: prefix Value: uploads/ - Name: suffix Value: .csv Environment: Variables: PROCESSED_BUCKET: !Sub "${ProcessedBucketName}-${AWS::AccountId}" METADATA_TABLE: !Ref ETLMetadataTable APPROVALS_TABLE: etl-pending-approvals APPROVAL_TOPIC_ARN: !ImportValue ETL-ApprovalTopicArn APPROVAL_API_URL: !ImportValue ETL-ApprovalApiUrl A few things to notice here: MemorySize: 1024 on the orchestrator (overrides the 512 MB global default). Since this single function does all the work, it needs more memory.ExecutionTimeout: 86400 – This is the total workflow duration across all invocations (24 hours). The standard Timeout: 900 is the per-invocation limit (15 minutes). Each checkpoint/resume is a fresh invocation.AutoPublishAlias: live – AWS recommends using Lambda versions with durable functions. If you update code while an execution is suspended, replay will use the version that started the execution.S3 filter with prefix: uploads/ and suffix: .csv – Only CSV files under the uploads/ directory trigger the workflow.The stack imports shared resources via !ImportValue the approval table, SNS topic, and API URL from the shared stack. Step 3: Writing the Durable Function This is where it gets interesting. The entire ETL pipeline, including the approval wait, lives in a single Lambda function. No state machine definition. No JSON DSL. Just Python code. First, the individual ETL steps. Each one is a regular Python function in a separate file: Extract Python import csv import io import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def extract_data(source_bucket, source_key, step_context=None): logger.info(f"Extracting from s3://{source_bucket}/{source_key}") response = s3_client.get_object(Bucket=source_bucket, Key=source_key) content = response["Body"].read().decode("utf-8") reader = csv.DictReader(io.StringIO(content)) records = list(reader) schema = { "columns": reader.fieldnames, "source_file": source_key, "file_size_bytes": response["ContentLength"] } logger.info(f"Extracted {len(records)} records with {len(schema['columns'])} columns") return {"data": records, "record_count": len(records), "schema": schema} Transform Python import logging from datetime import datetime logger = logging.getLogger() def transform_data(raw_data, schema_config, step_context=None): logger.info(f"Transforming {len(raw_data)} records") valid_records, rejected_records = [], [] for i, record in enumerate(raw_data): try: cleaned = {k: v.strip() if isinstance(v, str) else v for k, v in record.items()} if not cleaned.get("id") or not cleaned.get("name"): rejected_records.append({"index": i, "reason": "Missing required field"}) continue if "date" in cleaned: cleaned["date"] = normalize_date(cleaned["date"]) cleaned["_processed_at"] = datetime.utcnow().isoformat() for key in ["amount", "quantity", "price"]: if key in cleaned and cleaned[key]: try: cleaned[key] = float(cleaned[key]) except ValueError: cleaned[key] = None valid_records.append(cleaned) except Exception as e: rejected_records.append({"index": i, "reason": str(e)}) return { "data": valid_records, "valid_records": len(valid_records), "rejected_records": len(rejected_records), "rejection_details": rejected_records[:100] } def normalize_date(date_str): for fmt in ["%Y-%m-%d", "%m/%d/%Y", "%d-%m-%Y", "%Y/%m/%d"]: try: return datetime.strptime(date_str, fmt).strftime("%Y-%m-%d") except ValueError: continue return date_str Load Python import json import boto3 import logging logger = logging.getLogger() s3_client = boto3.client("s3") def load_data(transformed_data, target_bucket, target_key, step_context=None): logger.info(f"Loading {len(transformed_data)} records to s3://{target_bucket}/{target_key}") output_lines = "\n".join(json.dumps(r) for r in transformed_data) s3_client.put_object( Bucket=target_bucket, Key=target_key, Body=output_lines.encode("utf-8"), ContentType="application/jsonl", Metadata={"record_count": str(len(transformed_data))} ) summary = { "record_count": len(transformed_data), "columns": list(transformed_data[0].keys()) if transformed_data else [], "sample_records": transformed_data[:3] } return {"target_path": f"s3://{target_bucket}/{target_key}", "record_count": len(transformed_data), "summary": summary} Notice the steps are plain Python functions — no special decorator, no SDK import. They take step_context=None as an optional last parameter, which keeps them testable outside the durable execution context. Now the main ETL orchestrator that ties it all together: Python import json import os import logging from datetime import datetime from aws_durable_execution_sdk_python import durable_execution, DurableContext from steps.extract import extract_data from steps.transform import transform_data from steps.load import load_data from steps.finalize import finalize_job logger = logging.getLogger() logger.setLevel(logging.INFO) PROCESSED_BUCKET = os.environ.get("PROCESSED_BUCKET") METADATA_TABLE = os.environ.get("METADATA_TABLE") @durable_execution def lambda_handler(event, context: DurableContext): # Handle both S3 event format and direct invocation if "Records" in event: s3_event = event["Records"][0]["s3"] source_bucket = s3_event["bucket"]["name"] source_key = s3_event["object"]["key"] else: source_bucket = event.get("bucket") source_key = event.get("key") # Generate job_id deterministically using context.step() job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) context.logger.info(f"Starting ETL job: {job_id}") # Step 1: Extract extracted = context.step( lambda _: extract_data(source_bucket, source_key, None), name="extract-data" ) context.logger.info(f"Extracted {extracted['record_count']} records") # Step 2: Transform transformed = context.step( lambda _: transform_data(extracted["data"], extracted.get("schema", {}), None), name="transform-data" ) # Step 3: Load loaded = context.step( lambda _: load_data(transformed["data"], PROCESSED_BUCKET, f"processed/{job_id}/output.jsonl", None), name="load-data" ) # --- EXECUTION PAUSES HERE --- # The submitter function stores the callback_id in DynamoDB # and sends an SNS notification to the reviewer. # No compute charges while waiting for approval. def submit_for_approval(callback_id: str, ctx): return notify_reviewer(job_id, callback_id, loaded["summary"]) approval = context.wait_for_callback( submitter=submit_for_approval, name="quality-check-approval" ) # Parse approval result if isinstance(approval, str): approval = json.loads(approval) if not approval or not approval.get("approved"): return {"status": "REJECTED", "job_id": job_id, "reason": approval.get("reason", "No reason")} # Step 4: Finalize (only runs after approval) final = context.step( lambda _: finalize_job(job_id, source_bucket, source_key, loaded, approval, METADATA_TABLE, None), name="finalize-job" ) return { "status": "COMPLETED", "job_id": job_id, "records_processed": transformed["valid_records"], "output_path": loaded["target_path"], "approved_by": approval.get("reviewer"), "completed_at": final["completed_at"] } Let me break down the important parts: @durable_execution – This decorator (imported from aws_durable_execution_sdk_python) enables the checkpoint/replay mechanism on the handler.context.step(lambda _: ..., name="...") – Each step call creates a checkpoint. On replay, completed steps return their cached results instantly instead of re-executing.context.wait_for_callback(submitter=..., name="...") – This is the zero-cost waiting magic. The submitter function receives a callback_id which gets stored in DynamoDB. Execution then pauses completely — Lambda saves the state, shuts down, and you stop paying.Determinism matters – Notice job_id is generated inside a context.step(). That is intentional. Since Lambda replays your function from the beginning on resume, datetime.utcnow() would produce a different value on each replay. Wrapping it in a step ensures the timestamp gets checkpointed and replayed consistently. The notify_reviewer function (in the same file) stores the callback details in DynamoDB and sends an SNS notification: Python def notify_reviewer(job_id, callback_id, summary): import boto3 from datetime import timedelta dynamodb = boto3.resource('dynamodb') sns_client = boto3.client('sns') approvals_table = os.environ.get('APPROVALS_TABLE', 'etl-pending-approvals') approval_topic_arn = os.environ.get('APPROVAL_TOPIC_ARN') approval_api_url = os.environ.get('APPROVAL_API_URL') table = dynamodb.Table(approvals_table) ttl = int((datetime.utcnow() + timedelta(hours=24)).timestamp()) table.put_item(Item={ 'jobId': job_id, 'callbackId': callback_id, 'functionArn': os.environ.get('AWS_LAMBDA_FUNCTION_NAME'), 'workflowType': 'durable-functions', 'summary': json.dumps(summary), 'status': 'pending', 'requestedAt': datetime.utcnow().isoformat(), 'ttl': ttl }) if approval_topic_arn: sns_client.publish( TopicArn=approval_topic_arn, Subject=f'ETL Job Approval Required: {job_id}', Message=f"Job ID: {job_id}\n" f"Approve: POST {approval_api_url}/approve/{job_id}\n" f"Reject: POST {approval_api_url}/reject/{job_id}" ) return {"job_id": job_id, "callback_id": callback_id, "status": "pending"} The workflowType: 'durable-functions' field is important — it tells the shared approval handler which callback mechanism to use when the reviewer responds. Step 4: The Shared Approval Handler When the reviewer clicks approve, the shared handler looks up the callbackId from DynamoDB and sends the callback to the paused durable execution: Python # shared-resources/src/approval_handler.py (key excerpt) if workflow_type == 'durable-functions': callback_id = approval_record.get('callbackId') if approved: lambda_client.send_durable_execution_callback_success( CallbackId=callback_id, Result=json.dumps(approval_response) ) else: lambda_client.send_durable_execution_callback_failure( CallbackId=callback_id, Error='JobRejected', Cause=reason or 'Job rejected by reviewer' ) elif workflow_type == 'step-functions': task_token = approval_record.get('taskToken') if approved: stepfunctions.send_task_success( taskToken=task_token, output=json.dumps(approval_response) ) Same API, same reviewer experience — the underlying callback mechanism is the only thing that differs. Step 5: Deploy and Test Deploy in order (shared resources first, since the other stacks import from it): Markdown # 1. Deploy shared resources cd shared-resources sam build && sam deploy --guided # 2. Deploy Durable Functions cd ../durable-functions sam build && sam deploy --guided Generate test data: Markdown python scripts/generate_test_data.py --count 10 --output test-data/ Upload files to trigger the workflow (note the uploads/ prefix — the S3 filter requires it): Markdown aws s3 cp test-data/ s3://etl-raw-data-bucket-YOUR_ACCOUNT_ID/uploads/ --recursive Check approval status and approve: Markdown # Check status curl https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/status/<job-id> # Approve curl -X POST https://<api-id>.execute-api.us-east-1.amazonaws.com/prod/approve/<job-id> \ -H "Content-Type: application/json" \ -d '{"reviewer": "harpreet", "reason": "Data looks good"}' For bulk approvals during testing, the repo includes a handy script: Markdown ./scripts/approve_all_jobs.sh For local testing, the testing SDK supports pytest: Markdown pip install aws-lambda-durable-execution-sdk-testing pytest durable-functions/tests/ Step 6 (Optional): Deploy Step Functions for Comparison If you want to reproduce my full comparison, deploy the Step Functions stack too: Markdown cd step-functions sam build && sam deploy --guided Here is what the same workflow looks like in Amazon States Language: JSON { "StartAt": "ExtractData", "States": { "ExtractData": { "Type": "Task", "Resource": "${ExtractFunctionArn}", "ResultPath": "$.extractResult", "Next": "TransformData" }, "TransformData": { "Type": "Task", "Resource": "${TransformFunctionArn}", "ResultPath": "$.transformResult", "Next": "LoadData" }, "LoadData": { "Type": "Task", "Resource": "${LoadFunctionArn}", "ResultPath": "$.loadResult", "Next": "WaitForApproval" }, "WaitForApproval": { "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken", "Parameters": { "FunctionName": "${ApprovalFunctionArn}", "Payload": { "taskToken.$": "$$.Task.Token", "jobId.$": "$.loadResult.job_id", "summary.$": "$.loadResult.summary" } }, "TimeoutSeconds": 86400, "ResultPath": "$.approvalResult", "Next": "CheckApproval" }, "CheckApproval": { "Type": "Choice", "Choices": [{ "Variable": "$.approvalResult.approved", "BooleanEquals": true, "Next": "FinalizeJob" }], "Default": "JobRejected" }, "JobRejected": { "Type": "Pass", "Result": { "status": "REJECTED" }, "End": true }, "FinalizeJob": { "Type": "Task", "Resource": "${FinalizeFunctionArn}", "End": true } } } Compare the two approaches. Durable Functions: one Python file, one Lambda, familiar programming constructs. Step Functions: a JSON state machine definition, five separate Lambda functions, plus the ASL learning curve. Both do the same thing. The Real Cost Numbers Now, here is the part that made me rebuild a mental model I had about serverless orchestration costs. I ran 1,000 CSV files through this exact workflow — both with Durable Functions and with the Step Functions implementation. The approval wait averaged about 20 minutes per document. Cost ComponentDurable FunctionsStep FunctionsDifferenceLambda invocations$0.000358$0.001-64%Lambda duration$0.0308$0.0179+72%State transitions$0.000$0.175-100%DynamoDB$0.003$0.0030%S3 operations$0.010$0.0100%TOTAL$0.044$0.207-79% Source: AWS CloudWatch Metrics The total cost, which is 79% cheaper, is mainly driven almost entirely by one thing: state transitions. Step Functions charges $0.025 per 1,000 state transitions. ASL workflow has 7 states (ExtractData, TransformData, LoadData, WaitForApproval, CheckApproval, JobRejected/FinalizeJob). For 1,000 workflows, that is 7,000 transitions, which costs $0.175. That single line (state transition) item is 84% of the total Step Functions cost. Durable Functions eliminates state transition costs. The trade-off? Higher Lambda duration costs ($0.031 vs. $0.018) because the durable function runs with 1,024 MB memory (single function handling all work) compared to Step Functions using 512 MB per function across five smaller functions. At scale, the difference adds up quickly: Daily VolumeDurable Functions/yearStep Functions/yearAnnual Savings1,000/day$16.06$75.56$59.5010,000/day$160.60$755.60$595100,000/day$1,606$7,556$5,950 And the most important validation: both systems achieved $0 compute cost during the 20-minute approval wait. That is the real game-changer compared to polling or always-on servers. Understanding the Replay Model One thing that confused me initially was the invocation count. I expected 1,000 invocations for 1,000 workflows. Instead, I got 1,788. Here is why. The checkpoint/replay model means each workflow requires a minimum of 2 invocations: Initial invocation — S3 trigger fires, function runs generate-job-id → extract → transform → load → submit-for-approval → pauseResume invocation — Callback received, function replays from the beginning (all completed steps return cached results instantly), then executes the finalize step So the theoretical minimum is 2,000 invocations for 1,000 workflows. The actual number was 1,788 because some workflows were still pending approval when I collected the metrics over the 24-hour measurement window. The important thing to remember: your code must be deterministic. Since Lambda replays your function from the beginning on resume, any non-deterministic operations (random numbers, timestamps, external API calls) must happen inside context.step() blocks so their results get checkpointed. Python job_id = context.step( lambda _: f"etl-durable-{datetime.utcnow().strftime('%Y%m%d%H%M%S')}-" f"{source_key.split('/')[-1]}", name="generate-job-id" ) That is exactly why the job_id generation in our code uses context.step().Without it, the timestamp would change on every replay. Here are some other examples where your code must be deterministic and how to avoid that: Deterministic IsssuesWhy It BreaksSolutionMath.random()Different value on every replayWrap in context.step()Date.now()Time keeps moving forwardUse context.timestamp or wrap in a stepGlobal variablesMight change between replaysPass state through function argumentsExternal API callsNetwork is a lieAlways wrap in context.step()Iterating over Map or SetIteration order can vary by runtimeUse arrays or ensure stable ordering When Not to Use Durable Functions I want to be honest about the trade-offs, because this is not a "Durable Functions is better than everything" story. Choose Step Functions when: Visual debugging matters. The step function state machine execution graph is genuinely superior. You can see exactly which step failed, inspect the input/output of each state, and non-technical stakeholders can actually understand what the workflow is doing. With Durable functions, AWS did provide visual analysis, monitoring, and debugging as well, but its little more developer-friendly. Multi-service orchestration. Step Functions has 220+ native AWS service integrations. DynamoDB, SQS, SNS, ECS, and Glue without writing Lambda glue code. In our Step Functions implementation, the ASL connects directly to Lambda function ARNs with built-in retry policies. With Durable Functions, all integrations go through your Lambda code.Express Workflows apply. For short-duration (under 5 minutes), high-throughput workflows, Step Functions Express Workflows use a different pricing model that can be very competitive. Choose Durable Functions when: Cost optimization is the priority (79% savings at scale)Workflows are Lambda-centric (your logic lives in Lambda code anyway)You prefer writing orchestration in Python/TypeScript/over Amazon States Language. AWS just now released Lambda Duration functions with Java in developer preview.Your logic is complex, and dynamic programming language is preferred by developers over the declarative ASL. AWS recommends a hybrid approach: use Durable Functions for application-level logic within Lambda, and Step Functions for high-level orchestration across multiple AWS services. Concurrency Planning — A Quick Note One thing worth mentioning: Durable Functions consolidates your entire workflow into a single Lambda function (ETLDurableOrchestrator in our case). This means your Lambda concurrency quota directly limits how many workflows can run simultaneously. Step Functions distributes execution across five separate Lambda functions (Extract, Transform, Load, Approval, Finalize), spreading the concurrency demand. In practice, this means you should plan your Lambda concurrency quotas carefully when using Durable Functions. If you expect burst uploads of hundreds or thousands of files at once, set reserved concurrency appropriately for your workload. This applies to both services — the difference is just where the concurrency demand concentrates. Wrapping Up Lambda Durable Functions is a genuinely useful addition to the serverless toolkit. For a simple ETL pipeline with human-in-the-loop approval, it delivered 79% cost savings over Step Functions while achieving the same 100% success rate and zero-cost waiting. The code-first approach feels natural if you are already comfortable writing Lambda functions in Python, TypeScript, or Java. The wait_for_callback pattern for human approvals is clean and straightforward. And the cost savings are real, which is driven entirely by the elimination of state transition charges. That said, Step Functions remains the better choice when visual workflow representation, multi-service orchestration, or operational simplicity are your priorities. There is no universal winner here, and it depends on what your team values more. The complete implementation — both SAM stacks, shared approval infrastructure, test data generation scripts, bulk approval scripts, and detailed cost analysis — is available here: github.com/hsiddhu2/aws-lambda-durable-vs-stepfunctions. Clone it, deploy both implementations, run your own 1,000-file comparison, and see the numbers for yourself. The ~79% cost advantage held consistent for this workflow, but your number will vary based on workflow complexity and state count.

By Harpreet Siddhu
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel

In this article, I will discuss a highly available solution developed using Spring Boot 3 and Spring Security 6 to address the "centralized authentication method" problem frequently seen in modern microservice ecosystems. We are not simply moving to an "authorization service"; we are examining the cache-first pattern, which minimizes DB usage, and the Redis Sentinel enhancement, which guarantees system persistence. Why a Separate Authentication Service? While embedding security into each service is an option in microservices, I have always found it more logical to proceed with a centralized Auth service and API Gateway combination. DRY (Don't Repeat Yourself): Using token authentication logic in many services increases extra maintenance costs.Isolation: Business services focus only on business logic; they don't deal with "is this token valid?" questions.Performance: Thanks to the Redis connection, instead of going to the database with every request, we can resolve the validation via the cache in milliseconds. Plain Text [Client] ──► [API Gateway] ──► [Auth Service: validate token] │ (valid) ▼ [Backend Microservices] Cache-Focused Approach: Reducing Database Load In the classic workflow, every login request puts a load on the DB. With the cache-first approach, the process proceeds like this with a POST /auth/signin request: First, Redis is checked. If there is a valid and unexpired token for the user, it is replicated directly. In case of cache deficiency, AuthManager.authenticate() is activated, a DB query is sent, and a BCrypt check is performed. After a successful login, a token is generated with JJWT (HS256). This token is given to Redis with our changes and TTL (e.g., 24 minutes), and personal responses are converted. In this way, it protects our main database, especially in brute-force or high-intensity login password attacks. Plain Text POST /auth/signin │ ▼ ┌──────────────────────────────┐ │ Token exists in Redis? │──── YES ──► Return token (0 DB queries) └──────────────────────────────┘ │ NO ▼ ┌──────────────────────────────┐ │ AuthManager.authenticate() │ (DB query + BCrypt verification) └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Generate JWT (JJWT HS256) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Write to Redis (TTL: 24 min)│ └──────────────────────────────┘ │ ▼ Return token Implementation Details User Entity and UserDetails Integration In most projects, unnecessary mappings are performed between the User asset and the UserDetails objects expected by Spring Security. To reduce complexity, the User Entity is directly derived from the UserDetails interface. This makes the code cleaner and makes it "native," as outlined by Spring Security. Java @Data @Builder @NoArgsConstructor @AllArgsConstructor @Entity @Table(name = "T_APP_USER") public class User implements UserDetails { @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seq_user_gen") @SequenceGenerator(name = "seq_user_gen", sequenceName = "SEQ_APP_USER", allocationSize = 1) @Column(name = "idx") private Long idx; @Column(name = "firstname") private String firstName; @Column(name = "lastname") private String lastName; @Column(unique = true, name = "email") private String email; @Column(name = "accesskey") private String accessKey; // BCrypt-hashed @Column(name = "role") @Enumerated(EnumType.STRING) private Role role; @Override public Collection<? extends GrantedAuthority> getAuthorities() { return List.of(new SimpleGrantedAuthority(role.name())); } @Override public String getUsername() { return email; } @Override public String getPassword() { return accessKey; } @Override public boolean isAccountNonExpired() { return true; } @Override public boolean isAccountNonLocked() { return true; } @Override public boolean isCredentialsNonExpired() { return true; } @Override public boolean isEnabled() { return true; } } JWT Filter: The Gateway to Security The request to the system passes through the OncePerRequestFilter. Here, using JwtAuthenticationFilter, we parse the token in each request and populate the SecurityContext. By using the new SecurityFilterChain bean introduced with Spring Security 6, we have disabled CSRF and made session management completely stateless. Token Generation and Validation Java public interface JwtService { String extractUserName(String token); String generateToken(UserDetails userDetails); boolean isTokenValid(String token, UserDetails userDetails); } @Service public class JwtServiceImpl implements JwtService { @Value("${token.signing.key}") private String jwtSigningKey; // Base64-encoded secret key @Override public String extractUserName(String token) { return extractClaim(token, Claims::getSubject); } @Override public String generateToken(UserDetails userDetails) { return Jwts.builder() .setClaims(new HashMap<>()) .setSubject(userDetails.getUsername()) .setIssuedAt(new Date(System.currentTimeMillis())) .setExpiration(new Date(System.currentTimeMillis() + 1000 * 60 * 24)) .signWith(getSigningKey(), SignatureAlgorithm.HS256) .compact(); } @Override public boolean isTokenValid(String token, UserDetails userDetails) { final String userName = extractUserName(token); return userName.equals(userDetails.getUsername()) && !isTokenExpired(token); } private <T> T extractClaim(String token, Function<Claims, T> claimsResolver) { return claimsResolver.apply( Jwts.parserBuilder() .setSigningKey(getSigningKey()) .build() .parseClaimsJws(token) .getBody() ); } private boolean isTokenExpired(String token) { return extractClaim(token, Claims::getExpiration).before(new Date()); } private Key getSigningKey() { return Keys.hmacShaKeyFor(Decoders.BASE64.decode(jwtSigningKey)); } } High Availability: Redis Sentinel Using a single Redis instance means that the Auth service has a "Single Point of Failure." If Redis crashes, no one can access the system. This risk mitigation was achieved using Redis Sentinel. Thanks to the Sentinel structure: If the master node crashes, the dependent node is automatically promoted to master via failover. On the application side, we continuously manage these transitions using the Lettuce driver. Technical Stack and Requirements Redis Sentinel configuration: Java @Configuration public class RedisConfig { @Value("${spring.redis.sentinel.master}") private String master; @Value("${spring.redis.sentinel.nodes}") private String sentinelNodes; @Value("${spring.redis.password}") private String password; @Bean public RedisConnectionFactory redisConnectionFactory() { RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration() .master(master); for (String node : sentinelNodes.split(",")) { String[] hostPort = node.split(":"); sentinelConfig.sentinel(hostPort[0], Integer.parseInt(hostPort[1])); } sentinelConfig.setPassword(RedisPassword.of(password)); return new LettuceConnectionFactory(sentinelConfig); } } Plain Text yaml env: - name: spring.redis.sentinel.master valueFrom: secretKeyRef: name: redis-user-secret key: username - name: spring.redis.password valueFrom: secretKeyRef: name: redis-user-secret key: password Token cache service: Java @Service public class TokenCacheServiceImpl { private final RedisTemplate<String, String> redisTemplate; public TokenCacheServiceImpl(RedisTemplate<String, String> redisTemplate) { this.redisTemplate = redisTemplate; } public void cacheToken(String username, String token, long duration, TimeUnit unit) { redisTemplate.opsForValue().set(username, token, duration, unit); } @Cacheable(value = "tokens", key = "#username") public String getToken(String username) { return redisTemplate.opsForValue().get(username); } } Authentication service: signup and signin: Java @Service @RequiredArgsConstructor public class AuthenticationServiceImpl implements AuthenticationService { private final UserRepository userRepository; private final PasswordEncoder passwordEncoder; private final JwtService jwtService; private final AuthenticationManager authenticationManager; private final TokenCacheServiceImpl tokenCacheService; @Override public JwtAuthenticationResponse signup(SignUpRequest request) { var user = User.builder() .firstName(request.getFirstName()) .lastName(request.getLastName()) .email(request.getEmail()) .accessKey(passwordEncoder.encode(request.getAccessKey())) // BCrypt .role(Role.USER) .build(); userRepository.save(user); var jwt = jwtService.generateToken(user); return JwtAuthenticationResponse.builder().token(jwt).build(); } @Override public JwtAuthenticationResponse signin(SigninRequest request) { // 1. Check Redis cache first String cachedToken = tokenCacheService.getToken(request.getEmail()); if (cachedToken != null) { return JwtAuthenticationResponse.builder().token(cachedToken).build(); } // 2. If not cached, authenticate (DB + BCrypt) authenticationManager.authenticate( new UsernamePasswordAuthenticationToken(request.getEmail(), request.getAccessKey()) ); var user = userRepository.findByEmail(request.getEmail()) .orElseThrow(() -> new IllegalArgumentException("Invalid credentials.")); // 3. Generate token and write to Redis (24 min TTL) var jwt = jwtService.generateToken(user); tokenCacheService.cacheToken(request.getEmail(), jwt, 24, TimeUnit.MINUTES); return JwtAuthenticationResponse.builder().token(jwt).build(); } } JWT authentication filter: Java @Component @RequiredArgsConstructor public class JwtAuthenticationFilter extends OncePerRequestFilter { private final JwtService jwtService; private final UserService userService; @Override protected void doFilterInternal( @NonNull HttpServletRequest request, @NonNull HttpServletResponse response, @NonNull FilterChain filterChain ) throws ServletException, IOException { final String authHeader = request.getHeader("Authorization"); // Pass through if no Authorization header or doesn't start with Bearer if (StringUtils.isEmpty(authHeader) || !StringUtils.startsWith(authHeader, "Bearer ")) { filterChain.doFilter(request, response); return; } final String jwt = authHeader.substring(7); final String userEmail = jwtService.extractUserName(jwt); // Process only if SecurityContext has no authentication yet if (StringUtils.isNotEmpty(userEmail) && SecurityContextHolder.getContext().getAuthentication() == null) { UserDetails userDetails = userService.userDetailsService() .loadUserByUsername(userEmail); if (jwtService.isTokenValid(jwt, userDetails)) { SecurityContext context = SecurityContextHolder.createEmptyContext(); UsernamePasswordAuthenticationToken authToken = new UsernamePasswordAuthenticationToken( userDetails, null, userDetails.getAuthorities() ); authToken.setDetails(new WebAuthenticationDetailsSource().buildDetails(request)); context.setAuthentication(authToken); SecurityContextHolder.setContext(context); } } filterChain.doFilter(request, response); } } Spring Security 6 configuration: Java @Configuration @EnableWebSecurity @RequiredArgsConstructor public class SecurityConfiguration { private final JwtAuthenticationFilter jwtAuthenticationFilter; private final UserService userService; @Bean public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { http .csrf(AbstractHttpConfigurer::disable) // Stateless → no CSRF needed .authorizeHttpRequests(request -> request .requestMatchers("/auth/**").permitAll() // Auth endpoints open to all .anyRequest().authenticated() ) .sessionManagement(manager -> manager.sessionCreationPolicy(STATELESS) // No server-side session ) .authenticationProvider(authenticationProvider()) .addFilterBefore(jwtAuthenticationFilter, // JWT filter runs first UsernamePasswordAuthenticationFilter.class); return http.build(); } @Bean public PasswordEncoder passwordEncoder() { return new BCryptPasswordEncoder(); } @Bean public AuthenticationProvider authenticationProvider() { DaoAuthenticationProvider authProvider = new DaoAuthenticationProvider(); authProvider.setUserDetailsService(userService.userDetailsService()); authProvider.setPasswordEncoder(passwordEncoder()); return authProvider; } @Bean public AuthenticationManager authenticationManager(AuthenticationConfiguration config) throws Exception { return config.getAuthenticationManager(); } } Unit tests: Java @Test @DisplayName("Signin: if token is cached, should not query the DB") void testSignInWithCachedToken() { when(tokenCacheService.getToken(TEST_EMAIL)).thenReturn(TEST_TOKEN); JwtAuthenticationResponse response = authenticationService.signin( SigninRequest.builder().email(TEST_EMAIL).accessKey(TEST_PASSWORD).build() ); assertEquals(TEST_TOKEN, response.getToken()); verifyNoInteractions(authenticationManager); // No DB + BCrypt call should happen verifyNoInteractions(userRepository); } // Invalid token test — SecurityContext should remain empty @Test @DisplayName("With an invalid token, SecurityContext should remain empty") void testDoFilterInternalInvalidToken() throws Exception { when(request.getHeader("Authorization")).thenReturn("Bearer " + INVALID_TOKEN); when(jwtService.extractUserName(INVALID_TOKEN)).thenReturn(TEST_EMAIL); when(userService.userDetailsService()).thenReturn(userDetailsService); when(userDetailsService.loadUserByUsername(TEST_EMAIL)).thenReturn(userDetails); when(jwtService.isTokenValid(INVALID_TOKEN, userDetails)).thenReturn(false); jwtAuthenticationFilter.doFilterInternal(request, response, filterChain); verify(filterChain).doFilter(request, response); assertNull(SecurityContextHolder.getContext().getAuthentication()); } Summary and Conclusion With the purchasing architecture, not only a secure login screen; It has built an architecture that is extremely scalable, overcomes database bottlenecks with caching, and meets high availability (HA) standards. In particular, the modern architecture offered by Spring Boot 3 has made the security layer much more flexible. If you are starting a large-scale microservice project, you can design token management from the outset in this "stateless" and "cached" manner.

By Erkin Karanlık
Feature Flag Debt: Performance Impact in Enterprise Applications
Feature Flag Debt: Performance Impact in Enterprise Applications

Feature flags have become standard practice in enterprise applications, enabling teams to release code into production environments without exposing new features to users. As teams leverage feature flags to increase delivery velocity, technical debt accumulates. Left unchecked, this debt will slowly and silently impact application performance, maintainability, and developer productivity. What Is Feature Flag Debt? Feature flag debt occurs when feature flags are left in the codebase after they’ve served their purpose. The most common symptoms of feature flag debt include: Dead code Context switching for developers Feature flag debt can go unnoticed because it typically doesn’t cause broken features. As a result, developers are often reluctant to clean up flags so they can focus on developing new features. Impact on Performance Feature flag debt can have serious consequences for application performance. In front-end applications, this is often overlooked. Once a feature flag has been introduced into a codebase, it incurs a long-term cost every time the application is loaded in the browser. Larger JS bundles: Each feature flag adds logic to the application. When feature flags are not cleaned up, the associated code is typically not removed from the final bundled app. This means more code for users to download and more memory used on the client.Reduced execution speed in client-side rendering: The browser must download, parse, and evaluate the entire bundle, even if certain code paths are never executed. This leads to slower parsing, longer load times, and slower interaction time. Impact on Developer Productivity Feature flag debt also negatively impacts developer productivity. Imagine having to read through an if/else statement that checks a feature flag that will never be true. Developers frequently encounter this scenario when working with feature flags. New engineers, in particular, often struggle to know which feature flags are safe to ignore. Should they be commenting out this code? What if they need it later? Why Aren’t Feature Flags Cleaned Up? It should be standard practice to remove feature flags from the codebase once they’re no longer needed. However, they often become a long-term liability for the application for several reasons: Nobody takes responsibility for cleaning up flags.People are afraid to remove code.There are no tools to help automate the process.There’s always something more pressing to work on. We often don’t see a defined feature flag lifecycle, which leads to indefinite accumulation. Example of Feature Flag Debt For example, let’s take a look at how a feature would typically look when wrapped in a feature flag: JavaScript const isAIAgentsFeatureFlagEnabled = isFeatureEnabled('ai-agents'); if (isAIAgentsFeatureFlagEnabled) { // lines of code // Code to run when the feature flag is enabled } else { // lines of code // Code to run when the feature flag is disabled } When first implemented, this doesn’t look too bad. When this feature is rolled out to production, there’s still the safety net of keeping the original functionality should something go wrong. However, after the feature flag is turned on for everyone and the feature reaches general availability (GA), there is no reason to keep both pathways in the application. The application still ships both pieces of code in the bundle, but only one will ever execute at runtime. The else block now represents dead code that will not get executed, but still takes up space in the bundle and adds to code complexity. Manage and Eliminate Feature Flag Debt Organizations need to take measures to prevent feature flag debt from slowing down their applications. Defining a feature flag life cycle is a great place to start. By enforcing that each feature flag has a description, owner, status, and expiration date, the team can ensure flags aren’t left to become debt. Treat feature flags as temporary and not part of the application's core architecture. When the feature is in GA, remove the flag and delete any code paths that are no longer needed. This results in a cleaner, more maintainable, and performant codebase. JSON [ { "feature_flag_name": "ai-agents", "description": "Feature flag that will allow AI agents to assist users with workflows and provide suggestions", "owner": "architecture crew", "status": "GA", "expiration_date": "2026-12-31" }, { "feature_flag_name": "smart-checkout", "description": "Feature flag that will allow smart checkout features, including dynamic pricing, custom offers", "owner": "architecture crew", "status": "Dev", "expiration_date": "2026-12-31" }, { "feature_flag_name": "ai-agents-eval", "description": "Feature flag to allow the evaluation framework to execute tests against AI agents to determine how accurate they are", "owner": "agent evaluation crew", "status": "QA", "expiration_date": "2026-10-12" }, { "feature_flag_name": "experiment-recommendation-v2", "description": "Feature flag for experimenting v2 recommendation version", "owner": "agent evaluation crew", "status": "GA", "expiration_date": "2026-12-31" } ] Having the feature flags stored in a format similar to the above can help identify who to contact to clean up old flags. Performance Gains From Cleanup Removing unused feature flags reduces bundle size and eliminates unnecessary code execution, resulting in faster load times, improved rendering performance, and a cleaner codebase. Conclusion For most enterprise applications, feature flags aren’t the problem; it’s forgetting to take them down. As the application grows over time, old feature flags accumulate, which will silently bloat the bundle size, degrade performance, and clutter the code.

By Poornakumar Rasiraju
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
Docker Hardened Images Are Free Now — Here's What You Still Need to Build

The Problem Isn't the Image Hardened container images are no longer niche. Docker open-sourced major portions of the tooling behind Docker Hardened Images under Apache 2.0 in late 2025. Chainguard and Google's distroless variants sit in the same space. The pitch across all three: fewer packages, smaller attack surface, dramatically lower CVE counts. The pitch is accurate. It is also incomplete. Most container security failures are not image failures. They are governance failures: A team pushes a debug build to production. Admission control doesn't block it because the policy is in Audit mode, not Enforce.A six-month-old deployment keeps running an ancient image digest while the team patches newer builds. Nobody detects the drift.The platform team rotates signing keys. Old pipelines keep producing images signed with the revoked key. Admission still accepts them. Nobody notices for ninety days.A vendor pushes an updated base image under the same tag. CI rebuilds against the new digest. The new digest is unsigned. Production takes it. No alert fires. None of these are CVE failures. They are governance failures — gaps in how images are produced, attested, verified, and monitored. Swapping the base image to a hardened variant changes none of them. A signed-and-attested hardened image in a cluster that doesn't verify signatures is operationally equivalent to a signed Ubuntu image in that cluster: the signature is decorative. I recently worked on migrating a regulated production workload onto a hardened-image baseline. Lab 12 of my docker-security-practical-guide repository is a sanitized, reproducible distillation of what that work taught me. The short version: the value is in the control plane around the image, not the image itself. The Trust Control Plane in 60 Seconds In practice, the hardest part is not enabling hardened images. It is operating trustworthy deployments at scale without slowing engineers down. The operating model has three layers, joined by a feedback loop: Supply Chain layer – images are signed (cosign keyless against Fulcio), attested with an SBOM (syft + CycloneDX), and scanned for vulnerabilities (grype). The output: an image whose origin and contents are independently verifiable by anyone.Trust layer – an admission controller (Kyverno) verifies signatures and attestations before any pod is scheduled. The admission policy is the unit of governance: it encodes which signers, which attestations, and which constraints are required for a workload to start.Enforcement layer – continuous drift detection answers the question: admission can't: has the digest drifted since we admitted it? Has the signing key been revoked? Has a new unsigned workload landed via a controller that bypasses admission?Feedback loop – drift findings feed back into the supply chain: a drift event produces a rebuild; an admission rejection produces a ticket. Without the loop, the enforcement layer becomes an alerting backwater that engineers mute. FIGURE 1 — Trust control plane for cloud-native software supply chain security.The architecture separates supply chain generation, admission-time trust verification, and continuous runtime enforcement into independent layers connected through a feedback loop. The pattern is vendor-agnostic: any compatible signing, admission, and drift-detection components can fulfill these roles. The bottom line: a hardened image is one input to the supply chain layer. Without trust verification, it's indistinguishable from a regular image at deploy time. Without enforcement, untrusted images coexist with hardened images in the same cluster. Without the feedback loop, trust state drifts silently. Admission Control: Where Governance Gets Teeth The trust layer is where the control plane becomes operationally real. In the lab, Kyverno's verifyImages rule asserts that every image carries a cosign signature from an approved identity. Here's the core of the policy: YAML apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-signed-images spec: validationFailureAction: Enforce rules: - name: verify-cosign-keyless match: any: - resources: kinds: [Pod] verifyImages: - imageReferences: ["ghcr.io/opscart/*"] attestors: - entries: - keyless: subject: "https://github.com/opscart/*" issuer: "https://token.actions.githubusercontent.com" required: true The subject and issuer together define who is trusted. For DHI images, these values point to Docker's signing identity. For Chainguard, Chainguard's. The shape of the policy is identical in all cases — only the identity matcher changes. When someone deploys an unsigned image, the rejection is immediate and actionable: Shell $ kubectl run test --image=nginx:latest --restart=Never Error from server: admission webhook "validate.kyverno.svc-fail" denied the request: resource Pod/default/test was blocked due to the following policies require-trusted-registry: trusted-registries-only: 'validation error: Image must come from a trusted registry. Allowed: dhi.io/*.' FIGURE 2 — Kyverno admission webhook rejecting an nginx pod from an untrusted registry. Capture from terminal: kubectl run rejected-test --image=nginx:latest --restart=Never (with cluster up and policies applied). Catching an unsigned image at admission costs one re-run of kubectl apply. Catching the same workload running in production a week later costs a security ticket, an incident response, and possibly a regulatory disclosure conversation. Moving rejection earlier is the highest-leverage decision in the entire model. Phased Rollout: Audit Before Enforce In production, you don't flip everything to Enforce on day one. The lab uses a phased approach: the trusted-registry policy runs in Enforce mode (hard gate on image origin), while signature and SBOM verification policies run in Audit mode (log violations, don't block). This gives teams a migration runway: they can see which workloads would fail and fix them before the policies graduate to Enforce. The shift from Audit to Enforce is a single-field YAML change. Signing Your Supply Chain: Keyless Cosign The supply chain layer produces the artifacts that admission verifies. A common modern approach uses cosign with GitHub Actions OIDC for keyless signing — no private keys to manage, rotate, or leak. The mechanism: GitHub Actions mints a short-lived OIDC token at workflow time. Cosign exchanges it for an ephemeral certificate from Sigstore Fulcio, signs the image, and destroys the key immediately. The certificate records which workflow, on which repository, at which commit, produced the signature. The signature is logged in Sigstore Rekor's public transparency log. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The complete workflow and run history is public. The important property is that anyone can independently verify the signed artifact. Shell cosign verify \ --certificate-identity-regexp \ "^https://github\.com/opscart/docker-security-practical-guide/ \.github/workflows/supply-chain-gate\.yml@.+$" \ --certificate-oidc-issuer \ "https://token.actions.githubusercontent.com" \ ghcr.io/opscart/docker-security-practical-guide/dhi-sample-app:latest Verification for ghcr.io/opscart/.../dhi-sample-app:latest -- The following checks were performed on each of these signatures: - The cosign claims were validated - Existence of the claims in the transparency log was verified offline FIGURE 3 — cosign verify succeeds for any reader, without shared secrets. Capture from terminal: run the cosign verify command above against the published image at ghcr.io. This is what "supply chain security" means in practice: not "we sign our images," but "our trust assertions are independently verifiable by anyone, against neutral infrastructure, without prior trust setup." The published image can be verified directly against the public artifact. Fleet Drift: The Problem Nobody Watches Admission is point-in-time. Production is continuous. The enforcement layer's job is to answer the questions that admission can't: has the digest drifted since we admitted it? Has a new unsigned workload landed via a controller that bypasses admission? The lab's E1 experiment runs a drift audit against a synthetic 12-service fleet mixing DHI, Docker Hub, internally-built, and abandoned images. The fleet is intentionally constructed with an explicit variation matrix — the numbers below describe the synthetic fleet's structure, not measurements from a deployed environment. In this synthetic fleet, unsigned services averaged 13.0 critical CVEs while signed-and-verified services averaged 0.0. The exact ratio will vary by environment, but the audit makes the trust gap continuously visible. FIGURE 4a — Fleet drift audit: signing state vs CVE correlation across the synthetic fleet. Capture from terminal: run ./experiments/E1-drift-observation/analyze-drift.py. Screenshot Sections 1–3 (Fleet Summary + Origin×Signing Correlation + Signing State → CVE Accumulation) FIGURE 4b — Remediation order: compliance-scope risk concentration and prioritized action queue. Same script output, Sections 4 + 7 (Compliance Scope Risk Concentration + Recommended Remediation Order) The ratio isn't the point — your fleet will produce different numbers. What the control plane provides is the continuous, attributable surfacing of whatever the ratio actually is, including cases where the supposed benefit of hardening is harder to defend. That honest feedback loop is what turns the audit from a compliance checkbox into a supply chain prioritization tool. The Substitution Test A useful test for whether you've found an architectural pattern or a vendor recipe: can you swap a major component and have everything else continue to work? For this architecture, the test is straightforward. The lab demonstrates three configurations: Docker Hardened Images (dhi.io), Chainguard Images (cgr.dev/chainguard), and a self-built Alpine base signed against a project-owned GitHub Actions OIDC identity. In all three, the Kyverno policy structure is identical. The drift audit runs unchanged. The SBOM verification runs unchanged. Edits are confined to the identity matcher and the image references. The implication: "Should we standardize on DHI or Chainguard?" is a commercial decision (pricing, catalog coverage, support), not an architectural one. The architectural decision is whether to operate the trust control plane at all. A team that has invested in the control plane has built portable institutional capability. A team that has invested in "we use DHI" has bought a product, and a future migration off DHI is a structural rewrite rather than a configuration update. Production Friction: What Actually Goes Wrong The model works. It is also not free. Here are the operational costs my team hit, documented in detail in the companion repo's TROUBLESHOOTING.md: No shell. Distroless hardened images don't include /bin/sh, curl, wget, cat, or ls. When an engineer pages at 2 AM and runs kubectl exec -it pod -- /bin/sh, the command fails. The remediation is kubectl debug with an ephemeral debug container attached to the pod's process namespace. Train your on-call rotation on kubectl debug before migration, not after. The lab's E5 experiment documents three debug patterns (ephemeral containers, dev-variant images in dev namespaces only, pre-built debug sidecars) with runbook scenarios for unreachable services, crashloops, and OOM kills. Migration is not a FROM line change. The default user is nonroot (UID 65532), not root. Library paths differ. pip install --user installs to /home/nonroot/.local, not /root/.local. Required system packages (ca-certificates, timezone data) that come for free in stock bases must be explicitly carried over. The lab's Dockerfile required three iterations before the build succeeded locally: shell-form RUN failed (no /bin/sh), then pip --user installed to the wrong path, then requirements.txt pinned package versions that didn't exist on PyPI. Each of these is a 30-second local fix — and a 5-minute GitHub Actions round-trip if you don't test locally first. Signature paths vary by vendor. DHI signatures resolve via registry.scout.docker.com, not at the image's own registry path. Kyverno handles this through the policy's repository field, but any custom verification tooling needs to know. Plan to audit verification code before migration. Kyverno has schema gotchas. rekor and ctlog blocks must be inside keys, not siblings. webhookTimeoutSeconds is capped at 30. mutateDigest: true is incompatible with validationFailureAction: Audit. PolicyException requires an explicit feature flag. Each of these cost me 30–60 minutes of debugging — they're in TROUBLESHOOTING.md, so they don't cost you the same. None of these are deal-breakers individually. All of them together are why migrations slip from "next quarter" to "abandoned after two months." Budget for friction. When This Is Overkill The investment's value scales with three factors: regulatory pressure (HIPAA, PCI-DSS, SOC 2 Type II, FDA 21 CFR Part 11), fleet size and heterogeneity (8+ clusters, dozens of teams pushing images), and blast radius (pharmaceutical patient data vs. internal dashboard). Concretely: pre-production tools, side projects, prototypes, and developer sandboxes do not need this. They benefit from a hardened base image (free) and should not be put behind the full trust control plane. The overhead of policy maintenance, key rotation, and drift remediation outstrips the risk reduction. For most workloads outside regulated production, the supply chain layer alone — sign and SBOM your builds — captures most of the available value at a fraction of the cost. Conclusion: Architecture Over Image Choice Hardened images are useful. The point of this article is that they are one component of a broader architectural pattern, and the security outcomes regulated teams want are properties of the pattern, not the component. A team that adopts hardened images without the surrounding pattern has made a real but limited improvement. A team that adopts the pattern with any reasonable image vendor — DHI, Chainguard, or a self-built base — has built portable institutional capability. The substitution test is the diagnostic: ask whether a future migration away from your current image vendor is a configuration edit or a structural rewrite. If it's the former, you have the pattern. If it's the latter, you have a product dependency. The companion repository at github.com/opscart/docker-security-practical-guide (tag v1.12.0) contains everything in this article: working Kyverno policies, a keyless-signed sample image you can pull and verify right now, fleet drift audits, and five hypothesis-driven experiments. The cosign verify command above works against the published artifact today. Spend the design effort on the pattern. The image will be replaceable. The governance is what survives vendor replacement. This article is adapted from a longer write-up on OpsCart, which includes the complete threat model, substitution-test configurations, and an extended troubleshooting log.

By Shamsher Khan DZone Core CORE
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me

My data catalog project was the third time in my career that I had led a catalog implementation. My first was a custom-built solution in 2015 that worked but required three engineers to maintain. Number two was an off-the-shelf tool that nobody used because it was too cumbersome to keep current. For this third attempt, I wanted to get it right. We implemented Azure Purview for automated discovery and technical metadata, and Collibra for business glossary, data ownership, and governance workflows. They serve different functions and are connected through a custom integration. Here is how we set it up and what surprised us. Why Two Tools? Azure Purview is excellent at automated technical metadata collection. Purview scans your data sources on a schedule, discovers tables and columns, infers data types, and builds an automatically-maintained lineage graph. Automated discovery is its primary value. Doing this manually doesn't scale, and any manually-maintained catalog falls behind the actual state of the data within months. Purview isn't good at business governance workflows: data stewardship, business term assignment, data quality certification, access request approvals. These require human processes with approvals and audit trails that Purview's workflow capabilities do not cover adequately. Collibra handles the governance workflow side. Business data stewards maintain the business glossary in Collibra. Ownership assignments and data quality certifications go through Collibra's workflow engine. When a data consumer wants to know what a dataset means in business terms, they look in Collibra. When they want to know where the data physically lives and what its schema is, they look in Purview. The Purview Setup Purview scans are configured per data source. We set up scans for our three ADLS Gen2 storage accounts, our Azure SQL databases, our Databricks Unity Catalog, and our Azure Data Factory pipelines. Scans run daily for production data sources and weekly for development. Purview builds a lineage graph from ADF pipelines, which is genuinely useful. We can see, for any given table, which pipelines write to it and which tables it reads from. Lineage tracking has been valuable three times in incident investigations where we needed to understand the upstream sources of a corrupted dataset. Custom classifications are worth the setup time. Purview comes with built-in classifiers for common PII patterns: email addresses, phone numbers, credit card numbers, and national ID formats for several countries. We added custom classifiers for our internal account number formats and insurance policy number patterns. Automated classification isn't perfect, about 85% accurate in our testing, but it surfaces PII-candidate columns that manual review would miss. Python # Purview scan configuration (REST API) import requests def create_purview_scan(account_name, collection, data_source): url = (f"https://{account_name}.purview.azure.com/scan/datasources/" f"{data_source}/scans/daily-production-scan") body = { "kind": "AzureStorageMsi", "properties": { "scanRulesetName": "custom-pii-ruleset", "scanRulesetType": "Custom", "collection": {"referenceName": collection}, "credential": { "referenceName": "managed-identity", "credentialType": "ManagedIdentity" } }, "trigger": { "recurrence": { "frequency": "Day", "interval": 1, "startTime": "2024-01-01T02:00:00Z", "timezone": "UTC" } } } resp = requests.put(url, json=body, headers=get_auth_headers()) return resp.json() # Custom classifier for internal account numbers custom_classifier = { "kind": "Custom", "properties": { "classificationName": "INTERNAL_ACCOUNT_NUMBER", "description": "Internal 12-digit account number format", "classificationRule": { "kind": "Regex", "pattern": "^ACC[0-9]{9}$", "minimumPercentageMatch": 75 } } } The Collibra Integration We built a nightly sync that reads technical metadata from Purview via its REST API and creates or updates corresponding assets in Collibra. Our sync maps Purview datasets to Collibra data assets, adds technical metadata (schema, classification, lineage summary) as attributes on the Collibra asset, and creates a link between the Collibra and Purview assets so users can navigate between the business and technical views. Building this sync took about six weeks of engineering time. It's the part of the implementation I considered most for an off-the-shelf connector, but the available connectors didn't handle our specific Purview classification tagging approach correctly. Our custom sync has been running for 14 months with minimal maintenance. Python # Nightly Purview-to-Collibra metadata sync (Python) import requests from datetime import datetime def sync_purview_to_collibra(purview_client, collibra_client): """Sync technical metadata from Purview to Collibra assets.""" # Fetch all cataloged assets from Purview purview_assets = purview_client.discovery.query( keywords="*", filter={"and": [ {"entityType": "azure_datalake_gen2_path"}, {"classification": ["confidential", "restricted"]} ]}, limit=1000 ) for asset in purview_assets['value']: collibra_asset = collibra_client.find_or_create_asset( name=asset['name'], domain="Data Lake Assets", type="Data Set" ) # Sync technical metadata as attributes collibra_client.update_attributes(collibra_asset['id'], { "Technical Schema": asset.get('schema', ''), "Data Classification": asset.get('classification', []), "Purview Asset Link": asset['id'], "Last Scanned": asset.get('lastScanTime', ''), "Lineage Summary": get_lineage_summary( purview_client, asset['id']), "Sync Timestamp": datetime.utcnow().isoformat() }) return {"synced": len(purview_assets['value']), "timestamp": datetime.utcnow().isoformat()} What Adoption Looked Like Adoption was slow. We launched the catalog with a communication campaign, internal documentation, and a live demo. After three months, we'd had about 30% of our target user base actively using it, primarily data engineers who were looking up lineage information. Analysts and business stakeholders, the people Collibra was primarily designed to support, were largely not using it. Adoption really broke through when we integrated the catalog with our data access request process. Previously, access requests went to a Jira form. We changed the process: to request access to a dataset, you start from the Collibra data asset page. Each access request is contextualized with the asset's classification, ownership, and purpose, which both the requester and the approver see during the approval workflow. Usage of Collibra for data assets grew by 300% in the month after we made this change. Python # Collibra asset mapping schema for access request workflow { "asset_type": "Data Set", "domain": "Data Lake Assets", "attributes": { "Technical Name": {"type": "text", "source": "purview"}, "Business Name": {"type": "text", "source": "steward"}, "Data Classification": { "type": "single_select", "values": ["public", "internal", "confidential", "restricted"], "source": "purview" }, "Owner Team": {"type": "text", "source": "steward"}, "PII Columns": {"type": "multi_select", "source": "purview"}, "Quality Certification": { "type": "single_select", "values": ["certified", "provisional", "uncertified"], "source": "steward" }, "Access Request URL": { "type": "url", "template": "https://collibra.internal/access/{asset_id}" } }, "workflow": { "access_request": { "approvers": ["asset_owner", "data_governance_lead"], "sla_hours": 48, "auto_revoke_days": 365 } } } The Honest Caveat A data catalog requires ongoing investment that is easy to underestimate. Automated parts, Purview's scanning and discovery, take care of themselves. Business governance parts, glossary maintenance, stewardship assignments, and quality certifications require human effort that must be budgeted and owned. Our Collibra business glossary currently covers about 60% of our production datasets. The remaining 40% have technical metadata from Purview but no business context. That 40% is smaller than it was six months ago, which means we are making progress. But it's a real gap that we manage explicitly rather than pretending the catalog is complete.

By Kuladeep Sandra
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables

Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.

By Seshendranath Balla Venkata
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team
How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures
How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures

Modern API-led architectures are built for resilience. We add: Retries for transient failuresReplication for durabilityAutoscaling for elasticityCircuit breakers for isolation Each mechanism improves availability. Under stress, their interaction can bring the system down. Most enterprise outages aren’t caused by missing fault tolerance. They’re caused by unbounded fault-tolerance mechanisms reacting simultaneously. Let’s break down how this happens — and how to design bounded reliability instead. 1. Retry Storms: When Resilience Multiplies Traffic Retries are meant to protect against temporary failures. But retries multiply load. This is a simplified version of what we often see in service-to-service retry logic: Plain Text import time import random def downstream_service(): latency = random.choice([0.1, 0.2, 0.8]) time.sleep(latency) if latency > 0.7: raise TimeoutError("Slow response") return "OK" def call_with_retries(max_attempts=3): for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: print(f"Retry {attempt+1}") raise Exception("Failed after retries") Under normal conditions: Works fine. Under load: Latency increases.Timeouts trigger.Each request retries 3 times.Traffic triples.Backend slows further.More retries fire. That’s a retry storm. Now imagine this inside an API-led architecture: Gateway → Experience API → Process API → System APIs → ERP/DB If each layer retries independently, load amplification becomes multiplicative. In one system I worked on, we saw a single downstream slowdown take out three upstream APIs within minutes because each layer had its own retry logic. Bounded Retry Pattern (Production-Safe) Retries must be: LimitedBacked off exponentiallyJitteredDisabled under system stress Safer version: Plain Text def call_with_bounded_retries(max_attempts=2, system_load=0.5): if system_load > 0.75: return None # fail fast when under stress for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: backoff = 0.2 * (2 ** attempt) time.sleep(backoff + random.uniform(0, 0.1)) return None Key differences: Retry ceiling reducedExponential backoffJitter prevents synchronized wavesLoad-aware short-circuit Retries should dampen instability — not amplify it. 2. Replication Fan-Out and Coordination Collapse Replication improves durability. But synchronous replication increases coordination cost. Example: Plain Text import time def simulate_write(): time.sleep(0.2) def write_to_replicas(data, replicas=3): for _ in range(replicas): simulate_write() Under surge traffic: Write volume increases.Each write fans out to 3 replicas.Replica lag grows.Clients retry writes.Effective write load doubles. Durability turned into a bottleneck. In enterprise integration systems (order processing, billing, reconciliation), this pattern causes throughput collapse — not because data was lost, but because coordination overwhelmed the system. Tiered Durability Strategy Not all writes need identical guarantees. Plain Text def write(data, critical=True): if critical: write_to_replicas(data, replicas=3) else: write_to_replicas(data, replicas=1) Separate: Critical transactions → strong durabilityNon-critical logs/events → reduced coordination Reliability must be scoped — not maximized blindly. 3. Autoscaling Feedback Loops Autoscaling reacts to traffic metrics. But traffic metrics may be artificial. If retries inflate request counts: Plain Text def autoscale(request_rate): if request_rate > 100: print("Scaling up") Scaling triggers: New instances initialize.Initialization hits shared DB/cache.Backend latency increases.More timeouts occur.Retry rate rises. Autoscaling accelerated instability. Safer Scaling Signals Scale on: Sustained demand (not spikes)Latency distribution trendsOrganic RPS (excluding retries)Queue growth rate Example: Plain Text def autoscale_safe(request_rate, sustained_load): if sustained_load and request_rate > 120: print("Scaling safely") Autoscaling should respond to organic demand — not retry amplification. 4. The Real Problem: Correlated Reactions Retries respond to latency.Replication responds to writes.Autoscaling responds to traffic.Circuit breakers respond to error rates.Under stress, they react to the same signal.That correlation creates cascading failure.Distributed systems behave like feedback systems.Unbounded feedback loops destabilize them. Real-World Scenario: Payment Reconciliation API Consider a payment reconciliation service: Gateway → Process API → Billing → ERP → Database What happens during a minor ERP slowdown? ERP latency increases to 700ms.Billing times out at 500ms.Billing retries 3 times.Process API retries orchestration.Gateway retries client request.Autoscaling reacts to spike.DB replication lag increases.DLQ starts growing. Within minutes, a small slowdown becomes a platform-wide incident. Root cause: unbounded reaction. 5. Guardrails for Bounded Reliability in API Systems 1. Retry Budgets Effective Load = Incoming RPS × Retry Count If RPS = 1,000 and retries = 3 Effective load = 3,000 Cap retries per request and per service. 2. Failure Classification Not all errors are retriable. Error Type Retry? Action CONNECTIVITY Yes Bounded retry TIMEOUT Yes Backoff VALIDATION No Fail fast AUTH No Alert Blind retries are architectural debt. 3. Idempotency Enforcement Retries without idempotency cause corruption. Unsafe: Plain Text transaction_id = uuid() Safe: Plain Text transaction_id = payload.get("transaction_id") or request.headers["correlation-id"] Every retry must produce the same logical result. 4. DLQ With Observability Track: Retry percentageTimeout frequencyDLQ growth velocityP95 latency shifts These are early warning signals. None of these controls are free. Reducing retries can increase error rates in some scenarios, and limiting replication can affect durability guarantees. The goal isn’t to eliminate these mechanisms, but to apply them intentionally based on system behavior. 5. Design for Stability, Not Perfection The goal of distributed reliability isn’t maximum redundancy. It’s controlled degradation under stress. Bound retries. Scope replication. Dampen scaling reactions. Enforce idempotency. Monitor feedback loops. Minor latency should not become a cascading outage. Reliability is not about adding mechanisms. It’s about controlling how they interact. Final Thoughts Retry storms don’t start with catastrophic failure. They start with: A small latency increaseA few timeoutsA handful of retries Then fault-tolerance mechanisms react — together. Retries multiply traffic.Replication increases coordination pressure.Autoscaling amplifies backend load. Within minutes, a minor slowdown becomes a cascading outage. Reliability in API-led distributed systems is not about adding more safety nets. It’s about bounding how those safety nets behave under stress. Limit retries.Classify failures.Enforce idempotency.Scale on sustained demand — not noise.Monitor feedback loops before they spiral. The difference between a resilient platform and a cascading failure often comes down to one thing: Whether your reliability mechanisms are controlled — or uncontrolled. Design for stability under stress. Not perfection under ideal conditions.

By Manjeera Chanda
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch

A DynamoDB throttle alarm fires at 2 am. You confirm the spike in CloudWatch, then check ElastiCache in a second dashboard, then Redshift in a third. Cache hit rate dropped, which hammered DynamoDB, which stalled the zero-ETL export. Three services, three dashboards, one cascade you can only trace by hand. This guide maps the specific metrics, alarm thresholds, and configuration steps for each service, and then addresses the observability delta that CloudWatch leaves unresolved: cross-service correlation, root-cause traceability, and the capacity-planning intelligence that prevents cascades in the first place. What CloudWatch Gives You Across DynamoDB, ElastiCache, and Redshift Prerequisites: The CLI examples and alarm configurations in this guide assume AWS CLI v2, an IAM principal with cloudwatch:GetMetricData, cloudwatch:PutMetricAlarm, and dynamodb:UpdateContributorInsights permissions, and active DynamoDB tables, ElastiCache clusters, or Redshift clusters in your account. CloudWatch publishes metrics for all three services under service-specific namespaces. Per the AWS CloudWatch documentation, metric retention runs in three tiers: 1-minute data points retained for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days. NamespaceCategoryKey MetricsAWS/DynamoDBCapacityConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequestsAWS/DynamoDBLatencySuccessfulRequestLatency (p50, p99)AWS/DynamoDBHealthSystemErrorsAWS/ElastiCacheEfficiencyCacheHitRate, EvictionsAWS/ElastiCacheMemoryDatabaseMemoryUsagePercentageAWS/ElastiCacheConnectionsCurrConnections, ReplicationLagAWS/RedshiftPerformanceQueryDuration, QueryQueueTimeAWS/RedshiftWorkloadWLMQueueLength (per queue)AWS/RedshiftResourcesCPUUtilization, ReadIOPS, WriteIOPS For most post-incident investigations, you’ll hit the granularity boundary within two weeks. A throttle spike that lasted 4 minutes on day 17 shows up as a single 5-minute average data point, frequently indistinguishable from normal traffic variation. The per-custom-metric cost also compounds at scale: an account running 40 DynamoDB tables, 6 ElastiCache clusters, and 3 Redshift clusters with per-resource custom alarms can accumulate hundreds of CloudWatch metrics across namespaces, each costing $0.30/month to store and $0.10/alarm/month to evaluate. Each namespace provides enough signal to diagnose its own service, but CloudWatch publishes no native cross-service correlation mechanism. A ThrottledRequests spike in AWS/DynamoDB and a CacheHitRate collapse in AWS/ElastiCache at the same timestamp are both visible, but connecting them as cause and effect requires a human to match timestamps across dashboards. DynamoDB: Throttling Detection, Partition Health, and Capacity Mode Decisions DynamoDB throttling is rarely a single-metric problem. A throttle alarm tells you capacity was exceeded, but not whether the cause is a hot partition, an undersized provisioned table, or a traffic pattern that outgrew your capacity mode. The subsections below work through that diagnostic sequence: the metrics that surface the symptom, the tooling that pinpoints the partition, and the capacity decision that prevents recurrence. Core Metrics and Alarm Thresholds The DynamoDB CloudWatch metric namespace publishes table-level aggregates. For provisioned-capacity tables, these five metrics drive operational decisions: MetricUnitRecommended Alarm ThresholdNotesThrottledRequestsCount> 0 (provisioned mode)Any throttling on a provisioned table means capacity is misconfigured or a hot partition is concentrating loadSuccessfulRequestLatency p99Milliseconds> 10ms (read-heavy workloads); > 20ms (mixed)p99 > 10ms on reads is a practitioner-recommended leading indicator of partition pressure before throttles appearConsumedReadCapacityUnitsCount/second> 80% of provisioned RCUsSignals you’re approaching throttle territoryConsumedWriteCapacityUnitsCount/second> 80% of provisioned WCUsSame logic for write-heavy workloadsSystemErrorsCount> 0Indicates DynamoDB service-side failures, distinct from capacity limits Practitioner-recommended starting points. Tune to your workload characteristics. ThrottledRequests at table level confirms that throttling happened, but tells you nothing about which partition caused it. On a table with millions of items, a single access pattern (a user ID acting as a partition key hot spot, for instance) can drive 95% of throttles while aggregate consumed capacity looks healthy. DynamoDB Contributor Insights resolves this. Contributor Insights for Hot Partition Detection DynamoDB Contributor Insights surfaces the top-N most-accessed partition keys and sort keys in real time. It identifies the specific items driving throttling or high latency that pure CloudWatch metric aggregation can’t surface. Enabling it on a production table with significant traffic incurs cost (priced per request evaluated), but during a throttle incident, Contributor Insights gives you the specific key value generating excess load rather than an aggregate curve. Enable it from the DynamoDB console under the table’s “Monitor” tab, or via CLI (requires AWS CLI v2+): Plain Text aws dynamodb update-contributor-insights \ --table-name YOUR_TABLE_NAME \ --contributor-insights-action ENABLE Once active, CloudWatch Logs Insights receives partition-level data within minutes. Query the top-10 most-accessed partition keys over the past hour to confirm whether a hot key is generating the throttle alarm: Plain Text filter @message like /ContributorInsights/ | stats count(*) as accessCount by partitionKey | sort accessCount desc | limit 10 Capacity Mode Decision Logic The decision between provisioned and on-demand capacity modes depends on traffic predictability. Use a 7-day ConsumedCapacityUnits trend as your input signal: If consumed capacity stays below 80% of provisioned capacity and follows a consistent daily pattern, stay on provisioned. Set auto-scaling target utilization at 70% of provisioned capacity to leave headroom for traffic spikes before throttling begins.If consumed capacity regularly exceeds 80% of provisioned, or if usage patterns show irregular spikes with no predictable shape, on-demand mode eliminates throttling risk at a higher per-request cost. Teams running the DynamoDB zero-ETL integration with Redshift (GA October 2024) face a different monitoring angle from streaming replication. The integration operates via periodic incremental exports every 15 to 30 minutes, so source table latency doesn’t affect export timing. The primary constraint on analytics data freshness is export completion status, visible in the Redshift console under the integration view. Export failures are the leading indicator of stale analytics data. ElastiCache: Cache Efficiency, Memory Pressure, and the Valkey 8.0 Observability Upgrade When cache hit rate drops, the blast radius extends beyond ElastiCache. Every cache miss becomes a direct read against your origin datastore, and if that origin is a DynamoDB table already running near provisioned capacity, you get the throttle cascade from the introduction. The metrics below separate cache-level symptoms from the memory and replication signals that predict them, followed by the observability improvements Valkey 8.0 brings. Redis and Valkey Metrics Per the ElastiCache CloudWatch documentation, the metrics that drive operational decisions for Redis and Valkey deployments are: MetricTargetAlert ThresholdActionCacheHitRate>= 0.95< 0.90Investigate at < 0.90; below 0.80 indicates a significant access pattern change or deployment that altered cache key patternsEvictions~0 (steady state)> 100/min sustainedSustained evictions mean maxmemory-policy is evicting live data under memory pressureDatabaseMemoryUsagePercentage< 70%Alert at > 75%; scale-out at > 85%Alert at 75% gives runway to analyze dataset growth; above 85% triggers automatic evictions under most policiesReplicationLag< 100ms> 500msReplica lag at this level affects read scaling reliabilityCurrConnectionsWorkload-specific> 80% of max allowedPersistent near-limit connections indicate a connection pool misconfiguration or application-side leak Practitioner-recommended starting points based on operational experience. Memcached deployments within ElastiCache expose a different metric set through the same AWS/ElastiCache namespace: get_hits and get_misses (from which you derive hit rate), evictions, and bytes_used vs. limit_maxbytes. Valkey and Redis are cluster-based architectures with native replication, while Memcached is a horizontally partitioned cache with no native replication. Applying Redis/Valkey thresholds to Memcached deployments produces misleading alarms. Valkey 8.0 Observability Additions The open-source Valkey 8.0 release shipped from the Linux Foundation on September 16, 2024. Amazon ElastiCache 8.0 for Valkey launched on November 21, 2024, bringing four observability primitives that prior Redis OSS metrics on ElastiCache didn’t expose. Per-slot metrics let you identify which hash slots carry disproportionate traffic across a cluster. Before Valkey 8.0, CloudWatch surfaced per-node and per-cluster aggregates only. A slot-level throughput imbalance (common after a key pattern change in the application layer) was invisible until it produced node-level CPU or memory pressure. With per-slot metrics, you detect the asymmetry before it cascades to node-level saturation. Per-client event loop latency tracks how long each client connection waits in the event loop queue. This directly diagnoses client-specific throughput asymmetries. If one application service has a misconfigured connection pool producing tail latency that appears as a CacheHitRate degradation from another service’s perspective, per-client event loop latency identifies the offending client specifically rather than surfacing a cluster-level aggregate that implicates everything. Rehash memory tracking quantifies the temporary memory overhead during cluster rescaling. When you add nodes to an ElastiCache Valkey cluster, the rehashing process requires holding two copies of some hash-slot data in memory simultaneously. Before this metric, a DatabaseMemoryUsagePercentage spike during a scale-out event was ambiguous. With rehash memory tracking, you can confirm the spike is transient rehash overhead and dismiss the alarm as expected behavior rather than a capacity problem. Traffic breakdowns split read, write, and key expiry operations at the slot and node level. This replaces the single-dimensional throughput view that prior ElastiCache Redis metrics provided and enables you to identify whether a throughput increase is driven by reads, writes, or expiry churn without writing custom instrumentation. Valkey 8.1, released April 2, 2025, adds further observability improvements. Verify ElastiCache 8.1 availability in your region at the time of deployment, as managed service version availability can trail the open-source release by several weeks. Redshift: Query Performance, WLM Configuration, and Enhanced Monitoring Redshift performance problems tend to look identical from the outside: queries slow down. Whether the cause is CPU saturation, WLM slot exhaustion, or a bad query plan requires different metrics and different responses. The thresholds below separate those conditions, followed by the Enhanced Query Monitoring tooling that replaced the manual system-table workflow for root-cause diagnosis. Key CloudWatch Metrics and WLM Thresholds MetricRecommended ThresholdActionCPUUtilizationAlert at > 80%Investigate active query plans if sustained; evaluate concurrency scaling if combined with queue depthWLMQueueLength (per queue)Alert at > 3; escalate at > 5 sustained for 60 secondsWLMQueueLength > 5 sustained over 60 seconds combined with CPUUtilization > 85% is a practitioner-recommended trigger for enabling a Redshift concurrency scaling clusterQueryQueueTime> 30 secondsQueries waiting over 30 seconds indicate WLM queue saturation or slot misconfigurationQueryDuration2x the 7-day p95 baseline for that WLM queueBaseline drift detection for workload-specific thresholdsReadIOPSCluster baselineSharp ReadIOPS spikes without a corresponding query load increase can indicate full-table scans or missing sort key filters The WLMQueueLength threshold requires context to interpret correctly. A WLMQueueLength of 5 on a queue allocated 5 concurrency slots means every slot is occupied and the queue is at capacity. Combined with CPUUtilization above 85%, adding concurrency scaling capacity is the right response. WLMQueueLength of 5 with CPUUtilization at 40% points to a slot allocation problem: queries are queuing behind slot limits rather than behind compute saturation, and the fix is WLM reconfiguration, not additional nodes. Historically, diagnosing slow Redshift queries required direct access to system tables. A typical workflow queried STL_QUERY for execution times, joined to SVL_QUERY_METRICS for resource usage per execution step, and cross-referenced SVL_QUERY_SUMMARY for operator-level plan details. This three-step workflow required SQL client access, familiarity with the Redshift internal catalog schema, and significant manual correlation work. Redshift Enhanced Query Monitoring Redshift Enhanced Query Monitoring went GA on January 29, 2025, available for both Serverless and provisioned deployments. It surfaces query bottlenecks, execution plan anomalies, and resource contention at the query level through the Redshift console, removing the need for SQL-level diagnostic work against system tables. When WLMQueueLength spikes, you can go directly to a ranked list of the queries causing saturation, see their execution plan highlights, and identify whether the bottleneck is a sort key miss, a cross-join, or a network shuffle between nodes, all without writing a single STL_QUERY lookup. Redshift troubleshooting previously required a senior engineer with DBA-level knowledge of the system catalog. This change shifts basic performance diagnosis to any SRE comfortable with the console. AI-Driven Scaling and Its Monitoring Implications AWS previewed Redshift Serverless AI-driven scaling at re:Invent 2023, and it went GA in October 2024. Verify current GA status in the AWS documentation for your region before production adoption, as the preview-to-GA timeline varies by feature and region. AI-driven scaling automates RPU (Redshift Processing Unit) allocation by observing query patterns over time and adjusting base and max RPU settings to balance cost against performance. WLM queue priority, query monitoring rule configuration, and workload classification for mixed BI and ETL environments require manual configuration even on Serverless clusters running AI-driven scaling. A Redshift Serverless cluster with AI-driven scaling still requires you to define how ETL jobs and ad hoc analyst queries share resources, and which queue takes priority when both arrive simultaneously. Those decisions drive WLMQueueLength behavior regardless of how accurately the scaler provisions RPUs. Capacity Planning: Using Monitoring Data to Drive Scaling and Cost Decisions The cross-service capacity heuristic worth building into your runbooks: simultaneous DynamoDB p99 latency increase combined with ElastiCache CacheHitRate dropping below 0.90 can indicate several different conditions. Potential causes include a fan-out query change at the application layer, a cache node failure, a network event between services, or a deployment that altered cache key patterns. This symptom combination warrants application-layer investigation to confirm the root cause before deciding which service to scale. Scaling either service without confirming the shared trigger wastes capacity and can mask the actual issue. DynamoDB Build a 7-day ConsumedCapacityUnits average as your baseline, then set auto-scaling target utilization at 70% of provisioned capacity. This gives your table headroom to absorb a 30% traffic increase before auto-scaling triggers, with a further buffer before you hit throttles at 100% consumed capacity. When evaluating reserved capacity, AWS Cost Explorer surfaces DynamoDB reserved capacity recommendations with projected savings. At a 3-year term commitment, reserved capacity can save up to 77% versus provisioned capacity hourly rates. Reserved capacity makes financial sense for tables that have run in provisioned mode for at least 90 days with predictable consumption patterns. For tables with volatile or seasonal traffic, on-demand mode avoids the risk of underutilization that makes reserved capacity economically counterproductive. ElastiCache Trend DatabaseMemoryUsagePercentage over a 72-hour window. If it trends upward at a rate disconnected from traffic growth (the cache dataset is growing while the request rate stays flat), that signals cache dataset expansion rather than increased load. The operational response is node scaling before you cross the 75% alert threshold, as memory pressure at that level narrows your runway to eviction-level problems. For ElastiCache Serverless using Valkey, monitor ElastiCacheProcessingUnits (ECPUs) as the scaling proxy. ECPU consumption scales with operation complexity and data volume, making it the primary cost and capacity signal for Serverless deployments where node count decisions don’t apply. Redshift Correlate CPUUtilization with QueryQueueTime over a 1-week window. The CPU-vs-queue diagnostic from the Redshift metrics section applies here as your scaling decision input: high CPU points to node scaling, while high queue time with moderate CPU points to WLM slot reconfiguration. Where CloudWatch’s Coverage Falls Short The per-service metrics and tooling above give you solid visibility within each namespace. The gaps show up when you need to work across them: correlating alarms from different services, connecting logs to metrics, and suppressing the noise when a single event triggers alerts everywhere at once. No Native Cross-Service Correlation You can build a CloudWatch dashboard that co-locates DynamoDB ThrottledRequests, ElastiCache Evictions, and Redshift WLMQueueLength on a shared timeline, but it’s manual widget assembly with no causal linking between the graphs. The assembly is also fragile: every new table, cluster, or queue requires manual dashboard updates to keep the view current. Log-to-Metric Correlation Is Manual Connecting a slow Redshift query logged in STL_QUERY to a spike in DynamoDB SuccessfulRequestLatency at the same timestamp requires opening CloudWatch Logs Insights for Redshift audit logs, querying by timestamp range, then manually comparing results against the DynamoDB metric timeline. The Enhanced Query Monitoring GA from January 2025 reduces this friction for Redshift-internal diagnosis, but the cross-service correlation step remains a human task. Cross-Account Visibility CloudWatch Database Insights added cross-account and cross-region support for database fleet monitoring on November 21, 2025. Verify the current scope of service coverage at the time of your deployment, as the announcement references database fleet monitoring broadly, and the specific inclusion of ElastiCache and Redshift alongside RDS and Aurora should be confirmed against current documentation. Alert Fatigue Across Three Namespaces Each service generates its own alarm stream with no dependency-aware suppression between services. When a single network event causes DynamoDB latency to rise, ElastiCache hit rate to drop, and Redshift WLM queue depth to increase, CloudWatch fires alarms across three separate notification channels simultaneously. The on-call engineer receives three alerts for a single root cause event, with no automated path from any alarm to the triggering condition. ManageEngine OpManager Nexus addresses these gaps directly: it auto-discovers DynamoDB tables, ElastiCache clusters, and Redshift clusters within your AWS account, builds correlated dashboards that connect metrics across all three services on a shared timeline without manual widget assembly, and applies dependency-aware alarm suppression that treats downstream symptoms of a single event as a grouped incident. For teams running two or more of these managed database services, the operational delta between nine isolated CloudWatch alarms and a correlated, root-cause-linked view determines where monitoring hours get spent or recovered. Your Monitoring Baseline: Nine Alarms and a Unified View The minimum viable monitoring baseline for all three services is nine CloudWatch alarms routed to a single SNS topic. These are practitioner-recommended starting points. Tune each threshold to your observed workload behavior. DynamoDB Alarms Alarm NameMetricThresholdEvaluation PeriodDynamoDB-ThrottlesThrottledRequests> 01 minuteDynamoDB-LatencyP99SuccessfulRequestLatency (p99)> 20ms5 minutesDynamoDB-RCUHighConsumedReadCapacityUnits> 80% of provisioned5 minutes Metric definitions: DynamoDB CloudWatch metrics reference. ElastiCache Alarms Alarm NameMetricThresholdEvaluation PeriodCache-HitRateLowCacheHitRate< 0.905 minutesCache-EvictionsHighEvictions> 100 per minute1 minuteCache-MemoryHighDatabaseMemoryUsagePercentage> 75%5 minutes Metric definitions: ElastiCache CloudWatch metrics reference. Redshift Alarms Alarm NameMetricThresholdEvaluation PeriodRedshift-CPUHighCPUUtilization> 80%5 minutesRedshift-QueueDepthWLMQueueLength> 35 minutesRedshift-QueueWaitQueryQueueTime> 30 seconds5 minutes Metric definitions: Redshift CloudWatch metrics reference. Route all nine alarms to a single SNS topic. Tag each alarm with a Service dimension (values: DynamoDB, ElastiCache, Redshift) so your incident management tooling can filter and group by service. This configuration puts all three alarm streams in one place and makes it detectable when multiple service alarms fire within a short time window, which is the observable signature of a cross-service cascade. Run these nine alarms for a week or two. You’ll see the pattern: multiple alarms firing within the same minute window for what turns out to be a single root cause, with no automated way to connect them. That delta is what a correlated observability layer closes. ManageEngine OpManager Nexus provides that layer for AWS database services, with auto-discovery, cross-service dashboards, and dependency-aware alarm suppression out of the box. What’s your current setup for correlating alarms across managed AWS services? If you’re running DynamoDB, ElastiCache, or Redshift and have found thresholds or approaches that work well for your team, share them in the comments.

By Damaso Sanoja
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS

The Data Challenge Every industry has its version of the same data engineering problem: massive, complex payloads generated at the edge — far from the cloud, often on unreliable networks — that need to become queryable, structured datasets as fast as possible. In genomics, it is multi-gigabyte sequencing files produced by instruments in labs. In autonomous vehicles, it is LiDAR and camera telemetry streaming off test fleets. The underlying architectural challenge is the same in every case: ingest heavy data at burst scale, store it cost-effectively for years, and transform it into something an analyst or ML model can actually use without touching the raw files. This article uses hyperspectral imaging in digital agriculture as the concrete use case, but the architecture is designed to be general-purpose and replicable. Hyperspectral sensors capture light across hundreds of spectral bands, making it possible to detect water stress, nutrient deficiencies, and early disease in crops well before anything is visible to the human eye. A single sensor pass over a 160-acre field generates 40–80 GB of raw data. These are not images in any conventional sense — they are three-dimensional tensors, often called “hypercubes,” where every spatial pixel carries reflectance measurements across 200 or more contiguous spectral bands. The files arrive in scientific formats like HDF5, NetCDF, or ENVI, which do not support partial reads over a network without specialized tooling. Loading an entire 4 GB cube into memory just to extract a vegetation index from three bands is wasteful at the small scale and operationally unaffordable once a mid-size operation is producing 5–10 TB of raw cubes per growing season. The architecture described here solves that problem end to end: from raw sensor capture to queryable, structured tables in the cloud with cost-efficient storage and minimal dependency on network bandwidth. The patterns — event-driven ingestion, aggressive storage tiering, medallion lakehouse design, and containerized edge processing — are all portable. Swap the hyperspectral cube in this architecture pattern for a FASTQ file or a LiDAR point cloud, and the same blueprint applies with very minimal modifications. Ingestion: Handling Seasonal Burst Traffic Agricultural data arrives in extreme seasonal bursts. During harvest, hundreds of edge nodes may be uploading simultaneously; in winter, the pipeline sits nearly idle. Any architecture that provisions fixed compute for this pattern is going to be very inefficient, so the ingestion layer needs to scale to near-zero in both directions. The pipeline uses an S3 → SQS → Lambda → Batch pattern, and the SQS queue in the middle is what makes the rest of it work. When files land in S3, event notifications route into the queue, which acts as a buffer between the unpredictable arrival rate and the compute layer downstream. Lightweight Lambda functions essentially like an air traffic controller poll the queue, bundle incoming file references into manifest batches of 50–200 cubes, and submit those manifests to AWS Batch. Batch spins up Spot Instances to do the actual heavy processing. Triggering Lambda directly from S3 events was the first approach, but it breaks down at scale for two reasons: Lambda’s concurrency limits create a hard ceiling during burst ingest, causing silent throttling and dropped events, and the 1:1 mapping between files and Lambda invocations is inefficient when the processing works much better against batches of files. Putting SQS in the middle solves both problems at once. When selecting the compute environment, AWS Batch ultimately won out over the alternatives after some evaluation. The main limitation of Fargate was its hard memory ceiling of around 30 GB. This was simply too tight for processing a 4 GB data cube with intermediate arrays in memory that can easily require 32–64 GB of RAM. Batch also provides native handling for job queuing, retries, and Spot interruption recovery. Since the workload is highly parallel and interruption-tolerant, this capability allowed us to safely leverage Spot pricing, delivering a significant 60–90% cost reduction that would have been difficult to justify passing up. One early lesson involved S3 prefix design. A flat raw/ prefix structure ran into per-prefix request rate limits (3,500 PUTs/second) during burst ingest, which caused throttling that was initially difficult to diagnose. Restructuring to region/farm_id/year/month/day/ spread the writes across thousands of unique prefixes and also aligned neatly with the partition scheme used by Athena and Trino downstream, so the same naming convention solved both the throughput problem and the query performance problem. Storage: Managing Petabyte-Scale Costs At this scale, storage costs will quietly become the largest line item in the project if the tiering strategy is not aggressive from day one. Petabytes of data at $0.023/GB/month in S3 Standard add up fast, but deleting raw scientific data is not an option due to regulatory reasons and for future model improvements. The lifecycle strategy moves successfully processed cubes to Glacier Instant Retrieval within 24 hours. The initial instinct was to go straight to Deep Archive, but in practice, about 5–8% of cubes get retrieved within the first year—sensor calibrations get updated, new vegetation index algorithms need validation against historical data, and so on. Deep Archive’s 12-hour restoration time makes that retrieval workflow painful enough to slow down the R&D cycle. Glacier IR runs at roughly $0.004/GB/month, about 6x cheaper than Standard, with millisecond retrieval. After a year, once retrieval rates drop below 1%, a second lifecycle rule transitions everything to Deep Archive. The important detail in the lifecycle configuration is a tag-based filter that gates the transition on processing_status = complete. Without this check, cubes that failed processing end up in Glacier, and restoring them for a retry becomes an unnecessary expense that multiplies quickly during periods of high ingest. SQL # Terraform: Tiered lifecycle for raw HSI cubes resource "aws_s3_bucket_lifecycle_configuration" "hsi_raw" { bucket = aws_s3_bucket.raw_hsi_data.id rule { id = "raw_cubes_to_cold_storage" status = "Enabled" filter { and { prefix = "raw_cubes/" tags = { processing_status = "complete" } } } transition { days = 1 storage_class = "GLACIER_IR" } transition { days = 365 storage_class = "DEEP_ARCHIVE" } } The Lakehouse: From Cubes to Queryable Tables Everything upstream exists to feed this layer. The goal is to get the R&D team off the cycle of downloading, unzipping, and parsing multi-gigabyte cubes every time they need to calculate a vegetation index or train a model. The lakehouse is built on a medallion pattern using Apache Iceberg, organized around an extract-once, query-many principle. Iceberg was chosen over plain Parquet files on S3 with a Glue Catalog because three problems kept recurring during development. First, schema evolution: Flexibility for new sensors with different band configurations, and Iceberg handles column additions without rewriting historical data. Second, time travel: when a calibration error is discovered, rolling the Silver table back to a previous snapshot is a straightforward operation rather than a data recovery project. Third, hidden partitioning: Iceberg derives partition values from column data at write time, which means queries on acquisition_date get automatic partition pruning. Medallion Layers Bronze (Standardized Cubes) Calibrated for sensor noise and atmospheric interference, stored in cloud-optimized format (Zarr or COG), retaining the full 3D spectral structure. This layer serves as the reproducible starting point for all downstream processing — if an algorithm changes six months later, reprocessing starts from Bronze rather than from the raw archive sitting in Glacier. Silver (Structured Reflectance) The 3D tensors are flattened into Iceberg tables where each row represents a spatial coordinate, and each column holds a band’s reflectance value, partitioned by farm_id and acquisition_date. The Bronze-to-Silver transformation is the most compute-intensive step in the pipeline. Gold (Business-Ready Metrics) Pre-computed agricultural indices — NDVI, NDWI, chlorophyll estimates — aggregated by crop, field row, and time period. These are the tables that dashboards query, that yield prediction models train on, and that agronomists use to make irrigation and fertilization decisions. With data in this shape, Trino handles federated SQL across the Silver and Gold tables for ad-hoc analysis, and ML training pipelines read directly from Silver without any file wrangling. The most valuable analytical work comes from joining Gold-layer crop health metrics with non-spectral datasets across the organization, and those cross-domain joins are where insights about field-level yield variation actually emerge, which is something no single dataset can surface on its own. From Pixels to Decisions: Automating the Breeding Pipeline To make this pipeline actually valuable to the business, this has to go beyond just calculating a vegetation index. The Gold layer is where pixels turn into decisions. For example, in crop breeding programs, teams test thousands of seed varieties across different microclimates to see which ones survive drought or resist disease. Agronomists do not have time to look at thousands of heatmaps; they need automated, binary outcomes. By joining the structured hyperspectral data in the Gold tables with field boundaries and historical yield databases, the system applies predefined business logic to automatically flag which genetic lines are failing. This generates concrete "Advance" or "Discard" recommendations for the breeding pipeline. At this stage, the data stops being a scientific image and becomes a direct, automated trigger for the next planting cycle. Edge Deployment: Processing at the Source The bandwidth at some of these remote locations makes a cloud-only approach unrealistic. A 4 GB cube over a 50 Mbps rural LTE connection takes over 10 minutes under ideal conditions, and rural LTE rarely delivers ideal conditions. Multiply that by dozens of passes per day during peak season, and the uplink becomes the dominant bottleneck in the entire system. The first round of processing has to happen on the equipment itself. One Container, Two Targets For managing the single OCI-compliant processing container at the edge, both AWS IoT Greengrass and K3s were considered. While Greengrass provides tight, convenience-focused AWS integration for features like device shadows, OTA updates, and managed MQTT bridging, the long-term architectural goal heavily prioritizes operational independence and portability. K3s was the pick here — it runs fully offline after bootstrap, uses standard Kubernetes manifests, and avoids locking the edge layer into a single vendor. This commitment to a lightweight, standard Kubernetes runtime avoids vendor lock-in at the crucial edge layer and provides the essential flexibility needed should a multi-cloud strategy become necessary. The edge container performs radiometric calibration and spectral flattening, producing a Parquet file that is typically 50–100x smaller than the raw cube. That compression ratio is what makes the entire edge strategy viable — the processed output is small enough to upload over cellular, while the raw cube would take orders of magnitude longer. Hardware and Sync Hyperspectral processing is dominated by dense matrix multiplications across hundreds of bands, which requires GPU hardware. The setup uses ruggedized NVIDIA Jetson AGX Orin modules mounted directly on field equipment, providing the CUDA cores needed to run CuPy-based calibration and flattening in near real-time. The sync strategy splits on payload size and urgency. Processed Parquet files stream back to the cloud in near real-time via Amazon MSK (Kafka) over an MQTT bridge, giving the lakehouse immediate telemetry. Kafka was chosen over SQS for this link because the downstream Spark Structured Streaming jobs benefit from offset-based replay semantics — if a job fails mid-batch, it resumes from the last committed offset without data loss or duplication, which is harder to guarantee cleanly with SQS visibility timeouts. The raw cubes stay on local storage and are only backhauled when the equipment returns to a facility with a high-speed connection, keeping bandwidth costs under control. Summary The core ideas behind this pipeline are straightforward: decouple storage from compute using SQS as a buffer, push the first round of processing to the edge so bandwidth stops being the bottleneck, tier storage aggressively so petabyte-scale retention stays economical, and structure everything into a medallion lakehouse so end users get SQL tables instead of binary blobs. Each piece is well-understood on its own; the value is in how they compose into an end-to-end system that stays reliable and cost-effective at scale. As noted at the outset, none of this is specific to agriculture. The hyperspectral cube is just one instance of a pattern that shows up across industries — genomics, satellite imagery, LiDAR, manufacturing inspection — wherever heavy payloads are born at the edge and need to become queryable data in the cloud. The crop science forced this architecture into existence, but the blueprint is portable. Swap the payload and the domain-specific transforms, and the rest of the system carries over.

By Anil Bodepudi

Top Deployment Experts

expert thumbnail

John Vester

Senior Staff Engineer,
Marqeta

IT professional with 30+ years expertise in app design and architecture, feature development, and project and team management. Currently focusing on establishing resilient cloud-based services running across multiple regions and zones. Additional expertise architecting (Spring Boot) Java and .NET APIs against leading client frameworks, CRM design, and Salesforce integration.
expert thumbnail

Raghava Dittakavi

Manager , Release Engineering & DevOps,
TraceLink

The Latest Deployment Topics

article thumbnail
How to Build a Local LLM Agent to Automate Work List Generation from Monthly Reports (With Jira Integration)
Learn how a local LLM agent automates work list generation from reports, enriches tasks from Jira, detects duplicates, and keeps enterprise data secure.
June 11, 2026
by Sergey Laptick
· 954 Views · 1 Like
article thumbnail
The Repo Tracker: Automating My Daily GitHub Catch-Up
Automate GitHub repo tracking with a local agent using Python, SQLite, and cron. Learn how to build a lightweight monitoring system for open-source projects.
June 11, 2026
by Alain Airom
· 512 Views
article thumbnail
Deployment Lessons You Only Learn the Hard Way
Learn how to reduce deployment risk with effective monitoring, rollback strategies, incident response playbooks, and recovery practices.
June 10, 2026
by Sandesh Basrur
· 515 Views
article thumbnail
Building a RAG-Powered Bug Triage Agent With AWS Bedrock and OpenSearch k-NN
Learn how a RAG-powered bug triage agent uses AWS Bedrock, OpenSearch, and dynamic scoring to automate crash analysis and routing.
June 9, 2026
by Rajasekhar sunkara
· 678 Views
article thumbnail
Amazon Quick: AWS's Agentic Workspace, Explained for Engineers
A technical deep dive into Amazon Quick — how it works, how it connects to your tools via MCP, and where it sits in the AWS agent stack.
June 9, 2026
by Jubin Abhishek Soni DZone Core CORE
· 764 Views
article thumbnail
The Big Data Architecture Blueprint: Core Storage, Integration, and Governance Patterns
This comprehensive technical guide breaks down the essential architectural, storage, and integration patterns required to scale enterprise big data platforms.
June 8, 2026
by Ram Ghadiyaram DZone Core CORE
· 1,210 Views
article thumbnail
How to Interpret the Number of Spring ApplicationContexts in Integration Tests
When optimizing Spring Boot integration tests, developers often focus on obvious metrics, but they do not always explain why an integration test suite is slow.
June 8, 2026
by Constantin Kwiatkowski
· 976 Views
article thumbnail
Mastering Fluent Bit: Beginners' Guide for Contributing to our CNCF Project Docs
This intro to mastering Fluent Bit covers the entry point for developers that want to contribute to a CNCF documentation project but are not sure how.
June 8, 2026
by Eric D. Schabell DZone Core CORE
· 815 Views · 1 Like
article thumbnail
Mastering Fluent Bit: Beginners' Guide for Contributing to Our CNCF Project Website
This intro to mastering Fluent Bit covers the entry point for developers that want to contribute to a CNCF project website but are not sure how.
June 5, 2026
by Eric D. Schabell DZone Core CORE
· 2,183 Views · 1 Like
article thumbnail
Advanced Error Handling and Retry Patterns in Enterprise REST Integrations
Blind retries amplify outages fast. Classify failures first, jitter your backoff, and circuit-break early before cascading.
June 4, 2026
by Anil guntupalli
· 2,319 Views · 1 Like
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2
Build a Slack bot using AWS Bedrock and MCP to answer GitHub questions. Learn setup, architecture, and how to extend it with new tools and data sources.
June 4, 2026
by Sangharsh Agarwal
· 1,807 Views
article thumbnail
Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
How AI-native tooling is finally closing the loop between compliance personas and OSCAL artifacts with an MCP-standardized, AI-agent-ready interface.
June 4, 2026
by Yuji Watanabe
· 1,919 Views
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
Building a Slack bot with traditional APIs led to 400 lines of code. Using MCP and AWS Bedrock reduced complexity, enabling scalable, tool-driven automation.
June 3, 2026
by Sangharsh Agarwal
· 2,077 Views · 1 Like
article thumbnail
When One MVP Is Really Four Systems: A Better Way to Plan Multi-Role Apps
Many MVPs get too big because teams treat several user-facing systems and vendor-dependent workflows as one app instead of planning one complete path first.
June 2, 2026
by Kajol Shah
· 1,333 Views
article thumbnail
Event-Driven Pipelines With Apache Pulsar and Go
Build scalable, real-time pipelines with Apache Pulsar and Go using event-driven producers and consumers that communicate via Pulsar topics.
May 29, 2026
by Shivi Kashyap
· 2,687 Views
article thumbnail
Zero-Downtime Deployments for Java Apps on Kubernetes
Achieve zero-downtime deployments for Java applications on Kubernetes using rolling updates, readiness/liveness probes, and graceful shutdown strategies.
May 29, 2026
by Ramya vani Rayala
· 3,578 Views
article thumbnail
Pragmatica Aether: Let Java Be Java
A modern, distributed, fault-tolerant runtime environment for the language that was intentionally designed for managed environments.
May 29, 2026
by Sergiy Yevtushenko
· 3,750 Views · 1 Like
article thumbnail
Contract-First Integration: Building Scalable Systems With Flyway, OpenAPI, and Kafka
Define API, event, and DB contracts upfront to enable parallel development, catch breaking changes in CI, and maintain consistent, reliable integrations.
May 29, 2026
by Wallace Espindola
· 2,216 Views
article thumbnail
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Learn how to build an ETL pipeline with human-in-the-loop approval that costs nothing while waiting — and see real cost data from processing 1,000 documents.
May 28, 2026
by Harpreet Siddhu
· 3,852 Views · 1 Like
article thumbnail
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
Design a stateless JWT auth service with Spring Boot 3, Redis caching, and Sentinel for high availability, faster token validation, and reduced DB load.
May 27, 2026
by Erkin Karanlık
· 3,240 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×