DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Cloud Architecture

Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!

icon
Latest Premium Content
Trend Report
Cloud Native
Cloud Native
Refcard #370
Data Orchestration on Cloud Essentials
Data Orchestration on Cloud Essentials
Refcard #379
Getting Started With Serverless Application Architecture
Getting Started With Serverless Application Architecture

DZone's Featured Cloud Architecture Resources

Zero-Downtime Deployments for Java Apps on Kubernetes

Zero-Downtime Deployments for Java Apps on Kubernetes

By Ramya vani Rayala
This article provides a comprehensive guide to achieving zero-downtime deployments for Java-based applications on Kubernetes. We cover deployment strategies, Kubernetes primitives, Java-specific considerations, session state handling, database migrations, traffic shifting techniques, CI/CD pipelines, GitHub Actions, Jenkins with automated rollbacks, observability (Prometheus, Grafana, Jaeger), Helm/ArgoCD examples, testing strategies (canary analysis, chaos, smoke tests), and troubleshooting. Deployment Strategies Kubernetes offers several strategies for deploying new versions without downtime: Rolling Update Incrementally replace old pods with new ones, maintaining availability. Kubernetes Deployment object uses rolling updates by default. You can control maxUnavailable and maxSurge to tune the rollout. Blue-Green Deployment Run two separate environments: Blue = current, green = new. Only one serves live traffic at a time. Once the Green version is verified, switch the Service or Ingress to point at Green, then scale down Blue. This allows instant rollback by redirecting traffic back to Blue. Argo Rollouts defines a blue/green strategy with an active and preview Service. Traffic flows only to the active version until promotion. Canary Deployment Gradually shift a small percentage of traffic to the new version. Start with a few pods of v2, monitor, then incrementally increase. Tools like Istio or Argo Rollouts can control traffic weights. For instance, sending 10% of traffic to v2 can be done by running 9 v1 pods and 1 v2 pod (10%). Argo defines a canary rollout with setWeight steps and pauses for analysis. Shadow/Mirroring The new version receives a copy of live requests for testing under real load, but its responses are not returned to users. This is low risk but does not assist in rollback decisions since users don’t see the new behavior. Kubernetes Primitives for Zero Downtime Deployment A Deployment naturally performs rolling updates. By default, it creates a new ReplicaSet and scales it up while scaling down the old one controlled by maxUnavailable/maxSurge. This ensures some pods always serve traffic. To use blue/green, you would deploy two separate Deployments (e.g., app-blue, app-green) and switch Services. Service and Ingress A Service fronts pods. For blue/green, you can point a single Service at either the blue or green pods. Ingress can also switch between backend services. E.g., label selectors can be adjusted to redirect traffic from version blue to version green pods. PodDisruptionBudget Ensures a minimum number of pods stay running during voluntary disruptions. For instance, setting minAvailable 1 ensures at least one pod remains during a rolling update. To avoid complete downtime during maintenance. Horizontal Pod Autoscaler (HPA) Scales pods based on CPU/memory or custom metrics. It automatically updates a workload to match demand. An HPA can be attached to the Deployment so that if traffic spikes during a rollout, new pods will be created to handle the load. Example: YAML apiVersion autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: myapp-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: myapp minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 50 Liveness and Readiness Probes Critical for zero downtime. A liveness probe checks if the app is alive; if it fails, K8 restarts the pod. A readiness probe tells if the app is ready to serve traffic. During startup or shutdown, the readiness probe should fail, causing the pod to be removed from the service load balancer. Spring Boot Actuator provides /actuator/health for this. In K8S YAML: YAML livenessProbe: httpGet: path: /actuator/health/liveness port: 8080 initialDelaySeconds: 15 periodSeconds: 10 readinessProbe: httpGet: path: /actuator/health/readiness port: 8080 initialDelaySeconds: 5 periodSeconds: 5 Spring Boot exposes health/liveness and health/readiness groups by default. Quarkus and Micronaut have similar health endpoints. Spring Boot supports graceful shutdown by setting server.shutdown is equals to graceful and tuning spring.lifecycle.timeout-per-shutdown-phase. This causes the embedded server, either Tomcat/Jetty/Undertow, to stop accepting traffic and wait up to the timeout for active requests. Java @Component public class ShutdownListener implements SmartLifecycle { private boolean running = true; @Override public void stop() { running = false; } @Override public boolean isRunning() { return running; } } Quarkus provides graceful shutdown configuration. By setting quarkus.shutdown.timeout=10s, Quarkus will wait up to 10 seconds for current requests to finish before exiting. You can annotate a bean method with @Shutdown to run cleanup code. Micronaut has @EventListener for ShutdownEvent: Java @Singleton public class ShutdownBean { @EventListener void onShutdown(ShutdownEvent event) { } } Kubernetes Hooks You can use a preStop hook in the Deployment spec to run a script before SIGTERM. YAML lifecycle: preStop: exec: command: ["/bin/sh","-c","sleep 5"] terminationGracePeriodSeconds: 30 The grace period (default 30s) should be tuned to let the app finish. K8S doc 77†L99-L107 describes the sequence container enters Terminating, runs preStop, sends SIGTERM, waits terminationGracePeriodSeconds, then SIGKILL. JVM Tuning Set -XX +ExitOnOutOfMemoryError to avoid hanging. Tune thread pools so they drain quickly. Monitor GC pause times, consider using low-latency GC to minimize pause before shutdown. Session and State Handling To maintain zero downtime when pods switch: Stateless services: Best practice is to keep services stateless. Store session state or user data in an external store, such as Redis or a database. This way, any pod can handle any request, and pods can be replaced without losing the user session.Sticky sessions: If an app uses in-memory sessions, you can enforce sticky sessionsService affinity: Set sessionAffinity: ClientIP on the Service. Kubernetes routes requests from the same client IP to the same pod.Ingress affinity: Use Ingress annotations to bind a user’s requests to one pod. However, sticky sessions introduce risk and are not suitable for autoscaling.StatefulSets: For true stateful workloads, use StatefulSet with stable identities. StatefulSets pair pods with PersistentVolumes, which are not zero-downtime by themselves. GitHub Actions CI/CD Pipeline zero-downtime: YAML name: Deploy on: push: branches: [ main ] jobs: build: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/setup-java@v3 with: { java-version: '17' } - name: Build run: mvn clean package -DskipTests name: Docker Build & Push run: | docker build -t ghcr.io/myorg/myapp:${{ github.sha } echo ${{ secrets.GITHUB_TOKEN } | docker login ghcr.io -u ${{ github.actor } --password-stdin docker push ghcr.io/myorg/myapp:${{ github.sha } - name: Set image tag run: echo "::set-output name=image::ghcr.io/myorg/myapp:${{ github.sha } deploy: needs: build runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 with: { path: manifests } - name: Update K8s deployment uses: azure/setup-kubectl@v3 - name: Deploy to Kubernetes run: | kubectl set image deployment/myapp-deployment myapp=ghcr.io/myorg/myapp:${{ needs.build.outputs.image } kubectl rollout status deployment myapp-deployment This workflow builds the image, pushes it, and updates the deployment. The rollout status command waits for all new pods to become ready. If health checks fail, it will abort without downtime. Conclusion Zero-downtime deployment on Kubernetes combines careful architecture and automation, using rolling updates, progressive strategies, ensuring graceful shutdown and health checks in your Java apps, externalizing state, managing database changes, and orchestrating with CI/CD pipelines. Kubernetes primitives like Deployments, Services, Probes, and HPA, along with tools like Istio or Argo Rollouts, provide the building blocks. More
Pragmatica Aether: Let Java Be Java

Pragmatica Aether: Let Java Be Java

By Sergiy Yevtushenko
The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code More
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
By Harpreet Siddhu
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
By Shamsher Khan DZone Core CORE
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
By Kuladeep Sandra
Catching Data Perimeter Drift Before It Reaches Production
Catching Data Perimeter Drift Before It Reaches Production

Cloud providers provide tools for customers to prevent data exfiltration attempts by creating a data perimeter — a set of permission guardrails that ensure that only trusted identities from expected networks can access trusted resources [1]. For example, a company can set up controls so that users within its organization can access only their company-specific S3 buckets from their corporate networks. Any other access patterns will be denied. These are important for organizations that are generally sensitive to data exfiltration, such as finance, healthcare, and government. Setting up a data perimeter in AWS involves creating an organization-wide policy and network policy. Service control policies (SCP) [10] and resource control policies (RCP) [11] define the maximum allowable permissions for a given identity or a resource, while VPC endpoint policies [12] define the maximum allowable permissions for a given service through a private network. Together, these controls establish a boundary around the organization’s network and resources to enforce a data perimeter. In this article, we focus on establishing and maintaining data perimeter controls for a specific access pattern: users managing their resources through the AWS Management console — a web interface for resource management. This is a unique scenario involving complex setup steps, multiple service dependencies, and a high probability of data perimeter drifting over time. Then, we introduce a development-time validation pattern, demonstrated using Kiro's Powers feature as one implementation. We explain how to encode a team's knowledge into the development process and how to catch possible data perimeter drift during development. This article does not intend to replace critical controls such as infrastructure monitoring, integration tests, and notification systems. Rather, it aims to work in tandem with them to help prevent teams from accidentally breaking their data perimeter setups by catching these issues early on, during development. This way, teams do not wait until code is deployed to a non-production environment to catch data perimeter breaches. This is useful because some organizations have separate testing and CloudOps/DevOps teams that take a long time to deploy and iterate, wasting days of time just to detect and fix data perimeter setup issues. The focus is to ensure that breaking changes are caught before a pull request is made, and encoding historical context into code. The same pattern can be replicated in any IDE via a simple team-built IDE extension or a managed service. Background: AWS Management Console Private Access AWS Management Console (console) provides network perimeter controls via VPC endpoints. This feature works in tandem with AWS Sign-In to prevent unauthorized access to the AWS Management Console. With this feature, customers can “limit access to the AWS Management Console only to a specified set of known AWS accounts when the traffic originates from within their network. Console Private Access is also useful when customers want to ensure that all calls from the AWS Management Console to AWS services originate from within their network and from allowed accounts.” [2]. Setting up AWS Management console private access is unique in that it makes API requests to AWS service endpoints. If we load the S3 console web page, the console makes API requests to the S3 endpoint to load the list of buckets. This means that if a company wants a truly isolated network, it must set up not just a console VPC endpoint but also that of S3. Additionally, static assets must be routed through the public internet because assets and console-supporting API calls do not have a VPC endpoint. Problems DNS Setup Setting up console private access requires creating two VPC endpoints: console and signin [3]. Typically, AWS services’ VPC endpoints come together with private DNS support, providing the ability for requests to resolve DNS from within a private subnet. For example, enabling private DNS at the S3 endpoint helps resolve S3 requests within the VPC. However, console and endpoints do not provide private DNS names. Instead, customers are asked to set up Route53 private hosted zones for the console and signin domains and attach them to their VPCs to resolve console endpoints correctly [4]. This setup adds friction to an otherwise standard process of creating VPC endpoints and toggling private DNS support for those endpoints. Service Endpoints Service-specific management consoles, such as S3, depend not only on the S3 API endpoint but also on the CloudWatch monitoring API endpoint (e.g., monitoring.us-east-1.amazonaws.com). This knowledge is encoded in a JSON file [4] where customers are expected to pick up all service endpoints for a given region to make all service-specific consoles work. Missing endpoints in that list can result in a broken web experience. Endpoint Policies Endpoint policies control which users can access the AWS management console in a trusted network, while unauthorized users are denied login to the console. The endpoint policy format for AWS console private access is slightly different from other services — they do not support all sets of context keys and require every Principal and Resource to be set to * and the Action to either * or signin:*. If this is not documented or tested properly, it can cause someone to accidentally put a more specific action on the endpoint policy, breaking AWS management console access over Privatelink. Infrastructure Management AWS provides a CloudFormation stack example to set up Private Access. While it works, it does not scale to real-world environments. It becomes unmaintainable for teams to keep updating and deploying CFN stacks. The better alternative is AWS CDK, which helps manage infrastructure as code, but there are no examples online for this topic. Operational Best Practices Setting up Privatelink for a service endpoint is generally part of a bigger project involving setting up data perimeter controls for a large organization. Today, the example CloudFormation template [5] does not include important components such as monitoring potential data exfiltration attempts, validation of the stack as to whether endpoint policies are being set correctly, enabling CloudTrail network activity and data events, etc. These steps are necessary to enable a production-ready data perimeter in any organization. To expand on operational best practices, if customers do not enable CloudTrail data and network events, they lose visibility into who accessed their resources and into tracing data exfiltration attempts. This is important because VPC endpoints are an integral part of enforcing network perimeter protection [6]. Data perimeter drift: As teams evolve and new team members start contributing to and maintaining their data perimeter code, historical context (the whys) on their existing setup may be lost. This is common, especially when software has been maintained for several years, with documents scattered across multiple sources and code comments not being accurately maintained. For example, if a team member attempts to “optimize” or “unify,” say, VPC endpoint creation and remove the Route53 special instructions, then the Private access setup breaks, failing to resolve DNS from within the VPC. In the best case, this change breaks nonproduction systems, while in the worst case, it breaks large production systems, leading to business outages. Therefore, there must be some way of encoding, capturing, and enabling validation for preserving historical context. Proposal To begin, we categorize our problems into four distinct categories: Infrastructure setup – involving setup complexity like Route53 DNS entries, AWS service API endpoints, and maintaining proper VPC endpoint policies. Operational best practices – including several necessary components like monitoring, alarming and detection Software evolution – as team members rotate over time Data perimeter drift To address the infrastructure setup issue (problem #1), we start by encoding setup instructions in code. AWS provides Cloud Development Kit (CDK) to maintain Infrastructure as code (IaC). Alternatively, one could use Terraform or shell scripting to maintain their IaC. In this example, the CDK code compiles into a CloudFormation template, which will be used to provision infrastructure in our AWS account. With CDK, the setup step becomes simpler — instead of maintaining CloudFormation stacks by ourselves, we leverage CDK to make our code more readable, maintainable, and testable in a pipeline. I created one such example in my code repository [7], available publicly, and a sample snippet is included below: TypeScript // ============ ROUTE53 HOSTED ZONES =========== // Console Hosted Zone const consoleHostedZone = new route53.PrivateHostedZone(this, 'ConsoleHostedZone', { vpc, zoneName: 'console.aws.amazon.com', }); // Console records - use alias records to VPC endpoint new route53.ARecord(this, 'ConsoleRecordGlobal', { zone: consoleHostedZone, target: route53.RecordTarget.fromAlias( new InterfaceVpcEndpointTarget(consoleEndpoint) ), recordName: 'console.aws.amazon.com', }); Similarly, operational best practices (problem #2) like monitoring and detection can be encoded in the CDK stack as well — by leveraging AWS deep integration with services, we can use CDK to set up network and data events for CloudTrail and enable monitoring using CloudWatch in the same CDK repository. My code repository [7] contains one such example, like the snippet below: TypeScript // Helper to create VpceAccessDenied event selectors const createVpceAccessDeniedSelector = (serviceName: string, eventSource: string) => ({ name: `${serviceName} VPC Endpoint Denied Events`, fieldSelectors: [ { field: 'eventCategory', equalTo: ['NetworkActivity'] }, { field: 'eventSource', equalTo: [eventSource] }, { field: 'errorCode', equalTo: ['VpceAccessDenied'] }, ], }); While these two pieces of code already simplify setup, someone can still change what the code does by not having enough historical background of the setup (problem #3). This is where development-time validation can help. To demonstrate this pattern, I built a Kiro Power [9] — a bundle of steering files, MCP tool configuration, and event-driven hooks that encodes project-level domain knowledge and validates changes automatically. The steering file serves as an onboarding manual for the AI agent, describing what tools are available and when to use them. Hooks trigger validation on specific events like file saves, and the MCP server runs the actual checks against the project's best practices. The bundle loads dynamically based on context rather than in every conversation, keeping the agent's context window small. The same pattern can be replicated using any team-built IDE extension or managed service. Unlike traditional MCP, Kiro Powers aren’t loaded in each conversation. Instead, they are loaded dynamically based on certain keywords we define and are activated only when there is a match. This approach keeps the context window low. I created a Kiro power to setup AWS Management Console Private Access [8]. This power contains (1) knowledge related to Private Access, (2) an MCP validator that checks whether a given CloudFormation template follows best practices, and (3) hooks to review VPC endpoint policies and validate CloudFormation stacks after changes are made. With this Kiro Power, all project-level information and best practices are encoded in the Power.md file [9]. Historical context can be written to this file and is version-controlled in Git. Now, a team member can use Kiro IDE and install the Kiro Power. Any time they want to make a code change, they can talk to the Kiro agent to validate their changes. When this Kiro power activates, the LLM understands all the context about this project and responds accordingly. Upon responding, the LLM runs a validator to ensure that the CDK changes adhere to the company’s best practices by using the validator MCP. With this solution, we have effectively found a way to: Preserve historical context for the project Enforce best practices are being followed Set up a standard development pattern that can scale to multiple developers within the team If a team member makes a breaking change to the setup, the Kiro agent catches it immediately. This potentially prevents data perimeter drift (problem #4) because the goal and historical context of the project are encoded in the AI agent. The team can incorporate git hooks to trigger an agent to audit the local code changes, effectively alerting the user on potential drifts, catching issues, and potentially blocking pull request creation entirely. To sum it all up, the three properties that a development-time data perimeter validator must have are: Version-controlled context encoding Pre-commit validation hooks IDE-agnostic tools Limitations Kiro Powers is not a one-stop shop for this issue. This pattern does not work if some team members do not use the specific IDE or standard team development practice. This approach requires the team to adopt a shared validation step in their development workflow. For example, the team can require a pull request to contain the output of the LLM-validated CFN stack. Secondly, it does not imply teams can skip critical setup steps like monitoring whether 1) Cloudtrail network events are enabled, 2) VPC endpoint policies are set properly, and 3) whether Route53 resolution still happens within the VPC. These are essential to catch breaking changes caused by CDK changes early on. Conclusion We explored a development-time validation pattern to catch data perimeter drift before it reaches production. We applied it to a specific use case: setting up AWS Management Console Private Access, a setup that is easy to break and hard to debug without historical context. The core idea is straightforward — encode the whys behind your infrastructure in version-controlled files and validate changes against that context before a pull request is made. Kiro Powers is one way to implement this pattern, but the same approach works with any team-built IDE extension or validation hook. This does not replace monitoring, CloudTrail, or integration tests. It works alongside them by catching issues earlier, when they are cheapest to fix. References [1] Data perimeters on AWS: https://aws.amazon.com/identity/data-perimeters-on-aws/ [2] AWS Management Console private access: https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/console-private-access.html [3] Required VPC endpoints and DNS configuration: https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/required-endpoints-dns-configuration.html [4] DNS configuration for AWS Management Console and AWS Sign-In: https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/dns-configuration-console-signin.html [5] Test setup with Amazon EC2: https://docs.aws.amazon.com/awsconsolehelpdocs/latest/gsg/test-console-private-access-EC2.html [6] A Subtle Audit Log Consideration in AWS: https://systemweakness.com/a-subtle-audit-log-consideration-in-aws-063752150b20 [7] AWS Console Private Access Setup: https://github.com/sureshgururajan/aws-console-private-access-setup/blob/main/lib/aws-console-private-access-setup-stack.ts [8] Kiro power for AWS Management Console Private Access: https://github.com/sureshgururajan/aws-console-private-access-setup/tree/main/powers [9] Kiro Powers: https://kiro.dev/powers/ [10] Service control policies: https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps.html [11] Resource control policies: https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_rcps.html [12] Control access to VPC endpoints using endpoint policies: https://docs.aws.amazon.com/vpc/latest/privatelink/vpc-endpoints-access.html

By Suresh Gururajan
Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
Scaling Cloud Data Automation: A Practical Guide to Open Table Formats

When we talk about data analytics the way we set up our tables is really important. This is because it can make a difference, in how well our systems work and how fast they can grow. Data analytics and Open Table Formats go hand in hand. Open Table Formats are a part of how we build our data systems today. They make it easy to work with systems. Get more out of our data. In this blog post we will talk about what Open Table Formatsre. We will discuss data analytics and Open Table Formats in detail. We will also look at some examples. Help you figure out which Open Table Format is best for your data analytics needs. We want to help organizations choose the Open Table Format for their data systems because the Open Table Format is very important, for organizations. The Open Table Format is what organizations need to make their data systems work well. What Are Open Table Formats? Open Table Formats are really good at keeping data neat and tidy, in tables. Nobody owns Open Table Formats so they are made to work with lots of tools and systems. This is great because Open Table Formats can be used by people and computers and they all work together. The goal of Open Table Formats is to make it easy for people to share data and use it so everyone can work together smoothly no matter what kind of computer or system they use, with Open Table Formats. Popular Open Table Formats People really, like using Open Table Formats when they are dealing with data. Here are some popular Open Table Formats that people use a lot when they are working with Open Table Formats: Apache Iceberg Apache Iceberg is a way to organize tables. It helps people work with sets of data in an controlled way. Apache Iceberg gives us things like ACID transactions, which's, like a guarantee that Apache Iceberg will handle our data correctly. Apache Iceberg also has isolation so we can look at our data without worrying about people changing Apache Iceberg data at the same time. Apache Iceberg allows for schema evolution, which means we can change the way our Apache Iceberg data is organized without having to start over again with Apache Iceberg. I think Apache Iceberg is really useful for people who deal with datasets in data lakes. Apache Iceberg is very helpful because it makes working with amounts of data a lot easier for people who do this kind of work, with Apache Iceberg. Advantages The main advantages of this system are that it makes sure the data is consistent. It helps with queries. This system also allows the database schema to change and evolve over time without losing any of the data, from the database schema. The system ensures data consistency. It supports queries and it enables the database schema evolution. Use Cases: Ideal for data lakes requiring transactional guarantees and schema flexibility. Delta Lake Delta Lake is a way to store data that's free for anyone to use. It helps make sure the Delta Lake data is reliable. When many people use the Delta Lake data at the time Delta Lake makes sure there are no problems. The Delta Lake also keeps track of a lot of information, about the Delta Lake data. Delta Lake makes it easy to use data that is coming in all the time and old data that is already stored in the Delta Lake. The Delta Lake does all this by using something called ACID transactions to help the Delta Lake work properly. Delta Lake is really great when it comes to dealing with an amount of data. Delta Lake works well with data that is coming in all the time and Delta Lake also works well with data that comes in big groups. This thing has a lot of points. It makes sure the data is good and reliable. You can also go back. Look at old versions of the data. The data works well with the tools that use the data. The tools that process the data, like it when the data is set up this way. Use Cases: Suitable for data lakes requiring reliability, data versioning, and unified data processing. Apache Hudi Apache Hudi is a tool for working with data. It helps you add data to the data you already have. Apache Hudi also makes it easier to build systems that can move data around. This is really helpful when you have a lot of data in a data lake. Anyone can use Apache Hudi because it is source. The best thing about Apache Hudi is that it makes handling data processing and building data pipelines on data lakes simpler. Apache Hudi is very useful, for people who work with data lakes and need to process a lot of data. This system is good because it helps with processing data a little at a time. It also keeps track of versions of the data. The data system makes it easy to get the data in and to ask questions about the data. The data system is really helpful when you want to ask questions, about the data. Use Cases: Ideal for data lakes requiring incremental data processing and data pipeline management. Choosing the Right Open Table Format When you are trying to pick the Open Table Format for the data analytics you need you have to think about a lot of things. You have to think about what you will be using the Open Table Format for. What is your data, like? Will the Open Table Format work with the systems you use? How well does the Open Table Format need to perform for your data analytics? Here are some important things to think about when you're trying to decide on an Open Table Format for your data analytics needs: Use Cases and Workloads When you want to make sure your transactions are safe and your data is consistent you should think about using formats like Apache Iceberg or Delta Lake. These formats give you something called ACID transactions which's, like a promise that everything will work correctly. Apache Iceberg and Delta Lake are options because they help you keep your data safe and make sure everything is consistent. If you are looking for something that will guarantee your data is safe Apache Iceberg and Delta Lake are the way to go because Apache Iceberg and Delta Lake give you this guarantee. When we talk about Incremental Data Processing we need to think about how to handle Incremental Data Processing. For people who work with Incremental Data Processing and manage data pipelines Apache Hudi is an option to consider for their Incremental Data Processing needs. Apache Hudi can really help with tasks related to Incremental Data Processing. Data Characteristics When you are working with data think about how data you will have to deal with. You have to store data. Some ways of storing data are better for sets of data. Data volume is something you should think about because some formats can handle lots of data better, than others. This is really important when you are working with a lot of data. If you are working with data data volume can be a problem if you are not using the format for your data. Data Complexity You have to find out how complicated your data is. This means you need to look at the types of data you have. You should think about if you will need to make changes to how your data's organized. Some data formats, like Apache Iceberg and Delta Lake are very helpful. They are helpful because they let you make changes to your data easily. You can change your data without a lot of trouble when you use Apache Iceberg and Delta Lake. Ecosystem Compatibility When you choose an Open Testing Framework you need to make sure it works well with the data processing tools you already use. For example Delta Lake works with Apache Spark. This is really important because you want your Open Testing Framework to be compatible with your existing data processing frameworks and tools, like your Open Testing Framework and your data processing tools. You want your Open Testing Framework to work smoothly with the tools you have so your Open Testing Framework and your data processing tools work together perfectly. When you think about Cloud Platforms you need to think about how the OTF works with the Cloud Platform you want to use. You have to see if the OTF is compatible with the Cloud Platform you like.. You have to check if it works with the infrastructure you have at home or in your office. This is really important for Cloud Platforms, like the ones you use every day. You need to make sure the OTF and the Cloud Platform work together. The Cloud Platform you choose should be able to work with the OTF. Performance Requirements Let us take a look at the On The Fly system and see how it works when we have to handle queries. The On The Fly system has to be able to handle our queries. We need to check how well the On The Fly system does when it comes to query performance. This is important because we do a lot of work. The On The Fly system has to be good, at handling the kind of work we do. We have to test the On The Fly system to see how it performs with our workloads. The On The Fly system needs to be able to handle these workloads. * We are going to take a look, at how the On The Fly system works when it comes to answering queries. We want to see how the On The Fly system does its job. The On The Fly system is what we are focusing on. * We are going to use this for the work we do when we analyze things for our workloads. This will help us with our workloads. The main thing we want to figure out is how good the On The Fly system is at doing our work. We need to see if the On The Fly system can give us the results we need fast. This will help us decide if the On The Fly system is really good, for the kind of work we do with the On The Fly system. Data Ingestion We need to check how well our Data Ingestion processes are working, especially when we are getting Data Ingestion done on time or really close to time for analytics. This is really important, for Data Ingestion because it helps us understand what is happening now with our Data Ingestion. We need to see how Data Ingestion works with a lot of information. We have to know how fast Data Ingestion can process this information. For Data Ingestion to be really useful it has to be able to handle all this information. Data Ingestion is only good if it can do this. Open Table Formats are really important for working with data these days. They make it easy to work with systems and Open Table Formats can do a lot of things. If you know what makes Open Table Formats like Apache Iceberg, Delta Lake and Apache Hudi special you can pick the Open Table Format that's best, for your company. You need to think about your data. What is your data like? You should figure out what you want to do with your data and what tools you are using with your data. You should also think about what you want your data to be like. Then you can pick the Open Table Format that's best for your data and what you want to do with your data. Open Table Formats are important for your data so choosing the Open Table Format is important, for your data needs.

By Sandeep Batchu
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch

A DynamoDB throttle alarm fires at 2 am. You confirm the spike in CloudWatch, then check ElastiCache in a second dashboard, then Redshift in a third. Cache hit rate dropped, which hammered DynamoDB, which stalled the zero-ETL export. Three services, three dashboards, one cascade you can only trace by hand. This guide maps the specific metrics, alarm thresholds, and configuration steps for each service, and then addresses the observability delta that CloudWatch leaves unresolved: cross-service correlation, root-cause traceability, and the capacity-planning intelligence that prevents cascades in the first place. What CloudWatch Gives You Across DynamoDB, ElastiCache, and Redshift Prerequisites: The CLI examples and alarm configurations in this guide assume AWS CLI v2, an IAM principal with cloudwatch:GetMetricData, cloudwatch:PutMetricAlarm, and dynamodb:UpdateContributorInsights permissions, and active DynamoDB tables, ElastiCache clusters, or Redshift clusters in your account. CloudWatch publishes metrics for all three services under service-specific namespaces. Per the AWS CloudWatch documentation, metric retention runs in three tiers: 1-minute data points retained for 15 days, 5-minute data points for 63 days, and 1-hour data points for 455 days. NamespaceCategoryKey MetricsAWS/DynamoDBCapacityConsumedReadCapacityUnits, ConsumedWriteCapacityUnits, ThrottledRequestsAWS/DynamoDBLatencySuccessfulRequestLatency (p50, p99)AWS/DynamoDBHealthSystemErrorsAWS/ElastiCacheEfficiencyCacheHitRate, EvictionsAWS/ElastiCacheMemoryDatabaseMemoryUsagePercentageAWS/ElastiCacheConnectionsCurrConnections, ReplicationLagAWS/RedshiftPerformanceQueryDuration, QueryQueueTimeAWS/RedshiftWorkloadWLMQueueLength (per queue)AWS/RedshiftResourcesCPUUtilization, ReadIOPS, WriteIOPS For most post-incident investigations, you’ll hit the granularity boundary within two weeks. A throttle spike that lasted 4 minutes on day 17 shows up as a single 5-minute average data point, frequently indistinguishable from normal traffic variation. The per-custom-metric cost also compounds at scale: an account running 40 DynamoDB tables, 6 ElastiCache clusters, and 3 Redshift clusters with per-resource custom alarms can accumulate hundreds of CloudWatch metrics across namespaces, each costing $0.30/month to store and $0.10/alarm/month to evaluate. Each namespace provides enough signal to diagnose its own service, but CloudWatch publishes no native cross-service correlation mechanism. A ThrottledRequests spike in AWS/DynamoDB and a CacheHitRate collapse in AWS/ElastiCache at the same timestamp are both visible, but connecting them as cause and effect requires a human to match timestamps across dashboards. DynamoDB: Throttling Detection, Partition Health, and Capacity Mode Decisions DynamoDB throttling is rarely a single-metric problem. A throttle alarm tells you capacity was exceeded, but not whether the cause is a hot partition, an undersized provisioned table, or a traffic pattern that outgrew your capacity mode. The subsections below work through that diagnostic sequence: the metrics that surface the symptom, the tooling that pinpoints the partition, and the capacity decision that prevents recurrence. Core Metrics and Alarm Thresholds The DynamoDB CloudWatch metric namespace publishes table-level aggregates. For provisioned-capacity tables, these five metrics drive operational decisions: MetricUnitRecommended Alarm ThresholdNotesThrottledRequestsCount> 0 (provisioned mode)Any throttling on a provisioned table means capacity is misconfigured or a hot partition is concentrating loadSuccessfulRequestLatency p99Milliseconds> 10ms (read-heavy workloads); > 20ms (mixed)p99 > 10ms on reads is a practitioner-recommended leading indicator of partition pressure before throttles appearConsumedReadCapacityUnitsCount/second> 80% of provisioned RCUsSignals you’re approaching throttle territoryConsumedWriteCapacityUnitsCount/second> 80% of provisioned WCUsSame logic for write-heavy workloadsSystemErrorsCount> 0Indicates DynamoDB service-side failures, distinct from capacity limits Practitioner-recommended starting points. Tune to your workload characteristics. ThrottledRequests at table level confirms that throttling happened, but tells you nothing about which partition caused it. On a table with millions of items, a single access pattern (a user ID acting as a partition key hot spot, for instance) can drive 95% of throttles while aggregate consumed capacity looks healthy. DynamoDB Contributor Insights resolves this. Contributor Insights for Hot Partition Detection DynamoDB Contributor Insights surfaces the top-N most-accessed partition keys and sort keys in real time. It identifies the specific items driving throttling or high latency that pure CloudWatch metric aggregation can’t surface. Enabling it on a production table with significant traffic incurs cost (priced per request evaluated), but during a throttle incident, Contributor Insights gives you the specific key value generating excess load rather than an aggregate curve. Enable it from the DynamoDB console under the table’s “Monitor” tab, or via CLI (requires AWS CLI v2+): Plain Text aws dynamodb update-contributor-insights \ --table-name YOUR_TABLE_NAME \ --contributor-insights-action ENABLE Once active, CloudWatch Logs Insights receives partition-level data within minutes. Query the top-10 most-accessed partition keys over the past hour to confirm whether a hot key is generating the throttle alarm: Plain Text filter @message like /ContributorInsights/ | stats count(*) as accessCount by partitionKey | sort accessCount desc | limit 10 Capacity Mode Decision Logic The decision between provisioned and on-demand capacity modes depends on traffic predictability. Use a 7-day ConsumedCapacityUnits trend as your input signal: If consumed capacity stays below 80% of provisioned capacity and follows a consistent daily pattern, stay on provisioned. Set auto-scaling target utilization at 70% of provisioned capacity to leave headroom for traffic spikes before throttling begins.If consumed capacity regularly exceeds 80% of provisioned, or if usage patterns show irregular spikes with no predictable shape, on-demand mode eliminates throttling risk at a higher per-request cost. Teams running the DynamoDB zero-ETL integration with Redshift (GA October 2024) face a different monitoring angle from streaming replication. The integration operates via periodic incremental exports every 15 to 30 minutes, so source table latency doesn’t affect export timing. The primary constraint on analytics data freshness is export completion status, visible in the Redshift console under the integration view. Export failures are the leading indicator of stale analytics data. ElastiCache: Cache Efficiency, Memory Pressure, and the Valkey 8.0 Observability Upgrade When cache hit rate drops, the blast radius extends beyond ElastiCache. Every cache miss becomes a direct read against your origin datastore, and if that origin is a DynamoDB table already running near provisioned capacity, you get the throttle cascade from the introduction. The metrics below separate cache-level symptoms from the memory and replication signals that predict them, followed by the observability improvements Valkey 8.0 brings. Redis and Valkey Metrics Per the ElastiCache CloudWatch documentation, the metrics that drive operational decisions for Redis and Valkey deployments are: MetricTargetAlert ThresholdActionCacheHitRate>= 0.95< 0.90Investigate at < 0.90; below 0.80 indicates a significant access pattern change or deployment that altered cache key patternsEvictions~0 (steady state)> 100/min sustainedSustained evictions mean maxmemory-policy is evicting live data under memory pressureDatabaseMemoryUsagePercentage< 70%Alert at > 75%; scale-out at > 85%Alert at 75% gives runway to analyze dataset growth; above 85% triggers automatic evictions under most policiesReplicationLag< 100ms> 500msReplica lag at this level affects read scaling reliabilityCurrConnectionsWorkload-specific> 80% of max allowedPersistent near-limit connections indicate a connection pool misconfiguration or application-side leak Practitioner-recommended starting points based on operational experience. Memcached deployments within ElastiCache expose a different metric set through the same AWS/ElastiCache namespace: get_hits and get_misses (from which you derive hit rate), evictions, and bytes_used vs. limit_maxbytes. Valkey and Redis are cluster-based architectures with native replication, while Memcached is a horizontally partitioned cache with no native replication. Applying Redis/Valkey thresholds to Memcached deployments produces misleading alarms. Valkey 8.0 Observability Additions The open-source Valkey 8.0 release shipped from the Linux Foundation on September 16, 2024. Amazon ElastiCache 8.0 for Valkey launched on November 21, 2024, bringing four observability primitives that prior Redis OSS metrics on ElastiCache didn’t expose. Per-slot metrics let you identify which hash slots carry disproportionate traffic across a cluster. Before Valkey 8.0, CloudWatch surfaced per-node and per-cluster aggregates only. A slot-level throughput imbalance (common after a key pattern change in the application layer) was invisible until it produced node-level CPU or memory pressure. With per-slot metrics, you detect the asymmetry before it cascades to node-level saturation. Per-client event loop latency tracks how long each client connection waits in the event loop queue. This directly diagnoses client-specific throughput asymmetries. If one application service has a misconfigured connection pool producing tail latency that appears as a CacheHitRate degradation from another service’s perspective, per-client event loop latency identifies the offending client specifically rather than surfacing a cluster-level aggregate that implicates everything. Rehash memory tracking quantifies the temporary memory overhead during cluster rescaling. When you add nodes to an ElastiCache Valkey cluster, the rehashing process requires holding two copies of some hash-slot data in memory simultaneously. Before this metric, a DatabaseMemoryUsagePercentage spike during a scale-out event was ambiguous. With rehash memory tracking, you can confirm the spike is transient rehash overhead and dismiss the alarm as expected behavior rather than a capacity problem. Traffic breakdowns split read, write, and key expiry operations at the slot and node level. This replaces the single-dimensional throughput view that prior ElastiCache Redis metrics provided and enables you to identify whether a throughput increase is driven by reads, writes, or expiry churn without writing custom instrumentation. Valkey 8.1, released April 2, 2025, adds further observability improvements. Verify ElastiCache 8.1 availability in your region at the time of deployment, as managed service version availability can trail the open-source release by several weeks. Redshift: Query Performance, WLM Configuration, and Enhanced Monitoring Redshift performance problems tend to look identical from the outside: queries slow down. Whether the cause is CPU saturation, WLM slot exhaustion, or a bad query plan requires different metrics and different responses. The thresholds below separate those conditions, followed by the Enhanced Query Monitoring tooling that replaced the manual system-table workflow for root-cause diagnosis. Key CloudWatch Metrics and WLM Thresholds MetricRecommended ThresholdActionCPUUtilizationAlert at > 80%Investigate active query plans if sustained; evaluate concurrency scaling if combined with queue depthWLMQueueLength (per queue)Alert at > 3; escalate at > 5 sustained for 60 secondsWLMQueueLength > 5 sustained over 60 seconds combined with CPUUtilization > 85% is a practitioner-recommended trigger for enabling a Redshift concurrency scaling clusterQueryQueueTime> 30 secondsQueries waiting over 30 seconds indicate WLM queue saturation or slot misconfigurationQueryDuration2x the 7-day p95 baseline for that WLM queueBaseline drift detection for workload-specific thresholdsReadIOPSCluster baselineSharp ReadIOPS spikes without a corresponding query load increase can indicate full-table scans or missing sort key filters The WLMQueueLength threshold requires context to interpret correctly. A WLMQueueLength of 5 on a queue allocated 5 concurrency slots means every slot is occupied and the queue is at capacity. Combined with CPUUtilization above 85%, adding concurrency scaling capacity is the right response. WLMQueueLength of 5 with CPUUtilization at 40% points to a slot allocation problem: queries are queuing behind slot limits rather than behind compute saturation, and the fix is WLM reconfiguration, not additional nodes. Historically, diagnosing slow Redshift queries required direct access to system tables. A typical workflow queried STL_QUERY for execution times, joined to SVL_QUERY_METRICS for resource usage per execution step, and cross-referenced SVL_QUERY_SUMMARY for operator-level plan details. This three-step workflow required SQL client access, familiarity with the Redshift internal catalog schema, and significant manual correlation work. Redshift Enhanced Query Monitoring Redshift Enhanced Query Monitoring went GA on January 29, 2025, available for both Serverless and provisioned deployments. It surfaces query bottlenecks, execution plan anomalies, and resource contention at the query level through the Redshift console, removing the need for SQL-level diagnostic work against system tables. When WLMQueueLength spikes, you can go directly to a ranked list of the queries causing saturation, see their execution plan highlights, and identify whether the bottleneck is a sort key miss, a cross-join, or a network shuffle between nodes, all without writing a single STL_QUERY lookup. Redshift troubleshooting previously required a senior engineer with DBA-level knowledge of the system catalog. This change shifts basic performance diagnosis to any SRE comfortable with the console. AI-Driven Scaling and Its Monitoring Implications AWS previewed Redshift Serverless AI-driven scaling at re:Invent 2023, and it went GA in October 2024. Verify current GA status in the AWS documentation for your region before production adoption, as the preview-to-GA timeline varies by feature and region. AI-driven scaling automates RPU (Redshift Processing Unit) allocation by observing query patterns over time and adjusting base and max RPU settings to balance cost against performance. WLM queue priority, query monitoring rule configuration, and workload classification for mixed BI and ETL environments require manual configuration even on Serverless clusters running AI-driven scaling. A Redshift Serverless cluster with AI-driven scaling still requires you to define how ETL jobs and ad hoc analyst queries share resources, and which queue takes priority when both arrive simultaneously. Those decisions drive WLMQueueLength behavior regardless of how accurately the scaler provisions RPUs. Capacity Planning: Using Monitoring Data to Drive Scaling and Cost Decisions The cross-service capacity heuristic worth building into your runbooks: simultaneous DynamoDB p99 latency increase combined with ElastiCache CacheHitRate dropping below 0.90 can indicate several different conditions. Potential causes include a fan-out query change at the application layer, a cache node failure, a network event between services, or a deployment that altered cache key patterns. This symptom combination warrants application-layer investigation to confirm the root cause before deciding which service to scale. Scaling either service without confirming the shared trigger wastes capacity and can mask the actual issue. DynamoDB Build a 7-day ConsumedCapacityUnits average as your baseline, then set auto-scaling target utilization at 70% of provisioned capacity. This gives your table headroom to absorb a 30% traffic increase before auto-scaling triggers, with a further buffer before you hit throttles at 100% consumed capacity. When evaluating reserved capacity, AWS Cost Explorer surfaces DynamoDB reserved capacity recommendations with projected savings. At a 3-year term commitment, reserved capacity can save up to 77% versus provisioned capacity hourly rates. Reserved capacity makes financial sense for tables that have run in provisioned mode for at least 90 days with predictable consumption patterns. For tables with volatile or seasonal traffic, on-demand mode avoids the risk of underutilization that makes reserved capacity economically counterproductive. ElastiCache Trend DatabaseMemoryUsagePercentage over a 72-hour window. If it trends upward at a rate disconnected from traffic growth (the cache dataset is growing while the request rate stays flat), that signals cache dataset expansion rather than increased load. The operational response is node scaling before you cross the 75% alert threshold, as memory pressure at that level narrows your runway to eviction-level problems. For ElastiCache Serverless using Valkey, monitor ElastiCacheProcessingUnits (ECPUs) as the scaling proxy. ECPU consumption scales with operation complexity and data volume, making it the primary cost and capacity signal for Serverless deployments where node count decisions don’t apply. Redshift Correlate CPUUtilization with QueryQueueTime over a 1-week window. The CPU-vs-queue diagnostic from the Redshift metrics section applies here as your scaling decision input: high CPU points to node scaling, while high queue time with moderate CPU points to WLM slot reconfiguration. Where CloudWatch’s Coverage Falls Short The per-service metrics and tooling above give you solid visibility within each namespace. The gaps show up when you need to work across them: correlating alarms from different services, connecting logs to metrics, and suppressing the noise when a single event triggers alerts everywhere at once. No Native Cross-Service Correlation You can build a CloudWatch dashboard that co-locates DynamoDB ThrottledRequests, ElastiCache Evictions, and Redshift WLMQueueLength on a shared timeline, but it’s manual widget assembly with no causal linking between the graphs. The assembly is also fragile: every new table, cluster, or queue requires manual dashboard updates to keep the view current. Log-to-Metric Correlation Is Manual Connecting a slow Redshift query logged in STL_QUERY to a spike in DynamoDB SuccessfulRequestLatency at the same timestamp requires opening CloudWatch Logs Insights for Redshift audit logs, querying by timestamp range, then manually comparing results against the DynamoDB metric timeline. The Enhanced Query Monitoring GA from January 2025 reduces this friction for Redshift-internal diagnosis, but the cross-service correlation step remains a human task. Cross-Account Visibility CloudWatch Database Insights added cross-account and cross-region support for database fleet monitoring on November 21, 2025. Verify the current scope of service coverage at the time of your deployment, as the announcement references database fleet monitoring broadly, and the specific inclusion of ElastiCache and Redshift alongside RDS and Aurora should be confirmed against current documentation. Alert Fatigue Across Three Namespaces Each service generates its own alarm stream with no dependency-aware suppression between services. When a single network event causes DynamoDB latency to rise, ElastiCache hit rate to drop, and Redshift WLM queue depth to increase, CloudWatch fires alarms across three separate notification channels simultaneously. The on-call engineer receives three alerts for a single root cause event, with no automated path from any alarm to the triggering condition. ManageEngine OpManager Nexus addresses these gaps directly: it auto-discovers DynamoDB tables, ElastiCache clusters, and Redshift clusters within your AWS account, builds correlated dashboards that connect metrics across all three services on a shared timeline without manual widget assembly, and applies dependency-aware alarm suppression that treats downstream symptoms of a single event as a grouped incident. For teams running two or more of these managed database services, the operational delta between nine isolated CloudWatch alarms and a correlated, root-cause-linked view determines where monitoring hours get spent or recovered. Your Monitoring Baseline: Nine Alarms and a Unified View The minimum viable monitoring baseline for all three services is nine CloudWatch alarms routed to a single SNS topic. These are practitioner-recommended starting points. Tune each threshold to your observed workload behavior. DynamoDB Alarms Alarm NameMetricThresholdEvaluation PeriodDynamoDB-ThrottlesThrottledRequests> 01 minuteDynamoDB-LatencyP99SuccessfulRequestLatency (p99)> 20ms5 minutesDynamoDB-RCUHighConsumedReadCapacityUnits> 80% of provisioned5 minutes Metric definitions: DynamoDB CloudWatch metrics reference. ElastiCache Alarms Alarm NameMetricThresholdEvaluation PeriodCache-HitRateLowCacheHitRate< 0.905 minutesCache-EvictionsHighEvictions> 100 per minute1 minuteCache-MemoryHighDatabaseMemoryUsagePercentage> 75%5 minutes Metric definitions: ElastiCache CloudWatch metrics reference. Redshift Alarms Alarm NameMetricThresholdEvaluation PeriodRedshift-CPUHighCPUUtilization> 80%5 minutesRedshift-QueueDepthWLMQueueLength> 35 minutesRedshift-QueueWaitQueryQueueTime> 30 seconds5 minutes Metric definitions: Redshift CloudWatch metrics reference. Route all nine alarms to a single SNS topic. Tag each alarm with a Service dimension (values: DynamoDB, ElastiCache, Redshift) so your incident management tooling can filter and group by service. This configuration puts all three alarm streams in one place and makes it detectable when multiple service alarms fire within a short time window, which is the observable signature of a cross-service cascade. Run these nine alarms for a week or two. You’ll see the pattern: multiple alarms firing within the same minute window for what turns out to be a single root cause, with no automated way to connect them. That delta is what a correlated observability layer closes. ManageEngine OpManager Nexus provides that layer for AWS database services, with auto-discovery, cross-service dashboards, and dependency-aware alarm suppression out of the box. What’s your current setup for correlating alarms across managed AWS services? If you’re running DynamoDB, ElastiCache, or Redshift and have found thresholds or approaches that work well for your team, share them in the comments.

By Damaso Sanoja
Edge Computing in Utility IoT: Two Architecture Patterns That Actually Work
Edge Computing in Utility IoT: Two Architecture Patterns That Actually Work

When centralized control architectures were designed, power flowed from large generation plants down to passive consumers, utilities managed hundreds of large assets, data volumes were modest, and connectivity was reliable at substations. Few of these assumptions hold today. Power flows in both directions as rooftop solar and battery storage inject back into the distribution network. Utilities now coordinate millions of small, variable, distributed assets instead of hundreds of large ones. Data volumes have multiplied by orders of magnitude as smart meters, sensors, and distributed energy resource (DER) controllers generate continuous streams. According to SCE's "Countdown to 2045" analysis, overall electricity demand will nearly double over the next two decades, driven largely by EV adoption, building electrification, and distributed solar. That growth will come from millions of small, distributed resources that centralized systems weren't designed to coordinate. Going forward, the control architecture should match the grid they're actually operating, not the one they planned for decades ago. This gap is exactly what edge computing addresses. Why Edge Architecture Fits Utility Environments Before looking at specific deployment patterns, it helps to understand why edge computing suits utility infrastructure in particular. Three structural characteristics make it the right fit. Utility data has a locality problem. Sensors, meters, and controllers generate data where decisions need to happen — at the substation, the inverter, the distribution feeder. IEEE 2800-2022 specifies that inverter-based resources must achieve step response times within 2.5 grid cycles OSTI — at 60Hz that's roughly 42 milliseconds. A cloud round-trip often takes longer than that. Edge processing keeps logic where the data originates. Utility infrastructure has a connectivity problem. The system needs to keep functioning whether or not it can reach the cloud — despite storms, distance, or unreliable networks. For these environments, autonomous edge operation is a baseline requirement. Utility scale has a bandwidth problem. Modern grid sensors generate continuous, high-resolution streams. Transmitting everything to a central system is economically unfeasible at the deployment scale. Edge filtering means only verified anomalies and events travel upstream, not raw sensor streams. These three characteristics show up in every serious utility IoT deployment. The architecture pattern you choose determines how you address them. Two Patterns, Different Constraints Utility edge deployments fall into two fundamentally different patterns. Which one you're deploying determines your hardware, protocol, and AI strategy. Pattern A: High-Frequency Control Loops This pattern applies when the system needs to detect grid conditions and respond within milliseconds — DER coordination, voltage regulation, frequency response, fault detection. The key difference is that the control decisions are made by the device itself, not a gateway or central system. The Utilidata deployment with Southern California Edison and NVIDIA embedded computing directly into smart meters using NVIDIA's Jetson platform, running Real-Time Optimal Power Flow (RT-OPF) algorithms at the meter. Solar inverters, EV chargers, and battery systems responded to actual grid conditions measured locally, not static dispatch schedules. The published results: 27% reduction in peak demand and 12.5% reduction in electricity costs for the simulated household. This is a meaningful architectural difference that only works with grid-connected hardware with sufficient compute power and no battery constraints. Pattern B: Distributed Sensor Networks This pattern applies when deploying hundreds or thousands of battery-powered sensors across a wide geographic area. The data is being captured periodically, processed locally, and transmitted only when something meaningful is detected. EPCOR's acoustic leak detection deployment across 160 square miles of desert water infrastructure in Arizona demonstrates this. Battery-powered sensors attach directly to water pipes, wake periodically to capture acoustic samples, run local AI inference to detect leak signatures, and transmit only when an anomaly matches trained patterns. The system identified over 250 leaks and helped recover 115 million gallons of water. The results would be economically impossible if sensors were streaming continuous audio to the cloud. Every computation and transmission drains limited energy, yet sensors must run on batteries for years. AI models must be lightweight enough to run within milliwatt budgets while still being accurate enough to distinguish a genuine leak signature from pipe noise. Protocol and Hardware Decisions Follow the Pattern Once you know which pattern applies, hardware and protocol choices follow directly. For control loop deployments (Pattern A), hardware is typically grid-connected — gateways at substations, computing embedded in meters or inverters. Protocol selection centers on what your existing field devices speak: Modbus for legacy equipment, IEC 61850 for modern substations, DNP3 for SCADA-connected devices, MQTT for newer IoT sensors. A single-edge gateway must collect from all of these simultaneously. For sensor network deployments (Pattern B), hardware is battery-constrained, and the protocol choice is driven by range and power requirements. LoRaWAN achieves 15km range with years of battery life at the cost of low data throughput and is the common choice for large geographic areas. NB-IoT provides better penetration in dense or underground environments using cellular infrastructure. LoRaWAN requires a gateway deployment. NB-IoT runs on existing cellular coverage but introduces carrier dependency and ongoing SIM costs. Neither protocol is universally better. The choice depends on service area geography, existing cellular coverage, deployment density, and battery budget. ThingsBoard Edge supports protocol diversity through the IoT Gateway component. It connects to Modbus and OPC-UA out of the box, along with MQTT for modern IoT sensors. For low-power wide-area protocols like LoRaWAN and NB-IoT, ThingsBoard integrations enable connectivity without custom middleware. This allows utilities to deploy either pattern — or both — on unified infrastructure. AI at the Edge: Two Execution Strategies Both patterns differ based on hardware constraints. Pattern A gateways have enough compute to query external AI services when connected — OpenAI, Azure OpenAI, or a self-hosted model via API — and switch to cached local models when the connection drops. This approach keeps models updated centrally without hardware constraints limiting model complexity. For Pattern B deployments, models must run locally within tight power and memory budgets. The EPCOR deployment uses deep learning models trained on extensive plastic pipe acoustic datasets, optimized to run directly on the sensor hardware. Every percentage point of detection accuracy improvement must be weighed against battery life — a more complex model might cut operational lifespan in half. Either way, managing AI models across a distributed fleet requires automation. Utilities can't manually update inference logic on hundreds of field-deployed devices. Modern edge platforms solve this with central model repositories and OTA update pipelines. Engineers train on historical data, check results, and push updates to the whole device fleet during scheduled maintenance — no truck rolls, no downtime. Integration With Existing Infrastructure The most common engineer concern about edge computing isn't the technology — it's disruption to working systems. Replacing functional SCADA infrastructure mid-operation isn't realistic for most utilities, and it isn't necessary. The practical integration approach is additive rather than replacement. Deploy an edge gateway alongside your existing PLC or RTU. It connects to the network, collects data from field devices via the protocols they already speak, and runs additional intelligence — anomaly detection, pattern recognition, predictive alerts — without touching the control loop. Your existing PLC continues executing hard-coded protection logic. The edge layer watches for conditions the PLC wasn't designed to detect. This matters for DER management specifically. Existing SCADA systems weren't architected for thousands of bidirectional resources that both consume and inject power based on real-time conditions. Rather than rebuilding that SCADA layer, an edge platform can sit alongside it, handling the DER coordination layer while SCADA continues managing the assets it was built for. Energenix, a renewable energy SCADA provider operating across South Asia, takes exactly this approach using ThingsBoard Edge. The platform delivers local monitoring and control at solar plant sites, enabling operational staff to respond to events without cloud dependency, while the central ThingsBoard instance handles fleet-wide analytics and long-term storage across their 120+ MW portfolio spanning multiple countries. Managing hundreds of sites requires visibility into the fleet itself, not just the grid assets it monitors: battery levels, communication timestamps, firmware versions, health status. Edge platforms provide centralized dashboards for this fleet-wide visibility while edge nodes maintain autonomous operation. Over-the-air update pipelines push firmware and model updates to device groups from a central interface — no SSH sessions, no truck rolls. Choosing Where to Start For utilities evaluating edge computing deployments, the clearest starting point is identifying which pattern your highest-priority use case falls into, then running a contained pilot before fleet deployment. If your primary concern is DER coordination and real-time grid response, Pattern A applies — start with a single substation or feeder and measure latency improvements against your current SCADA response baseline. If your primary concern is infrastructure monitoring across a wide service area with connectivity constraints, Pattern B applies — start with a single sensor deployment zone, validate detection accuracy, then scale the fleet. The platform infrastructure — protocol integration, rule engine, OTA management, centralized dashboards — should be the same regardless of which pattern you start with. Utilities that deploy both patterns eventually need them to coexist under unified management. Building on a platform that handles both from day one avoids painful integration work later.

By Yevheniia Mala
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS

The Data Challenge Every industry has its version of the same data engineering problem: massive, complex payloads generated at the edge — far from the cloud, often on unreliable networks — that need to become queryable, structured datasets as fast as possible. In genomics, it is multi-gigabyte sequencing files produced by instruments in labs. In autonomous vehicles, it is LiDAR and camera telemetry streaming off test fleets. The underlying architectural challenge is the same in every case: ingest heavy data at burst scale, store it cost-effectively for years, and transform it into something an analyst or ML model can actually use without touching the raw files. This article uses hyperspectral imaging in digital agriculture as the concrete use case, but the architecture is designed to be general-purpose and replicable. Hyperspectral sensors capture light across hundreds of spectral bands, making it possible to detect water stress, nutrient deficiencies, and early disease in crops well before anything is visible to the human eye. A single sensor pass over a 160-acre field generates 40–80 GB of raw data. These are not images in any conventional sense — they are three-dimensional tensors, often called “hypercubes,” where every spatial pixel carries reflectance measurements across 200 or more contiguous spectral bands. The files arrive in scientific formats like HDF5, NetCDF, or ENVI, which do not support partial reads over a network without specialized tooling. Loading an entire 4 GB cube into memory just to extract a vegetation index from three bands is wasteful at the small scale and operationally unaffordable once a mid-size operation is producing 5–10 TB of raw cubes per growing season. The architecture described here solves that problem end to end: from raw sensor capture to queryable, structured tables in the cloud with cost-efficient storage and minimal dependency on network bandwidth. The patterns — event-driven ingestion, aggressive storage tiering, medallion lakehouse design, and containerized edge processing — are all portable. Swap the hyperspectral cube in this architecture pattern for a FASTQ file or a LiDAR point cloud, and the same blueprint applies with very minimal modifications. Ingestion: Handling Seasonal Burst Traffic Agricultural data arrives in extreme seasonal bursts. During harvest, hundreds of edge nodes may be uploading simultaneously; in winter, the pipeline sits nearly idle. Any architecture that provisions fixed compute for this pattern is going to be very inefficient, so the ingestion layer needs to scale to near-zero in both directions. The pipeline uses an S3 → SQS → Lambda → Batch pattern, and the SQS queue in the middle is what makes the rest of it work. When files land in S3, event notifications route into the queue, which acts as a buffer between the unpredictable arrival rate and the compute layer downstream. Lightweight Lambda functions essentially like an air traffic controller poll the queue, bundle incoming file references into manifest batches of 50–200 cubes, and submit those manifests to AWS Batch. Batch spins up Spot Instances to do the actual heavy processing. Triggering Lambda directly from S3 events was the first approach, but it breaks down at scale for two reasons: Lambda’s concurrency limits create a hard ceiling during burst ingest, causing silent throttling and dropped events, and the 1:1 mapping between files and Lambda invocations is inefficient when the processing works much better against batches of files. Putting SQS in the middle solves both problems at once. When selecting the compute environment, AWS Batch ultimately won out over the alternatives after some evaluation. The main limitation of Fargate was its hard memory ceiling of around 30 GB. This was simply too tight for processing a 4 GB data cube with intermediate arrays in memory that can easily require 32–64 GB of RAM. Batch also provides native handling for job queuing, retries, and Spot interruption recovery. Since the workload is highly parallel and interruption-tolerant, this capability allowed us to safely leverage Spot pricing, delivering a significant 60–90% cost reduction that would have been difficult to justify passing up. One early lesson involved S3 prefix design. A flat raw/ prefix structure ran into per-prefix request rate limits (3,500 PUTs/second) during burst ingest, which caused throttling that was initially difficult to diagnose. Restructuring to region/farm_id/year/month/day/ spread the writes across thousands of unique prefixes and also aligned neatly with the partition scheme used by Athena and Trino downstream, so the same naming convention solved both the throughput problem and the query performance problem. Storage: Managing Petabyte-Scale Costs At this scale, storage costs will quietly become the largest line item in the project if the tiering strategy is not aggressive from day one. Petabytes of data at $0.023/GB/month in S3 Standard add up fast, but deleting raw scientific data is not an option due to regulatory reasons and for future model improvements. The lifecycle strategy moves successfully processed cubes to Glacier Instant Retrieval within 24 hours. The initial instinct was to go straight to Deep Archive, but in practice, about 5–8% of cubes get retrieved within the first year—sensor calibrations get updated, new vegetation index algorithms need validation against historical data, and so on. Deep Archive’s 12-hour restoration time makes that retrieval workflow painful enough to slow down the R&D cycle. Glacier IR runs at roughly $0.004/GB/month, about 6x cheaper than Standard, with millisecond retrieval. After a year, once retrieval rates drop below 1%, a second lifecycle rule transitions everything to Deep Archive. The important detail in the lifecycle configuration is a tag-based filter that gates the transition on processing_status = complete. Without this check, cubes that failed processing end up in Glacier, and restoring them for a retry becomes an unnecessary expense that multiplies quickly during periods of high ingest. SQL # Terraform: Tiered lifecycle for raw HSI cubes resource "aws_s3_bucket_lifecycle_configuration" "hsi_raw" { bucket = aws_s3_bucket.raw_hsi_data.id rule { id = "raw_cubes_to_cold_storage" status = "Enabled" filter { and { prefix = "raw_cubes/" tags = { processing_status = "complete" } } } transition { days = 1 storage_class = "GLACIER_IR" } transition { days = 365 storage_class = "DEEP_ARCHIVE" } } The Lakehouse: From Cubes to Queryable Tables Everything upstream exists to feed this layer. The goal is to get the R&D team off the cycle of downloading, unzipping, and parsing multi-gigabyte cubes every time they need to calculate a vegetation index or train a model. The lakehouse is built on a medallion pattern using Apache Iceberg, organized around an extract-once, query-many principle. Iceberg was chosen over plain Parquet files on S3 with a Glue Catalog because three problems kept recurring during development. First, schema evolution: Flexibility for new sensors with different band configurations, and Iceberg handles column additions without rewriting historical data. Second, time travel: when a calibration error is discovered, rolling the Silver table back to a previous snapshot is a straightforward operation rather than a data recovery project. Third, hidden partitioning: Iceberg derives partition values from column data at write time, which means queries on acquisition_date get automatic partition pruning. Medallion Layers Bronze (Standardized Cubes) Calibrated for sensor noise and atmospheric interference, stored in cloud-optimized format (Zarr or COG), retaining the full 3D spectral structure. This layer serves as the reproducible starting point for all downstream processing — if an algorithm changes six months later, reprocessing starts from Bronze rather than from the raw archive sitting in Glacier. Silver (Structured Reflectance) The 3D tensors are flattened into Iceberg tables where each row represents a spatial coordinate, and each column holds a band’s reflectance value, partitioned by farm_id and acquisition_date. The Bronze-to-Silver transformation is the most compute-intensive step in the pipeline. Gold (Business-Ready Metrics) Pre-computed agricultural indices — NDVI, NDWI, chlorophyll estimates — aggregated by crop, field row, and time period. These are the tables that dashboards query, that yield prediction models train on, and that agronomists use to make irrigation and fertilization decisions. With data in this shape, Trino handles federated SQL across the Silver and Gold tables for ad-hoc analysis, and ML training pipelines read directly from Silver without any file wrangling. The most valuable analytical work comes from joining Gold-layer crop health metrics with non-spectral datasets across the organization, and those cross-domain joins are where insights about field-level yield variation actually emerge, which is something no single dataset can surface on its own. From Pixels to Decisions: Automating the Breeding Pipeline To make this pipeline actually valuable to the business, this has to go beyond just calculating a vegetation index. The Gold layer is where pixels turn into decisions. For example, in crop breeding programs, teams test thousands of seed varieties across different microclimates to see which ones survive drought or resist disease. Agronomists do not have time to look at thousands of heatmaps; they need automated, binary outcomes. By joining the structured hyperspectral data in the Gold tables with field boundaries and historical yield databases, the system applies predefined business logic to automatically flag which genetic lines are failing. This generates concrete "Advance" or "Discard" recommendations for the breeding pipeline. At this stage, the data stops being a scientific image and becomes a direct, automated trigger for the next planting cycle. Edge Deployment: Processing at the Source The bandwidth at some of these remote locations makes a cloud-only approach unrealistic. A 4 GB cube over a 50 Mbps rural LTE connection takes over 10 minutes under ideal conditions, and rural LTE rarely delivers ideal conditions. Multiply that by dozens of passes per day during peak season, and the uplink becomes the dominant bottleneck in the entire system. The first round of processing has to happen on the equipment itself. One Container, Two Targets For managing the single OCI-compliant processing container at the edge, both AWS IoT Greengrass and K3s were considered. While Greengrass provides tight, convenience-focused AWS integration for features like device shadows, OTA updates, and managed MQTT bridging, the long-term architectural goal heavily prioritizes operational independence and portability. K3s was the pick here — it runs fully offline after bootstrap, uses standard Kubernetes manifests, and avoids locking the edge layer into a single vendor. This commitment to a lightweight, standard Kubernetes runtime avoids vendor lock-in at the crucial edge layer and provides the essential flexibility needed should a multi-cloud strategy become necessary. The edge container performs radiometric calibration and spectral flattening, producing a Parquet file that is typically 50–100x smaller than the raw cube. That compression ratio is what makes the entire edge strategy viable — the processed output is small enough to upload over cellular, while the raw cube would take orders of magnitude longer. Hardware and Sync Hyperspectral processing is dominated by dense matrix multiplications across hundreds of bands, which requires GPU hardware. The setup uses ruggedized NVIDIA Jetson AGX Orin modules mounted directly on field equipment, providing the CUDA cores needed to run CuPy-based calibration and flattening in near real-time. The sync strategy splits on payload size and urgency. Processed Parquet files stream back to the cloud in near real-time via Amazon MSK (Kafka) over an MQTT bridge, giving the lakehouse immediate telemetry. Kafka was chosen over SQS for this link because the downstream Spark Structured Streaming jobs benefit from offset-based replay semantics — if a job fails mid-batch, it resumes from the last committed offset without data loss or duplication, which is harder to guarantee cleanly with SQS visibility timeouts. The raw cubes stay on local storage and are only backhauled when the equipment returns to a facility with a high-speed connection, keeping bandwidth costs under control. Summary The core ideas behind this pipeline are straightforward: decouple storage from compute using SQS as a buffer, push the first round of processing to the edge so bandwidth stops being the bottleneck, tier storage aggressively so petabyte-scale retention stays economical, and structure everything into a medallion lakehouse so end users get SQL tables instead of binary blobs. Each piece is well-understood on its own; the value is in how they compose into an end-to-end system that stays reliable and cost-effective at scale. As noted at the outset, none of this is specific to agriculture. The hyperspectral cube is just one instance of a pattern that shows up across industries — genomics, satellite imagery, LiDAR, manufacturing inspection — wherever heavy payloads are born at the edge and need to become queryable data in the cloud. The crop science forced this architecture into existence, but the blueprint is portable. Swap the payload and the domain-specific transforms, and the rest of the system carries over.

By Anil Bodepudi
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack

The Problem Nobody Warned You About You bought the GPUs. Maybe you've got a couple of NVIDIA A100s in a rack, some RTX 4090s under desks, or a Kubernetes cluster with mixed hardware. You've got the compute. Congratulations! Now what? Here's the part that catches most teams off guard: having GPUs is the easy part. Managing them is where things go sideways. You need to figure out which models fit on which cards, how to balance load across machines, how to handle a node going down at 2 AM, and how to expose all of this as a clean API your application team can actually call. Most teams end up building a brittle collection of Python scripts and crontab entries that haven't been updated since 2022. It works until it doesn't, and then someone's paging you on a Saturday. This is the problem GPUStack was built to solve. What Is GPUStack, Exactly? GPUStack is an open-source tool for managing GPU clusters. Think of it as Kubernetes for your inference workloads, except you don't need to spend three days debugging a whitespace error in a Helm chart. At its core, GPUStack does three things well: It aggregates your GPUs. Whether your hardware is spread across bare-metal servers, Kubernetes pods, or cloud instances, GPUStack sees them all as a single pool of compute. One dashboard, full visibility. It orchestrates inference engines. GPUStack doesn't try to reinvent the inference wheel. It plugs into engines like vLLM, SGLang, and TensorRT-LLM, picks the right one for the job, configures it, and manages the lifecycle so you don't have to. It serves models through an OpenAI-compatible API. Once a model is deployed, your application team gets a familiar REST endpoint. No custom client libraries. No new protocols to learn. Swap out the base URL, and you're talking to your own infrastructure. Getting Started in Under 5 Minutes I'm not exaggerating on the timeline. Here's how you go from zero to a running GPUStack server. Step 1: Fire Up the Server You need one machine to act as your control plane. It doesn't even need a GPU. A basic CPU-only box works fine for the server role. Shell sudo docker run -d --name gpustack \ --restart unless-stopped \ -p 80:80 \ --volume gpustack-data:/var/lib/gpustack \ gpustack/gpustack That's it. Open your browser, navigate to http://<your-server-ip>, and you'll see the GPUStack dashboard. The first time you log in, you'll set up your admin credentials. Step 2: Add Your GPU Workers Now for the fun part. On each worker node, make sure you have the NVIDIA driver and NVIDIA Container Toolkit installed, then run: Shell sudo docker run -d --name gpustack-worker \ --restart unless-stopped \ --gpus all \ -e GPUSTACK_SERVER_URL=http://<your-server-ip> \ -e GPUSTACK_TOKEN=<your-token> \ gpustack/gpustack Replace the server URL and token (grab the token from the GPUStack dashboard). Within seconds, your worker appears in the cluster view with GPU model info, VRAM capacity, and health status. Rinse and repeat for every GPU machine you want to add. Got 3 machines? Three commands. Got 30? Thirty commands, or one Ansible playbook if you're smart about it. Running the worker command is actually the easiest part. The real final boss of GPU clusters is usually getting the drivers and toolkit installed correctly on the host. Step 3: Deploy a Model Head over to the model catalog in the web UI. GPUStack supports pulling models from Hugging Face and the Ollama Library. Pick a model and click deploy. Here's where the scheduler really excels. It reads the model's metadata, computes the resource requirements for VRAM, compute, and memory, then figures out which workers can handle it. If the model is too big for a single GPU, it can shard it across multiple cards. You don't have to manually calculate whether a 70B parameter model fits on your hardware. GPUStack does the math for you. Step 4: Call the API Once the model is running, you get an OpenAI-compatible endpoint. Grab an API key from the dashboard and test it: Shell curl http://<your-server-ip>/v1/chat/completions \ -H "Authorization: Bearer <your-api-key>" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3", "messages": [ {"role": "user", "content": "Explain GPU cluster management in one paragraph."} ] }' If you're already using the OpenAI Python SDK, switching to your GPUStack endpoint is a one-line change: Python from openai import OpenAI client = OpenAI( base_url="http://<your-server-ip>/v1", api_key="<your-api-key>" ) response = client.chat.completions.create( model="llama3", messages=[{"role": "user", "content": "Hello from my own GPU cluster!"}] ) print(response.choices[0].message.content) Your application code stays the same. Your infrastructure is now fully under your control. Why This Actually Matters Let me break down the features that make GPUStack more than a nice-looking dashboard. Multi-Backend Flexibility GPUStack supports vLLM, SGLang, and TensorRT-LLM out of the box. This matters because no single engine is best for every workload. vLLM is great at high-throughput batch processing. TensorRT-LLM squeezes out every last drop of performance on NVIDIA hardware. SGLang shines with structured generation. GPUStack lets you pick the right tool for each deployment, or lets the scheduler pick for you. Built-In Monitoring GPUStack integrates with Grafana and Prometheus, giving you real-time dashboards for GPU utilization, VRAM usage, token throughput, and API request rates. No need to bolt on a separate monitoring stack (which usually ends up being three half-finished Grafana dashboards anyway). When something breaks at 2 AM, you'll know exactly which GPU on which machine is the problem. Automated Failure Recovery We’ve all been there - a node drops off the map because of a weird PCIe bus error or a driver mismatch that only appears under heavy load. Normally, that means your inference API just returns 500s until you manually intervene. GPUStack handles the panic phase for you. When Should You Use GPUStack? GPUStack isn't the right fit for every scenario. Here's a quick way to think about it: Use GPUStack if: You have 2+ GPU machines and want to serve LLMs or other AI models behind a unified API. Especially if your team doesn't want to become full-time infrastructure engineers just to keep models running. You want to run inference on your own hardware instead of paying per-token to a cloud provider. The cost savings at scale are real, and GPUStack removes the operational overhead that usually makes self-hosting painful. Maybe skip GPUStack if: You have a single GPU and just want to run a model locally for personal use. Tools like Ollama are simpler for that use case. You're already deep into a custom Kubernetes-based ML platform with KubeFlow or similar. GPUStack can work alongside Kubernetes, but if you've already invested heavily in that ecosystem, the overlap might not be worth it. The Bigger Picture The AI infrastructure landscape is shifting. A year ago, most teams defaulted to API providers for inference. Today, with open-weight models getting better every month and GPU costs coming down, self-hosted inference is becoming a real option. Not just for Big Tech, but for startups and mid-size companies too. The bottleneck isn't hardware anymore. It's operations. It's the glue code between "we have GPUs" and "our application can reliably call a model." GPUStack is a serious attempt at solving that gap, and it's open source under the Apache 2.0 license, so you can inspect, modify, and deploy it without vendor lock-in. If you’re sitting on a pile of hardware that’s currently just acting as expensive space heaters, or if you’re tired of seeing cloud inference bills that look like mortgage payments, give this a shot. You might find that self-hosting is actually viable again!

By Sandeep Sadarangani
Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs

Introduction: Beyond Compute Prices When migrating or running SAP S/4HANA on AWS, many organizations fixate on EC2 instance prices and assume that choosing the cheapest instance types will yield the biggest savings. In reality, cloud TCO is heavily impacted by landscape design choices, how many environments you run, how they’re sized, how data is managed and what auxiliary services you use. Cutting cloud costs isn’t just about shrinking VM sizes it’s about architecting an efficient SAP landscape. As one SAP FinOps guide notes, focusing only on instance sizing addresses symptoms, not causes. True cost optimization asks Is the SAP landscape design efficient? Are you running unnecessary SAP instances, and can workloads consolidate onto fewer systems?. In other words, a thoughtful landscape architecture often yields larger savings than a simple per-server cost reduction. Understanding an SAP S/4HANA Landscape on AWS A typical S/4HANA landscape consists of multiple tiers and environments. You might have separate DEV, QA, Staging and Production systems each a full SAP stack with its own HANA database and application servers. On AWS, that could translate to dozens of EC2 instances, along with associated storage and network infrastructure. Each additional environment or system copy multiplies costs for compute, Amazon EBS storage, Amazon EFS shared file systems, backup retention, and so on. Landscape design decisions such as how many parallel systems to run or whether every environment needs high availability can quickly outweigh the cost of an individual EC2 instance. Right-Sizing Compute Resources Right-sizing is the practice of matching instance types and sizes to actual workload needs. SAP S/4HANA is resource-intensive, so it’s critical to choose the appropriate EC2 instance families and sizes for each component. AWS offers SAP-certified instance families. Avoid the temptation to oversize just in case use monitoring tools like AWS CloudWatch and SAP’s EarlyWatch reports to gauge real utilization. If a QA system never exceeds 30% CPU and 50% memory, you might run it on a half-sized instance compared to production. Many companies set policies such as development instances must not exceed 50% of production capacity and QA 70%. This ensures non-production systems are proportionally smaller and cheaper. In Terraform, you can parameterize instance sizes by environment to enforce right-sizing. A production vs. dev HANA server might be expressed as: Plain Text # Example Terraform: Use smaller instance type for non-production variable "env" { default = "prod" } resource "aws_instance" "sap_hana" { ami = "ami-0abcdef12345..." # SAP HANA Linux AMI instance_type = var.env == "prod" ? "r6i.8xlarge" : "r6i.2xlarge" # ... (other configuration like VPC, subnet, security groups) tags = { Name = "${var.env}-hana" Environment = var.env } } In this snippet, a development environment could be launched with -var env=dev to automatically use a smaller instance, whereas production uses r6i.8xlarge. Right-sizing combined with flexible IaC lets you avoid paying for capacity you don’t need while still meeting SAP performance requirements. Beyond instance selection, leverage cost-saving options for compute: Savings Plans or Reserved Instances: If your SAP workloads run 24/7 in prod, commit to a one- or three-year Savings Plan to get discounts up to 72%.Auto-stop Non-Prod Instances: Schedule stops for dev, QA, training servers during off-hours. AWS Systems Manager Automation or AWS Instance Scheduler can start/stop instances on a cron schedule. By only running non-prod when needed, you save significantly on compute.Auto Scaling for SAP App Servers: SAP application servers can often scale horizontally. In AWS, you might use an Auto Scaling Group with a schedule or target utilization policy for app servers. This way, you run minimal servers during light load and scale out for peak times. Consolidation and Landscape Efficiency An inefficient SAP landscape one with too many duplicate systems or low-utilization servers will rack up cloud costs regardless of instance pricing. Cloud gives us flexibility to consolidate and optimize: Eliminate Unnecessary Systems: Audit your SAP instances are there old project systems or unused sandboxes running? It’s not uncommon to find forgotten test systems left on. Retire or shut down what isn’t truly needed.Consolidate Workloads: Where possible, consolidate multiple workloads on a single instance or platform. If you have separate SAP S/4HANA instances for different business units that are lightly used, consider consolidating them into one S/4HANA tenant or system. Fewer HANA databases means fewer high-memory instances to pay for. SAP HANA supports multi-tenant databases, so multiple schemas can reside in one HANA system this can be a way to run dev and QA on one HANA VM as separate tenants, rather than two separate VMs.Shared Services: Some landscape components can be shared across environments. For instance, a single SAP Solution Manager or central SAProuter can serve the entire landscape rather than one per environment. Fewer supporting servers equals lower cost.Right-Size Every Environment: Even within a consolidated landscape, differentiate the sizing. We mentioned limiting dev/QA to a fraction of prod. Also consider if every environment needs the same number of app servers maybe prod has 4 app nodes for high throughput but QA can do with 2 and dev with 1. This scaling down translates directly to cost savings in EC2 hours and licenses. Keep in mind that consolidation should not compromise testing realism or performance SLAs for production. It’s a balance consolidate and downsize where you safely can and use cloud tooling to isolate or simulate full scale only when necessary. Storage and Data Management Costs For SAP workloads, storage costs are often as significant as compute. A single S/4HANA instance may have terabytes of data on EBS volumes. Now multiply that by multiple environments, plus backups storage can eclipse compute costs if not managed. AWS provides multiple storage options using the right one for the right purpose is key: Use EBS Efficiently: Provision EBS volumes that meet performance needs without over-provisioning IOPS or size. AWS now recommends gp3 SSD volumes for SAP HANA over older gp2, as gp3 offers better price/performance. Only use expensive io2 volumes if you truly need ultra-high IOPS and durability for critical workloads, otherwise gp3 suffices in most cases. Always enable the delete on termination flag for temporary volumes and clean up unattached EBS volumes so you’re not paying for leftover storage.Offload Backups to S3: Don’t keep backup files on EBS or EFS longer than necessary. AWS offers the Backint agent for SAP HANA which lets HANA back up directly to Amazon S3. This bypasses the need for large intermediate disk space and leverages cheaper object storage. S3 is significantly cheaper per GB than EBS for data at rest. Design a backup strategy for each environment and send those to an S3 bucket. From there, apply lifecycle policies to move older backups to colder storage classes like Glacier for further savings. For example, you might keep 7 days of recent backups in S3 Standard, then transition older ones to S3 Glacier or Deep Archive after 30 days. Plain Text # Example Terraform: S3 bucket for SAP HANA backups with lifecycle policy resource "aws_s3_bucket" "sap_hana_backup" { bucket = "my-sap-hana-backups" force_destroy = true # allow auto-cleanup if destroying infra versioning { enabled = false # disable versioning for backup objects to save space } lifecycle_rule { id = "MoveOldBackupsToGlacier" enabled = true transition { days = 7 storage_class = "GLACIER" # move backups to Glacier after 7 days } expiration { days = 180 # delete backups after 6 months } } tags = { Purpose = "SAP HANA Backups" } } Terraform snippet: The above S3 bucket is configured to automatically transition objects older than 7 days to Glacier and delete anything older than 180 days. This kind of policy ensures your S3 storage costs stay low by archiving cold data. In practice, set the timing according to your retention requirements. Also consider enabling MFA Delete or Vault Lock on critical backup buckets for safety, instead of versioning. Use EFS for Shared Files, but Lifecycle Manage It: SAP applications often use shared file systems for transports (/usr/sap/trans), global SAP mounts (/sapmnt), and archives. Amazon EFS is ideal for this shared storage it’s managed NFS and can be mounted by multiple EC2 instances. However, treat EFS space as premium (especially the default Standard storage class). Enable EFS Lifecycle Management (Intelligent-Tiering) so that files not accessed for 30 days move to the lower-cost Infrequent Access tier automatically. For example, old transport files or archived data can sit in EFS IA at a much lower cost per GB. Also, clean up EFS after major projects. Deleting those or moving them to S3 after the project frees up costly EFS space. Plain Text # Example Terraform: EFS file system with lifecycle policy for infrequent access resource "aws_efs_file_system" "sap_shared_fs" { creation_token = "sap-shared-fs" performance_mode = "generalPurpose" throughput_mode = "bursting" lifecycle_policy { transition_to_ia = "AFTER_30_DAYS" # move files to Infrequent Access after 30 days } tags = { Name = "sap-shared" } } The above EFS definition will automatically tier off files not touched for 30 days. Mount this EFS on your SAP application EC2s to use for common directories. This way, you get the convenience of shared storage without continuously paying full price for cold data. Always review and delete any unattached or unused EFS file systems as well. Archive and Purge Data: A broader data strategy can greatly reduce TCO. If your S/4HANA database is bloated with years of transactional data, consider using SAP data archiving to move old data to cheaper storage. Storing infrequently accessed data in S3 is far cheaper than keeping it in memory on HANA. Also, use Amazon S3 for storing large interface files or logs rather than keeping them on EBS/EFS, and enable lifecycle policies for those as well. Every GB you offload from expensive storage to S3/Glacier or delete entirely is money saved. Network and Infrastructure Considerations Often overlooked in cost planning are networking and auxiliary infrastructure costs: Networking: Within a VPC, data transfer is free between instances in the same AZ, but costs can incur across AZs or out to the internet. If your SAP landscape replicates data, you’ll pay for cross-AZ data transfer. This is usually worth the HA benefit, but be aware. More straightforwardly, NAT Gateway costs catch people by surprise if each environment VPC has its own NAT and heavy internet egress, costs add up. Mitigation: use VPC endpoints for S3 and other services so traffic stays internal and avoids NAT usage.Backups and DR Infrastructure: If you maintain a warm standby environment or Disaster Recovery site, treat it as another environment in your cost planning. To save costs, you can keep DR systems mostly powered off, or use lower-performance instance types there, and only scale up if a failover is needed. AWS Backup can help here by storing snapshots that you can restore in a DR region on demand. Using lower-tier storage in the DR region for backups is a cost-effective strategy.AWS Managed Services: Consider using services like AWS Backup to automate backup retention policies across your SAP instances. This can ensure snapshots or EBS backups follow a schedule and transition to cold storage after a set time, reducing manual oversight and accidental cost bloat. Also leverage tagging and AWS Cost Explorer to allocate and track costs by environment or system this transparency can help identify which landscape components are most expensive and need optimization. Environment Strategy and Automation Your environment strategy should align with actual business usage patterns. Not every SAP environment needs to run 24/7 at full scale: For development, testing, training, use on-demand principles. If developers work 8am-6pm, there’s no reason to run dev systems all night. By shutting down servers during off hours, companies save 50-65% on those environments’ costs without any impact on users.Use Infrastructure-as-Code to spin up temporary environments. Create a Terraform module for a full S/4HANA stack and instantiate it for a short-term project or testing, then destroy it when done. This ensures you pay only for the time actually needed. Automating system copies/refreshes from production backups can populate these ephemeral environments with realistic data when needed.Plan fewer, well-utilized environments rather than many underutilized ones. Each additional landscape brings overhead of compute, storage and management. Wherever possible, combine roles.Enforce governance around provisioning new SAP systems. Implement approval processes that consider cost impact. Some organizations formalize this with policies so that cloud spend doesn’t sprawl uncontrolled. Conclusion The bottom line: optimizing your SAP S/4HANA landscape design is often the biggest lever for reducing cloud TCO, even more than shaving off a few percent on instance prices. AWS provides a rich toolkit various EC2 instance types, EBS/EFS storage classes, S3 tiers and management services that enable a high degree of cost control if used wisely for your SAP architecture. By right-sizing servers, turning off or consolidating what you don’t need, and leveraging services like S3, EFS lifecycle policies and AWS Backup, you tackle the true cost drivers in an SAP environment. In practice, companies that take this holistic approach have seen significant savings in their AWS bills for SAP, all while maintaining performance and reliability. The cloud’s promise is agility and efficiency with a practical engineering mindset and Infrastructure-as-Code automation, you can achieve an efficient SAP landscape that delivers on that promise, ensuring your cloud spend is as optimized as your SAP operations.

By Deepika Paturu
Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives

“Lambda-driven API design” fits naturally with Node.js because a Lambda handler can be treated as a small, explicit function boundary: an event arrives, a response is returned, and everything else becomes an implementation detail that can be composed. The core challenge is not producing a response object, but scaling many endpoints without turning each handler into a copy-pasted blob of parsing, validation, authorization, logging, and error mapping. AWS has increasingly nudged Lambda Node.js workloads toward modern asynchronous patterns, including guidance that async/await handlers are recommended and that callback-based handler signatures are only supported up to Node.js, with Node.js requiring asynchronous work to use async handlers. This constraint is a design opportunity: Once handler execution is centered on a returned value and on predictable, composable functions, cross-cutting behavior can be expressed as functional wrappers and pipelines rather than as framework-specific magic. The HTTP Contract Is the Stable Boundary A Node.js handler in Lambda is formally defined as the method that processes an invocation event and runs until the handler returns, exits, or times out, with AWS documenting valid asynchronous signatures as export const handler = async (event) and export const handler = async (event, context). For HTTP-facing endpoints, that event is commonly produced by an integration such as API Gateway HTTP APIs or by Lambda function URLs, each shaping requests into structured event objects and mapping handler output back to HTTP. Lambda function URLs explicitly follow the same request/response schema as the Amazon API Gateway payload format version 2.0, including fields such as version, rawPath, headers, cookies, and an HTTP method under requestContext.http.method. API Gateway’s own documentation for HTTP API Lambda proxy integration explains that payload format version 2.0 removes multiValueHeaders and multiValueQueryStringParameters, combines duplicates with commas into the single-value maps, introduces rawPath, and aggregates cookies into a cookies array, with response cookies emitted as set-cookie headers. Response construction is where design clarity often breaks down, especially when HTTP behavior is scattered across many handlers. For payload format version 2.0, API Gateway can infer defaults when the handler returns valid JSON without an explicit statusCode, assuming statusCode 200, isBase64Encoded false, and content-type application/json, with the body treated as the function response. That inference is convenient for prototypes but becomes brittle in production because status codes, content types, cache headers, correlation IDs, and cookies all need deliberate control. API Gateway documents the explicit response shape for format 2.0 as an object containing statusCode, headers, body, optional cookies, and isBase64Encoded. Treating that response shape as a wire format and wrapping it with a minimal set of pure helper functions keeps endpoint code focused on business decisions rather than serialization rules. TypeScript const json = (statusCode, payload, headers = {}) => ({ statusCode, headers: { "content-type": "application/json", ...headers }, body: JSON.stringify(payload), }); const text = (statusCode, body, headers = {}) => ({ statusCode, headers: { "content-type": "text/plain; charset=utf-8", ...headers }, body, }); const withCookies = (response, cookies) => ({ ...response, cookies }); const noContent = (headers = {}) => ({ statusCode: 204, headers, body: "" }); These helpers align with the documented proxy integration expectation that Lambda returns an object shaped around statusCode, headers, and a string body. Functional Primitives Match the Node.js Execution Model Composable endpoint behavior depends on the ability to pass functions around, return them from other functions, and assign them like any other value. MDN describes JavaScript functions as first-class objects, enabling functions to be passed as arguments, returned from other functions, and assigned to variables and properties. This property makes middleware-style design possible without a heavyweight framework: A cross-cutting concern becomes a higher-order function that accepts a handler and returns a new handler with additional behavior. A second primitive is predictable composition. A pipeline is often easiest to express as a reducer over a list of transformations, using a stable accumulator pattern: MDN documents Array.prototype.reduce() as running a reducer callback over all elements and accumulating them into a single value. When endpoint building blocks are functions that return Promises, a reducer can sequence them deterministically by chaining. MDN’s Promise reference explains that then(), catch(), and finally() associate further actions with a Promise that becomes settled, enabling structured chaining. TypeScript const pipeAsync = (...steps) => (input) => steps.reduce((p, step) => p.then(step), Promise.resolve(input)); const Ok = (value) => ({ ok: true, value }); const Err = (error) => ({ ok: false, error }); const map = (f) => (r) => (r.ok ? Ok(f(r.value)) : r); const chain = (f) => (r) => (r.ok ? f(r.value) : r); const mapErr = (f) => (r) => (r.ok ? r : Err(f(r.error))); A small Result shape like this prevents expected failures from becoming exceptions, keeping error handling explicit and composable. Exceptions remain appropriate for faults that are truly exceptional, such as invariant violations or library bugs, but HTTP endpoints frequently need to represent expected no such resource and invalid input conditions as typed outcomes, not stack traces. Normalizing Payload v2 Events into an Internal Request API Gateway HTTP APIs and Lambda function URLs share payload format v2.0, but the event is still an AWS-centric structure designed to represent many integration features. A composable endpoint benefits from a small internal request model that captures what business logic actually needs: method, path, headers, query, caller identity hints, raw body, decoded body, and stable request identifiers. API Gateway’s documentation notes that headers in the payload format examples are lowercase, that duplicate headers are comma-separated, and that cookies are surfaced as an array, suggesting that parsing and normalization should happen once, near the boundary. TypeScript const toHttpRequest = (event) => { const headers = event.headers ?? {}; const method = event.requestContext?.http?.method ?? event.httpMethod ?? "GET"; const path = event.rawPath ?? event.path ?? "/"; const query = event.queryStringParameters ?? {}; const cookies = event.cookies ?? (headers.cookie ? headers.cookie.split(";").map((c) => c.trim()) : []); const rawBody = event.body ?? ""; const body = event.isBase64Encoded ? Buffer.from(rawBody, "base64").toString("utf8") : rawBody; return { method, path, headers, query, cookies, body, requestId: event.requestContext?.requestId, sourceIp: event.requestContext?.http?.sourceIp, }; }; This mapping follows the documented v2.0 shape where rawPath, headers, queryStringParameters, cookies, and isBase64Encoded appear directly on the event, and where HTTP details are available under requestContext.http. It also creates a natural place to hide integration quirks, such as the payload v2.0 detail that rawPath will not include an API mapping value when API mapping is used with a custom domain, which can matter for routing rules that depend on the stage mapping prefix. Once a normalized request exists, JSON parsing and validation become pure steps. Even without showing a specific schema library, the shape of the transformation can remain stable: parse the body based on content-type, validate against a contract, and either return a typed error or pass a typed payload onward. This approach keeps the handler itself small and keeps failures consistently represented. Handler Composition Without Framework Lock-In A Lambda handler can be treated as async (event, context) => response, and AWS explicitly recommends the async signature while documenting callback-based handlers as unsupported for asynchronous operations starting from Node.js 24. That makes the entire endpoint surface a function that returns a value, which is ideal for higher-order wrapping. Middy formalizes this idea as a lightweight Node.js middleware engine specifically for AWS Lambda, explicitly positioning itself as a way to simplify Lambda code by applying a middleware pattern similar to traditional web frameworks. Implementing the same concept with functional primitives can be even smaller when only a narrow set of behaviors is needed. TypeScript const withHttpRequest = (handler) => async (event, context) => handler({ req: toHttpRequest(event), context }); const withJsonBody = (handler) => async (args) => { const ct = (args.req.headers["content-type"] ?? "").toLowerCase(); if (!ct.includes("application/json") || args.req.body === "") return handler(args); try { return handler({ ...args, json: JSON.parse(args.req.body) }); } catch { return json(400, { error: "invalid_json" }); } }; const withErrorMapping = (handler) => async (args) => { try { return await handler(args); } catch (err) { return json(500, { error: "internal_error" }, { "x-error-type": err?.name ?? "Error" }); } }; The error mapping wrapper is grounded in the reality that API Gateway expects Lambda proxy integrations to return a statusCode, headers, and a string body, and that error semantics become HTTP semantics when statusCode is controlled. A richer version can map domain errors to 4xx status codes and attach diagnostic headers when appropriate, API Gateway documentation describes passing an error type via a header, such as X-Amzn-ErrorType when propagating error details. Conclusion Lambda-driven API design becomes sustainable when the HTTP boundary is treated as a stable wire contract, and everything above it is expressed as composable functions. AWS documentation clarifies that payload format v2.0 consolidates headers and query parameters, introduces rawPath and cookies, and standardizes v2.0 event structure across API Gateway HTTP APIs and Lambda function URLs, while the proxy response contract remains an explicit object with statusCode, headers, and a string body. The Node.js runtime direction in Lambda further reinforces functional composition by requiring modern async handler signatures in Node.js for asynchronous operations, eliminating callback-based patterns that obscure control flow and response ownership. With first-class functions and reducer-based composition available in the language, endpoint behavior can be assembled from parsing, validation, authorization, error mapping, and observability primitives that remain small, testable, and reusable across routes.

By Bhanu Sekhar Guttikonda
Smart Deployment Strategies for Modern Applications
Smart Deployment Strategies for Modern Applications

Modern application development has moved toward distributed, cloud-based, and even microservices-based applications, requiring scalability, reliability, and performance under different conditions. Therefore, deployment has become a part of application development, not merely a final activity. Intelligent deployment patterns and practices are all about building applications that are not just easy to deploy, but also reliable, scalable, and efficient in production. This means moving away from traditional, manual deployment patterns and toward automated, container-based deployment practices. Docker and Kubernetes are two prominent technologies that play a vital role in this transformation and shift toward intelligent deployment patterns and practices. Docker helps developers build applications and deploy them along with their dependencies in lightweight, portable containers, overcoming environment consistency problems, while Kubernetes helps deploy, scale, and self-heal these containers. However, without an appropriate strategy, it is possible to introduce unnecessary complexity and even performance issues. Not every application needs Kubernetes, nor does every deployment issue call for a distributed solution. Knowing when to use Docker on its own, when to use Kubernetes, and when to balance performance, cost, and complexity is vital to deliver effective modern applications. This article provides smart deployment strategies using Docker and Kubernetes. It highlights the advantages, disadvantages, and performance of using Docker and Kubernetes. This gives an overview of the deployment strategy. What Docker Does Docker packages your application, all dependencies, and the run time into a small container. Issues Before Docker It works on my machine and is inconsistent in different environments, such as development, test, staging, and productionDependency conflicts – code language version, missing library version, configuration mismatch Docker Benefits Same behavior everywhere – local development environment, production environment, staging environment, etc.Isolation between apps – create each app that has separate containers.Fast startup – light weight versus a virtual machineEasy deployment – just run the container Plain Text Docker start <containername> How Docker Works Plain Text Application Code → Dockerfile → Docker Image → Docker Container → Run application A container image can run on a developer laptop, on virtual machines, in a data center, or in cloud environments with the same packaged runtime and dependencies. So that Docker resolves our packaging issues. But what if the machine has 100 containers? What if one crashes? How to scale during high traffic? How to manage deployments? Docker itself does not solve these problems. Here, we need a deployment strategy; there, we can use Kubernetes. What Kubernetes Does The operational problem of managing the image once it has been created is addressed by Kubernetes, which automates the deployment, scaling, and management of containerized applications, and can even maintain the state of the application by replacing failed containers and rescheduling applications as needed. Kubernetes Benefits Auto scaling: More containers (pods) if traffic increases, and fewer containers if traffic decreases.Self-healing: Starts the container again if it crashes.Load balancing: Spreads the load across the containers.Zero downtime deployment: Updates the system without stopping it.Service management: Manages multiple microservices easily. Docker builds and runs the container. Kubernetes runs the container reliably at scale. For example, in a real-world scenario: Docker = packing lunch boxesKubernetes = managing a large cafeteria serving thousands Plain Text build app → Docker container ↓ Deploy many containers → Kubernetes manages them What a Kubernetes Deployment Actually Does A Kubernetes deployment is a resource in a cluster that manages a group of pods and replica sets for a workload, typically a stateless application. Define the desired state, and the actual state in the cluster moves towards it. Kubernetes also supports rolling updates, where new Pods are created and marked as ready before the old ones are terminated. The typical process for deploying a Spring Boot application to a Kubernetes cluster Develop a Spring Boot application.The Spring Boot application is built and packaged as a Docker image.The Docker image is pushed to a repository.Kubernetes Deployments define the image.Kubernetes creates Pods and exposes them via a Service. Advantages Consistent deployments: Docker provides a standard unit for bundling the application and its run-time dependencies. This minimizes environment drift between development, testing, and production environments. This is one of the biggest advantages of using containers for Java-based Spring Boot applications.Declarative operations: Kubernetes uses a declarative model to manage its deployments. This is a significant advantage because it makes it easy for organizations to implement automation for the deployment of applications.Self-healing: Kubernetes has self-healing features. It can automatically replace failing containers and reschedule the application in case of unavailability. This is a significant advantage because it makes it easy for organizations to implement self-healing for the application.Inbuilt scaling options: Kubernetes provides built-in autoscaling features for the application. This makes it easy for organizations to implement elastic and efficient scaling for the application.Improved service abstraction and traffic routing: A Kubernetes Service is an API object that defines a single service and provides a consistent endpoint. It is then possible to have the system distribute traffic to matching Pods. If access to the service outside the cluster is required, then Ingress or Gateway-based routing is an option.Safer upgrades: It is possible to gradually roll out new versions using rolling updates. This reduces the deployment risk. Disadvantages 1. More Operational Complexity While Docker is simple in itself for small applications, Kubernetes introduces additional complexity, such as pods, deployments, services, ingress, ConfigMaps, secrets, autoscaling, networking policies, etc. While these features can be justified for production environments, they are complex features and must be appreciated for their complexity. Kubernetes documentation is divided into so many sections because of the complexity of the platform, which is multi-functional by design, encompassing features like orchestration, networking, scaling, storage, etc. 2. Higher Resource Overhead Kubernetes introduces operational complexity, which is absent in Docker. This could be a problem for very small applications, as the complexity may outweigh the advantages. This is an assumption based on the complexity of the Kubernetes model compared to the Docker model. 3. Harder Debugging While debugging a Docker application is relatively simple because the application is hosted on a single host, debugging a distributed application is far more complex because of the involvement of multiple hosts, pods, services, etc. This is an assumption based on the complexity of the Kubernetes model compared to the Docker model. 4. Misconfiguration Risk Kubernetes is a powerful technology, but misconfiguration can lead to application failures. Network Policies, for example, are complex features by design, requiring production-level configurations. Performance Considerations Kubernetes doesn’t make your application run faster on its own. Performance still relies on many factors such as application design, JVM tuning, container image quality, database performance, network latency, and resource allocation. However, there are many operational tools provided by Kubernetes for improving performance under varying loads. These tools include autoscaling and rollout features. In general terms, performance considerations can be divided into four categories: Startup performance. Startup performance of a Spring Boot container can be slow, depending on factors such as application size. However, rollout relies on new Pods becoming available for use. Thus, startup performance can impact rollout performance.Runtime efficiency. Containers are much more efficient than traditional deployment models that use many virtual machines. This is why Docker is so popular for container deployment. However, inefficient Docker images or large JVMs can still cause inefficiencies. Docker documentation lists many factors, such as glibc-based or musl-based Docker images.Scaling behavior. Horizontal pod autoscaling is useful when load increases, as it adds more pods to handle it, rather than scaling up resources for existing pods. However, it is critical to note that the application should scale horizontally and not have any bottlenecks at the single-node level.Networking overhead. Kubernetes provides Services, which add abstraction to the network. Although this is helpful for manageability and load balancing, it is critical to note that there should be careful design for every layer in latency-sensitive applications. The abstraction provided by Services is useful for operational purposes, but is not conceptually. Limitations One limitation to be aware of is the fact that Kubernetes deployments are designed for stateless workloads. This means if the application has state tightly coupled with the identity of the instance or has ordered storage, the application may not be the best candidate for a Kubernetes deployment. The Kubernetes documentation itself describes Deployments as typically being used for workloads that “do not maintain state.” Other practical limitations are: Small teams may find Kubernetes too heavy for a simple internal app.Stateful systems still require careful storage, backup, and failover planning.Local development experience can become more complex than plain Docker Compose.Security and networking require active design, not default trust. When/What to use ScenarioNeed DockerNeed Kubernetes Run single app Yes No Microservices Yes Yes Production scale Yes Yes (Mandatory) Auto scaling needed No Yes High Availability No Yes Conclusion The modern deployment model is not just about shipping code; it’s about shipping it reliably and at scale. Docker helps in providing consistency across environments, while Kubernetes helps in providing scale, resilience, and automation. The smart approach in deployment strategy is about selecting the appropriate tool for the job. Docker might be enough for a simple application, but for a complex application with high availability requirements, Kubernetes becomes a must-have. By understanding the strengths and weaknesses of both tools, we can develop efficient, scalable, and sustainable deployment strategies.

By Manju George

Top Cloud Architecture Experts

expert thumbnail

Abhishek Gupta

Principal PM, Azure Cosmos DB,
Microsoft

I mostly work on open-source technologies including distributed data systems, Kubernetes and Go
expert thumbnail

Srinivas Chippagiri

Sr. Member of Technical Staff

Srinivas Chippagiri is a highly skilled software engineering leader with over a decade of experience in cloud computing, distributed systems, virtualization, and AI/ML-applications across multiple industries, including telecommunications, healthcare, energy, and CRM software. He is currently involved in the development of core features for analytics products, at a Fortune 500 CRM company, where he collaborates with cross-functional teams to deliver innovative, scalable solutions. Srinivas has a proven track record of success, demonstrated by multiple awards recognizing his commitment to excellence and innovation. With a strong background in systems and cloud engineering at GE Healthcare, Siemens, and RackWare Inc, Srinivas also possesses expertise in designing and developing complex software systems in regulated environments. He holds an Master's degree from the University of Utah, where he was honored for his academic achievements and leadership contributions.
expert thumbnail

Vidyasagar (Sarath Chandra) Machupalli FBCS

Software Developer Operations Manager | Executive IT Architect,
IBM

Executive IT Architect, IBM Cloud | BCS Fellow, Distinguished Architect (The Open Group Certified)
expert thumbnail

Pratik Prakash

Principal Solution Architect,
Capital One

Pratik, an experienced solution architect and passionate open-source advocate, combines hands-on engineering expertise with an extensive experience in multi-cloud and data science .Leading transformative initiatives across current and previous roles, he specializes in large-scale multi-cloud technology modernization. Pratik's leadership is highlighted by his proficiency in developing scalable serverless application ecosystems, implementing event-driven architecture, deploying AI-ML & NLP models, and crafting hybrid mobile apps. Notably, his strategic focus on an API-first approach drives digital transformation while embracing SaaS adoption to reshape technological landscapes.

The Latest Cloud Architecture Topics

article thumbnail
Real-Time AI Inference at Scale Using Cloud Run, GPUs, and Vertex AI
Build scalable threat intel pipelines with Python, STIX/TAXII APIs, and Elasticsearch. Normalize data, preserve context, and enable fast, reliable detection.
June 4, 2026
by khadarvali shaik
· 1,165 Views
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 2
Build a Slack bot using AWS Bedrock and MCP to answer GitHub questions. Learn setup, architecture, and how to extend it with new tools and data sources.
June 4, 2026
by Sangharsh Agarwal
· 1,212 Views
article thumbnail
Compliance Automated Standard Solution (COMPASS), Part 11: Compliance as Code, the OSCAL MCP Server Way
How AI-native tooling is finally closing the loop between compliance personas and OSCAL artifacts with an MCP-standardized, AI-agent-ready interface.
June 4, 2026
by Yuji Watanabe
· 1,360 Views
article thumbnail
Build a GitHub Slack Bot With AWS Bedrock and MCP, Part 1
Building a Slack bot with traditional APIs led to 400 lines of code. Using MCP and AWS Bedrock reduced complexity, enabling scalable, tool-driven automation.
June 3, 2026
by Sangharsh Agarwal
· 1,554 Views · 1 Like
article thumbnail
How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
A practical guide to SaaS architecture decisions that determine whether platforms scale cleanly or collapse under technical debt, security, and growth pressure.
June 1, 2026
by Igboanugo David Ugochukwu DZone Core CORE
· 1,122 Views
article thumbnail
Zero-Downtime Deployments for Java Apps on Kubernetes
Achieve zero-downtime deployments for Java applications on Kubernetes using rolling updates, readiness/liveness probes, and graceful shutdown strategies.
May 29, 2026
by Ramya vani Rayala
· 3,302 Views
article thumbnail
Pragmatica Aether: Let Java Be Java
A modern, distributed, fault-tolerant runtime environment for the language that was intentionally designed for managed environments.
May 29, 2026
by Sergiy Yevtushenko
· 3,534 Views · 1 Like
article thumbnail
Building a Zero-Cost Approval Workflow With AWS Lambda Durable Functions
Learn how to build an ETL pipeline with human-in-the-loop approval that costs nothing while waiting — and see real cost data from processing 1,000 documents.
May 28, 2026
by Harpreet Siddhu
· 3,586 Views
article thumbnail
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
Docker Hardened Images solve the CVE problem. But CVEs aren't why containers fail in production — governance gaps are. Here's the trust architecture that closes them.
May 27, 2026
by Shamsher Khan DZone Core CORE
· 3,577 Views
article thumbnail
Setting Up a Data Catalog With Azure Purview and Collibra: What Three Attempts Taught Me
Setting up a data catalog isn’t just a tool problem. My work with Azure Purview and Collibra showed success depends on governance, metadata, and adoption.
May 27, 2026
by Kuladeep Sandra
· 3,313 Views
article thumbnail
Catching Data Perimeter Drift Before It Reaches Production
A development-time validation pattern to catch data perimeter setup issues and preserve the historical context of a data perimeter project.
May 26, 2026
by Suresh Gururajan
· 3,276 Views
article thumbnail
Scaling Cloud Data Automation: A Practical Guide to Open Table Formats
Leverage open table formats with cloud automation and scalable analytics to build reliable, high-performance data platforms.
May 25, 2026
by Sandeep Batchu
· 3,142 Views
article thumbnail
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One SQL query across 4 GPU nodes found a straggler in under a second using eBPF fleet fan-out, no central collector needed.
May 25, 2026
by Ingero Team
· 3,370 Views
article thumbnail
AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch
Three AWS managed databases, three dashboards, and one cascade you can only trace by hand. This guide fills the gap CloudWatch leaves open.
May 22, 2026
by Damaso Sanoja
· 3,581 Views · 1 Like
article thumbnail
Edge Computing in Utility IoT: Two Architecture Patterns That Actually Work
In this article, we break down edge architecture patterns that fit modern utility infrastructure when power flows both ways.
May 22, 2026
by Yevheniia Mala
· 3,747 Views
article thumbnail
Architecting Petabyte-Scale Hyperspectral Pipelines on AWS
Learn how to overcome serverless bottlenecks to process and route petabyte-scale hyperspectral agricultural data on AWS.
May 21, 2026
by Anil Bodepudi
· 3,162 Views
article thumbnail
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
GPUStack is an open-source tool that turns a bunch of scattered GPU machines into one managed cluster for deploying AI models behind an OpenAI-compatible API.
May 21, 2026
by Sandeep Sadarangani
· 3,474 Views
article thumbnail
Why SAP S/4HANA Landscape Design Impacts Cloud TCO More Than Compute Costs
SAP cloud TCO is driven more by landscape sprawl than by EC2 costs; optimize environments and use Terraform, S3, and EFS lifecycle policies to reduce costs.
May 20, 2026
by Deepika Paturu
· 2,322 Views
article thumbnail
Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
Lambda handlers are just functions that normalize once, wrap cross-cutting concerns as higher-order functions, and keep business logic clean.
May 19, 2026
by Bhanu Sekhar Guttikonda
· 2,768 Views · 4 Likes
article thumbnail
Smart Deployment Strategies for Modern Applications
Docker packages applications to ensure consistent and portable deployments. Kubernetes manages them with scaling, reliability, and automation in production.
May 18, 2026
by Manju George
· 3,443 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×