DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Containers

Containers allow applications to run quicker across many different development environments, and a single container encapsulates everything needed to run an application. Container technologies have exploded in popularity in recent years, leading to diverse use cases as well as new and unexpected challenges. This Zone offers insights into how teams can solve these challenges through its coverage of container performance, Kubernetes, testing, container orchestration, microservices usage to build and deploy containers, and more.

icon
Latest Premium Content
Trend Report
Kubernetes in the Enterprise
Kubernetes in the Enterprise
Trend Report
Cloud Native
Cloud Native
Refcard #400
Java Application Containerization and Deployment
Java Application Containerization and Deployment

DZone's Featured Containers Resources

Implementing Asynchronous Communication Between Microservices Using Kafka and Spring Boot

Implementing Asynchronous Communication Between Microservices Using Kafka and Spring Boot

By Mallikharjuna Manepalli
In a microservices system, that tight coupling turns a small hiccup into a cascading slowdown. Thread pools fill, retries amplify traffic, and suddenly your simple request is blocked on half the fleet. My executive summary: asynchronous messaging with Kafka helps systems keep moving when individual components inevitably slow down or fail. It does this by decoupling producers from consumers, absorbing traffic spikes, and allowing services to evolve without tying their availability directly to one another. Code Patterns in Spring Boot With Kafka Spring for Apache Kafka gives me two primitives that feel pleasantly old Spring KafkaTemplate for sending and @KafkaListener for receiving. That template/listener model is intentionally similar to other Spring integration tech, which keeps application code focused on domain logic instead of raw client plumbing. Below is a compact (but production-shaped) pattern: externalized config via @ConfigurationProperties, a service port for publishing, a REST command endpoint, a consumer with a real error strategy (DLT), and a REST error advice. Java // === Messaging config (externalized, type-safe) === @ConfigurationProperties(prefix = "messaging.orders") @Validated record OrdersMessagingProps( @NotBlank String topic, @NotBlank String dltTopic ) {} // === DTO (event contract) === public record OrderCreatedEvent(UUID orderId, UUID userId, BigDecimal total, Instant createdAt) {} // === Service port (keeps domain testable, Kafka swappable) === public interface OrderEventPublisher { void publishOrderCreated(OrderCreatedEvent event); } // === Adapter: Kafka producer === @Component class KafkaOrderEventPublisher implements OrderEventPublisher { private final KafkaTemplate<String, OrderCreatedEvent> template; private final OrdersMessagingProps props; KafkaOrderEventPublisher(KafkaTemplate<String, OrderCreatedEvent> template, OrdersMessagingProps props) { this.template = template; this.props = props; } @Override public void publishOrderCreated(OrderCreatedEvent event) { // Keying by orderId keeps per-order ordering and drives partitioning decisions. template.send(props.topic(), event.orderId().toString(), event); } } // === REST command API (synchronous edge, async core) === @RestController @RequestMapping("/v1/orders") class OrdersController { private final OrderService orderService; // domain port OrdersController(OrderService orderService) { this.orderService = orderService; } @PostMapping public ResponseEntity<Map<String, Object>> create(@Valid @RequestBody CreateOrderRequest req) { UUID orderId = orderService.create(req.userId(), req.total()); // persists + publishes event return ResponseEntity.accepted().body(Map.of("orderId", orderId, "status", "ACCEPTED")); } record CreateOrderRequest(@NotNull UUID userId, @NotNull @Positive BigDecimal total) {} } // === Domain service port (implementation can use outbox, transactions, etc.) === public interface OrderService { UUID create(UUID userId, BigDecimal total); } // === Consumer: downstream service reacts to events === @Component class BillingListener { @KafkaListener(topics = "${messaging.orders.topic}", groupId = "${spring.kafka.consumer.group-id}") void onOrderCreated(OrderCreatedEvent event) { // Idempotency belongs here: process-by-key + store processed eventId/orderId to avoid duplicates. // Do work (charge card, create invoice, etc.) } } // === Kafka consumer error handling: retries + DLT === @Configuration class KafkaErrorHandlingConfig { @Bean DefaultErrorHandler defaultErrorHandler(KafkaTemplate<Object, Object> template, OrdersMessagingProps props) { var recoverer = new DeadLetterPublishingRecoverer(template, (rec, ex) -> new TopicPartition(props.dltTopic(), rec.partition())); // Backoff and retry policy are configurable; keep it finite to avoid poison-pill loops. return new DefaultErrorHandler(recoverer, new FixedBackOff(1000L, 3)); } } // === REST error handling (ProblemDetail) === @RestControllerAdvice class ApiErrors { @ExceptionHandler(IllegalArgumentException.class) @ResponseStatus(HttpStatus.BAD_REQUEST) ProblemDetail badRequest(IllegalArgumentException ex) { var pd = ProblemDetail.forStatusAndDetail(HttpStatus.BAD_REQUEST, ex.getMessage()); pd.setTitle("Invalid request"); return pd; } } A few been-burned-before notes on the code above. Spring Kafka’s reference docs are explicit that KafkaTemplate is the convenience wrapper for producing, and DefaultErrorHandler + DeadLetterPublishingRecoverer is a first-class way to route failed records to dead-letter topics after retries. If we want non-blocking retries, Spring Kafka also provides @RetryableTopic, which orchestrates retry topics and a DLT automatically useful when transient failures are common and you want predictable retry delay semantics. Containers and Local Dev With Docker Compose When I’m chasing down event flow bugs, I like local environments that feel like the old days: one command, deterministic startup order, and no mystery dependencies. Docker Compose is still the quickest way to stand up Kafka alongside your services, and Confluent publishes straightforward Docker-based tutorials and compose examples for running Kafka locally. For the service image itself, multi-stage builds are the modern classic compile in a builder stage, and copy the artifact into a slimmer runtime stage. Docker documents multi-stage builds as a way to reduce the final image contents and keep build dependencies out of production. Dockerfile # Multi-stage Dockerfile for a Spring Boot service (orders-service) FROM eclipse-temurin:21-jdk AS build WORKDIR /workspace COPY mvnw pom.xml ./ COPY .mvn .mvn RUN ./mvnw -q -DskipTests dependency:go-offline COPY src src RUN ./mvnw -q -DskipTests package FROM eclipse-temurin:21-jre WORKDIR /app COPY --from=build /workspace/target/*.jar app.jar EXPOSE 8080 ENTRYPOINT ["java","-jar","/app/app.jar"] And here’s a Compose file that wires up Kafka and Schema Registry, plus an example Spring Boot service. The exact image choices are illustrative. Your production choices are unspecified and should reflect your standards and security posture. YAML # compose.yaml (local/dev) services: zookeeper: image: confluentinc/cp-zookeeper:7.6.0 environment: ZOOKEEPER_CLIENT_PORT: 2181 kafka: image: confluentinc/cp-kafka:7.6.0 depends_on: [zookeeper] ports: ["9092:9092"] environment: KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181 KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:9092 KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1 schema-registry: image: confluentinc/cp-schema-registry:7.6.0 depends_on: [kafka] ports: ["8081:8081"] environment: SCHEMA_REGISTRY_HOST_NAME: schema-registry SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: PLAINTEXT://kafka:9092 orders: build: ./orders-service depends_on: [kafka] ports: ["8080:8080"] environment: SPRING_KAFKA_BOOTSTRAP_SERVERS: kafka:9092 MESSAGING_ORDERS_TOPIC: orders.events MESSAGING_ORDERS_DLTTOPIC: orders.events.dlt SCHEMA_REGISTRY_URL: http://schema-registry:8081 Deploying on Kubernetes or AWS On AWS, the Kafka decision is usually managed or self-managed. If you choose Amazon MSK, the cluster lives in your VPC, pick subnets across distinct Availability Zones, and connect clients using the cluster’s bootstrap brokers. That’s the networking baseline, and it’s not optional. MSK is VPC-first by design. For authentication/authorization, MSK supports IAM access control. AWS documents the client configuration for IAM mechanisms. In EKS, I typically pair MSK IAM with IRSA so pods can obtain AWS credentials the AWS way, while ECS services would use task roles instead. Both patterns are documented by AWS, and your choice here is unspecified. Kubernetes service discovery is usually the easy part. Services and Pods get DNS names so workloads can call each other by name rather than IP. Kafka itself is reached via bootstrap broker endpoints or via internal Services, but either way, you want the strings in externalized config, not hardcoded. Here’s a minimal Kubernetes Deployment/Service for a Kafka client service. Values like region, account IDs, and MSK endpoints are unspecified placeholders. YAML apiVersion: apps/v1 kind: Deployment metadata: name: orders namespace: apps spec: replicas: 2 selector: matchLabels: { app: orders } template: metadata: labels: { app: orders } spec: serviceAccountName: orders-sa # IRSA-bound (role ARN unspecified) containers: - name: orders image: <UNSPECIFIED_AWS_ACCOUNT_ID>.dkr.ecr.<UNSPECIFIED_REGION>.amazonaws.com/orders:<TAG> ports: [{ containerPort: 8080 }] env: - name: SPRING_KAFKA_BOOTSTRAP_SERVERS value: "<UNSPECIFIED_MSK_BOOTSTRAP_BROKERS>" - name: MESSAGING_ORDERS_TOPIC value: "orders.events" - name: MESSAGING_ORDERS_DLTTOPIC value: "orders.events.dlt" readinessProbe: httpGet: { path: /actuator/health/readiness, port: 8080 } initialDelaySeconds: 10 --- apiVersion: v1 kind: Service metadata: name: orders namespace: apps spec: selector: { app: orders } ports: - port: 80 targetPort: 8080 Operationally, MSK exposes metrics into CloudWatch (AWS/Kafka), and broker logs can be delivered to CloudWatch Logs (or S3/Firehose). That combination gives you the classic visibility loop: throughput, lag, under-replicated partitions, and error logs without running your own monitoring plane. For distributed tracing in async flows, OpenTelemetry is my default vocabulary now. Spring Boot supports OpenTelemetry export via OTLP, and OpenTelemetry defines Kafka semantic conventions so your producer/consumer spans and attributes stay consistent across tools. CI/CD and the Hard-Earned Field Notes For CI/CD, I keep it boring: build once, push an immutable image, deploy via a declarative mechanism. AWS Prescriptive Guidance provides a clear GitHub Actions pattern for building Docker images and pushing to Amazon ECR, which is a solid baseline when your region/account is unspecified until configured. YAML # .github/workflows/orders.yml name: orders on: push: branches: ["main"] jobs: build_push_deploy: runs-on: ubuntu-latest permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - uses: actions/setup-java@v4 with: distribution: temurin java-version: "21" - name: Build & test run: ./mvnw -q test package - name: Configure AWS credentials (OIDC) uses: aws-actions/configure-aws-credentials@v4 with: role-to-assume: arn:aws:iam::<UNSPECIFIED_AWS_ACCOUNT_ID>:role/<UNSPECIFIED_GHA_ROLE> aws-region: <UNSPECIFIED_REGION> - name: Login to ECR run: | aws ecr get-login-password --region <UNSPECIFIED_REGION> \ | docker login --username AWS --password-stdin <UNSPECIFIED_AWS_ACCOUNT_ID>.dkr.ecr.<UNSPECIFIED_REGION>.amazonaws.com - name: Build & push image run: | IMAGE=<UNSPECIFIED_AWS_ACCOUNT_ID>.dkr.ecr.<UNSPECIFIED_REGION>.amazonaws.com/orders:${{ github.sha } docker build -t $IMAGE ./orders-service docker push $IMAGE - name: Deploy to EKS (example) run: | aws eks update-kubeconfig --name <UNSPECIFIED_EKS_CLUSTER> --region <UNSPECIFIED_REGION> kubectl -n apps set image deploy/orders orders=$IMAGE Now, the part I wish someone had handed me in 2016: Kafka gives you strong tools, but it does not remove distributed-systems truths. You still need safeguards on the consumer side: idempotent processing, disciplined schema management, and clearly defined retry and dead-letter topic behavior. Kafka’s documentation is careful about the limits of “exactly once” guarantees. Idempotent producers and transactions can strengthen delivery semantics, but achieving true end-to-end exactly-once behavior, especially when external side effects are involved, still depends on deliberate system design. For schema governance, Kafka itself doesn’t ship a schema registry, but acknowledges third-party registries; in practice, Confluent Schema Registry and Apicurio Registry are common choices. Both store schemas out-of-band, so messages carry only a schema identifier, and both support evolvable contracts across Avro/JSON Schema/Protobuf depending on your ecosystem. Conclusion and Best Practices If you take one lesson from my legacy brain into modern event-driven systems, let it be this: asynchrony is a reliability feature, not a performance trick. Kafka’s durable log and consumer group model decouples uptime and absorbs spikes, but you only get the real benefit when you treat schemas as contracts, consumers as idempotent processors, and failure handling as first-class application behavior. On AWS, the operational baseline is non-negotiable. MSK lives in your VPC across AZ subnets, clients connect via bootstrap brokers, IAM auth is configured explicitly, and observability lives in CloudWatch. Do those fundamentals early, and Kafka stops feeling like a mysterious black box and starts feeling like the dependable workhorse it was built to be. More
Your AI Coding Agent Can't Steal What It Never Had: The Docker Sandbox Isolation Story

Your AI Coding Agent Can't Steal What It Never Had: The Docker Sandbox Isolation Story

By Shamsher Khan DZone Core CORE
I ran an AI coding agent against a broken Kubernetes deployment for five minutes. The agent called Anthropic's API dozens of times — reasoning about manifests, running kubectl commands, redeploying workloads. It made fully authenticated requests throughout the entire session. The API key was never in its environment. Shell env | grep -iE "anthropic|api_key|secret|token|password" # (empty) That is Docker Sandbox's credential isolation model in action. This article is about what that actually means — and what else the isolation holds, breaks, and surprises you with when you probe it properly. Key Takeaways Docker Sandbox uses a host-side proxy to inject API credentials without the agent ever seeing them — the agent makes authenticated calls without possessing the keySeven live isolation probes confirmed the boundary held throughout real AI agent activity, not just at restNetwork policy is hostname-scoped HTTP filtering — not a full network control plane — with three specific behaviors the documentation doesn't make clearDevOps agents can run docker build and kubectl inside the sandbox without any path to the host Docker daemon or cluster credentialsThe --branch parallel agent mode is Git-level isolation, not VM-level — important distinction for threat models requiring separate credentials per agent The Setup I manage eight AKS clusters for Fortune 500 clients. My laptop has Azure service principals, SSH keys, kubeconfig files with a dozen cluster contexts, and twenty-plus repos — some with .env files containing real API keys. Running an AI agent from this machine without guardrails means the agent inherits all of it. Docker Sandbox changes that. Each sandbox is a microVM — its own Linux kernel, its own Docker daemon, its own network stack. You mount one project directory. The agent sees one project directory. Everything else on the machine does not exist inside the sandbox. I spent two weeks testing this claim. Here is what I found. Test environment: What Detail sbx version v0.31.1 · commit e658be1 Host macOS Apple Silicon Network endpoints probed 13 Isolation probes 7 targeted commands Kubernetes scenario Real agent task, two bugs, timed All findings backed by real terminal output. Full repo: github.com/opscart/docker-sandbox-devops. How the Credential Isolation Actually Works The sandbox environment has no API keys. But the agent made authenticated API calls. Here is the mechanism: Shell env | grep proxy # https_proxy=http://gateway.docker.internal:3128 # http_proxy=http://gateway.docker.internal:3128 # JAVA_TOOL_OPTIONS=-Dhttp.proxyHost=gateway.docker.internal -Dhttp.proxyPort=3128 ... Every outbound request — HTTP, HTTPS, even Java tools — routes through a proxy at gateway.docker.internal:3128. That proxy runs on the Mac host, completely outside the microVM boundary. When the agent sends a POST to api.anthropic.com, there is no Authorization header — the agent does not have the key. The request reaches the host-side proxy. The proxy checks the allowlist — api.anthropic.com is in the default AI services group under the Balanced policy. Authentication is performed by the host-side proxy using credentials stored outside the sandbox boundary. The authenticated request is forwarded to Anthropic. The agent receives the response. It has no idea what key was used, where it came from, or how to find it again. Think of it like an OAuth gateway. The proxy holds the credential and vouches for the agent's requests. The agent gets access without ever possessing the key. You cannot steal what you never had. This is architecturally different from the standard setup where ANTHROPIC_API_KEY sits in the shell environment — one echo $ANTHROPIC_API_KEY away from being exfiltrated. What the Four Isolation Layers Actually Do Docker Sandbox stacks four layers: Hypervisor isolation. Separate Linux kernel per sandbox. Host processes invisible. Other sandboxes invisible. A compromised sandbox cannot escalate to the host kernel. This is the fundamental difference from a Docker container — a container shares the host kernel. The microVM does not. Network isolation. All outbound HTTP/HTTPS routes through the host-side proxy. Raw TCP, UDP, and ICMP are blocked at the network layer. Three policy tiers: allow-all, balanced (curated dev allowlist), deny-all. Set before starting your first sandbox: Shell sbx policy set-default balanced Docker Engine isolation. Each sandbox runs a private Docker daemon with its own socket. No path to the host Docker daemon. An agent can run docker build and docker run without socket mounting — which is the tradeoff that breaks isolation in plain container-based approaches. Credential isolation. Proxy-based injection as described above. The raw key never enters the microVM. macOS host with sensitive assets and proxy on the left, Docker Sandbox microVM in the center, network policy zones on the right. Seven Isolation Proofs — Run Live After a Real Agent Task The agent exited after completing the debugging task. The sandbox remained alive, and I executed the following commands from the same shell session the agent had used — to show exactly what was accessible throughout the entire run. 1. Filesystem Boundary Shell ls /Users/opscart/ # Source ls /Users/opscart/.ssh/ 2>&1 One directory. The workspace mount. SSH keys, other repos, credential directories — none of them exist inside the sandbox. Parent directories above the workspace are read-only stubs with no siblings. One critical implication: if your workspace is your home directory, your entire home is visible and writable. Always mount a project subdirectory, not your home. 2. No Credentials in Environment Shell env | grep -iE "anthropic|api_key|aws|secret|token|password" # (empty) Confirmed. The agent that just made dozens of API calls had no raw credentials anywhere in its environment. 3. Proxy Confirms the Injection Mechanism Shell env | grep proxy # https_proxy=http://gateway.docker.internal:3128 # no_proxy=localhost,127.0.0.1,::1,[::1],gateway.docker.internal Proxy address visible. Credentials it carries: not visible. The mechanism described above confirmed live inside the running sandbox. 4. Process Namespace Shell ps aux | wc -l # 13 A macOS host runs hundreds of processes. The sandbox shows 13 — all internal. The stack includes dockerd, containerd, socat bridging SSH agent forwarding, and the coding agent. Host processes completely invisible. No way to inspect or interact with anything running on the host. 5. Private Docker Engine Shell docker info | grep -E "Server Version|Operating System|ID" # Server Version: 29.4.3 # Operating System: Ubuntu 25.10 (containerized) # ID: e6934b23-368c-4259-a873-96f879f587e5 Ubuntu 25.10. A unique daemon ID that differs from docker info on the host — confirming the sandbox runs a fully isolated daemon. The agent deployed a full Kubernetes cluster using this daemon. No path to the host Docker socket existed. 6. Host Services Unreachable Shell curl -s --max-time 3 https://localhost:6443 2>&1 || echo "blocked" # curl: (7) Failed to connect to localhost port 6443: Connection refused Port 6443 — my minikube cluster on the Mac host. From inside the sandbox, localhost is the sandbox's own loopback. Host clusters, host SSH, host services — unreachable by default. Eight AKS contexts on this machine. Zero is reachable from inside the sandbox without an explicit policy rule. 7. What the Agent Had vs. What It Didn't During the entire debugging task, the agent had full access to one project directory, kubectl to the sandbox-internal Kubernetes cluster, and full Docker capabilities against the private daemon. It could not reach any other directory, cloud credentials, other kubeconfig contexts, the host Docker daemon, or any cluster not running inside the sandbox. All seven proofs held throughout the session without exception. Three Network Policy Findings That Change How You Think About It Network policy is not a full network control plane. It is hostname-scoped HTTP filtering. Three findings define the actual scope: Finding 1: Blocking returns HTTP 403, not TCP rejection. Plain Text probe "example.com" "https://example.com" # example.com | exit=0 | http=403 Exit code 0. The curl command succeeded. The proxy returned 403 directly. An agent that retries on 403 will retry blocked requests indefinitely. It cannot distinguish a blocked domain from a legitimate server-side error by exit code. For DevOps workflows — an agent hitting a blocked container registry will keep retrying silently rather than failing fast. Finding 2: HTTP CONNECT established a tunnel to port 22 on an allowed host. Plain Text # Port 22 — SSH port curl -s --max-time 5 telnet://github.com:22 # Connected to github.com port 22 # Port 9999 — non-standard port curl -s --max-time 5 telnet://github.com:9999 # Connected to github.com port 9999 github.com is on the Balanced allowlist. HTTP CONNECT established TCP tunnels to github.com on both port 22 and the non-standard port 9999 — both succeeded. Port-based restrictions are not enforced at the proxy layer. The Balanced policy is hostname-scoped only. Any port to an allowed host is reachable via HTTP CONNECT. Finding 3: DNS is not filtered. A common assumption is that all outbound traffic routes through the HTTP proxy — including DNS. Lab results show DNS resolution occurs independently: Plain Text dig example.com +short # 172.66.147.243 A blocked domain resolved. The microVM has an internal stub resolver that forwards DNS independently of the HTTP proxy. An agent can resolve any hostname regardless of the active policy. DNS cannot serve as a secondary enforcement layer. These findings do not break the isolation model. They define its actual boundary. Network policy controls HTTP/HTTPS access by hostname. It does not control DNS, TCP tunnels to allowed hosts on arbitrary ports, or how agents interpret 403 responses. The Agent Scenario: Isolation Under Real Load The real test of isolation is not seven probe commands — it is whether the boundary holds while an agent is actively working, making API calls, running kubectl, deploying containers. I gave an AI agent a broken Kubernetes deployment: a payments-service with memory limits set to 64Mi on a service that needs ~150Mi at peak. The agent received a task file and a set of manifests. No other context. The agent completed the task in under five minutes. It found two bugs — one planted, one discovered independently by reading the manifest and noticing health check probes targeting port 8080 on an nginx container that only serves on port 80. The task said nothing about probes. Result: both pods 1/1 Running, 0 restarts. The seven isolation proofs above were verified immediately after — throughout the entire debugging session, the boundary held without exception. Full article and complete repo at opscart.com/docker-sandbox-devops. What This Means for DevOps Engineers Specifically Most Docker Sandbox articles target software developers running Claude Code on a single codebase. The DevOps case is different and more demanding. A DevOps engineer running an AI agent faces a broader attack surface: multiple cluster contexts, infrastructure credentials, IAM roles, service accounts, kubeconfigs that grant production access. The blast radius of a compromised or manipulated agent is not one repo — it is potentially every system those credentials touch. Docker Sandbox addresses this at the architecture level rather than the prompt level. You are not relying on the agent being well-behaved. You are relying on the microVM boundary, the proxy, and the private Docker daemon. The agent can be fully autonomous inside the sandbox because the guardrail is the environment, not the agent's behavior. The private Docker Engine is particularly significant. DevOps agents need to build and test containers. Every other local isolation approach that allows container operations requires socket mounting — which gives the agent direct access to the host Docker daemon and every image and volume on the host. Docker Sandbox eliminates this tradeoff. What Is Still Rough The image iteration cycle is the primary friction point. Adding a tool requires editing a Dockerfile, rebuilding, pushing to a registry, and recreating the sandbox. For a stable toolchain, this is acceptable. For rapid experimentation, it is not. The --branch parallel agent mode is Git isolation, not VM isolation. Both agents run in one microVM with shared Docker and network. For separate credentials or separate network policies per agent, you need separate workspace directories. The network policy CLI has non-obvious syntax in several places — sbx policy deny does not remove an allow rule, and external cluster access requires two policy rules not one. Neither behavior is documented. The CLI changes between minor versions. v0.31.1 changed login flow, renamed policy tiers, and introduced --clone mode. Pin your version. When Not to Use Docker Sandbox Docker Sandbox is the right tool for a specific set of problems. It is not the right tool when: You need raw UDP or ICMP. Network tracing tools (traceroute, mtr), some mTLS configurations, and anything relying on ICMP will not work — the sandbox proxy only handles HTTP/HTTPS. Your toolchain requires host-device access. USB devices, GPU passthrough beyond basic forwarding, and hardware security keys are not accessible from inside the microVM. You are on a memory-constrained machine. Each sandbox runs a full microVM plus its own Docker daemon. On a machine with 8GB RAM, running multiple sandboxes simultaneously alongside Docker Desktop and a browser will cause pressure. You need production-grade audit logging. Docker Sandbox is Experimental. Audit trails, compliance logging, and enterprise controls are not mature yet. For regulated environments, evaluate accordingly. Your agent needs to coordinate across multiple repositories simultaneously. The one-sandbox-per-workspace model means cross-repo agent work requires careful orchestration. The --clone mode helps but adds git workflow overhead. Conclusion The credential isolation model is the headline: the agent made authenticated API calls throughout the session without the API key ever entering the sandbox. Authentication was performed by the host-side proxy using credentials stored outside the sandbox boundary. The agent could use the credential — it could never see, copy, or exfiltrate it. Seven isolation proofs confirmed the boundary held under real active load. One directory visible. No credentials. No host processes. No host clusters. No host Docker daemon. The network policy findings add important nuance. The --branch mode reality is different from what the documentation implies. Docker Sandbox is Experimental, and the CLI is moving. Use it knowing what it is — and what it is not. More
Pragmatica Aether: Let Java Be Java
Pragmatica Aether: Let Java Be Java
By Sergiy Yevtushenko
Zero-Downtime Deployments for Java Apps on Kubernetes
Zero-Downtime Deployments for Java Apps on Kubernetes
By Ramya vani Rayala
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
By Seshendranath Balla Venkata
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack

The Problem Nobody Warned You About You bought the GPUs. Maybe you've got a couple of NVIDIA A100s in a rack, some RTX 4090s under desks, or a Kubernetes cluster with mixed hardware. You've got the compute. Congratulations! Now what? Here's the part that catches most teams off guard: having GPUs is the easy part. Managing them is where things go sideways. You need to figure out which models fit on which cards, how to balance load across machines, how to handle a node going down at 2 AM, and how to expose all of this as a clean API your application team can actually call. Most teams end up building a brittle collection of Python scripts and crontab entries that haven't been updated since 2022. It works until it doesn't, and then someone's paging you on a Saturday. This is the problem GPUStack was built to solve. What Is GPUStack, Exactly? GPUStack is an open-source tool for managing GPU clusters. Think of it as Kubernetes for your inference workloads, except you don't need to spend three days debugging a whitespace error in a Helm chart. At its core, GPUStack does three things well: It aggregates your GPUs. Whether your hardware is spread across bare-metal servers, Kubernetes pods, or cloud instances, GPUStack sees them all as a single pool of compute. One dashboard, full visibility. It orchestrates inference engines. GPUStack doesn't try to reinvent the inference wheel. It plugs into engines like vLLM, SGLang, and TensorRT-LLM, picks the right one for the job, configures it, and manages the lifecycle so you don't have to. It serves models through an OpenAI-compatible API. Once a model is deployed, your application team gets a familiar REST endpoint. No custom client libraries. No new protocols to learn. Swap out the base URL, and you're talking to your own infrastructure. Getting Started in Under 5 Minutes I'm not exaggerating on the timeline. Here's how you go from zero to a running GPUStack server. Step 1: Fire Up the Server You need one machine to act as your control plane. It doesn't even need a GPU. A basic CPU-only box works fine for the server role. Shell sudo docker run -d --name gpustack \ --restart unless-stopped \ -p 80:80 \ --volume gpustack-data:/var/lib/gpustack \ gpustack/gpustack That's it. Open your browser, navigate to http://<your-server-ip>, and you'll see the GPUStack dashboard. The first time you log in, you'll set up your admin credentials. Step 2: Add Your GPU Workers Now for the fun part. On each worker node, make sure you have the NVIDIA driver and NVIDIA Container Toolkit installed, then run: Shell sudo docker run -d --name gpustack-worker \ --restart unless-stopped \ --gpus all \ -e GPUSTACK_SERVER_URL=http://<your-server-ip> \ -e GPUSTACK_TOKEN=<your-token> \ gpustack/gpustack Replace the server URL and token (grab the token from the GPUStack dashboard). Within seconds, your worker appears in the cluster view with GPU model info, VRAM capacity, and health status. Rinse and repeat for every GPU machine you want to add. Got 3 machines? Three commands. Got 30? Thirty commands, or one Ansible playbook if you're smart about it. Running the worker command is actually the easiest part. The real final boss of GPU clusters is usually getting the drivers and toolkit installed correctly on the host. Step 3: Deploy a Model Head over to the model catalog in the web UI. GPUStack supports pulling models from Hugging Face and the Ollama Library. Pick a model and click deploy. Here's where the scheduler really excels. It reads the model's metadata, computes the resource requirements for VRAM, compute, and memory, then figures out which workers can handle it. If the model is too big for a single GPU, it can shard it across multiple cards. You don't have to manually calculate whether a 70B parameter model fits on your hardware. GPUStack does the math for you. Step 4: Call the API Once the model is running, you get an OpenAI-compatible endpoint. Grab an API key from the dashboard and test it: Shell curl http://<your-server-ip>/v1/chat/completions \ -H "Authorization: Bearer <your-api-key>" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3", "messages": [ {"role": "user", "content": "Explain GPU cluster management in one paragraph."} ] }' If you're already using the OpenAI Python SDK, switching to your GPUStack endpoint is a one-line change: Python from openai import OpenAI client = OpenAI( base_url="http://<your-server-ip>/v1", api_key="<your-api-key>" ) response = client.chat.completions.create( model="llama3", messages=[{"role": "user", "content": "Hello from my own GPU cluster!"}] ) print(response.choices[0].message.content) Your application code stays the same. Your infrastructure is now fully under your control. Why This Actually Matters Let me break down the features that make GPUStack more than a nice-looking dashboard. Multi-Backend Flexibility GPUStack supports vLLM, SGLang, and TensorRT-LLM out of the box. This matters because no single engine is best for every workload. vLLM is great at high-throughput batch processing. TensorRT-LLM squeezes out every last drop of performance on NVIDIA hardware. SGLang shines with structured generation. GPUStack lets you pick the right tool for each deployment, or lets the scheduler pick for you. Built-In Monitoring GPUStack integrates with Grafana and Prometheus, giving you real-time dashboards for GPU utilization, VRAM usage, token throughput, and API request rates. No need to bolt on a separate monitoring stack (which usually ends up being three half-finished Grafana dashboards anyway). When something breaks at 2 AM, you'll know exactly which GPU on which machine is the problem. Automated Failure Recovery We’ve all been there - a node drops off the map because of a weird PCIe bus error or a driver mismatch that only appears under heavy load. Normally, that means your inference API just returns 500s until you manually intervene. GPUStack handles the panic phase for you. When Should You Use GPUStack? GPUStack isn't the right fit for every scenario. Here's a quick way to think about it: Use GPUStack if: You have 2+ GPU machines and want to serve LLMs or other AI models behind a unified API. Especially if your team doesn't want to become full-time infrastructure engineers just to keep models running. You want to run inference on your own hardware instead of paying per-token to a cloud provider. The cost savings at scale are real, and GPUStack removes the operational overhead that usually makes self-hosting painful. Maybe skip GPUStack if: You have a single GPU and just want to run a model locally for personal use. Tools like Ollama are simpler for that use case. You're already deep into a custom Kubernetes-based ML platform with KubeFlow or similar. GPUStack can work alongside Kubernetes, but if you've already invested heavily in that ecosystem, the overlap might not be worth it. The Bigger Picture The AI infrastructure landscape is shifting. A year ago, most teams defaulted to API providers for inference. Today, with open-weight models getting better every month and GPU costs coming down, self-hosted inference is becoming a real option. Not just for Big Tech, but for startups and mid-size companies too. The bottleneck isn't hardware anymore. It's operations. It's the glue code between "we have GPUs" and "our application can reliably call a model." GPUStack is a serious attempt at solving that gap, and it's open source under the Apache 2.0 license, so you can inspect, modify, and deploy it without vendor lock-in. If you’re sitting on a pile of hardware that’s currently just acting as expensive space heaters, or if you’re tired of seeing cloud inference bills that look like mortgage payments, give this a shot. You might find that self-hosting is actually viable again!

By Sandeep Sadarangani
Smart Deployment Strategies for Modern Applications
Smart Deployment Strategies for Modern Applications

Modern application development has moved toward distributed, cloud-based, and even microservices-based applications, requiring scalability, reliability, and performance under different conditions. Therefore, deployment has become a part of application development, not merely a final activity. Intelligent deployment patterns and practices are all about building applications that are not just easy to deploy, but also reliable, scalable, and efficient in production. This means moving away from traditional, manual deployment patterns and toward automated, container-based deployment practices. Docker and Kubernetes are two prominent technologies that play a vital role in this transformation and shift toward intelligent deployment patterns and practices. Docker helps developers build applications and deploy them along with their dependencies in lightweight, portable containers, overcoming environment consistency problems, while Kubernetes helps deploy, scale, and self-heal these containers. However, without an appropriate strategy, it is possible to introduce unnecessary complexity and even performance issues. Not every application needs Kubernetes, nor does every deployment issue call for a distributed solution. Knowing when to use Docker on its own, when to use Kubernetes, and when to balance performance, cost, and complexity is vital to deliver effective modern applications. This article provides smart deployment strategies using Docker and Kubernetes. It highlights the advantages, disadvantages, and performance of using Docker and Kubernetes. This gives an overview of the deployment strategy. What Docker Does Docker packages your application, all dependencies, and the run time into a small container. Issues Before Docker It works on my machine and is inconsistent in different environments, such as development, test, staging, and productionDependency conflicts – code language version, missing library version, configuration mismatch Docker Benefits Same behavior everywhere – local development environment, production environment, staging environment, etc.Isolation between apps – create each app that has separate containers.Fast startup – light weight versus a virtual machineEasy deployment – just run the container Plain Text Docker start <containername> How Docker Works Plain Text Application Code → Dockerfile → Docker Image → Docker Container → Run application A container image can run on a developer laptop, on virtual machines, in a data center, or in cloud environments with the same packaged runtime and dependencies. So that Docker resolves our packaging issues. But what if the machine has 100 containers? What if one crashes? How to scale during high traffic? How to manage deployments? Docker itself does not solve these problems. Here, we need a deployment strategy; there, we can use Kubernetes. What Kubernetes Does The operational problem of managing the image once it has been created is addressed by Kubernetes, which automates the deployment, scaling, and management of containerized applications, and can even maintain the state of the application by replacing failed containers and rescheduling applications as needed. Kubernetes Benefits Auto scaling: More containers (pods) if traffic increases, and fewer containers if traffic decreases.Self-healing: Starts the container again if it crashes.Load balancing: Spreads the load across the containers.Zero downtime deployment: Updates the system without stopping it.Service management: Manages multiple microservices easily. Docker builds and runs the container. Kubernetes runs the container reliably at scale. For example, in a real-world scenario: Docker = packing lunch boxesKubernetes = managing a large cafeteria serving thousands Plain Text build app → Docker container ↓ Deploy many containers → Kubernetes manages them What a Kubernetes Deployment Actually Does A Kubernetes deployment is a resource in a cluster that manages a group of pods and replica sets for a workload, typically a stateless application. Define the desired state, and the actual state in the cluster moves towards it. Kubernetes also supports rolling updates, where new Pods are created and marked as ready before the old ones are terminated. The typical process for deploying a Spring Boot application to a Kubernetes cluster Develop a Spring Boot application.The Spring Boot application is built and packaged as a Docker image.The Docker image is pushed to a repository.Kubernetes Deployments define the image.Kubernetes creates Pods and exposes them via a Service. Advantages Consistent deployments: Docker provides a standard unit for bundling the application and its run-time dependencies. This minimizes environment drift between development, testing, and production environments. This is one of the biggest advantages of using containers for Java-based Spring Boot applications.Declarative operations: Kubernetes uses a declarative model to manage its deployments. This is a significant advantage because it makes it easy for organizations to implement automation for the deployment of applications.Self-healing: Kubernetes has self-healing features. It can automatically replace failing containers and reschedule the application in case of unavailability. This is a significant advantage because it makes it easy for organizations to implement self-healing for the application.Inbuilt scaling options: Kubernetes provides built-in autoscaling features for the application. This makes it easy for organizations to implement elastic and efficient scaling for the application.Improved service abstraction and traffic routing: A Kubernetes Service is an API object that defines a single service and provides a consistent endpoint. It is then possible to have the system distribute traffic to matching Pods. If access to the service outside the cluster is required, then Ingress or Gateway-based routing is an option.Safer upgrades: It is possible to gradually roll out new versions using rolling updates. This reduces the deployment risk. Disadvantages 1. More Operational Complexity While Docker is simple in itself for small applications, Kubernetes introduces additional complexity, such as pods, deployments, services, ingress, ConfigMaps, secrets, autoscaling, networking policies, etc. While these features can be justified for production environments, they are complex features and must be appreciated for their complexity. Kubernetes documentation is divided into so many sections because of the complexity of the platform, which is multi-functional by design, encompassing features like orchestration, networking, scaling, storage, etc. 2. Higher Resource Overhead Kubernetes introduces operational complexity, which is absent in Docker. This could be a problem for very small applications, as the complexity may outweigh the advantages. This is an assumption based on the complexity of the Kubernetes model compared to the Docker model. 3. Harder Debugging While debugging a Docker application is relatively simple because the application is hosted on a single host, debugging a distributed application is far more complex because of the involvement of multiple hosts, pods, services, etc. This is an assumption based on the complexity of the Kubernetes model compared to the Docker model. 4. Misconfiguration Risk Kubernetes is a powerful technology, but misconfiguration can lead to application failures. Network Policies, for example, are complex features by design, requiring production-level configurations. Performance Considerations Kubernetes doesn’t make your application run faster on its own. Performance still relies on many factors such as application design, JVM tuning, container image quality, database performance, network latency, and resource allocation. However, there are many operational tools provided by Kubernetes for improving performance under varying loads. These tools include autoscaling and rollout features. In general terms, performance considerations can be divided into four categories: Startup performance. Startup performance of a Spring Boot container can be slow, depending on factors such as application size. However, rollout relies on new Pods becoming available for use. Thus, startup performance can impact rollout performance.Runtime efficiency. Containers are much more efficient than traditional deployment models that use many virtual machines. This is why Docker is so popular for container deployment. However, inefficient Docker images or large JVMs can still cause inefficiencies. Docker documentation lists many factors, such as glibc-based or musl-based Docker images.Scaling behavior. Horizontal pod autoscaling is useful when load increases, as it adds more pods to handle it, rather than scaling up resources for existing pods. However, it is critical to note that the application should scale horizontally and not have any bottlenecks at the single-node level.Networking overhead. Kubernetes provides Services, which add abstraction to the network. Although this is helpful for manageability and load balancing, it is critical to note that there should be careful design for every layer in latency-sensitive applications. The abstraction provided by Services is useful for operational purposes, but is not conceptually. Limitations One limitation to be aware of is the fact that Kubernetes deployments are designed for stateless workloads. This means if the application has state tightly coupled with the identity of the instance or has ordered storage, the application may not be the best candidate for a Kubernetes deployment. The Kubernetes documentation itself describes Deployments as typically being used for workloads that “do not maintain state.” Other practical limitations are: Small teams may find Kubernetes too heavy for a simple internal app.Stateful systems still require careful storage, backup, and failover planning.Local development experience can become more complex than plain Docker Compose.Security and networking require active design, not default trust. When/What to use ScenarioNeed DockerNeed Kubernetes Run single app Yes No Microservices Yes Yes Production scale Yes Yes (Mandatory) Auto scaling needed No Yes High Availability No Yes Conclusion The modern deployment model is not just about shipping code; it’s about shipping it reliably and at scale. Docker helps in providing consistency across environments, while Kubernetes helps in providing scale, resilience, and automation. The smart approach in deployment strategy is about selecting the appropriate tool for the job. Docker might be enough for a simple application, but for a complex application with high availability requirements, Kubernetes becomes a must-have. By understanding the strengths and weaknesses of both tools, we can develop efficient, scalable, and sustainable deployment strategies.

By Manju George
Solving the Mystery: Why Java RSS Grows in Docker on M1 Macs
Solving the Mystery: Why Java RSS Grows in Docker on M1 Macs

The Problem You're running a Java application in a Docker container on your M1 Mac. Everything works fine, but you notice something strange: The resident set size (RSS) keeps growing, even though your heap usage is stable. After hours of investigation, you find mysterious rwxp memory regions, each exactly 128 MB, accumulating in your process memory map. What's causing this? Is it a memory leak? A JVM bug? Something else entirely? The Investigation Our journey began with monitoring RSS growth in a Java 17 application deployed on Docker-backed Minikube. Despite stable heap usage and no obvious memory leaks, RSS continued to grow by hundreds of megabytes over time. Initial Observations RSS growth: ~500-700 MB over 11 hoursHeap usage: Stable and within limitsThread count: StableNative memory tracking: No obvious leaks Deep Dive Into Memory Maps Using /proc/PID/maps and /proc/PID/smaps, we discovered the growth was coming from anonymous executable memory regions: Shell $ cat /proc/1/maps | grep rwxp efffd1d7c000-efffd9d7c000 rwxp 00000000 00:00 0 efffdb185000-efffe3185000 rwxp 00000000 00:00 0 efffe3d85000-efffebd85000 rwxp 00000000 00:00 0 ... Each region was exactly 128 MB, in the 0xefff* address range, with read-write-execute permissions. But what was in them? The Discovery Reading the memory content revealed something unexpected: ARM64 machine code instructions. But wait, the Java binary was x86-64, and the process reported x86_64 architecture. What was ARM64 code doing there? The "Aha!" Moment The answer: Rosetta 2 translation cache. When running x86-64 containers on ARM64 M1 Macs via Docker Desktop, Rosetta 2 translates x86-64 instructions to ARM64. The translated code is cached in executable memory regions-those mysterious RWXP regions we were seeing! The Root Cause Here's what was happening: JIT compilation: Java's JIT compiler generates x86-64 native code for hot methodsRosetta 2 intercepts: When x86-64 code executes, Rosetta 2 translates it to ARM64Translation cache: Translated ARM64 code is stored in 128 MB RWXP memory regionsGrowth: More JIT-compiled methods = more translations = more RWXP regions Evidence ObservationExplanationRWXP regions contain ARM64 codeRosetta 2's translated codeExactly 128 MB per regionRosetta 2 allocation granularityAnonymous (no file backing)Runtime translation cacheGrowth correlates with JIT activityMore compiled methods = more translations The Proof To definitively prove JIT was the trigger, we disabled JIT compilation using the -Xint flag: Java -Xint # Run in interpreter-only mode Results MetricBefore (JIT Enabled)After (JIT Disabled)RWXP Regions5 -> 12 -> 15 (growing)1 (stable, no growth)RWXP Memory~1.9 GB~128 MBGrowth RateMultiple regions/hour0 regions/hourCompiled Methods25,606 nmethods0 nmethods Result: With JIT disabled, RWXP growth completely stopped. Monitoring over 1+ hour confirmed zero growth. Why This Happens The Perfect Storm ARM64 host: M1 Mac (Apple Silicon)x86-64 container: Docker image built for AMD64Rosetta 2 enabled: Docker Desktop uses Rosetta 2 for emulationDynamic code generation: Java JIT compiler When all four conditions are met, Rosetta 2 must translate every JIT-compiled method from x86-64 to ARM64, storing the translations in executable memory regions that count toward process RSS. The Solution Option 1: Use Native ARM64 Images (Recommended) The best solution is to use ARM64-native Docker images: Shell # Build for ARM64 docker build --platform linux/arm64 ... # Or use multi-arch images docker pull --platform linux/arm64 your-image:tag Benefits: No Rosetta 2 translation neededNo RWXP growthBetter performance (native execution)Lower memory usage Option 2: Deploy to x86-64 Infrastructure If ARM64 images aren't available, deploy to x86-64 servers or cloud instances where Rosetta 2 isn't needed. Option 3: Accept and Monitor If you must use x86-64 containers on M1 Macs: Increase container memory limitsMonitor RWXP growthPlan for periodic restarts if needed Not Recommended Don't disable JIT in production (-Xint). While it stops RWXP growth, it dramatically reduces performance. Use it only for testing/debugging. Key Takeaways Rosetta 2 translation cache causes RWXP memory growth in x86-64 containers on ARM64 MacsJIT compilation is the primary trigger; each compiled method needs translationNative ARM64 images eliminate the problem entirelyThis is expected behavior, not a bug-it's the cost of emulation Conclusion What started as mysterious RSS growth turned out to be Rosetta 2's translation cache storing ARM64 translations of JIT-compiled Java code. By understanding the mechanism and testing with JIT disabled, we proved the root cause and identified the best solution: use native ARM64 images. If you're experiencing similar RSS growth in Java applications on M1 Macs, check for RWXP regions in your process memory map. If you see them, Rosetta 2 translation is likely the culprit. How to Check Shell # Check for RWXP regions cat /proc/PID/maps | grep rwxp # Count RWXP regions cat /proc/PID/maps | grep rwxp | wc -l # Check if Rosetta 2 is active cat /proc/PID/maps | grep rosetta Have you encountered similar issues? Share your experience in the comments below!

By Sumeet Sharma
How We Diagnosed a Hidden Scheduler Failure in a Docker Swarm Cluster Serving 2 Million Users
How We Diagnosed a Hidden Scheduler Failure in a Docker Swarm Cluster Serving 2 Million Users

Context: 120 Nodes, Strict SLAs, and Legacy Infrastructure Our team is responsible for the mobile backend infrastructure serving over 2 million registered users. The Docker Swarm cluster consists of 120 nodes: 5 manager nodes, 40 worker nodes, and the rest are infrastructure servers. The cluster runs about 50 services, totaling hundreds of replicas. We inherited Swarm from the previous contractor. The client is not yet ready to migrate to Kubernetes, and Swarm is currently sufficient for the current scale. Services are distributed across nodes in groups and bound by labels: up to 4 worker nodes are allocated to heavier services, 2 to less loaded ones, and 1 to non-critical services. Nodes can host replicas of multiple services. Our SLAs are strict: If any part of the mobile app is completely unavailable, we have 30 minutes to resolve the issue, after which penalties begin to accrue. What Happened The issue was detected thanks to a monitoring alert regarding the unavailability of service replicas. While investigating the incident in the manager-node logs, we found the following warning: Plain Text Mar 03 07:46:32 swarm3 dockerd[875]: time="2025-03-03T07:46:32.123554337Z" level=warning msg="underweighting node nt98wn9he8my6tsuasgkhrrjp for service 86jgkc35ctasmu8ubpnilsrqo because it experienced 5 failures or rejections within 5m0s" module=scheduler node.id=gaip86ri06jyrdwxcogl9j2p5 This message indicates that Swarm's internal scheduler is lowering the priority (weight) of a specific worker node when scheduling service tasks. The reason is 5 failures or rejections in the last 5 minutes. Swarm effectively excludes this node from the pool of candidates for running replicas. There was no critical downtime: Several replicas of the problematic services were running, and traffic was routed to the live instances. However, some replicas could not start — meaning the cluster was operating with reduced fault tolerance. With this SLA, that's a ticking time bomb. Why Swarm Lowers a Node's Weight Before describing our diagnosis, it's worth understanding the mechanics. Swarm lowers a node's weight for several reasons: Resource constraints. A container requires more CPU, memory, or disk space than is available on the node. Swarm cannot place the task and records a failure.Network issues. The node is unresponsive, or the connection is unstable. The manager loses contact with the worker and marks it as unreliable.Previous failed launches. If a container fails to start on a specific node several times in a row, Swarm temporarily excludes it from the list of candidates.Docker Daemon or hardware issues. Unstable Docker daemon operation or hardware failures lead to a cascade of failures when launching tasks.Mismatch between the number of replicas and the number of nodes with the required labels. This turned out to be our case. The service is bound to specific nodes via placement constraints with labels. If the number of replicas in the service configuration exceeds the number of nodes with the required label, the scheduler enters a cycle of failed placement attempts — even if there are enough free worker nodes in the cluster without that label.Service errors. The container starts but immediately terminates with an error or fails the health check. Swarm attempts to restart it, incrementing the failure count. What We Tried First The initial response to such errors is the standard set of steps: Rebuilding the service. We recreated the service using docker service update --force. The replicas restarted, but the problem returned after a few minutes.Changing the number of replicas. We reduced and then increased the number of replicas again. It didn't help.Reading container logs. The container logs themselves didn't show anything meaningful — the service was fine when it managed to start. None of this yielded a consistent result. It became clear that the problem wasn't with the service, but at the infrastructure level — specifically, in how the scheduler makes placement decisions. Troubleshooting: Identifying the Root Cause Step 1: Checking Node Status Shell docker node ls If any node has a status of Down or Unreachable, it is the first candidate. We look for the specific node mentioned in the error message: Shell docker node ls | grep nt98wn9he8my6tsuasgkhrrjp In our case, all nodes were in the Ready state — the issue wasn't related to availability. Step 2: Identify the Problematic Service Using the first 12 characters of the service ID from the log, we find its name: Shell docker service ls | grep 86jgkc35ctas Next, check the status of the tasks: Shell docker service ps 86jgkc35ctasmu8ubpnilsrqo Here you can see on which node the task failed to start and why: Rejected, Shutdown, No suitable node. Step 3: Checking Placement Constraints This is where we found the cause. Let's see what placement constraints are configured for the service: Shell docker service inspect 86jgkc35ctasmu8ubpnilsrqo \ --format '{{json .Spec.TaskTemplate.Placement}' | jq . The service was bound to nodes with a specific label. Let's check how many nodes have this label: Shell docker node ls --filter "label=cli=1" And then it became clear: The number of replicas in the service configuration exceeded the number of nodes with the required label. Most likely, the mismatch occurred during a routine service update, when the number of replicas was set higher than the number of available labeled nodes during reconfiguration. Replicas for which suitable nodes were found started normally, while for the rest, the scheduler repeatedly attempted to find a suitable node, received a rejection, and logged a failure. Step 4: Checking Resources (for a Complete Picture) Even after identifying the root cause, we checked the resources on the problematic nodes to rule out a combined issue: Shell docker node inspect nt98wn9he8my6tsuasgkhrrjp \ --format '{{json .Description.Resources}' | jq . And also the load directly: Shell top -o %CPU free -m df -h The resources were fine — it was confirmed that the issue was indeed due to a configuration mismatch. Solution Main action: We adjusted the number of service replicas to match the number of available nodes with the required label — we reduced the number of replicas in the .yml configuration file: YAML deploy: replicas: 2 # Match the number of nodes with the label After applying the updated configuration, the error disappeared — the scheduler no longer attempted to place replicas on non-existent nodes. Additionally, we reviewed the configuration of the remaining services, verifying that the number of replicas and nodes matched the required labels. We found several more services with a similar potential issue — and fixed them proactively. If the Cause Is Different, Additional Solutions Our specific case was related to a configuration error, but there are other scenarios that can cause the same error: Resource shortage. Free up space and clean up unused images: Shell docker system prune -a Or lower the limits for the service: Shell docker service update --limit-cpu 0.5 --limit-memory 512M <SERVICE_ID> Issues with the Docker Daemon on the node. Restart the daemon: Shell systemctl restart docker Temporarily excluding a problematic node. Switching to drain mode so that all tasks migrate to other nodes: Shell docker node update --availability drain <NODE_ID> Reconnecting the node to the cluster. If nothing else works, remove the node and add it again: Shell docker swarm leave --force docker swarm join --token <TOKEN> <MANAGER_IP>:2377 Conclusion This situation taught us a few things: The underweighting node error is a symptom, not a diagnosis. The same warning in the logs can stem from a wide variety of causes, ranging from a lack of resources to a configuration error. Configuration errors are the most insidious cause. In a cluster with dozens of services and labels, it's easy to introduce a mismatch between the number of replicas and available nodes during a routine update. The absence of downtime does not mean there is no problem. The cluster continued to operate thanks to live replicas, but it was running with reduced fault tolerance. One more failure, and the SLA would have been violated.

By Denis Tiumentsev
Mastering Kubernetes to Maximize Your Cloud Potential
Mastering Kubernetes to Maximize Your Cloud Potential

Kubernetes is often introduced as a container orchestrator. That’s like calling a modern city “a collection of buildings.” Technically correct, but wildly incomplete. In reality, Kubernetes is a layered ecosystem where storage, compute, networking, security, and developer workflows interlock like gears in a precision machine. If one gear slips, everything grinds. If all align, you unlock a platform that scales, heals, and evolves with your applications. After working through complex deployments, production outages, and cost optimization journeys, one truth stands out: Kubernetes mastery is not about knowing objects. It’s about understanding layers. Let’s break down the seven critical layers of Kubernetes and the tools that make them powerful. 1. Storage Layer: Where State Meets Reality Stateless is easy. Real-world systems aren’t. The storage layer ensures your applications don’t forget who they are every time a pod restarts. Key Components Persistent Volumes (PV) & Persistent Volume Claims (PVC): Abstract storage from workloads. Your app asks, Kubernetes provides.StorageClass & CSI (Container Storage Interface): Enable dynamic provisioning and seamless integration with cloud providers like AWS EBS, GCP PD, or Azure Disk. Why It Matters Without a well-designed storage strategy: Databases become fragileStateful apps become unreliableRecovery becomes painful This layer is the difference between ephemeral experiments and production-grade systems. 2. Compute / Runtime Layer: The Engine Room This is the layer most engineers start with, but ironically, it’s not where mastery ends. Core Primitives Pods: The smallest deployable unitDeployments: Declarative app managementReplicaSets: Ensure desired stateDaemonSets: One pod per node (great for agents) What It Solves Auto-healing (failed pods restart automatically)Horizontal scalingDeclarative infrastructure Hidden Complexity Misconfigured probes, resource limits, or rollout strategies can silently degrade performance or cause cascading failures. Compute is powerful, but blind compute is dangerous. 3. Observability Layer: Seeing the Invisible If Kubernetes is a living organism, observability is its nervous system. Without it, you’re operating blind. Essential Stack Prometheus + Grafana Metrics collection and visualizationLoki Log aggregation without heavy indexingOpenTelemetry Standardized tracing across distributed systems Why It Matters Detect anomalies before users doDebug distributed failuresUnderstand system behavior under load A cluster without observability is like flying a plane without instruments. You may stay airborne… until you don’t. 4. Networking Layer: The Silent Enabler Kubernetes networking “just works”… until it doesn’t. Core Components Services Stable internal communication (ClusterIP, NodePort, LoadBalancer)CNI (Container Network Interface) Handles pod-to-pod communicationIngress Manages external access to services Real Challenges Debugging network policiesHandling cross-cluster communicationManaging latency and service mesh complexity Networking is often underestimated because it’s invisible when functioning and painfully obvious when broken. 5. Security Layer: Guardrails, Not Afterthoughts Security in Kubernetes is not a feature. It’s a discipline. Key Tools RBAC (Role-Based Access Control) Define who can do whatOPA (Open Policy Agent) Enforce admission policiesKyverno Kubernetes-native policy managementPod Security Standards (PSS) Baseline security enforcement Why It Matters Without strong policies: Privilege escalation becomes trivialMisconfigurations slip into productionCompliance becomes reactive instead of proactive Modern Kubernetes security is about policy-as-code, not manual reviews. 6. Developer & DevOps Tooling: Speed Without Chaos Kubernetes can either accelerate developers… or slow them down dramatically. The difference lies in tooling. Key Tools Skaffold & Tilt Rapid local development and feedback loopsHelm Package management for KubernetesKustomize Environment-specific customization without templating What This Layer Enables Faster iteration cyclesStandardized deploymentsReduced cognitive load for developers Without this layer, Kubernetes becomes an operational burden rather than a developer platform. 7. CI/CD & GitOps: The Control Plane for Change This is where Kubernetes evolves from infrastructure to platform. Core Tools: ArgoCD & Flux GitOps-driven continuous deliveryTekton Kubernetes-native CI pipelinesJenkins X Cloud-native CI/CD automation Why GitOps Wins: Git becomes the single source of truthChanges are auditable and reversibleDrift detection is automatic Instead of pushing changes to the cluster, the cluster pulls desired state from Git. That subtle shift changes everything. The Bigger Picture: Kubernetes as a System of Systems Each layer solves a specific problem: Individually, they’re powerful. Together, they form a self-healing, scalable, policy-driven platform. Final Thought Most teams struggle with Kubernetes not because it’s complex, but because they approach it as a tool instead of a system. You don’t “use Kubernetes.” You operate an ecosystem. And the moment you start thinking in layers instead of YAML files, everything begins to click. Which Kubernetes layer challenges you the most today? Observability gaps?Security policy chaos?GitOps adoption struggles? If you’re facing these, it might be time for a Kubernetes maturity or reliability audit. The bottleneck is rarely where you think it is.

By Jaswinder Kumar
AI Agents for DevOps on Kubernetes Need Real Engineering, Not Magic
AI Agents for DevOps on Kubernetes Need Real Engineering, Not Magic

In a real Kubernetes cluster, incidents rarely appear as a single, clean alert. They arrive as waves of Kubernetes events, latency spikes, pod restarts, rollout failures, and unpredictable autoscaling behavior all at once. The hard part is usually not “Can we fix it?” but “Can we understand what’s happening fast enough to make a safe decision?” AI agents for DevOps can help here — but only when they sit on solid engineering foundations. They should compress the early correlation and triage phase, not take opaque, unsafe control of production. Google’s 2024 DORA report underlines why this matters: more than 75 percent of respondents now rely on AI for at least one professional task each day, and over one‑third report moderate to extreme productivity gains, yet 39 percent still have little to no trust in AI‑generated code. That gap between use and trust is exactly where our architecture and guardrails matter. Why Incident Triage Needs Help Now Traditional AIOps pitches often promise full automation, but most SREs do not want a black‑box system taking unilateral action in production. What they need is help with triage: Grouping noisy alerts into a single incident viewCorrelating Kubernetes events, metrics, and recent rolloutsProposing safe, reversible next steps — not silently applying risky changes The DORA research still centers on the same four key metrics: lead time, deployment frequency, change failure rate, and time to restore service. AI can absolutely improve developer productivity and documentation, but it can also undermine delivery stability when used on top of weak fundamentals such as oversized batch changes and poor test coverage. For a broader perspective on integrating DevOps services, see "Incorporating DevOps Services into Software Development." Traceable – every recommendation is explainable from telemetry and cluster stateAuditable – logs and decisions reviewable after the factReversible – actions easy to roll backLeast‑privilege – permissions constrained by Kubernetes RBAC Architecture Overview LayerResponsibilityKey TechnologiesTelemetry captureCollect traces, metrics, logs, and Kubernetes eventsOpenTelemetry CollectorEvent busBuffer and fan‑out telemetryKafkaLightweight consumerNormalize/enrich data, build incident contextCustom serviceAI agent layerTriage, correlate, draft next actionsCrewAI, Llama via OllamaControlled executionSafe, reversible scaling under RBACKubernetes RBAC, scale subresource Related: DZone's "10 Best Practices for Managing Kubernetes at Scale." The pattern that consistently holds up under load uses simple, composable layers: OpenTelemetry collector – capture traces, metrics, logs, and Kubernetes eventsKafka event bus – buffer, fan‑out, and replay telemetryLightweight consumer – normalize signals into “incident contexts.”AI agent layer – CrewAI agents backed by Llama 3.1 via OllamaSlack approval – humans approve or reject remediation stepsRBAC‑limited scaling – Kubernetes permissions restricted to the scale subresource Each layer can be tested, inspected, and replaced without rewriting the entire system. Why OpenTelemetry Fits Kubernetes OpenTelemetry Collector gives you one place to capture multi‑signal telemetry—traces, metrics, logs, and Kubernetes events — with pluggable receivers and exporters. Key points for Kubernetes: The k8sevents receiver (in contrib distributions) captures events from the Kubernetes API server and converts them into logs.Kubernetes events are short‑lived in the cluster (often an hour or less) and are not persisted long term; exporting them via OpenTelemetry preserves them for incident analysis.Events complement, but do not replace, application logs and traces; they describe what Kubernetes is doing to your workloads (e.g., scheduling failures, image pull errors, autoscaling decisions). Why Kafka Belongs in the Middle Dropping all telemetry straight into an AI model couples your reasoning to whatever the cluster happens to emit at that moment. Kafka gives you a much sturdier backbone: Replayable telemetry – reproduce incident contexts for testing and post‑mortemsMultiple consumers – feed different tools (dashboards, anomaly detectors, AI agents) from the same topicsDecoupled ingestion and analysis – collectors push at their own pace, consumers pull at theirs Kafka does not fix bad metric names or broken alert rules, but it does give you a consistent, durable pipe to reason about. A typical OpenTelemetry Collector configuration for this pattern looks like this (simplified): YAML text receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 k8sevents: namespaces: [production, staging] processors: memory_limiter: check_interval: 1s limit_mib: 512 spike_limit_mib: 128 batch: timeout: 10s send_batch_size: 1000 send_batch_max_size: 1500 exporters: kafka: brokers: - kafka-1.example.com:9092 - kafka-2.example.com:9092 - kafka-3.example.com:9092 retry_on_failure: enabled: true sending_queue: enabled: true traces: topic: otel-traces encoding: otlp_proto metrics: topic: otel-metrics encoding: otlp_proto logs: topic: otel-logs encoding: otlp_proto service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch] exporters: [kafka] metrics: receivers: [otlp] processors: [memory_limiter, batch] exporters: [kafka] logs: receivers: [otlp, k8sevents] processors: [memory_limiter, batch] exporters: [kafka] This keeps the collector focused on one job: getting signals in and pushing them reliably to Kafka. Why a Separate Consumer Layer Matters It is tempting to point your AI agents directly at Kafka topics, but that couples fragile prompt engineering with noisy raw data. A thin consumer service in the middle gives you a deterministic place to: De‑duplicate repeated events and alertsJoin pod‑level signals to the Deployment and Service metadataAttach rollout information (who changed what, when, and via which pipeline)Apply simple rules (“ignore known‑benign events,” “group alerts by owner team”) before AI sees them This consumer produces a single “incident context” document per active incident. AI agents then reason over this structured context instead of a firehose of raw logs. A straightforward Kubernetes Deployment for the consumer might look like this: YAML text apiVersion: apps/v1 kind: Deployment metadata: name: incident-context-consumer spec: replicas: 2 selector: matchLabels: app: incident-context-consumer template: metadata: labels: app: incident-context-consumer spec: serviceAccountName: agent-runner containers: - name: consumer image: your-registry/incident-consumer:v1.0.0 env: - name: KAFKA_BROKERS value: "kafka-1:9092,kafka-2:9092,kafka-3:9092" - name: INCIDENT_TOPIC value: "otel-logs" - name: OUTPUT_TOPIC value: "incident-contexts" AI Agent Layer With CrewAI and Llama 3.1 On top of incident contexts, we can deploy a small CrewAI‑based agent layer. Meta’s Llama 3.1 models are available in 8B, 70B, and 405B parameter sizes, and the llama3.1:8b variant runs comfortably on a single modern GPU or even a beefy workstation via Ollama. We split responsibilities into three agents: Triage Agent – groups related alerts, assigns severity, and identifies the likely owning teamDiagnosis Agent – correlates Kubernetes events, metrics, and rollout changes to propose the most likely root causeExecutor Agent – drafts safe, reversible next steps and requests human approval A minimal CrewAI definition might look like this (illustrative): Python from crewai import Agent, Task, Crew from llmclient import Llama31Client from tools import K8sTool, SlackTool, PrometheusTool llm = Llama31Client( endpoint="http://ollama-gateway:11434", model="llama3.1:8b" ) triage_agent = Agent( role="Incident Triage Engineer", goal="Group related alerts and identify likely impact and owning team.", tools=[K8sTool, SlackTool], llm=llm, ) diagnosis_agent = Agent( role="Correlation Analyst", goal="Correlate Kubernetes events with metrics and recent rollout data.", tools=[PrometheusTool, K8sTool], llm=llm, ) executor_agent = Agent( role="Runbook Automator", goal="Draft safe, reversible next steps and send them for approval.", tools=[K8sTool, SlackTool], llm=llm, ) crew = Crew( agents=[triage_agent, diagnosis_agent, executor_agent], tasks=[ Task(description="Triage incident context and assign severity.", agent=triage_agent), Task(description="Diagnose probable causes.", agent=diagnosis_agent), Task(description="Draft a safe remediation step and request approval.", agent=executor_agent), ], ) The key is that only the Executor Agent proposes actions, and even then, those actions are routed through Slack for explicit human approval. RBAC: Safe, Scale‑Only Permissions Kubernetes RBAC lets you grant fine‑grained permissions to specific subresources, including deployments/scale. This is exactly what we want for an AI‑assisted incident system: the ability to scale workloads up or down, without the power to change container images, environment variables, or security settings. Scaling is reversible and far safer than mutating Deployment specs. See the official Kubernetes RBAC docs for full details on subresource permissions. A typical “scaling‑only” role for agents looks like this: YAML text apiVersion: v1 kind: ServiceAccount metadata: name: agent-runner namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: deployment-scaler rules: # Read deployments and replica sets to understand current state - apiGroups: ["apps"] resources: ["deployments", "replicasets"] verbs: ["get", "list", "watch"] # Scale deployments via the scale subresource - apiGroups: ["apps"] resources: ["deployments/scale"] verbs: ["get", "update", "patch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: agent-runner-deployment-scaler roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: deployment-scaler subjects: - kind: ServiceAccount name: agent-runner namespace: default By operating only on the `/scale` subresource, you give the agent layer exactly enough power to adjust replica counts and nothing else. See DZone's Implementing RBAC Configuration for Kubernetes Applications for more RBAC patterns. How a Real Incident Flows When a rollout goes wrong, or a dependency starts failing, a typical incident flows like this through the system: Telemetry capture: The OpenTelemetry Collector gathers metrics, traces, logs, and Kubernetes events, and exports them to Kafka.Context building: The consumer service reads relevant records from Kafka and builds an “incident context” (involving namespaces, Deployments, pods, events, SLOs, and recent changes).AI‑assisted triage: The Triage Agent classifies severity (e.g., SEV‑1 vs SEV‑3), identifies impacted services, and tags likely owner teams.Correlation and diagnosis: The Diagnosis Agent matches restart reasons (ImagePullBackOff, OOMKilled, CrashLoopBackOff, etc.) with rollout timelines and metric anomalies to propose plausible root‑cause hypotheses.Drafting a reversible action: The Executor Agent proposes a small, clearly reversible change: for example, temporarily scaling a canary deployment from 10 replicas back to 2, or scaling a known‑stable previous version up to absorb traffic.Human approval: The proposed command and rationale are posted to a Slack incident channel. An on‑call SRE or incident commander explicitly approves or rejects the action.Execution under RBAC: If approved, the agent uses its deployments/scale permissions to apply the change. Every call is logged and auditable. For a deeper context for incident response, see DZone's Incident Response Guide. Where This Pattern Works Best (and Where It Doesn’t) This architecture is strongest when: Telemetry is clean and labeled (good metric names, consistent labels, sane alerts)Triage, not remediation, is the bottleneckRunbooks already exist with reversible actionsPlatform teams are comfortable owning Kafka and the consumer service It is less effective when: Every incident is truly novel and unstructuredData is sparse or heavily delayedOrganizational trust in automation is low, and there is no appetite for experimental changesThe AI endpoint itself has no SLOs, rate limits, or clear failure modes Final Thoughts This pattern fits squarely within the 2024–2026 shift toward platform engineering and AI-augmented DevOps workflows, but it succeeds only when built on strict operational guardrails. The goal isn't to replace humans in the incident response loop — it's to dramatically compress the time between "something broke" and "we understand the blast radius and have safe, reversible recovery options on the table." AI agents excel at grouping noisy Kubernetes signals into coherent incident contexts and proposing next steps grounded in telemetry and recent changes. Humans remain the final decision-makers for production actions, retaining full control through Slack approval gates and Kubernetes RBAC constrained to the safe scale subresource. When telemetry is clean, runbooks exist, and platform teams can own the Kafka/consumer layers, this architecture delivers measurable wins in mean time to understanding. When incidents remain truly novel or organizational trust in automation is low, it gracefully falls back to human-led triage. Either way, the system stays transparent, auditable, and reversible — never expanding blast radius through opaque "magic.

By Abdul Majid Qureshi
Java Backend Development in the Era of Kubernetes and Docker
Java Backend Development in the Era of Kubernetes and Docker

We moved our monolithic Java application to Kubernetes last year. The promise was scalability and resilience. The reality was a series of silent failures during deployments. Users reported dropped connections every time we pushed a new version. Our monitoring showed zero downtime, but the customer experience told a different story. Requests vanished into the void during rolling updates. We spent weeks chasing network ghosts before finding the root cause. The issue was not the network. It was how our Java application handled termination signals. In this article, I will share how we adapted our Java backend for container orchestration. I will explain the specific lifecycle issues we encountered. I will detail the configuration changes that solved the dropout problem. This is not a guide on writing Dockerfiles. It is a record of the operational friction we faced when Java met Kubernetes. Building cloud-native Java apps requires more than just packaging a JAR. It requires understanding how the orchestration layer interacts with the JVM. The Silent Dropout Problem Our deployment strategy used standard Kubernetes rolling updates. The controller would start a new pod before killing the old one. This should ensure zero downtime. Our users still reported errors during these windows. We checked the service logs. The old pods stopped accepting traffic instantly upon receiving the kill signal. The Kubernetes service endpoint removed the pod IP immediately. There was a gap between traffic cessation and process termination. In-flight requests died mid-stream. Java applications do not shut down instantly. They need time to finish processing current requests. They need to close database connections gracefully. Our Spring Boot app ignored the termination signal initially. It kept running until the kernel killed it. This hard kill interrupted active transactions. Data consistency was at risk. We needed to implement a graceful shutdown sequence. Implementing Graceful Shutdowns We started by configuring Spring Boot to handle shutdown signals. The framework provides a property for this. We enabled it in our application configuration. This told Spring to stop accepting new requests upon shutdown. It allowed existing requests to complete within thirty seconds. This was a good start, but it was not enough. Kubernetes sends a SIGTERM signal to the container. The JVM catches this signal. The application starts shutting down. Kubernetes waits for a preStop hook or the termination grace period. If the app takes too long, Kubernetes sends SIGKILL. We added a preStop hook to our deployment manifest. This script sleeps for a few seconds before allowing the container to stop. This delay ensures the Kubernetes service removes the pod IP from the load balancer before traffic stops flowing. This five-second sleep bridged the gap. The service mesh updated its endpoints. Traffic stopped routing to the terminating pod. Then the application began its graceful shutdown. No in-flight requests were dropped. The error rate during deployments dropped to zero. Configuration Management Challenges Configuration management was another pain point. We used ConfigMaps to store environment settings. Kubernetes mounted these as files inside the container. Our Java app reads these files at startup. Changing a ConfigMap triggered a rollout. Every config change restarted all pods. This was disruptive for minor tweaks. We wanted hot reloading for certain properties. Spring Cloud Kubernetes supports this feature. It watches for ConfigMap changes and refreshes the context. We enabled the reload strategy. This allowed us to update logging levels without restarting pods. It reduced deployment frequency for operational changes. However, we learned to be careful. Reloading the entire context can be heavy. We restricted hot reload to specific beans. Critical infrastructure settings still required a restart. This balance reduced risk while improving agility. Logging in a Distributed Environment Legacy Java apps often write logs to local files. This pattern fails in Kubernetes. Containers are ephemeral. When a pod dies, the local disk disappears. Logs vanish with it. We needed to stream logs to stdout. Kubernetes captures stdout and sends it to the logging driver. We reconfigured our Logback setup. We removed file appenders. We added a console appender with JSON formatting. Structured logs are easier for aggregation tools to parse. This change integrated us with our ELK stack seamlessly. We could trace requests across multiple pods. We could search logs without accessing individual containers. This visibility was crucial for debugging production issues. It also reduced disk IO within the container. The application ran lighter without file writes. Security and User Context Running Java as root in a container is a security risk. If an attacker escapes the JVM, they gain root access to the node. We audited our Docker images. The base images ran as root by default. We created a non-root user in our Dockerfile. This simple change reduced our attack surface. However, it introduced permission issues. The application could not write to certain directories. We had to adjust volume mounts. We ensured the tmp directory was writable by the new user. This step is often overlooked during migration. Testing security contexts in staging is essential. Resource Limits and JVM Awareness We faced memory issues early in the migration. The JVM did not know about container limits. It allocated a heap based on host memory. The container got OOMKilled repeatedly. We fixed this by using percentage-based flags. This ensured the JVM respected the cgroup limits. It left room for non-heap memory. We also set requests and limits in Kubernetes. Requests guaranteed resources for scheduling. Limits prevented runaway processes from starving neighbors. This alignment between JVM and Kubernetes was critical for stability. Health Checks and Startup Probes Java applications can be slow to start. Loading classes and connecting to databases takes time. Kubernetes liveness probes might kill the pod before it is ready. We used startup probes to handle this. The startup probe disables liveness checks until it succeeds. This gave our app up to five minutes to start. Once ready, the liveness probe took over. This prevented premature restarts during cold starts. It also protected us during heavy garbage collection pauses. The app remained healthy even if response times spiked temporarily. Lessons Learned and Best Practices Our journey taught us several key lessons. We incorporated these into our development standards. Handle SIGTERM. Always configure graceful shutdown. Do not rely on default behavior.Use preStop hooks. Bridge the gap between service discovery and process termination.Log to stdout. Never write to local files in containers. Use structured logging.Run as non-root. Reduce security risks by dropping privileges.Tune JVM for containers. Use percentage-based memory flags. Respect cgroup limits.Configure probes. Use startup probes for slow-starting applications. Tune liveness thresholds.Test failure modes. Simulate pod kills in staging. Verify no data loss occurs. Conclusion Moving Java to Kubernetes is more than just an infrastructure change; it is a fundamental shift in how we design, build, and operate software. Over time, we learned that the orchestration layer introduces new requirements. Graceful shutdowns, proper logging, and resource management are now fundamental for reliability. As a result, our application is resilient to both deployments and runtime failures. We can trust the platform to manage our workloads efficiently while we focus on delivering features. We continue to refine our patterns as the ecosystem evolves and best practices emerge. Java remains a powerful tool for backend development — it just requires a new mindset for the cloud-native era. Happy coding, and always keep your containers healthy.

By Ramya vani Rayala
The Pod Prometheus Never Saw: Kubernetes' Sampling Blind Spot
The Pod Prometheus Never Saw: Kubernetes' Sampling Blind Spot

The Fix That Doesn't Fix It Reducing your Prometheus scrape interval from 15 seconds to 5 seconds does not fix the sampling blind spot. It moves it. Any pod whose entire lifetime falls within one 5-second scrape gap is still structurally invisible — not because of misconfiguration, not because of missing rules, but because poll-based collection has an irreducible sampling gap that no interval setting eliminates. This article explains exactly why that is, what it costs in production, and what actually fixes it. What Is the H5 Evidence Horizon? Kubernetes evidence horizons are deterministic points after which specific diagnostic context becomes permanently unrecoverable. H5 — the scrape-interval sampling blind spot — is the only horizon that prevents observability data from being created in the first place. Unlike H1 (LastTerminationState rotation at ~90 seconds) or H2 (scheduler event pruning at 1 hour), H5 has no timer and no API call. It fires silently for every pod whose entire lifetime falls within one Prometheus scrape gap. The full evidence horizon taxonomy is documented at opscart.com/kubernetes-evidence-horizons-h2-h3-h4-h5/. Why Poll-Based Observability Has an Irreducible Blind Spot Prometheus collects metrics by sending HTTP requests to targets at a fixed interval. The default scrape interval in kube-prometheus-stack is 15 seconds. Every 15 seconds, Prometheus asks the world: "What is your current state?" This model works exceptionally well for persistent, long-running workloads. A deployment that has been running for hours will be scraped hundreds of times. Its CPU trends, memory patterns, and request rates are captured with high fidelity. It fails completely for ephemeral workloads — and Kubernetes generates ephemeral workloads by design. The math is straightforward. Given a scrape interval S and a pod lifetime L: If L > S: the pod will be scraped at least once, generating at least one data pointIf L < S: the pod may generate zero data points — not because of any failure in Prometheus, but because it never existed between two consecutive scrape cycles This is not a probability statement. It is deterministic. A pod with a 6-second lifetime and a 15-second scrape interval will generate exactly zero Prometheus data points if its entire lifetime falls within one scrape gap. There is no configuration change that fixes this for that specific pod in that specific gap. The only way to eliminate the blind spot entirely is to move from a poll-based model to an event-driven model. And this is precisely the architectural distinction that most observability discussions miss. The Ghost Pod Experiment To validate this claim empirically, I ran a controlled experiment on a 3-node Minikube cluster (Kubernetes 1.31, Apple M-series hardware). Setup: Pod memory limit: 64MiPod memory allocation: 128Mi (guaranteed OOMKill)Prometheus scrape interval: 15s (kube-prometheus-stack default)Pod name: ghost-pod, namespace: oma-sampling What happened: The pod started, allocated memory beyond its limit, and was OOMKilled by the kernel at T+5s. Total observed pod lifetime: 6 seconds. Prometheus result: SQL # Query executed the morning after the experiment $ promql: container_cpu_usage_seconds_total{pod="ghost-pod"} {} # empty — 0 data points $ promql: kube_pod_container_status_last_terminated_reason{pod="ghost-pod"} {} # empty — 0 data points $ kubectl get pod ghost-pod -n oma-sampling Error from server (NotFound): pods "ghost-pod" not found Zero data points. No alert. No record. From Prometheus's perspective, ghost-pod never existed. Event-driven result: An OMA (Operational Memory Architecture) collector subscribed to the Kubernetes watch API captured the following at the moment of occurrence: SQL OOMKill P001 captured at T+5s pod: ghost-pod namespace: oma-sampling exit_code: 137 memory_limit: 64Mi node: opscart-m03 timestamp: 2026-04-18T23:38:06Z The causal evidence — exit code, resource limits, node placement — captured at occurrence. No scrape gap. No sampling window. The watch API delivers every pod state transition at the moment it fires, regardless of timing. Poll-based vs event-driven architecture: a pod with a 6-second lifetime falls entirely within one 15-second Prometheus scrape gap, generating zero data points. An event-driven collector subscribed to the Kubernetes watch API captures the OOMKill at occurrence — no sampling gap exists by architecture. "Just Reduce the Scrape Interval" This is the most common response when engineers first encounter the H5 blind spot. It deserves a direct answer. Reducing the scrape interval from 15s to 5s does not eliminate the blind spot. It shifts the threshold from 15 seconds to 5 seconds. Any pod whose lifetime falls within one 5-second scrape gap is still structurally invisible. Consider the real-world distributions: CrashLoopBackOff with OOMKill on startup: A pod that allocates memory before its first checkpoint can OOMKill in under 1 second. No scrape interval short of continuous polling catches this. Init container failures: Init containers that fail immediately may have lifetimes measured in milliseconds. These are architecturally invisible to any poll-based system, regardless of scrape interval. Batch job bursts: Short-lived Job pods in a batch processing cluster can complete their entire lifecycle — start, run, succeed, or fail — within a single scrape gap at any reasonable interval. Reducing the scrape interval also has real costs: Storage: Prometheus metric storage grows proportionally with scrape frequency. Moving from 15s to 5s triples your time-series storage requirements.Cardinality: More frequent scrapes of high-cardinality metrics (per-pod, per-container) increase label cardinality and query latency.Target load: Every scrape is an HTTP request to your metrics endpoints. High scrape frequencies create measurable load on instrumented services. You are paying a real cost to shift the threshold — not to eliminate it. For workloads with sub-second or sub-5-second lifetimes, no scrape interval is fast enough. Why the Watch API Is Structurally Different The Kubernetes watch API is not a faster poll. It is a fundamentally different delivery mechanism. When you run kubectl get pods --watch, you are not asking Kubernetes "what is the current pod state every N seconds." You are opening a long-lived HTTP connection to the API server and subscribing to a stream of state change events. Every time a pod transitions — from Pending to Running, from Running to Terminated, from any state to OOMKilled — the API server pushes that transition to every active watcher. The delivery is at-occurrence. There is no polling interval. There is no sampling gap. If a pod OOMKills at T=17.3 seconds, the watch API delivers that event at T=17.3 seconds — not at the next scrape boundary. This means the H5 blind spot does not exist for event-driven collectors by architecture. A pod with a 6-second lifetime generates exactly one OOMKill transition event. That event is delivered to every watcher at the moment it fires. The watcher captures it. Done. The practical implication: event-driven collection provides complete coverage of pod lifecycle events regardless of pod lifetime, without any configuration tuning. What Sampling Blind-Spot Costs in Production The blind spot has three concrete operational consequences. Undetected crash loops. A pod in CrashLoopBackOff with a very short failure cycle can OOMKill dozens of times per hour without generating a single Prometheus alert. The restart counter increments in kubectl get pods output, but if nobody is looking at that specific pod, the pattern goes undetected. By the time an engineer investigates, the pod may have crashed hundreds of times with no metric record of any individual failure. Incomplete capacity planning. Short-lived batch pods that OOMKill during processing spikes are invisible to Prometheus-based capacity analysis. Your memory utilization reports show only long-running pods. The actual peak memory demand — which caused the batch pod OOMKills — never appears in your capacity data. Silent compliance gaps. In pharmaceutical and financial production environments with audit requirements, unrecorded container failures are a compliance problem. An auditor asking "what failed in this namespace between 2 AM and 4 AM on this date" deserves a complete answer. A Prometheus query that returns empty results for pods that actually OOMKilled is not a complete answer. The Structural Fix The H5 blind spot cannot be patched within a poll-based architecture. The fix is additive: complement Prometheus with an event-driven collector that subscribes to the Kubernetes watch API. This does not mean replacing Prometheus. Prometheus remains the right tool for what it does — metric aggregation, trend analysis, alerting on long-running workloads. The event-driven collector handles what Prometheus cannot: discrete lifecycle events for pods of any duration. The implementation I've validated uses a Go-based collector subscribing to CoreV1().Pods(namespace).Watch(). On each Modified event, the collector inspects ContainerStatus for OOMKill signals and captures the full forensic context synchronously — before the pod restarts and overwrites LastTerminationState. Go // Simplified watch loop watcher, _ := clientset.CoreV1().Pods(namespace).Watch( ctx, metav1.ListOptions{}) for event := range watcher.ResultChan() { pod := event.Object.(*corev1.Pod) for _, cs := range pod.Status.ContainerStatuses { if cs.LastTerminationState.Terminated != nil { reason := cs.LastTerminationState.Terminated.Reason if reason == "OOMKilled" { captureOOMKillEvidence(pod, cs) } } } The watch API delivers the event at occurrence. The capture is synchronous. No polling gap. No sampling threshold. Ghost pods are no longer invisible. Full implementation with reproducible Minikube scenarios is at github.com/opscart/k8s-causal-memory. H5 in Context: The Evidence Horizon Taxonomy H5 is one of five evidence destruction mechanisms I've identified and formalized as an evidence horizon taxonomy. The full taxonomy: HorizonTriggerWhat's lostH1Pod restart (~90s)OOMKill forensics, limits, ConfigMapsH2Event TTL (1hr/1000)Scheduler placement rationaleH3Debug session exitkubectl debug exit code, durationH4Kubelet restartIn-memory operational stateH5Scrape intervalSub-interval pod lifetimes H5 is unique in the taxonomy: H1 through H4 destroy the Kubernetes API state that previously existed. The scrape-interval blind spot prevents observability data from being created in the first place. It is the only horizon that requires no destruction event — the evidence simply never reaches any persistent store. The full taxonomy with empirical validation across Minikube and AKS 1.32.10 is documented in the canonical OpsCart article: Beyond the 90-Second Gap and in the research preprint at Zenodo DOI: 10.5281/zenodo.19685352. Conclusion The H5 blind spot is not a Prometheus bug. It is not a configuration problem. It is an irreducible consequence of poll-based collection applied to a platform that generates arbitrarily short-lived workloads. Kubernetes is designed to self-heal faster than humans can observe. A pod that OOMKills in 6 seconds and restarts in 2 is working exactly as designed. Prometheus, also working exactly as designed, sees nothing. The architectural answer is equally straightforward: subscribe to the Kubernetes watch API. Receive events at occurrence. No scrape interval. No sampling gap. No ghost pods. Every pod that crashes in your cluster deserves a record. The watch API ensures it gets one. Resources: github.com/opscart/k8s-causal-memory — open-source implementation with reproducible H5 scenarioBeyond the 90-Second Gap — full evidence horizon taxonomy (OpsCart canonical)Research preprint — 30-run statistical analysis, AKS 1.32.10 validation

By Shamsher Khan DZone Core CORE
The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes
The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes

Imagine deploying a robust Spring Boot microservice that passes every integration test in your local Docker environment, only to watch it crash loop endlessly shortly after launching to your Kubernetes production cluster. Everything ran fine on your laptop, but in the live environment, your pods start terminating en masse. Requests to your critical endpoints begin failing with 503 errors. Panic sets in as your service, the backbone of your transaction pipeline, is effectively brought down by an invisible foe. In our recent migration to a cloud-native architecture, the culprit was a hidden memory configuration issue involving how the Java Virtual Machine interacts with Kubernetes container limits. A tiny mismatch in resource allocation, something that went unnoticed during development, led to a chain reaction of OOMKilled events in production. In this article, we will walk through the scenario step by step, including how the problem manifested and how we diagnosed the root cause. We will discuss the configuration that was to blame and the fixes and best practices that emerged from the post-mortem. Along the way, we will highlight common Kubernetes pitfalls for Java developers that can similarly wreak havoc if left unchecked. Symptoms: When Pods Turn Against You The first sign of trouble was our monitoring dashboard lighting up with red alerts. Shortly after deploying our new payment service, we noticed patterns like mass restarts where pods that had started successfully were suddenly restarting every few minutes. This was not a one-off fluke since it was happening across all replicas simultaneously. Our ingress controller started returning 503 Service Unavailable responses. Essentially, Kubernetes was killing the pods before they could serve traffic. Digging into application logs revealed nothing unusual. There were no stack traces or Java exceptions. The logs simply stopped abruptly. However, checking the Kubernetes pod status revealed the cryptic message Reason: OOMKilled. This error essentially means the container exceeded its memory limit and was terminated by the Linux kernel. At first glance, we were not sure why this would happen. We had set the JVM heap size to 512 MB, and our Kubernetes memory limit was set to 1 GB. Surely there was enough headroom. Why would the kernel kill the process when the heap was only half the limit? The impact of this issue was severe. Since our app relied on steady uptime for processing transactions, widespread pod instability meant no requests could be completed. In effect, our service was down for all users until the issue was resolved. Reproducing and Observing the Failure In our staging environment, we tried to reproduce the sequence of events. We deployed the same Docker image and applied the same Kubernetes manifests. We watched the memory usage via kubectl top pods. Sure enough, as the load increased, the container memory usage climbed steadily until it hit the limit and the pod vanished. Interestingly, the application worked fine under low load. The issue only surfaced during peak traffic when non-heap memory usage spiked. This was a crucial clue. It hinted that the JVM heap was not the only consumer of memory within the container. We realized that focusing solely on heap size was a mistake. Understanding JVM vs. Container Memory At this point, it is helpful to explain how the JVM accounts for memory within a container. Many Java developers assume that the max heap flag controls the total memory usage of the process. However, the JVM requires memory for more than just the heap. Metaspace is used for class metadata. Thread Stacks require memory for each thread. Code cache is used for JIT-compiled code. Garbage collector structures need internal data structures for GC. Direct buffers handle NIO direct memory. In older Java versions, the JVM was not container-aware. It would calculate memory limits based on the host machine RAM, not the container limit. While modern Java versions have improved container awareness, they still require explicit configuration to ensure the non-heap memory fits within the Kubernetes cgroup limit. In our case, the JVM heap was set to 512 MB, but the non-heap memory usage under load grew to approximately 600 MB. Total usage was 1.1 GB. Kubernetes Limit was 1 GB. The result was OOMKilled. The Misconfigured Manifest and How It Failed Let us look at a simplified version of the Kubernetes deployment manifest that led to this issue. We set the Kubernetes memory limit to 1Gi. We set the JVM max heap to 512m. On paper, this looks safe. However, we failed to account for the JVM off-heap memory footprint. When the application loaded large libraries or processed high volumes of concurrent requests, the non-heap memory expanded, pushing the total process size over the 1Gi cgroup limit. Unlike the OAuth token issue, where the server rejected us, here the Linux kernel simply killed the process without warning the application. There was no chance to log an error or gracefully shut down. This silent failure made debugging incredibly difficult since the application never got a chance to speak. How We Fixed It: Correct Memory Alignment The fix for this issue was twofold. We needed to adjust the Kubernetes limits and tune the JVM flags to respect those limits dynamically. First, we increased Container Limits. We raised the memory limit to provide sufficient headroom for non-heap usage. Second, we decided to use a percentage-based heap. Instead of a fixed max heap value, we configured the JVM to use a percentage of the container's available memory. Here is the corrected configuration we applied. We used the MaxRAMPercentage flag so the JVM automatically calculates the heap size based on the cgroup limit detected at runtime. This prevents the configuration from becoming stale if we change the Kubernetes limits later. We also increased the total limit to ensure the remaining 25 percent was sufficient for metaspace and threads. This change allowed the JVM to adapt to the environment automatically. It removed the hard-coded assumption about available memory. This is critical in cloud environments where resource limits might change based on scaling policies. Preventing Similar Issues: Best Practices for Java on Kubernetes We learned several valuable lessons during this incident. We incorporated these into our development standards to prevent recurrence. Always account for non-heap memory: Never set the max heap equal to the container memory limit. Always leave at least 20-25 percent of the container memory for off-heap usage. This buffer is essential for stability.Use modern base images: Ensure you are using JDK versions that support container awareness. Java 8 update 191 or later is required. Java 11 or 17 is better. Consider using distroless images or Jib to reduce the attack surface and image size.Configure liveness probes carefully: A common pitfall is setting liveness probes too aggressively. If your Java app pauses for garbage collection, it might miss a probe timeout and get killed unnecessarily. Add initial delay and failure thresholds to accommodate GC pauses.Monitor memory trends: Implement monitoring using Prometheus and Grafana. Track both container memory usage bytes and JVM-specific metrics like JVM memory used bytes. Alert when usage approaches 80 percent of the limit. This gives you time to react before the kernel steps in.Simulate load in staging: One reason this bug slipped by is that in development, we rarely simulated production-level concurrency. To prevent such surprises, we now use tools like k6 or JMeter in our staging cluster to validate memory stability under load.Secure your secrets: Ensure you store sensitive configuration securely. In Kubernetes, use Secrets mounted as environment variables or files rather than hardcoding them in Docker images. This prevents accidental exposure during debugging.Handle graceful shutdowns: Configure your Spring Boot app to handle SIGTERM signals properly. Kubernetes sends this signal before killing a pod. Ensure your application stops accepting new requests and finishes processing in-flight requests before shutting down. The Human Element in Incident Response Beyond the technical fixes, we also improved our response process. We established a blameless post-mortem culture. This encouraged team members to share mistakes without fear. We documented the incident in our internal knowledge base. This ensures new team members learn from our experience. We also added a checklist for production deployments. This checklist includes verifying JVM flags and memory limits. These process changes are just as important as the code changes. Conclusion Kubernetes is powerful, but with power comes complexity. Our Java service went down due to a tiny memory alignment bug, something easy to overlook but with catastrophic consequences in production. The hidden issue was simply that we were not accounting for the JVM total memory footprint versus the container cgroup limit. Once identified, the fix was a configuration change, yet it brought to light the importance of thoroughly understanding how your runtime interacts with the orchestration layer. In the aftermath, we reinforced our processes. We simulate real-world load in testing and added robust monitoring around memory usage. We kept an eye on JVM flags for containerized environments. By sharing this story, we hope to spare others that moment of dread when you realize your service at the front door to your business logic has unexpectedly locked out your users due to a silent kernel kill. In the end, our system is now stable and more resilient. We treat container resources with greater care. We always align JVM flags with Kubernetes limits. We guard them like the infrastructure keys to the kingdom that they are. We never assume something as critical as resource management will just work without thorough validation. Kubernetes got the best of us once, but with these lessons learned, we are determined not to let a sneaky configuration issue slip by again. Happy and safe coding.

By Ramya vani Rayala

Top Containers Experts

expert thumbnail

Yitaek Hwang

Software Engineer,
NYDIG

expert thumbnail

Marija Naumovska

Co-founder & Head of Growth,
Microtica

expert thumbnail

Naga Santhosh Reddy Vootukuri

Principal Software Engineer,
Microsoft

Naga Santhosh Reddy Vootukuri, a seasoned professional with over 16+ years working at Microsoft, reflects on his journey from India to the USA. Graduating from Sreenidhi Institute of Science and Technology in 2008, he now serves as a Principal Software Engineer for Azure SQL. His role involves leading his team through software development cycles, ensuring successful product launches. Currently, Naga focuses on a significant initiative in Azure SQL Deployment, emphasizing high availability for SQL customers during feature rollouts. Previously, he managed Master Data Services (MDS) within SQL Server, gaining community connections and contributing actively to Microsoft forums. Currently his focus is mainly on AI LLM's and he shares his knowledge through detailed articles. Aside from technical responsibilities, Naga engages in Microsoft hackathons and mentors junior engineers, finding fulfillment in guiding their career paths. He also champions diversity and inclusion, advocating for equality within the tech industry. Naga sees himself not only as a technical leader but also as a catalyst for positive change at Microsoft. Also a Docker Captain

The Latest Containers Topics

article thumbnail
Implementing Asynchronous Communication Between Microservices Using Kafka and Spring Boot
Kafka decouples services, buffers spikes, and routes failures to a DLT. Schemas are contracts; consumers must be idempotent.
June 24, 2026
by Mallikharjuna Manepalli
· 1,276 Views
article thumbnail
Your AI Coding Agent Can't Steal What It Never Had: The Docker Sandbox Isolation Story
Docker Sandbox runs AI agents in microVMs. The API key never enters the sandbox — the host proxy authenticates on the agent's behalf.
June 19, 2026
by Shamsher Khan DZone Core CORE
· 1,342 Views
article thumbnail
Zero-Downtime Deployments for Java Apps on Kubernetes
Achieve zero-downtime deployments for Java applications on Kubernetes using rolling updates, readiness/liveness probes, and graceful shutdown strategies.
May 29, 2026
by Ramya vani Rayala
· 3,967 Views
article thumbnail
Pragmatica Aether: Let Java Be Java
A modern, distributed, fault-tolerant runtime environment for the language that was intentionally designed for managed environments.
May 29, 2026
by Sergiy Yevtushenko
· 4,296 Views · 1 Like
article thumbnail
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
Docker Hardened Images solve the CVE problem. But CVEs aren't why containers fail in production — governance gaps are. Here's the trust architecture that closes them.
May 27, 2026
by Shamsher Khan DZone Core CORE
· 4,233 Views
article thumbnail
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Liquid Clustering replaces rigid partitioning and Z-Order with adaptive clustering in Unity Catalog, improving performance with less maintenance.
May 26, 2026
by Seshendranath Balla Venkata
· 2,623 Views · 1 Like
article thumbnail
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One SQL query across 4 GPU nodes found a straggler in under a second using eBPF fleet fan-out, no central collector needed.
May 25, 2026
by Ingero Team
· 3,607 Views
article thumbnail
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
GPUStack is an open-source tool that turns a bunch of scattered GPU machines into one managed cluster for deploying AI models behind an OpenAI-compatible API.
May 21, 2026
by Sandeep Sadarangani
· 3,866 Views · 1 Like
article thumbnail
Smart Deployment Strategies for Modern Applications
Docker packages applications to ensure consistent and portable deployments. Kubernetes manages them with scaling, reliability, and automation in production.
May 18, 2026
by Manju George
· 3,640 Views
article thumbnail
Solving the Mystery: Why Java RSS Grows in Docker on M1 Macs
Java apps running in x86-64 Docker containers on ARM64 M1 Macs experience mysterious RSS memory growth due to Rosetta 2 translation cache. The culprit? JIT compilation.
May 12, 2026
by Sumeet Sharma
· 3,755 Views · 1 Like
article thumbnail
How We Diagnosed a Hidden Scheduler Failure in a Docker Swarm Cluster Serving 2 Million Users
A real production incident in a Docker Swarm cluster — how a routine service update triggered a silent scheduler failure, and how we uncovered it.
May 5, 2026
by Denis Tiumentsev
· 1,838 Views · 1 Like
article thumbnail
Mastering Kubernetes to Maximize Your Cloud Potential
Understanding Kubernetes architecture through seven critical layers: storage, compute, networking, observability, security, dev tools, and CI/CD.
May 4, 2026
by Jaswinder Kumar
· 1,867 Views · 2 Likes
article thumbnail
AI Agents for DevOps on Kubernetes Need Real Engineering, Not Magic
Kubernetes incident triage: OpenTelemetry → Kafka → CrewAI → RBAC scale. DORA 2024: 75% AI use, 39% low trust. AI correlates, humans approve changes.
April 30, 2026
by Abdul Majid Qureshi
· 2,326 Views
article thumbnail
Java Backend Development in the Era of Kubernetes and Docker
Containerization with Docker and orchestration through Kubernetes enables Java backends to be deployed, scaled, managed efficiently in modern cloud-native environments.
April 28, 2026
by Ramya vani Rayala
· 4,348 Views · 5 Likes
article thumbnail
Java in a Container: Efficient Development and Deployment With Docker
Docker containers make Java apps portable and consistent across environments, development, and deployment, and improve s scalability and streamline CI/CD.
April 28, 2026
by Ramya vani Rayala
· 2,679 Views · 2 Likes
article thumbnail
The Pod Prometheus Never Saw: Kubernetes' Sampling Blind Spot
Prometheus sampling gaps are irreducible — reducing the scrape interval just moves the threshold. The Kubernetes watch API eliminates it entirely.
April 23, 2026
by Shamsher Khan DZone Core CORE
· 2,245 Views · 1 Like
article thumbnail
The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes
A Kubernetes pod may restart due to an OOMKill when the Java process exceeds the container’s memory limit. JVM memory tuning and correct resource limits prevent crashes.
April 22, 2026
by Ramya vani Rayala
· 5,994 Views · 6 Likes
article thumbnail
When Kubernetes Breaks Session Consistency: Using Cosmos DB and Redis Together
Cosmos DB stores durable state; Redis acts as a coordination layer, enabling predictable, stateless scaling without sticky sessions, strong consistency, or high costs.
April 15, 2026
by Vikas Mittal
· 2,585 Views
article thumbnail
NeMo Agent Toolkit With Docker Model Runner
Agent observability is often missing in the rush to build AI agents. NeMo adds observability to AI agents, helping trace, evaluate, and debug multi-agent workflows.
April 15, 2026
by Siri Varma Vegiraju DZone Core CORE
· 2,671 Views
article thumbnail
Run AI Agents Safely With Docker Sandboxes: A Complete Walkthrough
A full walkthrough of how to set up Docker sandboxes on a local machine and how to run AI agents safely in YOLO mode without corrupting the host environment.
April 7, 2026
by Naga Santhosh Reddy Vootukuri DZone Core CORE
· 6,441 Views · 3 Likes
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×