DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Microservices

A microservices architecture is a development method for designing applications as modular services that seamlessly adapt to a highly scalable and dynamic environment. Microservices help solve complex issues such as speed and scalability, while also supporting continuous testing and delivery. This Zone will take you through breaking down the monolith step by step and designing a microservices architecture from scratch. Stay up to date on the industry's changes with topics such as container deployment, architectural design patterns, event-driven architecture, service meshes, and more.

icon
Latest Premium Content
Trend Report
Cloud Native
Cloud Native
Trend Report
Modern API Management
Modern API Management
Refcard #379
Getting Started With Serverless Application Architecture
Getting Started With Serverless Application Architecture

DZone's Featured Microservices Resources

Data Contracts as the

Data Contracts as the "Circuit Breaker" for Model Reliability

By SRIRAMPRABHU RAJENDRAN
Intro: When Good Models Go Wrong A few years ago, I spent months working on a microservices-based customer intake processing system for our application. The code was good, the tests were passing, and we had load-tested it with crazy high TPS. Yet, on one particular Tuesday afternoon, a small change to the response schema from an upstream service, where the date field changed from ISO 8601 to epoch milliseconds, cascaded through four downstream services and corrupted a day’s transactions without anyone realizing it until it was too late. We fixed it in a few hours, but the lesson has stayed with me, and it’s affected every integration I’ve worked on since then. Crashes are easy to see. Silent data corruption is not. I see the exact same thing happening with AI and machine learning pipelines today. Except now, the consequences are larger, and the feedback cycles are slower. A model will not throw an exception if the input schema changes slightly. It will, however, make worse predictions. Quietly. Confidently. For weeks. In this article, I’d like to propose a solution that brings together two worlds in which I’ve spent my entire professional life: software engineering’s resilience patterns and data governance. What I’d like to argue is the need to combine the concept of data contracts with the Circuit Breaker pattern to build a proactive defense against silent data quality failures that affect AI reliability. The Real Reason AI Models Fail in Production There’s a general understanding that if a model is not performing well, it’s a problem with the model itself, its architecture, its hyperparameters, the training process, etc. Sometimes this is true, but far more solvable. The upstream data changed. Nobody told the model. “Poor data quality is a silent killer of any AI projects.” This is consistent with many production environments. The data engineers did their job, the modelers did their job, and nobody owned the contract between the two. This might manifest in a number of ways: Schema drift: A column that has always been a float type now starts arriving as a string type. The feature engineering process, quietly and behind the scenes, attempts to convert it and introduces some error in the process.Semantic drift: An attribute named "account_status" that previously only had values such as "ACTIVE", "CLOSED", and "DELINQUENT" now starts having a new value, "UNDER_REVIEW", that the model has never seen before. The model maps it to the category that it is closest to in the embedding space, which could be completely incorrect.Distribution shift: The data source the model uses changes how it samples data or alters other aspects of the data. The model is now seeing a different data distribution than it was previously trained on, and the schema looks exactly the same. So, nothing appears to have changed.Cadence changes: A data source that is a batch source and has historically refreshed data every day now starts refreshing data every hour, or vice versa. A 2023 Gartner study found that the primary cause of AI project failure is poor data quality, and more than 60% of organizations reported that data issues, rather than model issues, were the primary cause of most of their production incidents. The diagram below shows the silent changes in the data and how they propagate through the machine learning pipeline. Data changes upstream will flow through the pipeline without error, producing confidently wrong model outputs that will go undetected for weeks. The fundamental issue here is that there is no contract between data producers and data consumers. In a microservices world, we solved this problem a decade ago through API contracts and schema registries. In a data world — and this might sting a little — we're still operating on trust and hope. This is a data governance and data quality issue. And the good news is that the data management community has a conceptual toolkit to solve this problem. We just need to integrate it into the AI pipeline. What Are Data Contracts? Most teams believe they have data contracts. In reality, they have documentation with good intentions. A wiki page with "this field is supposed to be a float" or a Slack channel where someone will ask, "Hey, did the schema change here?" A data contract is not documentation. A data contract is enforceable, but documentation is not. A real data contract is an enforceable agreement between two parties: the producer (the system or team that produces the data) and the consumer (the system or team that consumes the data). A real data contract is an agreement that includes: Schema: The exact structure, data types, and allowed values for every single field.Semantics: What each field means, including business definitions and edge cases.Quality: Minimum quality thresholds for completeness, freshness, accuracy, and uniqueness.SLAs: Service level agreements for delivery cadence, latency, and availability. Versioning: A definition for schema changes, including deprecation schedules for backward-incompatible changes. It’s like thinking of it as a data API specification. Just as OpenAPI (Swagger) has standardized how we specify a REST API, data contracts have standardized how we specify a data interface. It’s a concept that’s been getting a lot of traction among the DataOps community. Andrew Jones has been a prominent influencer in formalizing data contract specifications, and tools like Soda and Great Expectations provide frameworks for data quality expectations, which are part of a data contract. The importance of AI is unparalleled, as every ML model relies on a set of data assumptions that are not only unspecified but also unenforced. When those assumptions are violated, the model starts to deteriorate. A data contract spells out those assumptions, making it testable and enforceable — bringing the level of rigor that data stewardship teams have been advocating for, into the ML pipeline. The Circuit Breaker Pattern: A Primer You already know what a circuit breaker is; there is one in your house. It works by tripping and shutting off the electricity if the load gets too high. You simply flip it back on to restore service. Simple, elegant, and has saved many houses from burning to the ground. The concept of circuit breakers has been around for a long time in software development, popularized by Michael Nygard in his book “Release It!” It has been a standard pattern for building resilient distributed systems. I have been using this concept for a long time. We use Spring Cloud Circuit Breaker based on Resilience4j to handle circuit breakers for our microservices-based application to prevent cascading failures in downstream services, which are very critical to business. The circuit breaker works as follows: Closed state – this is the normal operating state. All requests go through to the downstream service. The circuit breaker is monitoring the failure rate.Open state – this is where the circuit breaker has detected a failure rate above a certain threshold. It has “tripped” and will stop sending requests to the downstream service. Instead, it will immediately send a fallback response or error.Half-open state [recovery probe] – after a cooldown period, the breaker allows a limited number of test requests to pass. If they are successful, the breaker closes; otherwise, it stays in the open position. State machine for circuit breaker here^; the circuit breaker changes states based on failure rates and recovery probes. This pattern has become accessible to every Java developer with the introduction of frameworks such as Spring Cloud Circuit Breaker and Netflix Hystrix. The pattern is simple but very useful. It’s all about failing fast. We have been using this pattern for service-to-service communication for more than a decade. We have 100s of our services with a circuit breaker pattern implemented on our platform. If our XXX critical service goes down, we simply trip the circuit breaker and fail gracefully. But if our upstream data source changes schema silently and starts corrupting our ML features? Nothing. No circuit breaker. No fallback. Just a degradation of our features for weeks. The failure mode is the same: a degraded upstream service silently corrupts a downstream service. But we didn’t have a similar pattern implemented for our data pipelines until we did. Applying the Circuit Breaker to Data Pipelines The basic idea is not as complex as it sounds: we propose that every data input to an AI model is a dependency that can cause a circuit breaker to trip. If we do this with HTTP calls to other microservices, we can do this with data going into a model. While a traditional microservice circuit breaker monitors HTTP request error rate and latency, a data circuit breaker monitors data quality metrics defined in the data contract: Circuit Breaker State Trigger Condition Action Closed (healthy) All contract quality thresholds met Data flows normally into the model pipeline Open (tripped) Quality metrics breach contract thresholds (e.g., null rate > 5%, freshness > 2 hours stale, schema mismatch detected) Data flow is halted; model receives no new input; fallback strategy activates Half-Open (probing) After cooldown, a sample batch is validated against the contract If the sample passes, the breaker closes; if it fails, the breaker stays open The fallback options when the breaker trips can be: Stale but safe – using the last known good data snapshot. The model will continue to run, just on slightly outdated, but still good, data.Graceful degradation – the model will continue to run, but flag its output as "low confidence" and send it to a human for review.Full halt – for high-stakes applications like fraud detection or compliance, the model will simply stop running until the data quality is resolved. This is a fundamental shift from "we'll detect the problem when it happens and send an alert" to "we'll prevent the problem from happening in the first place." Architecture: Data Contracts + Circuit Breakers in Practice Let me walk through a concrete data architecture that ties these patterns together. This is heavily inspired by how we operate this on our lending platform, but adapted for the data to model case: The Data Contract Registry A centralized service responsible for storing all active data contracts. Each data contract is versioned and associated with a data source and a consumer. The service provides APIs for: Registering a data contractValidating data against a data contractPublishing a data contract violation event The Quality Gate A lightweight service (or a 'sidecar' pattern, if you will) that sits in between the data source and the model pipeline. For every data batch or stream event received, the quality gate: Fetches the relevant data contract from the registryValidates data against schema, semantics, and quality rulesReports metrics to the circuit breaker The Circuit Breaker Controller A stateful component that: Aggregates quality metrics from the quality gate over a specified window sizeManages the breaker state (closed, open, half-open)Publishes state change events to a Kafka topic for downstream consumptionExecutes fallback strategies when the breaker is opened The Flow The architecture is an end-to-end solution that includes data contracts, quality gates, and circuit breakers. The circuit breaker is located between the quality gates and the model pipeline, automatically routing to fallbacks if the quality of the data worsens. If you are using AWS, which we are, then this architecture fits nicely with existing AWS services. For example, the quality gate can be performed by a Lambda function or ECS task, the contract registry can be on DynamoDB or other AWS-native datastores, the circuit breaker state can be maintained by ElastiCache (Redis), and the event bus can be on Kafka (or MSK, the AWS variant). We already make significant use of all these tools for our financial platform microservices, so the marginal cost for using them with the data pipeline is negligible. If you are using Kubernetes, then the quality gate can also function nicely as a sidecar container to your model serving pods. The key architectural concept is the separation of concerns. The data producer is responsible for the data contract, the quality gate is responsible for the quality, and the circuit breaker is responsible for the fail-fast. There is no need for a single team to “own” the entire process. From Chaos Engineering to Data Resilience The last time I intentionally broke my data pipeline and saw what happened was? On our system, we do disaster recovery drills regularly — an orchestrated set of exercises on 100+ components, including APIs, batch jobs, and streaming apps. The team is very good at infrastructure chaos engineering. However, when I asked, “What happens if the credit bureau feed starts sending garbage schema for two hours?” nobody answered because nobody had ever really tested this scenario. Most organizations practice chaos engineering on infrastructure, but very few practice data chaos engineering — intentionally introducing data quality errors to see if their systems correctly detect and respond to those errors. Data Chaos Engineering in Practice Schema injection: Apply a schema modification temporarily, for example, by adding a column or changing a data type. Validate that the quality gate detects this modification and the circuit breaker is triggered.Null injection: Increase the proportion of null values for a critical feature beyond the contract value. Validate that the breaker is triggered.Staleness simulation: Apply a delay in the data delivery beyond the SLA value. Validate that the staleness check is triggered.Distribution poisoning: Apply a small perturbation to the distribution of a critical feature. Validate the detection. The data chaos engineering cycle. Here, faults are injected to ensure that the contracts and breakers are functioning correctly. The missing pieces are fed back into the contract and breaker development. I have seen that by running these experiments every month, taking the same level of discipline that we already take in running our existing DR drills for our services, instills enormous confidence in the system's ability to look after itself. It also reveals missing pieces in your data contracts that you might never find by just reviewing your documentation. If you introduce a fault and nothing catches it, that means your contract is incomplete. We learned that we had three missing contract clauses just by running data chaos experiments for the first month. The principles of chaos engineering are applicable in this case. You are not testing if your system works under perfect conditions; you are testing if your system fails safely under realistic, degraded conditions. Real-World Scenario: Stopping a Bad Prediction Before It Ships For example, a financial services company might use ML models to predict customer behavior for risk analysis. The ML model might use various data sources as features, such as an external third-party data provider for customer risk indicators. The scenario: A third-party vendor changes their API and doesn't notify anyone. A critical field in the data set now returns numeric data instead of categories. The field previously returned HIGH_RISK, MEDIUM_RISK, LOW_RISK, and MINIMAL_RISK categories, but now it returns numeric data between 1 and 100. The ETL process doesn't fail but defaults to a mapping of the data, which essentially flattens all the risk into a single category across all customers. Without a data contract and circuit breaker: The model runs for weeks with corrupted features. Predictions are no longer accurate, but the gradual change is mistaken for market conditions or seasonality. By the time the actual cause is determined, thousands of decisions are made based on incorrect predictions. The process to address the problem involves several teams working in war rooms over the course of days, analyzing logs and assessing the damage, a considerable engineering and possibly business waste. With a data contract and circuit breaker: The data contract is very specific in that it requires the risk indicator field to contain one of four string values. If the vendor changes the format of the API, the quality gate immediately recognizes that the data is not passing the semantic validation. The circuit breaker is triggered within minutes. The system defaults to the last verified snapshot of the data and flags all predictions as "Degraded Confidence." An alert is sent to the data engineering team. The schema is fixed within hours, and zero corrupted predictions are ever made. The speed is a secondary benefit, the actual value is in the prevention of damage (as a preventative control rather than a detective). The circuit breaker prevented the bad data from entering the model before the corrupted prediction was ever made. FAQs What is the difference between a data contract and a schema registry, e.g., Confluent Schema Registry? A schema registry will verify structure, e.g., field names, data types, and nesting. A data contract extends that with semantic rules, e.g., allowed values, definitions, quality rules, e.g., nulls, freshness, and SLAs, e.g., delivery cadence, availability. In other words, the schema registry is just part of the data contract. Won't triggering circuit breakers cause the model to stop working too often? This is not a fundamental flaw; it's just a calibration issue. People often underestimate the amount of variation that is normal in their data. We did. Start with large values, then adjust them once you know your data's normal behavior. The half-open state helps with recovery. In practice, circuit breakers will not often fail, and when they do, it's likely due to real issues. Does this apply to real-time streaming data, or is it limited to batch data? Both. For streaming, the quality gate checks every event or micro-batch. The circuit breaker aggregates metrics over a time window. For batch, the quality gate checks at the batch level, prior to writing to the feature store. This pattern is unaware of the delivery mechanism. What about unstructured data, like text and images? For unstructured data, like text and images, the data contracts are concerned with other quality aspects, like encoding, language, document size, and metadata. The Circuit Breaker still applies, just to other metrics. For example, in an image processing pipeline, if 90% of the images received are 90% smaller than the average, it could be a sign of corrupted images or thumbnail images only. How do I get data producers to adopt contracts? Start with the highest value, highest risk data sources. Present it in the context of reducing their support load. The producer team is interrupted every time a consumer reports a bug because of the change in the data. I have been in enough cross-team incident reviews to know that these interruptions are not popular. Contracts remove the need for these interruptions. Once one producing team has adopted contracts and seen the reduction in downstream incidents, the rest tend to spread naturally. We began with a data feed and now have contracts in place for our most critical internal data sources. Conclusion The data engineering community has spent years developing ever-more sophisticated monitoring, alerting, and observability tools. That's all been good work. But let's be honest: monitoring is fundamentally reactive. Monitoring just lets you know something's gone wrong... after the damage is done. You want monitoring and prevention, but only prevention will stop the damage before it happens. Data contracts and circuit breakers are a fundamental shift in data resiliency: Contracts make the expectations explicit. Circuit breakers make those expectations active, in real time, before the bad data ever gets to the models and agents that rely on it. When building AI systems that make critical decisions... and increasingly, all of us are doing this... You simply cannot operate on implicit trust between data producers and data consumers. The chasm between "the data exists" and "the data is fit for purpose" is where model reliability goes to die. The data governance and data quality practices that this community has advocated for over the years are precisely what you need. And now, taking them to the AI layer is what's next. Bridge the gap. Write the contract. Wire the breaker. Start with one data source, the one that has burned you before. You know the one. Your models will thank you. Key Takeaways The cause of AI system failure is data, not code. The most common cause of production AI system failure is a change in data schema or semantics, which degrades model predictions silently.Data contracts make data producer and consumer expectations around schema, semantics, data quality thresholds, and SLAs explicit, making implicit assumptions explicit and testable.The Circuit Breaker pattern stops bad data from being fed to a model by automatically stopping data flow when data quality thresholds are violated, allowing for fallbacks to be implemented.Data chaos engineering makes you confident that your data contracts and circuit breakers will work when your data quality actually fails by intentionally inducing data quality failures.Target high-value, high-risk data sources first. Success in one area can generate enough organizational momentum for wider application. More
How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It

How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It

By Igboanugo David Ugochukwu DZone Core CORE
There's a specific kind of failure that never makes the post-mortem blog post. It's not a dramatic outage. There's no war room, no all-hands, no apology email sent to a hundred thousand users. It's quieter than that. It looks like a product that worked beautifully for thirty clients, suddenly becoming unreliable at sixty. It looks like an engineering team that can no longer ship without breaking something else. It looks like a sales pipeline that stalls because the platform can't pass a security questionnaire. This is where most SaaS products actually fail — not at launch, but somewhere around the eighteen-month mark, when the architectural decisions made during the sprint-first MVP phase start extracting their tax. I've been watching this pattern long enough to recognize it early. The symptoms vary; the underlying causes rarely do. This article is an attempt to lay out the structural decisions that determine whether a SaaS platform scales cleanly or degrades under its own weight — and to be specific enough about why things go wrong that the analysis is actually useful. The Multi-Tenancy Decision Is Made Once Every SaaS platform is a multi-tenant system. One application codebase, one infrastructure stack, multiple clients operating inside it simultaneously. That sentence sounds simple. The architectural reality it describes is not. The core question — how you isolate one tenant's data from another's — has a small number of answers, each with a distinct set of long-term consequences. AWS's SaaS Architecture Fundamentals whitepaper offers one of the cleaner frameworks for thinking about this: a spectrum from fully siloed tenancy (dedicated infrastructure per client) to fully pooled tenancy (shared everything, separated by tenant ID in the data layer), with hybrid models in between. The AWS multi-tenant architectures guidance is direct about the fundamental trade-off: "The Silo Model provides the strongest tenant isolation but incurs the most cost and complexity. Inversely, the Pool Model offers the least tenant isolation but costs the least." What this framing leaves implicit is worth stating explicitly: whichever model you choose, the choice shapes almost every subsequent technical decision your team will make. Siloed tenancy gives each client a dedicated database instance. Data isolation is structural — a bug affecting one tenant's environment cannot, by definition, reach another's. Compliance requirements from healthcare or financial services clients become dramatically simpler to satisfy because the isolation boundary is physical, not logical. The cost is proportional: you're provisioning, patching, and scaling N database instances, where N grows with your client count. Pooled tenancy places all tenants in a shared schema, differentiated by a tenant ID column embedded in every relevant table. Infrastructure costs are substantially lower, and horizontal scaling benefits all tenants simultaneously. The risk is what practitioners call the noisy neighbor problem: a single tenant running expensive aggregate queries can degrade performance for everyone sharing the same database. More critically, a bug in the tenant-filtering logic — a missing WHERE tenant_id = ?, a misconfigured ORM, a caching layer that doesn't scope keys by tenant — can expose one client's data to another. This failure mode isn't theoretical. It happens. The incidents don't always become public, but they reliably end enterprise contracts and occasionally end companies. Hybrid tenancy — dedicated infrastructure for high-value or compliance-sensitive clients, pooled resources for the long tail — is where most mature platforms land. The operational complexity of managing both models is real, but the economics usually justify it. What's not recoverable is discovering which model you've accidentally built after three years of feature development. Retrofitting siloed tenancy onto a codebase that has pooled assumptions baked into a hundred query paths is not a refactor. It's a rewrite. The teams that avoid it are the ones who treat the tenancy decision as an architectural constraint from day one — defined, documented, and intentionally chosen. Start With a Monolith; Plan to Leave It There is a category of architectural advice that circulates with great confidence among engineers who've read extensively about microservices but haven't operated them at scale under incident conditions. The advice is: "Build microservices from the start — it scales better." Martin Fowler's documented observation on this is worth citing directly: almost every successful microservices story started with a monolith that got too large and was split apart. Almost every system built as microservices from the beginning has encountered serious trouble. The trouble is operational. Running twelve services means twelve deployment pipelines, twelve sets of logs, twelve independent failure domains, and a distributed tracing requirement that doesn't exist when you have one process. A team of four engineers who are also building features, writing tests, and responding to client requests does not have the operational bandwidth for this. The cognitive overhead alone slows delivery. The alternative — a modular monolith — is not a compromise. It's a deliberate choice that preserves the ability to move to microservices later, without paying the full operational cost now. A well-structured modular monolith has clean module boundaries, explicit interfaces between modules, and no cross-module data access except through those interfaces. The billing logic doesn't reach into the notification module's tables. The reporting engine doesn't call internal functions of the core domain layer. When the time comes to extract the notification service because it needs to scale independently, or because it needs to deploy on a different cadence, there's a clean seam to cut along. You're lifting a well-defined box out of a larger structure, not untangling five years of implicit dependencies. The trigger for that extraction should always be evidence, not intuition. Real performance data. A concrete scaling bottleneck. A deployment coupling that's slowing down a specific team. Not hypothetical future requirements or architectural preference. Statelessness is the constraint that applies regardless of which model you choose. Individual application instances need to be replaceable without ceremony. Session state belongs in a distributed cache — Redis for most teams, though the technology matters less than the principle. File uploads go to object storage. Background jobs are queued and processed independently of the request/response cycle. If you can terminate any running instance without losing data or breaking user sessions, you have horizontal scalability. If you can't, no amount of autoscaling configuration will save you. The CI/CD Pipeline Is a Promise to Your Clients Here's a framing that changes how teams invest in deployment infrastructure: the CI/CD pipeline is not tooling. It is the mechanism by which your engineering organization makes and keeps reliability commitments. Every commit that flows through automated testing and staged deployment is an implicit promise that you are not shipping surprises. Every deployment that uses blue/green or canary strategies is a commitment that you can recover from problems without taking clients offline. The pipeline is the operational expression of your engineering standards. When it's not enforced, those standards become suggestions. A properly constructed pipeline enforces several stages without exception: Source control discipline. Protected main branches. Required pull request reviews. Automated checks that block merge on failing tests. This seems obvious. It isn't universal. Automated testing at multiple levels. Unit tests catch logic errors in isolation. Integration tests verify that components interact correctly at boundaries. End-to-end tests validate that user-facing flows behave correctly under production-like conditions. Coverage numbers are a proxy metric, and they get gamed. What matters is whether the test suite catches regressions before they reach clients. Security scanning in the pipeline. Static analysis for common vulnerability patterns. Dependency scanning for known CVEs. Container image scanning before any artifact reaches a deployment stage. None of this replaces a professional security review, but it raises the baseline that your security review starts from, and it catches low-hanging fruit on every commit rather than periodically. Staged deployment with canary releases. A canary release routes a controlled percentage of traffic — five or ten percent — to the new version before full rollout. Error rates and latency are monitored during the canary window. If metrics degrade beyond defined thresholds, the release rolls back automatically. Blue/green deployment maintains two production environments, with the router switching between them on successful validation. Rollbacks take seconds because the previous version is still running. Automated rollback triggers. Post-deployment error rate exceeds a defined threshold? The pipeline reverts without waiting for human acknowledgment. This requires defining what "good" looks like before the deployment goes out, which forces teams to think about observability requirements proactively. The DORA research on software delivery performance is consistent with practitioner experience: teams with mature CI/CD pipelines ship more frequently, experience fewer high-severity incidents, and recover faster when incidents do occur. The correlation isn't coincidental. Frequent small deployments are inherently lower-risk than infrequent large ones. The pipeline creates the conditions where frequent deployment is safe. One practical note on pipeline architecture: the staging environment needs to mirror production in configuration, even if not in scale. Misconfigured environment variables, incorrect secrets injection, and infrastructure assumptions that don't hold in the target environment — these all generate bugs that only appear at deployment and can't be caught by any amount of unit testing. Observability: What You Cannot See, You Cannot Fix Observability is the property of a system that allows you to understand its internal state from the signals it produces. Logs, metrics, and distributed traces are the three pillars. Most teams have logs. Fewer have metrics instrumented at meaningful granularity. Fewer still have distributed tracing that lets an engineer follow a single user request through every service it touches. The Google SRE team's framework — the four golden signals of latency, traffic, errors, and saturation — remains the clearest starting point for deciding what to measure. If you instrument nothing else, instrument these four things. They answer the question "Is the system working correctly right now?" without requiring an engineer to synthesize information from a dozen different dashboards. The gap matters most during incidents. When a client reports slow dashboards and the on-call engineer has only raw application logs to work with — logs that say "request processed in 4.3 seconds" without any breakdown of where that time went — the mean time to resolution depends entirely on how quickly the engineer's intuition gets lucky. When the same engineer has distributed traces showing the request blocking for 3.9 seconds waiting on a single database query in the reporting service, the resolution path is immediate. For multi-tenant SaaS specifically, per-tenant observability is a non-optional requirement that general monitoring guidance doesn't address. The ability to filter every metric, log line, and trace by tenant ID enables two things that matter: When a specific client reports a problem, you can immediately determine whether it's a platform-wide issue or specific to their tenant.You can detect the noisy neighbor problem in metrics before the affected client experiences it in their user interface. A single tenant whose analytics jobs are consuming disproportionate database CPU will appear in per-tenant metrics as an anomaly before their query patterns start affecting neighboring tenants' response times. That's the kind of early signal that separates reactive operations from proactive ones. Service Level Objectives translate quality commitments into measurable engineering targets. An SLO is not an SLA — SLAs are contractual commitments to clients; SLOs are internal targets that the engineering team holds itself to, set below the SLA threshold to provide a buffer. Alerting on SLO burn rate — "we're consuming our weekly error budget at three times the sustainable rate" — is meaningfully different from alerting on static thresholds like "error rate above 1%." The former fires on conditions that threaten the actual reliability commitment. The latter fires on every routine blip until engineers learn to ignore it. The SRE workbook's case studies on SLO implementation are worth reading carefully for teams setting up SLOs for the first time. The recurring insight is that getting SLOs slightly wrong is better than having no SLOs, and that they improve through iteration as the team develops better intuitions about what clients actually care about. Caching Is Architecture, Not Optimization There's a point in the growth curve of most SaaS platforms — somewhere between one hundred and five hundred active users — where the engineering team discovers that their application has been making an implicit performance bet. Every page load triggers database queries that should have been answered from a cache. Every API call recomputes values that could have been stored. The system that felt responsive at twenty clients is visibly straining at two hundred. The teams that handle this gracefully anticipated it. They designed caching into the architecture rather than retrofitting it as an emergency optimization. In a multi-tenant SaaS context, caching is more complex than "put Redis in front of your database." Every cached object must be scoped to a specific tenant. Cached data for Tenant A cannot, under any circumstances, be served to Tenant B. Cache key design must include tenant ID as a required component — not an optional one, not something checked at read time, but structurally embedded in every key. Cache invalidation — famously one of the two hard problems in computer science — becomes harder in multi-tenant environments because you're managing invalidation across tenant boundaries, and harder still when multiple application instances each maintain their own local in-process cache. An update to Tenant A's configuration needs to invalidate the right cache entries across every instance. Getting this wrong produces subtle, intermittent bugs that are difficult to reproduce and unpleasant to debug. A layered caching strategy handles different data categories appropriately. In-process cache for hot, rarely-changing data (feature flags, tenant configuration, static reference data). Distributed cache (Redis or equivalent) for session data, frequently-accessed query results, and computed aggregates that are expensive to regenerate. CDN for static assets, public-facing content, and anything that can be served without touching the application layer. Queue-based async processing is the complementary pattern for handling workload spikes without translating them into latency spikes. Long-running operations — report generation, bulk exports, email campaigns, file processing — do not belong in the synchronous request/response cycle. They belong in a job queue. The user receives an acknowledgment that the job has been accepted. The job runs in the background. The result is delivered when it's complete. This keeps p99 response times stable even under unusual load conditions, which is what enterprise SLAs actually measure. Security Is an Architecture Constraint, Not a Feature The framing problem with enterprise SaaS security is that most development teams treat it as a compliance checklist — a set of features to implement before a security audit — rather than a design constraint that shapes the system from the beginning. The OWASP Top 10 Proactive Controls are explicit about this for access control specifically: "Once you have chosen a specific access control design pattern, it is often difficult and time-consuming to re-engineer access control in your application with a new pattern. Access Control is one of the main areas of application security design that must be thoroughly designed up front, especially when addressing requirements like multi-tenancy and horizontal (data dependent) access control." The architectural implication: your access control model should be able to answer a three-variable question before every data access — does user X have permission Y in tenant Z? Note all three variables. A user with full administrative permissions in their own tenant has zero permissions in any other tenant. A service account with cross-tenant reporting access should be an explicit, audited exception, not an assumed default. Role-Based Access Control implemented at the framework level — where permission checks happen automatically on every request — is fundamentally more secure than RBAC implemented at the individual endpoint level, where checks can be forgotten or inconsistently applied. Audit logging is the forensic record that makes security audits tractable and incident investigations answerable. Every action that creates, modifies, or deletes sensitive data — and ideally, every access to sensitive data — should generate an immutable log entry recording: who took the action, which tenant they were acting within, what data was affected, and when. This is not only a compliance requirement. It's the record that lets you answer "what happened to this client's data between Tuesday evening and Wednesday morning" when that question needs answering under time pressure. Broken Access Control has held the top position on the OWASP Top 10 since 2021. In multi-tenant SaaS, it's not just the most common vulnerability — it's the one that carries the most severe consequences, because a broken access control bug doesn't affect one user, it potentially affects one tenant's entire dataset being visible to another. SSO federation and enforced MFA address the credential attack surface. The majority of cloud environment security incidents involve compromised credentials, not novel exploits. Allowing enterprise clients to authenticate through their existing identity provider reduces credential surface area and eliminates the parallel set of credentials that would otherwise need to be managed, rotated, and secured. Dependency and container image scanning in the CI/CD pipeline handles the supply chain attack surface. Known CVEs in third-party packages are a growing attack vector. Automated scanning on every build — blocking deployments when critical vulnerabilities are detected — keeps the baseline clean without requiring manual security reviews for every dependency update. Why So Many Platforms Stumble Quietly The failures rarely announce themselves dramatically. There's rarely a single decision you can point to. The pattern is a series of small optimizations for short-term velocity that individually make sense and collectively produce an architecture that resists change, punishes growth, and generates incidents faster than the team can resolve them. Treating SaaS like a desktop application. Session state held in process memory. File writes to local disk. Synchronous operations for everything. No consideration for multiple concurrent instances. This architecture has a hard ceiling on horizontal scalability that isn't visible until you're past the point where addressing it is easy. Neglecting tenant isolation until after the first incident. "We'll add proper tenant isolation once we have more clients" is a statement that makes practical sense and architectural nonsense. The isolation boundary is cheapest to implement correctly before there's existing code to refactor and existing clients whose data is stored in ways that need to be migrated. Skipping automated testing because there's no time. The codebase gradually becomes too risky to refactor. The parts that aren't understood don't get touched. Tests that were never written don't get written retroactively because the cost of retrofitting tests is higher than writing them alongside the code. Features slow down. Good engineers leave. Building observability as an afterthought. When incidents occur — and they will occur — the engineering team is debugging production systems with inadequate information, under client pressure, without the data they need to isolate the root cause quickly. Mean time to recovery extends. Trust erodes. The SLA that seemed achievable suddenly isn't. Designing for the first twenty clients, not the first two hundred. This one is subtle because the decisions feel responsible at the time. A shared database works fine for twenty clients. A monolith with no queue-based async works fine at low volume. A single deployment environment is fine for a small team. None of these are wrong in isolation. They become wrong when they're treated as permanent rather than temporary, when the plan to address them "when we need to" never gets made concrete. The honest summary is this: the decisions that are expensive to change later are cheapest to make correctly at the beginning. Not because teams should over-engineer early systems, but because the specific set of decisions that require early attention — tenant isolation model, stateless service design, CI/CD infrastructure, access control architecture — are structural, not incidental. Getting them right doesn't add months to the timeline. It adds a few weeks of design discipline that prevents a year of unplanned remediation. Applying This in Practice: An Engineering Lifecycle None of the above is useful as abstract principle. Here's what it looks like as a working process. Discovery and architecture design — Before writing code, define the problem space, the target client profiles, the compliance requirements, and the expected scale envelope. These inputs determine the tenant isolation model. They determine the access control design. They determine what "encrypted at rest" means for this specific platform. The output is a set of documented architecture decision records, not a market analysis. Infrastructure before features — The CI/CD pipeline, observability stack, secrets management system, and staging environment should exist before the first feature is developed. This is the investment that pays dividends across every subsequent sprint. A pipeline that's been running for six months has established a baseline of normal behavior; deviations from that baseline during deployments are immediately visible. Test-driven feature development — Code doesn't merge without tests. Not because 100% coverage is the goal, but because a test written for a new behavior is the cheapest possible insurance against that behavior regressing in a future sprint. Per-tenant metrics from the start — Instrumenting tenant ID into your metrics and logging schema from the beginning costs almost nothing. Retrofitting it into a mature observability stack after you have fifty tenants costs considerably more, and the retrofitted version is never as clean. Scheduled security and performance reviews — Not one-time events before launch. Recurring checkpoints. Load testing that simulates realistic tenant distributions. Security reviews that look for new attack surface introduced by recent features. Evidence-driven architectural evolution — As the platform grows, observability data guides structural changes. A service that needs to scale independently gets extracted when the data shows it's a bottleneck — not when someone has an architectural preference for microservices. Conclusion Architectural foresight isn't caution. It isn't the enemy of velocity. It's the precondition for sustained velocity — the kind that lets teams ship confidently at month twenty-four rather than spending month twenty-four unwinding the debt from month six. The SaaS platforms that degrade quietly at scale don't fail because they ran out of good ideas. They fail because the structural decisions made when speed was the only metric start exacting costs that compound faster than the team can pay them down. Multi-tenant isolation decisions made incorrectly become security incidents. CI/CD pipelines that were never built become deployment bottlenecks. Access control implemented as a checklist item becomes a failed enterprise security review. The specific decisions that prevent this aren't exotic. They're established. They're documented. They're the kind of decisions that experienced teams have been making and refining for a decade. The value in understanding them clearly is that you can make them deliberately, before the consequences of the wrong choice are already in production. References and Further Reading AWS SaaS Architecture Fundamentals Whitepaper – AWS's foundational framework for tenancy models and SaaS architectureAWS Guidance for Multi-Tenant Architectures – Silo, bridge, and pool model implementation patternsMartin Fowler: Breaking a Monolith into Microservices – Practical patterns for architectural evolutionGoogle SRE Book: Monitoring Distributed Systems – Four golden signals and SLO methodologyGoogle SRE Workbook: SLO Case Studies – Real-world SLO implementation at Evernote and Home DepotOWASP Top 10 Proactive Controls: Access Control – Access control design for multi-tenant environmentsOWASP Top 10 – Current web application security risk rankingsSapientPro SaaS Development – Architecture, multi-tenant platform design, and CI/CD delivery for SaaS products More
Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
By srinivas thotakura
Implementing Secure API Gateways for Microservices Architecture
Implementing Secure API Gateways for Microservices Architecture
By Mugunth Chandran
Zero-Downtime Deployments for Java Apps on Kubernetes
Zero-Downtime Deployments for Java Apps on Kubernetes
By Ramya vani Rayala
Pragmatica Aether: Let Java Be Java
Pragmatica Aether: Let Java Be Java

The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code

By Sergiy Yevtushenko
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel

In this article, I will discuss a highly available solution developed using Spring Boot 3 and Spring Security 6 to address the "centralized authentication method" problem frequently seen in modern microservice ecosystems. We are not simply moving to an "authorization service"; we are examining the cache-first pattern, which minimizes DB usage, and the Redis Sentinel enhancement, which guarantees system persistence. Why a Separate Authentication Service? While embedding security into each service is an option in microservices, I have always found it more logical to proceed with a centralized Auth service and API Gateway combination. DRY (Don't Repeat Yourself): Using token authentication logic in many services increases extra maintenance costs.Isolation: Business services focus only on business logic; they don't deal with "is this token valid?" questions.Performance: Thanks to the Redis connection, instead of going to the database with every request, we can resolve the validation via the cache in milliseconds. Plain Text [Client] ──► [API Gateway] ──► [Auth Service: validate token] │ (valid) ▼ [Backend Microservices] Cache-Focused Approach: Reducing Database Load In the classic workflow, every login request puts a load on the DB. With the cache-first approach, the process proceeds like this with a POST /auth/signin request: First, Redis is checked. If there is a valid and unexpired token for the user, it is replicated directly. In case of cache deficiency, AuthManager.authenticate() is activated, a DB query is sent, and a BCrypt check is performed. After a successful login, a token is generated with JJWT (HS256). This token is given to Redis with our changes and TTL (e.g., 24 minutes), and personal responses are converted. In this way, it protects our main database, especially in brute-force or high-intensity login password attacks. Plain Text POST /auth/signin │ ▼ ┌──────────────────────────────┐ │ Token exists in Redis? │──── YES ──► Return token (0 DB queries) └──────────────────────────────┘ │ NO ▼ ┌──────────────────────────────┐ │ AuthManager.authenticate() │ (DB query + BCrypt verification) └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Generate JWT (JJWT HS256) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Write to Redis (TTL: 24 min)│ └──────────────────────────────┘ │ ▼ Return token Implementation Details User Entity and UserDetails Integration In most projects, unnecessary mappings are performed between the User asset and the UserDetails objects expected by Spring Security. To reduce complexity, the User Entity is directly derived from the UserDetails interface. This makes the code cleaner and makes it "native," as outlined by Spring Security. Java @Data @Builder @NoArgsConstructor @AllArgsConstructor @Entity @Table(name = "T_APP_USER") public class User implements UserDetails { @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seq_user_gen") @SequenceGenerator(name = "seq_user_gen", sequenceName = "SEQ_APP_USER", allocationSize = 1) @Column(name = "idx") private Long idx; @Column(name = "firstname") private String firstName; @Column(name = "lastname") private String lastName; @Column(unique = true, name = "email") private String email; @Column(name = "accesskey") private String accessKey; // BCrypt-hashed @Column(name = "role") @Enumerated(EnumType.STRING) private Role role; @Override public Collection<? extends GrantedAuthority> getAuthorities() { return List.of(new SimpleGrantedAuthority(role.name())); } @Override public String getUsername() { return email; } @Override public String getPassword() { return accessKey; } @Override public boolean isAccountNonExpired() { return true; } @Override public boolean isAccountNonLocked() { return true; } @Override public boolean isCredentialsNonExpired() { return true; } @Override public boolean isEnabled() { return true; } } JWT Filter: The Gateway to Security The request to the system passes through the OncePerRequestFilter. Here, using JwtAuthenticationFilter, we parse the token in each request and populate the SecurityContext. By using the new SecurityFilterChain bean introduced with Spring Security 6, we have disabled CSRF and made session management completely stateless. Token Generation and Validation Java public interface JwtService { String extractUserName(String token); String generateToken(UserDetails userDetails); boolean isTokenValid(String token, UserDetails userDetails); } @Service public class JwtServiceImpl implements JwtService { @Value("${token.signing.key}") private String jwtSigningKey; // Base64-encoded secret key @Override public String extractUserName(String token) { return extractClaim(token, Claims::getSubject); } @Override public String generateToken(UserDetails userDetails) { return Jwts.builder() .setClaims(new HashMap<>()) .setSubject(userDetails.getUsername()) .setIssuedAt(new Date(System.currentTimeMillis())) .setExpiration(new Date(System.currentTimeMillis() + 1000 * 60 * 24)) .signWith(getSigningKey(), SignatureAlgorithm.HS256) .compact(); } @Override public boolean isTokenValid(String token, UserDetails userDetails) { final String userName = extractUserName(token); return userName.equals(userDetails.getUsername()) && !isTokenExpired(token); } private <T> T extractClaim(String token, Function<Claims, T> claimsResolver) { return claimsResolver.apply( Jwts.parserBuilder() .setSigningKey(getSigningKey()) .build() .parseClaimsJws(token) .getBody() ); } private boolean isTokenExpired(String token) { return extractClaim(token, Claims::getExpiration).before(new Date()); } private Key getSigningKey() { return Keys.hmacShaKeyFor(Decoders.BASE64.decode(jwtSigningKey)); } } High Availability: Redis Sentinel Using a single Redis instance means that the Auth service has a "Single Point of Failure." If Redis crashes, no one can access the system. This risk mitigation was achieved using Redis Sentinel. Thanks to the Sentinel structure: If the master node crashes, the dependent node is automatically promoted to master via failover. On the application side, we continuously manage these transitions using the Lettuce driver. Technical Stack and Requirements Redis Sentinel configuration: Java @Configuration public class RedisConfig { @Value("${spring.redis.sentinel.master}") private String master; @Value("${spring.redis.sentinel.nodes}") private String sentinelNodes; @Value("${spring.redis.password}") private String password; @Bean public RedisConnectionFactory redisConnectionFactory() { RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration() .master(master); for (String node : sentinelNodes.split(",")) { String[] hostPort = node.split(":"); sentinelConfig.sentinel(hostPort[0], Integer.parseInt(hostPort[1])); } sentinelConfig.setPassword(RedisPassword.of(password)); return new LettuceConnectionFactory(sentinelConfig); } } Plain Text yaml env: - name: spring.redis.sentinel.master valueFrom: secretKeyRef: name: redis-user-secret key: username - name: spring.redis.password valueFrom: secretKeyRef: name: redis-user-secret key: password Token cache service: Java @Service public class TokenCacheServiceImpl { private final RedisTemplate<String, String> redisTemplate; public TokenCacheServiceImpl(RedisTemplate<String, String> redisTemplate) { this.redisTemplate = redisTemplate; } public void cacheToken(String username, String token, long duration, TimeUnit unit) { redisTemplate.opsForValue().set(username, token, duration, unit); } @Cacheable(value = "tokens", key = "#username") public String getToken(String username) { return redisTemplate.opsForValue().get(username); } } Authentication service: signup and signin: Java @Service @RequiredArgsConstructor public class AuthenticationServiceImpl implements AuthenticationService { private final UserRepository userRepository; private final PasswordEncoder passwordEncoder; private final JwtService jwtService; private final AuthenticationManager authenticationManager; private final TokenCacheServiceImpl tokenCacheService; @Override public JwtAuthenticationResponse signup(SignUpRequest request) { var user = User.builder() .firstName(request.getFirstName()) .lastName(request.getLastName()) .email(request.getEmail()) .accessKey(passwordEncoder.encode(request.getAccessKey())) // BCrypt .role(Role.USER) .build(); userRepository.save(user); var jwt = jwtService.generateToken(user); return JwtAuthenticationResponse.builder().token(jwt).build(); } @Override public JwtAuthenticationResponse signin(SigninRequest request) { // 1. Check Redis cache first String cachedToken = tokenCacheService.getToken(request.getEmail()); if (cachedToken != null) { return JwtAuthenticationResponse.builder().token(cachedToken).build(); } // 2. If not cached, authenticate (DB + BCrypt) authenticationManager.authenticate( new UsernamePasswordAuthenticationToken(request.getEmail(), request.getAccessKey()) ); var user = userRepository.findByEmail(request.getEmail()) .orElseThrow(() -> new IllegalArgumentException("Invalid credentials.")); // 3. Generate token and write to Redis (24 min TTL) var jwt = jwtService.generateToken(user); tokenCacheService.cacheToken(request.getEmail(), jwt, 24, TimeUnit.MINUTES); return JwtAuthenticationResponse.builder().token(jwt).build(); } } JWT authentication filter: Java @Component @RequiredArgsConstructor public class JwtAuthenticationFilter extends OncePerRequestFilter { private final JwtService jwtService; private final UserService userService; @Override protected void doFilterInternal( @NonNull HttpServletRequest request, @NonNull HttpServletResponse response, @NonNull FilterChain filterChain ) throws ServletException, IOException { final String authHeader = request.getHeader("Authorization"); // Pass through if no Authorization header or doesn't start with Bearer if (StringUtils.isEmpty(authHeader) || !StringUtils.startsWith(authHeader, "Bearer ")) { filterChain.doFilter(request, response); return; } final String jwt = authHeader.substring(7); final String userEmail = jwtService.extractUserName(jwt); // Process only if SecurityContext has no authentication yet if (StringUtils.isNotEmpty(userEmail) && SecurityContextHolder.getContext().getAuthentication() == null) { UserDetails userDetails = userService.userDetailsService() .loadUserByUsername(userEmail); if (jwtService.isTokenValid(jwt, userDetails)) { SecurityContext context = SecurityContextHolder.createEmptyContext(); UsernamePasswordAuthenticationToken authToken = new UsernamePasswordAuthenticationToken( userDetails, null, userDetails.getAuthorities() ); authToken.setDetails(new WebAuthenticationDetailsSource().buildDetails(request)); context.setAuthentication(authToken); SecurityContextHolder.setContext(context); } } filterChain.doFilter(request, response); } } Spring Security 6 configuration: Java @Configuration @EnableWebSecurity @RequiredArgsConstructor public class SecurityConfiguration { private final JwtAuthenticationFilter jwtAuthenticationFilter; private final UserService userService; @Bean public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { http .csrf(AbstractHttpConfigurer::disable) // Stateless → no CSRF needed .authorizeHttpRequests(request -> request .requestMatchers("/auth/**").permitAll() // Auth endpoints open to all .anyRequest().authenticated() ) .sessionManagement(manager -> manager.sessionCreationPolicy(STATELESS) // No server-side session ) .authenticationProvider(authenticationProvider()) .addFilterBefore(jwtAuthenticationFilter, // JWT filter runs first UsernamePasswordAuthenticationFilter.class); return http.build(); } @Bean public PasswordEncoder passwordEncoder() { return new BCryptPasswordEncoder(); } @Bean public AuthenticationProvider authenticationProvider() { DaoAuthenticationProvider authProvider = new DaoAuthenticationProvider(); authProvider.setUserDetailsService(userService.userDetailsService()); authProvider.setPasswordEncoder(passwordEncoder()); return authProvider; } @Bean public AuthenticationManager authenticationManager(AuthenticationConfiguration config) throws Exception { return config.getAuthenticationManager(); } } Unit tests: Java @Test @DisplayName("Signin: if token is cached, should not query the DB") void testSignInWithCachedToken() { when(tokenCacheService.getToken(TEST_EMAIL)).thenReturn(TEST_TOKEN); JwtAuthenticationResponse response = authenticationService.signin( SigninRequest.builder().email(TEST_EMAIL).accessKey(TEST_PASSWORD).build() ); assertEquals(TEST_TOKEN, response.getToken()); verifyNoInteractions(authenticationManager); // No DB + BCrypt call should happen verifyNoInteractions(userRepository); } // Invalid token test — SecurityContext should remain empty @Test @DisplayName("With an invalid token, SecurityContext should remain empty") void testDoFilterInternalInvalidToken() throws Exception { when(request.getHeader("Authorization")).thenReturn("Bearer " + INVALID_TOKEN); when(jwtService.extractUserName(INVALID_TOKEN)).thenReturn(TEST_EMAIL); when(userService.userDetailsService()).thenReturn(userDetailsService); when(userDetailsService.loadUserByUsername(TEST_EMAIL)).thenReturn(userDetails); when(jwtService.isTokenValid(INVALID_TOKEN, userDetails)).thenReturn(false); jwtAuthenticationFilter.doFilterInternal(request, response, filterChain); verify(filterChain).doFilter(request, response); assertNull(SecurityContextHolder.getContext().getAuthentication()); } Summary and Conclusion With the purchasing architecture, not only a secure login screen; It has built an architecture that is extremely scalable, overcomes database bottlenecks with caching, and meets high availability (HA) standards. In particular, the modern architecture offered by Spring Boot 3 has made the security layer much more flexible. If you are starting a large-scale microservice project, you can design token management from the outset in this "stateless" and "cached" manner.

By Erkin Karanlık
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
Docker Hardened Images Are Free Now — Here's What You Still Need to Build

The Problem Isn't the Image Hardened container images are no longer niche. Docker open-sourced major portions of the tooling behind Docker Hardened Images under Apache 2.0 in late 2025. Chainguard and Google's distroless variants sit in the same space. The pitch across all three: fewer packages, smaller attack surface, dramatically lower CVE counts. The pitch is accurate. It is also incomplete. Most container security failures are not image failures. They are governance failures: A team pushes a debug build to production. Admission control doesn't block it because the policy is in Audit mode, not Enforce.A six-month-old deployment keeps running an ancient image digest while the team patches newer builds. Nobody detects the drift.The platform team rotates signing keys. Old pipelines keep producing images signed with the revoked key. Admission still accepts them. Nobody notices for ninety days.A vendor pushes an updated base image under the same tag. CI rebuilds against the new digest. The new digest is unsigned. Production takes it. No alert fires. None of these are CVE failures. They are governance failures — gaps in how images are produced, attested, verified, and monitored. Swapping the base image to a hardened variant changes none of them. A signed-and-attested hardened image in a cluster that doesn't verify signatures is operationally equivalent to a signed Ubuntu image in that cluster: the signature is decorative. I recently worked on migrating a regulated production workload onto a hardened-image baseline. Lab 12 of my docker-security-practical-guide repository is a sanitized, reproducible distillation of what that work taught me. The short version: the value is in the control plane around the image, not the image itself. The Trust Control Plane in 60 Seconds In practice, the hardest part is not enabling hardened images. It is operating trustworthy deployments at scale without slowing engineers down. The operating model has three layers, joined by a feedback loop: Supply Chain layer – images are signed (cosign keyless against Fulcio), attested with an SBOM (syft + CycloneDX), and scanned for vulnerabilities (grype). The output: an image whose origin and contents are independently verifiable by anyone.Trust layer – an admission controller (Kyverno) verifies signatures and attestations before any pod is scheduled. The admission policy is the unit of governance: it encodes which signers, which attestations, and which constraints are required for a workload to start.Enforcement layer – continuous drift detection answers the question: admission can't: has the digest drifted since we admitted it? Has the signing key been revoked? Has a new unsigned workload landed via a controller that bypasses admission?Feedback loop – drift findings feed back into the supply chain: a drift event produces a rebuild; an admission rejection produces a ticket. Without the loop, the enforcement layer becomes an alerting backwater that engineers mute. FIGURE 1 — Trust control plane for cloud-native software supply chain security.The architecture separates supply chain generation, admission-time trust verification, and continuous runtime enforcement into independent layers connected through a feedback loop. The pattern is vendor-agnostic: any compatible signing, admission, and drift-detection components can fulfill these roles. The bottom line: a hardened image is one input to the supply chain layer. Without trust verification, it's indistinguishable from a regular image at deploy time. Without enforcement, untrusted images coexist with hardened images in the same cluster. Without the feedback loop, trust state drifts silently. Admission Control: Where Governance Gets Teeth The trust layer is where the control plane becomes operationally real. In the lab, Kyverno's verifyImages rule asserts that every image carries a cosign signature from an approved identity. Here's the core of the policy: YAML apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-signed-images spec: validationFailureAction: Enforce rules: - name: verify-cosign-keyless match: any: - resources: kinds: [Pod] verifyImages: - imageReferences: ["ghcr.io/opscart/*"] attestors: - entries: - keyless: subject: "https://github.com/opscart/*" issuer: "https://token.actions.githubusercontent.com" required: true The subject and issuer together define who is trusted. For DHI images, these values point to Docker's signing identity. For Chainguard, Chainguard's. The shape of the policy is identical in all cases — only the identity matcher changes. When someone deploys an unsigned image, the rejection is immediate and actionable: Shell $ kubectl run test --image=nginx:latest --restart=Never Error from server: admission webhook "validate.kyverno.svc-fail" denied the request: resource Pod/default/test was blocked due to the following policies require-trusted-registry: trusted-registries-only: 'validation error: Image must come from a trusted registry. Allowed: dhi.io/*.' FIGURE 2 — Kyverno admission webhook rejecting an nginx pod from an untrusted registry. Capture from terminal: kubectl run rejected-test --image=nginx:latest --restart=Never (with cluster up and policies applied). Catching an unsigned image at admission costs one re-run of kubectl apply. Catching the same workload running in production a week later costs a security ticket, an incident response, and possibly a regulatory disclosure conversation. Moving rejection earlier is the highest-leverage decision in the entire model. Phased Rollout: Audit Before Enforce In production, you don't flip everything to Enforce on day one. The lab uses a phased approach: the trusted-registry policy runs in Enforce mode (hard gate on image origin), while signature and SBOM verification policies run in Audit mode (log violations, don't block). This gives teams a migration runway: they can see which workloads would fail and fix them before the policies graduate to Enforce. The shift from Audit to Enforce is a single-field YAML change. Signing Your Supply Chain: Keyless Cosign The supply chain layer produces the artifacts that admission verifies. A common modern approach uses cosign with GitHub Actions OIDC for keyless signing — no private keys to manage, rotate, or leak. The mechanism: GitHub Actions mints a short-lived OIDC token at workflow time. Cosign exchanges it for an ephemeral certificate from Sigstore Fulcio, signs the image, and destroys the key immediately. The certificate records which workflow, on which repository, at which commit, produced the signature. The signature is logged in Sigstore Rekor's public transparency log. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The complete workflow and run history is public. The important property is that anyone can independently verify the signed artifact. Shell cosign verify \ --certificate-identity-regexp \ "^https://github\.com/opscart/docker-security-practical-guide/ \.github/workflows/supply-chain-gate\.yml@.+$" \ --certificate-oidc-issuer \ "https://token.actions.githubusercontent.com" \ ghcr.io/opscart/docker-security-practical-guide/dhi-sample-app:latest Verification for ghcr.io/opscart/.../dhi-sample-app:latest -- The following checks were performed on each of these signatures: - The cosign claims were validated - Existence of the claims in the transparency log was verified offline FIGURE 3 — cosign verify succeeds for any reader, without shared secrets. Capture from terminal: run the cosign verify command above against the published image at ghcr.io. This is what "supply chain security" means in practice: not "we sign our images," but "our trust assertions are independently verifiable by anyone, against neutral infrastructure, without prior trust setup." The published image can be verified directly against the public artifact. Fleet Drift: The Problem Nobody Watches Admission is point-in-time. Production is continuous. The enforcement layer's job is to answer the questions that admission can't: has the digest drifted since we admitted it? Has a new unsigned workload landed via a controller that bypasses admission? The lab's E1 experiment runs a drift audit against a synthetic 12-service fleet mixing DHI, Docker Hub, internally-built, and abandoned images. The fleet is intentionally constructed with an explicit variation matrix — the numbers below describe the synthetic fleet's structure, not measurements from a deployed environment. In this synthetic fleet, unsigned services averaged 13.0 critical CVEs while signed-and-verified services averaged 0.0. The exact ratio will vary by environment, but the audit makes the trust gap continuously visible. FIGURE 4a — Fleet drift audit: signing state vs CVE correlation across the synthetic fleet. Capture from terminal: run ./experiments/E1-drift-observation/analyze-drift.py. Screenshot Sections 1–3 (Fleet Summary + Origin×Signing Correlation + Signing State → CVE Accumulation) FIGURE 4b — Remediation order: compliance-scope risk concentration and prioritized action queue. Same script output, Sections 4 + 7 (Compliance Scope Risk Concentration + Recommended Remediation Order) The ratio isn't the point — your fleet will produce different numbers. What the control plane provides is the continuous, attributable surfacing of whatever the ratio actually is, including cases where the supposed benefit of hardening is harder to defend. That honest feedback loop is what turns the audit from a compliance checkbox into a supply chain prioritization tool. The Substitution Test A useful test for whether you've found an architectural pattern or a vendor recipe: can you swap a major component and have everything else continue to work? For this architecture, the test is straightforward. The lab demonstrates three configurations: Docker Hardened Images (dhi.io), Chainguard Images (cgr.dev/chainguard), and a self-built Alpine base signed against a project-owned GitHub Actions OIDC identity. In all three, the Kyverno policy structure is identical. The drift audit runs unchanged. The SBOM verification runs unchanged. Edits are confined to the identity matcher and the image references. The implication: "Should we standardize on DHI or Chainguard?" is a commercial decision (pricing, catalog coverage, support), not an architectural one. The architectural decision is whether to operate the trust control plane at all. A team that has invested in the control plane has built portable institutional capability. A team that has invested in "we use DHI" has bought a product, and a future migration off DHI is a structural rewrite rather than a configuration update. Production Friction: What Actually Goes Wrong The model works. It is also not free. Here are the operational costs my team hit, documented in detail in the companion repo's TROUBLESHOOTING.md: No shell. Distroless hardened images don't include /bin/sh, curl, wget, cat, or ls. When an engineer pages at 2 AM and runs kubectl exec -it pod -- /bin/sh, the command fails. The remediation is kubectl debug with an ephemeral debug container attached to the pod's process namespace. Train your on-call rotation on kubectl debug before migration, not after. The lab's E5 experiment documents three debug patterns (ephemeral containers, dev-variant images in dev namespaces only, pre-built debug sidecars) with runbook scenarios for unreachable services, crashloops, and OOM kills. Migration is not a FROM line change. The default user is nonroot (UID 65532), not root. Library paths differ. pip install --user installs to /home/nonroot/.local, not /root/.local. Required system packages (ca-certificates, timezone data) that come for free in stock bases must be explicitly carried over. The lab's Dockerfile required three iterations before the build succeeded locally: shell-form RUN failed (no /bin/sh), then pip --user installed to the wrong path, then requirements.txt pinned package versions that didn't exist on PyPI. Each of these is a 30-second local fix — and a 5-minute GitHub Actions round-trip if you don't test locally first. Signature paths vary by vendor. DHI signatures resolve via registry.scout.docker.com, not at the image's own registry path. Kyverno handles this through the policy's repository field, but any custom verification tooling needs to know. Plan to audit verification code before migration. Kyverno has schema gotchas. rekor and ctlog blocks must be inside keys, not siblings. webhookTimeoutSeconds is capped at 30. mutateDigest: true is incompatible with validationFailureAction: Audit. PolicyException requires an explicit feature flag. Each of these cost me 30–60 minutes of debugging — they're in TROUBLESHOOTING.md, so they don't cost you the same. None of these are deal-breakers individually. All of them together are why migrations slip from "next quarter" to "abandoned after two months." Budget for friction. When This Is Overkill The investment's value scales with three factors: regulatory pressure (HIPAA, PCI-DSS, SOC 2 Type II, FDA 21 CFR Part 11), fleet size and heterogeneity (8+ clusters, dozens of teams pushing images), and blast radius (pharmaceutical patient data vs. internal dashboard). Concretely: pre-production tools, side projects, prototypes, and developer sandboxes do not need this. They benefit from a hardened base image (free) and should not be put behind the full trust control plane. The overhead of policy maintenance, key rotation, and drift remediation outstrips the risk reduction. For most workloads outside regulated production, the supply chain layer alone — sign and SBOM your builds — captures most of the available value at a fraction of the cost. Conclusion: Architecture Over Image Choice Hardened images are useful. The point of this article is that they are one component of a broader architectural pattern, and the security outcomes regulated teams want are properties of the pattern, not the component. A team that adopts hardened images without the surrounding pattern has made a real but limited improvement. A team that adopts the pattern with any reasonable image vendor — DHI, Chainguard, or a self-built base — has built portable institutional capability. The substitution test is the diagnostic: ask whether a future migration away from your current image vendor is a configuration edit or a structural rewrite. If it's the former, you have the pattern. If it's the latter, you have a product dependency. The companion repository at github.com/opscart/docker-security-practical-guide (tag v1.12.0) contains everything in this article: working Kyverno policies, a keyless-signed sample image you can pull and verify right now, fleet drift audits, and five hypothesis-driven experiments. The cosign verify command above works against the published artifact today. Spend the design effort on the pattern. The image will be replaceable. The governance is what survives vendor replacement. This article is adapted from a longer write-up on OpsCart, which includes the complete threat model, substitution-test configurations, and an extended troubleshooting log.

By Shamsher Khan DZone Core CORE
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables

Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.

By Seshendranath Balla Venkata
Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems
Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems

EMR platforms are unique software beasts. They must live longer than most online apps due to regulatory constraints. A startup may reinvent its primary product every three years, but an EMR system must retain data integrity and workflow consistency for decades. This lifespan is difficult. How do you change a healthcare system without violating strict compliance rules? API-first thinking is the answer. This method goes beyond data endpoint exposure. The issue is architectural survival. In a business where "move fast and break things" is unacceptable, architects may offer modular development, safer changes, and long-term stability by prioritizing the API. The Unique Constraints of EMR Architecture EMRs are not typical CRUD applications. In a standard business app, updating a record might just mean overwriting a row in a database. In healthcare, that simple update triggers a cascade of regulatory realities. Every change requires an audit trail. Data retention policies dictate that information cannot simply vanish. Clinical decisions are based on the history of that data, meaning immutability is often more important than mutability. Furthermore, healthcare workflows are long-lived. A patient's treatment plan might span months or years. An architecture built around short-lived features will crumble under the weight of these persistent workflows. You cannot refactor a database schema overnight if it breaks the continuity of a patient's care record. This is why stability is the paramount quality attribute of any EMR. What “API-First” Really Means in Regulated Systems In the context of regulated systems, API-first means designing contracts before writing a single line of implementation code. It requires treating your APIs as long-term public interfaces, even if the only consumer initially is your own frontend team. They are not internal shortcuts; they are binding agreements. This approach forces you to separate clinical workflows from user interface concerns. A button click on a screen is transient; the clinical action it represents is permanent. By defining the API first, you establish a boundary that encapsulates compliance logic. The API becomes the gatekeeper. It enables regulatory compliance regardless of data access via mobile app, web portal, or third-party integration. Contract Stability as a Core Architectural Principle Breaking an API contract in an EMR is far costlier than breaking a UI component. If a button breaks, a user complains. If an API contract breaks, integrations fail, data synchronization stops, and patient care can be impacted. Therefore, request and response models must be designed to survive years of change. Architects must avoid overfitting contracts to current UI needs. Just because a specific screen needs a patient's name and their last three blood pressure readings doesn't mean you should create an endpoint specifically for that view. Instead, design resources that represent the domain accurately. This decoupling protects the backend from the volatility of frontend trends. Backward Compatibility Without Freezing Innovation The fear of breaking existing clients often paralyzes development teams. However, API-first design provides a path to evolve without stagnation. The key is distinguishing between additive changes and destructive changes. Adding a new field to a response is generally safe; removing one or renaming one is not. In .NET Web APIs, versioning strategies are critical. You can support legacy consumers while enabling new features for modern clients. This transforms deprecation from a sudden emergency into a managed process. You provide a sunset period for old versions, giving consumers time to migrate without disruption. In regulated systems, versioning is not a technical afterthought. Explicit versioned routes allow EMR platforms to evolve safely, giving downstream systems time to migrate without disrupting clinical workflows. Plain Text ```csharp [ApiController] [Route("api/v1/encounters")] public class EncountersV1Controller : ControllerBase { [HttpPost("{id}/sign")] public IActionResult SignEncounter(Guid id) { // Business rule: encounter must be complete before signing _encounterService.Sign(id); return Ok(); } } Modeling Regulated Workflows Through APIs Your API should encode business rules and compliance constraints directly. It is dangerous to rely on the UI to validate clinical workflows. If a doctor must sign a note before billing can occur, that rule belongs in the API layer, not in the JavaScript of the frontend. Consistency: Business rules enforced at the API level apply to every consumer, preventing "workflow drift" between the web portal and mobile apps.Security: Bypassing the UI via a direct API call (e.g., using Postman) should not allow a user to bypass compliance checks.Clarity: The API endpoints should reflect real-world clinical states (e.g., POST /encounters/sign) rather than generic database operations. API-First and Modular EMR Growth Monolithic EMRs eventually become unmaintainable. Decoupling large domains like scheduling, assessments, reporting, and case management is possible with API-first design. Well-defined interfaces allow you to upgrade the scheduling engine without affecting the billing module. This modularity supports parallel development. Different teams can work on different modules simultaneously without constant merge conflicts or integration friction. It also lays the foundation for extensibility. If a client needs a custom integration for a specific device, your public-facing API is already robust enough to handle it because it’s the same API you use internally. .NET-Specific Considerations for API-First EMRs ASP.NET Core is an excellent framework for building long-lived API platforms. Its middleware pipeline allows you to handle cross-cutting concerns like logging and validation globally. However, structuring your solution requires discipline. Controllers should be thin, delegating logic to service layers that handle the heavy lifting. Using Data Transfer Objects (DTOs) is non-negotiable. Never give API consumers access to internal domain entities or Entity Framework models. DTO buffers allow database schema refactoring without breaching the public contract. Your architecture should prioritize validation, authorization, and detailed auditing over afterthoughts. DTO boundaries are a compliance safeguard. They allow internal schema evolution while preserving external contracts, critical for EMR platforms that must retain compatibility over decades. Plain Text ```csharp // Entity (internal, mutable, persistence-focused) public class EncounterEntity { public Guid Id { get; set; } public DateTime SignedAt { get; set; } public string InternalNotes { get; set; } } // DTO (public, stable, contract-focused) public class EncounterDto { public Guid Id { get; set; } public bool IsSigned { get; set; } } Security, Authorization, and Role-Based Access Authorization in healthcare is complex. It is rarely a simple binary of "admin" vs. "user." You have doctors, nurses, auditors, billing specialists, and patients, all with overlapping permissions. This complexity cannot be delegated to the UI. Scope: Design APIs around granular scopes and responsibilities, ensuring a nurse can view a chart but only a doctor can sign an order.Context: Authorization logic must understand the context. A doctor may see patients solely in their ward.Enforcement: Use.NET policies to enforce these restrictions at the controller or action level to catch all requests. Lessons Learned From Long-Term EMR Ownership Looking back at years of EMR development, the cost of early shortcuts is evident. Every time we bypassed the API to hack a feature directly into the database or coupled the UI too tightly to the backend, we paid for it with interest later. The API-first approach drastically reduced risk during major platform changes. When we needed to rewrite our entire frontend framework, the backend remained stable. We didn't have to reinvent our compliance logic because it was safely encapsulated behind our API contracts. I would tighten contract design reviews if I started over. Taking time to design the interface right is more important than coding speed. Final Thoughts: Building EMRs That Outlast Trends Technology trends fade. JavaScript frameworks rise and fall. But medical records must persist. An EMR system must survive multiple generations of UI rewrites and shifting regulatory landscapes. API-first design is the strategy for this longevity. It separates your system's volatile portions from its compliance-heavy core. Architects in this field must supply features and maintain system integrity throughout time. By investing in solid, well-designed APIs today, you assure your platform's longevity.

By Ronak Pavasiya
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes

TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story

By Ingero Team
Building Production-Grade GenAI on GCP with Vertex AI Agent Builder
Building Production-Grade GenAI on GCP with Vertex AI Agent Builder

Evidence of the ideas behind generative AI is not challenging to build, but the barrier between experimentation and production presents another group of concerns: repeatability, workflow predictability, safety, tracking, and scalability. The quality of the model is often not the bottleneck, and many teams find it challenging to apply GenAI into real systems and have enterprise-grade level guarantees. The Vertex AI Agent Builder offered by Google Cloud fills the gap with a managed infrastructure of deploying intelligent agents run on Gemini models, generation based on retrieval-augmented generation (RAG), and tools orchestration. In place of manually configuring a collection of services, Agent Builder is a unified runtime that allows balanced application development, both data grounding and deployment as well as monitoring, to be authored in GenAI. Architecture Foundations for Production GenAI A GenAI system on GCP that is production-grade is usually designed to have a layered architecture. The client applications communicate with Cloud Run or API Gateway and send requests to agents that are hosted by Vertex AI Agent Builder. Such agents plan prompts, access contextual information in indexed enterprise datastores like Big Query or Cloud Storage, reason using Gemini models and access external (or internal) tools (including Cloud Functions and internal APIs) when necessary. This division of labor enables frontend services, agent logic and knowledge systems to scale independently, without involving business workflows in immediate templates. The fundamental unit of this architecture is Retrieval Augmented Generation. In the absence of RAG, the model only uses pretrained knowledge and therefore, it tends to hallucinate or provide general answers. The use of agent Builder supports native indexing over both structured and unstructured data, thus enabling the application of outputs by applications to be based on actual organizational content. Documents are divided, inserted and filled with metadata to enable retrieval based on access level, department or domain. This practically forms a pipeline whereby user queries activate retrieval, dynamically assembled relevant context is formed and responses are produced by Gemini based on authoritative data. This method is much more accurate but flexible because the knowledge of the enterprise is going to change. Production GenAI Architecture Using Vertex AI Agent Builder on GCP Orchestration, Security, and Operational Readiness Recent GenAI applications do not typically limit themselves to text generation. There are databases, ticketing systems, and business services that must be touched by the production agents. Vertex AI Agent Builder allows the calling of tools so that models can invoke external actions like asking the status of orders, creating support tickets or running workflows. The teams do not have to write the logic inside prompts but can define structured flows using the assistance of Agent Builder, Cloud Workflows, or event-driven Cloud Functions. This renders orchestration checkable and verifiable whilst allowing the model to focus on argumentation and language production. Security is also the important thing. Vertex AI is connected to GCP IAM directly, allowing role-to-agent and role-to-dataset access as well as supporting service-to-service authentication. Sensitive areas may be covered in retrieval, audit logs can be viewed on the interactions of the agents, and VPC Service Controls are used to provide a boundary on data. Such capabilities are required in controlled settings where GenAI must abide by the current governance systems. Making agents like any other production service, which is subject to identity management, network controls, and logging, makes GenAI not an exception in architecture. Observability, Deployment, and Continuous Improvement The operational risk of deploying GenAI is that it is not observable. Vertex AI also offers logging of requests, latency, and tracing of the usage of tokens, although production teams often go further and export interaction data to BigQuery to analyze it offline. Gaining feedback on users, assessing response quality and versioning allows constant improvement, without destabilizing production systems. Another typical trend is to A/B test the promotion of prompt or agent changes in staging before they go to production, as with the traditional software release process. During deployment, the teams tend to open the agents through secured endpoints enabled by Cloud Run, manage the infrastructure with the help of Terraform, and create CI/CD pipelines to modify agent settings. This ensures that it can be replicated and it has reduced manual effort. Like traditional microservice ecosystems, successful GenAI platforms can be said to be monitored, versioned and constantly optimized in the long term. Vertex AI Agent Builder makes this process faster by bringing models, retrieval, orchestration and governance together on a single platform, which enables engineering teams to build reliable products instead of gluing the infrastructure together. Finally, GenAI in its production form will not be about access to powerful models, but rather the construction of robust systems to run them. Verse AI Agent Builder enables organizations to push agent deployment that is based on enterprise data, with cloud-native controls, and enhanced by feedback loops that are measurable to go to dependable applications. Conclusion Bringing GenAI out of the prototype and into production takes much more than model integration, it needs to be reliable in retrieval, deterministic in orchestration, hard security boundaries and continuously observable. The Vertex AI Agent Builder, offered by Google Cloud, unites all these abilities into one platform so that the teams can develop agents whose foundation lies in enterprise data, which relates to actual business processes and are controlled by cloud-native mechanisms. The integration of the Gemini models with Retrieval Augmented Generation, tool calling, and the operational ecosystem of GCP would enable organizations to implement scalable GenAI-based systems, which act similarly to the other production services. With enterprises becoming more entangled into AI-driven applications, they will find success once they start considering GenAI as part of infrastructure and not an experimental setup. Vertex AI Agent Builder can help speed up this shift by lowering the complexity of the existing architecture and allowing an engineering team to concentrate on the provision of quantifiable business value by offering reliable and production-ready intelligent systems.

By Sairamakrishna BuchiReddy Karri
How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures
How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures

Modern API-led architectures are built for resilience. We add: Retries for transient failuresReplication for durabilityAutoscaling for elasticityCircuit breakers for isolation Each mechanism improves availability. Under stress, their interaction can bring the system down. Most enterprise outages aren’t caused by missing fault tolerance. They’re caused by unbounded fault-tolerance mechanisms reacting simultaneously. Let’s break down how this happens — and how to design bounded reliability instead. 1. Retry Storms: When Resilience Multiplies Traffic Retries are meant to protect against temporary failures. But retries multiply load. This is a simplified version of what we often see in service-to-service retry logic: Plain Text import time import random def downstream_service(): latency = random.choice([0.1, 0.2, 0.8]) time.sleep(latency) if latency > 0.7: raise TimeoutError("Slow response") return "OK" def call_with_retries(max_attempts=3): for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: print(f"Retry {attempt+1}") raise Exception("Failed after retries") Under normal conditions: Works fine. Under load: Latency increases.Timeouts trigger.Each request retries 3 times.Traffic triples.Backend slows further.More retries fire. That’s a retry storm. Now imagine this inside an API-led architecture: Gateway → Experience API → Process API → System APIs → ERP/DB If each layer retries independently, load amplification becomes multiplicative. In one system I worked on, we saw a single downstream slowdown take out three upstream APIs within minutes because each layer had its own retry logic. Bounded Retry Pattern (Production-Safe) Retries must be: LimitedBacked off exponentiallyJitteredDisabled under system stress Safer version: Plain Text def call_with_bounded_retries(max_attempts=2, system_load=0.5): if system_load > 0.75: return None # fail fast when under stress for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: backoff = 0.2 * (2 ** attempt) time.sleep(backoff + random.uniform(0, 0.1)) return None Key differences: Retry ceiling reducedExponential backoffJitter prevents synchronized wavesLoad-aware short-circuit Retries should dampen instability — not amplify it. 2. Replication Fan-Out and Coordination Collapse Replication improves durability. But synchronous replication increases coordination cost. Example: Plain Text import time def simulate_write(): time.sleep(0.2) def write_to_replicas(data, replicas=3): for _ in range(replicas): simulate_write() Under surge traffic: Write volume increases.Each write fans out to 3 replicas.Replica lag grows.Clients retry writes.Effective write load doubles. Durability turned into a bottleneck. In enterprise integration systems (order processing, billing, reconciliation), this pattern causes throughput collapse — not because data was lost, but because coordination overwhelmed the system. Tiered Durability Strategy Not all writes need identical guarantees. Plain Text def write(data, critical=True): if critical: write_to_replicas(data, replicas=3) else: write_to_replicas(data, replicas=1) Separate: Critical transactions → strong durabilityNon-critical logs/events → reduced coordination Reliability must be scoped — not maximized blindly. 3. Autoscaling Feedback Loops Autoscaling reacts to traffic metrics. But traffic metrics may be artificial. If retries inflate request counts: Plain Text def autoscale(request_rate): if request_rate > 100: print("Scaling up") Scaling triggers: New instances initialize.Initialization hits shared DB/cache.Backend latency increases.More timeouts occur.Retry rate rises. Autoscaling accelerated instability. Safer Scaling Signals Scale on: Sustained demand (not spikes)Latency distribution trendsOrganic RPS (excluding retries)Queue growth rate Example: Plain Text def autoscale_safe(request_rate, sustained_load): if sustained_load and request_rate > 120: print("Scaling safely") Autoscaling should respond to organic demand — not retry amplification. 4. The Real Problem: Correlated Reactions Retries respond to latency.Replication responds to writes.Autoscaling responds to traffic.Circuit breakers respond to error rates.Under stress, they react to the same signal.That correlation creates cascading failure.Distributed systems behave like feedback systems.Unbounded feedback loops destabilize them. Real-World Scenario: Payment Reconciliation API Consider a payment reconciliation service: Gateway → Process API → Billing → ERP → Database What happens during a minor ERP slowdown? ERP latency increases to 700ms.Billing times out at 500ms.Billing retries 3 times.Process API retries orchestration.Gateway retries client request.Autoscaling reacts to spike.DB replication lag increases.DLQ starts growing. Within minutes, a small slowdown becomes a platform-wide incident. Root cause: unbounded reaction. 5. Guardrails for Bounded Reliability in API Systems 1. Retry Budgets Effective Load = Incoming RPS × Retry Count If RPS = 1,000 and retries = 3 Effective load = 3,000 Cap retries per request and per service. 2. Failure Classification Not all errors are retriable. Error Type Retry? Action CONNECTIVITY Yes Bounded retry TIMEOUT Yes Backoff VALIDATION No Fail fast AUTH No Alert Blind retries are architectural debt. 3. Idempotency Enforcement Retries without idempotency cause corruption. Unsafe: Plain Text transaction_id = uuid() Safe: Plain Text transaction_id = payload.get("transaction_id") or request.headers["correlation-id"] Every retry must produce the same logical result. 4. DLQ With Observability Track: Retry percentageTimeout frequencyDLQ growth velocityP95 latency shifts These are early warning signals. None of these controls are free. Reducing retries can increase error rates in some scenarios, and limiting replication can affect durability guarantees. The goal isn’t to eliminate these mechanisms, but to apply them intentionally based on system behavior. 5. Design for Stability, Not Perfection The goal of distributed reliability isn’t maximum redundancy. It’s controlled degradation under stress. Bound retries. Scope replication. Dampen scaling reactions. Enforce idempotency. Monitor feedback loops. Minor latency should not become a cascading outage. Reliability is not about adding mechanisms. It’s about controlling how they interact. Final Thoughts Retry storms don’t start with catastrophic failure. They start with: A small latency increaseA few timeoutsA handful of retries Then fault-tolerance mechanisms react — together. Retries multiply traffic.Replication increases coordination pressure.Autoscaling amplifies backend load. Within minutes, a minor slowdown becomes a cascading outage. Reliability in API-led distributed systems is not about adding more safety nets. It’s about bounding how those safety nets behave under stress. Limit retries.Classify failures.Enforce idempotency.Scale on sustained demand — not noise.Monitor feedback loops before they spiral. The difference between a resilient platform and a cascading failure often comes down to one thing: Whether your reliability mechanisms are controlled — or uncontrolled. Design for stability under stress. Not perfection under ideal conditions.

By Manjeera Chanda
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack
Self-Hosted Inference Doesn’t Have to Be a Nightmare: How to Use GPUStack

The Problem Nobody Warned You About You bought the GPUs. Maybe you've got a couple of NVIDIA A100s in a rack, some RTX 4090s under desks, or a Kubernetes cluster with mixed hardware. You've got the compute. Congratulations! Now what? Here's the part that catches most teams off guard: having GPUs is the easy part. Managing them is where things go sideways. You need to figure out which models fit on which cards, how to balance load across machines, how to handle a node going down at 2 AM, and how to expose all of this as a clean API your application team can actually call. Most teams end up building a brittle collection of Python scripts and crontab entries that haven't been updated since 2022. It works until it doesn't, and then someone's paging you on a Saturday. This is the problem GPUStack was built to solve. What Is GPUStack, Exactly? GPUStack is an open-source tool for managing GPU clusters. Think of it as Kubernetes for your inference workloads, except you don't need to spend three days debugging a whitespace error in a Helm chart. At its core, GPUStack does three things well: It aggregates your GPUs. Whether your hardware is spread across bare-metal servers, Kubernetes pods, or cloud instances, GPUStack sees them all as a single pool of compute. One dashboard, full visibility. It orchestrates inference engines. GPUStack doesn't try to reinvent the inference wheel. It plugs into engines like vLLM, SGLang, and TensorRT-LLM, picks the right one for the job, configures it, and manages the lifecycle so you don't have to. It serves models through an OpenAI-compatible API. Once a model is deployed, your application team gets a familiar REST endpoint. No custom client libraries. No new protocols to learn. Swap out the base URL, and you're talking to your own infrastructure. Getting Started in Under 5 Minutes I'm not exaggerating on the timeline. Here's how you go from zero to a running GPUStack server. Step 1: Fire Up the Server You need one machine to act as your control plane. It doesn't even need a GPU. A basic CPU-only box works fine for the server role. Shell sudo docker run -d --name gpustack \ --restart unless-stopped \ -p 80:80 \ --volume gpustack-data:/var/lib/gpustack \ gpustack/gpustack That's it. Open your browser, navigate to http://<your-server-ip>, and you'll see the GPUStack dashboard. The first time you log in, you'll set up your admin credentials. Step 2: Add Your GPU Workers Now for the fun part. On each worker node, make sure you have the NVIDIA driver and NVIDIA Container Toolkit installed, then run: Shell sudo docker run -d --name gpustack-worker \ --restart unless-stopped \ --gpus all \ -e GPUSTACK_SERVER_URL=http://<your-server-ip> \ -e GPUSTACK_TOKEN=<your-token> \ gpustack/gpustack Replace the server URL and token (grab the token from the GPUStack dashboard). Within seconds, your worker appears in the cluster view with GPU model info, VRAM capacity, and health status. Rinse and repeat for every GPU machine you want to add. Got 3 machines? Three commands. Got 30? Thirty commands, or one Ansible playbook if you're smart about it. Running the worker command is actually the easiest part. The real final boss of GPU clusters is usually getting the drivers and toolkit installed correctly on the host. Step 3: Deploy a Model Head over to the model catalog in the web UI. GPUStack supports pulling models from Hugging Face and the Ollama Library. Pick a model and click deploy. Here's where the scheduler really excels. It reads the model's metadata, computes the resource requirements for VRAM, compute, and memory, then figures out which workers can handle it. If the model is too big for a single GPU, it can shard it across multiple cards. You don't have to manually calculate whether a 70B parameter model fits on your hardware. GPUStack does the math for you. Step 4: Call the API Once the model is running, you get an OpenAI-compatible endpoint. Grab an API key from the dashboard and test it: Shell curl http://<your-server-ip>/v1/chat/completions \ -H "Authorization: Bearer <your-api-key>" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3", "messages": [ {"role": "user", "content": "Explain GPU cluster management in one paragraph."} ] }' If you're already using the OpenAI Python SDK, switching to your GPUStack endpoint is a one-line change: Python from openai import OpenAI client = OpenAI( base_url="http://<your-server-ip>/v1", api_key="<your-api-key>" ) response = client.chat.completions.create( model="llama3", messages=[{"role": "user", "content": "Hello from my own GPU cluster!"}] ) print(response.choices[0].message.content) Your application code stays the same. Your infrastructure is now fully under your control. Why This Actually Matters Let me break down the features that make GPUStack more than a nice-looking dashboard. Multi-Backend Flexibility GPUStack supports vLLM, SGLang, and TensorRT-LLM out of the box. This matters because no single engine is best for every workload. vLLM is great at high-throughput batch processing. TensorRT-LLM squeezes out every last drop of performance on NVIDIA hardware. SGLang shines with structured generation. GPUStack lets you pick the right tool for each deployment, or lets the scheduler pick for you. Built-In Monitoring GPUStack integrates with Grafana and Prometheus, giving you real-time dashboards for GPU utilization, VRAM usage, token throughput, and API request rates. No need to bolt on a separate monitoring stack (which usually ends up being three half-finished Grafana dashboards anyway). When something breaks at 2 AM, you'll know exactly which GPU on which machine is the problem. Automated Failure Recovery We’ve all been there - a node drops off the map because of a weird PCIe bus error or a driver mismatch that only appears under heavy load. Normally, that means your inference API just returns 500s until you manually intervene. GPUStack handles the panic phase for you. When Should You Use GPUStack? GPUStack isn't the right fit for every scenario. Here's a quick way to think about it: Use GPUStack if: You have 2+ GPU machines and want to serve LLMs or other AI models behind a unified API. Especially if your team doesn't want to become full-time infrastructure engineers just to keep models running. You want to run inference on your own hardware instead of paying per-token to a cloud provider. The cost savings at scale are real, and GPUStack removes the operational overhead that usually makes self-hosting painful. Maybe skip GPUStack if: You have a single GPU and just want to run a model locally for personal use. Tools like Ollama are simpler for that use case. You're already deep into a custom Kubernetes-based ML platform with KubeFlow or similar. GPUStack can work alongside Kubernetes, but if you've already invested heavily in that ecosystem, the overlap might not be worth it. The Bigger Picture The AI infrastructure landscape is shifting. A year ago, most teams defaulted to API providers for inference. Today, with open-weight models getting better every month and GPU costs coming down, self-hosted inference is becoming a real option. Not just for Big Tech, but for startups and mid-size companies too. The bottleneck isn't hardware anymore. It's operations. It's the glue code between "we have GPUs" and "our application can reliably call a model." GPUStack is a serious attempt at solving that gap, and it's open source under the Apache 2.0 license, so you can inspect, modify, and deploy it without vendor lock-in. If you’re sitting on a pile of hardware that’s currently just acting as expensive space heaters, or if you’re tired of seeing cloud inference bills that look like mortgage payments, give this a shot. You might find that self-hosting is actually viable again!

By Sandeep Sadarangani
OpenAPI From Code With Spring and Java: A Recipe for Your CI
OpenAPI From Code With Spring and Java: A Recipe for Your CI

This is not "just another article about Springdoc," I promise. This is a ready-to-use recipe I was struggling to find one day, and had to build it from scratch. Have you ever needed to generate OpenAPI documentation directly from your code and, more importantly, do it in a way that fits cleanly into a CI pipeline? Swagger UI is commonly used in Spring Boot applications to visualize and test APIs from the browser. It can also expose the generated OpenAPI definition through a configurable endpoint, and that endpoint is exactly what we will use in this article. Why OpenAPI Documentation Matters Frontend Client Generation One of the most practical uses of OpenAPI documentation is automatic client generation. Tools such as OpenAPI Generator or Swagger Codegen can take an OpenAPI definition and produce TypeScript, JavaScript, or Java clients with very little manual effort. Mocking a Service Before It Is Ready In early development stages, a team may want to spin up a mock server before the real endpoints are fully implemented. Tools such as Mockoon or WireMock can use an OpenAPI specification to simulate the service. This is especially useful for frontend teams that need to move forward while backend work is still in progress. Verifying Contracts Between Services When multiple services depend on one another, compatibility becomes critical. OpenAPI documentation can be used together with tools such as Spring Cloud Contract to verify that both providers and consumers still conform to the agreed contract. The Manual Approach to Generating OpenAPI Documentation Let us start with a simple Spring Boot project. Add the following dependencies to pom.xml: XML <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-security</artifactId> </dependency> <dependency> <groupId>org.springdoc</groupId> <artifactId>springdoc-openapi-starter-webmvc-ui</artifactId> <version>2.6.0</version> </dependency> Then add Springdoc configuration to application.yml: YAML springdoc: api-docs: path: /api-docs enabled: true swagger-ui: url: /api-docs enabled: true Now create a simple REST controller: Java @RestController @Tag(name = "default", description = "General API") @RequestMapping("/api/v1/default") public class WebRestController { private static final Logger log = LoggerFactory.getLogger(WebRestController.class); @GetMapping(produces = MediaType.TEXT_PLAIN_VALUE) @ResponseStatus(HttpStatus.OK) public String get() { log.info("GET method called"); return "Hello!"; } @PostMapping( consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE ) @ResponseStatus(HttpStatus.OK) public Set<String> post(@RequestBody String body) { log.info("POST method called"); return Set.of(body); } Finally, add a security configuration that allows access to both the REST API and to Swagger UI: Java @Configuration @EnableWebSecurity @EnableMethodSecurity public class WebSecurityConfig { @Profile("!openapi") @Bean public SecurityFilterChain filterChain(HttpSecurity httpSecurity) throws Exception { return httpSecurity.authorizeHttpRequests( request -> request .requestMatchers("/api-docs", "/api-docs/**").permitAll() .requestMatchers("/swagger-ui/*").permitAll() .requestMatchers("/api/v1/default").permitAll() .requestMatchers("/**").authenticated() ) .csrf(CsrfConfigurer::disable) .build(); } @Profile("openapi") @Bean public SecurityFilterChain filterChainOpenApi(HttpSecurity httpSecurity) throws Exception { return httpSecurity.authorizeHttpRequests( request -> request.anyRequest().permitAll() ) .csrf(CsrfConfigurer::disable) .build(); } Notice the separate openapi profile. We will use it later during automated generation. At this point, you can run the application and open Swagger UI at http://localhost:8080/swagger-ui/index.html. From there, the generated OpenAPI document is available at http://localhost:8080/api-docs. You can save that response manually and use it as your specification file. This works, but it is repetitive and not very practical for build automation. So let us move to the more useful approach: generating the spec during the Maven build. Automatic Generation To generate an OpenAPI file automatically, it helps to understand what actually happens during the build. The springdoc-openapi-maven-plugin does not generate the specification out of thin air. It calls the application endpoint that exposes the OpenAPI definition. In other words, your Spring Boot application must be running while the plugin executes. That is why the spring-boot-maven-plugin and springdoc-openapi-maven-plugin are typically used together. Because the application has to be started during the build, the security configuration must also allow the documentation endpoint to be accessed in that scenario. This is exactly why the separate openapi Spring profile is useful. Add a Dedicated Maven Profile Add the following Maven profile to pom.xml: XML <profile> <id>openapi</id> <properties> <maven.test.skip>true</maven.test.skip> </properties> <build> <plugins> <!-- When the Maven profile is openapi, run Spring with the openapi profile --> <plugin> <artifactId>spring-boot-maven-plugin</artifactId> <groupId>org.springframework.boot</groupId> <configuration> <jvmArguments> -Dspring.application.admin.enabled=true -Dspring.profiles.active=openapi </jvmArguments> </configuration> <executions> <execution> <id>pre-integration-test</id> <goals> <goal>start</goal> </goals> </execution> <execution> <id>post-integration-test</id> <goals> <goal>stop</goal> </goals> </execution> </executions> </plugin> <!-- Generate the OpenAPI file during the build --> <plugin> <artifactId>springdoc-openapi-maven-plugin</artifactId> <groupId>org.springdoc</groupId> <version>1.4</version> <configuration> <skip>false</skip> <apiDocsUrl>http://localhost:8080/api-docs.yaml</apiDocsUrl> <outputDir>${project.build.directory}</outputDir> <outputFileName>openapi.yml</outputFileName> </configuration> <executions> <execution> <id>integration-test</id> <goals> <goal>generate</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </profile> The important parts here are: We create openapi Maven and openapi Spring profiles, but they are not the same (and should not necessarily have those exact names or share one name).When openapi Maven profile is run, we run Spring app with openapi profile (look at jvmArguments)-Dspring.profiles.active=openapi enables the relaxed security profile created specifically for documentation generation.apiDocsUrl points to the endpoint that returns the OpenAPI document.outputDir and outputFileName control where the generated file is written. These are the exact parts I struggled to find in one place, hence the "recipe" article. Run the Generation Step Once the profile is in place, generating the spec is easy: Shell ./mvnw verify -Popenapi After the build completes, the generated OpenAPI spec should be here: YAML ./target/openapi.yml Using It in a CI Pipeline This setup is CI-friendly because the same command can run locally and in your pipeline: YAML ./mvnw verify -Popenapi From there you can archive target/openapi.yml as a build artifact, publish it to an artifact repository, pass it to frontend code generators, mock servers, and contract verification jobs. Conclusion Generating OpenAPI documentation manually from Swagger UI is fine for quick inspection, but it does not scale well when you need repeatability. By wiring Spring Boot and Springdoc into a dedicated Maven profile, you can generate the specification automatically during the build in your CI. That gives you a reliable OpenAPI artifact that can support client generation, service mocking, and contract verification without adding a separate manual step to the development workflow. Bonus: Represent Set as an Array In some cases, you may want a Set to be represented as a regular array in the generated OpenAPI specification instead of an array with uniqueItems: true. This can be useful when downstream tools expect a plain array schema (this is the exact request I once got from the frontend team). You can customize Springdoc behavior with a small configuration class: Java import org.springdoc.core.utils.SpringDocUtils; import io.swagger.v3.oas.models.media.Schema; import java.util.Collections; import java.util.Set; public class SwaggerConfig { // Make springdoc generate an Array schema for Set.class // and remove uniqueItems: true public SwaggerConfig() { var schema = new Schema<Set<?>>(); schema.type("array").example(Collections.emptyList()); SpringDocUtils.getConfig().replaceWithSchema(Set.class, schema); } With this adjustment in place, the generated schemas for Set will be emitted as an array, which can simplify integration with some client generators and consumers.

By Roman Dubinin

Top Microservices Experts

expert thumbnail

Jubin Abhishek Soni

Senior Software Engineer,
Yahoo

Jubin Soni is a Senior Software Engineer with 14+ years of experience building scalable systems, real-time data pipelines, and AI-driven platforms for industry leaders in technology and media. With deep expertise spanning cloud-native architectures, distributed systems, and applied machine learning, Jubin brings a rare combination of engineering depth and research breadth to every problem he tackles. He is a published researcher with work appearing in IEEE and other peer-reviewed venues, and a Manning Publications author. Jubin holds IEEE Senior Member status and has spoken at technical conferences including P99 CONF, ACM and APIdays, sharing his expertise in distributed systems, serverless architectures, and AI with engineering communities globally. He is passionate about pushing the boundaries of what scalable software can do — and sharing those insights with fellow engineers through writing, research, and open source.
expert thumbnail

Satrajit Basu

Chief Architect,
TCG Digital

Satrajit, a visionary Chief Architect and an AWS Ambassador, brings unparalleled expertise in architecting and directing mission-critical projects for industry leaders across various sectors. From banking to aviation, Global Distribution Systems (GDS) to restaurant and travel e-commerce, Satrajit has mastered the art of migrating and modernizing workloads on AWS. With an unwavering passion for technology, Satrajit ensures that applications on AWS are not just well-architected but also leverage the latest cutting-edge technologies. An architect par excellence, Satrajit's dedication extends beyond project delivery. He generously shares his vast knowledge through insightful technical blogs, enlightening aspiring architects and developers worldwide

The Latest Microservices Topics

article thumbnail
Combining Temporal and Kafka for Resilient Distributed Systems
Kafka handles durable event streaming while Temporal manages long-running workflow state, retries, and recovery to build resilient distributed systems.
June 9, 2026
by Akhil Madineni
· 211 Views
article thumbnail
Frame Buffer Hashing for Visual Regression on Embedded Devices
Learn how frame buffer hashing reduced visual regression storage from 18GB to 19KB while speeding up CI and eliminating flaky image diffs.
June 9, 2026
by Rajasekhar sunkara
· 191 Views
article thumbnail
How to Interpret the Number of Spring ApplicationContexts in Integration Tests
When optimizing Spring Boot integration tests, developers often focus on obvious metrics, but they do not always explain why an integration test suite is slow.
June 8, 2026
by Constantin Kwiatkowski
· 763 Views
article thumbnail
The Middleware Gap in AI Agent Frameworks
Most agent frameworks observe model calls and allow rewriting them only after they reach the model, making an understanding of callbacks and middleware essential.
June 8, 2026
by Ninaad Rao
· 832 Views
article thumbnail
Is the Data Warehouse Dead? 3 Patterns From Enterprise Architecture That Answer This Question
No, but its role has fundamentally changed. Here is what I have seen work, after building data platforms at enterprise scale across multiple industries.
June 5, 2026
by Nabarun Bandyopadhyay
· 2,475 Views · 1 Like
article thumbnail
Why Your Test Automation Is Always Behind the Code And the Architecture That Fixes It
Most QA teams are stuck in a manual scripting loop. Here's the requirement-driven architecture that eliminates the coverage gap permanently.
June 5, 2026
by Waqar Hashmi
· 1,692 Views
article thumbnail
Multi-Scale Feature Learning in CNN and U-Net Architectures
Multi-scale feature learning helps CNNs and U-Net models combine global context with fine details, improving accuracy in tasks like image segmentation.
June 3, 2026
by Akhil Madineni
· 945 Views
article thumbnail
Data Contracts as the "Circuit Breaker" for Model Reliability
AI models do not fail due to bad coding; they fail due to an upstream change in the input. Combine contracts with circuit breakers to stop bad data from entering models.
June 1, 2026
by SRIRAMPRABHU RAJENDRAN
· 1,304 Views
article thumbnail
How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
A practical guide to SaaS architecture decisions that determine whether platforms scale cleanly or collapse under technical debt, security, and growth pressure.
June 1, 2026
by Igboanugo David Ugochukwu DZone Core CORE
· 1,195 Views
article thumbnail
Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
How we stopped fighting the network and started treating bandwidth as a scarce resource — and what happened to our patch success rate when we did.
June 1, 2026
by srinivas thotakura
· 1,331 Views
article thumbnail
Implementing Secure API Gateways for Microservices Architecture
Use Kong as an API gateway to centralize JWT auth, rate limiting, and access control across all microservices, keeping individual services focused on business logic.
May 29, 2026
by Mugunth Chandran
· 3,625 Views · 3 Likes
article thumbnail
Zero-Downtime Deployments for Java Apps on Kubernetes
Achieve zero-downtime deployments for Java applications on Kubernetes using rolling updates, readiness/liveness probes, and graceful shutdown strategies.
May 29, 2026
by Ramya vani Rayala
· 3,466 Views
article thumbnail
Pragmatica Aether: Let Java Be Java
A modern, distributed, fault-tolerant runtime environment for the language that was intentionally designed for managed environments.
May 29, 2026
by Sergiy Yevtushenko
· 3,660 Views · 1 Like
article thumbnail
Stateless JWT Auth Microservice Architecture With Spring Boot 3 and Redis Sentinel
Design a stateless JWT auth service with Spring Boot 3, Redis caching, and Sentinel for high availability, faster token validation, and reduced DB load.
May 27, 2026
by Erkin Karanlık
· 3,136 Views · 1 Like
article thumbnail
Docker Hardened Images Are Free Now — Here's What You Still Need to Build
Docker Hardened Images solve the CVE problem. But CVEs aren't why containers fail in production — governance gaps are. Here's the trust architecture that closes them.
May 27, 2026
by Shamsher Khan DZone Core CORE
· 3,717 Views
article thumbnail
Beyond Partitioning and Z-Order: A Deep Dive into Liquid Clustering for Unity Catalog Managed Tables
Liquid Clustering replaces rigid partitioning and Z-Order with adaptive clustering in Unity Catalog, improving performance with less maintenance.
May 26, 2026
by Seshendranath Balla Venkata
· 2,404 Views · 1 Like
article thumbnail
Designing API-First EMR Architectures in .NET: Enabling Modular Growth in Compliance-Driven Systems
API-first .NET architecture lets EMR platforms evolve safely — enforcing compliance, stabilizing contracts, and isolating UI changes from critical business logic.
May 26, 2026
by Ronak Pavasiya
· 1,355 Views
article thumbnail
One Query, Four GPUs: Tracing a Distributed Training Stall Across Nodes
One SQL query across 4 GPU nodes found a straggler in under a second using eBPF fleet fan-out, no central collector needed.
May 25, 2026
by Ingero Team
· 3,421 Views
article thumbnail
Building Production-Grade GenAI on GCP with Vertex AI Agent Builder
GenAI is easy to prototype but hard to productionize. Vertex AI Agent Builder provides a unified platform for RAG, orchestration, security, and scalable deployment.
May 25, 2026
by Sairamakrishna BuchiReddy Karri
· 1,751 Views
article thumbnail
How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures
Unbounded retries and autoscaling can turn minor latency into cascading outages. API reliability must be bounded and load-aware to prevent retry storms.
May 22, 2026
by Manjeera Chanda
· 2,105 Views
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×