A microservices architecture is a development method for designing applications as modular services that seamlessly adapt to a highly scalable and dynamic environment. Microservices help solve complex issues such as speed and scalability, while also supporting continuous testing and delivery. This Zone will take you through breaking down the monolith step by step and designing a microservices architecture from scratch. Stay up to date on the industry's changes with topics such as container deployment, architectural design patterns, event-driven architecture, service meshes, and more.
Data Contracts as the "Circuit Breaker" for Model Reliability
How SaaS Architectures Break at Scale — and the Engineering Decisions That Prevent It
The Aberration We build Java applications like Go or Rust programs. Fat JARs. Docker images. Kubernetes deployments. Everyone does it, so it looks normal. It contradicts Java’s design DNA. Java has always been a language for managed environments. Applets ran inside browsers. Servlets ran inside application servers. EJBs ran inside containers like JBoss and WebLogic. OSGi bundles ran inside runtime containers like Eclipse Equinox. In every generation, the pattern was the same: a managed runtime hosts the application. The application handles business logic. The runtime handles infrastructure. The fat-jar era threw that away. We stopped letting Java be Java. We started bundling web servers, serialization frameworks, service discovery clients, configuration management, health checks, metrics libraries, and logging frameworks into every application. Then we wrapped the result in a Docker container and deployed it to an orchestration platform that reimplements — poorly — the infrastructure management that Java runtimes used to provide natively. This article introduces Pragmatica Aether: a distributed runtime that returns Java to its natural habitat. The application handles business logic. Runtime handles infrastructure. This isn’t radical — it's returning to what Java was designed for. The Problem: Infrastructure Wearing a Business Logic Mask Think of what a typical Java microservice carries. A web server (Tomcat, Netty, Undertow). A serialization framework (Jackson, Gson). A dependency injection container (Spring, Guice). A service discovery client (Eureka, Consul). Health check endpoints. Configuration management (Spring Cloud Config, Consul KV). A metrics library (Micrometer, Dropwizard). A logging framework (Logback, Log4j2). Retry logic (Resilience4j). Circuit breakers. HTTP client configuration. The application is wearing a heavy winter coat of infrastructure, armed to the teeth to survive in a hostile environment. Now consider the coupling this creates. Update the Java version — rebuild and test every service. Change your message broker from RabbitMQ to Kafka — modify, rebuild, and redeploy every application that touches messaging. Add a new observability tool and update dependencies in every microservice. Switch cloud providers — rewrite configuration, SDK calls, and deployment manifests across the entire fleet. Each change ripples through dozens or hundreds of services because infrastructure is entangled with business logic at the dependency level. This is the coupling trap. Your application’s pom.xml doesn't distinguish between business dependencies and infrastructure dependencies. They compile together, deploy together, and break together. A security patch in Netty requires a new build of every service that embeds a web server, which is all of them. Framework lock-in worsens this. It isn’t a vendor problem — it's an architecture problem. Spring’s dependency injection fights with Kubernetes service mesh for control over service routing and circuit breaking. The framework’s configuration system overlaps with Consul KV and Kubernetes ConfigMaps. Your cloud SDK’s retry logic conflicts with Resilience4j. Every layer claims authority over the same cross-cutting concerns, and the conflicts surface as subtle bugs in production — not during development. This is an architecture problem. Architectural problems have architectural solutions. Aether: The Core Idea What you write: an interface annotated with @Slice, plus business logic implementation. Java @Slice public interface OrderService { Promise<OrderResult> placeOrder(PlaceOrderRequest request); static OrderService orderService(InventoryService inventory, PricingEngine pricing) { return request -> inventory.check(request.items()) .flatMap(available -> pricing.calculate(available)) .map(priced -> OrderResult.placed(priced)); } } What you don’t write: everything else. No HTTP clients — inter-slice calls are direct method invocations via generated proxies. No service discovery — the runtime tracks where every slice instance lives. No retry logic — built-in retry with exponential backoff and node failover. No circuit breakers — the reliability fabric handles failure automatically. No serialization code — request/response types are serialized transparently. A method call via an imported interface is the only visible contract. The only hint that the actual call might be remote is a design requirement: slice methods should be idempotent. This isn’t a limitation — it's what enables retry, scaling, and fault tolerance to work transparently. The same request, processed by any available instance, produces the same result. Most read operations are naturally idempotent. For writes, standard patterns like idempotency keys and conditional writes handle it cleanly. Everything else is the environment’s job: resource provisioning, scaling, transport, discovery, retries, circuit breakers, configuration, observability, logging, tracing, monitoring, and security. None of these are application concerns, and none should be handled at the business logic level. The JBCT Leaf pattern serves two purposes here: it documents the design (“what we expect from an external implementation”) and encourages exactly one interface per dependency. Different implementations may have different technical properties — performance, latency, memory consumption — but as long as they’re compatible with the interface, business logic works unchanged. You write basically pure business logic that scales from your local computer to a global multi-zone distributed deployment, transparently. Under The Hood: What Makes It Work Five architectural decisions make this possible. Consensus KV Store. A single source of truth for all configuration, deployment state, and service discovery. Based on the Rabia protocol, a crash-fault-tolerant, leaderless consensus algorithm was published in 2021. Any node can propose; agreement is reached through a two-round voting protocol with a fast path when a supermajority agrees in round one. No external config servers. No etcd. No Consul. Configuration changes propagate through consensus and take effect cluster-wide. Built-in Artifact Repository. DHT-based storage with configurable replication — 3 replicas with quorum reads/writes in production, full replication in development. Artifacts are chunked into 64KB pieces, distributed across nodes via consistent hashing, and integrity-verified with MD5 and SHA-1 on every resolve. No external Nexus or Artifactory is needed. During development, slices resolve from your local Maven repository. In production, the cluster is self-contained. ClassLoader Isolation. Each slice runs inside its own SliceClassLoader with child-first delegation. Two slices can use different versions of the same library without conflict. Shared dependencies like Pragmatica Lite core are loaded once in a parent classloader. No dependency conflicts. No classpath hell between slices. Declarative Deployment. Blueprints — TOML files — describe the desired state: which slices, how many instances. TOML id = "org.example:commerce:1.0.0" [[slices]] artifact = "org.example:inventory-service:1.0.0" instances = 3 [[slices]] artifact = "org.example:order-processor:1.0.0" instances = 5 Apply with one command: aether blueprint apply commerce.toml. The cluster resolves artifacts, loads slices, distributes instances across nodes, registers routes, and starts serving traffic. The cluster converges to the desired state automatically. Infrastructure Independence. Aether nodes are identical — there's only one deployment artifact to manage at the infrastructure level. Node updates and application deployments run on completely independent schedules. Update Java — roll it out across nodes without touching applications. Update the Aether runtime — same. Update business logic — deploy new slice versions without touching infrastructure. Each independently, each without downtime. This is the fundamental benefit of proper separation: when layers don’t share a deployment unit, they don’t share a deployment schedule. Fault Tolerance: The 50% Rule The system survives the failure of less than half the nodes. Performance may degrade until replacements spin up, but functionality remains intact — actual redundancy, not just graceful degradation. A 5-node cluster tolerates 2 simultaneous failures. A 7-node cluster tolerates 3. The same request, processed by any available node, produces the same result. Quorum requires (N/2) + 1 nodes — as long as a majority is alive, the cluster operates normally. Leader failover is consensus-based and near-instant. Node replacement happens automatically — the Cluster Deployment Manager detects the deficit and provisions a replacement through the NodeProvider interface. The entire recovery sequence — from failure detection through state restoration to serving traffic — completes without human intervention. When a node fails, the recovery is automatic. Requests to slices on the failed node are immediately retried on healthy nodes. A replacement node is provisioned. It connects to peers, restores consensus state from a cluster snapshot, re-resolves artifacts from the DHT, and reactivates assigned slices. Dead nodes are automatically removed from routing tables. The new leader reconciles the stale state. No human intervention required. Rolling updates leverage this fault tolerance for zero-downtime deployments with weighted traffic routing: SQL aether update start org.example:order-processor 2.0.0 -n 3 aether update routing <id> -r 1:3 # 25% to v2, 75% to v1 aether update routing <id> -r 1:1 # 50/50 aether update complete <id> # 100% to v2, drain v1 Deploy during business hours. Shift traffic gradually — 10% canary, then 25%, 50%, 75%, 100%. Monitor health metrics at each step. If health degrades — error rate exceeds thresholds, latency spikes — instant rollback with one command: aether update rollback <id>. Traffic immediately shifts back to the old version. The 3 AM pager alert becomes an audit log entry. For Every Project: Legacy, Greenfield, And Everything Between Legacy Migration Your legacy Java system doesn’t need a complete rewrite. It needs a path forward. Pick a relatively independent part of your system — something hitting limits, something with clear boundaries. Extract an interface. Annotate it with @Slice. Wrap the legacy implementation: Java private Promise<Report> generateReport(ReportRequest request) { return Promise.lift(() -> legacyReportService.generate(request)); } One line to enter the Aether world. Promise.lift() wraps the legacy call, catches exceptions, and returns a proper Result inside a Promise. Your legacy code keeps running. Call sites don't change. You haven't added risk — the initial deployment in Ember runs in the same JVM as your existing application, which means it's no worse than what you have today. You've laid the foundation for removing risk, not adding it. Moving from Ember to a full Aether cluster is a configuration change, not a code change — and that's when the 50% rule starts to apply. From there, it’s the strangler fig pattern. Extract a hot path, deploy it as a slice, route traffic, repeat. Each extracted slice can be gradually refactored using the peeling pattern: first wrap everything in Promise.lift(), then decompose into a Sequencer with each step still wrapped, then peel individual steps into clean JBCT patterns. Tests pass at every step. The lift() calls mark exactly where legacy code remains, making progress visible and remaining work obvious. No rewrite is required. No big bang migration. One sprint to the first slice in production. The migration article covers the full path in detail — from initial wrapping through gradual peeling to clean JBCT code. Greenfield Development For new projects, slices enable a granularity that’s impossible with traditional microservices. Each slice can be as lean as a single method — and that’s the recommended approach. There are no operational or complexity tradeoffs for small slices because Aether handles all the infrastructure overhead. No container to configure, no load balancer to provision, no monitoring to set up per service. You get per-use-case scaling: one slice serving 50 instances during peak load while another idles at minimum. That kind of granularity would be operationally insane with traditional microservices — each needing its own container, load balancer, monitoring, and deployment pipeline. With Aether, it’s the default. JBCT patterns — Leaf, Sequencer, Fork-Join, Condition, Iteration, and Aspects — compose naturally within slices. Each slice method is a data transformation pipeline: parse input, gather data, process, respond. The patterns provide consistent structure within slices. Slices provide consistent boundaries between them. The Spectrum Same slice model, different granularity. A service slice wraps an entire legacy component. A lean slice implements a single method. Both coexist in the same cluster, deployed and scaled independently. Slice is the executable unit. It can be big or small as necessary and convenient. The architecture accommodates both monolith migration and greenfield development simultaneously. Your legacy system gains fault tolerance while new features get maximum deployment flexibility. Scaling: Two Levels, Three Tiers of Intelligence Two-Level Horizontal Scaling Aether scales in two dimensions independently: Slice scaling: Spin up more instances of a specific slice on existing nodes. Classes are already loaded—scaling takes milliseconds, not seconds.Node scaling: Add more machines to the cluster. The node connects, restores state, and begins accepting work. Independent controls, combined effect. Each node hosts at most one instance of a given slice, so scaling a slice beyond the current node count requires adding nodes first. Add 2 more nodes to a 3-node cluster, then scale a hot slice to 5 instances—one per node. No coordination between the two dimensions is required. Three-Tier Decision System Tier 1—Decision Tree (1-second intervals) Instant reactive decisions based on CPU utilization, request latency, queue depth, and error rate. CPU above 70%? Add an instance. Below 30% sustained? Remove one (if above minimum). Latency exceeding the P95 threshold? Scale up. Error rate above 1% due to timeouts? Scale up. Deterministic, predictable, fast. Handles routine load changes with configurable cooldown periods — 30 seconds for scale-up, 5 minutes for scale-down — to prevent oscillation. Tier 2—TTM Predictor (60-second intervals) An ONNX-based machine learning model (Tiny Time Mixers) analyzes a 60-minute sliding window of metrics — CPU usage, request rate, P95 latency, and active instances. Forecasts load and adjusts the Decision Tree’s thresholds preemptively. If TTM predicts a load increase, it lowers the scale-up CPU threshold by 20% so the reactive tier responds earlier. The cluster scales before the spike arrives, not after. The key design principle: the cluster always survives on Tier 1 alone. TTM enhances; it doesn’t replace. If TTM fails — model load error, insufficient data, inference failure — the Decision Tree continues with default thresholds. The error is logged and recorded in metrics. No scaling disruption. Tier 3—LLM-based (planned) Long-term capacity planning and cluster health monitoring. Seasonal pattern prediction, maintenance window planning, anomaly investigation. This tier is not yet implemented — the current system operates with Tiers 1 and 2. Fault tolerance makes preemptible instances viable for burst scaling. If a spot instance gets reclaimed, the cluster survives — it was designed for nodes to disappear. You don’t need a PhD in distributed systems or a dedicated platform team. The scaling system manages itself. Development Experience: From Laptop To Production Three Environments, Zero Code Changes Ember Single-process runtime with multiple cluster nodes running in the same JVM. Fast startup, simple debugging. Deploy your slices alongside your existing application — slices call each other directly in-process. No network overhead. Standard debugger breakpoints work as expected. Perfect for local development and unit testing. Forge A 5-node cluster simulator running on your laptop. Real consensus. Real routing. Real failure scenarios. Kill nodes, crash the leader, trigger rolling restarts — and watch the cluster recover in real time through a web dashboard with D3.js topology visualization, per-node metrics (CPU, heap, leader status), and event timeline. Configurable load generation with TOML-based multi-target configuration lets you stress-test realistic scenarios — set request rates, define body templates, and run duration-limited load tests. Chaos operations include node kill, leader kill, and rolling restart. Forge validates the entire dependency graph before starting anything. Aether Production cluster. Same slices, same code, different scale. Your code doesn’t know which environment it’s running in. Whether inter-slice calls are in-process or cross-network is transparent. Tooling 37 CLI commands cover deployment, scaling, updates, artifacts, observability, controller configuration, and alerts — in both single-command and interactive REPL modes. A web dashboard streams real-time metrics via WebSocket — no polling. 30+ REST management endpoints enable full programmatic control of everything the CLI can do. Prometheus-compatible metrics export (/metrics/prometheus) integrates with existing monitoring stacks. Metrics are push-based at 1-second intervals, with zero consensus overhead — they bypass the consensus protocol entirely. Per-method invocation tracking with P50/P95/P99 latency and configurable slow-invocation detection strategies (fixed threshold, adaptive, per-method, composite) surfaces performance issues before users notice. Dynamic aspects let you toggle LOG/METRICS/LOG_AND_METRICS modes per method at runtime via REST API, without redeployment. Test realistic failure scenarios on your laptop. Deploy to production with a config change, not a code change. Maturity Aether is a working system, not a concept paper. 81 end-to-end tests are run against real 5-node clusters in Podman containers, validating cluster formation, quorum establishment, slice deployment and scaling, blueprint application with topological ordering, multi-instance distribution, artifact upload, and cross-node resolution with integrity verification, leader failure and recovery, node restart with state restoration, and orphaned state cleanup after leader changes. The recovery and fault tolerance claims come from automated tests against real clusters, not marketing slides. Let Java Be Java Java’s lineage leads here. From applets managed by browsers, through servlets managed by application servers, through EJBs managed by enterprise containers, through OSGi managed by runtime frameworks, to Aether, managed by a distributed runtime. The fat-jar era was a detour. An understandable one — when Docker emerged, it offered a universal packaging format, and the industry standardized on it regardless of language. Java adopted the patterns of languages that were designed to produce standalone binaries. We started treating Java applications like Go programs with a heavier runtime. But it was never the destination. Java was designed for managed environments. The JVM makes it possible. The runtime manages the application. That’s the lineage. Aether continues it. Two entry points exist today. Wrap your legacy monolith behind a @Slice interface in one sprint and gain fault tolerance without rewriting anything. Or start fresh with maximum clarity — lean slices, explicit contracts, per-use-case scaling. Both paths converge on the same runtime, the same cluster, the same operational model. Both paths can coexist — legacy service slices and new lean slices running side by side. Fault tolerance is not an afterthought — it's the foundation. Scaling is not your problem — it's the environment’s. Infrastructure is not your code — it's the runtime’s. The heavy winter coat comes off. The application breathes. Resources Pragmatica Aether—project siteGitHub Repository—source code
In this article, I will discuss a highly available solution developed using Spring Boot 3 and Spring Security 6 to address the "centralized authentication method" problem frequently seen in modern microservice ecosystems. We are not simply moving to an "authorization service"; we are examining the cache-first pattern, which minimizes DB usage, and the Redis Sentinel enhancement, which guarantees system persistence. Why a Separate Authentication Service? While embedding security into each service is an option in microservices, I have always found it more logical to proceed with a centralized Auth service and API Gateway combination. DRY (Don't Repeat Yourself): Using token authentication logic in many services increases extra maintenance costs.Isolation: Business services focus only on business logic; they don't deal with "is this token valid?" questions.Performance: Thanks to the Redis connection, instead of going to the database with every request, we can resolve the validation via the cache in milliseconds. Plain Text [Client] ──► [API Gateway] ──► [Auth Service: validate token] │ (valid) ▼ [Backend Microservices] Cache-Focused Approach: Reducing Database Load In the classic workflow, every login request puts a load on the DB. With the cache-first approach, the process proceeds like this with a POST /auth/signin request: First, Redis is checked. If there is a valid and unexpired token for the user, it is replicated directly. In case of cache deficiency, AuthManager.authenticate() is activated, a DB query is sent, and a BCrypt check is performed. After a successful login, a token is generated with JJWT (HS256). This token is given to Redis with our changes and TTL (e.g., 24 minutes), and personal responses are converted. In this way, it protects our main database, especially in brute-force or high-intensity login password attacks. Plain Text POST /auth/signin │ ▼ ┌──────────────────────────────┐ │ Token exists in Redis? │──── YES ──► Return token (0 DB queries) └──────────────────────────────┘ │ NO ▼ ┌──────────────────────────────┐ │ AuthManager.authenticate() │ (DB query + BCrypt verification) └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Generate JWT (JJWT HS256) │ └──────────────────────────────┘ │ ▼ ┌──────────────────────────────┐ │ Write to Redis (TTL: 24 min)│ └──────────────────────────────┘ │ ▼ Return token Implementation Details User Entity and UserDetails Integration In most projects, unnecessary mappings are performed between the User asset and the UserDetails objects expected by Spring Security. To reduce complexity, the User Entity is directly derived from the UserDetails interface. This makes the code cleaner and makes it "native," as outlined by Spring Security. Java @Data @Builder @NoArgsConstructor @AllArgsConstructor @Entity @Table(name = "T_APP_USER") public class User implements UserDetails { @Id @GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "seq_user_gen") @SequenceGenerator(name = "seq_user_gen", sequenceName = "SEQ_APP_USER", allocationSize = 1) @Column(name = "idx") private Long idx; @Column(name = "firstname") private String firstName; @Column(name = "lastname") private String lastName; @Column(unique = true, name = "email") private String email; @Column(name = "accesskey") private String accessKey; // BCrypt-hashed @Column(name = "role") @Enumerated(EnumType.STRING) private Role role; @Override public Collection<? extends GrantedAuthority> getAuthorities() { return List.of(new SimpleGrantedAuthority(role.name())); } @Override public String getUsername() { return email; } @Override public String getPassword() { return accessKey; } @Override public boolean isAccountNonExpired() { return true; } @Override public boolean isAccountNonLocked() { return true; } @Override public boolean isCredentialsNonExpired() { return true; } @Override public boolean isEnabled() { return true; } } JWT Filter: The Gateway to Security The request to the system passes through the OncePerRequestFilter. Here, using JwtAuthenticationFilter, we parse the token in each request and populate the SecurityContext. By using the new SecurityFilterChain bean introduced with Spring Security 6, we have disabled CSRF and made session management completely stateless. Token Generation and Validation Java public interface JwtService { String extractUserName(String token); String generateToken(UserDetails userDetails); boolean isTokenValid(String token, UserDetails userDetails); } @Service public class JwtServiceImpl implements JwtService { @Value("${token.signing.key}") private String jwtSigningKey; // Base64-encoded secret key @Override public String extractUserName(String token) { return extractClaim(token, Claims::getSubject); } @Override public String generateToken(UserDetails userDetails) { return Jwts.builder() .setClaims(new HashMap<>()) .setSubject(userDetails.getUsername()) .setIssuedAt(new Date(System.currentTimeMillis())) .setExpiration(new Date(System.currentTimeMillis() + 1000 * 60 * 24)) .signWith(getSigningKey(), SignatureAlgorithm.HS256) .compact(); } @Override public boolean isTokenValid(String token, UserDetails userDetails) { final String userName = extractUserName(token); return userName.equals(userDetails.getUsername()) && !isTokenExpired(token); } private <T> T extractClaim(String token, Function<Claims, T> claimsResolver) { return claimsResolver.apply( Jwts.parserBuilder() .setSigningKey(getSigningKey()) .build() .parseClaimsJws(token) .getBody() ); } private boolean isTokenExpired(String token) { return extractClaim(token, Claims::getExpiration).before(new Date()); } private Key getSigningKey() { return Keys.hmacShaKeyFor(Decoders.BASE64.decode(jwtSigningKey)); } } High Availability: Redis Sentinel Using a single Redis instance means that the Auth service has a "Single Point of Failure." If Redis crashes, no one can access the system. This risk mitigation was achieved using Redis Sentinel. Thanks to the Sentinel structure: If the master node crashes, the dependent node is automatically promoted to master via failover. On the application side, we continuously manage these transitions using the Lettuce driver. Technical Stack and Requirements Redis Sentinel configuration: Java @Configuration public class RedisConfig { @Value("${spring.redis.sentinel.master}") private String master; @Value("${spring.redis.sentinel.nodes}") private String sentinelNodes; @Value("${spring.redis.password}") private String password; @Bean public RedisConnectionFactory redisConnectionFactory() { RedisSentinelConfiguration sentinelConfig = new RedisSentinelConfiguration() .master(master); for (String node : sentinelNodes.split(",")) { String[] hostPort = node.split(":"); sentinelConfig.sentinel(hostPort[0], Integer.parseInt(hostPort[1])); } sentinelConfig.setPassword(RedisPassword.of(password)); return new LettuceConnectionFactory(sentinelConfig); } } Plain Text yaml env: - name: spring.redis.sentinel.master valueFrom: secretKeyRef: name: redis-user-secret key: username - name: spring.redis.password valueFrom: secretKeyRef: name: redis-user-secret key: password Token cache service: Java @Service public class TokenCacheServiceImpl { private final RedisTemplate<String, String> redisTemplate; public TokenCacheServiceImpl(RedisTemplate<String, String> redisTemplate) { this.redisTemplate = redisTemplate; } public void cacheToken(String username, String token, long duration, TimeUnit unit) { redisTemplate.opsForValue().set(username, token, duration, unit); } @Cacheable(value = "tokens", key = "#username") public String getToken(String username) { return redisTemplate.opsForValue().get(username); } } Authentication service: signup and signin: Java @Service @RequiredArgsConstructor public class AuthenticationServiceImpl implements AuthenticationService { private final UserRepository userRepository; private final PasswordEncoder passwordEncoder; private final JwtService jwtService; private final AuthenticationManager authenticationManager; private final TokenCacheServiceImpl tokenCacheService; @Override public JwtAuthenticationResponse signup(SignUpRequest request) { var user = User.builder() .firstName(request.getFirstName()) .lastName(request.getLastName()) .email(request.getEmail()) .accessKey(passwordEncoder.encode(request.getAccessKey())) // BCrypt .role(Role.USER) .build(); userRepository.save(user); var jwt = jwtService.generateToken(user); return JwtAuthenticationResponse.builder().token(jwt).build(); } @Override public JwtAuthenticationResponse signin(SigninRequest request) { // 1. Check Redis cache first String cachedToken = tokenCacheService.getToken(request.getEmail()); if (cachedToken != null) { return JwtAuthenticationResponse.builder().token(cachedToken).build(); } // 2. If not cached, authenticate (DB + BCrypt) authenticationManager.authenticate( new UsernamePasswordAuthenticationToken(request.getEmail(), request.getAccessKey()) ); var user = userRepository.findByEmail(request.getEmail()) .orElseThrow(() -> new IllegalArgumentException("Invalid credentials.")); // 3. Generate token and write to Redis (24 min TTL) var jwt = jwtService.generateToken(user); tokenCacheService.cacheToken(request.getEmail(), jwt, 24, TimeUnit.MINUTES); return JwtAuthenticationResponse.builder().token(jwt).build(); } } JWT authentication filter: Java @Component @RequiredArgsConstructor public class JwtAuthenticationFilter extends OncePerRequestFilter { private final JwtService jwtService; private final UserService userService; @Override protected void doFilterInternal( @NonNull HttpServletRequest request, @NonNull HttpServletResponse response, @NonNull FilterChain filterChain ) throws ServletException, IOException { final String authHeader = request.getHeader("Authorization"); // Pass through if no Authorization header or doesn't start with Bearer if (StringUtils.isEmpty(authHeader) || !StringUtils.startsWith(authHeader, "Bearer ")) { filterChain.doFilter(request, response); return; } final String jwt = authHeader.substring(7); final String userEmail = jwtService.extractUserName(jwt); // Process only if SecurityContext has no authentication yet if (StringUtils.isNotEmpty(userEmail) && SecurityContextHolder.getContext().getAuthentication() == null) { UserDetails userDetails = userService.userDetailsService() .loadUserByUsername(userEmail); if (jwtService.isTokenValid(jwt, userDetails)) { SecurityContext context = SecurityContextHolder.createEmptyContext(); UsernamePasswordAuthenticationToken authToken = new UsernamePasswordAuthenticationToken( userDetails, null, userDetails.getAuthorities() ); authToken.setDetails(new WebAuthenticationDetailsSource().buildDetails(request)); context.setAuthentication(authToken); SecurityContextHolder.setContext(context); } } filterChain.doFilter(request, response); } } Spring Security 6 configuration: Java @Configuration @EnableWebSecurity @RequiredArgsConstructor public class SecurityConfiguration { private final JwtAuthenticationFilter jwtAuthenticationFilter; private final UserService userService; @Bean public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { http .csrf(AbstractHttpConfigurer::disable) // Stateless → no CSRF needed .authorizeHttpRequests(request -> request .requestMatchers("/auth/**").permitAll() // Auth endpoints open to all .anyRequest().authenticated() ) .sessionManagement(manager -> manager.sessionCreationPolicy(STATELESS) // No server-side session ) .authenticationProvider(authenticationProvider()) .addFilterBefore(jwtAuthenticationFilter, // JWT filter runs first UsernamePasswordAuthenticationFilter.class); return http.build(); } @Bean public PasswordEncoder passwordEncoder() { return new BCryptPasswordEncoder(); } @Bean public AuthenticationProvider authenticationProvider() { DaoAuthenticationProvider authProvider = new DaoAuthenticationProvider(); authProvider.setUserDetailsService(userService.userDetailsService()); authProvider.setPasswordEncoder(passwordEncoder()); return authProvider; } @Bean public AuthenticationManager authenticationManager(AuthenticationConfiguration config) throws Exception { return config.getAuthenticationManager(); } } Unit tests: Java @Test @DisplayName("Signin: if token is cached, should not query the DB") void testSignInWithCachedToken() { when(tokenCacheService.getToken(TEST_EMAIL)).thenReturn(TEST_TOKEN); JwtAuthenticationResponse response = authenticationService.signin( SigninRequest.builder().email(TEST_EMAIL).accessKey(TEST_PASSWORD).build() ); assertEquals(TEST_TOKEN, response.getToken()); verifyNoInteractions(authenticationManager); // No DB + BCrypt call should happen verifyNoInteractions(userRepository); } // Invalid token test — SecurityContext should remain empty @Test @DisplayName("With an invalid token, SecurityContext should remain empty") void testDoFilterInternalInvalidToken() throws Exception { when(request.getHeader("Authorization")).thenReturn("Bearer " + INVALID_TOKEN); when(jwtService.extractUserName(INVALID_TOKEN)).thenReturn(TEST_EMAIL); when(userService.userDetailsService()).thenReturn(userDetailsService); when(userDetailsService.loadUserByUsername(TEST_EMAIL)).thenReturn(userDetails); when(jwtService.isTokenValid(INVALID_TOKEN, userDetails)).thenReturn(false); jwtAuthenticationFilter.doFilterInternal(request, response, filterChain); verify(filterChain).doFilter(request, response); assertNull(SecurityContextHolder.getContext().getAuthentication()); } Summary and Conclusion With the purchasing architecture, not only a secure login screen; It has built an architecture that is extremely scalable, overcomes database bottlenecks with caching, and meets high availability (HA) standards. In particular, the modern architecture offered by Spring Boot 3 has made the security layer much more flexible. If you are starting a large-scale microservice project, you can design token management from the outset in this "stateless" and "cached" manner.
The Problem Isn't the Image Hardened container images are no longer niche. Docker open-sourced major portions of the tooling behind Docker Hardened Images under Apache 2.0 in late 2025. Chainguard and Google's distroless variants sit in the same space. The pitch across all three: fewer packages, smaller attack surface, dramatically lower CVE counts. The pitch is accurate. It is also incomplete. Most container security failures are not image failures. They are governance failures: A team pushes a debug build to production. Admission control doesn't block it because the policy is in Audit mode, not Enforce.A six-month-old deployment keeps running an ancient image digest while the team patches newer builds. Nobody detects the drift.The platform team rotates signing keys. Old pipelines keep producing images signed with the revoked key. Admission still accepts them. Nobody notices for ninety days.A vendor pushes an updated base image under the same tag. CI rebuilds against the new digest. The new digest is unsigned. Production takes it. No alert fires. None of these are CVE failures. They are governance failures — gaps in how images are produced, attested, verified, and monitored. Swapping the base image to a hardened variant changes none of them. A signed-and-attested hardened image in a cluster that doesn't verify signatures is operationally equivalent to a signed Ubuntu image in that cluster: the signature is decorative. I recently worked on migrating a regulated production workload onto a hardened-image baseline. Lab 12 of my docker-security-practical-guide repository is a sanitized, reproducible distillation of what that work taught me. The short version: the value is in the control plane around the image, not the image itself. The Trust Control Plane in 60 Seconds In practice, the hardest part is not enabling hardened images. It is operating trustworthy deployments at scale without slowing engineers down. The operating model has three layers, joined by a feedback loop: Supply Chain layer – images are signed (cosign keyless against Fulcio), attested with an SBOM (syft + CycloneDX), and scanned for vulnerabilities (grype). The output: an image whose origin and contents are independently verifiable by anyone.Trust layer – an admission controller (Kyverno) verifies signatures and attestations before any pod is scheduled. The admission policy is the unit of governance: it encodes which signers, which attestations, and which constraints are required for a workload to start.Enforcement layer – continuous drift detection answers the question: admission can't: has the digest drifted since we admitted it? Has the signing key been revoked? Has a new unsigned workload landed via a controller that bypasses admission?Feedback loop – drift findings feed back into the supply chain: a drift event produces a rebuild; an admission rejection produces a ticket. Without the loop, the enforcement layer becomes an alerting backwater that engineers mute. FIGURE 1 — Trust control plane for cloud-native software supply chain security.The architecture separates supply chain generation, admission-time trust verification, and continuous runtime enforcement into independent layers connected through a feedback loop. The pattern is vendor-agnostic: any compatible signing, admission, and drift-detection components can fulfill these roles. The bottom line: a hardened image is one input to the supply chain layer. Without trust verification, it's indistinguishable from a regular image at deploy time. Without enforcement, untrusted images coexist with hardened images in the same cluster. Without the feedback loop, trust state drifts silently. Admission Control: Where Governance Gets Teeth The trust layer is where the control plane becomes operationally real. In the lab, Kyverno's verifyImages rule asserts that every image carries a cosign signature from an approved identity. Here's the core of the policy: YAML apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: require-signed-images spec: validationFailureAction: Enforce rules: - name: verify-cosign-keyless match: any: - resources: kinds: [Pod] verifyImages: - imageReferences: ["ghcr.io/opscart/*"] attestors: - entries: - keyless: subject: "https://github.com/opscart/*" issuer: "https://token.actions.githubusercontent.com" required: true The subject and issuer together define who is trusted. For DHI images, these values point to Docker's signing identity. For Chainguard, Chainguard's. The shape of the policy is identical in all cases — only the identity matcher changes. When someone deploys an unsigned image, the rejection is immediate and actionable: Shell $ kubectl run test --image=nginx:latest --restart=Never Error from server: admission webhook "validate.kyverno.svc-fail" denied the request: resource Pod/default/test was blocked due to the following policies require-trusted-registry: trusted-registries-only: 'validation error: Image must come from a trusted registry. Allowed: dhi.io/*.' FIGURE 2 — Kyverno admission webhook rejecting an nginx pod from an untrusted registry. Capture from terminal: kubectl run rejected-test --image=nginx:latest --restart=Never (with cluster up and policies applied). Catching an unsigned image at admission costs one re-run of kubectl apply. Catching the same workload running in production a week later costs a security ticket, an incident response, and possibly a regulatory disclosure conversation. Moving rejection earlier is the highest-leverage decision in the entire model. Phased Rollout: Audit Before Enforce In production, you don't flip everything to Enforce on day one. The lab uses a phased approach: the trusted-registry policy runs in Enforce mode (hard gate on image origin), while signature and SBOM verification policies run in Audit mode (log violations, don't block). This gives teams a migration runway: they can see which workloads would fail and fix them before the policies graduate to Enforce. The shift from Audit to Enforce is a single-field YAML change. Signing Your Supply Chain: Keyless Cosign The supply chain layer produces the artifacts that admission verifies. A common modern approach uses cosign with GitHub Actions OIDC for keyless signing — no private keys to manage, rotate, or leak. The mechanism: GitHub Actions mints a short-lived OIDC token at workflow time. Cosign exchanges it for an ephemeral certificate from Sigstore Fulcio, signs the image, and destroys the key immediately. The certificate records which workflow, on which repository, at which commit, produced the signature. The signature is logged in Sigstore Rekor's public transparency log. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The lab's pipeline implements a full build → push → sign → attest → verify flow that fails closed if verification breaks. The complete workflow and run history is public. The important property is that anyone can independently verify the signed artifact. Shell cosign verify \ --certificate-identity-regexp \ "^https://github\.com/opscart/docker-security-practical-guide/ \.github/workflows/supply-chain-gate\.yml@.+$" \ --certificate-oidc-issuer \ "https://token.actions.githubusercontent.com" \ ghcr.io/opscart/docker-security-practical-guide/dhi-sample-app:latest Verification for ghcr.io/opscart/.../dhi-sample-app:latest -- The following checks were performed on each of these signatures: - The cosign claims were validated - Existence of the claims in the transparency log was verified offline FIGURE 3 — cosign verify succeeds for any reader, without shared secrets. Capture from terminal: run the cosign verify command above against the published image at ghcr.io. This is what "supply chain security" means in practice: not "we sign our images," but "our trust assertions are independently verifiable by anyone, against neutral infrastructure, without prior trust setup." The published image can be verified directly against the public artifact. Fleet Drift: The Problem Nobody Watches Admission is point-in-time. Production is continuous. The enforcement layer's job is to answer the questions that admission can't: has the digest drifted since we admitted it? Has a new unsigned workload landed via a controller that bypasses admission? The lab's E1 experiment runs a drift audit against a synthetic 12-service fleet mixing DHI, Docker Hub, internally-built, and abandoned images. The fleet is intentionally constructed with an explicit variation matrix — the numbers below describe the synthetic fleet's structure, not measurements from a deployed environment. In this synthetic fleet, unsigned services averaged 13.0 critical CVEs while signed-and-verified services averaged 0.0. The exact ratio will vary by environment, but the audit makes the trust gap continuously visible. FIGURE 4a — Fleet drift audit: signing state vs CVE correlation across the synthetic fleet. Capture from terminal: run ./experiments/E1-drift-observation/analyze-drift.py. Screenshot Sections 1–3 (Fleet Summary + Origin×Signing Correlation + Signing State → CVE Accumulation) FIGURE 4b — Remediation order: compliance-scope risk concentration and prioritized action queue. Same script output, Sections 4 + 7 (Compliance Scope Risk Concentration + Recommended Remediation Order) The ratio isn't the point — your fleet will produce different numbers. What the control plane provides is the continuous, attributable surfacing of whatever the ratio actually is, including cases where the supposed benefit of hardening is harder to defend. That honest feedback loop is what turns the audit from a compliance checkbox into a supply chain prioritization tool. The Substitution Test A useful test for whether you've found an architectural pattern or a vendor recipe: can you swap a major component and have everything else continue to work? For this architecture, the test is straightforward. The lab demonstrates three configurations: Docker Hardened Images (dhi.io), Chainguard Images (cgr.dev/chainguard), and a self-built Alpine base signed against a project-owned GitHub Actions OIDC identity. In all three, the Kyverno policy structure is identical. The drift audit runs unchanged. The SBOM verification runs unchanged. Edits are confined to the identity matcher and the image references. The implication: "Should we standardize on DHI or Chainguard?" is a commercial decision (pricing, catalog coverage, support), not an architectural one. The architectural decision is whether to operate the trust control plane at all. A team that has invested in the control plane has built portable institutional capability. A team that has invested in "we use DHI" has bought a product, and a future migration off DHI is a structural rewrite rather than a configuration update. Production Friction: What Actually Goes Wrong The model works. It is also not free. Here are the operational costs my team hit, documented in detail in the companion repo's TROUBLESHOOTING.md: No shell. Distroless hardened images don't include /bin/sh, curl, wget, cat, or ls. When an engineer pages at 2 AM and runs kubectl exec -it pod -- /bin/sh, the command fails. The remediation is kubectl debug with an ephemeral debug container attached to the pod's process namespace. Train your on-call rotation on kubectl debug before migration, not after. The lab's E5 experiment documents three debug patterns (ephemeral containers, dev-variant images in dev namespaces only, pre-built debug sidecars) with runbook scenarios for unreachable services, crashloops, and OOM kills. Migration is not a FROM line change. The default user is nonroot (UID 65532), not root. Library paths differ. pip install --user installs to /home/nonroot/.local, not /root/.local. Required system packages (ca-certificates, timezone data) that come for free in stock bases must be explicitly carried over. The lab's Dockerfile required three iterations before the build succeeded locally: shell-form RUN failed (no /bin/sh), then pip --user installed to the wrong path, then requirements.txt pinned package versions that didn't exist on PyPI. Each of these is a 30-second local fix — and a 5-minute GitHub Actions round-trip if you don't test locally first. Signature paths vary by vendor. DHI signatures resolve via registry.scout.docker.com, not at the image's own registry path. Kyverno handles this through the policy's repository field, but any custom verification tooling needs to know. Plan to audit verification code before migration. Kyverno has schema gotchas. rekor and ctlog blocks must be inside keys, not siblings. webhookTimeoutSeconds is capped at 30. mutateDigest: true is incompatible with validationFailureAction: Audit. PolicyException requires an explicit feature flag. Each of these cost me 30–60 minutes of debugging — they're in TROUBLESHOOTING.md, so they don't cost you the same. None of these are deal-breakers individually. All of them together are why migrations slip from "next quarter" to "abandoned after two months." Budget for friction. When This Is Overkill The investment's value scales with three factors: regulatory pressure (HIPAA, PCI-DSS, SOC 2 Type II, FDA 21 CFR Part 11), fleet size and heterogeneity (8+ clusters, dozens of teams pushing images), and blast radius (pharmaceutical patient data vs. internal dashboard). Concretely: pre-production tools, side projects, prototypes, and developer sandboxes do not need this. They benefit from a hardened base image (free) and should not be put behind the full trust control plane. The overhead of policy maintenance, key rotation, and drift remediation outstrips the risk reduction. For most workloads outside regulated production, the supply chain layer alone — sign and SBOM your builds — captures most of the available value at a fraction of the cost. Conclusion: Architecture Over Image Choice Hardened images are useful. The point of this article is that they are one component of a broader architectural pattern, and the security outcomes regulated teams want are properties of the pattern, not the component. A team that adopts hardened images without the surrounding pattern has made a real but limited improvement. A team that adopts the pattern with any reasonable image vendor — DHI, Chainguard, or a self-built base — has built portable institutional capability. The substitution test is the diagnostic: ask whether a future migration away from your current image vendor is a configuration edit or a structural rewrite. If it's the former, you have the pattern. If it's the latter, you have a product dependency. The companion repository at github.com/opscart/docker-security-practical-guide (tag v1.12.0) contains everything in this article: working Kyverno policies, a keyless-signed sample image you can pull and verify right now, fleet drift audits, and five hypothesis-driven experiments. The cosign verify command above works against the published artifact today. Spend the design effort on the pattern. The image will be replaceable. The governance is what survives vendor replacement. This article is adapted from a longer write-up on OpsCart, which includes the complete threat model, substitution-test configurations, and an extended troubleshooting log.
Partitioning and Z-Ordering have long been fundamental techniques in Delta Lake for optimizing data layout and query performance. However, these methods require significant upfront design and ongoing maintenance and they often struggle to adapt to changing data and query patterns. Databricks Liquid Clustering introduced with Delta Lake 3.0 goes beyond traditional partitioning and Z-Order, offering a self-tuning, flexible approach to organizing data that is especially powerful for Unity Catalog managed tables. In this article, we’ll explore how Liquid Clustering works, how it compares to traditional methods, and how to implement it in Databricks Unity Catalog for improved performance and simpler data management. Recap: Partitioning and Z-Order Limitations Before diving into Liquid Clustering, it’s important to understand the challenges of conventional partitioning and Z-Ordering in large Delta Lake tables: Design Complexity & Rigidity: Choosing an optimal partitioning scheme is difficult and usually fixed. A static Hive-style partition strategy often demands careful upfront planning to avoid data skew and concurrency conflicts and it cannot easily adapt if query patterns change. Changing partition columns later means expensive data rewrites.Partition Explosion & Metadata Overhead: If you partition on high-cardinality columns or many levels, you may end up with too many small partitions. This proliferation of tiny files and directories increases metadata overhead and slows down query planning.Need for Additional Clustering (Z-Order): Z-Ordering is often applied on top of partitions to co-locate related data. While Z-Order can improve data skipping, it is expensive to maintain it requires heavy shuffle and rewrite jobs and does not handle concurrent writes well. In other words, Z-Ordering jobs can be lengthy and costly and must be re-run as new data arrives to maintain clustering.Manual Tuning & Maintenance: Both partitioning and Z-Order require continuous tuning. Data engineers must monitor query patterns and manually decide how to partition or when to re-Zorder. This ongoing maintenance is time-consuming and error-prone. In summary, traditional partitioning/Z-ordering yields performance benefits but at the cost of rigidity and operational overhead. This sets the stage for a more adaptive solution. What Is Liquid Clustering? Liquid Clustering is a new data layout strategy in Databricks Delta Lake designed to replace traditional partitioning and Z-Ordering for Delta tables. The name liquid signifies flexibility data is clustered by one or more columns in a way that can evolve over time without strict, static partitions. Key characteristics of Liquid Clustering include: Dynamic, Self-Tuning Layout: Instead of static partitions, data is dynamically clustered based on specified clustering keys. The table’s storage layout automatically adjusts to changing data and query patterns, incrementally clustering new data as it is written. This means the data layout flows with your workload.Simplicity in Key Selection: You choose a set of clustering columns based on query access patterns, typically the columns most commonly used in WHERE filters or joins. You don’t need to worry about column cardinality, order of keys or file size tuning the platform handles optimal file sizing and clustering internally. Even high-cardinality columns can be used effectively, which would be impractical as partition keys.Flexibility to Change Keys (No Rewrites): Perhaps the most revolutionary aspect is that clustering keys can be redefined without rewriting existing data files. If your query patterns shift, you can alter the clustering columns and the system will gradually reorganize data for the new keys. There’s no massive upfront cost of re-partitioning the entire dataset past data doesn’t need an immediate rewrite.Skew-Resistant & Efficient Storage: Liquid Clustering is designed to maintain balanced file sizes and avoid the pitfalls of skewed partitions. Under the hood, the data engine can combine or split clustering ranges to keep files at an optimal size.Reduced Maintenance Overhead: Because the data layout adapts automatically, the need for manual maintenance is drastically reduced. You no longer have to schedule regular Z-Ordering jobs or hand-tune partition schemes. Liquid Clustering, especially in its automatic mode, offloads these decisions to Databricks. Databricks recommends using Liquid Clustering for most new Delta tables going forward, especially for tables that are large, have high-cardinality filter columns, experience data skew, or have evolving access patterns. It simplifies data engineering by set it and forget it clustering. In fact, thousands of customers have already adopted it as of 2025, over 3,000 monthly customers were writing 200+ PB of data into Liquid Clustered tables. Liquid Clustering vs Traditional Methods Liquid Clustering addresses the limitations of partitions and Z-ordering in several ways: No Rigid Partition Boundaries: Unlike Hive partitions, liquid clustering can store a range of values in each data file. This fluid layout avoids issues like tiny partitions or unbalanced file sizes.Incremental and Low-Shuffle Clustering: New data is clustered as it’s ingested, without requiring a full table rewrite. When you enable clustering on a table, Databricks flags the table to cluster future writes according to the specified keys. Each new INSERT or MERGE automatically writes out files clustered on those keys, and small files are merged as needed. This incremental approach means no huge one-time sort jobs every time you add data. Maintenance operations like OPTIMIZE still play a role but they can operate more efficiently since the incoming data is already sorted/clustered on write. Notably, the OPTIMIZE command for a liquid-clustered table can be more adaptive than traditional OPTIMIZE+ZORDER it only rearranges data that isn’t well clustered yet rather than always rewriting everything.Adapting to Change Without Rewriting Everything: In a partitioned table, if you realize a month later that queries would run faster partitioned by a different column, you’d have to repartition the entire dataset. With Liquid Clustering, you can simply issue an ALTER TABLE to change the clustering column set. The system will use the new keys for all future writes, while existing files remain as they are until an optimization is triggered. You can later run a full optimize to reorganize historical data under the new scheme if needed. This means you can respond to evolving query patterns without incurring an immediate cost for reprocessing the whole table.Better Concurrency and Fewer Conflicts: Because Liquid Clustering avoids overly granular partitions and heavy-duty clustering jobs, it also mitigates concurrency problems. Traditional partitions can suffer write conflicts if too many jobs target the same partition, and Z-order optimize jobs can conflict with concurrent writes. Liquid Clustering’s design results in fewer such bottlenecks.Performance Gains: Ultimately, the goal is faster queries and lower cost. By clustering data on the actual query predicates, Liquid Clustering improves data skipping. This leads to less IO and faster execution. In one benchmark, Databricks observed that a 1 TB warehouse dataset clustered with Liquid Clustering ran 2.5× faster to optimize (cluster) than using Z-Ordering, and yielded significantly better query performance than both partitioning or Z-Order. In real workloads, users have reported dramatic improvements; for example, Healthrise (a Databricks customer) saw some queries run up to 10× faster after enabling Automatic Liquid Clustering on their tables. We’ll discuss Automatic mode shortly. How Liquid Clustering Works (Under the Hood) At a high level, manual Liquid Clustering works by clustering data files on chosen key columns, while automatic Liquid Clustering adds an intelligent layer to choose and adjust those keys for you. Let’s break down the mechanisms: Clustering on Write: When you define clustering keys for a Delta table, the Delta engine ensures that newly written data is organized according to those keys.Maintenance and OPTIMIZE: Over time, as data is appended, you may still accumulate some fragmentation. The OPTIMIZE command can be used on a clustered Delta table to compact small files and sort data more finely according to the clustering columns. Unlike Z-Ordering, an optimize on a liquid-clustered table doesn’t always have to rewrite all files it focuses on incremental clustering, merging files that are sub-optimally placed. You can think of it as tightening the clustering. If you change the clustering columns via ALTER TABLE, you can run OPTIMIZE FULL to recluster all existing records under the new key order. In normal operation, Databricks recommends running periodic OPTIMIZE to keep performance optimal, but these operations are more lightweight than traditional heavy Z-order jobs.Data Skipping with Statistics: Delta Lake maintains statistics that the query engine uses for data skipping. Liquid Clustering maximizes the effectiveness of data skipping by ensuring those min/max ranges align with query filters. Enabling Automatic Clustering To use Automatic Liquid Clustering, you need to have Predictive Optimization enabled for your workspace (this is the feature in Unity Catalog that handles these background optimizations). Many new Databricks accounts have this on by default since late 2024, but it can also be enabled via the account console (under Feature Enablement). Assuming it’s enabled, turning on Automatic clustering for a table is straightforward: SQL: Use the CLUSTER BY AUTO clause when creating or altering a Delta table. For example, to create a new table in Unity Catalog with auto clustering: SQL -- Creating a Unity Catalog managed table with Automatic Liquid Clustering CREATE TABLE main.analytics.user_events ( user_id STRING, event_type STRING, event_date DATE, details STRING ) CLUSTER BY AUTO; -- enables automatic liquid clustering on this table SQL ALTER TABLE main.analytics.user_events CLUSTER BY AUTO; This instructs Databricks to begin monitoring the table’s workload and to auto-select clustering keys for optimal performance. The table does not need to have any manual keys set; the system will determine them. (Under the hood, the first time it chooses keys, it will update the table’s metadata with those columns as clustering keys.) PySpark API: In code, you can also enable auto clustering when writing data. For instance, using the DataFrame Writer API in PySpark: Python # df is a DataFrame we want to save as a Delta table with auto clustering df.write.format("delta") \ .option("clusterByAuto", "true") \ .mode("overwrite") \ .saveAsTable("main.analytics.user_events_auto") The above will create the user_events_auto table as a Unity Catalog managed table with automatic clustering enabled. (If you want to provide an initial hint for clustering columns, you can combine .clusterBy("col1", "col2") with the clusterByAuto=true option, but it’s not required – the system will figure it out if you leave it open.) Once Automatic mode is on, no further action is needed from the user. Databricks will handle running background optimize jobs as needed. It’s worth noting that these maintenance operations run on a serverless compute in the background. The benefit is you no longer need to schedule OPTIMIZE or VACUUM on your own; predictive optimization will run them at optimal times. Using Manual Liquid Clustering (Custom Clustering Keys) In some cases, you may want to manually specify the clustering columns. Unity Catalog supports manual Liquid Clustering on managed tables as well. Here’s how to use it: Table Creation with Cluster Keys: You can define clustering keys in the CREATE TABLE statement via a CLUSTER BY clause. For example: SQL -- Create a Delta table clustered by specific columns (manual clustering) CREATE OR REPLACE TABLE main.analytics.sales_data ( sale_id BIGINT, region STRING, product STRING, sale_date DATE, amount DECIMAL(10,2) ) CLUSTER BY (region, sale_date); In this example, the table’s data will be clustered by region and sale_date. This means each file written will tend to contain a narrow range of region values and sale_date values. This is analogous to creating a partitioned table on multiple keys, but without creating separate directories for each region or date. Altering an Existing Table: If you have an unpartitioned Delta table and want to enable clustering on it, use an ALTER statement. For instance: SQL ALTER TABLE main.analytics.sales_data CLUSTER BY (region, sale_date); This will register region and sale_date as the clustering keys for sales_data. As mentioned, this does not rewrite existing files immediately. It flags the table so that future writes will be clustered by these keys. Any new data you append or merge into sales_data will now be written in clustered order. Data that was already in the table remains in its original layout until you optimize. Reclustering Existing Data: To apply the new clustering to old files, you can run an OPTIMIZE operation. For a large table, you might do this during a maintenance window. For example: Python OPTIMIZE main.analytics.sales_data; The above will compact small files and cluster data incrementally. If you recently changed the clustering keys and want to force a full re-cluster of all data under the new key order, use OPTIMIZE main.analytics.sales_data **FULL**. An OPTIMIZE FULL will read and rewrite all files in the table, arranging them according to the current clustering columns. In most cases, a regular OPTIMIZE will suffice, as it will naturally pick up new keys over time. PySpark Write with Clustering Keys: You can also write data from Spark with clustering, similar to how you’d write partitioned data. For example: Python # Given a Spark DataFrame df, write it to a Delta table with clustering on specified keys df.write.format("delta") \ .mode("append") \ .clusterBy("region", "sale_date") \ .saveAsTable("main.analytics.sales_data"); Here, .clusterBy("region", "sale_date") ensures the data in df gets written out clustered by those columns. If the table sales_data was not already created, this will create it with those cluster keys. Finally, remember that Liquid Clustering is supported only on Delta tables with the latest protocols. Enabling it will bump your table’s Delta protocol version which older clients cannot read. In a Databricks environment this is usually not an issue, but be cautious if you have external readers/writers that might be using older Delta Lake libraries. Conclusion Liquid Clustering represents a major evolution in data layout management for the Lakehouse. By moving beyond the rigidness of partitioning and the heavy operational cost of Z-Ordering, it delivers a simpler and more adaptive way to optimize tables. For Data Engineers, this means less time agonizing over partition strategies and maintenance jobs, and more time focusing on data and insights. With Unity Catalog’s Automatic Liquid Clustering, the process is taken a step further clustering becomes a self-driving process, leveraging query insights to continuously improve performance. In summary, Databricks Liquid Clustering dynamically organizes data based on actual usage, can adjust without expensive rewrites, and has been shown to boost query performance significantly. As you design your next Delta Lake tables in Unity Catalog, consider leveraging Liquid Clustering from the start it can simplify your architecture and ensure your tables automatically stay optimized as your data (and its use cases) grow.
EMR platforms are unique software beasts. They must live longer than most online apps due to regulatory constraints. A startup may reinvent its primary product every three years, but an EMR system must retain data integrity and workflow consistency for decades. This lifespan is difficult. How do you change a healthcare system without violating strict compliance rules? API-first thinking is the answer. This method goes beyond data endpoint exposure. The issue is architectural survival. In a business where "move fast and break things" is unacceptable, architects may offer modular development, safer changes, and long-term stability by prioritizing the API. The Unique Constraints of EMR Architecture EMRs are not typical CRUD applications. In a standard business app, updating a record might just mean overwriting a row in a database. In healthcare, that simple update triggers a cascade of regulatory realities. Every change requires an audit trail. Data retention policies dictate that information cannot simply vanish. Clinical decisions are based on the history of that data, meaning immutability is often more important than mutability. Furthermore, healthcare workflows are long-lived. A patient's treatment plan might span months or years. An architecture built around short-lived features will crumble under the weight of these persistent workflows. You cannot refactor a database schema overnight if it breaks the continuity of a patient's care record. This is why stability is the paramount quality attribute of any EMR. What “API-First” Really Means in Regulated Systems In the context of regulated systems, API-first means designing contracts before writing a single line of implementation code. It requires treating your APIs as long-term public interfaces, even if the only consumer initially is your own frontend team. They are not internal shortcuts; they are binding agreements. This approach forces you to separate clinical workflows from user interface concerns. A button click on a screen is transient; the clinical action it represents is permanent. By defining the API first, you establish a boundary that encapsulates compliance logic. The API becomes the gatekeeper. It enables regulatory compliance regardless of data access via mobile app, web portal, or third-party integration. Contract Stability as a Core Architectural Principle Breaking an API contract in an EMR is far costlier than breaking a UI component. If a button breaks, a user complains. If an API contract breaks, integrations fail, data synchronization stops, and patient care can be impacted. Therefore, request and response models must be designed to survive years of change. Architects must avoid overfitting contracts to current UI needs. Just because a specific screen needs a patient's name and their last three blood pressure readings doesn't mean you should create an endpoint specifically for that view. Instead, design resources that represent the domain accurately. This decoupling protects the backend from the volatility of frontend trends. Backward Compatibility Without Freezing Innovation The fear of breaking existing clients often paralyzes development teams. However, API-first design provides a path to evolve without stagnation. The key is distinguishing between additive changes and destructive changes. Adding a new field to a response is generally safe; removing one or renaming one is not. In .NET Web APIs, versioning strategies are critical. You can support legacy consumers while enabling new features for modern clients. This transforms deprecation from a sudden emergency into a managed process. You provide a sunset period for old versions, giving consumers time to migrate without disruption. In regulated systems, versioning is not a technical afterthought. Explicit versioned routes allow EMR platforms to evolve safely, giving downstream systems time to migrate without disrupting clinical workflows. Plain Text ```csharp [ApiController] [Route("api/v1/encounters")] public class EncountersV1Controller : ControllerBase { [HttpPost("{id}/sign")] public IActionResult SignEncounter(Guid id) { // Business rule: encounter must be complete before signing _encounterService.Sign(id); return Ok(); } } Modeling Regulated Workflows Through APIs Your API should encode business rules and compliance constraints directly. It is dangerous to rely on the UI to validate clinical workflows. If a doctor must sign a note before billing can occur, that rule belongs in the API layer, not in the JavaScript of the frontend. Consistency: Business rules enforced at the API level apply to every consumer, preventing "workflow drift" between the web portal and mobile apps.Security: Bypassing the UI via a direct API call (e.g., using Postman) should not allow a user to bypass compliance checks.Clarity: The API endpoints should reflect real-world clinical states (e.g., POST /encounters/sign) rather than generic database operations. API-First and Modular EMR Growth Monolithic EMRs eventually become unmaintainable. Decoupling large domains like scheduling, assessments, reporting, and case management is possible with API-first design. Well-defined interfaces allow you to upgrade the scheduling engine without affecting the billing module. This modularity supports parallel development. Different teams can work on different modules simultaneously without constant merge conflicts or integration friction. It also lays the foundation for extensibility. If a client needs a custom integration for a specific device, your public-facing API is already robust enough to handle it because it’s the same API you use internally. .NET-Specific Considerations for API-First EMRs ASP.NET Core is an excellent framework for building long-lived API platforms. Its middleware pipeline allows you to handle cross-cutting concerns like logging and validation globally. However, structuring your solution requires discipline. Controllers should be thin, delegating logic to service layers that handle the heavy lifting. Using Data Transfer Objects (DTOs) is non-negotiable. Never give API consumers access to internal domain entities or Entity Framework models. DTO buffers allow database schema refactoring without breaching the public contract. Your architecture should prioritize validation, authorization, and detailed auditing over afterthoughts. DTO boundaries are a compliance safeguard. They allow internal schema evolution while preserving external contracts, critical for EMR platforms that must retain compatibility over decades. Plain Text ```csharp // Entity (internal, mutable, persistence-focused) public class EncounterEntity { public Guid Id { get; set; } public DateTime SignedAt { get; set; } public string InternalNotes { get; set; } } // DTO (public, stable, contract-focused) public class EncounterDto { public Guid Id { get; set; } public bool IsSigned { get; set; } } Security, Authorization, and Role-Based Access Authorization in healthcare is complex. It is rarely a simple binary of "admin" vs. "user." You have doctors, nurses, auditors, billing specialists, and patients, all with overlapping permissions. This complexity cannot be delegated to the UI. Scope: Design APIs around granular scopes and responsibilities, ensuring a nurse can view a chart but only a doctor can sign an order.Context: Authorization logic must understand the context. A doctor may see patients solely in their ward.Enforcement: Use.NET policies to enforce these restrictions at the controller or action level to catch all requests. Lessons Learned From Long-Term EMR Ownership Looking back at years of EMR development, the cost of early shortcuts is evident. Every time we bypassed the API to hack a feature directly into the database or coupled the UI too tightly to the backend, we paid for it with interest later. The API-first approach drastically reduced risk during major platform changes. When we needed to rewrite our entire frontend framework, the backend remained stable. We didn't have to reinvent our compliance logic because it was safely encapsulated behind our API contracts. I would tighten contract design reviews if I started over. Taking time to design the interface right is more important than coding speed. Final Thoughts: Building EMRs That Outlast Trends Technology trends fade. JavaScript frameworks rise and fall. But medical records must persist. An EMR system must survive multiple generations of UI rewrites and shifting regulatory landscapes. API-first design is the strategy for this longevity. It separates your system's volatile portions from its compliance-heavy core. Architects in this field must supply features and maintain system integrity throughout time. By investing in solid, well-designed APIs today, you assure your platform's longevity.
TL;DR A single straggling node held up a 4-node distributed training job. We found it by fanning out one SQL query to all four nodes and getting the answer in under a second. This is distributed GPU training debugging with eBPF – no central service, no Prometheus, no time-series database, just the same single-binary agent already running on each machine. The Problem We Kept Hitting We’ve been building Ingero — an eBPF agent that traces CUDA API calls and host kernel events to explain GPU latency. Until v0.9, it was single-node only. Trace one machine, explain what happened on that machine. For single-GPU inference or training, that worked well. But distributed training spreads the debugging surface across machines. When a 4-node DDP job slows down, the question is always: which node? And then: why? nvidia-smi on each machine reports healthy utilization. dstat shows nothing obvious. The typical workflow is SSH-ing into each box, eyeballing logs, diffing timestamps across terminals, and hoping the issue is still happening. We wanted a cross-node investigation without adding infrastructure. The question was: what’s the simplest architecture that works? What We Shipped in v0.9.1 Three features, all built on top of the existing per-node agent. No new services, no new daemons, no new ports. 1. Node Identity Every event now carries a node tag. The agent stamps each event with a name from a --node flag, an ingero.yaml config value, or the hostname as fallback: Shell sudo ingero trace --node gpu-node-01 Event IDs become node-namespaced (gpu-node-01:4821) so databases from different nodes can merge without collisions. For torchrun workloads, rank and world size are auto-detected from environment variables (RANK, LOCAL_RANK, WORLD_SIZE) — no extra configuration needed. 2. Fleet Fan-Out Queries Each Ingero agent already exposes a dashboard API over HTTPS (TLS 1.3, auto-generated ECDSA P-256 cert if no custom cert is provided). The new fleet client sends the same query to every node in parallel, collects the results, and concatenates them with a node column prepended. For production clusters, the client supports mTLS — --ca-cert, --client-cert, --client-key — so both sides authenticate. Plain HTTP is available via --no-tls but requires an explicit opt-in, and even then, it’s intended for trusted VPC networks only. The --nodes flag works for ad-hoc queries, but for anything beyond a handful of nodes, the node list goes into ingero.yaml once and every command picks it up automatically: YAML fleet: nodes: - gpu-node-01:8080 - gpu-node-02:8080 - gpu-node-03:8080 - gpu-node-04:8080 A full example config is in configs/ingero.yaml. Here’s what it looked like when we ran it against a 4-node cluster where one node was misbehaving: Shell $ ingero query --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 \ "SELECT node, source, count(*) as cnt, avg(duration)/1000 as avg_us FROM events GROUP BY node, source" node source cnt avg_us ---------------- ------ ----- ------ gpu-node-01 4 11009 5.2 gpu-node-01 3 847 18400 # ← 9x higher than peers gpu-node-02 4 10892 5.1 gpu-node-02 3 412 2100 gpu-node-03 4 10847 5.3 gpu-node-03 3 398 1900 gpu-node-04 4 10901 5.0 gpu-node-04 3 421 2200 8 rows from 4 node(s) Node 1 jumps out immediately: 847 host events at 18.4ms average, while the other three sit around 2ms. One more command to see the causal chains: Shell $ ingero explain --nodes gpu-node-01:8080,gpu-node-02:8080,gpu-node-03:8080,gpu-node-04:8080 FLEET CAUSAL CHAINS - 2 chain(s) from 4 node(s) [HIGH] [gpu-node-01] cuLaunchKernel p99=843us (63.9x p50) - 847 sched_switch events + heavy block I/O Root cause: 847 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset; Add nice -n 19 to background jobs [MEDIUM] [gpu-node-01] cuMemAlloc p99=932us (5.0x p50) - 855 sched_switch events + heavy block I/O Root cause: 855 sched_switch events + heavy block I/O Fix: Pin training process to dedicated cores with taskset Both chains are on gpu-node-01. The other three nodes have zero issues. The root cause: CPU contention from block I/O — checkpoint writes preempting the training process. Two commands to go from “distributed training is slow” to “pin the training process on node 1 and investigate the I/O source.” 3. Offline Merge and Perfetto Export Not every environment allows live HTTP queries between nodes. Air-gapped clusters, locked-down VPCs, compliance constraints — there are real reasons the network path isn’t always available. For those cases, ingero merge combines SQLite databases from each node into a single queryable file: Shell # 1. Collect traces from each node scp gpu-node-01:~/.ingero/ingero.db node-01.db scp gpu-node-02:~/.ingero/ingero.db node-02.db # 2. Merge and analyze ingero merge node-01.db node-02.db -o cluster.db ingero explain -d cluster.db Stack traces are deduplicated by hash. Events keep their node-namespaced IDs. Old databases that predate the node column work with --force-node. For visual timeline analysis, ingero export --format perfetto produces a Chrome Trace Event Format JSON that opens in ui.perfetto.dev. Each node gets its own process track. Causal chains show up as severity-colored markers. The straggler is visible at a glance in the timeline. Why We Built It This Way The obvious approach to multi-node observability is a central collector: ship events to a time-series database, build dashboards, set up alerts. Prometheus, Datadog, Honeycomb — the well-trodden path. We deliberately avoided that. No new infrastructure. Ingero is a zero-config, single-binary agent with no dependencies. Adding a central collector contradicts that. The fleet client is 400 lines of Go in the existing binary. It reuses the HTTPS API the agent already exposes. Nothing new to deploy, nothing new to secure — the same TLS 1.3 + mTLS configuration that protects a single node’s dashboard protects the entire fleet. Client-side fan-out is simple and sufficient. The CLI sends concurrent HTTP requests, collects results, and merges them locally. A sync.WaitGroup, some JSON decoding, column concatenation. No distributed query planning, no consensus protocol, no coordinator election. For 4-50 nodes, this is the right level of complexity. Partial failure is first-class. If one node is unreachable, results from the others still come back, plus a warning. No all-or-nothing semantics. In practice, the unreachable node is often the one in trouble — and knowing which nodes failed is diagnostic information in itself. Clock skew is measured, not ignored. eBPF timestamps come from bpf_ktime_get_ns() (CLOCK_MONOTONIC), which is per-machine. When correlating events across nodes, clock differences matter. The fleet client runs NTP-style offset estimation in parallel with the actual query — 3 samples per node, median filter. On a typical LAN with sub-millisecond RTT, precision should be well under 10ms. If skew exceeds a threshold, it warns. This adds zero latency since it runs concurrently with the data query. Offline merge covers air-gapped environments. Some production GPU clusters have no internal HTTP connectivity between nodes. SCP the databases, merge locally, investigate. The merge path also serves as a permanent record of the cluster state at investigation time. MCP: AI-Driven Fleet Investigation The fleet is also accessible through Ingero’s MCP server via the query_fleet tool. Here’s what the raw tool output looks like for a chains query across the same 4-node cluster: Python query_fleet(action="chains", since="5m") Fleet Chains: 2 chain(s) [HIGH] gpu-node-01 | cuLaunchKernel p99=843us (63.9x p50) | 847 sched_switch events + heavy block I/O [MEDIUM] gpu-node-01 | cuMemAlloc p99=932us (5.0x p50) | 855 sched_switch events + heavy block I/O That’s the complete response — an AI assistant gets this back from one tool call, no SSH access to each node, no manual SQL. The tool supports four actions: chains (causal analysis), sql (arbitrary queries), ops (operation breakdown per node), and overview (event counts). Clock skew warnings are prepended automatically when detected. Where This Stands v0.9.1 is the initial step in cluster-level tracing, not the destination. What we have now works well for the reactive investigation workflow: something went wrong, we need to find out what and where. Fan-out queries, offline merge, Perfetto export — these are diagnostic tools for after the fact. We’re actively working on cross-node correlation and straggler detection — more updates coming soon. And since the instrumentation sits on host-level eBPF rather than vendor-specific hooks, none of this is limited to a specific GPU vendor. The bet is that client-side fan-out scales to 50+ nodes before anything centralized is needed. When it doesn’t, the node-namespaced ID scheme and offline merge path ensure the architecture can evolve without breaking existing deployments. We’re stress-testing the fan-out architecture against larger clusters and would welcome feedback from teams running multi-node training. Open an issue on GitHub. The investigations/ directory has ready-to-query databases for trying this without a GPU cluster: sample-gpu-node-01.db, sample-gpu-node-02.db, sample-gpu-node-03.db – individual node traces from a 3-node clustersample-cluster.db – all three merged into one (600 events, 6 chains, 9 stacks) GitHub (give us a star!): github.com/ingero-io/ingero. No NVIDIA SDK, no code changes, production-safe by design. If you are facing distributed training issues in your own workloads, we’d love to take a look. Drop an issue on GitHub, and we will gladly dive into it together. Ingero is free & open source software licensed under Apache 2.0 (user-space) + GPL-2.0/BSD-3 (eBPF kernel-space). One binary, zero dependencies, <2% overhead. Related Reading GPU incident response in 60 seconds with eBPF – single-node investigation workflow that the fleet feature extends11-second time to first token on a healthy vLLM server – kernel-level scheduling contention causing hidden latency, similar to the straggler root cause in this postGPU showing 97% utilization while training runs 3x slower – why nvidia-smi metrics alone miss the real story
Evidence of the ideas behind generative AI is not challenging to build, but the barrier between experimentation and production presents another group of concerns: repeatability, workflow predictability, safety, tracking, and scalability. The quality of the model is often not the bottleneck, and many teams find it challenging to apply GenAI into real systems and have enterprise-grade level guarantees. The Vertex AI Agent Builder offered by Google Cloud fills the gap with a managed infrastructure of deploying intelligent agents run on Gemini models, generation based on retrieval-augmented generation (RAG), and tools orchestration. In place of manually configuring a collection of services, Agent Builder is a unified runtime that allows balanced application development, both data grounding and deployment as well as monitoring, to be authored in GenAI. Architecture Foundations for Production GenAI A GenAI system on GCP that is production-grade is usually designed to have a layered architecture. The client applications communicate with Cloud Run or API Gateway and send requests to agents that are hosted by Vertex AI Agent Builder. Such agents plan prompts, access contextual information in indexed enterprise datastores like Big Query or Cloud Storage, reason using Gemini models and access external (or internal) tools (including Cloud Functions and internal APIs) when necessary. This division of labor enables frontend services, agent logic and knowledge systems to scale independently, without involving business workflows in immediate templates. The fundamental unit of this architecture is Retrieval Augmented Generation. In the absence of RAG, the model only uses pretrained knowledge and therefore, it tends to hallucinate or provide general answers. The use of agent Builder supports native indexing over both structured and unstructured data, thus enabling the application of outputs by applications to be based on actual organizational content. Documents are divided, inserted and filled with metadata to enable retrieval based on access level, department or domain. This practically forms a pipeline whereby user queries activate retrieval, dynamically assembled relevant context is formed and responses are produced by Gemini based on authoritative data. This method is much more accurate but flexible because the knowledge of the enterprise is going to change. Production GenAI Architecture Using Vertex AI Agent Builder on GCP Orchestration, Security, and Operational Readiness Recent GenAI applications do not typically limit themselves to text generation. There are databases, ticketing systems, and business services that must be touched by the production agents. Vertex AI Agent Builder allows the calling of tools so that models can invoke external actions like asking the status of orders, creating support tickets or running workflows. The teams do not have to write the logic inside prompts but can define structured flows using the assistance of Agent Builder, Cloud Workflows, or event-driven Cloud Functions. This renders orchestration checkable and verifiable whilst allowing the model to focus on argumentation and language production. Security is also the important thing. Vertex AI is connected to GCP IAM directly, allowing role-to-agent and role-to-dataset access as well as supporting service-to-service authentication. Sensitive areas may be covered in retrieval, audit logs can be viewed on the interactions of the agents, and VPC Service Controls are used to provide a boundary on data. Such capabilities are required in controlled settings where GenAI must abide by the current governance systems. Making agents like any other production service, which is subject to identity management, network controls, and logging, makes GenAI not an exception in architecture. Observability, Deployment, and Continuous Improvement The operational risk of deploying GenAI is that it is not observable. Vertex AI also offers logging of requests, latency, and tracing of the usage of tokens, although production teams often go further and export interaction data to BigQuery to analyze it offline. Gaining feedback on users, assessing response quality and versioning allows constant improvement, without destabilizing production systems. Another typical trend is to A/B test the promotion of prompt or agent changes in staging before they go to production, as with the traditional software release process. During deployment, the teams tend to open the agents through secured endpoints enabled by Cloud Run, manage the infrastructure with the help of Terraform, and create CI/CD pipelines to modify agent settings. This ensures that it can be replicated and it has reduced manual effort. Like traditional microservice ecosystems, successful GenAI platforms can be said to be monitored, versioned and constantly optimized in the long term. Vertex AI Agent Builder makes this process faster by bringing models, retrieval, orchestration and governance together on a single platform, which enables engineering teams to build reliable products instead of gluing the infrastructure together. Finally, GenAI in its production form will not be about access to powerful models, but rather the construction of robust systems to run them. Verse AI Agent Builder enables organizations to push agent deployment that is based on enterprise data, with cloud-native controls, and enhanced by feedback loops that are measurable to go to dependable applications. Conclusion Bringing GenAI out of the prototype and into production takes much more than model integration, it needs to be reliable in retrieval, deterministic in orchestration, hard security boundaries and continuously observable. The Vertex AI Agent Builder, offered by Google Cloud, unites all these abilities into one platform so that the teams can develop agents whose foundation lies in enterprise data, which relates to actual business processes and are controlled by cloud-native mechanisms. The integration of the Gemini models with Retrieval Augmented Generation, tool calling, and the operational ecosystem of GCP would enable organizations to implement scalable GenAI-based systems, which act similarly to the other production services. With enterprises becoming more entangled into AI-driven applications, they will find success once they start considering GenAI as part of infrastructure and not an experimental setup. Vertex AI Agent Builder can help speed up this shift by lowering the complexity of the existing architecture and allowing an engineering team to concentrate on the provision of quantifiable business value by offering reliable and production-ready intelligent systems.
Modern API-led architectures are built for resilience. We add: Retries for transient failuresReplication for durabilityAutoscaling for elasticityCircuit breakers for isolation Each mechanism improves availability. Under stress, their interaction can bring the system down. Most enterprise outages aren’t caused by missing fault tolerance. They’re caused by unbounded fault-tolerance mechanisms reacting simultaneously. Let’s break down how this happens — and how to design bounded reliability instead. 1. Retry Storms: When Resilience Multiplies Traffic Retries are meant to protect against temporary failures. But retries multiply load. This is a simplified version of what we often see in service-to-service retry logic: Plain Text import time import random def downstream_service(): latency = random.choice([0.1, 0.2, 0.8]) time.sleep(latency) if latency > 0.7: raise TimeoutError("Slow response") return "OK" def call_with_retries(max_attempts=3): for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: print(f"Retry {attempt+1}") raise Exception("Failed after retries") Under normal conditions: Works fine. Under load: Latency increases.Timeouts trigger.Each request retries 3 times.Traffic triples.Backend slows further.More retries fire. That’s a retry storm. Now imagine this inside an API-led architecture: Gateway → Experience API → Process API → System APIs → ERP/DB If each layer retries independently, load amplification becomes multiplicative. In one system I worked on, we saw a single downstream slowdown take out three upstream APIs within minutes because each layer had its own retry logic. Bounded Retry Pattern (Production-Safe) Retries must be: LimitedBacked off exponentiallyJitteredDisabled under system stress Safer version: Plain Text def call_with_bounded_retries(max_attempts=2, system_load=0.5): if system_load > 0.75: return None # fail fast when under stress for attempt in range(max_attempts): try: return downstream_service() except TimeoutError: backoff = 0.2 * (2 ** attempt) time.sleep(backoff + random.uniform(0, 0.1)) return None Key differences: Retry ceiling reducedExponential backoffJitter prevents synchronized wavesLoad-aware short-circuit Retries should dampen instability — not amplify it. 2. Replication Fan-Out and Coordination Collapse Replication improves durability. But synchronous replication increases coordination cost. Example: Plain Text import time def simulate_write(): time.sleep(0.2) def write_to_replicas(data, replicas=3): for _ in range(replicas): simulate_write() Under surge traffic: Write volume increases.Each write fans out to 3 replicas.Replica lag grows.Clients retry writes.Effective write load doubles. Durability turned into a bottleneck. In enterprise integration systems (order processing, billing, reconciliation), this pattern causes throughput collapse — not because data was lost, but because coordination overwhelmed the system. Tiered Durability Strategy Not all writes need identical guarantees. Plain Text def write(data, critical=True): if critical: write_to_replicas(data, replicas=3) else: write_to_replicas(data, replicas=1) Separate: Critical transactions → strong durabilityNon-critical logs/events → reduced coordination Reliability must be scoped — not maximized blindly. 3. Autoscaling Feedback Loops Autoscaling reacts to traffic metrics. But traffic metrics may be artificial. If retries inflate request counts: Plain Text def autoscale(request_rate): if request_rate > 100: print("Scaling up") Scaling triggers: New instances initialize.Initialization hits shared DB/cache.Backend latency increases.More timeouts occur.Retry rate rises. Autoscaling accelerated instability. Safer Scaling Signals Scale on: Sustained demand (not spikes)Latency distribution trendsOrganic RPS (excluding retries)Queue growth rate Example: Plain Text def autoscale_safe(request_rate, sustained_load): if sustained_load and request_rate > 120: print("Scaling safely") Autoscaling should respond to organic demand — not retry amplification. 4. The Real Problem: Correlated Reactions Retries respond to latency.Replication responds to writes.Autoscaling responds to traffic.Circuit breakers respond to error rates.Under stress, they react to the same signal.That correlation creates cascading failure.Distributed systems behave like feedback systems.Unbounded feedback loops destabilize them. Real-World Scenario: Payment Reconciliation API Consider a payment reconciliation service: Gateway → Process API → Billing → ERP → Database What happens during a minor ERP slowdown? ERP latency increases to 700ms.Billing times out at 500ms.Billing retries 3 times.Process API retries orchestration.Gateway retries client request.Autoscaling reacts to spike.DB replication lag increases.DLQ starts growing. Within minutes, a small slowdown becomes a platform-wide incident. Root cause: unbounded reaction. 5. Guardrails for Bounded Reliability in API Systems 1. Retry Budgets Effective Load = Incoming RPS × Retry Count If RPS = 1,000 and retries = 3 Effective load = 3,000 Cap retries per request and per service. 2. Failure Classification Not all errors are retriable. Error Type Retry? Action CONNECTIVITY Yes Bounded retry TIMEOUT Yes Backoff VALIDATION No Fail fast AUTH No Alert Blind retries are architectural debt. 3. Idempotency Enforcement Retries without idempotency cause corruption. Unsafe: Plain Text transaction_id = uuid() Safe: Plain Text transaction_id = payload.get("transaction_id") or request.headers["correlation-id"] Every retry must produce the same logical result. 4. DLQ With Observability Track: Retry percentageTimeout frequencyDLQ growth velocityP95 latency shifts These are early warning signals. None of these controls are free. Reducing retries can increase error rates in some scenarios, and limiting replication can affect durability guarantees. The goal isn’t to eliminate these mechanisms, but to apply them intentionally based on system behavior. 5. Design for Stability, Not Perfection The goal of distributed reliability isn’t maximum redundancy. It’s controlled degradation under stress. Bound retries. Scope replication. Dampen scaling reactions. Enforce idempotency. Monitor feedback loops. Minor latency should not become a cascading outage. Reliability is not about adding mechanisms. It’s about controlling how they interact. Final Thoughts Retry storms don’t start with catastrophic failure. They start with: A small latency increaseA few timeoutsA handful of retries Then fault-tolerance mechanisms react — together. Retries multiply traffic.Replication increases coordination pressure.Autoscaling amplifies backend load. Within minutes, a minor slowdown becomes a cascading outage. Reliability in API-led distributed systems is not about adding more safety nets. It’s about bounding how those safety nets behave under stress. Limit retries.Classify failures.Enforce idempotency.Scale on sustained demand — not noise.Monitor feedback loops before they spiral. The difference between a resilient platform and a cascading failure often comes down to one thing: Whether your reliability mechanisms are controlled — or uncontrolled. Design for stability under stress. Not perfection under ideal conditions.
The Problem Nobody Warned You About You bought the GPUs. Maybe you've got a couple of NVIDIA A100s in a rack, some RTX 4090s under desks, or a Kubernetes cluster with mixed hardware. You've got the compute. Congratulations! Now what? Here's the part that catches most teams off guard: having GPUs is the easy part. Managing them is where things go sideways. You need to figure out which models fit on which cards, how to balance load across machines, how to handle a node going down at 2 AM, and how to expose all of this as a clean API your application team can actually call. Most teams end up building a brittle collection of Python scripts and crontab entries that haven't been updated since 2022. It works until it doesn't, and then someone's paging you on a Saturday. This is the problem GPUStack was built to solve. What Is GPUStack, Exactly? GPUStack is an open-source tool for managing GPU clusters. Think of it as Kubernetes for your inference workloads, except you don't need to spend three days debugging a whitespace error in a Helm chart. At its core, GPUStack does three things well: It aggregates your GPUs. Whether your hardware is spread across bare-metal servers, Kubernetes pods, or cloud instances, GPUStack sees them all as a single pool of compute. One dashboard, full visibility. It orchestrates inference engines. GPUStack doesn't try to reinvent the inference wheel. It plugs into engines like vLLM, SGLang, and TensorRT-LLM, picks the right one for the job, configures it, and manages the lifecycle so you don't have to. It serves models through an OpenAI-compatible API. Once a model is deployed, your application team gets a familiar REST endpoint. No custom client libraries. No new protocols to learn. Swap out the base URL, and you're talking to your own infrastructure. Getting Started in Under 5 Minutes I'm not exaggerating on the timeline. Here's how you go from zero to a running GPUStack server. Step 1: Fire Up the Server You need one machine to act as your control plane. It doesn't even need a GPU. A basic CPU-only box works fine for the server role. Shell sudo docker run -d --name gpustack \ --restart unless-stopped \ -p 80:80 \ --volume gpustack-data:/var/lib/gpustack \ gpustack/gpustack That's it. Open your browser, navigate to http://<your-server-ip>, and you'll see the GPUStack dashboard. The first time you log in, you'll set up your admin credentials. Step 2: Add Your GPU Workers Now for the fun part. On each worker node, make sure you have the NVIDIA driver and NVIDIA Container Toolkit installed, then run: Shell sudo docker run -d --name gpustack-worker \ --restart unless-stopped \ --gpus all \ -e GPUSTACK_SERVER_URL=http://<your-server-ip> \ -e GPUSTACK_TOKEN=<your-token> \ gpustack/gpustack Replace the server URL and token (grab the token from the GPUStack dashboard). Within seconds, your worker appears in the cluster view with GPU model info, VRAM capacity, and health status. Rinse and repeat for every GPU machine you want to add. Got 3 machines? Three commands. Got 30? Thirty commands, or one Ansible playbook if you're smart about it. Running the worker command is actually the easiest part. The real final boss of GPU clusters is usually getting the drivers and toolkit installed correctly on the host. Step 3: Deploy a Model Head over to the model catalog in the web UI. GPUStack supports pulling models from Hugging Face and the Ollama Library. Pick a model and click deploy. Here's where the scheduler really excels. It reads the model's metadata, computes the resource requirements for VRAM, compute, and memory, then figures out which workers can handle it. If the model is too big for a single GPU, it can shard it across multiple cards. You don't have to manually calculate whether a 70B parameter model fits on your hardware. GPUStack does the math for you. Step 4: Call the API Once the model is running, you get an OpenAI-compatible endpoint. Grab an API key from the dashboard and test it: Shell curl http://<your-server-ip>/v1/chat/completions \ -H "Authorization: Bearer <your-api-key>" \ -H "Content-Type: application/json" \ -d '{ "model": "llama3", "messages": [ {"role": "user", "content": "Explain GPU cluster management in one paragraph."} ] }' If you're already using the OpenAI Python SDK, switching to your GPUStack endpoint is a one-line change: Python from openai import OpenAI client = OpenAI( base_url="http://<your-server-ip>/v1", api_key="<your-api-key>" ) response = client.chat.completions.create( model="llama3", messages=[{"role": "user", "content": "Hello from my own GPU cluster!"}] ) print(response.choices[0].message.content) Your application code stays the same. Your infrastructure is now fully under your control. Why This Actually Matters Let me break down the features that make GPUStack more than a nice-looking dashboard. Multi-Backend Flexibility GPUStack supports vLLM, SGLang, and TensorRT-LLM out of the box. This matters because no single engine is best for every workload. vLLM is great at high-throughput batch processing. TensorRT-LLM squeezes out every last drop of performance on NVIDIA hardware. SGLang shines with structured generation. GPUStack lets you pick the right tool for each deployment, or lets the scheduler pick for you. Built-In Monitoring GPUStack integrates with Grafana and Prometheus, giving you real-time dashboards for GPU utilization, VRAM usage, token throughput, and API request rates. No need to bolt on a separate monitoring stack (which usually ends up being three half-finished Grafana dashboards anyway). When something breaks at 2 AM, you'll know exactly which GPU on which machine is the problem. Automated Failure Recovery We’ve all been there - a node drops off the map because of a weird PCIe bus error or a driver mismatch that only appears under heavy load. Normally, that means your inference API just returns 500s until you manually intervene. GPUStack handles the panic phase for you. When Should You Use GPUStack? GPUStack isn't the right fit for every scenario. Here's a quick way to think about it: Use GPUStack if: You have 2+ GPU machines and want to serve LLMs or other AI models behind a unified API. Especially if your team doesn't want to become full-time infrastructure engineers just to keep models running. You want to run inference on your own hardware instead of paying per-token to a cloud provider. The cost savings at scale are real, and GPUStack removes the operational overhead that usually makes self-hosting painful. Maybe skip GPUStack if: You have a single GPU and just want to run a model locally for personal use. Tools like Ollama are simpler for that use case. You're already deep into a custom Kubernetes-based ML platform with KubeFlow or similar. GPUStack can work alongside Kubernetes, but if you've already invested heavily in that ecosystem, the overlap might not be worth it. The Bigger Picture The AI infrastructure landscape is shifting. A year ago, most teams defaulted to API providers for inference. Today, with open-weight models getting better every month and GPU costs coming down, self-hosted inference is becoming a real option. Not just for Big Tech, but for startups and mid-size companies too. The bottleneck isn't hardware anymore. It's operations. It's the glue code between "we have GPUs" and "our application can reliably call a model." GPUStack is a serious attempt at solving that gap, and it's open source under the Apache 2.0 license, so you can inspect, modify, and deploy it without vendor lock-in. If you’re sitting on a pile of hardware that’s currently just acting as expensive space heaters, or if you’re tired of seeing cloud inference bills that look like mortgage payments, give this a shot. You might find that self-hosting is actually viable again!
This is not "just another article about Springdoc," I promise. This is a ready-to-use recipe I was struggling to find one day, and had to build it from scratch. Have you ever needed to generate OpenAPI documentation directly from your code and, more importantly, do it in a way that fits cleanly into a CI pipeline? Swagger UI is commonly used in Spring Boot applications to visualize and test APIs from the browser. It can also expose the generated OpenAPI definition through a configurable endpoint, and that endpoint is exactly what we will use in this article. Why OpenAPI Documentation Matters Frontend Client Generation One of the most practical uses of OpenAPI documentation is automatic client generation. Tools such as OpenAPI Generator or Swagger Codegen can take an OpenAPI definition and produce TypeScript, JavaScript, or Java clients with very little manual effort. Mocking a Service Before It Is Ready In early development stages, a team may want to spin up a mock server before the real endpoints are fully implemented. Tools such as Mockoon or WireMock can use an OpenAPI specification to simulate the service. This is especially useful for frontend teams that need to move forward while backend work is still in progress. Verifying Contracts Between Services When multiple services depend on one another, compatibility becomes critical. OpenAPI documentation can be used together with tools such as Spring Cloud Contract to verify that both providers and consumers still conform to the agreed contract. The Manual Approach to Generating OpenAPI Documentation Let us start with a simple Spring Boot project. Add the following dependencies to pom.xml: XML <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-security</artifactId> </dependency> <dependency> <groupId>org.springdoc</groupId> <artifactId>springdoc-openapi-starter-webmvc-ui</artifactId> <version>2.6.0</version> </dependency> Then add Springdoc configuration to application.yml: YAML springdoc: api-docs: path: /api-docs enabled: true swagger-ui: url: /api-docs enabled: true Now create a simple REST controller: Java @RestController @Tag(name = "default", description = "General API") @RequestMapping("/api/v1/default") public class WebRestController { private static final Logger log = LoggerFactory.getLogger(WebRestController.class); @GetMapping(produces = MediaType.TEXT_PLAIN_VALUE) @ResponseStatus(HttpStatus.OK) public String get() { log.info("GET method called"); return "Hello!"; } @PostMapping( consumes = MediaType.TEXT_PLAIN_VALUE, produces = MediaType.APPLICATION_JSON_VALUE ) @ResponseStatus(HttpStatus.OK) public Set<String> post(@RequestBody String body) { log.info("POST method called"); return Set.of(body); } Finally, add a security configuration that allows access to both the REST API and to Swagger UI: Java @Configuration @EnableWebSecurity @EnableMethodSecurity public class WebSecurityConfig { @Profile("!openapi") @Bean public SecurityFilterChain filterChain(HttpSecurity httpSecurity) throws Exception { return httpSecurity.authorizeHttpRequests( request -> request .requestMatchers("/api-docs", "/api-docs/**").permitAll() .requestMatchers("/swagger-ui/*").permitAll() .requestMatchers("/api/v1/default").permitAll() .requestMatchers("/**").authenticated() ) .csrf(CsrfConfigurer::disable) .build(); } @Profile("openapi") @Bean public SecurityFilterChain filterChainOpenApi(HttpSecurity httpSecurity) throws Exception { return httpSecurity.authorizeHttpRequests( request -> request.anyRequest().permitAll() ) .csrf(CsrfConfigurer::disable) .build(); } Notice the separate openapi profile. We will use it later during automated generation. At this point, you can run the application and open Swagger UI at http://localhost:8080/swagger-ui/index.html. From there, the generated OpenAPI document is available at http://localhost:8080/api-docs. You can save that response manually and use it as your specification file. This works, but it is repetitive and not very practical for build automation. So let us move to the more useful approach: generating the spec during the Maven build. Automatic Generation To generate an OpenAPI file automatically, it helps to understand what actually happens during the build. The springdoc-openapi-maven-plugin does not generate the specification out of thin air. It calls the application endpoint that exposes the OpenAPI definition. In other words, your Spring Boot application must be running while the plugin executes. That is why the spring-boot-maven-plugin and springdoc-openapi-maven-plugin are typically used together. Because the application has to be started during the build, the security configuration must also allow the documentation endpoint to be accessed in that scenario. This is exactly why the separate openapi Spring profile is useful. Add a Dedicated Maven Profile Add the following Maven profile to pom.xml: XML <profile> <id>openapi</id> <properties> <maven.test.skip>true</maven.test.skip> </properties> <build> <plugins> <!-- When the Maven profile is openapi, run Spring with the openapi profile --> <plugin> <artifactId>spring-boot-maven-plugin</artifactId> <groupId>org.springframework.boot</groupId> <configuration> <jvmArguments> -Dspring.application.admin.enabled=true -Dspring.profiles.active=openapi </jvmArguments> </configuration> <executions> <execution> <id>pre-integration-test</id> <goals> <goal>start</goal> </goals> </execution> <execution> <id>post-integration-test</id> <goals> <goal>stop</goal> </goals> </execution> </executions> </plugin> <!-- Generate the OpenAPI file during the build --> <plugin> <artifactId>springdoc-openapi-maven-plugin</artifactId> <groupId>org.springdoc</groupId> <version>1.4</version> <configuration> <skip>false</skip> <apiDocsUrl>http://localhost:8080/api-docs.yaml</apiDocsUrl> <outputDir>${project.build.directory}</outputDir> <outputFileName>openapi.yml</outputFileName> </configuration> <executions> <execution> <id>integration-test</id> <goals> <goal>generate</goal> </goals> </execution> </executions> </plugin> </plugins> </build> </profile> The important parts here are: We create openapi Maven and openapi Spring profiles, but they are not the same (and should not necessarily have those exact names or share one name).When openapi Maven profile is run, we run Spring app with openapi profile (look at jvmArguments)-Dspring.profiles.active=openapi enables the relaxed security profile created specifically for documentation generation.apiDocsUrl points to the endpoint that returns the OpenAPI document.outputDir and outputFileName control where the generated file is written. These are the exact parts I struggled to find in one place, hence the "recipe" article. Run the Generation Step Once the profile is in place, generating the spec is easy: Shell ./mvnw verify -Popenapi After the build completes, the generated OpenAPI spec should be here: YAML ./target/openapi.yml Using It in a CI Pipeline This setup is CI-friendly because the same command can run locally and in your pipeline: YAML ./mvnw verify -Popenapi From there you can archive target/openapi.yml as a build artifact, publish it to an artifact repository, pass it to frontend code generators, mock servers, and contract verification jobs. Conclusion Generating OpenAPI documentation manually from Swagger UI is fine for quick inspection, but it does not scale well when you need repeatability. By wiring Spring Boot and Springdoc into a dedicated Maven profile, you can generate the specification automatically during the build in your CI. That gives you a reliable OpenAPI artifact that can support client generation, service mocking, and contract verification without adding a separate manual step to the development workflow. Bonus: Represent Set as an Array In some cases, you may want a Set to be represented as a regular array in the generated OpenAPI specification instead of an array with uniqueItems: true. This can be useful when downstream tools expect a plain array schema (this is the exact request I once got from the frontend team). You can customize Springdoc behavior with a small configuration class: Java import org.springdoc.core.utils.SpringDocUtils; import io.swagger.v3.oas.models.media.Schema; import java.util.Collections; import java.util.Set; public class SwaggerConfig { // Make springdoc generate an Array schema for Set.class // and remove uniqueItems: true public SwaggerConfig() { var schema = new Schema<Set<?>>(); schema.type("array").example(Collections.emptyList()); SpringDocUtils.getConfig().replaceWithSchema(Set.class, schema); } With this adjustment in place, the generated schemas for Set will be emitted as an array, which can simplify integration with some client generators and consumers.
Jubin Abhishek Soni
Senior Software Engineer,
Yahoo
Satrajit Basu
Chief Architect,
TCG Digital