Production Checklist for Tool-Using AI Agents in Enterprise Apps
How to Test PUT API Request Using REST-Assured Java
Security by Design
Security teams are dealing with faster release cycles, increased automation across CI/CD pipelines, a widening attack surface, and new risks introduced by AI-assisted development. As organizations ship more code and rely heavily on open-source and third-party services, security can no longer live at the end of the pipeline. It must shift to a model that is enforced continuously — built into architectures, workflows, and day-to-day decisions — with controls that scale across teams and systems rather than relying on one-off reviews.This report examines how teams are responding to that shift, from AI-powered threat detection to identity-first and zero-trust models for supply chain hardening, quantum-safe encryption, and SBOM adoption and strategies. It also explores how organizations are automating governance across build and deployment systems, and what changes when AI agents begin participating directly in DevSecOps workflows. Leaders and practitioners alike will gain a grounded view of what is working today, what is emerging next, and what security-first software delivery looks like in practice in 2026.
Shipping Production-Grade AI Agents
Threat Modeling Core Practices
The Digital Archaeology Experiment We all have that one folder. The one labeled "v1_final_do_not_touch_2016." It is a sprawling ecosystem of spaghetti code, global variables, and comments that simply read // I am sorry. In an era of large language models (LLMs), we often hear about AI writing boilerplate, but can it actually perform digital archeology? I decided to feed my most "haunted" legacy script — a 2,000-line monolith responsible for processing data — into a hypothetical next-generation model, Gemini 3. The goal wasn't just to see if it could fix the bugs, but to see if it could transform a maintenance nightmare into a modern, scalable architecture. What followed was a masterclass in software engineering best practices. The AI didn't just move code around; it applied structural patterns that we often neglect in the heat of deadlines. This guide breaks down the core best practices Gemini 3 utilized to transform legacy junk into production-grade software, and why you should apply these practices even if you aren't using an AI assistant. 1. The Single Responsibility Principle (SRP): Deconstructing the Monolith The first thing the AI flagged was the "God Object" syndrome. In my legacy code, a single function called process_claim() was responsible for: Validating user input.Connecting to a MySQL database.Calculating claim totals with hardcoded tax rules.Sending an email notification.Logging errors to a local file. The Bad Practice (The Monolith) Plain Text def process_claim(claim_data): # Validation if not claim_data.get("id"): return "Error" # Database logic db = connect_to_db("prod_db") db.execute(f"INSERT INTO claims VALUES ({claim_data['id']})") # Business logic total = claim_data['amount'] * 1.15 # Hardcoded tax # Notification send_email("[email protected]", f"Claim {total} processed") return "Success" Why This Fails This code is impossible to test in isolation. If you want to test the tax calculation, you must have a live database connection and an email server ready. Furthermore, a change in the email provider's API forces a change in the business logic file, violating the principle that software should be easy to change without unintended side effects. The Good Practice (Applying SRP) Gemini 3 refactored this into distinct services. Validation, Persistence, Calculation, and Messaging were separated. Plain Text class ClaimValidator: def validate(self, data): if not data.get("id"): raise ValidationError("Missing ID") class TaxCalculator: def calculate(self, amount, region_code): rate = self._get_rate(region_code) return amount * rate class ClaimService: def __init__(self, validator, calculator, repository, notifier): self.validator = validator self.calculator = calculator self.repository = repository self.notifier = notifier def execute(self, claim_data): self.validator.validate(claim_data) total = self.calculator.calculate(claim_data['amount'], "US") self.repository.save(claim_data) self.notifier.send(f"Claim {total} processed") Why It Matters By separating concerns, the code becomes modular. You can now swap the TaxCalculator for a different regional version without touching the ClaimService. Testing becomes a matter of passing "mock" objects into the constructor, ensuring your unit tests are fast and reliable. Checklist for Applying SRP TaskDescriptionIdentify "Ands"If a function does A and B, it needs to be split.Extract LogicMove business rules into separate, pure functions.Isolate I/OKeep database and API calls outside of core logic classes.Limit LinesAim for functions under 20 lines of code. 2. Decoupling Through Dependency Injection One of the most profound changes Gemini 3 suggested involved how objects interact. In the legacy code, objects instantiated their own dependencies. If Class A needed Class B, it would simply call b = new ClassB() inside its constructor. This creates "tight coupling." Visualizing the Transformation Below is a Flowchart illustrating the decision-making process for decoupling legacy dependencies. The Pitfall: The "New" Keyword When you use new inside a class, you are locking that class to a specific implementation. This makes it impossible to substitute a mock version for testing or a different implementation for a new environment (like a staging server). The Solution: Dependency Injection (DI) Instead of creating the dependency inside the class, you "inject" it — usually via the constructor. This practice shifts the responsibility of object creation to the caller or a dedicated DI container. Comparison: Before vs. After Bad (Tight Coupling): Plain Text class OrderService { constructor() { this.database = new PostgresDatabase(); // Hardcoded dependency } } Good (Loose Coupling): Plain Text class OrderService { constructor(database) { // Injected dependency this.database = database; } } The Benefit: In your production environment, you pass a real PostgresDatabase. In your test environment, you pass an InMemoryDatabase. The OrderService doesn't know the difference, making it highly reusable. 3. Defensive Programming and Error Handling Legacy code often treats error handling as an afterthought, using generic try-catch blocks that swallow exceptions or returning null values that eventually lead to the dreaded "Null Reference Exception." Gemini 3's refactoring emphasized Defensive Programming: the practice of designing software to continue functioning under unforeseen circumstances. Sequence Diagram: Proper Error Handling Flow This Sequence Diagram shows the interaction between a client, a service, and an external API using resilient patterns. Key Defensive Practices Fail Fast: Validate inputs at the very beginning of a function. If they are invalid, throw an exception immediately.Use Meaningful Exceptions: Instead of throwing Error, throw InsufficientFundsError or UserNotFoundError.Circuit Breakers: If an external service is down, don't keep hammering it. Stop the calls and return a cached result or a graceful failure. Good vs. Bad Error Handling Bad Practice: Plain Text try: result = api.call() except: pass # Silently failing is the worst thing you can do Good Practice: Plain Text try: result = api.get_user(user_id) except ConnectionError as e: logger.error(f"Failed to connect to UserAPI for ID {user_id}: {e}") raise ServiceUnavailableError("Our user service is temporarily down.") except UserNotFoundError: return None # Explicitly handled 4. Modernizing State Management In my legacy script, the code relied heavily on global state. A variable like current_user_id was updated by multiple functions across the file. This led to unpredictable bugs where the state would change in the middle of a process due to an asynchronous callback. Implementation: Using Immutability Instead of modifying an existing object, create a new one. This ensures that other parts of the system holding a reference to the old object aren't surprised by a sudden change. Bad (Mutable): Plain Text function updatePrice(product, newPrice) { product.price = newPrice; // Changes the object everywhere } Good (Immutable): Plain Text function updatePrice(product, newPrice) { return { ...product, price: newPrice }; // Returns a new object } By using immutability, you make your code thread-safe and much easier to debug. If a bug occurs, you can inspect the state at any point in time without worrying that it was modified downstream. 5. Refactoring Summary: The Do's and Don'ts To help you apply these findings to your own legacy codebases, here is a summary table of the transformations Gemini 3 performed. AreaDon't Do This (Legacy)Do This (Modern)LogicGiant functions with nested if/else.Small, pure functions with early returns.DataDirect manipulation of global state.Immutable data structures and local state.DependenciesHardcoded new instances.Injected dependencies via interfaces.ErrorsGeneric try-catch with empty bodies.Domain-specific exceptions and logging.PerformanceNested loops with O(n^2) complexity.Optimized algorithms with O(n) or O(log n).DocumentationComments explaining what code does.Self-documenting code explaining why. Common Pitfalls to Avoid During Refactoring Even with an AI as powerful as Gemini 3, refactoring is not without risks. Here are three common pitfalls I encountered during this experiment: Refactoring Without Tests: Never start refactoring until you have "Characterization Tests" — tests that describe how the code currently behaves. If you change the code and the tests pass, you know you haven't broken existing functionality.Over-Engineering: It is tempting to apply every design pattern (Factory, Strategy, Observer) at once. Only introduce complexity when it solves a specific problem. If a simple function works, you don't need a class.The "Big Bang" Rewrite: Resist the urge to rewrite the entire system from scratch. This almost always leads to project failure. Instead, refactor one small module at a time, ensuring the system remains operational throughout the process. Practical Guidance: An Implementation Roadmap If you are staring at a mountain of legacy code today, here is the recommended roadmap for modernization: Identify the Pain Points: Which part of the code breaks most often? Start there.Write Integration Tests: Capture the current behavior of that module.Decouple the Core: Identify the business logic and extract it from the infrastructure (database/UI).Introduce Dependency Injection: Allow your business logic to be tested in isolation.Clean Up the Syntax: Use modern language features (like Async/Await or Type Hints) to improve readability. Conclusion: AI as the Ultimate Pair Programmer Feeding my worst legacy code to Gemini 3 was an eye-opening experience. The AI didn't just "fix" the code; it enforced a level of discipline that is often lost in the day-to-day grind of feature delivery. It reminded me that the most important audience for our code isn't the compiler — it is the human developer who has to maintain it six months from now. By prioritizing the Single Responsibility Principle, decoupling dependencies through injection, and embracing defensive programming, we can turn even the most frightening legacy scripts into robust, modern systems. Whether you use an AI assistant or your own expertise, these best practices remain the bedrock of professional software engineering. Further Reading & Resources Refactoring: Improving the Design of Existing Code by Martin FowlerClean Code: A Handbook of Agile Software CraftsmanshipThe Twelve-Factor App MethodologyGoogle Software Engineering Best PracticesSOLID Principles of Object-Oriented Design
As the AI tidal wave continues to break on our shores, there are two existential questions we’re all struggling to answer: Knowledge workers and other content producers – how can we survive the AI wave with some kind of defensible capability we can offer our employers and our audiences that AI won’t be able to replace, even as it matures?Software vendors – how can we survive the AI wave with some kind of defensible product capability we can offer our customers that AI agents won’t be able to replace, even as they mature? If you’re a pessimist, the situation may seem hopeless. AI is getting so much better so quickly that even if it can’t quite replace us or our software products today, it’s only a matter of time, right? Should we abandon hope? Or perhaps you’re an optimist. There must be some aspect of what we as humans bring to the table that AI won’t be able to replace, no matter how good it gets. If only we had a way of understanding and measuring just what that essential value-add that humans can bring to the table, whether we are creating content, addressing business needs as knowledge workers, or building software products that provide value to their users. The good news: there is hope. Here is a way of looking at the problem that will help illuminate that je ne sais quoi – that ineffable human contribution that AI will never be able to replace. First, Understand Semantic Density Generative AI (genAI) depends upon large language models (LLMs) that deal well with content that has specific, well-defined meaning. The better defined our inputs – training data, retrieval augmented generation (RAG) data, and information in prompts – the better formed our outputs. In contrast, when the meaning of input data contains too many nuances – implications, unspoken references, intuitive leaps and the like – then LLMs fall short. The models’ creators simply have no way to build them to account for such subtleties. Language experts have a term for how to understand such differences in meaning: semantic density. You create a message with high semantic density by cramming a lot of meaning into a few words. In contrast, a message has low semantic density if it takes a lot of words to express a simple idea. Humans are particularly good at creating semantically dense content – and in fact, we generally identify higher semantic density with better written content. On the other hand, LLMs excel at both consuming and producing content with low semantic density. Such output is especially useful when we are looking for clear, precise explanations, accurate summaries, etc. – just the sorts of content we’ve come to expect and demand from genAI. Is Semantic Density the Answer? An obvious conclusion at this point would be for humans to focus on creating semantically dense content to survive the onslaught of AI. Unfortunately, there are problems with this argument. First, LLMs can also generate semantically dense content, especially when source data are also semantically dense, for example, asking genAI to create an abstract for a semantically dense academic paper. Asking an LLM to write the paper is a recipe for plagiarism and hallucinations (as many students have learned to their chagrin), but the models are quite skilled at summarizing such content. Second, it’s overly simplistic to equate semantically dense human-generated content with good writing vs. less dense content with poorer writing. After all, sometimes we want human-generated content to be less semantically dense. A simple example would be writing for children – something genAI can do for sure, but the best child-oriented content still comes from real people. On the flip side, extreme semantic density typically makes the text obscure and difficult to read – clearly not hallmarks of excellent writing. So, while semantic density has a loose correlation to how well LLMs can perform, it’s not the whole story. The missing piece: context density. The Importance of Context Density While semantic density measures the internal complexity of meaning within a message, context density measures the meaningful content around a message. Context density is similar to semantic density. More meaning crammed into fewer words leads to more density, so it’s easy to confuse the two. The reason context density is so important, however, is because of the role context plays in how LLMs behave – in particular, agentic behavior. In fact, we could even say that what makes an LLM-based application into an AI agent is how it understands and takes action based upon context. Such context can include: Information about available local files, databases, and APIsAvailable tools and how to access themSecurity information necessary to access required assetsOther metadata relevant to each query. Such context must be clear and unambiguous for the agents to behave properly. In other words, agents require context that has low context density. In fact, this requirement for low context density is one of the reasons why the Model Context Protocol (MCP) has been such a rapid success. The MCP is an open integration protocol standard for interactions with and among LLMs. It’s based on JSON, a flexible format for expressing data with low semantic density – or in the case of MCP, low context density. While the creators of MCP didn’t explicitly design it with low context density in mind, they did intend for the protocol to prioritize clarity and structure over density. Given that each system in an agentic interaction must understand the relevant context without hidden assumptions or other nuances of meaning, explicit context with low density is essential to the success of agentic systems. What, then, Is the Role of High Context Density? Human-to-human interactions, aka conversations, have inherently high context density – even though we rarely notice it. Every human conversation contains layers of subtext and hidden meaning via facial expressions, hand gestures, tone of voice, words with ambiguous meaning, patterns of pauses in speech, and other subtle aspects of human communication. Such nuance goes right over the proverbial head of AI – even LLMs that do such a good job of mimicking human conversation. In other words, it’s virtually impossible for LLMs to deal with high context density. Agentic interactions in particular are quite sensitive to excessive context density. Agents rely so heavily on the precision possible with low context density that any nuance in context will throw them off entirely. At the very least, they will completely ignore it. How Context Density Helps Us Humans Where agents (and genAI in general) is weak, humans are strong. Context density, therefore, helps us answer the questions at the top of this article. If we look at various applications of AI, context density drives essential distinctions: Knowledge work – ask your favorite copilot to handle tasks with low context density. Focus human attention and activity on those tasks that require high context density.Automation – processes with low context density are easy for AI to automate. Processes with high context density require human input and control.Building software – anyone can leverage code generation tools to build applications with low context density. For applications that require high context density, code generation tools must be secondary to skilled human effort, insight, and control. Context density thus becomes the differentiating metric between activities and applications that LLMs are well-suited for vs. those activities and applications that will continue to require human input and control, even as AI technologies mature. The Intellyx Take The most important part of this story is not identifying where AI is useful. It’s identifying where it is not. As AI inevitably transforms how we work and live, we must all come to terms with the fact that AI will take various tasks off our respective plates, leaving us wondering what our purpose will be in this arguably dystopian future. Take heart: there will always be roles for us humans. We are the masters of insight, creativity, nuance, and hidden meaning – the essence of context density. Our challenge moving forward: identifying those activities where we can provide value as individuals by offering just those capabilities that AI is so woefully unable to provide. The opportunity for software vendors: make sure your products have high context density. That way agents won’t be able to do what your products do. Instead, agents will need to call upon your products to accomplish their tasks successfully. The opportunity for humans: make sure your work is both semantically and contextually dense. Focus on the meaning that LLMs can’t grasp. Express your intuition, insight, and creativity in terms of meaning, both within your work as well as its human context. AI gives us an amazing set of tools. Knowing how to use them well means focusing our efforts on providing the value that we as humans are uniquely qualified to contribute.
There's a particular kind of incident that doesn't show up in your error dashboards. No alerts fire. Latency looks fine, actually — or fine-ish, in that flickering, indeterminate way that makes you suspicious but not certain. What shows up, days later, is a billing anomaly. A line item that's 4x what you budgeted. And when you dig, you find it: retries. Hundreds of thousands of them. Loyal, tireless, utterly pointless retries, hammering a dependency that was never going to recover within the retry window, each one spinning up a Lambda invocation, writing to CloudWatch, touching the database, accruing egress. The system was "retrying" its way into insolvency. This is what I mean when I call uncontrolled retries a self-inflicted Denial-of-Wallet attack. Not metaphorically. Mechanically. The Seductive Logic of "Just Try Again" The impulse is almost irresistible. Networks are flaky. Downstream services hiccup. Transient faults are real, they are common, and a single retry genuinely does rescue a meaningful fraction of requests that would otherwise fail. Every distributed systems textbook will tell you this. The problem is that the textbook version of a retry — lone request, momentary fault, clean recovery — bears almost no resemblance to what retries actually do inside a system operating at load under a real failure. Under real failure, the math inverts. Say Service A depends on Service B. B starts returning 500s — maybe a deployment went sideways, maybe a database connection pool saturated. A is configured with what seems reasonable: three retries, linear backoff, no jitter. What happens next is not three polite attempts and a graceful degradation. What happens is multiplication. Every original request to A becomes four requests to B (the original plus three retries). If A is receiving 1,000 RPS, B is now absorbing 4,000 RPS — on top of the load it was already failing to handle. Each of those extra requests touches middleware, writes a log line, maybe hits a queue. B, already struggling, gets worse. A's retries accelerate B's failure. The snowball rolls. The Stanford RetryGuard researchers have a name for this: the retry storm. It's not exotic. It's what happens when you deploy reasonable-looking retry policies without thinking about what they do in aggregate. What the Cost Actually Looks Like People underestimate the surface area of a retry. They think: one extra HTTP call. They don't think about what's attached to that HTTP call. In a Lambda-backed architecture, each retry is an invocation — billed separately. Each invocation likely emits structured logs to CloudWatch, which charges per GB ingested. If the function hits a DynamoDB table, that's another read unit consumed, possibly another write. If there's an API Gateway in front, that's another API call counted against your tier. If the response is large, there's egress cost. And this happens in parallel across however many concurrent requests are in flight. Now consider the timeline. Service B fails at 2 AM. The on-call engineer doesn't see it until 2:17. During those 17 minutes, if A was receiving 500 RPS and each request retried three times, you've generated roughly 2 million additional requests to B. You've paid for every one of them. You've gotten nothing back. The original failure wasn't solved; the retries just made the failure expensive. One way to think about this: retries without circuit breakers are paying a premium to prolong a failure. The Hidden Feedback Loops Nobody Draws on the Architecture Diagram The simple A-calls-B diagram is almost always wrong. What's usually true is that A, B, and C all call each other in some configuration, and several of them share infrastructure. So when B degrades: A retries B, increasing load on B's shared database connection pool. The pool saturates. Now C, which also reads from that database, starts timing out. C's callers — let's say D and E — start retrying. D and E's retries hit the same pool. The pool is now so saturated that even requests that have nothing to do with the original B failure are timing out. This is the cascade that the RetryGuard paper captures: service A experiences a retry storm and pays the price, but the price is actually distributed across the whole graph. The bulkhead patterns — isolating thread pools, rate-limiting per-dependency — exist precisely to prevent this. Most systems don't have them, or have them configured with defaults that were never tuned for actual traffic. The other feedback loop worth naming is the log-based one. Your observability stack is probably downstream of your services. If it's Elasticsearch or Loki or CloudWatch, it absorbs your logs. Under a retry storm, log volume can spike 5–10x. That means your observability system — the thing you're depending on to diagnose the problem — is now also under load. I've been in incidents where the logging pipeline itself started dropping messages at exactly the moment we needed full fidelity. The retry storm ate its own evidence. Exponential Backoff Is Not Enough (and Jitter Matters More Than You Think) Backoff is the first thing people reach for. Double the wait between attempts. It's better than nothing. But standard exponential backoff without jitter has a subtle and nasty property: it synchronizes retries. Suppose 500 requests arrive simultaneously. They all fail. They all back off by 1 second. They all retry simultaneously at T+1. They all fail again. They all back off by 2 seconds. They all retry simultaneously at T+3. You've turned continuous load into synchronized bursts — which are, in some ways, worse than continuous load, because they create spike conditions that can exceed per-second rate limits and overwhelm autoscaling that hasn't had time to provision. Jitter — adding a random offset to the backoff interval — breaks this synchronization. The AWS Architecture Blog's "Exponential Backoff and Jitter" post from 2015 remains one of the clearest explications of why, and the "full jitter" strategy (where the wait is uniformly random between zero and the calculated backoff) outperforms "equal jitter" in most workloads. The math isn't complicated. The intuition is: you want your retriers to spread out across time, not march in lockstep. The formula you actually want: Plain Text wait = random_between(0, min(cap, base * 2^attempt)) That min(cap, ...) is important. Without a ceiling, your backoff can grow to minutes or hours, which creates its own problems — held connections, stale state, zombie sessions that reconnect long after the original context is gone. Retry Budgets: The Underused Primitive Here's where Linkerd gets something importantly right that most service meshes and client libraries don't foreground: the retry budget. The idea is simple. Instead of configuring retries per-request ("retry up to N times"), you configure retries per-traffic-volume ("retries may not exceed X% of requests"). Linkerd's default is 20% — meaning if your service is handling 1,000 RPS, it will allow at most 200 retry requests per second, regardless of how many individual requests are failing. Once the budget is exhausted, requests fail fast. This is a fundamentally different mental model. Per-request retry limits think locally — this request failed, try it again. Retry budgets think globally — the system is under stress, we cannot afford to amplify that stress beyond this threshold. The budget makes the cost of retrying explicit at the system level. The Istio equivalent is less elegant but workable. You can cap numRetries and set aggressive perTryTimeout values to bound the worst-case amplification, though you're still thinking per-route rather than per-budget. A rough YAML configuration: YAML retries: attempts: 3 perTryTimeout: 2s retryOn: "5xx,connect-failure,refused-stream" Notice retryOn. This matters. You should not retry on every error code. A 400 Bad Request doesn't get better with retries — the request is malformed and will fail identically on every attempt. Retrying 4xx errors is particularly wasteful because they're often client-side problems that the server will consistently reject. The codes worth retrying are: transient network failures, 503 Service Unavailable, 429 Too Many Requests (with appropriate backoff), and sometimes 502 Bad Gateway. Even 504 Gateway Timeout deserves scrutiny — if B is genuinely overwhelmed, retrying a timed-out request doesn't help B recover. Circuit Breakers: The Pattern Everyone Claims to Use and Almost Nobody Tunes Resilience4j, Hystrix (RIP), Polly, Istio's outlier detection — the options are plentiful. The implementations, in my experience, are often misconfigured to the point of uselessness. A circuit breaker has three states: closed (passing requests through), open (failing fast), and half-open (letting a probe request through to test recovery). The transitions between states are governed by parameters: failure rate threshold, minimum number of calls before the threshold applies, wait duration in open state, permitted calls in half-open state. The defaults in most libraries are conservative in a way that makes them nearly inert. A failure rate threshold of 50% sounds aggressive, but if your minimum call count is 100, the breaker won't open until you've seen 50 failures in the sampling window. With a small sliding window of, say, 10 calls, you might need 5 consecutive failures before it trips. In practice, by the time the breaker opens, you've already generated substantial unnecessary load. The tuning questions nobody asks at configuration time: What's the expected recovery time for this dependency? Set your waitDurationInOpenState to something meaningful relative to that. If your downstream service typically recovers in 30 seconds, a 5-second open window means the breaker will half-open and immediately re-trip multiple times before recovery, adding noise to your metrics and extending the incident.What's the right sampling window? A count-based window (last N calls) can be gamed by low-traffic services where N takes minutes to fill. Time-based windows (last N seconds) are usually more appropriate for production.What should happen when the circuit is open? This is the graceful degradation question. Returning an error is fine. Returning a cached response is better. Returning a sensible default is sometimes correct. The teams I've seen handle this best define the fallback behavior explicitly, in code, with the same rigor they'd apply to the happy path. The half-open state is where circuit breakers most often fail in practice. Probe requests succeed in the test environment because the test environment has predictable load. In production, the first probe arrives when the downstream service has just recovered and is still warming up — and under the concurrent burst of all the callers that were queued behind the open breaker. The probe succeeds. The breaker closes. 200 requests hit simultaneously. The service tips over again. Repeat. The fix is to open the circuit gradually: allow, say, 5% of traffic through in half-open state, ramp to 25%, ramp to 100%. Most libraries don't do this natively. Istio's outlier detection is closer to this model, ejecting individual hosts rather than binary-tripping a per-service breaker. What You Actually Change on Monday Morning Not everything. The systems are running. You don't get to redesign the retry architecture from scratch during business hours. But some things are cheap and high-value: Audit your retry configurations. Find every place in your codebase where retries are configured — client libraries, service mesh configs, SDK defaults you didn't know were there. AWS SDKs retry by default. Many HTTP clients retry on timeout by default. The retry behavior you didn't configure is often more dangerous than the retry behavior you did. Add jitter to anything that doesn't have it. If you have backoff = base * 2^attempt, change it to backoff = random(0, base * 2^attempt). Twenty minutes of work. Immediate improvement in thundering herd conditions. Turn on retry rate monitoring. Your APM or service mesh almost certainly exposes retry counts. Surface them. Add a dashboard. Set an alert at, say, 1% retry rate under normal conditions — abnormal elevations will catch incipient retry storms before they become billing anomalies. Identify your non-idempotent paths and either remove retries or add idempotency keys. POST endpoints that create resources cannot be safely retried without idempotency controls. If you're retrying a payment or an order creation, you're potentially creating duplicates. This is its own class of disaster, separate from cost — but it compounds cost because you're now also writing extra records. Define your fallbacks. For each service your system depends on, what should happen when it's unavailable? The answer "retry indefinitely" is almost never correct. "Return a cached response" or "return a degraded but valid result" or "queue for later processing" are usually better. The fallback should be in code, tested, and not a surprise to the on-call engineer at 2 AM. The Broader Frame There's something philosophically interesting about retry storms that I keep coming back to. Each individual retry is rational. From the perspective of a single request that failed due to a transient network glitch, retrying is exactly the right behavior. The emergence of a retry storm from individually-rational retries is a classic collective action problem — something that's good for each agent is destructive when everyone does it simultaneously. Circuit breakers and retry budgets are collective action solutions. They impose a global constraint that each individual caller would have no incentive to impose on itself. This is, incidentally, why they work better when implemented in the mesh layer (where they can see aggregate traffic) than in individual client libraries (where they can only see their own requests). The Denial-of-Wallet framing is useful because it names the threat model correctly. You don't need an external attacker. You don't need a misconfigured adversary. You need one failure, one reasonable-looking retry policy, and enough traffic that the multiplication matters. The attack surface is your own response to your own failures. That's the part that's hard to internalize. The retries feel like resilience. They feel like diligence. They are, under the wrong conditions, the instrument of your own undoing.
Java has always been a serious language for production systems, and in 2026, the Generative AI ecosystem has finally caught up. For years, Java developers watched from the sidelines as Python and TypeScript accumulated framework after framework for building LLM-powered applications. Today, the picture is very different. Java has multiple mature, actively maintained AI frameworks, each with its own philosophy and trade-offs. This article covers the four frameworks I have personally used to ship Java AI applications: Genkit Java, Spring AI, LangChain4j, and Google ADK Java. Each one represents a meaningfully different bet on what a Java AI framework should be, and understanding those differences will save you from picking the wrong tool. Genkit Java History and Direction Genkit started life as a TypeScript-first framework launched by Google at I/O 2024. The Java SDK arrived as a community-maintained effort, built and maintained by developers within the Google ecosystem who wanted to bring the same developer experience to Java that Genkit had established in TypeScript. As of 2026, Genkit Java is unofficial; it is not an official Google product, but it is actively maintained, follows the core Genkit design closely, and ships its own plugin ecosystem. The framework’s first stable release landed in early 2026 after months of preview use. Its ambition mirrors the TypeScript SDK’s: bring Genkit’s multi-level abstractions (vanilla generation, typed flows, agents), its broad provider-neutral plugin model, and, crucially, the Genkit Developer UI to Java developers. The Java SDK ships with Spring Boot and Jetty server plugins, making it a natural fit for teams that already run Java services in production. The Javadoc and architecture are clean and idiomatic Java; this does not feel like a port; it feels designed for the language. The direction is clear: maintain parity with the TypeScript Genkit SDK’s abstractions while embracing Java idioms (builder patterns, typed schemas via Java classes, annotation-free configuration). Support for evaluation, MCP (Model Context Protocol), RAG with pgvector and Pinecone, and multi-agent patterns is already in place. What Makes Genkit Java Stand Out Like its TypeScript counterpart, Genkit Java provides three levels of abstraction in a single SDK: direct model calls, typed flows (observable pipelines), and agents. This is unique in the Java AI space; no other Java framework gives you all three in one coherent API. Supported languages: Java 21+ (primary). Deploys to Spring Boot, Jetty, or Firebase Cloud Functions. Vanilla Generation Java import com.google.genkit.Genkit; import com.google.genkit.ai.GenerateOptions; import com.google.genkit.plugins.googlegenai.GoogleGenAIPlugin; Genkit genkit = Genkit.builder() .plugin(GoogleGenAIPlugin.create()) .build(); String text = genkit.generate(GenerateOptions.builder() .model("googleai/gemini-flash-latest") .prompt("Explain the CAP theorem in two sentences.") .build()).getText(); Typed Flows: Observable Pipelines Flows are the heart of Genkit Java. They wrap your AI logic in a named, typed, traceable unit that is automatically exposed as an HTTP endpoint and visible in the Dev UI. Java import com.google.genkit.Genkit; import com.google.genkit.flow.FlowOptions; import com.google.genkit.plugins.googlegenai.GoogleGenAIPlugin; import com.google.genkit.plugins.jetty.JettyPlugin; import com.google.genkit.plugins.jetty.JettyPluginOptions; record TranslateRequest(String text, String targetLanguage) {} record TranslateResponse(String translation, String detectedLanguage) {} JettyPlugin jetty = new JettyPlugin(JettyPluginOptions.builder().port(8080).build()); Genkit genkit = Genkit.builder() .plugin(GoogleGenAIPlugin.create()) .plugin(jetty) .build(); genkit.defineFlow( FlowOptions.<TranslateRequest, TranslateResponse>builder() .name("translateText") .inputClass(TranslateRequest.class) .outputClass(TranslateResponse.class) .build(), (ctx, request) -> { var response = genkit.generate(GenerateOptions.builder() .model("googleai/gemini-flash-latest") .prompt("Translate '%s' to %s. Return JSON with 'translation' and 'detectedLanguage'." .formatted(request.text(), request.targetLanguage())) .outputClass(TranslateResponse.class) .build()); return response.getOutput(TranslateResponse.class); } ); Tools and Agents Java import com.google.genkit.ai.tool.ToolDefinition; var weatherTool = genkit.defineTool( ToolDefinition.<String, String>builder() .name("getWeather") .description("Returns current weather for a city.") .inputClass(String.class) .outputClass(String.class) .build(), (ctx, city) -> "Sunny, 24°C in " + city ); // Use the tool inside a flow or agent var result = genkit.generate(GenerateOptions.builder() .model("googleai/gemini-flash-latest") .prompt("What's the weather like in Tokyo?") .tools(List.of(weatherTool)) The Dev UI: Same Power as TypeScript One of Genkit Java’s most compelling features is that the same Genkit Developer UI used by the TypeScript SDK works directly with Java applications. You install the Genkit CLI (Node.js-based) and start your Java app through it: Shell npm install -g genkit The Dev UI opens at http://localhost:4000 and gives you: Flow runner – execute any flow interactively with custom inputs and inspect typed outputs.Trace explorer – full OpenTelemetry traces for every generate and flow call, showing latency, token counts, and exact prompts.Model playground – test any registered model directly.Tool testing – stub and test tools in isolation.Dotprompt editor – edit .prompt files live with variable injection. This is the single biggest advantage Genkit Java has over every other Java AI framework: a zero-config, local developer UI that replaces the need for LangSmith or Grafana during development. Provider Support Genkit Java ships plugins for: Google GenAI (Gemini), OpenAI, Anthropic (Claude), AWS Bedrock, Azure AI Foundry, Ollama, xAI (Grok), DeepSeek, Cohere, Mistral, and Groq. All accessed through the same genkit.generate() interface. Vector store plugins cover: Firebase Firestore, Weaviate, PostgreSQL (pgvector), Pinecone, and a local in-memory store. Pros and Cons ✅ Pros❌ ConsBest-in-class Dev UI with local trace explorerUnofficial/community-maintained (not a Google product)Multi-level abstractions: vanilla, flows, agentsArtifacts on GitHub Packages (requires auth to pull)Broadest provider support in Java ecosystemJava 21+ requiredSpring Boot and Jetty deployment pluginsSmaller community than LangChain4j or Spring AIOpenTelemetry built inStill SNAPSHOT versioned (1.0.0-SNAPSHOT)Idiomatic Java with builder patterns Spring AI History and Direction Spring AI was announced by the Spring team (Broadcom) in mid-2023 and reached its 1.0 GA release in mid-2024. It is the most enterprise-grade option in this comparison, built by the same team that maintains Spring Framework, Spring Boot, and Spring Data, which together underpin a vast proportion of the world’s Java server-side applications. The founding premise of Spring AI is that AI integration in Java applications should feel like every other Spring integration: auto-configured, testable, portable, and production-ready out of the box. The project draws inspiration from LangChain and LlamaIndex, but explicitly avoids being a port; it is designed from the ground up to be idiomatic Spring. If you have written Spring applications, Spring AI will feel immediately familiar: @Autowired AI clients, Spring Boot starters, application.properties configuration, and Advisor patterns that mirror Spring’s existing interception model. Spring AI’s direction through 2025 and into 2026 has been to deepen its observability story (Micrometer-native metrics and traces), expand its ChatClient fluent API, and ship more vector store integrations. The framework is now the de facto standard for teams that are already invested in the Spring ecosystem and want to add AI capabilities without introducing a foreign dependency philosophy. What Makes Spring AI Stand Out Spring AI’s killer feature is Spring Boot integration depth. There is no framework on this list, in any language, that integrates AI capabilities as seamlessly into an existing application framework as Spring AI does with Spring Boot. Auto-configuration, conditional beans, health indicators, Actuator endpoints for AI metrics, everything a Spring developer expects, applied to AI. Supported languages: Java (primary). Also supports Kotlin (via Spring’s Kotlin DSL). Runs anywhere Spring Boot runs: embedded Tomcat, Jetty, Undertow, GraalVM native images. Java // application.properties // spring.ai.openai.api-key=${OPENAI_API_KEY} // spring.ai.openai.chat.options.model=gpt-4o import org.springframework.ai.chat.client.ChatClient; import org.springframework.web.bind.annotation.*; @RestController public class ChatController { private final ChatClient chatClient; public ChatController(ChatClient.Builder builder) { this.chatClient = builder.build(); } @GetMapping("/chat") public String chat(@RequestParam String message) { return chatClient.prompt() .user(message) .call() .content(); } Structured Output Spring AI’s BeanOutputConverter maps model responses directly to Java POJOs, using the class schema to generate format instructions automatically. Java import org.springframework.ai.chat.client.ChatClient; import org.springframework.ai.converter.BeanOutputConverter; record MovieReview(String title, int rating, String summary, List<String> pros) {} BeanOutputConverter<MovieReview> converter = new BeanOutputConverter<>(MovieReview.class); MovieReview review = chatClient.prompt() .user(u -> u.text("Review the movie Inception. {format}") .param("format", converter.getFormat())) .call() .entity(MovieReview.class); RAG With Advisors Spring AI’s Advisors API is one of its most elegant features. Advisors wrap ChatClient calls with cross-cutting concerns, RAG retrieval, chat memory, logging, guardrails in a declarative, composable way. Java import org.springframework.ai.chat.client.ChatClient; import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor; import org.springframework.ai.vectorstore.VectorStore; @Service public class DocumentQAService { private final ChatClient chatClient; public DocumentQAService(ChatClient.Builder builder, VectorStore vectorStore) { this.chatClient = builder .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore)) .build(); } public String answerQuestion(String question) { return chatClient.prompt() .user(question) .call() .content(); } } Observability Spring AI ships with Micrometer integration out of the box. Every chat call generates spans (Spring Boot tracing) and metrics (prompt token count, completion token count, model latency) visible in any Micrometer-compatible backend: Prometheus, Grafana, Zipkin, or Datadog. There is no separate Dev UI, observability is handled by your existing Spring Boot infrastructure. Broad Vector Store and Model Support Spring AI supports 10+ model providers (OpenAI, Anthropic, Google Vertex AI, Amazon Bedrock, Azure OpenAI, Mistral, Ollama, Groq, and more) and 20+ vector stores (PGVector, Pinecone, Weaviate, Redis, Elasticsearch, MongoDB Atlas, Chroma, and more), the broadest integration coverage of any Java AI framework. Pros and Cons ✅ Pros❌ ConsDeepest Spring Boot integration, feels nativeNo standalone Dev UI for flow inspectionMicrometer-native observabilityAgent abstractions are less mature than LangChain4jBroadest model and vector store integrationsAdvisors pattern has a learning curveProduction-tested by the Spring ecosystemHeavier spring context overhead for simple use casesGraalVM native image supportNo flow/pipeline abstraction like GenkitIdiomatic Java and Kotlin support LangChain4j History and Direction LangChain4j was started in early 2023 by a small community of Java developers who noticed that the LLM framework explosion happening in Python had no Java equivalent. Despite the name, the project is not a mechanical port of LangChain Python; it is a fusion of ideas from LangChain, Haystack, LlamaIndex, and original innovation, packaged in a way that makes sense for Java. It grew quickly through 2023 and 2024, driven by its comprehensive integration list (20+ LLM providers, 30+ vector stores) and its clean two-level abstraction model: low-level primitives for maximum control and high-level AI Services for rapid development. The AI Services pattern, where you define an interface with annotations, and LangChain4j implements it for you at runtime, became the framework’s signature feature and arguably the most Java-idiomatic approach to LLM integration in the ecosystem. By 2025, LangChain4j had formal integrations with Quarkus, Spring Boot, Micronaut, and Helidon, covering every major Java application framework. The team’s direction in 2026 is focused on deepening agentic capabilities (multi-step tools, planning loops, MCP support) and improving the observability story, which has historically been a weaker point compared to Spring AI’s Micrometer integration or Genkit’s Dev UI. What Makes LangChain4j Stand Out LangChain4j’s AI Services pattern is its defining feature. Instead of writing imperative LLM call code, you declare an interface, annotate it with @SystemMessage, @UserMessage, and memory annotations, and LangChain4j generates the implementation. The result is AI code that reads like a Java service contract, clean, testable, and completely familiar to Java developers. Supported languages: Java (primary). Kotlin extensions available (coroutine-based async support). Integrates with Spring Boot, Quarkus, Micronaut, Helidon. Java import dev.langchain4j.service.AiServices; import dev.langchain4j.service.SystemMessage; import dev.langchain4j.model.openai.OpenAiChatModel; interface TranslationAssistant { @SystemMessage("You are a professional translator. Translate text accurately and naturally.") String translate(@UserMessage String text, @V("language") String targetLanguage); } var model = OpenAiChatModel.withApiKey(System.getenv("OPENAI_API_KEY")); TranslationAssistant assistant = AiServices.builder(TranslationAssistant.class) .chatLanguageModel(model) .build(); String result = assistant.translate("The quick brown fox jumps over the lazy dog", "Spanish"); System.out.println(result); Memory and Streaming Java import dev.langchain4j.memory.chat.MessageWindowChatMemory; import dev.langchain4j.service.MemoryId; interface ConversationalAssistant { @SystemMessage("You are a helpful assistant.") String chat(@MemoryId String userId, @UserMessage String message); } ConversationalAssistant assistant = AiServices.builder(ConversationalAssistant.class) .chatLanguageModel(model) .chatMemoryProvider(memoryId -> MessageWindowChatMemory.withMaxMessages(20)) .build(); // Each userId gets its own isolated memory assistant.chat("user-42", "My name is Alice."); String response = assistant.chat("user-42", "What's my name?"); // Returns: "Your name is Alice." RAG Pipeline Java import dev.langchain4j.data.document.loader.UrlDocumentLoader; import dev.langchain4j.data.document.splitter.DocumentSplitters; import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore; import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever; // Ingest documents var documents = UrlDocumentLoader.load("https://example.com/docs"); var splitter = DocumentSplitters.recursive(500, 50); var segments = splitter.splitAll(documents); var embeddingModel = OpenAiEmbeddingModel.withApiKey(apiKey); var embeddingStore = new InMemoryEmbeddingStore<TextSegment>(); EmbeddingStoreIngestor.ingest(segments, embeddingStore, embeddingModel); // Build RAG-enabled assistant interface DocsAssistant { String answer(@UserMessage String question); } var retriever = EmbeddingStoreContentRetriever.builder() .embeddingStore(embeddingStore) .embeddingModel(embeddingModel) .maxResults(3) .build(); DocsAssistant assistant = AiServices.builder(DocsAssistant.class) .chatLanguageModel(model) .contentRetriever(retriever) .build(); Two Abstraction Levels LangChain4j explicitly offers two levels: Low level – ChatModel, UserMessage, AiMessage, EmbeddingStore: full control, more code.High level – AiServices: declarative interfaces, minimal boilerplate. This mirrors what Genkit Java achieves differently. Where Genkit gives you flows and agents as pipeline concepts, LangChain4j uses interface-based AI Services as its high-level abstraction, very idiomatic in Java terms. Pros and Cons ✅ Pros❌ ConsAI Services pattern is uniquely Java-idiomaticNo built-in Dev UI or trace explorerLargest integration ecosystem (20+ models, 30+ stores)Observability requires external tooling (no Micrometer by default)Two clear abstraction levels (low and high)Agent capabilities still maturing (2026)Spring Boot, Quarkus, Micronaut, Helidon integrationsLarge number of modules can be overwhelmingKotlin coroutine supportLess opinionated, more choices to make yourselfStrong RAG tooling out of the box Google ADK Java History and Direction Google ADK (Agent Development Kit) launched in 2024 as a Python-first agent framework targeting enterprise deployments on Google Cloud. Java was a late addition to the multi-language roadmap, with ADK Java 1.0 shipping in early 2026 alongside ADK Go 1.0. The Java SDK arrival was significant: it signaled that Google views ADK as a serious enterprise runtime, not just a Python scripting tool. ADK Java follows the same design philosophy as the Python SDK: everything is an agent, workflow, or tool. The framework is optimized for building reliable, evaluatable, production-grade multi-agent systems and deploying them to Google Cloud infrastructure, primarily Vertex AI Agent Engine, Cloud Run, and GKE. Like its Python counterpart, ADK Java carries the weight of Google Cloud gravity. The best developer experience, the smoothest deployment path, and the most mature observability story all assume you are running on GCP. ADK Java 1.0 includes the full agent runtime (LLM agents, sequential/loop/parallel workflow agents), tool calling, MCP support, A2A (Agent-to-Agent) protocol, session/memory management, and streaming. The Java API closely mirrors the Python API in structure, which means the mental model transfers well, but also means the Java SDK carries a style that reflects Python-first design decisions. ADK Java’s Position: Agent-Only, Enterprise-Grade Like its Python counterpart, ADK Java is an agent framework; it has no vanilla generation primitive or flow abstraction outside the agent model. Its raison d’être is spinning up reliable, evaluatable agents and deploying them at enterprise scale. If you are building a multi-agent system on Google Cloud and Java is your language of choice, ADK Java 1.0 is Google’s recommended path. Supported languages: Java (with ADK Java 1.0). Also: Python (primary), TypeScript, Go. Java import com.google.adk.agents.LlmAgent; import com.google.adk.tools.GoogleSearchTool; import com.google.adk.runner.InMemoryRunner; import com.google.genai.types.Content; var researchAgent = LlmAgent.builder() .name("researcher") .model("gemini-flash-latest") .instruction("You help users research topics thoroughly and accurately.") .tools(List.of(new GoogleSearchTool())) .build(); var runner = new InMemoryRunner(researchAgent); var session = runner.sessionService().createSession( researchAgent.name(), "user-1" ).blockingGet(); var userMessage = Content.fromParts(Part.fromText( "What are the latest developments in fusion energy?" )); runner.runAsync(researchAgent.name(), session.id(), userMessage) .blockingForEach(event -> { if (event.finalResponse()) { System.out.println(event.stringifyContent()); } }); Multi-Agent Orchestration ADK Java’s multi-agent capabilities match the Python SDK’s, including sequential, parallel, and loop orchestration. Java import com.google.adk.agents.SequentialAgent; var researcher = LlmAgent.builder() .name("researcher") .model("gemini-flash-latest") .instruction("Research the given topic and provide key facts.") .build(); var writer = LlmAgent.builder() .name("writer") .model("gemini-flash-latest") .instruction("Write a clear, well-structured article from the research provided.") .build(); var editor = LlmAgent.builder() .name("editor") .model("gemini-flash-latest") .instruction("Polish and format the article for publication.") .build(); var pipeline = SequentialAgent.builder() .name("contentPipeline") .subAgents(List.of(researcher, writer, editor)) .build(); Vertex AI Lock-In ADK Java’s production deployment story is built around Vertex AI Agent Engine and Google Cloud. While you can run ADK Java locally (via the ADK CLI or directly) and deploy to Cloud Run or GKE independently, the managed evaluation tools, performance dashboards, and enterprise support all assume GCP. This is the clearest example in the Java AI space of a framework built to serve a platform rather than being platform-neutral. Pros and Cons ✅ Pros❌ ConsOfficial Google support with production SLATightly coupled to Vertex AI and GCPBest multi-agent orchestration in JavaAgent-only framework, no vanilla generation or flowsA2A protocol for agent interoperabilityPython-first design reflected in Java API styleFull evaluation tools (user simulation, custom metrics)Requires GCP for full observability and deployment featuresScales to enterprise on Google CloudYoungest Java SDK (1.0 released 2026)Streaming support (Gemini Live API) Head-to-Head Comparison Developer Experience FrameworkDX HighlightsShortcomingsGenkit JavaDev UI for local tracing is unmatched. Idiomatic Java builder API.GitHub Packages auth friction; unofficial statusSpring AIFeels native to any Spring Boot codebase. Zero-surprise API.No visual Dev UI; observability via Micrometer onlyLangChain4jAI Services pattern is the cleanest Java-native AI abstractionNo Dev UI; agent features still maturingADK JavaPowerful multi-agent tooling. Official Google support.GCP-centric; Python-style reflected in Java API Abstraction Levels Genkit Java is the only Java AI framework that provides all three levels: vanilla generation, typed flows (pipelines), and agents. Spring AI covers generation and a basic agent model via tools, but lacks a flow abstraction. LangChain4j provides two levels (low-level primitives and high-level AI Services) but is agent/service focused. ADK Java is agent-only. Observability FrameworkLocal DevProductionGenkit JavaDev UI with trace explorerOTEL-compatible exportSpring AILogs and Actuator endpointsMicrometer (Prometheus, Grafana, Datadog)LangChain4jLogging onlyManual OTEL setupADK JavaADK Web UICloud Trace + Vertex (GCP) Framework Neutrality Genkit Java and LangChain4j are built to be provider-neutral: they support every major model and deploy to any infrastructure. Spring AI is similarly neutral on model providers, though it carries Spring’s opinionated application framework as a dependency, a worthwhile trade for most Java shops. ADK Java carries the heaviest platform dependency: its full value is unlocked on Google Cloud. Java Ecosystem Fit FrameworkSpring BootQuarkusMicronautNative ImageGenkit Java✅ Plugin❌❌❌Spring AI✅ Native❌❌✅ GraalVMLangChain4j✅ Module✅ Extension✅ ModulePartialADK Java❌❌❌❌ Which Framework Should You Choose? Choose Genkit Java if: You want to iterate on your AI fast and get feedback with less back and forth — Genkit was built from the ground up for powerful local tooling and observability, and the Dev UI is genuinely transformative.You need multiple abstraction levels (vanilla calls, typed flows, and agents) in one SDK.Provider neutrality matters: you need to swap or mix Gemini, Claude, OpenAI, and Bedrock.Your team also writes TypeScript and wants a consistent framework story across both stacks. Choose Spring AI if: You are already running Spring Boot and want AI to feel like any other Spring integration.Micrometer-native metrics and traces plugging into your existing Prometheus/Grafana stack are a priority.You need the broadest model and vector store coverage with production-grade auto-configuration.GraalVM native images are a requirement for your deployment targets. Choose LangChain4j if: You want the most Java-idiomatic high-level AI abstraction: interface-based AI Services with annotations.You need the largest integration ecosystem and don’t want to be tied to any application framework.Your team works across Spring Boot, Quarkus, Micronaut, and Helidon, LangChain4j is the most framework-agnostic.RAG pipelines with rich document ingestion and retrieval are a core use case. Choose ADK Java if: You are building enterprise-grade multi-agent systems and Google Cloud is your runtime.You need official Google support and SLA-backed infrastructure for agent deployment.Multi-agent orchestration (sequential, parallel, loop) and the A2A interoperability protocol matter.Your team is already using the ADK Python SDK and wants to extend to Java services. Conclusion Java’s AI framework landscape in 2026 is surprisingly rich. The four frameworks covered here serve genuinely different needs, and unlike in the JavaScript world, where Genkit, Vercel, Mastra, LangChain, and ADK overlap significantly, the Java options each occupy a clearer niche. For enterprise Spring Boot teams, Spring AI is the obvious choice, with zero friction, production-ready observability via Micrometer, and the broadest integration matrix. For teams that value developer experience above all, Genkit Java’s Dev UI is a category apart and worth the unofficial status trade-off. For framework-agnostic Java developers who want the most idiomatic Java AI service abstraction, LangChain4j’s AI Services pattern is hard to beat. And for Google Cloud enterprise workloads that need reliable multi-agent orchestration at scale, ADK Java 1.0 is where Google is putting its weight. The most important thing is that you no longer have an excuse to reach for Python just because it has better AI tooling. Java’s time in generative AI has arrived. Last updated: April 2026. Framework versions referenced: Genkit Java 1.0.0-SNAPSHOT, Spring AI 1.x, LangChain4j 0.36.x, Google ADK Java 1.0.
The key-value (KV) cache is a fundamental optimization in transformer-based LLM inference. It stores intermediate attention states, i.e., keys and values computed during the prefill phase, so that subsequent tokens can reuse them instead of recomputing from scratch. This significantly reduces compute cost and latency, especially for long context or multi-turn agentic workloads. KV caching has been extensively discussed across several blogs and documentation [1, 2, 3, 4, 5]. In this article, instead of revisiting those well-known concepts, vLLM (v0.20.0) KV cache implementation details are discussed for a deeper understanding. By walking through code internals with concrete code pointers and design insights, the goal is to bridge the gap between high-level understanding and real-world system design. KV Cache Is Not a Standard Cache At first glance, KV cache sounds like a standard caching problem: storing computed results to reuse later. However, in systems like vLLM, KV cache behaves fundamentally differently from traditional caches like Redis cache. It is not a simple key-value lookup system sitting outside the execution path, but rather a tightly coupled component of the model's forward pass that must be accessed at every decoding step. Unlike conventional caches, KV cache is dynamic, partially reusable, and deeply intertwined with GPU memory allocation. This means that KV cache design is as much about memory management and scheduling as it is about cache reuse. Thinking of it as just a cache hides its true complexity, and it is better understood as a virtualized memory layer for intermediate computation. Dimensiontraditional cache (E.g., Redis)KV cache in LLMs (e.g., vllm) Purpose Avoid recomputing full results Avoid recomputing intermediate attention state Common access pattern key -> value lookup Key -> key-value bytes lookup during model execution Reuse type All or nothing Partial reuse (prefix based) Storage In-memory / persisted Primarily GPU memory which can also be persisted Consistency Eventual or strong consistency Must match exact token sequence Scheduling dependency Independent Strongly coupled with request scheduling Failure mode Cache miss results in recompute Cache miss results in recompute Cache locality sensitivity Low (can often be distributed for better reliability and scalability) Very high (node/worker local) and be IO latency sensitive. The kv_cache_manager is a good entry point to understand that the KV cache in vLLM is not a traditional cache, but an active memory manager used during inference. It actively manages GPU KV cache memory during inference, i.e., allocation, reuse, eviction, prefix cache hits, and request lifecycle state. Python class KVCacheManager: def __init__( self, kv_cache_config: KVCacheConfig, max_model_len: int, hash_block_size: int, max_num_batched_tokens: int | None = None, enable_caching: bool = True, use_eagle: bool = False, log_stats: bool = False, enable_kv_cache_events: bool = False, dcp_world_size: int = 1, pcp_world_size: int = 1, metrics_collector: KVCacheMetricsCollector | None = None, ) -> None: Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub vLLM KV Cache Design vLLM's KV cache design treats KV memory like virtual memory and not contiguous tensors to avoid memory bottlenecks. Instead of allocating large blocks per request, it introduces a layer of indirection via fixed-size blocks and block tables. This allows memory to be used efficiently, reused across requests, and dynamically resized as sequences grow. Two core primitives enable this design: block tables and an eviction mechanism. Together, they solve critical problems in memory fragmentation, reuse, and scalability. Block Tables The block table is the central abstraction in vLLM's KV cache design. Instead of storing KV tensors contiguously in GPU memory, each request maintains a mapping from logical token positions to physical memory blocks. This indirection layer is conceptually similar to a page table in operating systems. When the model accesses KV for a given token, it resolves through the block table to locate the physical block in GPU memory. This design allows KV memory to be non-contiguous, shared across multiple requests, and dynamically extended as tokens are generated. The code pointers below are a good entry point to understand this concept in detail. vLLM maintains a BlockTable whose rows correspond to active request slots. Each row maps a request's logical token/block positions to physical KV cache block IDs in GPU memory. This indirection lets KV blocks be allocated non-contiguously and lets multiple requests refer to reused/shared cached blocks. Python class BlockTable: def __init__( self, block_size: int, max_num_reqs: int, max_num_blocks_per_req: int, max_num_batched_tokens: int, pin_memory: bool, device: torch.device, kernel_block_size: int, cp_kv_cache_interleave_size: int, ): Source: (v0.20.0) vllm/vllm/v1/worker/block_table.py at main · vllm-project/vllm · GitHub vLLM's KV cache is divided into fixed size KVCacheBlocks. These blocks are the fundamental unit of allocation, prefix cache reuse, reference counting, and eviction. The code below is a good set of pointers for understanding that lifecycle. Python class BlockPool: def __init__( self, num_gpu_blocks: int, enable_caching: bool, hash_block_size: int, enable_kv_cache_events: bool = False, metrics_collector: KVCacheMetricsCollector | None = None, ): Source: (v0.20.0) vllm/vllm/v1/core/block_pool.py at main · vllm-project/vllm · GitHub allocate_slots() asks the coordinator how many blocks are needed, checks the shared block_pool for free capacity, and then calls allocate_new_blocks() only for the current request's needed slots. That shows blocks are dynamically assigned from a shared pool rather than preallocated per request. Python def allocate_slots( self, request: Request, num_new_tokens: int, num_new_computed_tokens: int = 0, new_computed_blocks: KVCacheBlocks | None = None, num_lookahead_tokens: int = 0, num_external_computed_tokens: int = 0, delay_cache_blocks: bool = False, num_encoder_tokens: int = 0, full_sequence_must_fit: bool = False, ) -> KVCacheBlocks | None: Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub Cache Eviction Eviction in vLLM is more complex than a typical least recently used (LRU) policy due to dependencies between tokens. KV blocks form a logical prefix chain, meaning later tokens depend on earlier ones. As a result, eviction cannot arbitrarily remove blocks without breaking correctness. Instead, vLLM uses a reference count-based mechanism combined with recency heuristics. Blocks are only eligible for eviction when no active request depends on them, and even then, eviction typically proceeds from the tail of sequences to preserve prefix integrity. This constrained eviction behavior ensures correctness while still allowing the system to operate under memory pressure. Blocks are reference-counted. A block can only be freed when no active request depends on it to ensure correctness. Python def free(self, request: Request) -> None: self.coordinator.free(request.request_id) Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub When a request completes, KVCacheCoordinator.free() calls free() on each per-type manager. The manager removes that request's blocks from req_to_blocks and returns them to BlockPool.free_blocks(), where they become reclaimable once their reference count reaches zero. Python def free(self, request_id: str) -> None: req_blocks = self.req_to_blocks.pop(request_id, []) Source: (v0.20.0) vllm/vllm/v1/core/single_type_kv_cache_manager.py at main · vllm-project/vllm · GitHub How the Request Flow Works Understanding how KV cache works requires following a request through the system. At a high level, vLLM attempts to reuse previously computed KV blocks by matching prefixes, allocates new blocks for unseen tokens, and schedules requests in a way that maximizes reuse while balancing GPU utilization. Prefix matching identifies previously computed KV blocks that can be reused for the incoming request. find_longest_cache_hit() takes the incoming request's block_hashes, searches the prefix cache for matching cached blocks, and returns the reusable KVCacheBlocks plus the number of computed tokens. Python def find_longest_cache_hit( self, block_hashes: list[BlockHash], max_cache_hit_length: int, ) -> tuple[tuple[list[KVCacheBlock], ...], int]: Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_coordinator.py at main · vllm-project/vllm · GitHub New tokens are assigned newly allocated KV blocks on demand, and the returned block IDs are later used by the worker side block table to extend the request's physical KV mapping. Python new_blocks = self.coordinator.allocate_new_blocks( request.request_id, num_tokens_need_slot, num_tokens_main_model, num_encoder_tokens, ) Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub The scheduler decides which requests run together. Requests with shared prefixes benefit from co-location, improving cache hit rate. Even with perfect caching logic, poor scheduling can eliminate all cache benefits. Python def schedule(self) -> SchedulerOutput: Source: (v0.20.0) vllm/vllm/v1/core/sched/scheduler.py at main · vllm-project/vllm · GitHub Before the model forward pass execution, gpu_mode_runner uses the request block table to resolve scheduled logical token positions into physical KV cache slot IDs. During attention execution, the resulting slot_mapping and block_table_tensor are passed through attention metadata so kernels can read/write the correct KV cache locations. Python def _prepare_inputs( self, scheduler_output: "SchedulerOutput", num_scheduled_tokens: np.ndarray, ) -> tuple[ torch.Tensor, SpecDecodeMetadata | None, ]: self.input_batch.block_table.compute_slot_mapping( num_reqs, self.query_start_loc.gpu[: num_reqs + 1], self.positions[:total_num_scheduled_tokens], ) source: (v0.20.0) vllm/vllm/v1/worker/gpu_model_runner.py at main · vllm-project/vllm · GitHub Future Work While vLLM's prefix-based KV caching is highly effective, it has inherent limitations that motivate future work. Today, reuse is mainly strongest when requests share the same prefix, because cached blocks are validated through prefix/block hash chains. One future direction is a more general segment or chunk-level reuse, where systems try to reuse repeated prompt regions beyond strict prefixes. This could help when shared content appears later in prompts, but it is harder than prefix caching because KV states depend on position and surrounding context. Another direction is distributed KV caching, where KV state can be stored, transferred, or shared across workers or replicas rather than remaining purely local to one GPU/node. This can improve reuse and scaling, but introduces challenges around latency, routing, placement, and consistency. Together, these directions move KV caching from a local per-worker optimization toward a broader system-level capability. Conclusion vLLM rethinks KV caching as a memory management and scheduling problem rather than a simple reuse mechanism. Through fixed-size block allocation, block tables for logical to physical indirection, prefix-aware reuse, reference-counted block lifetimes, and LRU-like cached block eviction, it turns KV cache into a virtualized resource that can be shared efficiently across requests. However, the effectiveness of this system depends not only on its internal design, but also on how requests are scheduled, batched, and routed. These nuances show that KV caching is not merely a local optimization, but a core systems primitive for modern LLM inference. As inference systems evolve toward more general segment/chunk level reuse and distributed KV caching, these same principles will continue to shape scalable and efficient serving platforms. References https://github.com/vllm-project/vllmhttps://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llmshttps://bentoml.com/llm/inference-optimization/kv-cache-offloadinghttps://cloud.google.com/blog/topics/developers-practitioners/boosting-llm-performance-with-tiered-kv-cache-on-google-kubernetes-engine/https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/https://pub.towardsai.net/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d
Three Lessons from an AI Makeathon I recently participated in a makeathon focused on building AI-powered applications. Over 2–3 intense days, I watched teams go from idea to demo — and the patterns that separated working products from frustrated debugging sessions were remarkably consistent, especially for teams building AI agents. From this makeathon and from my experience working with teams building AI applications and agents, here are the three lessons I took away on how to build reliable AI applications by engineering around non-determinism. Together, these form what I would like to call “The Architecture for Reasoning Control”. 1. Start Small — Non-Determinism Compounds AI models are non-deterministic. The same input won’t always produce the same output. This is a feature when you want creativity. It’s a problem when you want reliability. In a small app — one model call, one task — non-determinism is manageable, you can observe this behavior, tune your prompts, and build confidence. You iterate fast and catch drift early. In a large app like an AI Agent where the model must reason, select tool, and manage state across multiple steps, these non-determinism errors compound. Every AI call is a roll of the dice. Chain ten of them together and you’re rolling ten dice simultaneously. The probability of a successful end-to-end run — P(success)^n — decays exponentially. The probability of at least one undesired result doesn’t just grow — it compounds quickly. In my experience building bigger AI agents, we often spend the majority of our time chasing unpredictable outputs across these long chains. By scoping small, we found we could build working demos and deployable applications that actually stay on the rails. The Architectural Lesson: Apply the Single Responsibility Principle (SRP) of Architecture design: An AI module should have one, and only one, reason to change. You can think of these as analogous to microservices — small, single-purpose AI units that can be composed safely. Get one agentic interaction working with high reliability before you dream of chaining it. If the foundation is shaky, the agentic skyscraper will fall. 2. Multipass Guardrails — Defense in Depth Even the best guardrails we built didn’t have 100% effectiveness. A single validation pass catches most bad outputs — but “most” isn’t enough when you’re shipping to users. To understand why, consider the full surface area you need to guard. Most teams think about content safety — blocking violent or illegal content. But that’s just one of six categories: To get more determinism in our guardrail efficacy and build true Defense-in-Depth, we experimented with a “double-pass” approach — running the same guardrail logic against both input and output. While this bumped our success rate slightly, it quickly revealed a structural flaw: correlated blind spots. When our detection logic misclassified an illegal query as merely “off-topic” at the input stage, it consistently made the same error at the output stage. Similarly, PII that bypassed the upstream filter sailed through downstream because the detection signature was identical. We realized that while doubling down on the same logic slightly increased our safety margin — it just mirrored our existing weaknesses. So we researched shifting from symmetrical filtering to a model built on orthogonal, independent layers. The goal was to ensure that if one layer failed, the next would approach the problem from a completely different technical angle. This “cops-and-robbers” dynamic makes it significantly less likely that failures align — requiring multiple, differently designed systems to fail simultaneously for an issue to reach the user. If you’re looking to move beyond simple “pass/fail” filters, here are the layers you could analyze to stack with your guardrail: Dedicated Scanners (NER & Regex): Use deterministic PII scanners (regex for SSNs/credit cards) and Named Entity Recognition (NER) to catch data leaks before the query even hits the model.Intent Routing: Use a fast, specialized classifier to bucket queries into “benign,” “ambiguous,” or “high-risk.” This allows you to route high-risk queries through stricter handling paths or specialized system prompts before they reach the primary generative model.Structural Enforcement (JSON Schema): Move the goalposts from “free-text” to “data validation.” By forcing the model to output in a strict JSON schema, you turn unpredictable “Model Behavior” risks into a predictable code problem that can be caught by a standard parser.LLM-as-a-Judge: Introduce a secondary, smaller “observer” model tasked purely with evaluating the primary model’s response against a different set of criteria.Retrieval-grounded responses (RAG) Constraining the model to answer only from retrieved context and validating that outputs are traceable to sources — reducing hallucination and unsupported claims.Confidence / uncertainty gating Using signals (judge scores, validation checks, or model uncertainty) to decide when to answer, ask for clarification, or fall back — rather than treating all outputs equally. The overarching lesson was that there is no such thing as a “perfect” guardrail. Instead, assemble a stack of diverse, independent checks. By assuming that every individual layer will occasionally fail, you can design a system where those failures never align — creating a robust “Swiss Cheese” model of AI safety that actually holds up under adversarial pressure. 3. Flow Engineering — Mix AI with Deterministic Processing: Control What You Can AI excels at ambiguity: reasoning over messy inputs, interpreting intent, and generating natural language. But for problems requiring guaranteed correctness — precise data lookups, workflow sequencing, or state management — it remains fundamentally probabilistic. It can often arrive at the right answer, but it cannot reliably guarantee it every time. The insight that worked best: Use AI for reasoning; use deterministic code for execution. Let AI decide what to do (intent, analysis, extraction). Then let code decide how to do it (orchestration, API calls, state management). This separation doesn’t just improve reliability — it fundamentally changes how the system behaves: Controlled Scope: By limiting LLM calls to only the steps that require reasoning, you reduce unnecessary model invocations and keep the AI surface area small. This reinforces Lesson 1 — when the scope is smaller, the non-determinism is easier to observe.Targeted Safety: It strengthens Lesson 2 — guardrails are most effective when applied to fewer, well-defined points rather than across an unbounded flow. This is the “agentic pattern” emerging across the industry: a deterministic workflow engine that delegates to AI only where human-like reasoning is needed, then pulls the result back into controlled, predictable code. The best AI applications aren’t the ones that give AI the most freedom — they’re the ones that give AI the right freedom. This is the core of Flow Engineering. Instead of letting an agent navigate a dark room, we hard-coded the rails. By using the LLM as a cognitive engine at specific steps in a verifiable chain — rather than a free-roaming driver — we replaced a porous process with a solid structural track. Why This Works Reliability: Deterministic systems eliminate randomness in mission-critical steps.Cost & Latency: Fewer LLM calls lead to lower inference costs and faster responses.Observability: A smaller AI surface area is easier to monitor, test, and debug.Safety: Guardrails become exponentially more effective when applied at controlled, well-defined points. You’re not just optimizing performance — you’re containing non-determinism. Exec Insight: High-risk business logic should stay deterministic; creative and reasoning tasks can be probabilistic. Conclusion All three lessons point to the same principle: respect the non-determinism. The goal isn’t to eliminate non-determinism. It’s to build systems where it can’t break you by using “ARC: The Architecture for Reasoning Control”. AI systems don’t fail because they’re non-deterministic. They fail because that non-determinism is poorly bounded. Don’t fight it. Don’t ignore it. Don’t pretend your model is a function that returns the same output every time. The teams that built the most impressive demos at the makeathon weren’t the ones with the most ambitious prompts. They were the ones who understood where AI helps — and where it doesn’t. Summarizing using a Swiss Cheese metaphor: Lesson 1 (Start Small): Shrinking the size of the “holes” in the cheese by limiting scope using Architectural principle of SRP.Lesson 2 (Orthogonal Defense-in-Depth): Stacking the slices so the “holes” never align through orthogonal layers.Lesson 3 (Flow Engineering): Reduce how much is cheese in the first place in the system by using Deterministic Flows for critical logic. While our team was recognized with a special award, the real takeaway was the framework we discovered along the way. Start small. Guard deep. Stay deterministic. That’s what turns AI from a demo into a system you can trust.
Agentic development is rapidly becoming one of the most talked-about paradigms in software development. The talk is not just of using AI to assist in coding but of using systems where an AI agent is capable of planning, executing tasks, and even deciding. From a surface-level perspective, agentic systems are a new abstraction. But if we look under the hood, we find something that looks rather familiar: distributed systems. In microservices, asynchronous workflows, or event-driven architectures, many of the same challenges apply: Irregular behaviorPartial terminal conditionsLatency fluctuationsLack of observability The biggest mistake teams make is treating agents like deterministic scripts. In reality, they require the same rigor and design discipline as distributed systems. The Illusion of Determinism The traditional software model is fundamentally deterministic. Under similar conditions, one expects the same result. Agentic systems contradict this assumption. Identical prompts and inputs cannot always cause the same outputs because of: Model variabilityContext variationToken limitsThe response from an external tool This is akin to the behavior of distributed systems that have to deal with the real-world conditions - network latency, retries, and service dependencies that generate differences. This logically means that you cannot rely on "it worked once" as proof of correctness. Instead, you must design for: VariabilityApproximationProbabilistic correctness This one modification is sufficient to prompt engineers to reconsider the entire approach to achieving reliability. Agents Are Just Services With Unstable Contracts In the realm of distributed systems, services often interact with clearly defined contracts. This is usually an API, schema, or a versioned interface. However, the converse is often true for the agentic systems. A typical agent flow might look like: Create a responseCall a toolParse the outputDecide on the next Action However, without strict contracts things break: The model returns JSON that is not entirely the sameThere is a field that is either missing or has been renamedThe tool response format is different These problems are not edge cases; they are expected behaviors. The solution is to treat agents like services with stricter contracts: Ensure that the outputs are structured clearly (JSON schemas, typed responses)Validate each interaction that takes placeFail fast on invalid responses You don't trust the model, you would rather encase it in a construct that ensures correctness at the boundaries. Orchestration Over Autonomy There is a general perception that agents are autonomous and can thus operate independently. In reality, this is not often the case in production scenarios. What actually works is orchestration. Like the distributed systems that make use of orchestrators (workflow engines, schedulers, queues), agentic systems also require: Feedback control loopsStepwise executionExplicit state transitions The robust agentic workflow includes the following main steps: Propose the taskImplement a single stepCheck outputChoose the next stepLoop or terminate This is not autonomy, but rather controlled implementation. It’s a bit like a state machine rather than a self-driving system. The more critical the workflow, the more you need control: Limiting agent freedomSpecifying allowed actionsAdding human-in-the-loop checkpoints when needed Without a doubt, orchestration is what makes systems reliable, though autonomy does have its own charm. Failure Is the Default State Distributed systems are frequently structured in the same way. Thus, failure is not a special event but, rather, a normal occurrence. This holds true even for the agentic systems; thus, failure is a possibility. Errors can arise on different fronts: The model might misjudge what the issue actually isA tool call could fail or timeoutThe agent might get stuck in a loopThe output is syntactically correct but semantically wrong If your system assumes success, it will fail in production. Rather, design for failure such as: Adding retries with limitsImplementing timeoutsIntroducing fallback pathsDetecting and breaking infinite loops For example: If the agent is unable to produce valid output for 3 repeated attempts, it will flow to a deterministic flowIf a tool call fails, it can still give a degraded yet safe response This shows the circuit-breaker and retry policy patterns at work in distributed systems. Reliability comes not from avoiding errors but from handling errors gracefully. Observability Is Non-Negotiable One of the hardest issues in distributed systems is observability, or understanding what happened when something has gone wrong. But in agentic systems, it is ever harder Why? The answer is that failures are often not binary. The system could: Deliver an answer that's covertly erroneousUse the wrong reasoningAdopt incorrect assumptions Without observability, debugging will be guesswork. Application of agentic systems in production thus needs: Structured logs of every stepPrompt and response tracingTool invocation trackingPath decision visibility Think of it as distributed tracing for agents. Instead of just logging outputs, log: InputsIntermediate reasoning (if safe)Tool calls and resultsFinal decisions This allows you to answer critical questions: Where did the system go astray?Was it the model, the prompt, or the tool?Is that an isolated issue, or is it a pattern? Good observability changes the unpredictable systems into manageable ones. Idempotency and State Management In distributed systems, idempotency guarantees that repeated actions don't produce unintended consequences. Agentic systems need this even more. Consider the scenario where: A step is retriedA tool is called multiple timesThe agent restarts mid-flow These situations will lead to some of the following outcomes: Twice the number of actionsOutputs that are inconsistentWorkflows that are corrupted Best practices include: Keep the explicit state stored between stepsMake tool calls idempotent where possibleKeep a track of execution history For example: Rather than allowing the agent to "remember" context implicitly, persist: What steps were completedWhat outputs were producedWhat decisions were made This will turn a brittle state into one that is recoverable. Guardrails Over Intelligence One common misconception is that improving the model will solve most problems. However, system design matters more than model capability. More robust models mean fewer mistakes, but they do eliminate: AmbiguitiesMisinterpretationsUnexpected outputs Guardrails are what make systems usable: Input validationOutput constraintsAction limitsSafety checks For example: The agent can only call the tools that are allowedValidate outputs before executionDestructive actions must be prevented This resembles the way in which distributed systems enforce: Access controlsRate limitsData validation You don’t trust components blindly; rather, you constrain them. Closing Thoughts Agentic development is not about replacing the engineering discipline. It is about rigor in applying it. The most effective systems are not necessarily the most independent. They are the ones that are: Intelligently orchestratedHeavily constrainedDeeply observable Ultimately, the agents are simply another layer in your architecture.
One of the most unsettling characteristics of AI systems is how often they appear perfectly healthy. Infrastructure dashboards report stable CPU utilization, normal latency levels, and acceptable throughput. No alerts are triggered. From an operational standpoint, the system is functioning exactly as designed. Yet the outputs are wrong. In many AI deployments, engineers eventually encounter this situation: a recommendation system begins suggesting irrelevant items, a support chatbot produces inconsistent answers, or an AI assistant gradually becomes less reliable in answering domain-specific questions. Despite infrastructure stability evidenced by nominal CPU and latency metrics, AI systems frequently exhibit what can be described as silent degradation, a condition where semantic accuracy deteriorates while the transport layer remains fully operational. This failure mode is increasingly common in modern AI pipelines. Why Traditional Dashboards Can Be Misleading Monitoring platforms such as Prometheus, Datadog, and CloudWatch were designed for deterministic software systems. They track signals like request latency, memory usage, and service availability. These metrics are still essential. However, they only capture infrastructure health — not model behavior. Consider a typical retrieval-augmented generation (RAG) architecture. A user query moves through several layers before producing an answer: an API gateway, an embedding service, a vector database, a re-ranking layer, and finally the language model responsible for generating the response. If the embedding service experiences a brief latency spike, the system might reduce the number of retrieved documents or fall back to cached embeddings. The request still completes successfully, and infrastructure metrics remain within healthy ranges. But the language model now receives weaker context. The generated response may still appear fluent and coherent, yet its factual accuracy has quietly declined. From the perspective of the monitoring dashboard, the system remains healthy. From the perspective of the end user, the system has degraded. This gap highlights a fundamental challenge: AI reliability problems often occur in the semantic layer rather than the infrastructure layer. Retrieval Pipelines: A Hidden Source of Instability Retrieval systems are particularly vulnerable to subtle instability. Modern AI applications depend heavily on vector search to provide contextual knowledge to language models. Even small disturbances in this pipeline can significantly alter system behavior. For example, if a vector index update is delayed or embedding quality drifts slightly, similarity search may return documents that are only partially relevant. The model must then infer missing context on its own, increasing the probability of hallucination. Several factors can introduce this instability: embedding drift caused by model updatesdelayed indexing of newly ingested documentslatency spikes reducing the retrieval windowincomplete ranking signals in re-ranking layers None of these conditions necessarily produce infrastructure failures. Instead, they reduce the informational quality available to the model, weakening its reasoning capability. Hallucination Amplification Large language models generate responses probabilistically based on the context they receive. When that context becomes incomplete or noisy, the model compensates by relying more heavily on internal patterns. This is where hallucinations begin. A small retrieval error may initially produce a slightly uncertain response. In more complex systems, particularly agentic frameworks, this uncertainty can cascade through downstream workflows. For example, autonomous agents may execute follow-up API calls or trigger actions based on the model’s interpretation of retrieved data. If the underlying reasoning is degraded, those actions can amplify the original error. In other words, a minor retrieval issue can evolve into a chain of incorrect decisions. Traditional monitoring tools rarely capture this phenomenon because they do not measure the semantic integrity of outputs. Metrics That Actually Matter If infrastructure metrics alone cannot detect these issues, what signals should engineers monitor instead? AI reliability requires a new class of observability metrics focused on model behavior. One important signal is accuracy drift. Continuous evaluation pipelines can periodically test model outputs against benchmark datasets or validated queries, allowing teams to detect gradual declines in model performance.Another critical metric is retrieval precision. In RAG systems, measuring the relevance of retrieved documents helps identify when embedding quality or vector index freshness begins to deteriorate.Engineers should also monitor inference variance, the degree to which identical prompts produce different outputs over repeated runs. High variance can indicate unstable context, inconsistent retrieval results, or fluctuating model states. Tracking these signals provides visibility into how the AI system is reasoning, rather than simply confirming that it is responding. MetricWhat It DetectsWhy It MattersAccuracy DriftGradual decline in model correctnessEarly indicator of model degradationRetrieval PrecisionQuality of documents retrieved in RAG pipelinesPoor retrieval leads to hallucinationsInference VarianceOutput instability across repeated promptsIndicates context inconsistencyContext CoveragePercentage of relevant documents retrievedMeasures knowledge completenessResponse EntropyUncertainty in generated responsesHigh entropy signals weak model confidence Example: Detecting Semantic Drift in a RAG Pipeline A simple reliability monitor can periodically test model responses against expected outputs to detect early-stage degradation. Plain Text import openai from sklearn.metrics.pairwise import cosine_similarity from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") expected_answer = "The Eiffel Tower is located in Paris, France." test_query = "Where is the Eiffel Tower located?" response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": test_query}] ) generated_answer = response["choices"][0]["message"]["content"] expected_embedding = model.encode([expected_answer]) generated_embedding = model.encode([generated_answer]) similarity = cosine_similarity(expected_embedding, generated_embedding)[0][0] if similarity < 0.80: print("⚠️ Potential semantic drift detected.") The AI Reliability Stack: A Proposed Architecture Addressing these hidden failure modes requires integrating semantic monitoring into the AI development lifecycle. A typical AI reliability stack may include several layers of observability. At the infrastructure level, traditional monitoring tools such as Prometheus or OpenTelemetry continue to track system health metrics. These tools ensure that core services remain operational. Above this layer sits model observability platforms such as LangSmith or Arize. These tools track prompt-response pairs, analyze model outputs, and detect anomalies in inference behavior. A third layer focuses on evaluation pipelines integrated into CI/CD workflows. Automated tests evaluate model performance using curated datasets, enabling teams to detect accuracy drift before it reaches production environments. Together, these layers provide a more complete picture of system reliability. Infrastructure monitoring ensures services remain available, while semantic monitoring ensures the system’s intelligence remains intact. In my work developing intent-based chaos models for distributed systems (for which I hold a USPTO-recognized patent), I observed that infrastructure telemetry alone rarely detects early-stage AI failures. Combining topology-aware chaos testing with semantic observability allows engineering teams to detect reliability issues before they propagate through production systems. Plain Text User Query │ ▼ API Gateway │ ▼ Embedding Service │ ▼ Vector Database │ ▼ LLM Inference │ ▼ Semantic Evaluation Layer │ ├── Accuracy Drift Monitor ├── Retrieval Precision Tracker └── Hallucination Detection Toward Reliability Engineering for AI Systems As AI systems become embedded in production environments, reliability engineering must evolve alongside them. Traditional observability practices remain essential for maintaining infrastructure stability. However, they must be complemented by tools that measure how AI systems actually behave. The next generation of reliability frameworks will likely combine infrastructure telemetry with semantic evaluation pipelines, enabling engineers to detect not just outages, but the early signals of degraded reasoning. The hidden failure modes of AI systems cannot be eliminated entirely. But with the right monitoring strategies, they can be detected before they undermine the reliability of intelligent systems. Building trustworthy AI requires more than uptime dashboards. It requires visibility into how the system thinks. CategoryTraditional Monitoring (Infrastructure)AI Observability (Semantic)Primary GoalDetect system outages and latency.Detect quality degradation and drift.Core MetricsCPU, RAM, HTTP 500s, p99 Latency.Faithfulness, Answer Relevancy, Context Recall.Failure StateBinary (Up or Down).Spectrum (Accurate to Hallucinated).ToolingPrometheus, Grafana, Datadog.LangSmith, Arize, Fiddler, DeepEval.Root CauseCode bugs, Hardware failure, Traffic spikes.Embedding drift, Retrieval gaps, Prompt sensitivity.
The generative AI tooling ecosystem has exploded over the past two years. What started as a handful of Python libraries has grown into a rich, opinionated landscape of frameworks spanning multiple languages, deployment targets, and philosophical bets. As a developer who has shipped production applications using all five of the frameworks covered in this article, Genkit, Vercel AI SDK, Mastra, LangChain, and Google ADK, I want to offer a practical, hands-on view of where each one excels, where each one falls short, and what I would reach for depending on the project I’m building. This is not a benchmark post. Tokens per second and latency numbers go stale within weeks. Instead, this is a developer experience and architecture comparison, the kind of thing that matters when you’re deciding what framework will carry your product through 2026 and beyond. A quick note on scope: all five frameworks are in active development and moving fast. Code samples in this article use the APIs as of April 2026. Genkit History and Direction Genkit was announced by Google at Google I/O 2024 as an open-source framework designed to bring production-ready AI tooling to full-stack developers, regardless of their cloud provider. At the time, the JavaScript/TypeScript ecosystem lacked a coherent story for building AI-powered features with the kind of developer ergonomics you’d expect from, say, a Next.js app. Firebase’s team set out to fix that, building Genkit not as a proprietary Firebase product but as a cloud-agnostic SDK with first-class support for plugins. By mid-2024, Genkit had already attracted a community plugin ecosystem covering AWS Bedrock, Azure OpenAI, Ollama, Cohere, and a growing list of vector stores. The framework reached its 1.0 milestone in late 2024 and shipped major expansions in 2025, most notably adding Python (preview), Go, and Dart (preview) SDKs alongside the primary TypeScript runtime. This multi-language vision is central to Genkit’s story: it aspires to be the framework you reach for no matter what stack you’re running. As of 2026, the Dart SDK has matured notably, making Genkit one of the very few AI frameworks with meaningful Flutter support, giving mobile developers a first-class path into generative AI that no other framework on this list can match. It is also important to note that Genkit has an unofficial Java SDK, maintained by the community, which has been used in production but is not officially supported by the Genkit team. The team’s declared direction is to deepen Genkit’s role as a full-stack AI layer: strong observability primitives baked into the runtime, composable workflow abstractions (flows), and an expanding model plugin ecosystem. The ambition is not just to be a bridge to a single model provider but to be the connective tissue that lets you swap providers, mix modalities, and trace every hop in your pipeline, all from one coherent API. Of course, adding more capabilities to its DEV UI is also a major focus, with the goal of making it the best local development experience for AI applications, regardless of where they deploy. What Makes Genkit Stand Out Genkit occupies a unique position among the frameworks in this comparison: it is the only one that provides multiple levels of abstraction in a single, coherent API. You can call a model directly (vanilla generation), compose steps into a typed flow, or wire up a fully autonomous agent, and you can mix all three in the same application. Most other frameworks force you to choose a lane. Supported languages: TypeScript/JavaScript (primary, stable), Python (preview), Go, Dart/Flutter (preview) JavaScript import { genkit } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; const ai = genkit({ plugins: [googleAI()] }); // Vanilla generation — no abstraction needed const { text } = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: 'What is the capital of France?', }); Flows — Composable, Typed Pipelines Flows are Genkit’s first-class pipeline primitive. They are strongly typed, observable end-to-end, and automatically traced in the Dev UI. You define them once and can invoke them from CLI, HTTP, or the Dev UI without any extra scaffolding. import { genkit, z } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; const ai = genkit({ plugins: [googleAI()] }); const summarizeFlow = ai.defineFlow( { name: 'summarizeArticle', inputSchema: z.object({ url: z.string().url() }), outputSchema: z.object({ summary: z.string(), keyPoints: z.array(z.string()) }), }, async ({ url }) => { const { output } = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: `Summarize the article at ${url} and list the key points.`, output: { schema: z.object({ summary: z.string(), keyPoints: z.array(z.string()) }), }, }); return output!; } ); Agent Abstractions For agents, Genkit uses definePrompt with tools and a system prompt to define specialized agents, along with tool calling via defineTool and conversation memory, all integrated with the same tracing and observability infrastructure that flows use. The agent model is deliberate: it gives you control over how much autonomy you hand over to the model. JavaScript import { genkit, z } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; const ai = genkit({ plugins: [googleAI()] }); const weatherTool = ai.defineTool( { name: 'getWeather', description: 'Returns current weather conditions for a given city.', inputSchema: z.object({ city: z.string() }), outputSchema: z.object({ temperature: z.number(), condition: z.string() }), }, async ({ city }) => { // Real implementation would call a weather API return { temperature: 22, condition: 'Sunny' }; } ); const travelAgent = ai.definePrompt( { name: 'travelAdvisor', description: 'Travel Advisor can help with trip planning and weather-based advice', model: googleAI.model('gemini-flash-latest'), tools: [weatherTool], system: 'You are a helpful travel advisor. Use available tools to give accurate advice.', } ); // Start a chat session with the agent const chat = ai.chat(travelAgent); const response = await chat.send('Should I pack a jacket for my trip to Lisbon?'); console.log(response.text); The Dev UI — Where Genkit Truly Shines The Genkit Developer UI is, frankly, the killer feature. No other framework in this comparison comes close to what Genkit offers locally. You launch it with a single command: Shell npx genkit start The Dev UI gives you: Flow runner – execute any flow with a custom input, inspect the typed output, and view the full execution trace.Model playground – invoke any registered model directly, tweak prompt templates, compare outputs.Tool testing – stub and test individual tools in isolation before wiring them into an agent.Trace explorer – every generate, flow, and agent call is traced with latency breakdowns, token counts, and the exact prompts and completions sent to the model. This is OpenTelemetry-compatible telemetry, exportable to Cloud Trace, Langfuse, or any OTEL collector.Dotprompt editor – Genkit’s .prompt files (Dotprompt) are editable live in the UI, with real-time preview and variable injection.Session replay – replay any traced session end-to-end to reproduce bugs without re-running the full application. This local observability loop collapses what normally requires a deployed tracing backend (LangSmith, Langfuse, Weave) into a zero-config experience that runs entirely offline. For development speed, this is enormous. Vercel’s Developer Tool, by comparison, is a lightweight panel primarily for inspecting HTTP streaming responses. It doesn’t offer flow visualization, trace exploration, or tool testing. It’s functional but basic, the kind of thing you’d expect as a starting point, not a full developer experience. Broad Model Support — Provider Neutral by Design Genkit ships official plugins for Google AI (Gemini), Google Vertex AI, OpenAI, Anthropic Claude, Cohere, Mistral, Ollama (local models), AWS Bedrock, and more. The community has extended this to xAI, DeepSeek, Perplexity, and Azure OpenAI. Every model, regardless of provider, is accessed through the same ai.generate() interface, and every call is automatically traced. JavaScript import { genkit } from 'genkit'; import { anthropic } from 'genkitx-anthropic'; import { openAI } from 'genkitx-openai'; const ai = genkit({ plugins: [anthropic(), openAI()] }); // Switch between providers without changing downstream code const { text: claudeResponse } = await ai.generate({ model: anthropic.model('claude-sonnet-4-5'), prompt: 'Explain transformer attention in one paragraph.', }); const { text: gptResponse } = await ai.generate({ model: openAI.model('gpt-4o'), prompt: 'Explain transformer attention in one paragraph.', }); Pros and Cons ✅ Pros❌ ConsBest-in-class Dev UI with local tracing and flow visualizationDart/Python SDKs still in previewMultiple abstraction levels: vanilla, flows, and agentsSmaller community than LangChainTruly provider-neutral with broad plugin ecosystemSome advanced patterns require deeper framework knowledgeStrong Flutter/Dart support for mobile AI Idiomatic TypeScript API Firebase, Cloud Run, or self-hosted deployment OpenTelemetry-compatible observability built in Vercel AI SDK History and Direction The Vercel AI SDK was born out of a practical need: Vercel builds the infrastructure that powers a large portion of the modern web, and as developers started shipping AI features inside Next.js apps in 2023, the friction of integrating streaming LLM responses into React was painfully apparent. Vercel released the initial AI SDK as an open-source library to standardize streaming, provider integration, and UI hooks across its ecosystem. The SDK grew quickly, adding support for Vue, Svelte, SolidJS, and plain Node.js, but its DNA remains deeply tied to the Vercel and Next.js stack. Version 3 in 2024 introduced streamUI, which lets you stream React components as model output, a paradigm-shift for building truly generative user interfaces. Version 4, shipping in late 2024, brought generateObject and streamObject with Zod schemas, structured output across all providers, and an expanded agent API. By 2026, AI SDK v6 will have established itself as the go-to choice for teams that live in the Vercel/React ecosystem and want the lowest-friction path from a prompt to a production UI. Vercel’s direction is clear: deeper integration between AI, edge compute, and the frontend. The AI Gateway, launched in 2025, acts as a provider proxy with load balancing and fallback, another layer of lock-in dressed as a convenience. The SDK is intentionally lower-level than Genkit or Mastra, favoring simplicity and composability over opinionated abstractions. What Makes the Vercel AI SDK Stand Out The Vercel AI SDK’s greatest strength is its seamless integration with React and the web UI layer. useChat, useCompletion, and useObject hooks wire directly into streaming AI responses with built-in state management, loading indicators, and error boundaries. If you’re building a Next.js app and want to add a chat interface or a streaming form, nothing gets you there faster. Supported languages: TypeScript/JavaScript (primary). Node.js, React, Next.js, Nuxt, SvelteKit, SolidStart, Expo (React Native). TypeScript // app/api/chat/route.ts (Next.js App Router) import { streamText } from 'ai'; import { openai } from '@ai-sdk/openai'; export async function POST(req: Request) { const { messages } = await req.json(); const result = await streamText({ model: openai('gpt-4o'), messages, }); return result.toDataStreamResponse(); TypeScript // app/page.tsx — chat UI with one hook 'use client'; import { useChat } from 'ai/react'; export default function Chat() { const { messages, input, handleInputChange, handleSubmit } = useChat(); return ( <div> {messages.map(m => ( <div key={m.id}><b>{m.role}:</b> {m.content}</div> ))} <form onSubmit={handleSubmit}> <input value={input} onChange={handleInputChange} placeholder="Say something..." /> <button type="submit">Send</button> </form> </div> ); } Structured Generation and Agent Patterns The SDK provides clean primitives for structured output and tool use, though the abstractions are deliberately minimal. You get generateText, streamText, generateObject, streamObject, and a simple maxSteps loop for agentic behavior. There is no high-level “flow” abstraction or graph, you compose these primitives yourself. JavaScript import { generateObject } from 'ai'; import { openai } from '@ai-sdk/openai'; import { z } from 'zod'; const { object } = await generateObject({ model: openai('gpt-4o'), schema: z.object({ recipe: z.object({ name: z.string(), ingredients: z.array(z.object({ name: z.string(), amount: z.string() })), steps: z.array(z.string()), }), }), prompt: 'Generate a recipe for a vegan chocolate cake.', }); Genkit vs. Vercel AI SDK — Abstraction Levels Compared to Genkit, the Vercel AI SDK operates at a lower level of abstraction. This is by design; Vercel wants to give you sharp, composable tools, not an opinionated framework. The trade-off is that you assemble more boilerplate yourself. Want to trace a multi-step agent? Wire up OpenTelemetry manually. Want a typed pipeline? Build it yourself. Genkit bakes these in. Conversely, Vercel’s deep UI integration, streaming RSC, useChat, generative UI patterns, is something Genkit does not attempt to own. For Flutter-based applications, Genkit’s Dart SDK fills this role, but in the web domain, Vercel wins on integration depth. Pros and Cons of Permalink ✅ Pros❌ ConsUnmatched React/Next.js/Edge integrationPrimarily TypeScript/JavaScript onlyMinimal API surface, easy to learnNo built-in flow or pipeline abstractionuseChat / useCompletion hooks are best-in-classDeveloper Tool is basic (no trace explorer, no flow runner)Generative UI with RSC streamingObservability requires external toolingBroad provider support via official adaptersDeeper use cases accumulate boilerplate quicklyIdiomatic TypeScript throughoutVercel-ecosystem bias (AI Gateway, templates) Mastra History and Direction Mastra is the youngest framework in this comparison, founded in 2024 by the team behind Gatsby (Cade Diehm and Sam Bhagwat). Coming from a background of developer experience, tooling, and static-site generation, Mastra’s founders approached AI framework design with a strong bias toward TypeScript ergonomics, workflow-first thinking, and integrated tooling. The name “Mastra” (Swahili for “master”) reflects the team’s ambition to be the definitive TypeScript-native AI orchestration layer. Mastra reached public beta in late 2024 and gained significant traction in early 2025 among TypeScript developers frustrated with LangChain’s Python-ported patterns. The framework’s distinct feature, a built-in Studio UI, arrived in early 2025 and quickly became its marquee differentiator. Mastra Studio is a web-based visual interface for defining, testing, and running agents and workflows, accessible locally or in the cloud. By mid-2025, Mastra had secured seed funding and announced hosted cloud infrastructure for deploying Mastra agents directly from the Studio. Mastra’s direction is firmly in the TypeScript/JavaScript ecosystem. The team has shown no signs of pursuing multi-language support; instead, they are doubling down on deep integrations with popular TypeScript meta-frameworks like Next.js, Astro, SvelteKit, and Hono. Think of Mastra as the opinionated, batteries-included agent framework for TypeScript developers who want to spin up production agents as fast as possible, without writing any platform glue. What Makes Mastra Stand Out Mastra is purpose-built for one thing: spinning up agents fast. It is an agent-only framework; you will not find vanilla model calls or a “flow” primitive. Everything in Mastra is modeled around agents, tools, memory, and workflows. If you know exactly what you need (an agent with memory and tool access), Mastra gets you there in fewer lines of code than any other framework here. Supported languages: TypeScript/JavaScript exclusively. Integrations with Next.js, Astro, SvelteKit, Hono, Express. JavaScript import { Mastra, Agent } from '@mastra/core'; import { openai } from '@mastra/openai'; const researchAgent = new Agent({ name: 'researcher', model: openai('gpt-4o'), instructions: `You are a research assistant. Find relevant information, synthesize key points, and present clear, well-structured summaries.`, tools: { // Tools added here }, }); const mastra = new Mastra({ agents: { researchAgent } }); const response = await mastra.getAgent('researcher').generate([ { role: 'user', content: 'Summarize the latest developments in quantum computing.' }, ]); console.log(response.text); Workflows Mastra’s workflow primitive lets you chain agent steps into typed, directed graphs, useful when you need a mix of deterministic logic and LLM reasoning. JavaScript import { Workflow, Step } from '@mastra/core'; import { z } from 'zod'; const contentPipeline = new Workflow({ name: 'contentPipeline', triggerSchema: z.object({ topic: z.string() }), }); contentPipeline .step({ id: 'research', execute: async ({ context }) => { const { topic } = context.triggerData; // Agent call to research the topic return { research: `Key facts about ${topic}` }; }, }) .then({ id: 'draft', execute: async ({ context }) => { const { research } = context.getStepResult('research'); // Agent call to draft the article return { draft: `Article draft using: ${research}` }; }, }) .commit(); Pros and Cons ✅ Pros❌ ConsFastest path to a production-ready agent in TypeScriptAgent-only: no flows, no vanilla generation primitivesExcellent Studio UI for visual workflow buildingTypeScript/JavaScript onlyIdiomatic TypeScript API with strong type inferenceYounger ecosystem, fewer pluginsGood memory and tool-calling primitivesObservability still maturingIntegrates well with popular JS meta-frameworksNo mobile/cross-platform story LangChain History and Direction LangChain is, by a significant margin, the most widely used AI framework in the world, but its story is complicated. Harrison Chase created LangChain in October 2022 as a Python library for chaining LLM calls, and it spread virally through the developer community in early 2023 as everyone scrambled to experiment with GPT-3 and GPT-4. Its key insight, that useful AI applications require structured chains of calls, retrieval augmentation, and tool integration, was correct and arrived at the right moment. GitHub stars and npm downloads shot to the top of every chart. The JavaScript port, langchain on npm, arrived shortly after and has tracked the Python library closely in both API design and feature parity. This is the source of one of LangChain’s most persistent criticisms: the JavaScript SDK feels like Python idioms force-translated into TypeScript. Patterns like BaseChain, runnable pipelines with .pipe(), and the LCEL (LangChain Expression Language) make perfect sense coming from Python’s compositional patterns but feel unnatural to TypeScript developers accustomed to async/await and module-based composition. LangChain, the company, raised $35M in 2023 and has since built a growing platform around LangSmith (observability and evaluation) and LangGraph (graph-based orchestration). This is where the tension lies: LangChain’s open-source SDK and LangSmith are designed to complement each other. Getting the best observability experience requires using LangSmith. While you can configure other backends, the seamless experience is on their platform. The framework is excellent and featureful, but its commercial direction is unmistakably pointed toward LangSmith adoption. In 2025, LangChain reorganized its JavaScript library around a cleaner agent API (create_agent) and introduced Deep Agents, pre-built agent implementations with built-in context compression and subagent spawning. LangGraph remains the recommended framework for complex multi-step workflows, and LangSmith continues to be the best-in-class platform for production LLM observability. LangChain’s Position: Agent-First, Platform-Tied LangChain is squarely an agent framework. Its sweet spot is spinning up capable agents quickly, particularly for teams coming from the Python AI ecosystem who want to move to or stay in JavaScript without losing the LangChain mental model. It is the most feature-complete framework here in terms of raw agent capabilities, RAG patterns, and integrations, but that breadth comes with complexity. Supported languages: Python (primary, feature-complete), JavaScript/TypeScript (JS port, near-parity). Note: the JS SDK carries Python-style patterns. JavaScript import { createAgent } from 'langchain/agents'; import { ChatOpenAI } from '@langchain/openai'; function getWeather(city: string): string { // Real implementation would call a weather API return `It's always sunny in ${city}!`; } const model = new ChatOpenAI({ model: 'gpt-4o', temperature: 0 }); const agent = createAgent({ model, tools: [ { name: 'get_weather', description: 'Get weather for a given city.', func: getWeather, }, ], systemPrompt: 'You are a helpful assistant.', }); const result = await agent.invoke({ messages: [{ role: 'user', content: 'What is the weather in Madrid?' }], }); console.log(result.messages.at(-1)?.content); LangSmith Observability LangSmith is LangChain’s answer to the observability problem. It provides trace visualization, dataset management, prompt versioning, and LLM evaluation, all polished and production-grade. The integration with LangChain is seamless: set LANGSMITH_TRACING=true and every run is captured automatically. The catch is that LangSmith is a SaaS platform. Genkit’s Dev UI provides comparable local observability with zero cloud dependency. If you need hosted, team-scale observability, LangSmith is arguably the best option in the market. If you need local, zero-config development tracing, Genkit wins. Pros and Cons ✅ Pros❌ ConsLargest community and integration ecosystemJavaScript SDK feels like Python ported to TSLangSmith is best-in-class for production observabilityTight coupling to LangSmith for full observabilityFeature-complete agent, RAG, and chain primitivesComplex API surface, steep learning curveExcellent Python SDK for Python teamsLangGraph required for complex graph workflowsDeep AgentS provide batteries-included patternsHeavy bundle size in browser/edge environmentsLangGraph for advanced workflow orchestrationCommercial platform pressure Google ADK (Agent Development Kit) History and Direction Google ADK was announced at Google Cloud Next 2024 as Google’s opinionated take on a production-grade agent framework, specifically targeting enterprise deployments on Google Cloud. Unlike Genkit, which is cloud-agnostic and full-stack, ADK was designed from day one around Vertex AI and Google Cloud’s agent infrastructure, including Agent Engine, Cloud Run, and GKE. It is the framework Google recommends when you’re building agents that will live in a Google Cloud environment at scale. ADK’s initial release was Python-only, which told the story clearly: this was a framework for the enterprise Python AI developer, data scientists, ML engineers, and cloud architects who think in agents and workflows and are already committed to Google Cloud. The TypeScript, Go, and Java SDKs followed in 2025, with ADK Go 1.0 and ADK Java 1.0 shipping in early 2026. This multi-language expansion signals that Google is positioning ADK as more than a Python script runner; it wants to be the enterprise agent runtime for any Google Cloud workload. ADK 2.0, released in 2026, brought significant refinements: graph-based workflow APIs, a visual Web UI builder, enhanced evaluation tooling (including user simulation and environment simulation for testing agents end-to-end), and deeper A2A (Agent-to-Agent) protocol support. The A2A protocol is an open standard that allows ADK agents to communicate with agents built on other frameworks, a meaningful interoperability effort in a fragmented ecosystem. Google’s direction with ADK is unmistakable: this is enterprise AI infrastructure for Google Cloud customers. If your organization runs on GCP and needs reliable, scalable, observable agent deployments with enterprise support, ADK is Google’s answer. If you need to be cloud-agnostic, look elsewhere. ADK’s Position: Agent-First, Enterprise-Grade Like LangChain and Mastra, ADK is an agent-only framework; its reason for existing is to make building, evaluating, and deploying agents fast and reliable. Unlike Mastra (which targets indie developers and startups), ADK is purpose-built for enterprise scenarios: multi-agent systems, graph-based orchestration, agent evaluation at scale, and deployment to Google’s managed infrastructure. Supported languages: Python (primary, feature-complete), TypeScript/JavaScript, Go, Java. Note: the API design and documentation are heavily Python-first; TypeScript and other SDKs track but sometimes lag the Python feature set. Python # Python — ADK's primary language from google.adk import Agent from google.adk.tools import google_search research_agent = Agent( name="researcher", model="gemini-flash-latest", instruction="You help users research topics thoroughly and accurately.", tools=[google_search], ) # Run locally result = research_agent.run("What are the latest developments in fusion energy?") print(result.text) TypeScript // TypeScript ADK import { Agent } from '@google/adk'; import { googleSearch } from '@google/adk/tools'; const researchAgent = new Agent({ name: 'researcher', model: 'gemini-flash-latest', instruction: 'You help users research topics thoroughly and accurately.', tools: [googleSearch], }); const result = await researchAgent.run( 'What are the latest developments in fusion energy?' ); console.log(result.text); Multi-Agent Systems ADK’s multi-agent support is one of its strongest features. You can compose agents hierarchically, assign them different models, and let them collaborate via the A2A protocol. Python from google.adk import Agent from google.adk.agents import SequentialAgent, ParallelAgent researcher = Agent(name="researcher", model="gemini-flash-latest", instruction="Research the topic.") writer = Agent(name="writer", model="gemini-pro-latest", instruction="Write a clear article from the research.") editor = Agent(name="editor", model="gemini-flash-latest", instruction="Polish and format the article.") content_pipeline = SequentialAgent( name="contentPipeline", agents=[researcher, writer, editor], ) Vertex AI Lock-In ADK’s evaluation, deployment, and production observability features lean heavily on Vertex AI Agent Engine, Cloud Trace, and Google’s managed infrastructure. You can run ADK locally and even deploy to Cloud Run or GKE independently, but to get the full ADK experience, including agent evaluation, performance dashboards, and managed scaling, you’re on Google Cloud. This is similar to how LangSmith is the intended observability backend for LangChain: technically optional, practically expected. Frameworks like Genkit, Vercel AI SDK, and Mastra were designed from the ground up to be cloud-neutral. ADK and LangChain, by contrast, have strong ecosystem gravity toward their respective platforms. Pros and Cons ✅ Pros❌ ConsEnterprise-grade agent infrastructureStrongly tied to Vertex AI and Google CloudMulti-language: Python, TypeScript, Go, JavaPython-first: TS/Go/Java APIs lag in featuresBest-in-class multi-agent and A2A supportBrings Python coding patterns to JS developersGraph-based workflows and evaluation toolsLess suitable for cloud-agnostic deploymentsDirect integration with Google Search, Vertex SearchHeavier setup and operational complexityAgent evaluation with user simulationNot a full-stack framework (agent-only) Head-to-Head Comparison Developer Experience FrameworkDX HighlightsShortcomingsGenkitDev UI is unparalleled for local debugging. Idiomatic TypeScript. Multi-level abstractions.Less prescriptive, more choices to make upfrontVercel AI SDKFrictionless React/Next.js integration. Minimal API.Assembles boilerplate for complex scenariosMastraFastest path to a working agent. Great Studio UI.Agent-only, JS-onlyLangChainVast documentation and community. Battle-tested patterns.Python idioms in TypeScript, complex APIADKPowerful multi-agent tooling. Strong eval story.GCP-centric, Python-first Abstraction Levels Genkit is the only framework that gives you all three levels in one SDK: vanilla generation, typed flows (pipelines), and agents. Vercel AI SDK lives at the lower end; it gives you clean generation and tool-calling primitives but no flow abstraction. Mastra, LangChain, and ADK are agent frameworks: they optimize for spinning up agents quickly but don’t offer a coherent story for when you just want to generate text or structure a pipeline without agent autonomy. Observability FrameworkLocal Dev ObservabilityProduction ObservabilityGenkitBuilt-in Dev UI, trace explorer, Dotprompt editorOTEL-compatible, Cloud Trace, LangfuseVercel AI SDKBasic Developer PanelOTEL, Vercel Observability (platform-tied)MastraStudio UI for workflowsStill maturingLangChainMinimal without LangSmithLangSmith (best-in-class, SaaS)ADKADK Web UICloud Trace + Vertex (GCP-tied) Language Support FrameworkPrimaryAdditionalGenkitTypeScriptPython (preview), Go, Dart/Flutter (preview), Java (Unofficial)Vercel AI SDKTypeScriptNode.js runtimes, EdgeMastraTypeScriptJS runtimes onlyLangChainPythonTypeScript (near-parity, Python idioms)ADKPythonTypeScript, Go, Java Framework Neutrality Genkit, Vercel AI SDK, and Mastra were built from the ground up to be provider-neutral. They support OpenAI, Anthropic, Google, and others through a unified API, and they deploy to any infrastructure. LangChain and ADK are platform-influenced. LangChain’s full power unlocks with LangSmith; ADK’s full power unlocks on Google Cloud. This is not a dealbreaker; both platforms are excellent, but it is an architectural commitment you should make consciously. Idiom and Code Style Genkit, Mastra, and Vercel AI SDK feel natively TypeScript: async/await everywhere, Zod schemas for validation, module-based composition, and no runtime class inheritance chains to navigate. LangChain and ADK’s TypeScript SDKs carry the weight of their Python origins. You’ll find class-heavy APIs, .pipe() chains, and patterns that feel natural if you’ve written LangChain Python but unfamiliar if you’re coming from the TypeScript world. This is not a quality judgment; it’s a cultural fit question. Which Framework Should You Choose? After building with all five, here’s my honest take: Choose Genkit if: You want to iterate on your AI fast and get feedback with less back and forth — Genkit was built from the ground up for powerful local tooling and observability.You need to mix vanilla generation, typed pipelines (flows), and agents in the same app.Provider neutrality is important now or likely to be important later.You’re building a Flutter/Dart mobile app and need AI capabilities.You want OpenTelemetry-compatible tracing without configuring a separate backend. Choose Vercel AI SDK if: You’re building a React/Next.js app and want the lowest-friction path to streaming AI UI.Simplicity and minimal API surface matter more than built-in abstractions.You’re already on the Vercel platform and want native integration.Your use case maps well to the UI hooks (useChat, useCompletion, generative UI). Choose Mastra if: You’re a TypeScript developer who wants to spin up a production agent as fast as possible.You want a clean, idiomatic TypeScript agent API without Python-ported patterns.The visual Studio UI for workflow design appeals to your team.You’re building in the Next.js/SvelteKit/Hono ecosystem. Choose LangChain if: Your team is coming from the Python AI ecosystem and wants cross-language continuity.You need the broadest possible integration ecosystem (the most integrations of any framework).You’re investing in LangSmith for production observability and want a cohesive platform.LangGraph’s graph-based orchestration matches your workflow complexity. Choose ADK if: You’re building enterprise-grade multi-agent systems on Google Cloud.Vertex AI’s infrastructure (Agent Engine, Cloud Trace, Vertex Search) is already in your stack.You need battle-tested multi-language support, including Go and Java.Agent evaluation at scale (user simulation, custom metrics) is a core requirement. Conclusion The Generative AI framework landscape in 2026 is not a winner-take-all market. Each of the five frameworks covered here has a legitimate use case, a growing community, and an active development team. If I had to crown one framework as the most versatile choice for teams that haven’t already committed to a cloud platform, it would be Genkit. Its combination of multi-level abstractions, provider neutrality, and, above all, the Developer UI creates a development experience that genuinely accelerates iteration. The fact that it is expanding to Dart/Flutter, Python, and Go while keeping its TypeScript SDK as the best-in-class experience is a sign of a team thinking about the long game. That said, none of these frameworks is going away. LangChain’s ecosystem depth, ADK’s enterprise footprint, Vercel’s UI ergonomics, and Mastra’s TypeScript-native speed all serve real needs. The most important thing is to make the choice deliberately, understanding what you’re trading when you pick a platform-tied framework, and what you’re gaining when you pick a more opinionated one. Happy building. Last updated: April 2026. Framework versions referenced: Genkit 1.x, Vercel AI SDK 6.x, Mastra 0.x (latest), LangChain JS 0.3.x, Google ADK 2.0.
Keeping two PostgreSQL databases in sync sounds simple. Until it isn’t. At first, everything looks fine: Logical replication is enabledChanges are flowingThe target database looks current Then, a few days later, something is off: Rows are missingSome updates appear twiceReplication lag jumps for no obvious reasonA small schema change breaks the pipelineRestarting the job does not clearly continue from the right place Now the problem is no longer “how do I stream changes from PostgreSQL?” The problem is proving that the target database is still correct. That is where most PostgreSQL CDC guides stop being useful. They explain how to enable logical replication. They explain replication slots, publications, WAL, maybe Debezium. That is useful. But production CDC usually breaks somewhere else: in the handoff between initial load and CDC, in retries, in checkpoints, in ordering, and in the recovery paths nobody tests until something fails. The Promise of PostgreSQL CDC Change data capture exists for a good reason. Instead of repeatedly querying whole tables or running batch exports, PostgreSQL can stream committed changes from its write-ahead log. In theory, that gives you: Near real-time replicationLess load on the source databaseNo polling loopsNo full reload after every change And yes, that part works. WAL is reliable. Logical replication is mature. PostgreSQL can tell you what changed and in what order. The hard part starts after that. Because WAL is only one part of the system. Once changes leave PostgreSQL, they usually pass through readers, queues, workers, retry logic, target writes, checkpoints, and monitoring. That is where things get interesting. Also annoying. Where PostgreSQL CDC Actually Breaks A simplified CDC pipeline often looks like this: Plain Text PostgreSQL WAL | v CDC reader | v Queue / buffer | v Workers | | v v Target Retry / failure handling PostgreSQL WAL is ordered. Your pipeline may not be. The moment you add queues, parallel workers, retries, and target writes, correctness becomes your responsibility. Not just throughput. Not just “events per second.” Correctness. That means answering questions like: Did the target receive every committed change?Were updates applied in the right order?Did a retry apply the same change twice?Was the checkpoint saved before or after the target write?What happens if the job stops halfway through a large initial load?What happens if CDC starts after the snapshot, but not from the snapshot boundary? Those are the questions that decide whether CDC is reliable in production. 1. Initial Load Is Not CDC CDC starts from a point in time. It does not recreate everything that already existed before that point. So if the target is empty, you first need a baseline copy of the existing data. Usually, this is called an initial load or snapshot. The hard part is what happens next. CDC snapshot gap If CDC starts from the wrong position, the target may miss rows, replay rows, or apply stale updates. This is the snapshot gap. It appears when initial load and CDC are run as separate steps without a shared WAL boundary. snapshot + CDC is two separate operations. snapshot → CDC is a controlled handoff. This is the difference between “snapshot + CDC” and a continuous snapshot → CDC flow. For PostgreSQL, the safe version means: Read a consistent snapshotRecord the matching WAL positionStart CDC from that position Anything else is guessing. 2. WAL Is Ordered, But Workers Can Break Ordering PostgreSQL emits changes in commit order. That does not mean your target receives them in a safe order. Once a pipeline introduces parallelism, ordering can break. For example: Plain Text Event 1: update customer id = 42 Event 2: delete customer id = 42 If these events are processed by different workers, the delete might reach the target first. Or this: Plain Text Event 1: insert parent row Event 2: insert child row If the child insert arrives first, the target may reject it because the parent row does not exist yet. The usual answer is controlled parallelism: Preserve ordering per table where neededAvoid unsafe reordering inside the same key/table streamBatch carefullyRetry without changing event order Parallelism is not bad. Blind parallelism is bad. CDC systems usually fail here when they optimize for speed before defining ordering guarantees. 3. At-Least-Once Delivery Means Duplicates Are Normal Most CDC pipelines are at least once. That means an event may be delivered more than once. This is not automatically a bug. It is a normal recovery behavior. If the pipeline writes to the target, then crashes before saving its checkpoint, it may replay the same event after restart. That is why target writes must be idempotent. For database-to-database replication, this usually means: Inserts should behave like upserts when possibleUpdates should be safe to apply more than onceDeletes should not fail the whole stream if the row is already gonePrimary keys matter If the target write logic is not idempotent, retries can silently corrupt data. For example: Duplicate insertsCounters incremented twiceAudit rows repeatedAppend-only targets growing incorrect history CDC without idempotent target writes is fragile. It may work in a demo. Production will eventually find the retry path. Production has a talent for that. 4. Checkpoints Must Be Commit-Aware Checkpointing sounds simple: Save the last processed WAL position. But the timing matters. If the checkpoint is saved too early, the failure path looks like this: CDC reads an event at LSN X.Checkpoint advances to X.Target write fails.Process crashes before the failure is handled.Restart begins after X, so the event is skipped. The system now believes the event was delivered. But it was never written to the target. That is silent data loss. The safe order is: Read eventWrite to targetWait for target ACKAdvance checkpoint This way, if the process crashes after reading but before committing to the target, the event can be replayed. That may create duplicates if target writes are not idempotent, but it avoids skipping committed source changes. This is the usual tradeoff: Checkpoint too early → data lossCheckpoint after commit → possible replayReplay + idempotency → safe recovery A reliable CDC system must choose the boring, safe option. 5. Schema Changes Are Not Free CDC captures data changes. It does not magically solve schema evolution. In real systems, someone eventually: Adds a columnChanges a typeRenames a tableDrops a columnChanges a defaultModifies a constraint Then the CDC pipeline has to answer: Does the target schema already have this column?Can this type be mapped safely?Should the pipeline stop or continue?What happens to old events?What happens to in-flight batches? Some platforms try to automate schema evolution. That can be useful, especially in analytics pipelines. But for database-to-database replication, automatic schema changes can also be dangerous. A production target is not always just a passive copy. It may have constraints, indexes, permissions, triggers, or application dependencies. The safest practical answer is usually: Detect schema mismatch clearlyFail loudly when target writes are unsafeLet operators coordinate schema changesDo not silently invent a broken target schema A CDC pipeline that keeps running incorrectly is worse than one that stops. At least a stopped pipeline is honest. 6. Long Transactions Create Hidden Lag CDC lag is not always caused by slow networking or slow consumers. Sometimes the source transaction itself is the problem. PostgreSQL changes become safe to replicate only after the transaction commits. So a large transaction can look quiet for a while, then suddenly release a huge batch of changes at once. PLSQL BEGIN; UPDATE orders SET status = 'archived' WHERE created_at < '2024-01-01'; – this runs for several minutes – CDC cannot treat these row changes as final yet COMMIT; While the transaction is open, downstream replication may appear to be stuck or falling behind. After COMMIT, all those changes become visible together. Result: Replication lag jumpsTarget writes arrive in a burstWorkers suddenly have a backlogMonitoring graphs look haunted Long transaction This is common during bulk updates, maintenance jobs, large imports, or application code that keeps transactions open too long. CDC cannot remove this behavior. It can only process the changes once PostgreSQL makes them committed and visible. 7. Restarts Are Where Fake Reliability Gets Exposed A CDC pipeline that works while everything is healthy is not enough. The real test is what happens after: Service restartDatabase disconnectTarget write failureProcess crashMachine rebootOperator pressing "Stop" Restart behavior must be explicit. The system should know: The last durable source positionWhether the target write was acknowledgedWhether the initial load had completedWhether CDC handoff had happenedWhether a partially loaded table can continue safely If those states are not stored durably, restart becomes guesswork. And guesswork is not a recovery strategy. Treat CDC as a workflow, not just a stream. Most real database movement work does not start with CDC. It starts with questions: Which tables should be copied?How large are they?Does the target schema match?Can the existing data be loaded safely?Where is the handoff point between the initial load and CDC?How do we validate the initial load?How do we keep validating after the CDC starts?How do we recover if something stops? But many setups split those steps across tools: Plain Text SQL client → export → scripts → pipeline → CDC → validation Each tool may be fine on its own. The problems live in the gaps: Assumptions are lostState is not sharedValidation becomes manualHandoff points are unclearRestart behavior is inconsistent That is why CDC should be treated as part of the full data movement workflow: Plain Text explore → load → validate → replicate → keep validating Not because workflows look nice on a diagram, but because the correctness problems happen between those steps. What a Reliable CDC System Needs A production CDC system should handle failure and recovery paths deliberately: Pure CDC without initial load: target writes must be idempotent, and checkpoints must be durable.Initial load → CDC: CDC must start from the snapshot boundary.Restart after stop: checkpoints must advance only after successful target writes.Interrupted large load: the system must know what was already copied.Delayed CDC after snapshot: the system must not start blindly from “now.”Schema mismatch: the system should fail clearly, not silently corrupt data. These scenarios look boring. They are also where many CDC implementations fail. Not because PostgreSQL is unreliable. Because the workflow around PostgreSQL CDC is incomplete. When Debezium + Kafka is the right answer A common production CDC architecture looks like this: Plain Text PostgreSQL → Debezium → Kafka → consumer → target database This can be the right architecture. Especially when you need: Kafka as the central event backboneMultiple independent consumersEvent-driven servicesVery high throughputExisting Kafka operations Debezium is a serious CDC tool. Kafka is not the villain. The problem starts when the architecture is much bigger than the job. If the goal is simply PostgreSQL → PostgreSQL replication, the stack becomes a distributed system around a relatively direct task. Now every issue has several possible owners: PostgreSQL WALReplication slot stateDebezium connector configKafka topic lagConsumer retry logicRarget database writesSchema handling between all of the above When lag appears, where is it? When a row is missing, who skipped it? When an event replays, was it Debezium, Kafka, the consumer, or the checkpoint? When the target schema changes, which layer owns the fix? Nothing here is impossible. But it changes the problem. You are no longer just moving data. You are operating a multi-component CDC platform. That may be worth it. But it should be a conscious tradeoff, not the default answer for every sync job. Kafka did not break. The architecture became heavier than the job required. Where DBConvert Streams Fits DBConvert Streams keeps the risky parts of PostgreSQL CDC in one workflow: Plain Text load → handoff → replicate → resume → validate That does not remove the hard parts of CDC. It makes them explicit. Instead of stitching together a snapshot job, a CDC process, retry logic, checkpoints, and validation queries by hand, the workflow is visible in one place. What Changed in DBConvert Streams 2.1 DBConvert Streams 2.1 focuses on several of these recovery paths: Initial load → CDC now hands off automatically from a saved position.CDC resumes from the last durable checkpoint after Stop or restart.Eligible large load runs can continue from saved progress instead of starting again from zero.Schema changes are still not handled automatically and need coordination. These are workflow changes, not new WAL magic. That is the point. What DBConvert Streams Does Not Solve Automatically DBConvert Streams 2.1 does not automatically handle: Schema evolutionExactly-once delivery across source, pipeline, and targetLag caused by long PostgreSQL transactionsTarget repair after manual changes or divergence These are still operational boundaries. When CDC Is the Wrong Solution CDC is not always the answer. Use something simpler if: Data changes rarelyLatency does not matterA nightly reload is acceptableThe target can be rebuilt cheaplyCorrectness matters more than freshness Batch jobs are boring. But boring is not an insult. Boring systems often fail in predictable ways. A full reload that takes 10 minutes and is easy to verify may be better than a CDC pipeline nobody fully understands. CDC is worth it when freshness matters, and the source cannot be repeatedly reloaded. Otherwise, do not add moving parts just to feel enterprise. PostgreSQL will not be impressed. The important part is that they are explicit, not hidden behind a “CDC just works” promise. Final Thought PostgreSQL CDC is not hard because WAL is unreliable. It is hard because a real CDC system has state: Snapshot stateWAL positionCheckpoint stateTarget commit stateRetry stateSchema state If that state is implicit, CDC breaks in strange ways. If it is explicit, CDC becomes boring. And boring is exactly what production replication should be. DBConvert Streams 2.1 handles this as one controlled workflow: initial load, CDC handoff, checkpointing, resume, and monitoring. See: Log-based CDC for MySQL and PostgreSQL
How to Implement AI Agents in Rails With RubyLLM
May 7, 2026 by
Why Your RAG Pipeline Will Fail Without an MCP Server
May 7, 2026 by
How to Implement AI Agents in Rails With RubyLLM
May 7, 2026 by
Why Your RAG Pipeline Will Fail Without an MCP Server
May 7, 2026 by