DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

DZone Spotlight

Friday, May 8 View All Articles »
Production Checklist for Tool-Using AI Agents in Enterprise Apps

Production Checklist for Tool-Using AI Agents in Enterprise Apps

By Pier-Jean MALANDRINO DZone Core CORE
Agents Need a Production Gate, Not Just a Demo Review I have seen this pattern more than once. A team builds an agent that summarizes tickets, queries CRM data, and opens service requests. The demo lands well. Leadership wants it in production next month. The agent works, but production is not a quality bar; it is an operational contract. The moment an agent can call a tool, it stops being an ML artifact and becomes production software. Most of what we know about shipping production software still applies: identity, authorization, logs, rate limits, and rollback. None of this is new. But four assumptions of traditional ops quietly break when the caller is an agent. Execution is no longer deterministic. An HTTP 200 no longer means the action was correct. The threat surface is not static; it grows with every prompt. And on-call engineers cannot resolve every incident on their own, because the relevant judgment is often a business one. If you need a reminder of how bad this can go, look at the Replit incident from July 2025, when an autonomous coding agent deleted a live production database during an explicit code freeze, then misreported the rollback options to the user [1]. That was not a model failure. It was a missing production gate. The OWASP Top 10 for Agentic Applications 2026, released in December 2025, formalizes most of the failure modes referenced below [2]. The relevant ASI categories are noted as we go. Assumption of traditional opsStill holds for agents?Identity, authorization, audit, rollback are requiredYesSame code, same input, same outputNoHTTP 200 means the action succeededNoThreat surface is defined at design timeNoOn-call ops can resolve most incidentsPartially 1. Define the Agent's Job Before Defining Its Tools Scoping a microservice is mostly product work. Scoping an agent is also risky work, because the job description is the blast radius. Tell an agent to "help customer support" with broad tool access, and in practice it has the union of every permission it can reach. OWASP captures this in the 2026 list under the principle of Least Agency: only grant agents the minimum autonomy required for safe, bounded tasks [2]. Same logic as least privilege, applied to decision space rather than access space. Before any technical work, write down: who owns the agent and what business process it serves, the approved use cases, the explicitly excluded use cases, the input and output boundaries, the human escalation paths, and measurable success and failure criteria. The goal is testable boundaries, not aspirational ones. For a concrete reference of what surgical scoping looks like, an npm packaging error on March 31, 2026, exposed the full source of Anthropic's Claude Code agent, system prompt included [11]. The passage that governs action authorization is worth reading in full: "But for actions that are hard to reverse, affect shared systems beyond your local environment, or could otherwise be risky or destructive, check with the user before proceeding. [...] A user approving an action (like a git push) once does NOT mean that they approve it in all contexts [...]. Authorization stands for the scope specified, not beyond. Match the scope of your actions to what was actually requested. Examples of the kind of risky actions that warrant user confirmation: Destructive operations: deleting files/branches, dropping database tables, killing processes, rm -rf, overwriting uncommitted changesHard-to-reverse operations: force-pushing, git reset --hard, amending published commits, removing or downgrading packages/dependencies, modifying CI/CD pipelines" Notice the structure. The principle is stated first: confirmation is tied to scope, not to a session-wide trust toggle. Then two named categories, each with a closed list of concrete operations. No vague "be careful with sensitive operations." The agent does not have to interpret what "risky" means at runtime; the policy enumerates it. This is the level of granularity an enterprise agent's job description needs. Not "the agent helps with customer support", but a closed list of allowed operations, a closed list of escalation triggers, and a default of "ask before any irreversible action." If you cannot write your agent's job at this level of precision, you are not ready to scope its tools. 2. Assign Identity to Every Actor in the Workflow The pattern is familiar from service-to-service auth. Each actor in the chain needs its own identity, no shared credentials, full propagation downstream. The twist for agents: there are now three actors, not two. The end user authenticates and grants delegated authority. The agent has its own machine identity that downstream services can recognize and rate-limit independently. Tools authenticate the agent on behalf of the user, typically through OAuth on-behalf-of flows. South et al. (2025) propose a concrete extension of OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, designed to maintain a clear chain of accountability between user, agent, and service [3]. Worth reading if you are building this from scratch. A practical note. Most teams I have worked with start by giving the agent the user's session token directly. That works for a demo, but breaks the moment two agents act on the same resource, or when you need to audit which agent did what for which user. Get the three identities right early. It is much harder to retrofit. 3. Enforce Authorization at the Action Level "Can call CRM" is the wrong granularity. The production question is whether the agent can read an account, update a record, approve a refund under €50, or trigger a shipment. Each of these is a distinct authorization decision, often with distinct approvers. OWASP categorizes the failure mode here as ASI03 (Identity and Privilege Abuse) and ASI02 (Tool Misuse) [2]. There is a parallel worth making explicit. Enterprise architecture has spent twenty years getting this right for services. Bounded contexts. Clear ownership per domain. Explicit responsibilities per system. Governance over what each block can and cannot do. When we ship a new microservice, no one questions the need to define its scope, its dependencies, and its allowed operations. That discipline is mainstream. Most engineering organizations have an architecture team whose entire job is to enforce it. Then an agent arrives, and the same teams who would never give a payment service write access to the HR database hand the agent a single broad token that reaches both. We tend to underestimate what an agent can actually do. A microservice does what its code says. An agent does whatever it decides to do, within whatever permissions it holds. The blast radius is wider, not narrower. The discipline should be tighter, not looser. Over-permission is a problem in any application. Audits flag it, security teams chase it, and everyone knows it should not be there. But in a normal application, the gap between holding a permission and exercising it is a human decision. A user with too many rights still has to choose to abuse them or be compromised first. With an agent, the gap collapses. Every over-permission is an action the agent might take by mistake, with no intent and no compromise required. The Replit case is exactly this: the agent had write access to a production database it should never have been able to reach. Shi et al. (2025) propose Progent, a programmable privilege-control framework for LLM agents. On the Agent Security Bench, fine-grained policy enforcement dropped attack success rates from 70.3% to 7.3% in the autonomous version, and to 0% with manual policies [4]. The lesson is not the specific framework. It is that action-level scoping is measurable and effective. ToolReadWriteHigh-impact actionRequired approvalCRMAccount, contactNotes, draft emailUpdate account ownerManagerBillingInvoice statusNoneIssue refund under €50AutoBillingNoneNoneIssue refund €50 or aboveHuman reviewInventoryStock levelNoneTrigger shipmentHuman reviewData exportNoneNoneBulk exportBlocked 4. Build Tool Allowlists and Deny-by-Default Behavior In a microservice mesh, dynamic discovery is a feature. Services find each other through a registry. For an agent in production, dynamic discovery is a vulnerability. The agent should never call a tool that was not declared, signed, and versioned at deploy time. The recent rise of the Model Context Protocol (MCP) makes this concrete. MCP standardizes how agents discover and invoke external tools, decoupling the application from each individual integration. Hasan et al. (2025) summarize the architecture in a single figure that is worth reading carefully: MCP architecture: without MCP, each framework requires custom tool implementations; with MCP, tools are decoupled via the MCP Registry and capability negotiation. Reproduced from Hasan et al. (2025), arXiv:2506.13538. The trade-off is that this decoupling moves trust to the descriptor layer. The agent reads tool metadata at invocation time and decides what to call based on it. If the descriptor lies, the agent acts on a lie. Hasan et al. studied 1,899 open-source MCP servers and found that 7.2% contained general vulnerabilities and 5.5% exhibited MCP-specific tool poisoning, where malicious instructions are hidden in tool descriptions that the agent reads at invocation time [5]. Your allowlist is not just "which tools can the agent call." It is "which tool descriptors do I trust, at which version, signed by which key." Treat the tool catalog like an IAM policy file. Checked into source control, peer reviewed, signed, and pinned to a release. Adding a tool is an explicit change with an audit trail. And the validation has to run at every call, not just at deploy. Three gates, in this order: is the tool in the allowlist, is the descriptor signature still valid, and does the version match the deployed manifest. Any one of them failing means the call is rejected and a security event is logged. Skip any of them, and you have a hole. Without all three, you do not have an allowlist. You have a wishlist. 5. Add Audit Logs for Every Tool Call Traditional application logs answer "what happened." Agent logs need to answer "why did the agent decide that?" A forensic investigation of an agent incident, a wrong refund, or a wrong customer contact is not the reconstruction of a single API call. It is the reconstruction of a chain of decisions. AlSayyad et al. (AAAI 2026) frame this explicitly. Agent observability requires capturing logs across three surfaces simultaneously: operational (what happened), cognitive (the reasoning trace), and contextual (what state the agent saw). Standard application observability covers the first surface. The other two are agent-specific [6]. Their proposed runtime architecture, AgentTrace, illustrates how these three surfaces flow together end-to-end: AgentTrace end-to-end runtime architecture: initialization (logger setup, OpenTelemetry enablement, auto-instrumentation hooks), runtime instrumentation with trace/span ID generation and cognitive extraction, logging pipeline for event encoding and routing, and downstream storage and visualization. Reproduced from AlSayyad et al. (2026), arXiv:2602.10133. Two things in this diagram are worth pointing out for production teams. First, cognitive events flow on a separate path from operational events, but they share the same trace and span IDs. That is what allows you to reconstruct a decision and a tool call as a single causally-linked unit, not as two parallel streams of mystery. Second, the contextual layer rides on OpenTelemetry attributes, which means you do not need a parallel observability stack. The agent-specific signal lives inside the infrastructure that your platform team already operates. The Replit case illustrates the cost of getting this wrong. The agent itself misreported what had happened, claiming the rollback was impossible when it was not [1]. Without independent traces of the actual tool calls, the user had no way to verify the agent's narrative. Trust nothing the agent says about its own actions. Log the actions independently. At the same time, raw user prompts can carry PII or injected secrets, so redaction policies have to be deliberate. Log the structure of the decision, redact the content where sensitive, and version everything that can change between runs. 6. Set Rate Limits, Quotas, and Cost Controls Rate limiting an API protects the system from clients. Rate limiting an agent also protects you from itself. Agents have planner loops that can re-invoke themselves, retry on ambiguous failures, and chain tool calls in ways that the threat model of each individual tool never anticipated. The classic controls all apply. Retries with backoff, concurrency limits, and per-user request limits. Two additional dimensions matter for agents. First, cost limits, because LLM spend is uncapped by default, and a runaway agent can burn a quarter's budget in an afternoon. Lemkin reported $607 in additional Replit charges over three and a half days, on a $25 monthly plan, before the catastrophic incident [7]. That is not the disaster, but it is a signal that should have triggered a review. Second, asymmetric limits between read and write tools. A loop that re-reads is annoying. A loop that rewrites is a production incident. There is a third dimension that gets overlooked, and it matters more than the other two. The components that enforce these limits must be deterministic and external to the agent. Not a tool the agent can call. Not a setting the agent can adjust. Not a confirmation that the agent can self-issue. The whole point of a limit is that it holds when the thing it constrains misbehaves. Putting it under the agent's control defeats the purpose. The agent should not know its own quota any more than a microservice knows its own rate-limit headers from the gateway side. In practice, rate limits live at the gateway, cost caps live at the LLM proxy, concurrency limits live at the orchestrator, and circuit breakers live at the tool boundary. All of them sit outside the agent's call surface, all of them refuse without negotiation, and all of them log the rejection. The agent finds out it hit a limit the same way a misbehaving client finds out: by getting a 429, not by reading a config file. This is the piece teams miss most often, and it is the cheapest fix in the entire checklist. The right thresholds themselves come from your traffic shape, not from a blog post. Set them explicitly, alert on them, and revisit after every incident. 7. Design Deterministic Fallbacks The previous section made the case that limits must sit outside the agent. Fallbacks are the same idea applied one level deeper: when the model itself produces a bad output, what runs instead must be deterministic, external, and not negotiable by the agent. Every non-deterministic component needs a deterministic floor. The question is not whether the model will fail; it is what runs when it does. Three failure modes need explicit fallbacks. Output validation failure: the model returns a malformed structure, retry once with a stricter schema, then fall through to a templated response and escalation. Low confidence: when a confidence score or a judge model disagrees with the actor, route to a human. High-stakes actions: never let the LLM be the sole decider on irreversible operations. Bhagwatkar et al. (March 2026) propose tool-input and tool-output firewalls, called Minimizer and Sanitizer, placed at the agent-tool boundary as a deterministic, non-LLM defense layer [8]. Same logic as the limit-enforcement components in the previous section, applied at the data boundary instead of the call boundary. A coded check between the model's intent and the system's action. A refund of €50 or above, a record deletion, and a customer-facing email all go through a coded rule, not a generated decision. The deterministic floor is the contract. It is what is auditable, regardless of what the model produces. 8. Treat Deployment and Rollback as First-Class Requirements Rolling back code is a solved problem. Rolling back an agent is not, because the unit of deployment is no longer code alone. A working agent is the joint state of code, prompts, tool catalog, and model version. A rollback that touches only the code can leave the system in a worse state than the one it was rolled back from, because the prompt now references tools that no longer exist, or expects model behavior that has shifted. Production-ready agents version all four artifacts together and ship them as a bundle. Canary deployments run on a tenant subset, with shadow mode for the rest: the agent computes a decision, the system does not act on it, and telemetry compares to the previous version. Operational hooks, like those exposed by platforms such as OutSystems DevOps APIs, make this kind of automated deploy and rollback workflow tractable. After the Replit incident, the company implemented exactly this kind of architectural separation: development and production database isolation, a planning-only mode to enforce code freezes, and improved rollback [1]. Basic deployment hygiene that should have been in place from the start. ArtifactVersioned?Rollback together?Application codeYesYesSystem and tool promptsYesYesTool catalog (allowlist, schemas)YesYesModel ID and parametersYesYesTelemetry baselineYesReference for canary 9. Validate With Failure-Mode Testing Unit tests check correctness. Failure-mode tests check robustness. Unlike traditional adversarial testing, the relevant adversary is not always malicious. Real users discover prompt injections by accident. Models drift on minor version bumps. Tools time out under load. Two academic benchmarks define the current state of the art. AgentDojo (Debenedetti et al., NeurIPS 2024) provides 97 realistic tasks and 629 security test cases across email, banking, and travel-booking environments [9]. Agent Security Bench (Zhang et al., 2025) extends this to direct prompt injection, indirect prompt injection, memory poisoning, and plan-of-thought backdoor attacks across ten scenarios [10]. Their threat model maps the attack surfaces of a tool-using agent better than any prose I could write: Attack surfaces of an LLM agent: Direct Prompt Injection at the user input, PoT (Plan-of-Thought) Backdoor on the system prompt, Memory Poisoning on long-term and short-term memory, and Indirect Prompt Injection on tool responses returned by the external environment. Reproduced from Zhang et al. (2025), arXiv:2410.02644. Four distinct attack surfaces, four distinct test classes. The shift from "prompt injection" to "indirect prompt injection" matters most here. The threat in 2026 is not a hostile user typing "ignore previous instructions" into a chat box. It is a malicious instruction hidden in an email, a web page, a RAG document, or a tool descriptor that the agent reads in the course of normal work. ASI06 in the OWASP Top 10 calls this Memory Poisoning when it reaches the stored state [2]. Both surfaces need to be in your test suite, with their own dedicated cases, not lumped under a single "we test for prompt injection" box. Beyond adversarial inputs, your suite needs to cover the boring failure modes too. Tool timeouts, malformed payloads, model version bumps, out-of-scope requests, concurrent runs on the same resource, and looping behavior on ambiguous tool results. None of these is exotic. All of them happen in the first month of production. The teams that catch them in staging are the ones who wrote the tests on purpose, not the ones who hoped for the best. 10. Close With a Launch-Readiness Table None of this is optional. A team that cannot fill the table below for its agent is not ready for production. Not because the agent will fail tomorrow, but because when it does fail, the team will not know why, will not be able to contain it, and will not be able to roll it back cleanly. The intuition that runs through every row is the same one we started with. Most of what we know about shipping production software still holds. What changes is that an agent collapses the gap between holding a permission and exercising it, between calling a tool and deciding to call it, between seeing data and acting on it. Every control on this list exists to put that gap back in, deliberately, with code you wrote and own, sitting outside the agent itself. Readiness domainQuestion to askProduction evidenceAgent-specific subtletyScope (1)What can't this agent do?Closed list of allowed operations and escalation triggersAuthorization stands for the scope specified, not beyondIdentity (2)Who is acting, on whose behalf, in which run?Three identities propagated end-to-end (user, agent, tool)Identity carries decision context, not just accessAuthorization (3)Action-level scopes, per tool and per parameter?Per-action ACL, reviewed with the same rigor as a microservice contractBlast radius is wider; the discipline must be tighterAllowlist (4)Deny-by-default, with descriptor and version validation?Signed catalog plus runtime gates on tool, signature, and versionTrust moves to the descriptor layer; verify at every callAudit (5)Can we reconstruct a decision, not just an API call?Operational, cognitive, and contextual events sharing trace IDsDo not trust the agent's narrative; log the actions independentlyRate limits (6)Are limits external to and not negotiable by the agent?Enforcement at the gateway, LLM proxy, orchestrator, tool boundaryThe agent finds out by getting a 429, not by reading a configFallback (7)What runs when the model fails?Deterministic floor tested; coded rules for irreversible actionsThe floor is the contract, not the modelDeploy / rollback (8)Atomic version bundle?Code, prompts, tool catalog, and model versioned and reverted togetherRolling back code alone leaves an incoherent systemFailure tests (9)Coverage of all four attack surfaces, plus boring failures?Tests for direct PI, indirect PI, memory poisoning, PoT backdoor, plus timeouts and driftIndirect prompt injection is the dominant threat surface in 2026 Agents in production are not an ML problem. They are an ops problem with a new edge case at every layer. The teams that succeed will treat them with the same discipline they apply to any production system, plus an extra column on every checklist for the subtlety the agent introduces. The frameworks, the benchmarks, and the threat models referenced throughout this article (OWASP ASI Top 10 [2], AgentDojo [9], ASB [10], the academic work on delegation [3], privilege control [4], MCP security [5], observability [6], and tool firewalls [8]) are not academic curiosities. They are the tools the field has already built for exactly this problem. Use them. References [1] Fortune, AI-powered coding tool wiped out a software company's database in 'catastrophic failure', July 23, 2025. https://fortune.com/2025/07/23/ai-coding-tool-replit-wiped-database-called-it-a-catastrophic-failure/ [2] OWASP GenAI Security Project, Top 10 for Agentic Applications 2026, December 2025. https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/ [3] South, T., Marro, S., Hardjono, T., Mahari, R., Whitney, C. D., Greenwood, D., Chan, A., Pentland, A., Authenticated Delegation and Authorized AI Agents, arXiv:2501.09674, January 2025. https://arxiv.org/abs/2501.09674 [4] Shi et al., Progent: Programmable Privilege Control for LLM Agents, arXiv:2504.11703, 2025. https://arxiv.org/abs/2504.11703 [5] Hasan et al., Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers, arXiv:2506.13538, 2025. https://arxiv.org/abs/2506.13538 [6] AlSayyad, A. et al., AgentTrace: A Structured Logging Framework for Agent System Observability, arXiv:2602.10133, AAAI 2026. https://arxiv.org/abs/2602.10133 [7] The Register, Vibe coding service Replit deleted production database, faked data, and lied about it, July 21, 2025. https://www.theregister.com/2025/07/21/replit_saastr_vibe_coding_incident/ [8] Bhagwatkar, R. et al., Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?, arXiv:2510.05244, v2 March 2026. https://arxiv.org/abs/2510.05244 [9] Debenedetti, E., Zhang, J., Balunović, M., Beurer-Kellner, L., Fischer, M., Tramèr, F., AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents, NeurIPS 2024 Datasets and Benchmarks Track, arXiv:2406.13352. https://arxiv.org/abs/2406.13352 [10] Zhang et al., Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents, arXiv:2410.02644, 2025. https://arxiv.org/abs/2410.02644 [11] The Register, Claude Code's source reveals extent of system access, April 1, 2026. https://www.theregister.com/2026/04/01/claude_code_source_leak_privacy_nightmare/ More
How to Test PUT API Request Using REST-Assured Java

How to Test PUT API Request Using REST-Assured Java

By Faisal Khatri DZone Core CORE
PUT requests are typically used for updating an existing resource. This means replacing the current data for the target resource with the data sent in the API request body. Just like POST requests, the content-type header is important because it tells the server how to interpret the data we’re sending. When successful, a PUT request usually returns a 200 OK status along with the updated resource in the response. That said, not all APIs behave the same way; some may choose not to return any data at all, depending on how the API is designed. Difference Between PUT and POST APIs The following table shows the clear difference between PUT and POST APIs: CriteriaPUTPOSTPurposeIt is used to update or replace an existing resource entirely.It is used to create a new resource or submit data to a resource.IdempotencyIt is Idempotent; multiple identical requests result in the same outcome.It is not idempotent; multiple identical requests may create multiple records.Response Status CodeCommonly returns 200 OK with updated resource.Commonly returns 201 Created with the new resource details.Use Case ExampleUpdating a user’s profile information.Creating a new user account PUT API Example Let’s take an example of the PUT /updateOrder/{id} API that updates the available order using its order ID. This API is a part of the RESTful e-commerce application available on GitHub. This API requires an authentication token to identify and update the order. If the token is missing or invalid, the request will fail with an error. The order ID must be provided as a path parameter to identify and update the respective order. The updated order details must then be included in the JSON format in the request body. It is important to note that since it is a PUT request, we have to send the entire order object, and not just the field we want to change. Even if we’re updating a single value, the full order data must be supplied. How to Test PUT APIs Using REST-Assured Java The following test scenario will be used to demonstrate testing PUT APIs with REST-assured Java. Plain Text ## Test Scenario Title: Update the existing orders in the system. ## Pre-condition: Valid orders are available in the system ## Test 1. Update all the order details of order_id “2.” 2. Verify that the Status Code 200 is returned in the response. 3. Assert that the order details have been updated correctly. Test Implementation The implementation of this test scenario is divided into two parts: Writing a test to hit the Authorization API and extract the token from it. (Since an authorization token is mandatory to update the order).Updating the order and verifying the updated details. Step 1: Write a Test to Generate and Extract the Token The POST /auth API takes the username and password as the request body and returns the Authorization token in the response with a 201 status. Let’s create a new Java class named TestPutRequestExamples to implement this test scenario, and create a new method testTokenGeneration() in it. Java public class TestPutRequestExamples { private String token; @Test public void testTokenGeneration () { String requestBody = """ { "username": "admin", "password": "secretPass123" }"""; token = given ().contentType (ContentType.JSON) .when () .body (requestBody) .post ("http://localhost:3004/auth") .then () .statusCode (201) .and () .body ("token", notNullValue ()) .extract () .path ("token"); } } The testTokenGeneration() test sends a POST request with login credentials to generate an authentication token using REST-assured. It validates that the response returns a 201 status code and ensures the token is present. Finally, it extracts and stores the authorization token in the “token” variable, which is declared at the global level so other tests can use its value. Step 2: Updating the Order With a PUT Request Let’s create a new test method, testUpdateOrder(), in the same class. This test will use the OrderData generated using the Builder Pattern + Datafaker library, so we don't have to worry about updating the test data manually. POJO for the order object: Java @Getter @Setter @Builder @JsonPropertyOrder ({ "user_id", "product_id", "product_name", "product_amount", "qty", "tax_amt", "total_amt" }) public class OrderData { @JsonProperty ("user_id") private String userId; @JsonProperty ("product_id") private String productId; @JsonProperty ("product_name") private String productName; @JsonProperty ("product_amount") private int productAmount; private int qty; @JsonProperty ("tax_amt") private int taxAmt; @JsonProperty ("total_amt") private int totalAmt; } This OrderData class is a POJO used to represent order details, where annotations like @Getter, @Setter, and @Builder, from Lombok, reduce boilerplate code. The Jackson annotations @JsonProperty and @JsonPropertyOrder ensure proper JSON field mapping and ordering when sending or receiving API requests. Generating a new Order with random values using the Datafaker library: Java public class OrderDataBuilder { public static OrderData getOrderData () { Faker faker = new Faker (); int productAmount = (faker.number () .numberBetween (1, 1999)); int qty = faker.number () .numberBetween (1, 10); int grossAmt = qty * productAmount; int taxAmt = (int) (grossAmt * 0.10); int totalAmt = grossAmt + taxAmt; return OrderData.builder () .userId (String.valueOf (faker.number () .numberBetween (301, 499))) .productId (String.valueOf (faker.number () .numberBetween (201, 533))) .productName (faker.commerce () .productName ()) .productAmount (productAmount) .qty (qty) .taxAmt (taxAmt) .totalAmt (totalAmt) .build (); } } The OrderDataBuilder class generates dynamic test data for testing the Order APIs using the Datafaker library. It creates random values for the fields such as product amount, quantity, and product name, calculates tax and total amount, and then uses the Builder pattern to construct and return an OrderData object with all fields populated. Step 3: Writing the Test to Update and Verify the Order Let’s create a new test method, testUpdateOrder(), in the existing class TestPutRequestExamples. Java @Test public void testUpdateOrder () { int orderId = 1; OrderData updatedOrder = getOrderData (); String responseBody = given ().contentType (ContentType.JSON) .header ("Authorization", token) .when () .log () .all () .body (updatedOrder) .put ("http://localhost:3004/updateOrder/" + orderId) .then () .log () .all () .statusCode (200) .and () .assertThat () .body ("message", equalTo ("Order updated successfully!")) .extract () .response () .asPrettyString (); JSONObject responseObject = new JSONObject (responseBody); JSONObject orderObject = responseObject.getJSONObject ("order"); assertThat (orderObject.get ("id"), equalTo (orderId)); assertThat (orderObject.get ("user_id"), equalTo (updatedOrder.getUserId ())); assertThat (orderObject.get ("product_id"), equalTo (updatedOrder.getProductId ())); assertThat (orderObject.get ("product_name"), equalTo (updatedOrder.getProductName ())); assertThat (orderObject.get ("product_amount"), equalTo (updatedOrder.getProductAmount ())); assertThat (orderObject.get ("qty"), equalTo (updatedOrder.getQty ())); assertThat (orderObject.get ("tax_amt"), equalTo (updatedOrder.getTaxAmt ())); assertThat (orderObject.get ("total_amt"), equalTo (updatedOrder.getTotalAmt ())); } The testUpdateOrder() method sends a PUT request with authorization and JSON body, validates the response status, and verifies that the updated order data matches the request payload. It updates the order with ID 1. The code can be divided further into the following categories to understand it in simple terms: Request building methods: given(): It is the entry point to start building the API request..contentType(ContentType.JSON): It specifies that the request body is in JSON format..header(“Authorization”, token): It adds the Authorization header and uses the global variable token for authentication. Here, it should be ensured that the token authorization test runs first; otherwise, the update order test will fail at this point. Request execution methods: when(): It marks the transition from request setup to execution..log().all(): It logs the complete request details, including headers and request body. Logging request details helps in simplifying the debugging process..body(updatedOrder): It sends the updatedOrder object as the request payload and automatically converts the Java object to JSON..put(“http://localhost:3004/updateOrder/” + orderId): It sends a PUT request to update an existing order with the order ID. The orderId value is passed as a path parameter in the URL for updating the specific order. Response validation methods: then(): It starts the response validation..log().all(): It logs the entire response, including headers, response body, and other details..statusCode(200): It verifies that the request was successfully executed and a 200 OK status was returned in response..body(“message”, equalTo(“Order updated successfully!”)): It verifies that the response contains the expected success message. The equalTo() is a static method used from the Hamcrest library. Response body extraction: .extract().response().asString(): It extracts the response and converts it into a readable JSON string. JSON { "message": "Order updated successfully!", "order": { "id": 1, "qty": 7, "user_id": "326", "product_id": "1046", "product_name": "muller.org", "product_amount": 22, "tax_amt": 287, "total_amt": 1713 } } Response body verification: Java JSONObject responseObject = new JSONObject (responseBody); JSONObject orderObject = responseObject.getJSONObject ("order"); Using this code, the response string is converted into a JSON object. Then, the nested order object is extracted. Java assertThat (orderObject.get ("id"), equalTo (orderId)); assertThat (orderObject.get ("user_id"), equalTo (updatedOrder.getUserId ())); assertThat (orderObject.get ("product_id"), equalTo (updatedOrder.getProductId ())); assertThat (orderObject.get ("product_name"), equalTo (updatedOrder.getProductName ())); assertThat (orderObject.get ("product_amount"), equalTo (updatedOrder.getProductAmount ())); assertThat (orderObject.get ("qty"), equalTo (updatedOrder.getQty ())); assertThat (orderObject.get ("tax_amt"), equalTo (updatedOrder.getTaxAmt ())); assertThat (orderObject.get ("total_amt"), equalTo (updatedOrder.getTotalAmt ())); } The assertThat() method is a static method from the Hamcrest library, used for validating each field of the order object in the response body. The “updatedOrder” object generated using the Builder pattern and Datafaker library is used here to verify that the response order object contains the same values that were provided in the request. It is important to note here that asserting all the fields in the response body is suggested; however, depending on the test case, only the respective assertions can be performed. Test Execution The tests should be executed in such a way that the token generation test runs first, followed by the update order test, as it depends on the token. Let’s create a testng.xml file, as it allows running the tests in a specific order. XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Restful ECommerce Test Suite"> <test name="Restful ECommerce End to End tests"> <classes> <class name="restfulecommerce.tutorial.TestPutRequestExamples"> <methods> <include name="testTokenGeneration"/> <include name="testUpdateOrder"/> </methods> </class> </classes> </test> </suite> The following screenshot of the test execution shows that the test was executed successfully: Summary Testing PUT requests is important in automation testing, as they are commonly used to update or modify existing data. Using Datafaker helps generate random test data, which enables effective testing of PUT endpoints while avoiding duplicate data issues. When automating a PUT request, it’s important to provide the correct headers and ensure the data is sent in the expected format. If the API requires an authorization token, that should be included as well. Additionally, to run the API automation tests sequentially. testng.xml can be used, as it is a recommended way to run the tests locally or in the CI/CD pipeline. In my experience, covering both happy paths and negative scenarios in automated tests helps keep regression suites efficient and makes it easier to catch issues early. Using a POJO-based approach helps in easily creating, managing, and maintaining request and response payloads, making the test code more readable and maintainable. Happy testing! More

Trend Report

Security by Design

Security teams are dealing with faster release cycles, increased automation across CI/CD pipelines, a widening attack surface, and new risks introduced by AI-assisted development. As organizations ship more code and rely heavily on open-source and third-party services, security can no longer live at the end of the pipeline. It must shift to a model that is enforced continuously — built into architectures, workflows, and day-to-day decisions — with controls that scale across teams and systems rather than relying on one-off reviews.This report examines how teams are responding to that shift, from AI-powered threat detection to identity-first and zero-trust models for supply chain hardening, quantum-safe encryption, and SBOM adoption and strategies. It also explores how organizations are automating governance across build and deployment systems, and what changes when AI agents begin participating directly in DevSecOps workflows. Leaders and practitioners alike will gain a grounded view of what is working today, what is emerging next, and what security-first software delivery looks like in practice in 2026.

Security by Design

Refcard #403

Shipping Production-Grade AI Agents

By Vidyasagar (Sarath Chandra) Machupalli FBCS DZone Core CORE
Shipping Production-Grade AI Agents

Refcard #388

Threat Modeling Core Practices

By Apostolos Giannakidis DZone Core CORE
Threat Modeling Core Practices

More Articles

I Gave Gemini 3 My Worst Legacy Code — Here’s What Happened
I Gave Gemini 3 My Worst Legacy Code — Here’s What Happened

The Digital Archaeology Experiment We all have that one folder. The one labeled "v1_final_do_not_touch_2016." It is a sprawling ecosystem of spaghetti code, global variables, and comments that simply read // I am sorry. In an era of large language models (LLMs), we often hear about AI writing boilerplate, but can it actually perform digital archeology? I decided to feed my most "haunted" legacy script — a 2,000-line monolith responsible for processing data — into a hypothetical next-generation model, Gemini 3. The goal wasn't just to see if it could fix the bugs, but to see if it could transform a maintenance nightmare into a modern, scalable architecture. What followed was a masterclass in software engineering best practices. The AI didn't just move code around; it applied structural patterns that we often neglect in the heat of deadlines. This guide breaks down the core best practices Gemini 3 utilized to transform legacy junk into production-grade software, and why you should apply these practices even if you aren't using an AI assistant. 1. The Single Responsibility Principle (SRP): Deconstructing the Monolith The first thing the AI flagged was the "God Object" syndrome. In my legacy code, a single function called process_claim() was responsible for: Validating user input.Connecting to a MySQL database.Calculating claim totals with hardcoded tax rules.Sending an email notification.Logging errors to a local file. The Bad Practice (The Monolith) Plain Text def process_claim(claim_data): # Validation if not claim_data.get("id"): return "Error" # Database logic db = connect_to_db("prod_db") db.execute(f"INSERT INTO claims VALUES ({claim_data['id']})") # Business logic total = claim_data['amount'] * 1.15 # Hardcoded tax # Notification send_email("[email protected]", f"Claim {total} processed") return "Success" Why This Fails This code is impossible to test in isolation. If you want to test the tax calculation, you must have a live database connection and an email server ready. Furthermore, a change in the email provider's API forces a change in the business logic file, violating the principle that software should be easy to change without unintended side effects. The Good Practice (Applying SRP) Gemini 3 refactored this into distinct services. Validation, Persistence, Calculation, and Messaging were separated. Plain Text class ClaimValidator: def validate(self, data): if not data.get("id"): raise ValidationError("Missing ID") class TaxCalculator: def calculate(self, amount, region_code): rate = self._get_rate(region_code) return amount * rate class ClaimService: def __init__(self, validator, calculator, repository, notifier): self.validator = validator self.calculator = calculator self.repository = repository self.notifier = notifier def execute(self, claim_data): self.validator.validate(claim_data) total = self.calculator.calculate(claim_data['amount'], "US") self.repository.save(claim_data) self.notifier.send(f"Claim {total} processed") Why It Matters By separating concerns, the code becomes modular. You can now swap the TaxCalculator for a different regional version without touching the ClaimService. Testing becomes a matter of passing "mock" objects into the constructor, ensuring your unit tests are fast and reliable. Checklist for Applying SRP TaskDescriptionIdentify "Ands"If a function does A and B, it needs to be split.Extract LogicMove business rules into separate, pure functions.Isolate I/OKeep database and API calls outside of core logic classes.Limit LinesAim for functions under 20 lines of code. 2. Decoupling Through Dependency Injection One of the most profound changes Gemini 3 suggested involved how objects interact. In the legacy code, objects instantiated their own dependencies. If Class A needed Class B, it would simply call b = new ClassB() inside its constructor. This creates "tight coupling." Visualizing the Transformation Below is a Flowchart illustrating the decision-making process for decoupling legacy dependencies. The Pitfall: The "New" Keyword When you use new inside a class, you are locking that class to a specific implementation. This makes it impossible to substitute a mock version for testing or a different implementation for a new environment (like a staging server). The Solution: Dependency Injection (DI) Instead of creating the dependency inside the class, you "inject" it — usually via the constructor. This practice shifts the responsibility of object creation to the caller or a dedicated DI container. Comparison: Before vs. After Bad (Tight Coupling): Plain Text class OrderService { constructor() { this.database = new PostgresDatabase(); // Hardcoded dependency } } Good (Loose Coupling): Plain Text class OrderService { constructor(database) { // Injected dependency this.database = database; } } The Benefit: In your production environment, you pass a real PostgresDatabase. In your test environment, you pass an InMemoryDatabase. The OrderService doesn't know the difference, making it highly reusable. 3. Defensive Programming and Error Handling Legacy code often treats error handling as an afterthought, using generic try-catch blocks that swallow exceptions or returning null values that eventually lead to the dreaded "Null Reference Exception." Gemini 3's refactoring emphasized Defensive Programming: the practice of designing software to continue functioning under unforeseen circumstances. Sequence Diagram: Proper Error Handling Flow This Sequence Diagram shows the interaction between a client, a service, and an external API using resilient patterns. Key Defensive Practices Fail Fast: Validate inputs at the very beginning of a function. If they are invalid, throw an exception immediately.Use Meaningful Exceptions: Instead of throwing Error, throw InsufficientFundsError or UserNotFoundError.Circuit Breakers: If an external service is down, don't keep hammering it. Stop the calls and return a cached result or a graceful failure. Good vs. Bad Error Handling Bad Practice: Plain Text try: result = api.call() except: pass # Silently failing is the worst thing you can do Good Practice: Plain Text try: result = api.get_user(user_id) except ConnectionError as e: logger.error(f"Failed to connect to UserAPI for ID {user_id}: {e}") raise ServiceUnavailableError("Our user service is temporarily down.") except UserNotFoundError: return None # Explicitly handled 4. Modernizing State Management In my legacy script, the code relied heavily on global state. A variable like current_user_id was updated by multiple functions across the file. This led to unpredictable bugs where the state would change in the middle of a process due to an asynchronous callback. Implementation: Using Immutability Instead of modifying an existing object, create a new one. This ensures that other parts of the system holding a reference to the old object aren't surprised by a sudden change. Bad (Mutable): Plain Text function updatePrice(product, newPrice) { product.price = newPrice; // Changes the object everywhere } Good (Immutable): Plain Text function updatePrice(product, newPrice) { return { ...product, price: newPrice }; // Returns a new object } By using immutability, you make your code thread-safe and much easier to debug. If a bug occurs, you can inspect the state at any point in time without worrying that it was modified downstream. 5. Refactoring Summary: The Do's and Don'ts To help you apply these findings to your own legacy codebases, here is a summary table of the transformations Gemini 3 performed. AreaDon't Do This (Legacy)Do This (Modern)LogicGiant functions with nested if/else.Small, pure functions with early returns.DataDirect manipulation of global state.Immutable data structures and local state.DependenciesHardcoded new instances.Injected dependencies via interfaces.ErrorsGeneric try-catch with empty bodies.Domain-specific exceptions and logging.PerformanceNested loops with O(n^2) complexity.Optimized algorithms with O(n) or O(log n).DocumentationComments explaining what code does.Self-documenting code explaining why. Common Pitfalls to Avoid During Refactoring Even with an AI as powerful as Gemini 3, refactoring is not without risks. Here are three common pitfalls I encountered during this experiment: Refactoring Without Tests: Never start refactoring until you have "Characterization Tests" — tests that describe how the code currently behaves. If you change the code and the tests pass, you know you haven't broken existing functionality.Over-Engineering: It is tempting to apply every design pattern (Factory, Strategy, Observer) at once. Only introduce complexity when it solves a specific problem. If a simple function works, you don't need a class.The "Big Bang" Rewrite: Resist the urge to rewrite the entire system from scratch. This almost always leads to project failure. Instead, refactor one small module at a time, ensuring the system remains operational throughout the process. Practical Guidance: An Implementation Roadmap If you are staring at a mountain of legacy code today, here is the recommended roadmap for modernization: Identify the Pain Points: Which part of the code breaks most often? Start there.Write Integration Tests: Capture the current behavior of that module.Decouple the Core: Identify the business logic and extract it from the infrastructure (database/UI).Introduce Dependency Injection: Allow your business logic to be tested in isolation.Clean Up the Syntax: Use modern language features (like Async/Await or Type Hints) to improve readability. Conclusion: AI as the Ultimate Pair Programmer Feeding my worst legacy code to Gemini 3 was an eye-opening experience. The AI didn't just "fix" the code; it enforced a level of discipline that is often lost in the day-to-day grind of feature delivery. It reminded me that the most important audience for our code isn't the compiler — it is the human developer who has to maintain it six months from now. By prioritizing the Single Responsibility Principle, decoupling dependencies through injection, and embracing defensive programming, we can turn even the most frightening legacy scripts into robust, modern systems. Whether you use an AI assistant or your own expertise, these best practices remain the bedrock of professional software engineering. Further Reading & Resources Refactoring: Improving the Design of Existing Code by Martin FowlerClean Code: A Handbook of Agile Software CraftsmanshipThe Twelve-Factor App MethodologyGoogle Software Engineering Best PracticesSOLID Principles of Object-Oriented Design

By Jubin Abhishek Soni DZone Core CORE
Context Density: How to Survive the AI Tidal Wave
Context Density: How to Survive the AI Tidal Wave

As the AI tidal wave continues to break on our shores, there are two existential questions we’re all struggling to answer: Knowledge workers and other content producers – how can we survive the AI wave with some kind of defensible capability we can offer our employers and our audiences that AI won’t be able to replace, even as it matures?Software vendors – how can we survive the AI wave with some kind of defensible product capability we can offer our customers that AI agents won’t be able to replace, even as they mature? If you’re a pessimist, the situation may seem hopeless. AI is getting so much better so quickly that even if it can’t quite replace us or our software products today, it’s only a matter of time, right? Should we abandon hope? Or perhaps you’re an optimist. There must be some aspect of what we as humans bring to the table that AI won’t be able to replace, no matter how good it gets. If only we had a way of understanding and measuring just what that essential value-add that humans can bring to the table, whether we are creating content, addressing business needs as knowledge workers, or building software products that provide value to their users. The good news: there is hope. Here is a way of looking at the problem that will help illuminate that je ne sais quoi – that ineffable human contribution that AI will never be able to replace. First, Understand Semantic Density Generative AI (genAI) depends upon large language models (LLMs) that deal well with content that has specific, well-defined meaning. The better defined our inputs – training data, retrieval augmented generation (RAG) data, and information in prompts – the better formed our outputs. In contrast, when the meaning of input data contains too many nuances – implications, unspoken references, intuitive leaps and the like – then LLMs fall short. The models’ creators simply have no way to build them to account for such subtleties. Language experts have a term for how to understand such differences in meaning: semantic density. You create a message with high semantic density by cramming a lot of meaning into a few words. In contrast, a message has low semantic density if it takes a lot of words to express a simple idea. Humans are particularly good at creating semantically dense content – and in fact, we generally identify higher semantic density with better written content. On the other hand, LLMs excel at both consuming and producing content with low semantic density. Such output is especially useful when we are looking for clear, precise explanations, accurate summaries, etc. – just the sorts of content we’ve come to expect and demand from genAI. Is Semantic Density the Answer? An obvious conclusion at this point would be for humans to focus on creating semantically dense content to survive the onslaught of AI. Unfortunately, there are problems with this argument. First, LLMs can also generate semantically dense content, especially when source data are also semantically dense, for example, asking genAI to create an abstract for a semantically dense academic paper. Asking an LLM to write the paper is a recipe for plagiarism and hallucinations (as many students have learned to their chagrin), but the models are quite skilled at summarizing such content. Second, it’s overly simplistic to equate semantically dense human-generated content with good writing vs. less dense content with poorer writing. After all, sometimes we want human-generated content to be less semantically dense. A simple example would be writing for children – something genAI can do for sure, but the best child-oriented content still comes from real people. On the flip side, extreme semantic density typically makes the text obscure and difficult to read – clearly not hallmarks of excellent writing. So, while semantic density has a loose correlation to how well LLMs can perform, it’s not the whole story. The missing piece: context density. The Importance of Context Density While semantic density measures the internal complexity of meaning within a message, context density measures the meaningful content around a message. Context density is similar to semantic density. More meaning crammed into fewer words leads to more density, so it’s easy to confuse the two. The reason context density is so important, however, is because of the role context plays in how LLMs behave – in particular, agentic behavior. In fact, we could even say that what makes an LLM-based application into an AI agent is how it understands and takes action based upon context. Such context can include: Information about available local files, databases, and APIsAvailable tools and how to access themSecurity information necessary to access required assetsOther metadata relevant to each query. Such context must be clear and unambiguous for the agents to behave properly. In other words, agents require context that has low context density. In fact, this requirement for low context density is one of the reasons why the Model Context Protocol (MCP) has been such a rapid success. The MCP is an open integration protocol standard for interactions with and among LLMs. It’s based on JSON, a flexible format for expressing data with low semantic density – or in the case of MCP, low context density. While the creators of MCP didn’t explicitly design it with low context density in mind, they did intend for the protocol to prioritize clarity and structure over density. Given that each system in an agentic interaction must understand the relevant context without hidden assumptions or other nuances of meaning, explicit context with low density is essential to the success of agentic systems. What, then, Is the Role of High Context Density? Human-to-human interactions, aka conversations, have inherently high context density – even though we rarely notice it. Every human conversation contains layers of subtext and hidden meaning via facial expressions, hand gestures, tone of voice, words with ambiguous meaning, patterns of pauses in speech, and other subtle aspects of human communication. Such nuance goes right over the proverbial head of AI – even LLMs that do such a good job of mimicking human conversation. In other words, it’s virtually impossible for LLMs to deal with high context density. Agentic interactions in particular are quite sensitive to excessive context density. Agents rely so heavily on the precision possible with low context density that any nuance in context will throw them off entirely. At the very least, they will completely ignore it. How Context Density Helps Us Humans Where agents (and genAI in general) is weak, humans are strong. Context density, therefore, helps us answer the questions at the top of this article. If we look at various applications of AI, context density drives essential distinctions: Knowledge work – ask your favorite copilot to handle tasks with low context density. Focus human attention and activity on those tasks that require high context density.Automation – processes with low context density are easy for AI to automate. Processes with high context density require human input and control.Building software – anyone can leverage code generation tools to build applications with low context density. For applications that require high context density, code generation tools must be secondary to skilled human effort, insight, and control. Context density thus becomes the differentiating metric between activities and applications that LLMs are well-suited for vs. those activities and applications that will continue to require human input and control, even as AI technologies mature. The Intellyx Take The most important part of this story is not identifying where AI is useful. It’s identifying where it is not. As AI inevitably transforms how we work and live, we must all come to terms with the fact that AI will take various tasks off our respective plates, leaving us wondering what our purpose will be in this arguably dystopian future. Take heart: there will always be roles for us humans. We are the masters of insight, creativity, nuance, and hidden meaning – the essence of context density. Our challenge moving forward: identifying those activities where we can provide value as individuals by offering just those capabilities that AI is so woefully unable to provide. The opportunity for software vendors: make sure your products have high context density. That way agents won’t be able to do what your products do. Instead, agents will need to call upon your products to accomplish their tasks successfully. The opportunity for humans: make sure your work is both semantically and contextually dense. Focus on the meaning that LLMs can’t grasp. Express your intuition, insight, and creativity in terms of meaning, both within your work as well as its human context. AI gives us an amazing set of tools. Knowing how to use them well means focusing our efforts on providing the value that we as humans are uniquely qualified to contribute.

By Jason Bloomberg
When Retries Become a Denial-of-Wallet
When Retries Become a Denial-of-Wallet

There's a particular kind of incident that doesn't show up in your error dashboards. No alerts fire. Latency looks fine, actually — or fine-ish, in that flickering, indeterminate way that makes you suspicious but not certain. What shows up, days later, is a billing anomaly. A line item that's 4x what you budgeted. And when you dig, you find it: retries. Hundreds of thousands of them. Loyal, tireless, utterly pointless retries, hammering a dependency that was never going to recover within the retry window, each one spinning up a Lambda invocation, writing to CloudWatch, touching the database, accruing egress. The system was "retrying" its way into insolvency. This is what I mean when I call uncontrolled retries a self-inflicted Denial-of-Wallet attack. Not metaphorically. Mechanically. The Seductive Logic of "Just Try Again" The impulse is almost irresistible. Networks are flaky. Downstream services hiccup. Transient faults are real, they are common, and a single retry genuinely does rescue a meaningful fraction of requests that would otherwise fail. Every distributed systems textbook will tell you this. The problem is that the textbook version of a retry — lone request, momentary fault, clean recovery — bears almost no resemblance to what retries actually do inside a system operating at load under a real failure. Under real failure, the math inverts. Say Service A depends on Service B. B starts returning 500s — maybe a deployment went sideways, maybe a database connection pool saturated. A is configured with what seems reasonable: three retries, linear backoff, no jitter. What happens next is not three polite attempts and a graceful degradation. What happens is multiplication. Every original request to A becomes four requests to B (the original plus three retries). If A is receiving 1,000 RPS, B is now absorbing 4,000 RPS — on top of the load it was already failing to handle. Each of those extra requests touches middleware, writes a log line, maybe hits a queue. B, already struggling, gets worse. A's retries accelerate B's failure. The snowball rolls. The Stanford RetryGuard researchers have a name for this: the retry storm. It's not exotic. It's what happens when you deploy reasonable-looking retry policies without thinking about what they do in aggregate. What the Cost Actually Looks Like People underestimate the surface area of a retry. They think: one extra HTTP call. They don't think about what's attached to that HTTP call. In a Lambda-backed architecture, each retry is an invocation — billed separately. Each invocation likely emits structured logs to CloudWatch, which charges per GB ingested. If the function hits a DynamoDB table, that's another read unit consumed, possibly another write. If there's an API Gateway in front, that's another API call counted against your tier. If the response is large, there's egress cost. And this happens in parallel across however many concurrent requests are in flight. Now consider the timeline. Service B fails at 2 AM. The on-call engineer doesn't see it until 2:17. During those 17 minutes, if A was receiving 500 RPS and each request retried three times, you've generated roughly 2 million additional requests to B. You've paid for every one of them. You've gotten nothing back. The original failure wasn't solved; the retries just made the failure expensive. One way to think about this: retries without circuit breakers are paying a premium to prolong a failure. The Hidden Feedback Loops Nobody Draws on the Architecture Diagram The simple A-calls-B diagram is almost always wrong. What's usually true is that A, B, and C all call each other in some configuration, and several of them share infrastructure. So when B degrades: A retries B, increasing load on B's shared database connection pool. The pool saturates. Now C, which also reads from that database, starts timing out. C's callers — let's say D and E — start retrying. D and E's retries hit the same pool. The pool is now so saturated that even requests that have nothing to do with the original B failure are timing out. This is the cascade that the RetryGuard paper captures: service A experiences a retry storm and pays the price, but the price is actually distributed across the whole graph. The bulkhead patterns — isolating thread pools, rate-limiting per-dependency — exist precisely to prevent this. Most systems don't have them, or have them configured with defaults that were never tuned for actual traffic. The other feedback loop worth naming is the log-based one. Your observability stack is probably downstream of your services. If it's Elasticsearch or Loki or CloudWatch, it absorbs your logs. Under a retry storm, log volume can spike 5–10x. That means your observability system — the thing you're depending on to diagnose the problem — is now also under load. I've been in incidents where the logging pipeline itself started dropping messages at exactly the moment we needed full fidelity. The retry storm ate its own evidence. Exponential Backoff Is Not Enough (and Jitter Matters More Than You Think) Backoff is the first thing people reach for. Double the wait between attempts. It's better than nothing. But standard exponential backoff without jitter has a subtle and nasty property: it synchronizes retries. Suppose 500 requests arrive simultaneously. They all fail. They all back off by 1 second. They all retry simultaneously at T+1. They all fail again. They all back off by 2 seconds. They all retry simultaneously at T+3. You've turned continuous load into synchronized bursts — which are, in some ways, worse than continuous load, because they create spike conditions that can exceed per-second rate limits and overwhelm autoscaling that hasn't had time to provision. Jitter — adding a random offset to the backoff interval — breaks this synchronization. The AWS Architecture Blog's "Exponential Backoff and Jitter" post from 2015 remains one of the clearest explications of why, and the "full jitter" strategy (where the wait is uniformly random between zero and the calculated backoff) outperforms "equal jitter" in most workloads. The math isn't complicated. The intuition is: you want your retriers to spread out across time, not march in lockstep. The formula you actually want: Plain Text wait = random_between(0, min(cap, base * 2^attempt)) That min(cap, ...) is important. Without a ceiling, your backoff can grow to minutes or hours, which creates its own problems — held connections, stale state, zombie sessions that reconnect long after the original context is gone. Retry Budgets: The Underused Primitive Here's where Linkerd gets something importantly right that most service meshes and client libraries don't foreground: the retry budget. The idea is simple. Instead of configuring retries per-request ("retry up to N times"), you configure retries per-traffic-volume ("retries may not exceed X% of requests"). Linkerd's default is 20% — meaning if your service is handling 1,000 RPS, it will allow at most 200 retry requests per second, regardless of how many individual requests are failing. Once the budget is exhausted, requests fail fast. This is a fundamentally different mental model. Per-request retry limits think locally — this request failed, try it again. Retry budgets think globally — the system is under stress, we cannot afford to amplify that stress beyond this threshold. The budget makes the cost of retrying explicit at the system level. The Istio equivalent is less elegant but workable. You can cap numRetries and set aggressive perTryTimeout values to bound the worst-case amplification, though you're still thinking per-route rather than per-budget. A rough YAML configuration: YAML retries: attempts: 3 perTryTimeout: 2s retryOn: "5xx,connect-failure,refused-stream" Notice retryOn. This matters. You should not retry on every error code. A 400 Bad Request doesn't get better with retries — the request is malformed and will fail identically on every attempt. Retrying 4xx errors is particularly wasteful because they're often client-side problems that the server will consistently reject. The codes worth retrying are: transient network failures, 503 Service Unavailable, 429 Too Many Requests (with appropriate backoff), and sometimes 502 Bad Gateway. Even 504 Gateway Timeout deserves scrutiny — if B is genuinely overwhelmed, retrying a timed-out request doesn't help B recover. Circuit Breakers: The Pattern Everyone Claims to Use and Almost Nobody Tunes Resilience4j, Hystrix (RIP), Polly, Istio's outlier detection — the options are plentiful. The implementations, in my experience, are often misconfigured to the point of uselessness. A circuit breaker has three states: closed (passing requests through), open (failing fast), and half-open (letting a probe request through to test recovery). The transitions between states are governed by parameters: failure rate threshold, minimum number of calls before the threshold applies, wait duration in open state, permitted calls in half-open state. The defaults in most libraries are conservative in a way that makes them nearly inert. A failure rate threshold of 50% sounds aggressive, but if your minimum call count is 100, the breaker won't open until you've seen 50 failures in the sampling window. With a small sliding window of, say, 10 calls, you might need 5 consecutive failures before it trips. In practice, by the time the breaker opens, you've already generated substantial unnecessary load. The tuning questions nobody asks at configuration time: What's the expected recovery time for this dependency? Set your waitDurationInOpenState to something meaningful relative to that. If your downstream service typically recovers in 30 seconds, a 5-second open window means the breaker will half-open and immediately re-trip multiple times before recovery, adding noise to your metrics and extending the incident.What's the right sampling window? A count-based window (last N calls) can be gamed by low-traffic services where N takes minutes to fill. Time-based windows (last N seconds) are usually more appropriate for production.What should happen when the circuit is open? This is the graceful degradation question. Returning an error is fine. Returning a cached response is better. Returning a sensible default is sometimes correct. The teams I've seen handle this best define the fallback behavior explicitly, in code, with the same rigor they'd apply to the happy path. The half-open state is where circuit breakers most often fail in practice. Probe requests succeed in the test environment because the test environment has predictable load. In production, the first probe arrives when the downstream service has just recovered and is still warming up — and under the concurrent burst of all the callers that were queued behind the open breaker. The probe succeeds. The breaker closes. 200 requests hit simultaneously. The service tips over again. Repeat. The fix is to open the circuit gradually: allow, say, 5% of traffic through in half-open state, ramp to 25%, ramp to 100%. Most libraries don't do this natively. Istio's outlier detection is closer to this model, ejecting individual hosts rather than binary-tripping a per-service breaker. What You Actually Change on Monday Morning Not everything. The systems are running. You don't get to redesign the retry architecture from scratch during business hours. But some things are cheap and high-value: Audit your retry configurations. Find every place in your codebase where retries are configured — client libraries, service mesh configs, SDK defaults you didn't know were there. AWS SDKs retry by default. Many HTTP clients retry on timeout by default. The retry behavior you didn't configure is often more dangerous than the retry behavior you did. Add jitter to anything that doesn't have it. If you have backoff = base * 2^attempt, change it to backoff = random(0, base * 2^attempt). Twenty minutes of work. Immediate improvement in thundering herd conditions. Turn on retry rate monitoring. Your APM or service mesh almost certainly exposes retry counts. Surface them. Add a dashboard. Set an alert at, say, 1% retry rate under normal conditions — abnormal elevations will catch incipient retry storms before they become billing anomalies. Identify your non-idempotent paths and either remove retries or add idempotency keys. POST endpoints that create resources cannot be safely retried without idempotency controls. If you're retrying a payment or an order creation, you're potentially creating duplicates. This is its own class of disaster, separate from cost — but it compounds cost because you're now also writing extra records. Define your fallbacks. For each service your system depends on, what should happen when it's unavailable? The answer "retry indefinitely" is almost never correct. "Return a cached response" or "return a degraded but valid result" or "queue for later processing" are usually better. The fallback should be in code, tested, and not a surprise to the on-call engineer at 2 AM. The Broader Frame There's something philosophically interesting about retry storms that I keep coming back to. Each individual retry is rational. From the perspective of a single request that failed due to a transient network glitch, retrying is exactly the right behavior. The emergence of a retry storm from individually-rational retries is a classic collective action problem — something that's good for each agent is destructive when everyone does it simultaneously. Circuit breakers and retry budgets are collective action solutions. They impose a global constraint that each individual caller would have no incentive to impose on itself. This is, incidentally, why they work better when implemented in the mesh layer (where they can see aggregate traffic) than in individual client libraries (where they can only see their own requests). The Denial-of-Wallet framing is useful because it names the threat model correctly. You don't need an external attacker. You don't need a misconfigured adversary. You need one failure, one reasonable-looking retry policy, and enough traffic that the multiplication matters. The attack surface is your own response to your own failures. That's the part that's hard to internalize. The retries feel like resilience. They feel like diligence. They are, under the wrong conditions, the instrument of your own undoing.

By David Iyanu Jonathan
Comparing Top Gen AI Frameworks for Java in 2026
Comparing Top Gen AI Frameworks for Java in 2026

Java has always been a serious language for production systems, and in 2026, the Generative AI ecosystem has finally caught up. For years, Java developers watched from the sidelines as Python and TypeScript accumulated framework after framework for building LLM-powered applications. Today, the picture is very different. Java has multiple mature, actively maintained AI frameworks, each with its own philosophy and trade-offs. This article covers the four frameworks I have personally used to ship Java AI applications: Genkit Java, Spring AI, LangChain4j, and Google ADK Java. Each one represents a meaningfully different bet on what a Java AI framework should be, and understanding those differences will save you from picking the wrong tool. Genkit Java History and Direction Genkit started life as a TypeScript-first framework launched by Google at I/O 2024. The Java SDK arrived as a community-maintained effort, built and maintained by developers within the Google ecosystem who wanted to bring the same developer experience to Java that Genkit had established in TypeScript. As of 2026, Genkit Java is unofficial; it is not an official Google product, but it is actively maintained, follows the core Genkit design closely, and ships its own plugin ecosystem. The framework’s first stable release landed in early 2026 after months of preview use. Its ambition mirrors the TypeScript SDK’s: bring Genkit’s multi-level abstractions (vanilla generation, typed flows, agents), its broad provider-neutral plugin model, and, crucially, the Genkit Developer UI to Java developers. The Java SDK ships with Spring Boot and Jetty server plugins, making it a natural fit for teams that already run Java services in production. The Javadoc and architecture are clean and idiomatic Java; this does not feel like a port; it feels designed for the language. The direction is clear: maintain parity with the TypeScript Genkit SDK’s abstractions while embracing Java idioms (builder patterns, typed schemas via Java classes, annotation-free configuration). Support for evaluation, MCP (Model Context Protocol), RAG with pgvector and Pinecone, and multi-agent patterns is already in place. What Makes Genkit Java Stand Out Like its TypeScript counterpart, Genkit Java provides three levels of abstraction in a single SDK: direct model calls, typed flows (observable pipelines), and agents. This is unique in the Java AI space; no other Java framework gives you all three in one coherent API. Supported languages: Java 21+ (primary). Deploys to Spring Boot, Jetty, or Firebase Cloud Functions. Vanilla Generation Java import com.google.genkit.Genkit; import com.google.genkit.ai.GenerateOptions; import com.google.genkit.plugins.googlegenai.GoogleGenAIPlugin; Genkit genkit = Genkit.builder() .plugin(GoogleGenAIPlugin.create()) .build(); String text = genkit.generate(GenerateOptions.builder() .model("googleai/gemini-flash-latest") .prompt("Explain the CAP theorem in two sentences.") .build()).getText(); Typed Flows: Observable Pipelines Flows are the heart of Genkit Java. They wrap your AI logic in a named, typed, traceable unit that is automatically exposed as an HTTP endpoint and visible in the Dev UI. Java import com.google.genkit.Genkit; import com.google.genkit.flow.FlowOptions; import com.google.genkit.plugins.googlegenai.GoogleGenAIPlugin; import com.google.genkit.plugins.jetty.JettyPlugin; import com.google.genkit.plugins.jetty.JettyPluginOptions; record TranslateRequest(String text, String targetLanguage) {} record TranslateResponse(String translation, String detectedLanguage) {} JettyPlugin jetty = new JettyPlugin(JettyPluginOptions.builder().port(8080).build()); Genkit genkit = Genkit.builder() .plugin(GoogleGenAIPlugin.create()) .plugin(jetty) .build(); genkit.defineFlow( FlowOptions.<TranslateRequest, TranslateResponse>builder() .name("translateText") .inputClass(TranslateRequest.class) .outputClass(TranslateResponse.class) .build(), (ctx, request) -> { var response = genkit.generate(GenerateOptions.builder() .model("googleai/gemini-flash-latest") .prompt("Translate '%s' to %s. Return JSON with 'translation' and 'detectedLanguage'." .formatted(request.text(), request.targetLanguage())) .outputClass(TranslateResponse.class) .build()); return response.getOutput(TranslateResponse.class); } ); Tools and Agents Java import com.google.genkit.ai.tool.ToolDefinition; var weatherTool = genkit.defineTool( ToolDefinition.<String, String>builder() .name("getWeather") .description("Returns current weather for a city.") .inputClass(String.class) .outputClass(String.class) .build(), (ctx, city) -> "Sunny, 24°C in " + city ); // Use the tool inside a flow or agent var result = genkit.generate(GenerateOptions.builder() .model("googleai/gemini-flash-latest") .prompt("What's the weather like in Tokyo?") .tools(List.of(weatherTool)) The Dev UI: Same Power as TypeScript One of Genkit Java’s most compelling features is that the same Genkit Developer UI used by the TypeScript SDK works directly with Java applications. You install the Genkit CLI (Node.js-based) and start your Java app through it: Shell npm install -g genkit The Dev UI opens at http://localhost:4000 and gives you: Flow runner – execute any flow interactively with custom inputs and inspect typed outputs.Trace explorer – full OpenTelemetry traces for every generate and flow call, showing latency, token counts, and exact prompts.Model playground – test any registered model directly.Tool testing – stub and test tools in isolation.Dotprompt editor – edit .prompt files live with variable injection. This is the single biggest advantage Genkit Java has over every other Java AI framework: a zero-config, local developer UI that replaces the need for LangSmith or Grafana during development. Provider Support Genkit Java ships plugins for: Google GenAI (Gemini), OpenAI, Anthropic (Claude), AWS Bedrock, Azure AI Foundry, Ollama, xAI (Grok), DeepSeek, Cohere, Mistral, and Groq. All accessed through the same genkit.generate() interface. Vector store plugins cover: Firebase Firestore, Weaviate, PostgreSQL (pgvector), Pinecone, and a local in-memory store. Pros and Cons ✅ Pros❌ ConsBest-in-class Dev UI with local trace explorerUnofficial/community-maintained (not a Google product)Multi-level abstractions: vanilla, flows, agentsArtifacts on GitHub Packages (requires auth to pull)Broadest provider support in Java ecosystemJava 21+ requiredSpring Boot and Jetty deployment pluginsSmaller community than LangChain4j or Spring AIOpenTelemetry built inStill SNAPSHOT versioned (1.0.0-SNAPSHOT)Idiomatic Java with builder patterns Spring AI History and Direction Spring AI was announced by the Spring team (Broadcom) in mid-2023 and reached its 1.0 GA release in mid-2024. It is the most enterprise-grade option in this comparison, built by the same team that maintains Spring Framework, Spring Boot, and Spring Data, which together underpin a vast proportion of the world’s Java server-side applications. The founding premise of Spring AI is that AI integration in Java applications should feel like every other Spring integration: auto-configured, testable, portable, and production-ready out of the box. The project draws inspiration from LangChain and LlamaIndex, but explicitly avoids being a port; it is designed from the ground up to be idiomatic Spring. If you have written Spring applications, Spring AI will feel immediately familiar: @Autowired AI clients, Spring Boot starters, application.properties configuration, and Advisor patterns that mirror Spring’s existing interception model. Spring AI’s direction through 2025 and into 2026 has been to deepen its observability story (Micrometer-native metrics and traces), expand its ChatClient fluent API, and ship more vector store integrations. The framework is now the de facto standard for teams that are already invested in the Spring ecosystem and want to add AI capabilities without introducing a foreign dependency philosophy. What Makes Spring AI Stand Out Spring AI’s killer feature is Spring Boot integration depth. There is no framework on this list, in any language, that integrates AI capabilities as seamlessly into an existing application framework as Spring AI does with Spring Boot. Auto-configuration, conditional beans, health indicators, Actuator endpoints for AI metrics, everything a Spring developer expects, applied to AI. Supported languages: Java (primary). Also supports Kotlin (via Spring’s Kotlin DSL). Runs anywhere Spring Boot runs: embedded Tomcat, Jetty, Undertow, GraalVM native images. Java // application.properties // spring.ai.openai.api-key=${OPENAI_API_KEY} // spring.ai.openai.chat.options.model=gpt-4o import org.springframework.ai.chat.client.ChatClient; import org.springframework.web.bind.annotation.*; @RestController public class ChatController { private final ChatClient chatClient; public ChatController(ChatClient.Builder builder) { this.chatClient = builder.build(); } @GetMapping("/chat") public String chat(@RequestParam String message) { return chatClient.prompt() .user(message) .call() .content(); } Structured Output Spring AI’s BeanOutputConverter maps model responses directly to Java POJOs, using the class schema to generate format instructions automatically. Java import org.springframework.ai.chat.client.ChatClient; import org.springframework.ai.converter.BeanOutputConverter; record MovieReview(String title, int rating, String summary, List<String> pros) {} BeanOutputConverter<MovieReview> converter = new BeanOutputConverter<>(MovieReview.class); MovieReview review = chatClient.prompt() .user(u -> u.text("Review the movie Inception. {format}") .param("format", converter.getFormat())) .call() .entity(MovieReview.class); RAG With Advisors Spring AI’s Advisors API is one of its most elegant features. Advisors wrap ChatClient calls with cross-cutting concerns, RAG retrieval, chat memory, logging, guardrails in a declarative, composable way. Java import org.springframework.ai.chat.client.ChatClient; import org.springframework.ai.chat.client.advisor.QuestionAnswerAdvisor; import org.springframework.ai.vectorstore.VectorStore; @Service public class DocumentQAService { private final ChatClient chatClient; public DocumentQAService(ChatClient.Builder builder, VectorStore vectorStore) { this.chatClient = builder .defaultAdvisors(new QuestionAnswerAdvisor(vectorStore)) .build(); } public String answerQuestion(String question) { return chatClient.prompt() .user(question) .call() .content(); } } Observability Spring AI ships with Micrometer integration out of the box. Every chat call generates spans (Spring Boot tracing) and metrics (prompt token count, completion token count, model latency) visible in any Micrometer-compatible backend: Prometheus, Grafana, Zipkin, or Datadog. There is no separate Dev UI, observability is handled by your existing Spring Boot infrastructure. Broad Vector Store and Model Support Spring AI supports 10+ model providers (OpenAI, Anthropic, Google Vertex AI, Amazon Bedrock, Azure OpenAI, Mistral, Ollama, Groq, and more) and 20+ vector stores (PGVector, Pinecone, Weaviate, Redis, Elasticsearch, MongoDB Atlas, Chroma, and more), the broadest integration coverage of any Java AI framework. Pros and Cons ✅ Pros❌ ConsDeepest Spring Boot integration, feels nativeNo standalone Dev UI for flow inspectionMicrometer-native observabilityAgent abstractions are less mature than LangChain4jBroadest model and vector store integrationsAdvisors pattern has a learning curveProduction-tested by the Spring ecosystemHeavier spring context overhead for simple use casesGraalVM native image supportNo flow/pipeline abstraction like GenkitIdiomatic Java and Kotlin support LangChain4j History and Direction LangChain4j was started in early 2023 by a small community of Java developers who noticed that the LLM framework explosion happening in Python had no Java equivalent. Despite the name, the project is not a mechanical port of LangChain Python; it is a fusion of ideas from LangChain, Haystack, LlamaIndex, and original innovation, packaged in a way that makes sense for Java. It grew quickly through 2023 and 2024, driven by its comprehensive integration list (20+ LLM providers, 30+ vector stores) and its clean two-level abstraction model: low-level primitives for maximum control and high-level AI Services for rapid development. The AI Services pattern, where you define an interface with annotations, and LangChain4j implements it for you at runtime, became the framework’s signature feature and arguably the most Java-idiomatic approach to LLM integration in the ecosystem. By 2025, LangChain4j had formal integrations with Quarkus, Spring Boot, Micronaut, and Helidon, covering every major Java application framework. The team’s direction in 2026 is focused on deepening agentic capabilities (multi-step tools, planning loops, MCP support) and improving the observability story, which has historically been a weaker point compared to Spring AI’s Micrometer integration or Genkit’s Dev UI. What Makes LangChain4j Stand Out LangChain4j’s AI Services pattern is its defining feature. Instead of writing imperative LLM call code, you declare an interface, annotate it with @SystemMessage, @UserMessage, and memory annotations, and LangChain4j generates the implementation. The result is AI code that reads like a Java service contract, clean, testable, and completely familiar to Java developers. Supported languages: Java (primary). Kotlin extensions available (coroutine-based async support). Integrates with Spring Boot, Quarkus, Micronaut, Helidon. Java import dev.langchain4j.service.AiServices; import dev.langchain4j.service.SystemMessage; import dev.langchain4j.model.openai.OpenAiChatModel; interface TranslationAssistant { @SystemMessage("You are a professional translator. Translate text accurately and naturally.") String translate(@UserMessage String text, @V("language") String targetLanguage); } var model = OpenAiChatModel.withApiKey(System.getenv("OPENAI_API_KEY")); TranslationAssistant assistant = AiServices.builder(TranslationAssistant.class) .chatLanguageModel(model) .build(); String result = assistant.translate("The quick brown fox jumps over the lazy dog", "Spanish"); System.out.println(result); Memory and Streaming Java import dev.langchain4j.memory.chat.MessageWindowChatMemory; import dev.langchain4j.service.MemoryId; interface ConversationalAssistant { @SystemMessage("You are a helpful assistant.") String chat(@MemoryId String userId, @UserMessage String message); } ConversationalAssistant assistant = AiServices.builder(ConversationalAssistant.class) .chatLanguageModel(model) .chatMemoryProvider(memoryId -> MessageWindowChatMemory.withMaxMessages(20)) .build(); // Each userId gets its own isolated memory assistant.chat("user-42", "My name is Alice."); String response = assistant.chat("user-42", "What's my name?"); // Returns: "Your name is Alice." RAG Pipeline Java import dev.langchain4j.data.document.loader.UrlDocumentLoader; import dev.langchain4j.data.document.splitter.DocumentSplitters; import dev.langchain4j.store.embedding.inmemory.InMemoryEmbeddingStore; import dev.langchain4j.rag.content.retriever.EmbeddingStoreContentRetriever; // Ingest documents var documents = UrlDocumentLoader.load("https://example.com/docs"); var splitter = DocumentSplitters.recursive(500, 50); var segments = splitter.splitAll(documents); var embeddingModel = OpenAiEmbeddingModel.withApiKey(apiKey); var embeddingStore = new InMemoryEmbeddingStore<TextSegment>(); EmbeddingStoreIngestor.ingest(segments, embeddingStore, embeddingModel); // Build RAG-enabled assistant interface DocsAssistant { String answer(@UserMessage String question); } var retriever = EmbeddingStoreContentRetriever.builder() .embeddingStore(embeddingStore) .embeddingModel(embeddingModel) .maxResults(3) .build(); DocsAssistant assistant = AiServices.builder(DocsAssistant.class) .chatLanguageModel(model) .contentRetriever(retriever) .build(); Two Abstraction Levels LangChain4j explicitly offers two levels: Low level – ChatModel, UserMessage, AiMessage, EmbeddingStore: full control, more code.High level – AiServices: declarative interfaces, minimal boilerplate. This mirrors what Genkit Java achieves differently. Where Genkit gives you flows and agents as pipeline concepts, LangChain4j uses interface-based AI Services as its high-level abstraction, very idiomatic in Java terms. Pros and Cons ✅ Pros❌ ConsAI Services pattern is uniquely Java-idiomaticNo built-in Dev UI or trace explorerLargest integration ecosystem (20+ models, 30+ stores)Observability requires external tooling (no Micrometer by default)Two clear abstraction levels (low and high)Agent capabilities still maturing (2026)Spring Boot, Quarkus, Micronaut, Helidon integrationsLarge number of modules can be overwhelmingKotlin coroutine supportLess opinionated, more choices to make yourselfStrong RAG tooling out of the box Google ADK Java History and Direction Google ADK (Agent Development Kit) launched in 2024 as a Python-first agent framework targeting enterprise deployments on Google Cloud. Java was a late addition to the multi-language roadmap, with ADK Java 1.0 shipping in early 2026 alongside ADK Go 1.0. The Java SDK arrival was significant: it signaled that Google views ADK as a serious enterprise runtime, not just a Python scripting tool. ADK Java follows the same design philosophy as the Python SDK: everything is an agent, workflow, or tool. The framework is optimized for building reliable, evaluatable, production-grade multi-agent systems and deploying them to Google Cloud infrastructure, primarily Vertex AI Agent Engine, Cloud Run, and GKE. Like its Python counterpart, ADK Java carries the weight of Google Cloud gravity. The best developer experience, the smoothest deployment path, and the most mature observability story all assume you are running on GCP. ADK Java 1.0 includes the full agent runtime (LLM agents, sequential/loop/parallel workflow agents), tool calling, MCP support, A2A (Agent-to-Agent) protocol, session/memory management, and streaming. The Java API closely mirrors the Python API in structure, which means the mental model transfers well, but also means the Java SDK carries a style that reflects Python-first design decisions. ADK Java’s Position: Agent-Only, Enterprise-Grade Like its Python counterpart, ADK Java is an agent framework; it has no vanilla generation primitive or flow abstraction outside the agent model. Its raison d’être is spinning up reliable, evaluatable agents and deploying them at enterprise scale. If you are building a multi-agent system on Google Cloud and Java is your language of choice, ADK Java 1.0 is Google’s recommended path. Supported languages: Java (with ADK Java 1.0). Also: Python (primary), TypeScript, Go. Java import com.google.adk.agents.LlmAgent; import com.google.adk.tools.GoogleSearchTool; import com.google.adk.runner.InMemoryRunner; import com.google.genai.types.Content; var researchAgent = LlmAgent.builder() .name("researcher") .model("gemini-flash-latest") .instruction("You help users research topics thoroughly and accurately.") .tools(List.of(new GoogleSearchTool())) .build(); var runner = new InMemoryRunner(researchAgent); var session = runner.sessionService().createSession( researchAgent.name(), "user-1" ).blockingGet(); var userMessage = Content.fromParts(Part.fromText( "What are the latest developments in fusion energy?" )); runner.runAsync(researchAgent.name(), session.id(), userMessage) .blockingForEach(event -> { if (event.finalResponse()) { System.out.println(event.stringifyContent()); } }); Multi-Agent Orchestration ADK Java’s multi-agent capabilities match the Python SDK’s, including sequential, parallel, and loop orchestration. Java import com.google.adk.agents.SequentialAgent; var researcher = LlmAgent.builder() .name("researcher") .model("gemini-flash-latest") .instruction("Research the given topic and provide key facts.") .build(); var writer = LlmAgent.builder() .name("writer") .model("gemini-flash-latest") .instruction("Write a clear, well-structured article from the research provided.") .build(); var editor = LlmAgent.builder() .name("editor") .model("gemini-flash-latest") .instruction("Polish and format the article for publication.") .build(); var pipeline = SequentialAgent.builder() .name("contentPipeline") .subAgents(List.of(researcher, writer, editor)) .build(); Vertex AI Lock-In ADK Java’s production deployment story is built around Vertex AI Agent Engine and Google Cloud. While you can run ADK Java locally (via the ADK CLI or directly) and deploy to Cloud Run or GKE independently, the managed evaluation tools, performance dashboards, and enterprise support all assume GCP. This is the clearest example in the Java AI space of a framework built to serve a platform rather than being platform-neutral. Pros and Cons ✅ Pros❌ ConsOfficial Google support with production SLATightly coupled to Vertex AI and GCPBest multi-agent orchestration in JavaAgent-only framework, no vanilla generation or flowsA2A protocol for agent interoperabilityPython-first design reflected in Java API styleFull evaluation tools (user simulation, custom metrics)Requires GCP for full observability and deployment featuresScales to enterprise on Google CloudYoungest Java SDK (1.0 released 2026)Streaming support (Gemini Live API) Head-to-Head Comparison Developer Experience FrameworkDX HighlightsShortcomingsGenkit JavaDev UI for local tracing is unmatched. Idiomatic Java builder API.GitHub Packages auth friction; unofficial statusSpring AIFeels native to any Spring Boot codebase. Zero-surprise API.No visual Dev UI; observability via Micrometer onlyLangChain4jAI Services pattern is the cleanest Java-native AI abstractionNo Dev UI; agent features still maturingADK JavaPowerful multi-agent tooling. Official Google support.GCP-centric; Python-style reflected in Java API Abstraction Levels Genkit Java is the only Java AI framework that provides all three levels: vanilla generation, typed flows (pipelines), and agents. Spring AI covers generation and a basic agent model via tools, but lacks a flow abstraction. LangChain4j provides two levels (low-level primitives and high-level AI Services) but is agent/service focused. ADK Java is agent-only. Observability FrameworkLocal DevProductionGenkit JavaDev UI with trace explorerOTEL-compatible exportSpring AILogs and Actuator endpointsMicrometer (Prometheus, Grafana, Datadog)LangChain4jLogging onlyManual OTEL setupADK JavaADK Web UICloud Trace + Vertex (GCP) Framework Neutrality Genkit Java and LangChain4j are built to be provider-neutral: they support every major model and deploy to any infrastructure. Spring AI is similarly neutral on model providers, though it carries Spring’s opinionated application framework as a dependency, a worthwhile trade for most Java shops. ADK Java carries the heaviest platform dependency: its full value is unlocked on Google Cloud. Java Ecosystem Fit FrameworkSpring BootQuarkusMicronautNative ImageGenkit Java✅ Plugin❌❌❌Spring AI✅ Native❌❌✅ GraalVMLangChain4j✅ Module✅ Extension✅ ModulePartialADK Java❌❌❌❌ Which Framework Should You Choose? Choose Genkit Java if: You want to iterate on your AI fast and get feedback with less back and forth — Genkit was built from the ground up for powerful local tooling and observability, and the Dev UI is genuinely transformative.You need multiple abstraction levels (vanilla calls, typed flows, and agents) in one SDK.Provider neutrality matters: you need to swap or mix Gemini, Claude, OpenAI, and Bedrock.Your team also writes TypeScript and wants a consistent framework story across both stacks. Choose Spring AI if: You are already running Spring Boot and want AI to feel like any other Spring integration.Micrometer-native metrics and traces plugging into your existing Prometheus/Grafana stack are a priority.You need the broadest model and vector store coverage with production-grade auto-configuration.GraalVM native images are a requirement for your deployment targets. Choose LangChain4j if: You want the most Java-idiomatic high-level AI abstraction: interface-based AI Services with annotations.You need the largest integration ecosystem and don’t want to be tied to any application framework.Your team works across Spring Boot, Quarkus, Micronaut, and Helidon, LangChain4j is the most framework-agnostic.RAG pipelines with rich document ingestion and retrieval are a core use case. Choose ADK Java if: You are building enterprise-grade multi-agent systems and Google Cloud is your runtime.You need official Google support and SLA-backed infrastructure for agent deployment.Multi-agent orchestration (sequential, parallel, loop) and the A2A interoperability protocol matter.Your team is already using the ADK Python SDK and wants to extend to Java services. Conclusion Java’s AI framework landscape in 2026 is surprisingly rich. The four frameworks covered here serve genuinely different needs, and unlike in the JavaScript world, where Genkit, Vercel, Mastra, LangChain, and ADK overlap significantly, the Java options each occupy a clearer niche. For enterprise Spring Boot teams, Spring AI is the obvious choice, with zero friction, production-ready observability via Micrometer, and the broadest integration matrix. For teams that value developer experience above all, Genkit Java’s Dev UI is a category apart and worth the unofficial status trade-off. For framework-agnostic Java developers who want the most idiomatic Java AI service abstraction, LangChain4j’s AI Services pattern is hard to beat. And for Google Cloud enterprise workloads that need reliable multi-agent orchestration at scale, ADK Java 1.0 is where Google is putting its weight. The most important thing is that you no longer have an excuse to reach for Python just because it has better AI tooling. Java’s time in generative AI has arrived. Last updated: April 2026. Framework versions referenced: Genkit Java 1.0.0-SNAPSHOT, Spring AI 1.x, LangChain4j 0.36.x, Google ADK Java 1.0.

By Xavier Portilla Edo DZone Core CORE
KV Cache Implementation Inside vLLM
KV Cache Implementation Inside vLLM

The key-value (KV) cache is a fundamental optimization in transformer-based LLM inference. It stores intermediate attention states, i.e., keys and values computed during the prefill phase, so that subsequent tokens can reuse them instead of recomputing from scratch. This significantly reduces compute cost and latency, especially for long context or multi-turn agentic workloads. KV caching has been extensively discussed across several blogs and documentation [1, 2, 3, 4, 5]. In this article, instead of revisiting those well-known concepts, vLLM (v0.20.0) KV cache implementation details are discussed for a deeper understanding. By walking through code internals with concrete code pointers and design insights, the goal is to bridge the gap between high-level understanding and real-world system design. KV Cache Is Not a Standard Cache At first glance, KV cache sounds like a standard caching problem: storing computed results to reuse later. However, in systems like vLLM, KV cache behaves fundamentally differently from traditional caches like Redis cache. It is not a simple key-value lookup system sitting outside the execution path, but rather a tightly coupled component of the model's forward pass that must be accessed at every decoding step. Unlike conventional caches, KV cache is dynamic, partially reusable, and deeply intertwined with GPU memory allocation. This means that KV cache design is as much about memory management and scheduling as it is about cache reuse. Thinking of it as just a cache hides its true complexity, and it is better understood as a virtualized memory layer for intermediate computation. Dimensiontraditional cache (E.g., Redis)KV cache in LLMs (e.g., vllm) Purpose Avoid recomputing full results Avoid recomputing intermediate attention state Common access pattern key -> value lookup Key -> key-value bytes lookup during model execution Reuse type All or nothing Partial reuse (prefix based) Storage In-memory / persisted Primarily GPU memory which can also be persisted Consistency Eventual or strong consistency Must match exact token sequence Scheduling dependency Independent Strongly coupled with request scheduling Failure mode Cache miss results in recompute Cache miss results in recompute Cache locality sensitivity Low (can often be distributed for better reliability and scalability) Very high (node/worker local) and be IO latency sensitive. The kv_cache_manager is a good entry point to understand that the KV cache in vLLM is not a traditional cache, but an active memory manager used during inference. It actively manages GPU KV cache memory during inference, i.e., allocation, reuse, eviction, prefix cache hits, and request lifecycle state. Python class KVCacheManager: def __init__( self, kv_cache_config: KVCacheConfig, max_model_len: int, hash_block_size: int, max_num_batched_tokens: int | None = None, enable_caching: bool = True, use_eagle: bool = False, log_stats: bool = False, enable_kv_cache_events: bool = False, dcp_world_size: int = 1, pcp_world_size: int = 1, metrics_collector: KVCacheMetricsCollector | None = None, ) -> None: Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub vLLM KV Cache Design vLLM's KV cache design treats KV memory like virtual memory and not contiguous tensors to avoid memory bottlenecks. Instead of allocating large blocks per request, it introduces a layer of indirection via fixed-size blocks and block tables. This allows memory to be used efficiently, reused across requests, and dynamically resized as sequences grow. Two core primitives enable this design: block tables and an eviction mechanism. Together, they solve critical problems in memory fragmentation, reuse, and scalability. Block Tables The block table is the central abstraction in vLLM's KV cache design. Instead of storing KV tensors contiguously in GPU memory, each request maintains a mapping from logical token positions to physical memory blocks. This indirection layer is conceptually similar to a page table in operating systems. When the model accesses KV for a given token, it resolves through the block table to locate the physical block in GPU memory. This design allows KV memory to be non-contiguous, shared across multiple requests, and dynamically extended as tokens are generated. The code pointers below are a good entry point to understand this concept in detail. vLLM maintains a BlockTable whose rows correspond to active request slots. Each row maps a request's logical token/block positions to physical KV cache block IDs in GPU memory. This indirection lets KV blocks be allocated non-contiguously and lets multiple requests refer to reused/shared cached blocks. Python class BlockTable: def __init__( self, block_size: int, max_num_reqs: int, max_num_blocks_per_req: int, max_num_batched_tokens: int, pin_memory: bool, device: torch.device, kernel_block_size: int, cp_kv_cache_interleave_size: int, ): Source: (v0.20.0) vllm/vllm/v1/worker/block_table.py at main · vllm-project/vllm · GitHub vLLM's KV cache is divided into fixed size KVCacheBlocks. These blocks are the fundamental unit of allocation, prefix cache reuse, reference counting, and eviction. The code below is a good set of pointers for understanding that lifecycle. Python class BlockPool: def __init__( self, num_gpu_blocks: int, enable_caching: bool, hash_block_size: int, enable_kv_cache_events: bool = False, metrics_collector: KVCacheMetricsCollector | None = None, ): Source: (v0.20.0) vllm/vllm/v1/core/block_pool.py at main · vllm-project/vllm · GitHub allocate_slots() asks the coordinator how many blocks are needed, checks the shared block_pool for free capacity, and then calls allocate_new_blocks() only for the current request's needed slots. That shows blocks are dynamically assigned from a shared pool rather than preallocated per request. Python def allocate_slots( self, request: Request, num_new_tokens: int, num_new_computed_tokens: int = 0, new_computed_blocks: KVCacheBlocks | None = None, num_lookahead_tokens: int = 0, num_external_computed_tokens: int = 0, delay_cache_blocks: bool = False, num_encoder_tokens: int = 0, full_sequence_must_fit: bool = False, ) -> KVCacheBlocks | None: Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub Cache Eviction Eviction in vLLM is more complex than a typical least recently used (LRU) policy due to dependencies between tokens. KV blocks form a logical prefix chain, meaning later tokens depend on earlier ones. As a result, eviction cannot arbitrarily remove blocks without breaking correctness. Instead, vLLM uses a reference count-based mechanism combined with recency heuristics. Blocks are only eligible for eviction when no active request depends on them, and even then, eviction typically proceeds from the tail of sequences to preserve prefix integrity. This constrained eviction behavior ensures correctness while still allowing the system to operate under memory pressure. Blocks are reference-counted. A block can only be freed when no active request depends on it to ensure correctness. Python def free(self, request: Request) -> None: self.coordinator.free(request.request_id) Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub When a request completes, KVCacheCoordinator.free() calls free() on each per-type manager. The manager removes that request's blocks from req_to_blocks and returns them to BlockPool.free_blocks(), where they become reclaimable once their reference count reaches zero. Python def free(self, request_id: str) -> None: req_blocks = self.req_to_blocks.pop(request_id, []) Source: (v0.20.0) vllm/vllm/v1/core/single_type_kv_cache_manager.py at main · vllm-project/vllm · GitHub How the Request Flow Works Understanding how KV cache works requires following a request through the system. At a high level, vLLM attempts to reuse previously computed KV blocks by matching prefixes, allocates new blocks for unseen tokens, and schedules requests in a way that maximizes reuse while balancing GPU utilization. Prefix matching identifies previously computed KV blocks that can be reused for the incoming request. find_longest_cache_hit() takes the incoming request's block_hashes, searches the prefix cache for matching cached blocks, and returns the reusable KVCacheBlocks plus the number of computed tokens. Python def find_longest_cache_hit( self, block_hashes: list[BlockHash], max_cache_hit_length: int, ) -> tuple[tuple[list[KVCacheBlock], ...], int]: Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_coordinator.py at main · vllm-project/vllm · GitHub New tokens are assigned newly allocated KV blocks on demand, and the returned block IDs are later used by the worker side block table to extend the request's physical KV mapping. Python new_blocks = self.coordinator.allocate_new_blocks( request.request_id, num_tokens_need_slot, num_tokens_main_model, num_encoder_tokens, ) Source: (v0.20.0) vllm/vllm/v1/core/kv_cache_manager.py at main · vllm-project/vllm · GitHub The scheduler decides which requests run together. Requests with shared prefixes benefit from co-location, improving cache hit rate. Even with perfect caching logic, poor scheduling can eliminate all cache benefits. Python def schedule(self) -> SchedulerOutput: Source: (v0.20.0) vllm/vllm/v1/core/sched/scheduler.py at main · vllm-project/vllm · GitHub Before the model forward pass execution, gpu_mode_runner uses the request block table to resolve scheduled logical token positions into physical KV cache slot IDs. During attention execution, the resulting slot_mapping and block_table_tensor are passed through attention metadata so kernels can read/write the correct KV cache locations. Python def _prepare_inputs( self, scheduler_output: "SchedulerOutput", num_scheduled_tokens: np.ndarray, ) -> tuple[ torch.Tensor, SpecDecodeMetadata | None, ]: self.input_batch.block_table.compute_slot_mapping( num_reqs, self.query_start_loc.gpu[: num_reqs + 1], self.positions[:total_num_scheduled_tokens], ) source: (v0.20.0) vllm/vllm/v1/worker/gpu_model_runner.py at main · vllm-project/vllm · GitHub Future Work While vLLM's prefix-based KV caching is highly effective, it has inherent limitations that motivate future work. Today, reuse is mainly strongest when requests share the same prefix, because cached blocks are validated through prefix/block hash chains. One future direction is a more general segment or chunk-level reuse, where systems try to reuse repeated prompt regions beyond strict prefixes. This could help when shared content appears later in prompts, but it is harder than prefix caching because KV states depend on position and surrounding context. Another direction is distributed KV caching, where KV state can be stored, transferred, or shared across workers or replicas rather than remaining purely local to one GPU/node. This can improve reuse and scaling, but introduces challenges around latency, routing, placement, and consistency. Together, these directions move KV caching from a local per-worker optimization toward a broader system-level capability. Conclusion vLLM rethinks KV caching as a memory management and scheduling problem rather than a simple reuse mechanism. Through fixed-size block allocation, block tables for logical to physical indirection, prefix-aware reuse, reference-counted block lifetimes, and LRU-like cached block eviction, it turns KV cache into a virtualized resource that can be shared efficiently across requests. However, the effectiveness of this system depends not only on its internal design, but also on how requests are scheduled, batched, and routed. These nuances show that KV caching is not merely a local optimization, but a core systems primitive for modern LLM inference. As inference systems evolve toward more general segment/chunk level reuse and distributed KV caching, these same principles will continue to shape scalable and efficient serving platforms. References https://github.com/vllm-project/vllmhttps://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llmshttps://bentoml.com/llm/inference-optimization/kv-cache-offloadinghttps://cloud.google.com/blog/topics/developers-practitioners/boosting-llm-performance-with-tiered-kv-cache-on-google-kubernetes-engine/https://developer.nvidia.com/blog/how-to-reduce-kv-cache-bottlenecks-with-nvidia-dynamo/https://pub.towardsai.net/the-secret-behind-fast-llm-inference-unlocking-the-kv-cache-9c13140b632d

By Bhala Ranganathan DZone Core CORE
ARC: The Architecture for Reasoning Control
ARC: The Architecture for Reasoning Control

Three Lessons from an AI Makeathon I recently participated in a makeathon focused on building AI-powered applications. Over 2–3 intense days, I watched teams go from idea to demo — and the patterns that separated working products from frustrated debugging sessions were remarkably consistent, especially for teams building AI agents. From this makeathon and from my experience working with teams building AI applications and agents, here are the three lessons I took away on how to build reliable AI applications by engineering around non-determinism. Together, these form what I would like to call “The Architecture for Reasoning Control”. 1. Start Small — Non-Determinism Compounds AI models are non-deterministic. The same input won’t always produce the same output. This is a feature when you want creativity. It’s a problem when you want reliability. In a small app — one model call, one task — non-determinism is manageable, you can observe this behavior, tune your prompts, and build confidence. You iterate fast and catch drift early. In a large app like an AI Agent where the model must reason, select tool, and manage state across multiple steps, these non-determinism errors compound. Every AI call is a roll of the dice. Chain ten of them together and you’re rolling ten dice simultaneously. The probability of a successful end-to-end run — P(success)^n — decays exponentially. The probability of at least one undesired result doesn’t just grow — it compounds quickly. In my experience building bigger AI agents, we often spend the majority of our time chasing unpredictable outputs across these long chains. By scoping small, we found we could build working demos and deployable applications that actually stay on the rails. The Architectural Lesson: Apply the Single Responsibility Principle (SRP) of Architecture design: An AI module should have one, and only one, reason to change. You can think of these as analogous to microservices — small, single-purpose AI units that can be composed safely. Get one agentic interaction working with high reliability before you dream of chaining it. If the foundation is shaky, the agentic skyscraper will fall. 2. Multipass Guardrails — Defense in Depth Even the best guardrails we built didn’t have 100% effectiveness. A single validation pass catches most bad outputs — but “most” isn’t enough when you’re shipping to users. To understand why, consider the full surface area you need to guard. Most teams think about content safety — blocking violent or illegal content. But that’s just one of six categories: To get more determinism in our guardrail efficacy and build true Defense-in-Depth, we experimented with a “double-pass” approach — running the same guardrail logic against both input and output. While this bumped our success rate slightly, it quickly revealed a structural flaw: correlated blind spots. When our detection logic misclassified an illegal query as merely “off-topic” at the input stage, it consistently made the same error at the output stage. Similarly, PII that bypassed the upstream filter sailed through downstream because the detection signature was identical. We realized that while doubling down on the same logic slightly increased our safety margin — it just mirrored our existing weaknesses. So we researched shifting from symmetrical filtering to a model built on orthogonal, independent layers. The goal was to ensure that if one layer failed, the next would approach the problem from a completely different technical angle. This “cops-and-robbers” dynamic makes it significantly less likely that failures align — requiring multiple, differently designed systems to fail simultaneously for an issue to reach the user. If you’re looking to move beyond simple “pass/fail” filters, here are the layers you could analyze to stack with your guardrail: Dedicated Scanners (NER & Regex): Use deterministic PII scanners (regex for SSNs/credit cards) and Named Entity Recognition (NER) to catch data leaks before the query even hits the model.Intent Routing: Use a fast, specialized classifier to bucket queries into “benign,” “ambiguous,” or “high-risk.” This allows you to route high-risk queries through stricter handling paths or specialized system prompts before they reach the primary generative model.Structural Enforcement (JSON Schema): Move the goalposts from “free-text” to “data validation.” By forcing the model to output in a strict JSON schema, you turn unpredictable “Model Behavior” risks into a predictable code problem that can be caught by a standard parser.LLM-as-a-Judge: Introduce a secondary, smaller “observer” model tasked purely with evaluating the primary model’s response against a different set of criteria.Retrieval-grounded responses (RAG) Constraining the model to answer only from retrieved context and validating that outputs are traceable to sources — reducing hallucination and unsupported claims.Confidence / uncertainty gating Using signals (judge scores, validation checks, or model uncertainty) to decide when to answer, ask for clarification, or fall back — rather than treating all outputs equally. The overarching lesson was that there is no such thing as a “perfect” guardrail. Instead, assemble a stack of diverse, independent checks. By assuming that every individual layer will occasionally fail, you can design a system where those failures never align — creating a robust “Swiss Cheese” model of AI safety that actually holds up under adversarial pressure. 3. Flow Engineering — Mix AI with Deterministic Processing: Control What You Can AI excels at ambiguity: reasoning over messy inputs, interpreting intent, and generating natural language. But for problems requiring guaranteed correctness — precise data lookups, workflow sequencing, or state management — it remains fundamentally probabilistic. It can often arrive at the right answer, but it cannot reliably guarantee it every time. The insight that worked best: Use AI for reasoning; use deterministic code for execution. Let AI decide what to do (intent, analysis, extraction). Then let code decide how to do it (orchestration, API calls, state management). This separation doesn’t just improve reliability — it fundamentally changes how the system behaves: Controlled Scope: By limiting LLM calls to only the steps that require reasoning, you reduce unnecessary model invocations and keep the AI surface area small. This reinforces Lesson 1 — when the scope is smaller, the non-determinism is easier to observe.Targeted Safety: It strengthens Lesson 2 — guardrails are most effective when applied to fewer, well-defined points rather than across an unbounded flow. This is the “agentic pattern” emerging across the industry: a deterministic workflow engine that delegates to AI only where human-like reasoning is needed, then pulls the result back into controlled, predictable code. The best AI applications aren’t the ones that give AI the most freedom — they’re the ones that give AI the right freedom. This is the core of Flow Engineering. Instead of letting an agent navigate a dark room, we hard-coded the rails. By using the LLM as a cognitive engine at specific steps in a verifiable chain — rather than a free-roaming driver — we replaced a porous process with a solid structural track. Why This Works Reliability: Deterministic systems eliminate randomness in mission-critical steps.Cost & Latency: Fewer LLM calls lead to lower inference costs and faster responses.Observability: A smaller AI surface area is easier to monitor, test, and debug.Safety: Guardrails become exponentially more effective when applied at controlled, well-defined points. You’re not just optimizing performance — you’re containing non-determinism. Exec Insight: High-risk business logic should stay deterministic; creative and reasoning tasks can be probabilistic. Conclusion All three lessons point to the same principle: respect the non-determinism. The goal isn’t to eliminate non-determinism. It’s to build systems where it can’t break you by using “ARC: The Architecture for Reasoning Control”. AI systems don’t fail because they’re non-deterministic. They fail because that non-determinism is poorly bounded. Don’t fight it. Don’t ignore it. Don’t pretend your model is a function that returns the same output every time. The teams that built the most impressive demos at the makeathon weren’t the ones with the most ambitious prompts. They were the ones who understood where AI helps — and where it doesn’t. Summarizing using a Swiss Cheese metaphor: Lesson 1 (Start Small): Shrinking the size of the “holes” in the cheese by limiting scope using Architectural principle of SRP.Lesson 2 (Orthogonal Defense-in-Depth): Stacking the slices so the “holes” never align through orthogonal layers.Lesson 3 (Flow Engineering): Reduce how much is cheese in the first place in the system by using Deterministic Flows for critical logic. While our team was recognized with a special award, the real takeaway was the framework we discovered along the way. Start small. Guard deep. Stay deterministic. That’s what turns AI from a demo into a system you can trust.

By Ananth Iyer
Designing Agentic Systems Like Distributed Systems
Designing Agentic Systems Like Distributed Systems

Agentic development is rapidly becoming one of the most talked-about paradigms in software development. The talk is not just of using AI to assist in coding but of using systems where an AI agent is capable of planning, executing tasks, and even deciding. From a surface-level perspective, agentic systems are a new abstraction. But if we look under the hood, we find something that looks rather familiar: distributed systems. In microservices, asynchronous workflows, or event-driven architectures, many of the same challenges apply: Irregular behaviorPartial terminal conditionsLatency fluctuationsLack of observability The biggest mistake teams make is treating agents like deterministic scripts. In reality, they require the same rigor and design discipline as distributed systems. The Illusion of Determinism The traditional software model is fundamentally deterministic. Under similar conditions, one expects the same result. Agentic systems contradict this assumption. Identical prompts and inputs cannot always cause the same outputs because of: Model variabilityContext variationToken limitsThe response from an external tool This is akin to the behavior of distributed systems that have to deal with the real-world conditions - network latency, retries, and service dependencies that generate differences. This logically means that you cannot rely on "it worked once" as proof of correctness. Instead, you must design for: VariabilityApproximationProbabilistic correctness This one modification is sufficient to prompt engineers to reconsider the entire approach to achieving reliability. Agents Are Just Services With Unstable Contracts In the realm of distributed systems, services often interact with clearly defined contracts. This is usually an API, schema, or a versioned interface. However, the converse is often true for the agentic systems. A typical agent flow might look like: Create a responseCall a toolParse the outputDecide on the next Action However, without strict contracts things break: The model returns JSON that is not entirely the sameThere is a field that is either missing or has been renamedThe tool response format is different These problems are not edge cases; they are expected behaviors. The solution is to treat agents like services with stricter contracts: Ensure that the outputs are structured clearly (JSON schemas, typed responses)Validate each interaction that takes placeFail fast on invalid responses You don't trust the model, you would rather encase it in a construct that ensures correctness at the boundaries. Orchestration Over Autonomy There is a general perception that agents are autonomous and can thus operate independently. In reality, this is not often the case in production scenarios. What actually works is orchestration. Like the distributed systems that make use of orchestrators (workflow engines, schedulers, queues), agentic systems also require: Feedback control loopsStepwise executionExplicit state transitions The robust agentic workflow includes the following main steps: Propose the taskImplement a single stepCheck outputChoose the next stepLoop or terminate This is not autonomy, but rather controlled implementation. It’s a bit like a state machine rather than a self-driving system. The more critical the workflow, the more you need control: Limiting agent freedomSpecifying allowed actionsAdding human-in-the-loop checkpoints when needed Without a doubt, orchestration is what makes systems reliable, though autonomy does have its own charm. Failure Is the Default State Distributed systems are frequently structured in the same way. Thus, failure is not a special event but, rather, a normal occurrence. This holds true even for the agentic systems; thus, failure is a possibility. Errors can arise on different fronts: The model might misjudge what the issue actually isA tool call could fail or timeoutThe agent might get stuck in a loopThe output is syntactically correct but semantically wrong If your system assumes success, it will fail in production. Rather, design for failure such as: Adding retries with limitsImplementing timeoutsIntroducing fallback pathsDetecting and breaking infinite loops For example: If the agent is unable to produce valid output for 3 repeated attempts, it will flow to a deterministic flowIf a tool call fails, it can still give a degraded yet safe response This shows the circuit-breaker and retry policy patterns at work in distributed systems. Reliability comes not from avoiding errors but from handling errors gracefully. Observability Is Non-Negotiable One of the hardest issues in distributed systems is observability, or understanding what happened when something has gone wrong. But in agentic systems, it is ever harder Why? The answer is that failures are often not binary. The system could: Deliver an answer that's covertly erroneousUse the wrong reasoningAdopt incorrect assumptions Without observability, debugging will be guesswork. Application of agentic systems in production thus needs: Structured logs of every stepPrompt and response tracingTool invocation trackingPath decision visibility Think of it as distributed tracing for agents. Instead of just logging outputs, log: InputsIntermediate reasoning (if safe)Tool calls and resultsFinal decisions This allows you to answer critical questions: Where did the system go astray?Was it the model, the prompt, or the tool?Is that an isolated issue, or is it a pattern? Good observability changes the unpredictable systems into manageable ones. Idempotency and State Management In distributed systems, idempotency guarantees that repeated actions don't produce unintended consequences. Agentic systems need this even more. Consider the scenario where: A step is retriedA tool is called multiple timesThe agent restarts mid-flow These situations will lead to some of the following outcomes: Twice the number of actionsOutputs that are inconsistentWorkflows that are corrupted Best practices include: Keep the explicit state stored between stepsMake tool calls idempotent where possibleKeep a track of execution history For example: Rather than allowing the agent to "remember" context implicitly, persist: What steps were completedWhat outputs were producedWhat decisions were made This will turn a brittle state into one that is recoverable. Guardrails Over Intelligence One common misconception is that improving the model will solve most problems. However, system design matters more than model capability. More robust models mean fewer mistakes, but they do eliminate: AmbiguitiesMisinterpretationsUnexpected outputs Guardrails are what make systems usable: Input validationOutput constraintsAction limitsSafety checks For example: The agent can only call the tools that are allowedValidate outputs before executionDestructive actions must be prevented This resembles the way in which distributed systems enforce: Access controlsRate limitsData validation You don’t trust components blindly; rather, you constrain them. Closing Thoughts Agentic development is not about replacing the engineering discipline. It is about rigor in applying it. The most effective systems are not necessarily the most independent. They are the ones that are: Intelligently orchestratedHeavily constrainedDeeply observable Ultimately, the agents are simply another layer in your architecture.

By Satyam Nikhra
The Hidden Failure Modes of AI Systems (That Traditional Monitoring Misses)
The Hidden Failure Modes of AI Systems (That Traditional Monitoring Misses)

One of the most unsettling characteristics of AI systems is how often they appear perfectly healthy. Infrastructure dashboards report stable CPU utilization, normal latency levels, and acceptable throughput. No alerts are triggered. From an operational standpoint, the system is functioning exactly as designed. Yet the outputs are wrong. In many AI deployments, engineers eventually encounter this situation: a recommendation system begins suggesting irrelevant items, a support chatbot produces inconsistent answers, or an AI assistant gradually becomes less reliable in answering domain-specific questions. Despite infrastructure stability evidenced by nominal CPU and latency metrics, AI systems frequently exhibit what can be described as silent degradation, a condition where semantic accuracy deteriorates while the transport layer remains fully operational. This failure mode is increasingly common in modern AI pipelines. Why Traditional Dashboards Can Be Misleading Monitoring platforms such as Prometheus, Datadog, and CloudWatch were designed for deterministic software systems. They track signals like request latency, memory usage, and service availability. These metrics are still essential. However, they only capture infrastructure health — not model behavior. Consider a typical retrieval-augmented generation (RAG) architecture. A user query moves through several layers before producing an answer: an API gateway, an embedding service, a vector database, a re-ranking layer, and finally the language model responsible for generating the response. If the embedding service experiences a brief latency spike, the system might reduce the number of retrieved documents or fall back to cached embeddings. The request still completes successfully, and infrastructure metrics remain within healthy ranges. But the language model now receives weaker context. The generated response may still appear fluent and coherent, yet its factual accuracy has quietly declined. From the perspective of the monitoring dashboard, the system remains healthy. From the perspective of the end user, the system has degraded. This gap highlights a fundamental challenge: AI reliability problems often occur in the semantic layer rather than the infrastructure layer. Retrieval Pipelines: A Hidden Source of Instability Retrieval systems are particularly vulnerable to subtle instability. Modern AI applications depend heavily on vector search to provide contextual knowledge to language models. Even small disturbances in this pipeline can significantly alter system behavior. For example, if a vector index update is delayed or embedding quality drifts slightly, similarity search may return documents that are only partially relevant. The model must then infer missing context on its own, increasing the probability of hallucination. Several factors can introduce this instability: embedding drift caused by model updatesdelayed indexing of newly ingested documentslatency spikes reducing the retrieval windowincomplete ranking signals in re-ranking layers None of these conditions necessarily produce infrastructure failures. Instead, they reduce the informational quality available to the model, weakening its reasoning capability. Hallucination Amplification Large language models generate responses probabilistically based on the context they receive. When that context becomes incomplete or noisy, the model compensates by relying more heavily on internal patterns. This is where hallucinations begin. A small retrieval error may initially produce a slightly uncertain response. In more complex systems, particularly agentic frameworks, this uncertainty can cascade through downstream workflows. For example, autonomous agents may execute follow-up API calls or trigger actions based on the model’s interpretation of retrieved data. If the underlying reasoning is degraded, those actions can amplify the original error. In other words, a minor retrieval issue can evolve into a chain of incorrect decisions. Traditional monitoring tools rarely capture this phenomenon because they do not measure the semantic integrity of outputs. Metrics That Actually Matter If infrastructure metrics alone cannot detect these issues, what signals should engineers monitor instead? AI reliability requires a new class of observability metrics focused on model behavior. One important signal is accuracy drift. Continuous evaluation pipelines can periodically test model outputs against benchmark datasets or validated queries, allowing teams to detect gradual declines in model performance.Another critical metric is retrieval precision. In RAG systems, measuring the relevance of retrieved documents helps identify when embedding quality or vector index freshness begins to deteriorate.Engineers should also monitor inference variance, the degree to which identical prompts produce different outputs over repeated runs. High variance can indicate unstable context, inconsistent retrieval results, or fluctuating model states. Tracking these signals provides visibility into how the AI system is reasoning, rather than simply confirming that it is responding. MetricWhat It DetectsWhy It MattersAccuracy DriftGradual decline in model correctnessEarly indicator of model degradationRetrieval PrecisionQuality of documents retrieved in RAG pipelinesPoor retrieval leads to hallucinationsInference VarianceOutput instability across repeated promptsIndicates context inconsistencyContext CoveragePercentage of relevant documents retrievedMeasures knowledge completenessResponse EntropyUncertainty in generated responsesHigh entropy signals weak model confidence Example: Detecting Semantic Drift in a RAG Pipeline A simple reliability monitor can periodically test model responses against expected outputs to detect early-stage degradation. Plain Text import openai from sklearn.metrics.pairwise import cosine_similarity from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") expected_answer = "The Eiffel Tower is located in Paris, France." test_query = "Where is the Eiffel Tower located?" response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": test_query}] ) generated_answer = response["choices"][0]["message"]["content"] expected_embedding = model.encode([expected_answer]) generated_embedding = model.encode([generated_answer]) similarity = cosine_similarity(expected_embedding, generated_embedding)[0][0] if similarity < 0.80: print("⚠️ Potential semantic drift detected.") The AI Reliability Stack: A Proposed Architecture Addressing these hidden failure modes requires integrating semantic monitoring into the AI development lifecycle. A typical AI reliability stack may include several layers of observability. At the infrastructure level, traditional monitoring tools such as Prometheus or OpenTelemetry continue to track system health metrics. These tools ensure that core services remain operational. Above this layer sits model observability platforms such as LangSmith or Arize. These tools track prompt-response pairs, analyze model outputs, and detect anomalies in inference behavior. A third layer focuses on evaluation pipelines integrated into CI/CD workflows. Automated tests evaluate model performance using curated datasets, enabling teams to detect accuracy drift before it reaches production environments. Together, these layers provide a more complete picture of system reliability. Infrastructure monitoring ensures services remain available, while semantic monitoring ensures the system’s intelligence remains intact. In my work developing intent-based chaos models for distributed systems (for which I hold a USPTO-recognized patent), I observed that infrastructure telemetry alone rarely detects early-stage AI failures. Combining topology-aware chaos testing with semantic observability allows engineering teams to detect reliability issues before they propagate through production systems. Plain Text User Query │ ▼ API Gateway │ ▼ Embedding Service │ ▼ Vector Database │ ▼ LLM Inference │ ▼ Semantic Evaluation Layer │ ├── Accuracy Drift Monitor ├── Retrieval Precision Tracker └── Hallucination Detection Toward Reliability Engineering for AI Systems As AI systems become embedded in production environments, reliability engineering must evolve alongside them. Traditional observability practices remain essential for maintaining infrastructure stability. However, they must be complemented by tools that measure how AI systems actually behave. The next generation of reliability frameworks will likely combine infrastructure telemetry with semantic evaluation pipelines, enabling engineers to detect not just outages, but the early signals of degraded reasoning. The hidden failure modes of AI systems cannot be eliminated entirely. But with the right monitoring strategies, they can be detected before they undermine the reliability of intelligent systems. Building trustworthy AI requires more than uptime dashboards. It requires visibility into how the system thinks. CategoryTraditional Monitoring (Infrastructure)AI Observability (Semantic)Primary GoalDetect system outages and latency.Detect quality degradation and drift.Core MetricsCPU, RAM, HTTP 500s, p99 Latency.Faithfulness, Answer Relevancy, Context Recall.Failure StateBinary (Up or Down).Spectrum (Accurate to Hallucinated).ToolingPrometheus, Grafana, Datadog.LangSmith, Arize, Fiddler, DeepEval.Root CauseCode bugs, Hardware failure, Traffic spikes.Embedding drift, Retrieval gaps, Prompt sensitivity.

By Sayali Patil
Top JavaScript/TypeScript Gen AI Frameworks for 2026
Top JavaScript/TypeScript Gen AI Frameworks for 2026

The generative AI tooling ecosystem has exploded over the past two years. What started as a handful of Python libraries has grown into a rich, opinionated landscape of frameworks spanning multiple languages, deployment targets, and philosophical bets. As a developer who has shipped production applications using all five of the frameworks covered in this article, Genkit, Vercel AI SDK, Mastra, LangChain, and Google ADK, I want to offer a practical, hands-on view of where each one excels, where each one falls short, and what I would reach for depending on the project I’m building. This is not a benchmark post. Tokens per second and latency numbers go stale within weeks. Instead, this is a developer experience and architecture comparison, the kind of thing that matters when you’re deciding what framework will carry your product through 2026 and beyond. A quick note on scope: all five frameworks are in active development and moving fast. Code samples in this article use the APIs as of April 2026. Genkit History and Direction Genkit was announced by Google at Google I/O 2024 as an open-source framework designed to bring production-ready AI tooling to full-stack developers, regardless of their cloud provider. At the time, the JavaScript/TypeScript ecosystem lacked a coherent story for building AI-powered features with the kind of developer ergonomics you’d expect from, say, a Next.js app. Firebase’s team set out to fix that, building Genkit not as a proprietary Firebase product but as a cloud-agnostic SDK with first-class support for plugins. By mid-2024, Genkit had already attracted a community plugin ecosystem covering AWS Bedrock, Azure OpenAI, Ollama, Cohere, and a growing list of vector stores. The framework reached its 1.0 milestone in late 2024 and shipped major expansions in 2025, most notably adding Python (preview), Go, and Dart (preview) SDKs alongside the primary TypeScript runtime. This multi-language vision is central to Genkit’s story: it aspires to be the framework you reach for no matter what stack you’re running. As of 2026, the Dart SDK has matured notably, making Genkit one of the very few AI frameworks with meaningful Flutter support, giving mobile developers a first-class path into generative AI that no other framework on this list can match. It is also important to note that Genkit has an unofficial Java SDK, maintained by the community, which has been used in production but is not officially supported by the Genkit team. The team’s declared direction is to deepen Genkit’s role as a full-stack AI layer: strong observability primitives baked into the runtime, composable workflow abstractions (flows), and an expanding model plugin ecosystem. The ambition is not just to be a bridge to a single model provider but to be the connective tissue that lets you swap providers, mix modalities, and trace every hop in your pipeline, all from one coherent API. Of course, adding more capabilities to its DEV UI is also a major focus, with the goal of making it the best local development experience for AI applications, regardless of where they deploy. What Makes Genkit Stand Out Genkit occupies a unique position among the frameworks in this comparison: it is the only one that provides multiple levels of abstraction in a single, coherent API. You can call a model directly (vanilla generation), compose steps into a typed flow, or wire up a fully autonomous agent, and you can mix all three in the same application. Most other frameworks force you to choose a lane. Supported languages: TypeScript/JavaScript (primary, stable), Python (preview), Go, Dart/Flutter (preview) JavaScript import { genkit } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; const ai = genkit({ plugins: [googleAI()] }); // Vanilla generation — no abstraction needed const { text } = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: 'What is the capital of France?', }); Flows — Composable, Typed Pipelines Flows are Genkit’s first-class pipeline primitive. They are strongly typed, observable end-to-end, and automatically traced in the Dev UI. You define them once and can invoke them from CLI, HTTP, or the Dev UI without any extra scaffolding. import { genkit, z } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; const ai = genkit({ plugins: [googleAI()] }); const summarizeFlow = ai.defineFlow( { name: 'summarizeArticle', inputSchema: z.object({ url: z.string().url() }), outputSchema: z.object({ summary: z.string(), keyPoints: z.array(z.string()) }), }, async ({ url }) => { const { output } = await ai.generate({ model: googleAI.model('gemini-flash-latest'), prompt: `Summarize the article at ${url} and list the key points.`, output: { schema: z.object({ summary: z.string(), keyPoints: z.array(z.string()) }), }, }); return output!; } ); Agent Abstractions For agents, Genkit uses definePrompt with tools and a system prompt to define specialized agents, along with tool calling via defineTool and conversation memory, all integrated with the same tracing and observability infrastructure that flows use. The agent model is deliberate: it gives you control over how much autonomy you hand over to the model. JavaScript import { genkit, z } from 'genkit'; import { googleAI } from '@genkit-ai/google-genai'; const ai = genkit({ plugins: [googleAI()] }); const weatherTool = ai.defineTool( { name: 'getWeather', description: 'Returns current weather conditions for a given city.', inputSchema: z.object({ city: z.string() }), outputSchema: z.object({ temperature: z.number(), condition: z.string() }), }, async ({ city }) => { // Real implementation would call a weather API return { temperature: 22, condition: 'Sunny' }; } ); const travelAgent = ai.definePrompt( { name: 'travelAdvisor', description: 'Travel Advisor can help with trip planning and weather-based advice', model: googleAI.model('gemini-flash-latest'), tools: [weatherTool], system: 'You are a helpful travel advisor. Use available tools to give accurate advice.', } ); // Start a chat session with the agent const chat = ai.chat(travelAgent); const response = await chat.send('Should I pack a jacket for my trip to Lisbon?'); console.log(response.text); The Dev UI — Where Genkit Truly Shines The Genkit Developer UI is, frankly, the killer feature. No other framework in this comparison comes close to what Genkit offers locally. You launch it with a single command: Shell npx genkit start The Dev UI gives you: Flow runner – execute any flow with a custom input, inspect the typed output, and view the full execution trace.Model playground – invoke any registered model directly, tweak prompt templates, compare outputs.Tool testing – stub and test individual tools in isolation before wiring them into an agent.Trace explorer – every generate, flow, and agent call is traced with latency breakdowns, token counts, and the exact prompts and completions sent to the model. This is OpenTelemetry-compatible telemetry, exportable to Cloud Trace, Langfuse, or any OTEL collector.Dotprompt editor – Genkit’s .prompt files (Dotprompt) are editable live in the UI, with real-time preview and variable injection.Session replay – replay any traced session end-to-end to reproduce bugs without re-running the full application. This local observability loop collapses what normally requires a deployed tracing backend (LangSmith, Langfuse, Weave) into a zero-config experience that runs entirely offline. For development speed, this is enormous. Vercel’s Developer Tool, by comparison, is a lightweight panel primarily for inspecting HTTP streaming responses. It doesn’t offer flow visualization, trace exploration, or tool testing. It’s functional but basic, the kind of thing you’d expect as a starting point, not a full developer experience. Broad Model Support — Provider Neutral by Design Genkit ships official plugins for Google AI (Gemini), Google Vertex AI, OpenAI, Anthropic Claude, Cohere, Mistral, Ollama (local models), AWS Bedrock, and more. The community has extended this to xAI, DeepSeek, Perplexity, and Azure OpenAI. Every model, regardless of provider, is accessed through the same ai.generate() interface, and every call is automatically traced. JavaScript import { genkit } from 'genkit'; import { anthropic } from 'genkitx-anthropic'; import { openAI } from 'genkitx-openai'; const ai = genkit({ plugins: [anthropic(), openAI()] }); // Switch between providers without changing downstream code const { text: claudeResponse } = await ai.generate({ model: anthropic.model('claude-sonnet-4-5'), prompt: 'Explain transformer attention in one paragraph.', }); const { text: gptResponse } = await ai.generate({ model: openAI.model('gpt-4o'), prompt: 'Explain transformer attention in one paragraph.', }); Pros and Cons ✅ Pros❌ ConsBest-in-class Dev UI with local tracing and flow visualizationDart/Python SDKs still in previewMultiple abstraction levels: vanilla, flows, and agentsSmaller community than LangChainTruly provider-neutral with broad plugin ecosystemSome advanced patterns require deeper framework knowledgeStrong Flutter/Dart support for mobile AI Idiomatic TypeScript API Firebase, Cloud Run, or self-hosted deployment OpenTelemetry-compatible observability built in Vercel AI SDK History and Direction The Vercel AI SDK was born out of a practical need: Vercel builds the infrastructure that powers a large portion of the modern web, and as developers started shipping AI features inside Next.js apps in 2023, the friction of integrating streaming LLM responses into React was painfully apparent. Vercel released the initial AI SDK as an open-source library to standardize streaming, provider integration, and UI hooks across its ecosystem. The SDK grew quickly, adding support for Vue, Svelte, SolidJS, and plain Node.js, but its DNA remains deeply tied to the Vercel and Next.js stack. Version 3 in 2024 introduced streamUI, which lets you stream React components as model output, a paradigm-shift for building truly generative user interfaces. Version 4, shipping in late 2024, brought generateObject and streamObject with Zod schemas, structured output across all providers, and an expanded agent API. By 2026, AI SDK v6 will have established itself as the go-to choice for teams that live in the Vercel/React ecosystem and want the lowest-friction path from a prompt to a production UI. Vercel’s direction is clear: deeper integration between AI, edge compute, and the frontend. The AI Gateway, launched in 2025, acts as a provider proxy with load balancing and fallback, another layer of lock-in dressed as a convenience. The SDK is intentionally lower-level than Genkit or Mastra, favoring simplicity and composability over opinionated abstractions. What Makes the Vercel AI SDK Stand Out The Vercel AI SDK’s greatest strength is its seamless integration with React and the web UI layer. useChat, useCompletion, and useObject hooks wire directly into streaming AI responses with built-in state management, loading indicators, and error boundaries. If you’re building a Next.js app and want to add a chat interface or a streaming form, nothing gets you there faster. Supported languages: TypeScript/JavaScript (primary). Node.js, React, Next.js, Nuxt, SvelteKit, SolidStart, Expo (React Native). TypeScript // app/api/chat/route.ts (Next.js App Router) import { streamText } from 'ai'; import { openai } from '@ai-sdk/openai'; export async function POST(req: Request) { const { messages } = await req.json(); const result = await streamText({ model: openai('gpt-4o'), messages, }); return result.toDataStreamResponse(); TypeScript // app/page.tsx — chat UI with one hook 'use client'; import { useChat } from 'ai/react'; export default function Chat() { const { messages, input, handleInputChange, handleSubmit } = useChat(); return ( <div> {messages.map(m => ( <div key={m.id}><b>{m.role}:</b> {m.content}</div> ))} <form onSubmit={handleSubmit}> <input value={input} onChange={handleInputChange} placeholder="Say something..." /> <button type="submit">Send</button> </form> </div> ); } Structured Generation and Agent Patterns The SDK provides clean primitives for structured output and tool use, though the abstractions are deliberately minimal. You get generateText, streamText, generateObject, streamObject, and a simple maxSteps loop for agentic behavior. There is no high-level “flow” abstraction or graph, you compose these primitives yourself. JavaScript import { generateObject } from 'ai'; import { openai } from '@ai-sdk/openai'; import { z } from 'zod'; const { object } = await generateObject({ model: openai('gpt-4o'), schema: z.object({ recipe: z.object({ name: z.string(), ingredients: z.array(z.object({ name: z.string(), amount: z.string() })), steps: z.array(z.string()), }), }), prompt: 'Generate a recipe for a vegan chocolate cake.', }); Genkit vs. Vercel AI SDK — Abstraction Levels Compared to Genkit, the Vercel AI SDK operates at a lower level of abstraction. This is by design; Vercel wants to give you sharp, composable tools, not an opinionated framework. The trade-off is that you assemble more boilerplate yourself. Want to trace a multi-step agent? Wire up OpenTelemetry manually. Want a typed pipeline? Build it yourself. Genkit bakes these in. Conversely, Vercel’s deep UI integration, streaming RSC, useChat, generative UI patterns, is something Genkit does not attempt to own. For Flutter-based applications, Genkit’s Dart SDK fills this role, but in the web domain, Vercel wins on integration depth. Pros and Cons of Permalink ✅ Pros❌ ConsUnmatched React/Next.js/Edge integrationPrimarily TypeScript/JavaScript onlyMinimal API surface, easy to learnNo built-in flow or pipeline abstractionuseChat / useCompletion hooks are best-in-classDeveloper Tool is basic (no trace explorer, no flow runner)Generative UI with RSC streamingObservability requires external toolingBroad provider support via official adaptersDeeper use cases accumulate boilerplate quicklyIdiomatic TypeScript throughoutVercel-ecosystem bias (AI Gateway, templates) Mastra History and Direction Mastra is the youngest framework in this comparison, founded in 2024 by the team behind Gatsby (Cade Diehm and Sam Bhagwat). Coming from a background of developer experience, tooling, and static-site generation, Mastra’s founders approached AI framework design with a strong bias toward TypeScript ergonomics, workflow-first thinking, and integrated tooling. The name “Mastra” (Swahili for “master”) reflects the team’s ambition to be the definitive TypeScript-native AI orchestration layer. Mastra reached public beta in late 2024 and gained significant traction in early 2025 among TypeScript developers frustrated with LangChain’s Python-ported patterns. The framework’s distinct feature, a built-in Studio UI, arrived in early 2025 and quickly became its marquee differentiator. Mastra Studio is a web-based visual interface for defining, testing, and running agents and workflows, accessible locally or in the cloud. By mid-2025, Mastra had secured seed funding and announced hosted cloud infrastructure for deploying Mastra agents directly from the Studio. Mastra’s direction is firmly in the TypeScript/JavaScript ecosystem. The team has shown no signs of pursuing multi-language support; instead, they are doubling down on deep integrations with popular TypeScript meta-frameworks like Next.js, Astro, SvelteKit, and Hono. Think of Mastra as the opinionated, batteries-included agent framework for TypeScript developers who want to spin up production agents as fast as possible, without writing any platform glue. What Makes Mastra Stand Out Mastra is purpose-built for one thing: spinning up agents fast. It is an agent-only framework; you will not find vanilla model calls or a “flow” primitive. Everything in Mastra is modeled around agents, tools, memory, and workflows. If you know exactly what you need (an agent with memory and tool access), Mastra gets you there in fewer lines of code than any other framework here. Supported languages: TypeScript/JavaScript exclusively. Integrations with Next.js, Astro, SvelteKit, Hono, Express. JavaScript import { Mastra, Agent } from '@mastra/core'; import { openai } from '@mastra/openai'; const researchAgent = new Agent({ name: 'researcher', model: openai('gpt-4o'), instructions: `You are a research assistant. Find relevant information, synthesize key points, and present clear, well-structured summaries.`, tools: { // Tools added here }, }); const mastra = new Mastra({ agents: { researchAgent } }); const response = await mastra.getAgent('researcher').generate([ { role: 'user', content: 'Summarize the latest developments in quantum computing.' }, ]); console.log(response.text); Workflows Mastra’s workflow primitive lets you chain agent steps into typed, directed graphs, useful when you need a mix of deterministic logic and LLM reasoning. JavaScript import { Workflow, Step } from '@mastra/core'; import { z } from 'zod'; const contentPipeline = new Workflow({ name: 'contentPipeline', triggerSchema: z.object({ topic: z.string() }), }); contentPipeline .step({ id: 'research', execute: async ({ context }) => { const { topic } = context.triggerData; // Agent call to research the topic return { research: `Key facts about ${topic}` }; }, }) .then({ id: 'draft', execute: async ({ context }) => { const { research } = context.getStepResult('research'); // Agent call to draft the article return { draft: `Article draft using: ${research}` }; }, }) .commit(); Pros and Cons ✅ Pros❌ ConsFastest path to a production-ready agent in TypeScriptAgent-only: no flows, no vanilla generation primitivesExcellent Studio UI for visual workflow buildingTypeScript/JavaScript onlyIdiomatic TypeScript API with strong type inferenceYounger ecosystem, fewer pluginsGood memory and tool-calling primitivesObservability still maturingIntegrates well with popular JS meta-frameworksNo mobile/cross-platform story LangChain History and Direction LangChain is, by a significant margin, the most widely used AI framework in the world, but its story is complicated. Harrison Chase created LangChain in October 2022 as a Python library for chaining LLM calls, and it spread virally through the developer community in early 2023 as everyone scrambled to experiment with GPT-3 and GPT-4. Its key insight, that useful AI applications require structured chains of calls, retrieval augmentation, and tool integration, was correct and arrived at the right moment. GitHub stars and npm downloads shot to the top of every chart. The JavaScript port, langchain on npm, arrived shortly after and has tracked the Python library closely in both API design and feature parity. This is the source of one of LangChain’s most persistent criticisms: the JavaScript SDK feels like Python idioms force-translated into TypeScript. Patterns like BaseChain, runnable pipelines with .pipe(), and the LCEL (LangChain Expression Language) make perfect sense coming from Python’s compositional patterns but feel unnatural to TypeScript developers accustomed to async/await and module-based composition. LangChain, the company, raised $35M in 2023 and has since built a growing platform around LangSmith (observability and evaluation) and LangGraph (graph-based orchestration). This is where the tension lies: LangChain’s open-source SDK and LangSmith are designed to complement each other. Getting the best observability experience requires using LangSmith. While you can configure other backends, the seamless experience is on their platform. The framework is excellent and featureful, but its commercial direction is unmistakably pointed toward LangSmith adoption. In 2025, LangChain reorganized its JavaScript library around a cleaner agent API (create_agent) and introduced Deep Agents, pre-built agent implementations with built-in context compression and subagent spawning. LangGraph remains the recommended framework for complex multi-step workflows, and LangSmith continues to be the best-in-class platform for production LLM observability. LangChain’s Position: Agent-First, Platform-Tied LangChain is squarely an agent framework. Its sweet spot is spinning up capable agents quickly, particularly for teams coming from the Python AI ecosystem who want to move to or stay in JavaScript without losing the LangChain mental model. It is the most feature-complete framework here in terms of raw agent capabilities, RAG patterns, and integrations, but that breadth comes with complexity. Supported languages: Python (primary, feature-complete), JavaScript/TypeScript (JS port, near-parity). Note: the JS SDK carries Python-style patterns. JavaScript import { createAgent } from 'langchain/agents'; import { ChatOpenAI } from '@langchain/openai'; function getWeather(city: string): string { // Real implementation would call a weather API return `It's always sunny in ${city}!`; } const model = new ChatOpenAI({ model: 'gpt-4o', temperature: 0 }); const agent = createAgent({ model, tools: [ { name: 'get_weather', description: 'Get weather for a given city.', func: getWeather, }, ], systemPrompt: 'You are a helpful assistant.', }); const result = await agent.invoke({ messages: [{ role: 'user', content: 'What is the weather in Madrid?' }], }); console.log(result.messages.at(-1)?.content); LangSmith Observability LangSmith is LangChain’s answer to the observability problem. It provides trace visualization, dataset management, prompt versioning, and LLM evaluation, all polished and production-grade. The integration with LangChain is seamless: set LANGSMITH_TRACING=true and every run is captured automatically. The catch is that LangSmith is a SaaS platform. Genkit’s Dev UI provides comparable local observability with zero cloud dependency. If you need hosted, team-scale observability, LangSmith is arguably the best option in the market. If you need local, zero-config development tracing, Genkit wins. Pros and Cons ✅ Pros❌ ConsLargest community and integration ecosystemJavaScript SDK feels like Python ported to TSLangSmith is best-in-class for production observabilityTight coupling to LangSmith for full observabilityFeature-complete agent, RAG, and chain primitivesComplex API surface, steep learning curveExcellent Python SDK for Python teamsLangGraph required for complex graph workflowsDeep AgentS provide batteries-included patternsHeavy bundle size in browser/edge environmentsLangGraph for advanced workflow orchestrationCommercial platform pressure Google ADK (Agent Development Kit) History and Direction Google ADK was announced at Google Cloud Next 2024 as Google’s opinionated take on a production-grade agent framework, specifically targeting enterprise deployments on Google Cloud. Unlike Genkit, which is cloud-agnostic and full-stack, ADK was designed from day one around Vertex AI and Google Cloud’s agent infrastructure, including Agent Engine, Cloud Run, and GKE. It is the framework Google recommends when you’re building agents that will live in a Google Cloud environment at scale. ADK’s initial release was Python-only, which told the story clearly: this was a framework for the enterprise Python AI developer, data scientists, ML engineers, and cloud architects who think in agents and workflows and are already committed to Google Cloud. The TypeScript, Go, and Java SDKs followed in 2025, with ADK Go 1.0 and ADK Java 1.0 shipping in early 2026. This multi-language expansion signals that Google is positioning ADK as more than a Python script runner; it wants to be the enterprise agent runtime for any Google Cloud workload. ADK 2.0, released in 2026, brought significant refinements: graph-based workflow APIs, a visual Web UI builder, enhanced evaluation tooling (including user simulation and environment simulation for testing agents end-to-end), and deeper A2A (Agent-to-Agent) protocol support. The A2A protocol is an open standard that allows ADK agents to communicate with agents built on other frameworks, a meaningful interoperability effort in a fragmented ecosystem. Google’s direction with ADK is unmistakable: this is enterprise AI infrastructure for Google Cloud customers. If your organization runs on GCP and needs reliable, scalable, observable agent deployments with enterprise support, ADK is Google’s answer. If you need to be cloud-agnostic, look elsewhere. ADK’s Position: Agent-First, Enterprise-Grade Like LangChain and Mastra, ADK is an agent-only framework; its reason for existing is to make building, evaluating, and deploying agents fast and reliable. Unlike Mastra (which targets indie developers and startups), ADK is purpose-built for enterprise scenarios: multi-agent systems, graph-based orchestration, agent evaluation at scale, and deployment to Google’s managed infrastructure. Supported languages: Python (primary, feature-complete), TypeScript/JavaScript, Go, Java. Note: the API design and documentation are heavily Python-first; TypeScript and other SDKs track but sometimes lag the Python feature set. Python # Python — ADK's primary language from google.adk import Agent from google.adk.tools import google_search research_agent = Agent( name="researcher", model="gemini-flash-latest", instruction="You help users research topics thoroughly and accurately.", tools=[google_search], ) # Run locally result = research_agent.run("What are the latest developments in fusion energy?") print(result.text) TypeScript // TypeScript ADK import { Agent } from '@google/adk'; import { googleSearch } from '@google/adk/tools'; const researchAgent = new Agent({ name: 'researcher', model: 'gemini-flash-latest', instruction: 'You help users research topics thoroughly and accurately.', tools: [googleSearch], }); const result = await researchAgent.run( 'What are the latest developments in fusion energy?' ); console.log(result.text); Multi-Agent Systems ADK’s multi-agent support is one of its strongest features. You can compose agents hierarchically, assign them different models, and let them collaborate via the A2A protocol. Python from google.adk import Agent from google.adk.agents import SequentialAgent, ParallelAgent researcher = Agent(name="researcher", model="gemini-flash-latest", instruction="Research the topic.") writer = Agent(name="writer", model="gemini-pro-latest", instruction="Write a clear article from the research.") editor = Agent(name="editor", model="gemini-flash-latest", instruction="Polish and format the article.") content_pipeline = SequentialAgent( name="contentPipeline", agents=[researcher, writer, editor], ) Vertex AI Lock-In ADK’s evaluation, deployment, and production observability features lean heavily on Vertex AI Agent Engine, Cloud Trace, and Google’s managed infrastructure. You can run ADK locally and even deploy to Cloud Run or GKE independently, but to get the full ADK experience, including agent evaluation, performance dashboards, and managed scaling, you’re on Google Cloud. This is similar to how LangSmith is the intended observability backend for LangChain: technically optional, practically expected. Frameworks like Genkit, Vercel AI SDK, and Mastra were designed from the ground up to be cloud-neutral. ADK and LangChain, by contrast, have strong ecosystem gravity toward their respective platforms. Pros and Cons ✅ Pros❌ ConsEnterprise-grade agent infrastructureStrongly tied to Vertex AI and Google CloudMulti-language: Python, TypeScript, Go, JavaPython-first: TS/Go/Java APIs lag in featuresBest-in-class multi-agent and A2A supportBrings Python coding patterns to JS developersGraph-based workflows and evaluation toolsLess suitable for cloud-agnostic deploymentsDirect integration with Google Search, Vertex SearchHeavier setup and operational complexityAgent evaluation with user simulationNot a full-stack framework (agent-only) Head-to-Head Comparison Developer Experience FrameworkDX HighlightsShortcomingsGenkitDev UI is unparalleled for local debugging. Idiomatic TypeScript. Multi-level abstractions.Less prescriptive, more choices to make upfrontVercel AI SDKFrictionless React/Next.js integration. Minimal API.Assembles boilerplate for complex scenariosMastraFastest path to a working agent. Great Studio UI.Agent-only, JS-onlyLangChainVast documentation and community. Battle-tested patterns.Python idioms in TypeScript, complex APIADKPowerful multi-agent tooling. Strong eval story.GCP-centric, Python-first Abstraction Levels Genkit is the only framework that gives you all three levels in one SDK: vanilla generation, typed flows (pipelines), and agents. Vercel AI SDK lives at the lower end; it gives you clean generation and tool-calling primitives but no flow abstraction. Mastra, LangChain, and ADK are agent frameworks: they optimize for spinning up agents quickly but don’t offer a coherent story for when you just want to generate text or structure a pipeline without agent autonomy. Observability FrameworkLocal Dev ObservabilityProduction ObservabilityGenkitBuilt-in Dev UI, trace explorer, Dotprompt editorOTEL-compatible, Cloud Trace, LangfuseVercel AI SDKBasic Developer PanelOTEL, Vercel Observability (platform-tied)MastraStudio UI for workflowsStill maturingLangChainMinimal without LangSmithLangSmith (best-in-class, SaaS)ADKADK Web UICloud Trace + Vertex (GCP-tied) Language Support FrameworkPrimaryAdditionalGenkitTypeScriptPython (preview), Go, Dart/Flutter (preview), Java (Unofficial)Vercel AI SDKTypeScriptNode.js runtimes, EdgeMastraTypeScriptJS runtimes onlyLangChainPythonTypeScript (near-parity, Python idioms)ADKPythonTypeScript, Go, Java Framework Neutrality Genkit, Vercel AI SDK, and Mastra were built from the ground up to be provider-neutral. They support OpenAI, Anthropic, Google, and others through a unified API, and they deploy to any infrastructure. LangChain and ADK are platform-influenced. LangChain’s full power unlocks with LangSmith; ADK’s full power unlocks on Google Cloud. This is not a dealbreaker; both platforms are excellent, but it is an architectural commitment you should make consciously. Idiom and Code Style Genkit, Mastra, and Vercel AI SDK feel natively TypeScript: async/await everywhere, Zod schemas for validation, module-based composition, and no runtime class inheritance chains to navigate. LangChain and ADK’s TypeScript SDKs carry the weight of their Python origins. You’ll find class-heavy APIs, .pipe() chains, and patterns that feel natural if you’ve written LangChain Python but unfamiliar if you’re coming from the TypeScript world. This is not a quality judgment; it’s a cultural fit question. Which Framework Should You Choose? After building with all five, here’s my honest take: Choose Genkit if: You want to iterate on your AI fast and get feedback with less back and forth — Genkit was built from the ground up for powerful local tooling and observability.You need to mix vanilla generation, typed pipelines (flows), and agents in the same app.Provider neutrality is important now or likely to be important later.You’re building a Flutter/Dart mobile app and need AI capabilities.You want OpenTelemetry-compatible tracing without configuring a separate backend. Choose Vercel AI SDK if: You’re building a React/Next.js app and want the lowest-friction path to streaming AI UI.Simplicity and minimal API surface matter more than built-in abstractions.You’re already on the Vercel platform and want native integration.Your use case maps well to the UI hooks (useChat, useCompletion, generative UI). Choose Mastra if: You’re a TypeScript developer who wants to spin up a production agent as fast as possible.You want a clean, idiomatic TypeScript agent API without Python-ported patterns.The visual Studio UI for workflow design appeals to your team.You’re building in the Next.js/SvelteKit/Hono ecosystem. Choose LangChain if: Your team is coming from the Python AI ecosystem and wants cross-language continuity.You need the broadest possible integration ecosystem (the most integrations of any framework).You’re investing in LangSmith for production observability and want a cohesive platform.LangGraph’s graph-based orchestration matches your workflow complexity. Choose ADK if: You’re building enterprise-grade multi-agent systems on Google Cloud.Vertex AI’s infrastructure (Agent Engine, Cloud Trace, Vertex Search) is already in your stack.You need battle-tested multi-language support, including Go and Java.Agent evaluation at scale (user simulation, custom metrics) is a core requirement. Conclusion The Generative AI framework landscape in 2026 is not a winner-take-all market. Each of the five frameworks covered here has a legitimate use case, a growing community, and an active development team. If I had to crown one framework as the most versatile choice for teams that haven’t already committed to a cloud platform, it would be Genkit. Its combination of multi-level abstractions, provider neutrality, and, above all, the Developer UI creates a development experience that genuinely accelerates iteration. The fact that it is expanding to Dart/Flutter, Python, and Go while keeping its TypeScript SDK as the best-in-class experience is a sign of a team thinking about the long game. That said, none of these frameworks is going away. LangChain’s ecosystem depth, ADK’s enterprise footprint, Vercel’s UI ergonomics, and Mastra’s TypeScript-native speed all serve real needs. The most important thing is to make the choice deliberately, understanding what you’re trading when you pick a platform-tied framework, and what you’re gaining when you pick a more opinionated one. Happy building. Last updated: April 2026. Framework versions referenced: Genkit 1.x, Vercel AI SDK 6.x, Mastra 0.x (latest), LangChain JS 0.3.x, Google ADK 2.0.

By Xavier Portilla Edo DZone Core CORE
Why PostgreSQL CDC Breaks in Production
Why PostgreSQL CDC Breaks in Production

Keeping two PostgreSQL databases in sync sounds simple. Until it isn’t. At first, everything looks fine: Logical replication is enabledChanges are flowingThe target database looks current Then, a few days later, something is off: Rows are missingSome updates appear twiceReplication lag jumps for no obvious reasonA small schema change breaks the pipelineRestarting the job does not clearly continue from the right place Now the problem is no longer “how do I stream changes from PostgreSQL?” The problem is proving that the target database is still correct. That is where most PostgreSQL CDC guides stop being useful. They explain how to enable logical replication. They explain replication slots, publications, WAL, maybe Debezium. That is useful. But production CDC usually breaks somewhere else: in the handoff between initial load and CDC, in retries, in checkpoints, in ordering, and in the recovery paths nobody tests until something fails. The Promise of PostgreSQL CDC Change data capture exists for a good reason. Instead of repeatedly querying whole tables or running batch exports, PostgreSQL can stream committed changes from its write-ahead log. In theory, that gives you: Near real-time replicationLess load on the source databaseNo polling loopsNo full reload after every change And yes, that part works. WAL is reliable. Logical replication is mature. PostgreSQL can tell you what changed and in what order. The hard part starts after that. Because WAL is only one part of the system. Once changes leave PostgreSQL, they usually pass through readers, queues, workers, retry logic, target writes, checkpoints, and monitoring. That is where things get interesting. Also annoying. Where PostgreSQL CDC Actually Breaks A simplified CDC pipeline often looks like this: Plain Text PostgreSQL WAL | v CDC reader | v Queue / buffer | v Workers | | v v Target Retry / failure handling PostgreSQL WAL is ordered. Your pipeline may not be. The moment you add queues, parallel workers, retries, and target writes, correctness becomes your responsibility. Not just throughput. Not just “events per second.” Correctness. That means answering questions like: Did the target receive every committed change?Were updates applied in the right order?Did a retry apply the same change twice?Was the checkpoint saved before or after the target write?What happens if the job stops halfway through a large initial load?What happens if CDC starts after the snapshot, but not from the snapshot boundary? Those are the questions that decide whether CDC is reliable in production. 1. Initial Load Is Not CDC CDC starts from a point in time. It does not recreate everything that already existed before that point. So if the target is empty, you first need a baseline copy of the existing data. Usually, this is called an initial load or snapshot. The hard part is what happens next. CDC snapshot gap If CDC starts from the wrong position, the target may miss rows, replay rows, or apply stale updates. This is the snapshot gap. It appears when initial load and CDC are run as separate steps without a shared WAL boundary. snapshot + CDC is two separate operations. snapshot → CDC is a controlled handoff. This is the difference between “snapshot + CDC” and a continuous snapshot → CDC flow. For PostgreSQL, the safe version means: Read a consistent snapshotRecord the matching WAL positionStart CDC from that position Anything else is guessing. 2. WAL Is Ordered, But Workers Can Break Ordering PostgreSQL emits changes in commit order. That does not mean your target receives them in a safe order. Once a pipeline introduces parallelism, ordering can break. For example: Plain Text Event 1: update customer id = 42 Event 2: delete customer id = 42 If these events are processed by different workers, the delete might reach the target first. Or this: Plain Text Event 1: insert parent row Event 2: insert child row If the child insert arrives first, the target may reject it because the parent row does not exist yet. The usual answer is controlled parallelism: Preserve ordering per table where neededAvoid unsafe reordering inside the same key/table streamBatch carefullyRetry without changing event order Parallelism is not bad. Blind parallelism is bad. CDC systems usually fail here when they optimize for speed before defining ordering guarantees. 3. At-Least-Once Delivery Means Duplicates Are Normal Most CDC pipelines are at least once. That means an event may be delivered more than once. This is not automatically a bug. It is a normal recovery behavior. If the pipeline writes to the target, then crashes before saving its checkpoint, it may replay the same event after restart. That is why target writes must be idempotent. For database-to-database replication, this usually means: Inserts should behave like upserts when possibleUpdates should be safe to apply more than onceDeletes should not fail the whole stream if the row is already gonePrimary keys matter If the target write logic is not idempotent, retries can silently corrupt data. For example: Duplicate insertsCounters incremented twiceAudit rows repeatedAppend-only targets growing incorrect history CDC without idempotent target writes is fragile. It may work in a demo. Production will eventually find the retry path. Production has a talent for that. 4. Checkpoints Must Be Commit-Aware Checkpointing sounds simple: Save the last processed WAL position. But the timing matters. If the checkpoint is saved too early, the failure path looks like this: CDC reads an event at LSN X.Checkpoint advances to X.Target write fails.Process crashes before the failure is handled.Restart begins after X, so the event is skipped. The system now believes the event was delivered. But it was never written to the target. That is silent data loss. The safe order is: Read eventWrite to targetWait for target ACKAdvance checkpoint This way, if the process crashes after reading but before committing to the target, the event can be replayed. That may create duplicates if target writes are not idempotent, but it avoids skipping committed source changes. This is the usual tradeoff: Checkpoint too early → data lossCheckpoint after commit → possible replayReplay + idempotency → safe recovery A reliable CDC system must choose the boring, safe option. 5. Schema Changes Are Not Free CDC captures data changes. It does not magically solve schema evolution. In real systems, someone eventually: Adds a columnChanges a typeRenames a tableDrops a columnChanges a defaultModifies a constraint Then the CDC pipeline has to answer: Does the target schema already have this column?Can this type be mapped safely?Should the pipeline stop or continue?What happens to old events?What happens to in-flight batches? Some platforms try to automate schema evolution. That can be useful, especially in analytics pipelines. But for database-to-database replication, automatic schema changes can also be dangerous. A production target is not always just a passive copy. It may have constraints, indexes, permissions, triggers, or application dependencies. The safest practical answer is usually: Detect schema mismatch clearlyFail loudly when target writes are unsafeLet operators coordinate schema changesDo not silently invent a broken target schema A CDC pipeline that keeps running incorrectly is worse than one that stops. At least a stopped pipeline is honest. 6. Long Transactions Create Hidden Lag CDC lag is not always caused by slow networking or slow consumers. Sometimes the source transaction itself is the problem. PostgreSQL changes become safe to replicate only after the transaction commits. So a large transaction can look quiet for a while, then suddenly release a huge batch of changes at once. PLSQL BEGIN; UPDATE orders SET status = 'archived' WHERE created_at < '2024-01-01'; – this runs for several minutes – CDC cannot treat these row changes as final yet COMMIT; While the transaction is open, downstream replication may appear to be stuck or falling behind. After COMMIT, all those changes become visible together. Result: Replication lag jumpsTarget writes arrive in a burstWorkers suddenly have a backlogMonitoring graphs look haunted Long transaction This is common during bulk updates, maintenance jobs, large imports, or application code that keeps transactions open too long. CDC cannot remove this behavior. It can only process the changes once PostgreSQL makes them committed and visible. 7. Restarts Are Where Fake Reliability Gets Exposed A CDC pipeline that works while everything is healthy is not enough. The real test is what happens after: Service restartDatabase disconnectTarget write failureProcess crashMachine rebootOperator pressing "Stop" Restart behavior must be explicit. The system should know: The last durable source positionWhether the target write was acknowledgedWhether the initial load had completedWhether CDC handoff had happenedWhether a partially loaded table can continue safely If those states are not stored durably, restart becomes guesswork. And guesswork is not a recovery strategy. Treat CDC as a workflow, not just a stream. Most real database movement work does not start with CDC. It starts with questions: Which tables should be copied?How large are they?Does the target schema match?Can the existing data be loaded safely?Where is the handoff point between the initial load and CDC?How do we validate the initial load?How do we keep validating after the CDC starts?How do we recover if something stops? But many setups split those steps across tools: Plain Text SQL client → export → scripts → pipeline → CDC → validation Each tool may be fine on its own. The problems live in the gaps: Assumptions are lostState is not sharedValidation becomes manualHandoff points are unclearRestart behavior is inconsistent That is why CDC should be treated as part of the full data movement workflow: Plain Text explore → load → validate → replicate → keep validating Not because workflows look nice on a diagram, but because the correctness problems happen between those steps. What a Reliable CDC System Needs A production CDC system should handle failure and recovery paths deliberately: Pure CDC without initial load: target writes must be idempotent, and checkpoints must be durable.Initial load → CDC: CDC must start from the snapshot boundary.Restart after stop: checkpoints must advance only after successful target writes.Interrupted large load: the system must know what was already copied.Delayed CDC after snapshot: the system must not start blindly from “now.”Schema mismatch: the system should fail clearly, not silently corrupt data. These scenarios look boring. They are also where many CDC implementations fail. Not because PostgreSQL is unreliable. Because the workflow around PostgreSQL CDC is incomplete. When Debezium + Kafka is the right answer A common production CDC architecture looks like this: Plain Text PostgreSQL → Debezium → Kafka → consumer → target database This can be the right architecture. Especially when you need: Kafka as the central event backboneMultiple independent consumersEvent-driven servicesVery high throughputExisting Kafka operations Debezium is a serious CDC tool. Kafka is not the villain. The problem starts when the architecture is much bigger than the job. If the goal is simply PostgreSQL → PostgreSQL replication, the stack becomes a distributed system around a relatively direct task. Now every issue has several possible owners: PostgreSQL WALReplication slot stateDebezium connector configKafka topic lagConsumer retry logicRarget database writesSchema handling between all of the above When lag appears, where is it? When a row is missing, who skipped it? When an event replays, was it Debezium, Kafka, the consumer, or the checkpoint? When the target schema changes, which layer owns the fix? Nothing here is impossible. But it changes the problem. You are no longer just moving data. You are operating a multi-component CDC platform. That may be worth it. But it should be a conscious tradeoff, not the default answer for every sync job. Kafka did not break. The architecture became heavier than the job required. Where DBConvert Streams Fits DBConvert Streams keeps the risky parts of PostgreSQL CDC in one workflow: Plain Text load → handoff → replicate → resume → validate That does not remove the hard parts of CDC. It makes them explicit. Instead of stitching together a snapshot job, a CDC process, retry logic, checkpoints, and validation queries by hand, the workflow is visible in one place. What Changed in DBConvert Streams 2.1 DBConvert Streams 2.1 focuses on several of these recovery paths: Initial load → CDC now hands off automatically from a saved position.CDC resumes from the last durable checkpoint after Stop or restart.Eligible large load runs can continue from saved progress instead of starting again from zero.Schema changes are still not handled automatically and need coordination. These are workflow changes, not new WAL magic. That is the point. What DBConvert Streams Does Not Solve Automatically DBConvert Streams 2.1 does not automatically handle: Schema evolutionExactly-once delivery across source, pipeline, and targetLag caused by long PostgreSQL transactionsTarget repair after manual changes or divergence These are still operational boundaries. When CDC Is the Wrong Solution CDC is not always the answer. Use something simpler if: Data changes rarelyLatency does not matterA nightly reload is acceptableThe target can be rebuilt cheaplyCorrectness matters more than freshness Batch jobs are boring. But boring is not an insult. Boring systems often fail in predictable ways. A full reload that takes 10 minutes and is easy to verify may be better than a CDC pipeline nobody fully understands. CDC is worth it when freshness matters, and the source cannot be repeatedly reloaded. Otherwise, do not add moving parts just to feel enterprise. PostgreSQL will not be impressed. The important part is that they are explicit, not hidden behind a “CDC just works” promise. Final Thought PostgreSQL CDC is not hard because WAL is unreliable. It is hard because a real CDC system has state: Snapshot stateWAL positionCheckpoint stateTarget commit stateRetry stateSchema state If that state is implicit, CDC breaks in strange ways. If it is explicit, CDC becomes boring. And boring is exactly what production replication should be. DBConvert Streams 2.1 handles this as one controlled workflow: initial load, CDC handoff, checkpointing, resume, and monitoring. See: Log-based CDC for MySQL and PostgreSQL

By Dmitry Narizhnykh DZone Core CORE

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

May 7, 2026 by Sayali Patil

Designing Agentic Systems Like Distributed Systems

May 6, 2026 by Satyam Nikhra

Beyond Conversation: Mastering Context with Claude Code Skills and Agents

May 5, 2026 by Ioan Tinca

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

How to Implement AI Agents in Rails With RubyLLM

May 7, 2026 by Josef Strzibny

Why Your RAG Pipeline Will Fail Without an MCP Server

May 7, 2026 by Jaswinder Kumar

Identity Security in the Age of Agentic AI: What Engineers Need to Know

May 7, 2026 by Ashly Joseph

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Identity Security in the Age of Agentic AI: What Engineers Need to Know

May 7, 2026 by Ashly Joseph

Securing CI/CD Pipelines Against Supply Chain Attacks: Why Artifacts and Dependencies Matter More Than Ever

May 7, 2026 by Ifeoma Eleweke

Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

May 7, 2026 by Sayali Patil

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

From Compliance Pipes to Data Streams: Modernizing Healthcare EDI for Strategic Value

May 7, 2026 by Naga Sai Mrunal Vuppala

Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

May 7, 2026 by Sayali Patil

Production Checklist for Tool-Using AI Agents in Enterprise Apps

May 7, 2026 by Pier-Jean MALANDRINO DZone Core CORE

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Why Your RAG Pipeline Will Fail Without an MCP Server

May 7, 2026 by Jaswinder Kumar

Securing CI/CD Pipelines Against Supply Chain Attacks: Why Artifacts and Dependencies Matter More Than Ever

May 7, 2026 by Ifeoma Eleweke

Designing Self-Healing AI Infrastructure: The Role of Autonomous Recovery

May 7, 2026 by Sayali Patil

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

How to Implement AI Agents in Rails With RubyLLM

May 7, 2026 by Josef Strzibny

Why Your RAG Pipeline Will Fail Without an MCP Server

May 7, 2026 by Jaswinder Kumar

Identity Security in the Age of Agentic AI: What Engineers Need to Know

May 7, 2026 by Ashly Joseph

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook
×