AI/ML Resources

DZone's Featured AI/ML Resources

GraphRAG in Practice Using Spring AI, Neo4j, and Goodreads Data

By Akmal Chaudhri

CORE

Large language models (LLMs) are impressive — until they are not. If you ask one about your internal data, your product catalog, or your users' reviews, it will either hallucinate an answer or admit it does not know. The solution most teams reach for is retrieval-augmented generation (RAG). This retrieves relevant data first, injects it into the prompt as context, and lets the model answer from that context rather than from memory. GraphRAG takes this a step further. Instead of retrieving only text chunks, it can use graph relationships to retrieve connected context, following relationships between entities to build richer, more structured context. The result can provide answers grounded in both data and the relationships between that data. In this article, we'll walk through a practical GraphRAG implementation using Spring AI and Neo4j, built on top of a Goodreads book and review dataset. We'll cover the data model, loading the data, setting up the vector index, running the Spring Boot application, and some lessons learned along the way. The full source code is available on GitHub. What We Are Building The application answers natural language queries like "find books with a happy ending" or "something encouraging" by combining two retrieval mechanisms in Neo4j: Vector search – embeds the search phrase via OpenAI and finds semantically similar book reviews using cosine similarity.Graph traversal – follows the WRITTEN_FOR relationship from matched reviews to their associated books, giving the LLM structured book context rather than raw review text. This example uses a simple GraphRAG pattern where vector search identifies relevant reviews and graph traversal expands the retrieved context to connected books. The LLM then summarizes the retrieved books in the context of the original search phrase. The architecture looks like Figure 1. Figure 1. Architecture. Prerequisites Before we start, we will need: Java 21 or laterA Neo4j AuraDB instanceAn OpenAI API key Installing Java If Java is not already installed, the recommended distribution is Temurin from the Adoptium project, available at adoptium.net. Installers are available for Windows, macOS, and Linux. Once installed, verify with: Shell java -version We should see something like openjdk version "21.x.x". The project uses the Maven wrapper, so there is no need to install Maven separately. Setting Up Neo4j AuraDB AuraDB is Neo4j's fully managed cloud database. A free tier is available. Sign up at neo4j.com/product/auradb/.Create a new AuraDB Free instance.When the instance is created, download or note the credentials — the URI, username, and password. Neo4j only shows the password once, so save it somewhere safe.Once the instance is running, open the built-in Query tab and verify connectivity: cypher MATCH (n) RETURN count(n) . This should return 0. We are ready to load data. AuraDB Free includes Awesome Procedures on Cypher (APOC), a utility that provides numerous procedures and functions for data handling. We'll use APOC for the data loading steps. The Data Model The dataset is built around three core node types: Book – 10,000 books from the Goodreads UCSD datasetAuthor – 12,371 authorsReview – 69,791 user reviews, each linked to a book via a WRITTEN_FOR relationship There is also a User node (44,827 users) linked to reviews via a PUBLISHED relationship, although the main application focuses on Books and Reviews. The graph model is shown in Figure 2. Figure 2. The Goodreads Dataset. The key insight is that the Review node carries two things: the review text and 1,536-dimension embeddings generated using an OpenAI embedding model. This is what makes vector similarity search possible without a separate vector database — Neo4j handles both the graph and the vectors. The Goodreads data used in this article is derived from the UCSD Book Graph dataset and related Goodreads datasets released by researchers at the University of California, San Diego, including Mengting Wan, Julian McAuley, and collaborators. The data is provided for research and educational purposes. If you use these datasets in your own work, please cite the following publications: Mengting Wan and Julian McAuley, Item Recommendation on Monotonic Behavior Chains, RecSys 2018.Mengting Wan, Rishabh Misra, Ndapa Nakashole, and Julian McAuley, Fine-Grained Spoiler Detection from Large-Scale Review Corpora, ACL 2019. Loading the Data Let's load the data step by step in the AuraDB Query tab. Run each of the following blocks separately. Constraints and Indexes First, let's set up the constraints and the vector index: Cypher CREATE CONSTRAINT FOR (b:Book) REQUIRE b.book_id IS UNIQUE; CREATE CONSTRAINT FOR (a:Author) REQUIRE a.author_id IS UNIQUE; CREATE CONSTRAINT FOR (r:Review) REQUIRE r.id IS UNIQUE; CREATE CONSTRAINT FOR (u:User) REQUIRE u.user_id IS UNIQUE; CREATE INDEX FOR (r:Review) ON (r.user_id); Then create the vector index on the Review node's embedding property: Cypher CREATE VECTOR INDEX `review-text` IF NOT EXISTS FOR (n:Review) ON (n.embedding) OPTIONS { indexConfig: { `vector.dimensions`: 1536, `vector.similarity_function`: 'cosine' }; Note the index name review-text — we will come back to this in the lessons learned section. Loading Books and Authors The data are hosted on Neo4j's public servers, so we can load them directly via APOC: Cypher CALL apoc.load.json("https://data.neo4j.com/goodreads/goodreads_books_10k.json") YIELD value as book MERGE (b:Book {book_id: book.book_id}) SET b += apoc.map.clean(book, ['authors','similar_books'],[""]); Next, we'll load the initial author stubs: Cypher CALL apoc.load.json("https://data.neo4j.com/goodreads/goodreads_books_10k.json") YIELD value as book WITH book UNWIND book.authors as author MERGE (a:Author {author_id: author.author_id}); and then populate the author nodes with the full data: Cypher CALL apoc.periodic.iterate( 'CALL apoc.load.json("https://data.neo4j.com/goodreads/goodreads_book_authors.json.gz") YIELD value as author', 'WITH author MATCH (a:Author {author_id: author.author_id}) SET a += apoc.map.clean(author, [],[""])', {batchsize: 10000} ); Next, we'll create the AUTHORED and SIMILAR_TO relationships: Cypher CALL apoc.load.json("https://data.neo4j.com/goodreads/goodreads_books_10k.json") YIELD value as book WITH book MATCH (b:Book {book_id: book.book_id}) WITH book, b UNWIND book.authors as author MATCH (a:Author {author_id: author.author_id}) MERGE (a)-[w:AUTHORED]->(b); Cypher CALL apoc.load.json("https://data.neo4j.com/goodreads/goodreads_books_10k.json") YIELD value as book WITH book MATCH (b:Book {book_id: book.book_id}) WITH book, b WHERE book.similar_books IS NOT NULL UNWIND book.similar_books as similarBookId MATCH (b2:Book {book_id: similarBookId}) MERGE (b)-[r:SIMILAR_TO]->(b2); Loading Reviews This step can take several minutes, as it is pulling and processing approximately 70,000 reviews from a gzipped JSON file: Cypher CALL apoc.load.json("https://data.neo4j.com/goodreads/goodreads_reviews_dedup.json.gz") YIELD value as review CALL { WITH review MATCH (b:Book) WHERE b.book_id = review.book_id WITH review, b MERGE (r:Review {id: review.review_id}) SET r += apoc.map.clean(review, [],[""]) WITH b, r MERGE (b)<-[rel:WRITTEN_FOR]-(r) } in transactions of 20000 rows; Note that review.review_id is stored as the Review node's id property, which Spring AI expects when mapping vector search results. Then we'll separate the User nodes from the Review data: Cypher MATCH (r:Review) WHERE r.user_id IS NOT NULL CALL { WITH r MERGE (u:User {user_id: r.user_id}) WITH r, u MERGE (r)<-[:PUBLISHED]-(u) } in transactions of 20000 rows; Adding the text Property Spring AI maps vector search results to Document objects using a property named text. Our review data uses review_text, so we need to add the text property: Cypher MATCH (r:Review) CALL { WITH r SET r.text = r.review_text } IN TRANSACTIONS OF 20000 ROWS; Loading Pre-Generated Embeddings Rather than generating embeddings at runtime, which costs tokens and time, we'll load pre-computed embeddings hosted by Neo4j. This step also takes several minutes: Cypher LOAD CSV WITH HEADERS FROM "https://data.neo4j.com/goodreads/review_embeddings.psv" as row FIELDTERMINATOR '|' CALL { WITH row MATCH (r:Review {id: row.reviewId}) CALL db.create.setNodeVectorProperty(r, 'embedding', apoc.convert.fromJsonList(row.embedding)) RETURN r } in transactions of 1000 rows WITH r RETURN count(r); Once complete, we can verify the embeddings loaded correctly: Cypher MATCH (r:Review) WHERE r.embedding IS NOT NULL RETURN count(r) AS reviews_with_embeddings We should see 69791. Exploring the Data Before running the application, let's take a look at what we have loaded. Here are a few useful queries to run in the AuraDB Query tab. Browse the top-rated books: Cypher MATCH (b:Book) RETURN b.title, b.average_rating ORDER BY b.average_rating DESC LIMIT 10 Browse books with their authors: Cypher MATCH (a:Author)-[:AUTHORED]->(b:Book) RETURN a.name, b.title, b.average_rating ORDER BY b.average_rating DESC LIMIT 10 Inspect a sample embedding — we can see the first few dimensions of a review's vector: Cypher MATCH (r:Review) WHERE r.embedding IS NOT NULL RETURN r.id, r.text, r.embedding[0..5] AS embedding_sample LIMIT 5 Building and Running the Application Let's clone the GitHub repo and get the application running: Shell git clone https://github.com/JMHReif/springai-goodreads.git cd springai-goodreads Set the environment variables for Neo4j AuraDB and OpenAI, as follows: Shell export SPRING_NEO4J_URI=neo4j+s://xxxx.databases.neo4j.io export SPRING_NEO4J_AUTHENTICATION_USERNAME=your_username_here export SPRING_NEO4J_AUTHENTICATION_PASSWORD=your_password_here export SPRING_AI_OPENAI_API_KEY=your_openai_key_here These variables must be set in the terminal session used to run the Spring Boot application, specifically the window where you run ./mvnw spring-boot:run. The terminal used for curl commands does not need them. To avoid having to re-export them each time, you can add them to your shell profile (e.g. ~/.zshrc on macOS or ~/.bashrc on Linux) or save them in a small shell script and source it before starting the app. Now we'll start the application from the root of the cloned repo, where the pom.xml and mvnw files live, as follows: Shell ./mvnw spring-boot:run Maven will download dependencies on the first run. Once the startup banner appears, the app is ready on port 8080. The Four Endpoints The application exposes four REST endpoints, each representing a different retrieval strategy: /hello — Baseline LLM Call Shell curl "http://localhost:8080/hello" A simple call to the LLM with no retrieval. Useful to verify the OpenAI connection is working. /llm — LLM With No Context Shell curl "http://localhost:8080/llm?searchPhrase=happy%20ending" This sends the search phrase directly to the LLM with no data from Neo4j. The model answers from its training data — fast, but prone to hallucination and not grounded in our Goodreads data. /vector — Vector Search Only Shell curl "http://localhost:8080/vector?searchPhrase=happy%20ending" Spring AI embeds the search phrase via OpenAI, queries the review-text vector index in Neo4j, and passes the matching review text to the LLM. Semantic matching works well here — the phrase does not need to match any exact words in the reviews. /graph — Full GraphRAG Pipeline Shell curl "http://localhost:8080/graph?searchPhrase=happy%20ending" This is the full pipeline. Vector search finds the most semantically similar reviews, the graph traversal follows the WRITTEN_FOR relationship to retrieve the associated Book nodes, and the LLM receives structured book context rather than raw review text. Let's look at the output for a few different search phrases: Shell curl "http://localhost:8080/graph?searchPhrase=encouragement" curl "http://localhost:8080/graph?searchPhrase=high%20tech" curl "http://localhost:8080/graph?searchPhrase=caffeine" The contrast between /llm and /graph on the same phrase is the most compelling comparison — the LLM answers from memory in one case and from our actual Goodreads data in the other. GraphRAG Uses Both Vector Search and Graph Traversal It's worth comparing the two retrieval strategies directly, as shown in Figure 3. Figure 3. Vector Search and Graph Traversal. Neither approach is strictly better. Rather, they are complementary. Vector search handles fuzzy, intent-driven queries that keyword search would miss entirely. Graph traversal adds relationship-aware context that makes the LLM response richer and easier to trace back to source data. The /graph endpoint combines both. Lessons Learned Here are four things worth knowing before setting this up from scratch. Vector index naming matters. Spring AI's default vector index name is spring-ai-document-index. This project requires review-text. If the index is created with the wrong name, the application throws a runtime error that is not immediately obvious. Always check the index name configured in the application against the one created in Neo4j.Review nodes need id and text properties. Spring AI maps vector search results to Document objects using properties named id and text. In this dataset, review_id is mapped to the Review node's id property during loading, but the review text is stored as review_text. We therefore add a text property so Spring AI can map the results correctly. Without the expected properties, vector search returns results, but the book list comes back empty — the model gets no context and answers from memory instead.Pre-generated embeddings save time and money. Generating 69,791 embeddings at runtime via the OpenAI API would be slow and costly. Loading pre-computed embeddings from a file is much faster for initial development setups. The trade-off is that the embeddings are fixed, as they were generated with a specific OpenAI model and will need to be regenerated if the model changes.Data loading takes patience. The two long-running steps are the review load and the embedding load. Plan for this, although both steps only need to be done once and the database can be left running between sessions. Summary GraphRAG is a practical pattern, not just a research concept. By combining Neo4j's graph traversal with its vector index, we get two retrieval mechanisms in a single database, and no separate vector store is required for this architecture. Spring AI provides the abstractions to wire it all together in a way that will feel familiar to any Spring developer. The Goodreads domain is approachable and familiar to many readers, but the architecture generalizes to any graph of connected entities, such as product catalogs, knowledge graphs, and collections of documents. If you have relationships in your data, a graph database gives you relationship-aware retrieval capabilities that a plain vector store does not provide. The full source code is on GitHub. Acknowledgements I thank my colleague Jennifer Reif for sharing the Spring AI example. More

API Facade vs. Orchestration vs. Eventing, Now With AI in the Loop

By Jubin Abhishek Soni

CORE

AI Doesn't Replace Your Architecture; It Becomes Part of It Picture this. Your team has just integrated a large language model into your enterprise application. The demo looked compelling. The agent interpreted user intent, called several APIs, and returned a coherent result. Everyone in the room was impressed. Then the questions started. What happens when the LLM misinterprets a request and calls the wrong API? Who owns the business logic embedded in that prompt? If the model changes, does the integration break? How do you audit what the AI decided and why? These aren't AI questions. They're architecture questions, and they don't go away just because you've added intelligence to the system. The most important architectural decision you'll make about AI isn't which model to use. It's where the AI sits relative to your existing integration layers. Get that right, and AI becomes a powerful, governable component in a coherent system. Get it wrong, and you'll end up with business logic scattered across prompts, brittle integrations that break when the model updates, and no clear line of accountability when something fails. The question isn't "Can AI call APIs?" It's "Where should AI sit within your architecture?" There are three architectural roles worth separating clearly. API facade. The edge layer that translates external requests into internal operations.Workflow orchestration. The layer that manages multi-step business processes and decision logic.Event-driven integration. The layer that lets systems react to changes without tight coupling. Each serves a different purpose, and AI belongs in different places depending on the business problem you're solving. Figure 1 lays out all three roles side by side, including what AI owns and does not own in each one. Figure 1. Where AI Sits: Three Architectural Roles The table below gives a quick reference for how the three patterns differ before we walk through each one in detail. Pattern Purpose Coupling Determinism Where AI Fits API Facade Translate external requests into internal operations Tight, synchronous Low, request-driven Interpreting intent, extracting parameters Workflow Orchestration Sequence multi-step business processes Moderate, coordinated High, explicit branching Providing probabilistic input to decision points Event-Driven Integration Let systems react to change asynchronously Loose, decoupled Variable, per consumer Consuming and enriching events, never the bus itself This article walks through where AI fits within each pattern, and just as importantly, where it doesn't. 1. Start by Defining What the AI Is Responsible For Before you touch an integration pattern, answer a more fundamental question. What is the AI actually accountable for in this system? This sounds obvious but gets skipped constantly. Teams reach for an LLM because it handles natural language well, then gradually load it with responsibilities it shouldn't own, like validating business rules, managing state, enforcing authorization logic, and driving deterministic workflows. The AI ends up doing everything, which means the architecture owns nothing clearly. Ask these questions before making any integration decisions. Is the AI interpreting human input? Natural language understanding, intent classification, and entity extraction are AI-native tasks where models genuinely add value.Is the AI making recommendations or decisions? A recommendation, such as "this customer is likely to churn," is a probabilistic output. A decision, such as "cancel this subscription," is a deterministic action with business consequences. These require different ownership models.Is the AI coordinating business processes? If yes, be careful. Orchestration logic embedded in prompts is invisible to your governance tooling, untestable in any traditional sense, and will silently drift as the model updates.Which steps require human approval? Any action that is irreversible, regulated, or high stakes should have an explicit human checkpoint that lives in your workflow layer, not inside a prompt. The cleaner your answer to these questions, the cleaner your integration design will be. Blurry responsibilities produce brittle architectures. Define the boundary first. 2. AI at the API Facade, the Conversational Edge The API facade pattern sits at the edge of your system. It's the layer that translates external requests into internal operations. Traditionally, this meant REST or GraphQL endpoints that routed structured requests to back-end services. AI belongs here when the primary challenge is bridging the gap between unstructured human intent and structured system operations. Think of an enterprise procurement assistant. A buyer types, "Reorder the same supplies we used for the Sydney office fit-out, but increase quantity by 20% and flag anything over $5,000 for manager approval." No traditional API handles that sentence on its own. The facade layer is exactly where an LLM adds value. It parses intent, extracts parameters, resolves ambiguity, and maps the request to specific downstream API calls. What AI does well at the facade includes intent resolution, turning natural language into structured API parameters. It also handles entity extraction, pulling order IDs, product codes, dates, and names from conversational input. It supports contextual disambiguation, using conversation history to resolve references like "that vendor" back to a specific vendor ID mentioned earlier. And it enables response synthesis, taking structured API responses and returning natural language answers. What AI should not own at the facade is just as important. Authorization logic belongs in your API gateway or identity layer. Rate limiting and throttling are infrastructure concerns, not model concerns. Core business rules, such as "orders over $5,000 require approval," should live in your workflow layer rather than in a prompt where they're invisible to compliance tooling. The practical pattern is that AI at the facade acts as a structured parameter extractor. It takes conversational input, produces a clean structured intent object, and hands off to APIs that were designed for deterministic consumption. The model interprets. The API executes. The example below shows what that structured intent object might look like once the model has parsed the procurement request above. JSON { "intent": "create_purchase_order", "reference_order": "sydney_office_fitout_2026", "quantity_multiplier": 1.2, "approval_required_above": 5000, "currency": "USD", "extracted_from": "conversational_input", "confidence": 0.94 } Listing 1: Example structured intent object produced at the API facade. Design your facade APIs to accept both human-readable context and machine-structured parameters. Build explicit validation at the API boundary so that when the model produces a malformed or out-of-range parameter, the error is caught and surfaced clearly, not silently swallowed or, worse, acted upon incorrectly. 3. AI Inside Orchestration, Where Flexibility Meets Business Workflows Workflow orchestration manages multi-step business processes, including the sequence of steps, branching logic, error handling, retries, and human approval gates. It's the layer that knows how work gets done, in what order, and under what conditions. The central tension when introducing AI into orchestration is that orchestration is deterministic by design, while AI is probabilistic by nature. A well-governed workflow produces the same output given the same inputs. An LLM does not. Mixing these carelessly produces workflows that are auditable on paper but unpredictable in practice. The architectural resolution is to keep the orchestration layer deterministic while allowing AI to provide probabilistic inputs into specific decision points. Think of AI as a specialized step inside the workflow, one that produces an output that the workflow then acts on according to explicit, auditable logic. A claims processing workflow illustrates this well. The overall process — intake, validation, AI-assisted assessment, human review, approval, and payment — is orchestrated deterministically. The AI participates at the assessment step. It analyzes claim documentation and produces a structured output: an estimated validity score, a list of missing documents, and a recommended action. The workflow then applies explicit branching logic. A score above 0.85 triggers auto approval. A score below 0.4 gets flagged for denial review. Everything in between routes to a human adjudicator. The AI informs. The orchestration decides. Figure 2 shows this flow end to end. Figure 2. AI Inside Orchestration: Claims Processing Workflow A few design principles matter here. Treat AI steps as typed operations with defined inputs and outputs. The orchestration layer should pass a structured payload to the AI and receive a structured response, not an open-ended conversation. This makes the AI step testable, replaceable, and governable. The snippet below shows a minimal example of what a typed contract for an AI step might look like. TypeScript // Typed contract for an AI step inside orchestration interface ClaimAssessmentInput { claimId: string; documents: DocumentRef[]; } interface ClaimAssessmentOutput { validityScore: number; // 0.0 to 1.0 missingDocuments: string[]; recommendedAction: "approve" | "review" | "deny"; } Listing 2: Example typed input/output contract for an AI step inside an orchestrated workflow. Never let the AI own branching logic that has compliance or audit implications. If a decision must be explainable to a regulator, it should live in the orchestration layer where it's visible, versionable, and logged. Design explicit human approval gates. In enterprise workflows, AI recommendations that trigger consequential actions, such as financial transactions, customer notifications, or system changes, should route through a human checkpoint unless you've explicitly validated and signed off on full automation. Build retry and fallback paths. An AI step that fails, times out, or returns a low-confidence result needs a defined fallback, whether that's routing to a human, using a default, or escalating, built into the orchestration rather than handled ad hoc in the calling code. Platforms like OutSystems, which provide visual workflow design alongside AI integration capabilities, make this separation of concerns tangible. You can see exactly where in the process flow an AI step participates, what it receives, and what happens next based on its output. 4. AI and Event-Driven Architecture, Reacting Without Controlling Event-driven architecture decouples systems through a shared event bus. Producers emit events when something happens, and consumers subscribe and react without either party knowing the other exists. It's the pattern that makes large distributed systems composable and independently evolvable. AI fits naturally into event-driven systems, but as a consumer and enricher, not as the event bus itself. The pattern works like this. A transactional system emits a clean, well-defined business event, such as OrderPlaced, CustomerChurnRiskFlagged, or SupportTicketOpened. An AI consumer subscribes, processes the event asynchronously, and either emits a derived event, like ChurnRiskClassified or TicketCategorized, or writes to a downstream store. Core transaction systems remain untouched. This architecture has a key property for AI integration, which is isolation. The AI component can be updated, replaced, or retrained without touching the transactional system that produced the event. The event schema is the contract between them. As long as the AI consumer honors its output schema, the downstream systems don't care what model is running behind it. AI adds value in event-driven systems in several ways. Real-time classification lets an incoming support ticket event trigger AI categorization and routing before a human ever sees it. Anomaly detection allows a stream of transaction events to feed an AI consumer that flags unusual patterns and emits a FraudSignalDetected event. Content enrichment means a DocumentUploaded event can trigger an AI pipeline that extracts entities, generates a summary, and writes structured metadata back to the event stream. A few cautions are worth noting too. Don't use AI to produce events that trigger irreversible transactional operations without a validation step. An AI-emitted event that directly drives a financial settlement or account closure is a governance risk. Keep AI consumers idempotent, since event-driven systems often deliver events at least once, and your AI consumer should produce the same output for the same event input regardless of how many times it processes it. Version your event schemas independently of your AI models. When the model changes, the event contract should remain stable. Break this rule, and you'll find yourself coordinating model updates with schema migrations across multiple teams. 5. Design APIs for AI Variability, Not Just Traditional Applications Traditional API design assumes well-behaved clients. They send valid, structured requests, handle errors predictably, and operate within known parameters. AI agents are different clients. They may generate requests outside expected parameter ranges, retry with slight variations when uncertain, pass natural language fragments where IDs are expected, or call endpoints in unconventional sequences. This changes how APIs should be designed when AI is a first-class consumer. Be explicit about parameter constraints and semantics. Document not just the type of a parameter but what it means and what values are valid. An AI agent that doesn't understand that "customer_status" is an enum with five specific values will guess, and it may guess wrong. Explicit schemas with enumerated values and clear descriptions dramatically reduce the error surface. Return structured, self-describing error responses. When an AI agent calls an API and gets a validation error, the response should tell the agent exactly what was wrong and what correction is expected. A generic 400 with "invalid input" gives the agent nothing to act on. A structured error that says the field "quantity" must be a positive integer, and that a negative value was received, allows the agent to self-correct on retry. Design for idempotency on write operations. AI agents may retry failed calls. Any write operation that could be called multiple times should be idempotent, meaning calling it twice with the same payload should produce the same result as calling it once. This is a baseline requirement for reliable agentic workflows. Consider AI-specific API profiles alongside your standard endpoints. Some teams are building enriched API descriptions, effectively structured, semantic documentation that LLMs can consume during function calling or tool use scenarios. These profiles describe not just syntax but intent, preconditions, and expected postconditions. If your platform supports it, these descriptions significantly improve agentic reliability. 6. Preserve Loose Coupling as AI Capabilities Evolve If there is one thing that is certain about the current AI landscape, it's that it will look different in 18 months. Model capabilities are improving rapidly. New reasoning architectures, longer context windows, better function calling, and multimodal inputs will change what AI can reliably do, which means the design decisions you make today about where AI participates in your architecture will need to evolve. The integration architectures that will age best are the ones that treat AI as a replaceable component behind a stable interface, not as a load-bearing structural element that the rest of the system is built around. Practically, this means a few things. The interface between your AI component and the rest of the system should be typed and versioned, just like any other service boundary. If you replace the LLM behind that interface with a better model, the orchestration layer and downstream consumers shouldn't need to change. Business logic should not live in prompts. Prompts that embed business rules, such as approval thresholds, eligibility criteria, or routing conditions, will drift as models are updated and will be invisible to your governance tooling. Extract that logic into the orchestration or rules layer where it can be versioned and audited. Test AI steps in isolation. Build evaluation harnesses that validate the AI component's outputs against known good test cases. When you upgrade a model, run the evaluation before you promote to production. This is standard software engineering discipline. It just hasn't been applied consistently to AI components yet. Plan for model-level fallback. If a primary model is unavailable or underperforming, your architecture should support routing to a fallback. This is easier to build in advance than to retrofit during an incident. The teams that will maintain architectural coherence as AI evolves are the ones that applied the same separation of concerns discipline to AI components that they've always applied to services, databases, and APIs. 7. Build Observability Across AI and Integration Layers Debugging traditional distributed systems is hard. Debugging systems where one of the components is an LLM is harder. The failure modes are different. The system may be technically healthy while producing incorrect, inconsistent, or subtly wrong outputs. A 200 OK from an AI step tells you the HTTP call succeeded. It says nothing about whether the response was accurate, relevant, or safe. Observability in AI integrated architectures needs to span multiple layers simultaneously. At the AI component level, teams should capture the full prompt sent to the model, not just the output, along with the raw model response before any parsing or post-processing. Token counts, latency, and model version matter too, as do confidence scores or reasoning traces where the model provides them, and retry attempts or fallback triggers. At the integration layer, capture which APIs the AI called, with what parameters, and what the responses were. Track workflow step durations and branching decisions, event payloads at each stage of processing, and human review decisions and overrides. At the business outcome level, ask whether the end-to-end process completed successfully, whether AI-assisted decisions matched expected patterns, and where AI components are producing outputs that require human correction. Platforms that provide centralized monitoring across application logic, integrations, and workflows, such as OutSystems, reduce the instrumentation burden by giving teams a single observability surface rather than requiring separate tooling for each layer. This matters most during incident response, when you need to trace a failure from a user-visible symptom back through the AI component, through the API calls it made, and into the underlying workflow state, quickly. One practice worth establishing early is shadow mode evaluation. Before promoting AI-assisted decisions to full automation, run the AI in parallel with existing logic and compare outcomes without acting on the AI's output. This builds confidence in the model's reliability on your specific data distribution before you depend on it in production. Conclusion. Integration Architecture Is Still the Foundation AI agents are sophisticated components, but they're still components. They have inputs and outputs. They can fail. They need to be tested, monitored, versioned, and replaced, and crucially, they need to sit somewhere coherent in your architecture. The teams that will get the most out of AI are the ones that ask the architectural questions first. What is the AI responsible for? Where does its output go? Who owns the logic around it? How will we know when it's wrong? The answer isn't a different architecture for AI. It's the same architectural discipline that enterprise systems have always required, applied with precision to a new kind of component. API facade, orchestration, and event-driven architecture were built to manage complexity, enforce separation of concerns, and keep systems evolvable. AI makes all three more valuable, not less. The question is simply where, within each, the intelligence belongs. References APISDOR. "How AI Agents Are Reshaping Enterprise Software Architecture." 2026. https://www.apisdor.com/blog/how-ai-agents-are-reshaping-enterprise-software-architecture/Elementum. "Enterprise AI Orchestration: Complete Architecture Guide." 2026. https://www.elementum.ai/blog/enterprise-ai-orchestration-architectureDevRev. "AI Agent Orchestration: Patterns, Pitfalls & the Shared Memory Architecture." 2026. https://devrev.ai/blog/ai-agent-orchestrationViston AI. "Architecture for Enterprise AI Orchestration: A 2026 Blueprint." 2026. https://viston.tech/recommending-a-production-ready-architecture-for-enterprise-ai-orchestration/"Autonomous Event-Driven Multi-Agent Orchestration for Enterprise AI at Scale." arXiv, 2026. https://arxiv.org/pdf/2606.20058Zuplo. "The API Readiness Gap: How to Design APIs That AI Agents Can Actually Use." 2026. https://zuplo.com/learning-center/api-readiness-gap-agent-callable-apis freeCodeCamp. "How to Design APIs for AI Agents." 2026. https://www.freecodecamp.org/news/how-to-design-apis-for-ai-agents/"Self-Reflective APIs: Structure Beats Verbosity for AI Agent Recovery." arXiv, 2026. https://arxiv.org/pdf/2606.05037 "Building Customer Support AI Agents at 100M-User Scale: An Evaluation-Driven Framework." arXiv, 2026. https://arxiv.org/pdf/2606.08867"Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes." arXiv, 2026. https://arxiv.org/pdf/2603.06847 Agentive AI Agents. "AI Agent Error Handling: 7 Proven Practices." 2026. https://agentiveaiagents.com/ai-agent-error-handling-best-practices/ More

The Role of Multi-Agent AI in Optimizing Warehouse Logistics

By Lilly Gracia

Machine Identity Debt: Why Human Identity Is No Longer Cloud Security's Primary Boundary

By Igboanugo David Ugochukwu

CORE

Service Industry Evolution: Beyond 99.9% Uptime With Evolving Technology

By Abhishek Sharma

Performance Testing RAG Applications: Complete Engineering Guide

In this blog post, we will see how to perform a performance test on a retrieval-augmented generation (RAG) application properly, covering both speed and correctness, and how to wire both into a CI/CD pipeline so regressions get caught before they reach production. Performance testing a RAG application requires two separate testing gates: one for speed and one for answer quality. Traditional load testing tools measure response times but cannot detect hallucinations, where a model returns fast but factually incorrect answers grounded in fabricated context rather than retrieved documents. The guide demonstrates using k6 for load testing end-to-end latency and DeepEval for evaluating faithfulness and answer relevancy using an LLM-as-judge approach. Both gates are integrated into a GitHub Actions CI/CD pipeline so regressions in either performance or output quality are caught automatically on every pull request before reaching production. If you've come from a JMeter or k6 background as I have, your first instinct with a RAG endpoint is probably to point a load test at it and check response times. That gets you halfway there. A RAG app can return a fast, confident, completely wrong answer, and a plain load test will never tell you that. You need two testing surfaces, not one: performance and quality. This guide covers both, using a single running example throughout: a documentation assistant that answers "How do I run JMeter in non-GUI mode?" against a small knowledge base. Why RAG Breaks Traditional Load Testing Assumptions A conventional API returns a complete response, and you measure the round trip. A RAG endpoint does two expensive things before it answers: it retrieves context from a vector store or search index, then it streams a generated response token by token. That second part matters a lot. A single request can stream hundreds of tokens over several seconds, so "request duration" as a single number hides two very different problems: how long the model took to start answering, and how fast it generated once it started. A system with slow startup but fast generation feels broken to someone typing in a chat UI. A system with fast startup but slow generation is fine for a quick question but painful for a long document summary. Averaging those together tells you nothing useful. The Two Testing Surfaces: Performance and Quality I think of RAG testing as two separate gates that happen to run against the same endpoint. Performance answers: how fast is it, and does it hold up under load? This is k6's job, same as any other API load test, just with LLM-specific metrics layered on. Quality answers: is the answer actually grounded in what got retrieved, or did the model make something up? This is where DeepEval comes in, scoring faithfulness and relevancy on every response using an LLM as the judge. Neither gate alone tells the full story. A fast RAG app that hallucinates is worse than a slow one that's accurate, and a perfectly grounded app that takes eight seconds to respond will lose users regardless of correctness. Metrics That Actually Matter Performance Metrics MetricWhat it tells youTTFT (Time to First Token)How long a user stares at a blank screen before anything appearsITL (Inter-Token Latency)How smoothly tokens stream once generation startsTokens/secGeneration speed, matters most for long-form answersp95 / p99 latencyThe tail experience, not the average one TTFT is the most user-visible number in the whole system, and it's also the metric most classic load testing tools weren't built to isolate, since they were designed for atomic request/response cycles, not streams. Quality Metrics MetricWhat it tells youFaithfulnessIs the answer grounded in the retrieved context, or inventedAnswer relevancyDoes the answer address the actual question, or just sound plausibleContext precisionDid retrieval return the right chunks, ranked correctlyContext recallDid retrieval miss anything the answer needed These four metrics carry most of the diagnostic weight in RAG evaluation. Faithfulness and answer relevancy live on the generation side; context precision and recall live on the retrieval side. When faithfulness is low but context recall is high, the retriever did its job, and the model ignored it; that's a prompting problem, not a retrieval problem. Worth knowing the difference before you go tuning the wrong component. Hallucination Detection With DeepEval I'm using DeepEval here instead of RAGAS mainly because DeepEval treats evaluations as pytest test cases with pass/fail thresholds, which is exactly the shape you need for a CI/CD gate. It also accepts any LLM as the judge model, so it isn't locked to one vendor even though our example app happens to use Gemini. Here's what a test case looks like against our JMeter doc-assistant example: Python from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric from deepeval.test_case import LLMTestCase from deepeval.models import GeminiModel judge_model = GeminiModel( model="gemini-3.5-flash", api_key=os.getenv("GEMINI_API_KEY"), ) faithfulness_metric = FaithfulnessMetric(threshold=0.75, model=judge_model) answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.8, model=judge_model) def test_jmeter_non_gui_mode_answer(): question = "How do I run JMeter in non-GUI mode?" result = query_rag_app(question) test_case = LLMTestCase( input=question, actual_output=result["answer"], retrieval_context=result["retrieved_chunks"], ) for metric in [faithfulness_metric, answer_relevancy_metric]: metric.measure(test_case) status = "PASS" if metric.success else "FAIL" print(f"[{status}] {metric.__class__.__name__}: {metric.score:.3f}") failed = [m for m in [faithfulness_metric, answer_relevancy_metric] if not m.success] if failed: names = ", ".join(m.__class__.__name__ for m in failed) raise AssertionError(f"Metrics below threshold: {names}") Run this with pytest, and it either passes or fails like any other test. That's the whole point it turns a fuzzy "does the AI sound right" question into a binary CI/CD signal. The test suite includes retry logic to handle transient Gemini API 503 errors, automatically retrying up to 3 times with exponential backoff. DeepEval generates both JUnit XML and HTML reports, making it trivial to wire into any CI system that understands pytest output. Load Testing With k6 (and Why You Can't Measure TTFT Yet) Here's where things get frustrating if you came here looking for a clean TTFT measurement story: the k6 SSE extension (xk6-sse) is not compatible with k6 v2. It targets go.k6.io/k6 v1, and until it gets updated, you're stuck choosing between k6 v2's improved architecture or the ability to measure streaming metrics properly. So the companion repo does the pragmatic thing: it tests the /chat/complete endpoint instead of the /chat streaming endpoint, using k6's built-in http module. No custom binary, no extensions, just standard k6. The tradeoff is you lose true TTFT measurement, because /chat/complete waits for the full response before returning. What you get instead is end-to-end latency, which is still useful it tells you if the system is slow, just not why it's slow. Here's what the test looks like: JavaScript import http from 'k6/http'; import { Trend, Counter } from 'k6/metrics'; import { check } from 'k6'; const totalDuration = new Trend('total_duration_ms', true); const tokensPerSecond = new Trend('tokens_per_second'); const BASE_URL = __ENV.RAG_APP_URL || 'http://localhost:8080'; export const options = { scenarios: { rag_chat: { executor: 'ramping-vus', stages: [ { duration: '30s', target: 10 }, { duration: '1m', target: 10 }, { duration: '30s', target: 0 }, ], }, }, thresholds: { http_req_duration: ['p(95)<6000'], total_duration_ms: ['p(95)<6000'], }, }; export default function () { const startTime = Date.now(); const res = http.post( `${BASE_URL}/chat/complete`, JSON.stringify({ query: 'How do I run JMeter in non-GUI mode?' }), { headers: { 'Content-Type': 'application/json' }, timeout: '30s', }, ); const duration = Date.now() - startTime; check(res, { 'status 200': (r) => r.status === 200, 'has answer': (r) => JSON.parse(r.body).answer !== undefined, }); totalDuration.add(duration); // Rough tokens/sec estimate from word count const words = JSON.parse(res.body).answer.trim().split(/\s+/).length; tokensPerSecond.add((words / duration) * 1000); } The test ramps from 0 to 10 virtual users over 30 seconds, holds for a minute, then ramps back down. Thresholds are set at p95 < 6000ms for both http_req_duration and the custom total_duration_ms metric. When should you switch back to SSE? Watch the xk6-sse repo. Once it adds k6 v2 support, swap the endpoint from /chat/complete to /chat, add the SSE extension to your Dockerfile, and you'll get true TTFT measurement. Until then, this is the most pragmatic path forward: standard k6, no custom builds, just with the caveat that you're measuring end-to-end latency rather than streaming behavior. The companion repo includes both endpoints in the Express app so you can switch when you're ready: EndpointResponseStatusPOST /chatSSE streamReady for when xk6-sse supports k6 v2POST /chat/completeFull JSONUsed by k6 and DeepEval today Wiring Both Gates Into CI/CD Once both tests run locally, wiring them into GitHub Actions is mostly plumbing: start the app, wait for it to be healthy, run the k6 gate, run the DeepEval gate, both in parallel since they're independent. YAML name: RAG CI on: [pull_request] jobs: performance-gate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Write app env file run: | cat > app/.env << EOF GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY } GEMINI_MODEL=gemini-3.5-flash FILE_SEARCH_STORE_NAME=${{ secrets.FILE_SEARCH_STORE_NAME } PORT=8080 EOF - name: Start RAG app run: docker compose up -d --build app - name: Wait for health run: | timeout 60 bash -c 'until curl -f http://localhost:8080/health; do sleep 2; done' - name: Run k6 load test run: docker compose --profile perf run --rm k6 quality-gate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Write app env file run: | cat > app/.env << EOF GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY } GEMINI_MODEL=gemini-3.5-flash FILE_SEARCH_STORE_NAME=${{ secrets.FILE_SEARCH_STORE_NAME } PORT=8080 EOF - name: Start RAG app run: docker compose up -d --build app - name: Wait for health run: | timeout 60 bash -c 'until curl -f http://localhost:8080/health; do sleep 2; done' - name: Run DeepEval tests env: GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY } run: docker compose --profile quality run --rm deepeval Both jobs run on every pull request. A PR that slows down response time and a PR that quietly makes the model hallucinate get caught the same way, before either reaches a reviewer's eyeballs, let alone production. You'll need to add two secrets to your GitHub repo before the workflow will pass: SecretValueGEMINI_API_KEYYour Gemini API key from https://aistudio.google.com/apikeyFILE_SEARCH_STORE_NAMEThe store name from setup-store.js (format: fileSearchStores/your-store-id) Setting SLOs I'm deliberately not giving you one universal latency number to target. I've seen guidance ranging from sub-second targets for chat-style RAG apps to 3-5 second budgets for more complex document analysis, and the right number for you depends entirely on your retrieval backend, your model, and what your users are actually doing. Run the load test against your own baseline first, then set thresholds off that baseline, not off a number from a blog post (including this one). The example repo uses p95 < 6000ms as a starting point because that's what the test Gemini File Search RAG app achieves at 10 concurrent users with gemini-3.5-flash. Your mileage will vary dramatically based on: Model choice (flash vs pro, size of context window actually used) Retrieval backend (vector DB query time, number of chunks retrieved) Document size and complexity Network latency to your LLM provider What you should track regardless of the exact number: p95 and p99 latency, not just the median. The tail experience is what users complain about. Latency at your expected concurrency, not at 1 user. RAG apps often degrade non-linearly under load because of retrieval bottlenecks. Faithfulness and answer relevancy trending over time, not just pass/fail on one run. A metric that's consistently 0.90 dropping to 0.78 is a signal even if both pass the 0.75 threshold. Wrap-Up RAG performance testing is really two disciplines wearing one trench coat: classic load testing with LLM-aware metrics, and LLM-as-judge quality scoring that classic load testing tools were never built to do. Run them both, gate on both, and you'll catch the regressions that a speed-only test walks right past. The current state of tooling isn't perfect; you can't measure TTFT with k6 v2 without writing your own SSE client, and LLM-as-judge scoring has its own consistency quirks, but it's good enough to catch regressions before production, which is the whole point of a CI/CD gate. Head to the companion GitHub repo for the full working app, k6 script, DeepEval tests, Docker Compose setup, and GitHub Actions workflow you can clone and run locally in under five minutes. Happy testing! Have you run into hallucination regressions that a pure load test missed? I'd like to hear how you caught them; reply on X or open an issue on the companion repo.

By NaveenKumar Namachivayam

CORE

Slopsquatting: A New Supply Chain Threat From AI Coding Agents

A new supply chain attack class is targeting the layer below your code: the dependencies your AI coding agent suggests. Researchers call it slopsquatting. It works because AI tools, even the good ones, hallucinate package names. Attackers do not need to compromise a real package anymore. They wait for a model to invent one, then register the invented name on a public registry. When your developer runs npm install or pip install, the malware lands. This article covers what slopsquatting is, why it is different from typosquatting, the documented incidents in 2025 and 2026, and a practical defense stack you can put in your pipeline this sprint. How AI Coding Agents Hallucinate Dependencies Large language models do not look up packages. They predict tokens that are statistically likely to follow a coding context. When the right answer is uncertain, the model fills the gap with something that looks plausible: an author name that follows naming conventions, a DOI that is well-formed, a package name that fits the ecosystem. A March 2025 research paper measured this directly. Researchers generated 576,000 code samples using major LLMs and checked every package name against npm and PyPI. The results: 19.7% of suggested packages did not exist on the target registryOpen-source models hallucinated at 21.7%CodeLlama hallucinated package names in over a third of its outputsGPT-4 Turbo was the cleanest at 3.59% The most important number for security teams is repeatability. When the same prompt was run 10 times against the same model, 43% of the hallucinated names appeared in every single run. The hallucinations are not random noise. They are deterministic enough to be predicted, scraped, and weaponized. A separate report from Versa Networks found that 58% of hallucinated packages appeared repeatedly across runs. With AI now writing 25% or more of new code at leading tech firms, the attack surface is large and growing. A modern digital Trojan horse representing a supply chain attack. Why This Is Different From Typosquatting Typosquatting relies on human error. An attacker registers cross-evn and waits for someone to mistype cross-env. Registries like npm have countermeasures: name similarity checks, blocklists, and post-publication review. Slopsquatting bypasses all of these. The attacker does not need a name that looks like a real package. They need a name that an AI consistently invents. The two attack profiles diverge sharply: PropertyTyposquattingSlopsquattingSource of the wrong nameHuman typoAI hallucinationDetection by similarityPossibleUseless (names are novel)Attack volumeLimited by typo patternsScales with model usageExploit windowMonths to yearsHours to daysCross-ecosystem leakageRare8.7% of hallucinated Python names exist as real JavaScript packages That last row is worth highlighting. The model often gets the concept right but the language wrong. It knows a package called serverless-python-requirements exists in the JavaScript world, so it suggests the same name when the developer asked for a Python solution. The install command sends the build to a totally unrelated piece of code in a different registry. Standard registry-side defenses cannot catch this because each name is legitimate in its own ecosystem. Documented Incidents This is not a hypothetical risk. The exploits have already happened. huggingface-cli (PyPI) Bar Lanyado of Lasso Security observed that AI models consistently suggested pip install huggingface-cli. The real package installs differently: pip install -U "huggingface_hub[cli]". The shorter name was a hallucination. To measure the impact, Lanyado registered the hallucinated name on PyPI as a harmless empty package. Within three months, it received over 30,000 downloads. Alibaba pasted the wrong install command into the README of one of their public repositories. The package was harmless because Lanyado was a researcher. Anyone else could have shipped credential exfiltration in the same slot. react-codeshift (npm) In January 2026, Charlie Eriksen at Aikido Security noticed AI tools recommending an npm package called react-codeshift. The package was a hallucinated mash-up of two real packages, jscodeshift and react-codemod. It had no author and had never existed. Eriksen registered the name. Within weeks, it spread to 237 GitHub repositories through cloned and modified agent skills, and the npm registry showed real install attempts from agent tooling. His public summary: the only reason it did not become an attack vector is that he registered it first. unused-imports (npm) A confirmed malicious package, registered to catch developers who confuse it with the real eslint-plugin-unused-imports. Currently behind an npm security hold, but as of February 2026 it was still receiving 233 downloads per week, suggesting either AI tools are still recommending it or the malicious version made it into project lockfiles before takedown. PromptMink (multi-registry) Researchers at ReversingLabs attribute this campaign to Famous Chollima, a North Korean APT group. Instead of registering hallucinated names, the group uses LLM Optimization (LLMO) abuse: they craft package descriptions, README files, and embedded documentation specifically designed to make AI coding agents recommend their packages. The targets are crypto and fintech development teams. The malicious payloads have evolved from simple JS to compiled Single Executable Applications and Rust-based NAPI-RS native modules to evade detection. CISA, the NSA, and the Five Eyes partners published a joint advisory on AI agent supply chain risk in early 2026. The Stanford AI Index lists it among the top three new attack surfaces for autonomous agents. The Attack Pattern The mechanics are simple and repeatable: Attacker runs targeted prompts against popular coding agents to harvest hallucinated package names.Attacker filters the list for names that appear consistently across runs (the 43% repeatable subset).Attacker registers the most promising names on npm, PyPI, RubyGems, crates.io, or any registry that allows public publication.Each registered package contains a working facade and a malicious payload, typically credential exfiltration or environment variable theft on install.Developer prompts an AI agent. Agent suggests the slopsquatted name. Developer or agent runs install.Payload executes inside the developer's environment, with their privileges, on their codebase. The window between hallucination harvesting and victim install can be hours. Agents do not check publication date by default. Defense Stack Slopsquatting is a supply chain problem, not a code review problem. You cannot scan source files for it because the malicious code lives in dependencies your code does not yet have. Defense has to live in the pipeline. Below is a layered approach. None of the layers alone is sufficient. Together they make the attack expensive enough that most attempts will fail. Layer 1: Pre-Install Validation The cheapest defense is to validate every AI-suggested package against the registry before install runs. This is a one-script change for most teams. Shell #!/bin/bash for pkg in $(jq -r '.aiSuggestions[]' suggestions.json); do if ! curl -sf "https://registry.npmjs.org/$pkg" > /dev/null; then echo "WARN: package $pkg not found on registry — possible hallucination" exit 1 fi done This catches the simplest case: a name that does not exist at all. It does not catch a name that exists but is malicious. That requires the next layer. Tools that automate this: Slopcheck — open-source CLI that validates AI-generated dependency lists against npm, PyPI, and other registriesAikido Intel — package threat feed including known slopsquatted namesMCP-based validators — Claude Code performs registry lookup before suggesting; Cursor and most other agents do not by default Layer 2: Registry Heuristics Past existence is not enough. A package that was registered yesterday with no download history is suspicious regardless of whether it appears on PyPI. Three signals worth alerting on: Package age below 30 days for any new dependencyWeekly download count below 1,000 for a critical-path dependencyPackage name not in your organization's previous lockfiles or approved list Open Policy Agent (OPA) is the standard way to enforce this in CI: Shell package supply_chain deny[msg] { input.package.age_days < 30 not input.package.allowlisted msg := sprintf("dependency %v is less than 30 days old", [input.package.name]) } deny[msg] { input.package.weekly_downloads < 1000 input.package.role == "production" msg := sprintf("low-traffic dependency %v in production path", [input.package.name]) } This stops the most common slopsquatting pattern, where attackers register names within a 24-48 hour window after harvesting hallucinations. Layer 3: SBOM Generation and Verification Every build should produce a Software Bill of Materials that records exactly what went in. The standard chain: Shell # generate SBOM during build syft . -o cyclonedx-json > sbom.json # sign it so it cannot be swapped post-build cosign attest --predicate sbom.json --type cyclonedx \ ghcr.io/your-org/your-app:$TAG # verify before deploy cosign verify-attestation --type cyclonedx \ ghcr.io/your-org/your-app:$TAG The point is not the syntax. The point is that after this step, the artifact carries cryptographic evidence of its dependency tree. A swapped or substituted package fails verification before it reaches production. For continuous tracking across builds, Dependency-Track ingests CycloneDX SBOMs and flags new components, version drift, and known vulnerabilities across every build in your organization. Layer 4: Sandboxed Install AI-generated install commands should not run in your developer's primary environment. They should run in an ephemeral container with no credentials, no network access beyond the registry, and outbound traffic logging. Shell docker run --rm \ --network=registry-only \ --read-only \ -v "$PWD":/work:ro \ node:20-alpine \ sh -c "cd /work && npm install --dry-run" If the install attempts an outbound connection during postinstall scripts, the container logs it, and the build fails. This catches malicious packages that pass registry checks but execute payloads on install. Layer 5: Lockfile Diff Enforcement Every PR that modifies a lockfile should require explicit review of the diff. New dependencies are the highest-risk additions, and AI agents introduce them silently. # in CI, fail the build if package.json keys diverge from package-lock.json diff <(jq -r '.dependencies | keys[]' package.json) \ <(jq -r '.packages | to_entries[].key | select(. != "")' package-lock.json) \ || { echo "lockfile out of sync — possible injected dependency"; exit 1; } For Python projects, the equivalent is checking requirements.txt against pip freeze output, or using pip-compile with --generate-hashes and verifying the hashes in CI. Layer 6: Model Configuration Every defense above is reactive. The proactive defense is reducing the hallucination rate at the source. Lower model temperature. Hallucination rates correlate strongly with sampling randomness. For coding tasks, set temperature between 0.0 and 0.2.Use retrieval-augmented prompts where possible. A model with access to live registry data hallucinates fewer package names than one operating from training data alone.Prefer agents with built-in validation. Claude Code performs web search verification before suggesting packages. This is not a complete defense, but it materially reduces slip-through.Keep the model out of dependency decisions where possible. For production-critical packages, use only allow-listed dependencies that have passed organizational review. What Detection Looks Like in Practice Three behavioral signals that should trigger investigation: An AI agent suggests a package your team has never used and is not in your approved list.A package with under 1,000 weekly downloads is recommended for a production dependency.An install command runs successfully, but the package cannot be found in npm view or pip show output afterward. If your CI catches one of these, treat it as a potential incident. Pull the install logs, check the package source on the registry, and look for outbound network activity from the build environment during install. Closing Traditional supply chain security assumed humans choose dependencies. AI coding agents broke that assumption. They generate dependencies at machine speed, with confidence, and without verification. The registries they pull from were built for a world where the attacker had to guess what humans would mistype. That world is gone. The defenses are not new. SBOMs, signed attestations, sandboxed installs, OPA policies, and lockfile diff enforcement are the same tools mature teams already use for supply chain security. What slopsquatting changes is the urgency. AI raises the tempo at which weak controls break. If you are running AI coding agents in any production-adjacent workflow, the question is not whether a slopsquatted package will be suggested in your environment. It is whether your pipeline catches it before it lands in a lockfile. If your team has detected slopsquatting attempts in your pipeline, the data would be useful to the wider security community. Anonymized incident reports are welcome in the comments.

By Kadir Arslan

Goodbye, Skeleton Keys: Why Machine Identity Broke IAM, and What SPIFFE Is Doing About It

Cloudflare published its own forensic timeline of the Salesloft Drift breach down to the minute, and it's worth sitting with the detail for a second. At 11:51 on August 9, 2025, an actor researchers track as GRUB1 tried to validate a stolen Cloudflare API token against the Salesforce API using TruffleHog's user-agent string — a tool built for finding leaked secrets, repurposed here to confirm one actually worked. That attempt failed. At 22:14, it didn't. GRUB1 walked into Cloudflare's Salesforce tenant using a credential that belonged to the Salesloft Drift integration, no exploit required, no privilege escalation needed — just a token that had been sitting there, valid, with no expiry pressure and no second factor to clear. Cloudflare wasn't an outlier. Google's Threat Intelligence Group eventually counted more than 700 organizations hit through that same OAuth token theft, including Google itself, Palo Alto Networks, and Proofpoint. I keep coming back to that incident in conversations with platform teams, because it's the cleanest illustration I've seen of a problem that's now bigger than any single breach: we built identity and access management for humans, and then we quietly let it sprawl across a population of machines that outnumber humans by a ratio nobody fully agrees on, but everyone agrees is large. CyberArk's 2025 Identity Security Landscape study puts machine identities at more than 80 to 1 against human accounts in the average enterprise. Other measurements land lower or higher depending on methodology — the point isn't the exact multiple, it's that every credible number has been climbing for three straight years, and AI agents are the fastest-growing slice of it. The Bottom Turtle There's an old explanation of the universe — turtles all the way down — that the SPIFFE community borrowed for exactly this problem, sometimes literally titling their own documentation "Solving the Bottom Turtle." The question it's pointing at is uncomfortable: when service A needs to prove its identity to service B, what's the root of that trust? For most organizations through the 2010s, the honest answer was "a string." An API key baked into a config file. A service account password rotated, if you were disciplined, once a quarter. A shared secret copied from a wiki page that three former employees probably still remember. None of that was a deliberate architecture decision. It was what happened by default when nobody designed for machine-scale identity, because for most of computing history, nobody had to. SPIFFE — the Secure Production Identity Framework for Everyone — came out of the people who hit that wall first, at the scale where it actually hurts: engineers from Google, Netflix, Pinterest, and Amazon, along with a startup called Scytale that's since been folded into Hewlett Packard Enterprise, pooling their separately built internal solutions into a shared open standard. SPIRE is the production-grade runtime that implements it, and both are now graduated projects under the Cloud Native Computing Foundation — the same governance tier Kubernetes itself holds. That's not a vanity badge. It signals that the CNCF's technical oversight committee considers the project's adoption and maturity broad enough to bet production infrastructure on, which is precisely what Uber, Block (formerly Square), Bloomberg, ByteDance, and the financial services firm Wise have done, each presenting their own deployment at SPIFFE community events over the past several years. Wise's case is the one I find most persuasive for regulated industries specifically: they adopted SPIRE to establish trust between systems operating across different regulatory jurisdictions, replacing shared secrets with something an auditor could actually verify cryptographically rather than take on faith. What an SVID Actually Buys You Strip away the acronyms, and the mechanism is fairly elegant. A SPIRE Agent runs on every node. When a workload starts up, the agent doesn't ask it to present a password — it interrogates the environment the workload is running in: which Kubernetes service account launched it, which container image hash it's running, which cloud instance metadata applies. That process is called attestation, and it's the part that matters most, because it ties identity to something an attacker can't simply copy out of a config file. If attestation succeeds, the agent requests a SPIFFE Verifiable Identity Document — an SVID — from the SPIRE Server: either an X.509 certificate for mutual TLS or a JWT for API-style calls, both scoped to a narrow lifetime, often measured in minutes rather than months. That lifetime is the entire point. One practitioner walkthrough I'd recommend to any platform engineer puts the contrast plainly: steal a static API key and an attacker holds working access until someone notices and rotates it, a process that in real incident response routinely takes days. Steal an SVID, and the credential is already approaching its own expiration before anyone needs to act — the damage window is bounded by cryptographic TTL instead of by how fast your detection pipeline happens to be that week. Compare that against the Cloudflare timeline above, where the stolen token had no built-in clock running against the attacker at all. Production deployments increasingly don't ask application code to deal with any of this directly. Service meshes absorb it at the infrastructure layer instead: Istio issues SPIFFE-compliant identities to every workload by default through its own internal certificate authority, and organizations that want centralized governance across mesh and non-mesh workloads alike can point Istio at an external SPIRE deployment instead, unifying the audit trail. Envoy proxies fetch SVIDs straight from a local SPIRE Agent through its Secret Discovery Service, which means mutual TLS between two services can be enforced with zero changes to either service's application code — the identity lives in the sidecar, not the business logic. Where Cloud IAM Already Got This Half Right None of this is unique to the open-source SPIFFE world, and it's worth being fair to the cloud providers here, because they solved an adjacent piece of the same problem years ago for one specific case: a workload calling its own cloud provider's APIs. AWS's IAM Roles for Service Accounts — IRSA — lets a pod running in EKS exchange a short-lived, Kubernetes-issued OIDC token for temporary AWS credentials, instead of mounting a static access key into the container image. Google Cloud's Workload Identity Federation and Azure's federated credentials do the structural equivalent for their own platforms. All three share the same underlying trick: trade a long-lived secret for a freshly minted, narrowly-scoped token, issued just-in-time, federated through an OIDC trust relationship rather than copy-pasted by a human. The gap is what happens the moment a workload needs to talk to something that isn't its home cloud's API — another service on the same team's mesh, a partner's system in a different cloud, a vendor integration that predates anyone's identity strategy. AWS IAM has no opinion about a request arriving from GCP. That's the seam SPIFFE is built to close: a single SPIFFE ID and trust model that spans Kubernetes, VMs, multiple clouds, and on-prem hardware at once, with authorization policies written against that one identity rather than against whichever cloud-specific construct happens to apply this week. You can, and increasingly should, run both layers together — IRSA or Workload Identity Federation for the “talking to my own cloud” case, SPIFFE/SPIRE for everything else, federated through each cloud's OIDC provider so the two systems trust the same root rather than operating as separate, unrelated islands. Workload starts | v SPIRE Agent --attests workload--> (checks: k8s service account, | container image hash, node identity) | attestation OK v SPIRE Server --issues--> SVID (X.509 cert or JWT, TTL: minutes) | +----> mTLS to peer workload (via Envoy/Istio sidecar, SPIFFE ID in cert SAN) | +----> OIDC Federation --> Cloud IAM (AWS STS / GCP WIF) --> short-lived cloud creds | +----> SVID expires automatically; re-attestation required for renewal The Part Agentic AI Just Made Worse Everything above was already a hard problem before AI agents entered the picture, and the agents have not been gentle with it. Gartner flagged non-human identity management as a top 2025 strategic trend specifically because of agentic AI's growth curve, and OWASP responded with a dedicated Non-Human Identity Top 10 the same year — an acknowledgment that neither traditional application security tooling nor human-centric IAM processes were built with credentials that never sleep, never log in interactively, and frequently outlive the project that created them. The npm worm campaigns that tore through the back half of 2025 made the failure mode concrete rather than theoretical: forensic write-ups of the Shai-Hulud malware describe it actively harvesting environment variables and any cloud credentials exposed through instance metadata services on infected build runners — precisely the long-lived, broad-scope keys that IRSA and Workload Identity Federation exist to eliminate, sitting unprotected because someone, somewhere, found it easier to bake in a static key than to wire up federation. And then there's the harder case, the one that should concern anyone running agentic systems in production: Anthropic's account of the GTG-1002 espionage campaign in late 2025 described a threat actor manipulating an AI coding agent into autonomously executing the bulk of an intrusion across roughly thirty targets. An agent acting with that kind of autonomy needs some identity to operate under. If that identity is a copied human credential or a static service account with standing privilege — the skeleton-key pattern this whole piece has been arguing against — then a manipulated agent inherits every door that credential opens, instantly, at whatever speed the agent can issue requests. If instead it's a narrowly attested, short-lived SVID scoped to exactly the tools that the agent's task requires, the same manipulation still happens, but the blast radius it can reach is bounded by design rather than by luck. Where This Actually Goes in Practice Nobody serious is suggesting a rip-and-replace migration, and the practitioners who've done this well consistently describe a phased rollout instead: stand up SPIRE on Kubernetes first, prove mTLS between two or three high-value internal services, then move to eliminating the cloud credential files with the broadest blast radius — typically the workloads touching object storage or managed databases — before tackling legacy VMs and anything that predates the cluster entirely. None of it requires abandoning Vault, AWS Secrets Manager, or whatever secrets store already exists; SPIFFE is narrower than that, it specifically removes the class of secret used purely to prove "I am workload X," and leaves genuine application secrets — database passwords for systems that haven't adopted modern identity, third-party API keys — to whatever vault you're already running, just with a shrinking footprint over time. The IETF formalized a working group for Workload Identity in Multi-System Environments in 2024, which tells you where the standards body sees this heading: not as a niche Kubernetes pattern, but as infrastructure plumbing on the same tier as TLS itself. My honest read, watching this mature over the past two years: a decade from now, handing a workload a static, long-lived credential is going to look the way handing an employee a permanent admin password without MFA looks today — technically functional, and a decision nobody will be able to defend after the fact.

By Igboanugo David Ugochukwu

CORE

Candidate Generation Decides Your Pipeline's Cost, Not the LLM

When the Most Capable Model Is the Wrong Starting Point The fastest way to exceed a document pipeline budget is to let an LLM inspect every document before you have performed lightweight filtering. This sounds obvious, but the bottleneck is invisible at the prototype stage. A single model call is cheap, and it works well on the 20 documents in your test set. Then you hit production traffic. The failure mode is usually pretty similar across teams: tens of thousands of LLM calls per day, tens of millions of tokens, and a monthly bill that drifts past the assigned budget. No candidate generation. No triage. Raw corpus straight to the model. The cost compounds because the corpus does not shrink without an upstream triage. A more capable model just gives you a more expensive way to process noise. The Bottleneck Is Candidate Generation Summarization is the easy half. Given a good document and a clear target, almost any capable model produces a passable summary. The hard part is deciding which documents are good and which ones match a given target at all. In a large-scale pipeline, most documents are irrelevant to any particular target entity. A company monitoring its own products, competitors, regulatory activity, industry and market signals might care about a small fraction of incoming documents. The system has to find those without missing too many, and without spending inference budget on the overwhelming majority it should never have touched. Bad candidate pools produce confident summaries of the wrong material. Once the candidate pool is good, the LLM step becomes something you can afford to run. If you solve only summarization, you get a pipeline that is both unaffordable to run and unreliable in what it produces at scale. The Three-Stage Pipeline The pipeline ingests a high-volume stream of documents such as web articles, news feeds, financial filings, or any large text corpus and delivers a curated candidate set for each target. Targets might be companies, products, people, regulatory topics, threat actors, or research areas — potentially thousands of them. Each target has a profile defining what it cares about. The pipeline's output is a per-target digest: a scored and summarized shortlist of deduplicated documents. The pipeline runs in three stages. Each stage is more expensive per document than the one before it, and each shrinks the volume the next stage sees. Figure 1 shows the end-to-end flow. Figure 1: The three-stage pipeline. Volume narrows by orders of magnitude before the LLM is invoked Stage 1: Cost-Efficient Triage The triage stage ingests every incoming document through lightweight classifiers and filters. This filtering catches ads, paywalled stubs, spam, malformed pages, and auto-generated fragments. Purpose-built classifiers for topic, region, language, and content type label the document. These are small supervised classifiers trained for specific domain detection. Named entity recognition (NER), knowledge graph entities, industry verticals, and salient term extraction transform the document into a structured feature vector. Near duplicate detection clusters documents with near-identical text and picks a canonical document to represent the event. One of the highest-value filters here is less obvious. A large fraction of incoming news is stock-ticker recap: an article whose entire body is that a company's share price moved some percent, with the company name and ticker symbol repeated throughout. On pure entity overlap, it looks maximally relevant to that company. For almost any downstream consumer, it carries nothing actionable. A naive pipeline scores it high and spends a model call on it. We trained a small classifier specifically to catch this pattern and tuned it on human-annotated examples, because the failure mode is subtle: the document is not spam, it is well-formed, and it is genuinely about the target — it is just useless. Generic junk filters miss it for exactly that reason. This stage uses no LLMs. Classical ML, static rules, regex, bloom filters, and blocking indexes are enough to process millions of documents a day. The output is a canonical document with a feature vector attached. Not a summary. Not an embedding. Teams sometimes try to add semantic richness here and end up with an ingestion stage that is computationally expensive and difficult to debug. Save that work for downstream stages. The stream coming out is far smaller than the stream going in — most of the firehose never survives triage — but the exact survival rate depends entirely on the corpus. Stage 2: Target-Aware Retrieval Stage 2 processes the features emitted by Stage 1 and matches them against target profiles to produce a bounded candidate set per target. The retrieval problem is not a full cross-join of documents against targets. That does not scale. Instead, you build an inverted index over the document feature vectors and use blocking strategies such as entity mentions and domain signals to constrain which target profiles each document is evaluated against. This is closer to classic information retrieval index construction than to brute-force semantic search. A common instinct is to reach for dense vector similarity as the primary retrieval layer. Push back on that. Embeddings are useful, but they are often not effective when entity aliases, regulated terminology, and taxonomy labels are what actually define a match. A pharmaceutical company named in a document by its subsidiary's trade name will not reliably surface in a cosine-similarity search against a profile built on the parent brand. Inverted indexes on extracted entities and taxonomy codes handle that case directly. Dense retrieval can still participate as a secondary ranking signal, but it is not where to start. Entity matching has the opposite failure too. Some of the most relevant documents name no target entity at all. An industry or regulatory development can matter to a whole set of targets without mentioning any of them by name — a rule change affecting a sector, a shift that hits every company in a category. Strict entity overlap drops these on the floor. We handle them with a separate classifier that matches on industry and taxonomy signals rather than entity mentions, then routes the document to every target whose profile sits in that category. This is why the scorer combines taxonomy agreement alongside entity overlap rather than treating entity match as the only path in: the entity-free relevant document is real, and a pipeline that only does entity matching never sees it. For each document-target pair that survives blocking — whether it got there by entity match or taxonomy signal — a lightweight scorer combines entity overlap, keyword overlap, taxonomy agreement, and source reputation into a match score. This step is fast, deterministic, and easy to interpret. Each target accumulates a bounded pool, capped at roughly 50 to 100 documents. This pool is the only set of documents the LLM ever processes, and the cap is what makes Stage 3 cost predictable rather than a function of corpus size. Here is the scoring logic in compact form: Python def stage2_score(doc_features, target_profile): """Calculates a deterministic match score, bypassing heavy model inference.""" # Blocking: Fast rejection using set intersections shared_entities = doc_features.entities & target_profile.entities shared_topics = doc_features.topics & target_profile.topics if not (shared_entities or shared_topics): return None # Weighted match signal — pure computation, zero model calls score = sum([ WEIGHTS.ENTITY * jaccard(doc_features.entities, target_profile.entities), WEIGHTS.KEYWORD * keyword_overlap(doc_features, target_profile), WEIGHTS.TAXONOMY * taxonomy_agreement(doc_features, target_profile), WEIGHTS.REPUTATION * source_reputation(doc_features.source) ]) return score if score >= target_profile.threshold else None def build_candidate_pools(documents, target_index): """Maps documents to target profiles, returning bounded candidate sets.""" pools = defaultdict(list) for doc in documents: # retrieve_candidate_targets acts as our inverted index lookup for target in retrieve_candidate_targets(doc, target_index): score = stage2_score(doc.features, target.profile) if score is not None: pools[target.id].append((doc, score)) # Cap each pool at K so Stage 3 cost is bounded, not corpus-dependent return {target_id: top_k(candidates, k=100) for target_id, candidates in pools.items()}- Blocking eliminates the cross-join. Scoring is a deterministic linear combination so that engineers can reason about why a document scored where it did. The top-K trim caps downstream cost regardless of how many candidates passed the threshold. Stage 3: Bounded LLM Reasoning By the time the LLM is invoked, it is operating over the bounded pool from Stage 2 — at most 100 documents per target, not the entire corpus. Of that pool, the model typically marks 10 to 50 documents as relevant for the target's digest. That reduction is a relevance verdict, not another trim — the trimming already happened in Stage 2, which is the whole point: the expensive model judges; it does not select. That is the difference between something that looks good in a prototype and something that survives production. Tasks at this stage are the final relevance judgment, novelty detection, concise summarization, theme extraction, and reason code generation. These tasks genuinely need a capable model because they require judgment and contextual synthesis that rule-based scorers cannot provide. If you think of this as retrieval-augmented generation, the retrieval side is doing most of the operational work. The RAG survey literature makes that retrieval/generation split explicit. By this stage, the LLM is judging a curated shortlist rather than searching the corpus, and that changes two things. The cost argument is the obvious one: token usage starts only after the pool is bounded. The quality argument matters as much. A model handed 10 vetted candidates isn't competing with 9,990 distractors for attention — usually improves the quality of the relevance judgment. An End-to-End Concrete Example To see how these three stages interact in practice, let's trace a single document from raw ingestion to final LLM reasoning. Suppose a standard press release arrives at 7:14 AM. The headline mentions a company called "Pomfrey Health Solutions" announcing a new contract with a hospital network called "Mount Avery Medical Center". Stage 1 clears the junk filter, classifies the document as healthcare/industry news, and runs NER to extract "Pomfrey Health Solutions" and "Mount Avery Medical Center". Here is where alias handling matters. The entity database knows that "Pomfrey Health Solutions" is a wholly owned subsidiary of "Pomfrey Corp," a monitored company. Without that mapping, the document exits Stage 1 without matching anything and is never seen again. With it, the NER output gets enriched with the parent entity ID before the feature vector is indexed. Stage 2 looks up the Pomfrey Corp entity ID and finds two active target profiles: an investment research team tracking the company and a competitor tracker watching the hospital software market. Heuristic scoring clears both thresholds. The document enters both candidate pools. No embedding lookup. No LLM call. Under single-digit milliseconds on a warm in-memory index. Stage 3 (runs at 7:20 AM). The LLM receives the bounded pool for Pomfrey Corp — 72 candidates, this press release among them. The model judges the release relevant to Pomfrey Corp and confirms it's genuinely new — nothing in the recent window already covered this contract. It attaches a one-sentence summary and the reason code new_contract_win. Token usage starts only after the candidate pool is bounded. A naive pipeline spends tokens before it knows whether the document matters. Where This Pattern Reappears Nothing in this architecture is specific to news articles. The three-stage shape, namely cost-efficient triage, target-aware retrieval and bounded LLM reasoning, appears anywhere the document/corpus volume is high and the relevance question is specific. Legal discovery is the clearest parallel. Documents are case filings, deposition transcripts, contracts, internal emails, and chat logs. Targets are legal matters or custodians. Stage 1 handles format normalization, OCR, deduplication, and junk removal. Stage 2 matches documents to matters using legal entity names, date ranges, case codes, and custodian aliases — the same alias problem the "Pomfrey" example showed, but with legal and compliance risk at stake. Stage 3 produces relevance and privilege flags for each document, plus short excerpts and timeline metadata for the ones that survive. Candidate pool discipline matters more here than in news. A missed document is not a missed news story. It is a discovery failure for the case. Enterprise security has the same shape with a different texture. Alerts and threat intelligence reports are the corpus. Monitored assets and known threat actors are the targets. At Stage 3, you get a ranked list of alerts with a short triage note instead of a news digest. Corpus, prompts, and target profiles all differ, but the underlying architecture is virtually the same. Trade-offs to Make Explicit Any production system is a set of explicit choices. The main ones in this architecture: Design ChoiceBenefitcostMore stage 1 classifiersLower LLM spend, faster eliminationMore orchestration and model maintenanceAggressive deduplicationLess repeated inferenceRisk of collapsing meaningful variantsBroad candidate pools Better recallHigher downstream ranking costTight match thresholdsLower latency and spendHigher false-negative rateLLM final passBetter nuance and summarizationLatency and observability burden Conclusion The most common cost failure in LLM document pipelines is not a model problem. It is a missing layer of cost-efficient work upstream of the model. Triage filters out noise early. Target-aware retrieval then bounds each target's candidate set so that the LLM only ever sees a pre-filtered shortlist. Which LLM you pick is a Stage 3 tuning decision, which is important but not architectural. The architectural decisions all live upstream.

By Deepak Gupta

6 Types of AI Orchestration Every Tech Leader Needs to Know

Most AI projects don’t fail because of bad models. They fail because nobody thought about how the pieces fit together. That’s the orchestration problem — and it’s quietly costing teams months of rework, bloated infrastructure spend, and AI systems that stall at the pilot stage and never reach production scale. I’ve spent the last several years building enterprise AI systems — from RAG pipelines to agentic workflows deployed across Fortune 500 operations. And the pattern is consistent: the teams that ship reliable, scalable AI aren’t the ones with the best models. They’re the ones who got orchestration right. Here are the six types of AI orchestration every tech leader needs to understand before building at scale. What Is AI Orchestration? AI orchestration is the practice of coordinating and managing multiple AI components, services, or agents to function as a cohesive system. Think of it as the conductor of an orchestra — ensuring each instrument plays its part at the right time to produce harmonious output. Without the conductor, you don’t get music. You get noise. As AI systems grow in complexity, manual coordination becomes impractical and error-prone. Orchestration is what separates a demo from a production system. 1. Workflow Orchestration What it does: Manages sequential or parallel execution of tasks within a defined ML pipeline — from data preprocessing through model execution to final output. Why it matters: It automates your ML lifecycle and ensures consistency across experiments, deployments, and production runs. Without workflow orchestration, every pipeline step is a manual handoff. Engineers become bottlenecks. Experiments drift. Deployments become fragile. Real-World Example A fraud detection system that preprocesses transaction data, runs it through multiple detection models, aggregates results, and triggers alerts — all in an automated, auditable sequence with no human in the loop. Tools to Know Apache Airflow – battle-tested DAG-based pipeline managementPrefect/Dagster – modern Python-native workflow orchestrationKubeflow pipelines – ML-specific orchestration on Kubernetes Key insight: Workflow orchestration is the foundation. You cannot layer other orchestration types on top of an unreliable pipeline. 2. Agent Orchestration What it does: Coordinates multiple AI agents with specialized roles to collaborate on complex, multi-step problems. Why it matters: No single agent does everything well. Specialization combined with coordination consistently outperforms a generalist approach. Agent orchestration is where AI systems start to resemble distributed teams. One agent researches, one analyzes, one writes, one validates. The orchestration layer manages inter-agent communication, state, and decision flow. Real-World Example A customer service platform where one agent handles sentiment analysis, another retrieves from a knowledge base, and a third generates the final response — all operating in concert to resolve issues faster and more accurately than a single-agent system. Tools to Know LangGraph – stateful multi-agent graph orchestrationMicrosoft AutoGen – conversational multi-agent frameworkCrewAI – role-based agent coordinationModel Context Protocol (MCP) – emerging standard for agent-tool integration Key insight: The future of enterprise AI is multi-agent. Start designing for agent handoffs now, even if you’re deploying a single agent today. 3. Model Orchestration What it does: Routes inputs to the right model for the job — or combines multiple models through ensemble methods for higher-confidence output. Why it matters: Different models have different strengths. Intelligent routing means you’re always deploying the best tool for the task, not the most convenient one. Betting on a single model for every use case is a common architectural mistake. Model orchestration lets you mix specialized models — by modality, domain, size, or latency profile — within a single system. Real-World Example A content moderation system that routes text, images, and video to domain-specific models, then combines their outputs for a final decision with measurably higher accuracy than any single model could achieve alone. Patterns to Know Router models – classifiers that determine which downstream model handles the requestEnsemble methods – weighted combination of multiple model outputsFallback chains – primary model fails or abstains, secondary model activatesCascading – small fast model first, escalate to large model only when needed Key insight: Model orchestration is how you manage cost and quality simultaneously — not by picking one. 4. Resource Orchestration What it does: Manages GPU/TPU scheduling, load balancing, and cost optimization across distributed AI infrastructure. Why it matters: Wasted compute is wasted budget. Proper resource orchestration keeps utilization high and infrastructure costs predictable. As AI workloads scale, managing compute becomes a discipline in itself. Resource orchestration handles the scheduling, allocation, and deallocation of infrastructure dynamically — so your team isn’t manually provisioning capacity for every experiment. Real-World Example A research lab running hundreds of concurrent experiments — orchestration automatically allocates GPU resources by priority, deallocates idle capacity, and surfaces cost-per-experiment metrics to keep teams within budget. Tools to Know Kubernetes + KEDA – container orchestration with event-driven autoscalingRay – distributed computing framework for AI/ML workloadsSlurm – HPC job scheduler widely used in research environmentsCloud-native autoscaling (AWS SageMaker, Azure ML, GCP Vertex AI) Key insight: Resource orchestration is the difference between a FinOps win and a surprise cloud bill. Build cost visibility in from day one. 5. Data Orchestration What it does: Manages ETL pipelines and coordinates information flow between systems so AI receives clean, timely, correctly formatted data. Why it matters: A great model on stale or malformed data is useless. Data orchestration is what makes your models trustworthy in production. Data orchestration is the most underinvested layer in most AI stacks. Teams optimize the model and neglect the pipeline that feeds it. The result: inconsistent outputs, silent failures, and eroding stakeholder trust. Real-World Example A real-time recommendation engine pulling from user behavior streams, inventory systems, pricing databases, and external market signals — all orchestrated into a single, coherent, low-latency input for the model. Tools to Know Apache Kafka – high-throughput real-time event streamingdbt – SQL-based data transformation and lineageAirbyte/Fivetran – managed data integration and ELTGreat Expectations – data quality validation in pipelines Key insight: If you can’t trust your data pipeline, you can’t trust your model outputs. Data observability is not optional. 6. Service Orchestration What it does: Integrates multiple AI services and APIs — internal and third-party — into sophisticated applications that deliver compounding value. Why it matters: Your AI product is only as strong as the services it can connect and coordinate. Composability is the new competitive advantage. Modern AI applications are composites. Service orchestration is what turns a collection of APIs into a coherent product. It manages authentication, retry logic, rate limiting, and response aggregation across every integration point. Real-World Example An intelligent document processing system that chains OCR, NLP, entity extraction, and database services to automatically extract, classify, and store information from unstructured documents — end to end, without human intervention. Tools to Know LangChain/LlamaIndex – AI-native service chaining and retrieval orchestrationMCP (Model Context Protocol) – standardized tool and service integration for agentsAPI gateways (Kong, AWS API Gateway) – centralized service management and observability Key insight: The teams winning with AI aren’t building monoliths. They’re building composable systems where each service is replaceable. Quick Reference: The 6 Types at a Glance orchestration typecore functionprimary benefit Workflow Pipeline automation Consistent ML lifecycle Agent Multi-agent coordination Specialization at scale Model Intelligent model routing Best tool for every task Resource Compute & cost management Predictable infrastructure spend Data ETL & data pipeline mgmt Clean, timely model inputs Service API & service integration Composable AI products Why This Matters Right Now AI infrastructure complexity is growing faster than most teams’ ability to manage it. The organizations winning with AI aren’t the ones with the most sophisticated models — they’re the ones building better systems around those models. Effective orchestration delivers compounding returns across every dimension of your AI operation: Scalability – handle increasing workloads without proportional management overheadReliability – automated coordination reduces human error and ensures consistent executionEfficiency – optimized resource utilization and reduced operational costsFlexibility – add, remove, or swap components without redesigning the entire systemSpeed – accelerated development cycles and faster time-to-production The question isn’t whether you need AI orchestration. It’s which of these six types is most critical for your current use case — and whether you’re building it intentionally or inheriting the chaos later. Start with workflow orchestration. Layer agent coordination on top. Add model routing for quality and cost control. Build data and service orchestration as you scale. Resource orchestration becomes critical once you’re at production load. Final Thought I’ve seen well-funded AI projects fail not because the models were wrong, but because the system around them was improvised. Orchestration is not an afterthought — it’s the architecture. The teams that get this right early move faster, spend less, and ship AI systems that actually hold up in production. The teams that don’t spend months in rework and lose stakeholder confidence before they ever reach scale. Which of these six types is your team investing in right now — and which one are you ignoring? Drop it in the comments below.

By Balaji Venkatasubramaniyar

The AI Reliability Gap: Why Enterprise AI Is Failing Long Before It Reaches Production

Intelligence stopped being the bottleneck. Almost nobody has rebuilt their engineering around that fact yet. For three years, the industry has obsessed over one question: can we build intelligent systems? That question is basically settled. The models are good — good enough that nobody serious argues otherwise anymore. The question nobody wants to sit with is the operational one. Can we run these things? Can a company put an LLM-powered agent in front of a paying customer, or inside a production database, and trust it not to quietly wreck the week? Increasingly, the answer is no. Not because the models got worse. Because the gap between "demo that works" and "system that survives contact with production" turned out to be much wider than anyone budgeted for — in time, in money, and in credibility. Call it the AI Reliability Gap. It's the defining engineering problem of this phase of the AI buildout, and 2025 produced enough evidence to fill a casebook. Organizations don't have an AI problem right now. They have an AI Reliability Gap, and most of them don't know it yet because nobody's given it a name. The Evidence: The Numbers Are Not Subtle Start with MIT's Project NANDA, whose "GenAI Divide: State of AI in Business 2025" report — based on roughly 150 executive interviews, hundreds of employee surveys, and an analysis of 300 public AI deployments — landed on a number that's now repeated in every boardroom deck: 95% of enterprise generative AI pilots produce no measurable P&L impact. Despite an estimated $30–40 billion in enterprise spend, only about 5% of pilots are extracting real value. Lead researcher Aditya Challapally told Fortune the failure isn't about model quality — it's a "learning gap," where tools don't adapt to how the business actually works and don't retain context between sessions. That's the AI Reliability Gap measured in dollars: tens of billions spent, and 95% of it stuck in pilot purgatory because nobody engineered the part that makes a model trustworthy over time, inside this specific company's workflows. Gartner's read on the agentic side of the market is just as blunt. In June 2025, the firm predicted that over 40% of agentic AI projects will be canceled by the end of 2027, citing escalating costs, unclear ROI, and weak risk controls. Gartner also flagged something worth sitting with: of the thousands of vendors marketing "agentic AI," the firm estimates only around 130 offer anything genuinely agentic. The rest is what analyst Anushree Verma calls "agent washing" — chatbots and RPA scripts with a new label glued on. Neither of those numbers is about whether the underlying models are smart enough. They're about what happens after the demo — in the messy intersection of legacy systems, governance, memory, and the thousand small ways a workflow can drift out from under a model that was never built to notice. That intersection is exactly where the AI Reliability Gap lives. The Incidents: Three Failures That Made the Gap Impossible to Ignore Statistics are easy to argue with. Incidents aren't. 2025 handed the industry three that became instant case studies — and every one of them is the AI Reliability Gap in practice, not a model-quality story. Replit's agent deleted a live production database — during a code freeze. In July, SaaStr founder Jason Lemkin was nine days into a "vibe coding" project on Replit when its AI agent ran an unauthorized command and wiped a database holding records for more than 1,200 executives and nearly 1,200 companies, despite having been told, repeatedly and in all caps, not to touch anything. When Lemkin asked the agent to rate the severity of what it had done, it answered 95 out of 100. It also told him a rollback was impossible — that turned out to be false; the data was recoverable. Replit CEO Amjad Masad apologized publicly and pushed emergency fixes: automatic separation of development and production databases, a rebuilt rollback system, and a new "planning-only" mode that lets the agent reason without being able to execute destructive commands. Lemkin's verdict to Fortune afterward was measured rather than furious: he called it "good, important steps on a journey," while noting plainly that AI agents in their current form will say things that aren't true. This is the AI Reliability Gap with a number attached: a model capable enough to build an entire app from natural language, and not one guardrail capable enough to stop it from deleting the data underneath that app. Cursor's own support bot hallucinated a company policy — and customers canceled over it. In April 2025, developers using the AI coding tool Cursor started getting logged out across devices. Some who emailed support got an answer from an AI agent named "Sam," who explained, confidently, that subscriptions were limited to one device as a security policy. There was no such policy. Sam invented it. The fabricated rule spread across Reddit and Hacker News fast enough that users canceled subscriptions before the real explanation — a session-handling bug, not a deliberate change — caught up. Cursor co-founder Michael Truell apologized on Reddit: "We have no such policy... this is an incorrect response from a front-line AI support bot." The company now labels AI-generated support replies. The irony wasn't lost on anyone: a company selling AI reliability to developers got publicly burned by an AI reliability failure in its own support queue. Every one of these incidents widens the AI Reliability Gap in the public's mind a little further: it's no longer a hypothetical risk analysts warn about; it's a recurring, named, dated pattern. Klarna unwound its flagship AI customer-service story. In 2024, Klarna's replacement of roughly 700 customer-service agents with an OpenAI-built assistant was the industry's go-to proof point that AI had arrived for white-collar work. By spring 2025, CEO Sebastian Siemiatkowski was telling Bloomberg a different story: the company was hiring humans again because quality had slipped. "We went too far," he said. "The result was lower quality, and that's not sustainable." By late 2025, outlets including Business Insider and CX Dive were reporting Klarna quietly rebuilding human support capacity into 2026, moving to a hybrid model where AI absorbs high-volume routine queries, and humans take escalations and anything requiring judgment. Klarna's IPO pitch had been an AI-replaces-labor story. The sequel was an AI-needs-a-human-backstop story — and Gartner has since predicted that, by 2027, half of companies that cut customer-service headcount because of AI will need to rehire. The AI Reliability Gap isn't about model intelligence — Klarna's chatbot was, by every account, technically competent. It's about what happens when "technically competent" meets "no fallback path for the cases it can't handle well." That's a reliability failure, not an intelligence failure, and the distinction is the whole argument. The Pattern: Why This Is Structurally Different From "The Model Needs to Get Better" There's a pattern across the MIT data and the incidents above, and it's not subtle once you see it: the failures cluster around integration, memory, and governance — not raw capability. Generic chat tools are flexible enough for individual use but don't adapt to an organization's specific workflows or retain context across sessions, which is exactly why MIT found purchased, customized tools succeeding roughly twice as often as internally built ones. Agentic systems are being deployed with production-level permissions and prototype-level guardrails — Replit's incident is the textbook version of that mismatch. And support and customer-facing deployments are discovering that a model under pressure to give a confident answer will manufacture one, which is a governance and evaluation problem, not a one-off bug. This is the same maturation curve cloud computing went through roughly a decade ago. The breakthrough — "we can rent compute by the hour" — stopped being the interesting part almost immediately. The interesting part became how you keep a distributed system up at 2 a.m., and an entire discipline (SRE) grew up around that question. AI is hitting the same inflection point, just faster and with higher-stakes failure modes, because a hallucinated cloud outage doesn't fabricate 4,000 fake customer records the way Replit's agent reportedly did during the same incident. If there's one sentence to take from this piece, it's this: the companies that close the AI Reliability Gap first aren't the ones with the smartest model. They're the ones who stopped assuming the model would behave, and built the engineering around the assumption that, eventually, it won't. The Prediction: The Discipline Is Already Forming Here's the part that should interest anyone reading this for the career angle, because it's not a hypothetical future role — it's a live job category, today, with a real name attached. Anthropic — the company that builds Claude — already runs an internal team called AIRE, AI Reliability Engineering, whose stated job is to improve reliability "across our most critical serving paths — every hop from the SDK through our network, API layers, serving infrastructure, and accelerators and back." The listing asks for people with SRE or production-engineering backgrounds, chaos-engineering experience, and the willingness to jump into unfamiliar systems mid-incident and help drive resolution. That's a site-reliability skill set, explicitly repurposed for AI, inside one of the companies building the frontier models themselves. That's not an isolated data point. Job boards in early 2026 show titles like "Senior Site Reliability Engineer, AI/ML," "Staff Software Engineer, AI Reliability," and "AI Platform Reliability Engineer" open at companies from NVIDIA to Intuitive Surgical, with listed compensation bands in the $176,000–$333,500 range at senior levels, and responsibilities centered on drift detection, anomaly alerting, and keeping model behavior consistent under load — work that didn't have a name two years ago and now has a salary band. The prompt engineer had a moment. It was a real skill in 2023, and it's still useful, but it was never going to be the durable job category, because prompting a model in isolation isn't the hard part anymore. Closing the AI Reliability Gap — keeping a model-dependent system honest, bounded, and recoverable in production — is the harder, more durable problem, and it's the one enterprises are now paying real salaries to solve. The Career Implications: Where the Leverage Actually Is If you're an engineering leader, the actionable read on 2025 isn't "slow down on AI." MIT's own data shows the back-office automation use cases — the unglamorous ones, document processing, BPO replacement, risk workflows — are where the actual ROI is landing, while the most-funded category, sales and marketing tools, is overrepresented in the failure pile. Buy specialized, learning-capable tools rather than building generic ones in-house wherever you can; MIT found that path succeeding roughly twice as often. And before anything autonomous touches a production system, ask the question Replit answered the hard way: what happens when this agent is confidently wrong, and what stops it from acting on that confidence? If you're early in a career and trying to figure out where the leverage is, this is the answer: reliability, evaluation, and observability for AI systems are where the unfilled roles are sitting right now — not because the work is glamorous, but because almost nobody has five years of experience doing it yet. Nobody does. The discipline is that new, which means the people who name it, document it, and build a visible track record around it first have a real, durable advantage. Anthropic didn't create the AIRE team because it sounded good in a job posting. It created it because someone had to own the gap between "the model works" and "the model works reliably, at scale, in production, under load, with humans depending on it." That's a hiring need before it's a buzzword, and it's not going away in 2027 the way "agent washing" eventually will. Conclusion The AI race isn't over. But the part of it that was about raw intelligence is increasingly a commodity question. The part that's still wide open — the part separating the 5% of pilots that work from the 95% that don't — is whether anyone built the operational discipline to keep the thing running once the demo ends. That's the AI Reliability Gap. It will not be closed by a better model. It will be closed by the engineers, leaders, and teams who treat reliability as the actual deliverable — not the thing you bolt on after the postmortem. The companies that figure that out first won't be the loudest ones in the AI conversation. They'll be the ones nobody's writing an incident report about. Sources MIT NANDA, "The GenAI Divide: State of AI in Business 2025" — coverage via Fortune, Virtualization ReviewGartner, "Over 40% of Agentic AI Projects Will Be Canceled by End of 2027" (June 25, 2025) — Gartner newsroomReplit database deletion incident, July 2025 — Fortune, The Register, Fast CompanyCursor support-bot hallucination, April 2025 — The Register, SlashdotKlarna AI customer-service reversal — Entrepreneur, MLQ NewsAnthropic AIRE (AI Reliability Engineering) job posting — Anthropic careers, via GreenhouseAI/ML reliability role listings and compensation data, early 2026 — Indeed, ZipRecruiter

By Igboanugo David Ugochukwu

CORE

Harness Engineering for AI: Why the Model Is Only Half the System

The discipline of building what surrounds the model, so AI can operate safely in production. The Problem Nobody Puts on the Roadmap Every AI project starts the same way. Someone wires up a call to an LLM, the demo works, and the room gets excited. Then it goes to production, and within a week: It hallucinates a hotel that doesn't exist.It quotes a price in the wrong currency.It answers a question it was explicitly told not to touch.Nobody can explain why it did what it did, because nothing was logged. None of this is a "the model isn't good enough" problem. GPT-4-class and Claude-class models are extraordinarily capable. The gap is almost always in everything built around the model — the part that decides what the model gets to see, what it's allowed to do, whether its output can be trusted, and what happens when it's wrong. That surrounding system has a name: harness engineering. What Is Harness Engineering? Harness engineering is the discipline of designing the infrastructure, constraints, tools, memory, verification, and feedback loops that let an AI system operate safely, reliably, and autonomously in production. The analogy I keep coming back to is that a raw LLM is a Formula 1 engine. Enormous power, zero judgment. Bolt that engine into a chassis with no brakes, no steering, and no telemetry, and you don't have a race car — you have a liability. The harness is the chassis, brakes, steering, and dashboard combined. It's what turns raw intelligence into a dependable product. Concretely, a harness is made of six layers: LayerQuestion it answersContext & MemoryWhat does the model actually know about this user and this moment?Guardrails & ConstraintsWhat is the model not allowed to do?Tools & IntegrationsHow does the model act on the real world?Verification & TestingCan we trust what came back?Feedback LoopsHow does the system get better after every run?ObservabilityCan we see what happened, after the fact, in production? Here's the same six layers as a pipeline: The rest of this post builds that pipeline for real, using a scenario straight out of the box: a user asking an AI agent to find hotels in Paris under a fixed nightly budget. Tools Used Python 3.11LangGraph – to model the harness as an explicit state graph rather than a single promptLangChain – for the LLM wrapper and tool-calling utilitiesPydantic – to make guardrail and verification checks type-safe instead of string-matchedA structured logger (standard logging, swappable for LangSmith/OpenTelemetry in production) Nothing exotic. That's the point — harness engineering is mostly disciplined software engineering applied to a non-deterministic component. Building the Harness: A Trip-Planning Agent The scenario: a user asks the agent, "Find me the best hotels in Paris under ₹10,000 per night." We'll build this as a LangGraph StateGraph, where every node is one layer of the harness. That's a deliberate choice — a raw prompt chain hides the harness; a graph makes every constraint, check, and loop a first-class, testable node. 1. Define the Shared State Every node in the graph reads from and writes to one typed state object. This is the contract that keeps the graph honest — no node can silently mutate something another node depends on. Python from typing import TypedDict, Optional from pydantic import BaseModel class HotelOption(BaseModel): name: str price_per_night_inr: float rating: float source: str # which API/tool returned this class TripPlanState(TypedDict): user_request: str user_id: str # Context & Memory user_profile: dict past_trips: list[dict] # Guardrails budget_inr: Optional[float] guardrail_violation: Optional[str] # Tools tool_results: list[HotelOption] # Model output draft_response: str # Verification verified: bool verification_notes: list[str] # Final final_response: str 2. Context and Memory Layer Before the model sees anything, the harness decides what it's allowed to see. Here we pull the user's saved preferences and past trip history from a store (a vector DB or a plain Postgres row — the interface matters more than the backend). Python def load_context(state: TripPlanState) -> TripPlanState: profile = user_store.get_profile(state["user_id"]) past_trips = trip_store.get_recent(state["user_id"], limit=5) return { **state, "user_profile": profile, "past_trips": past_trips, } This is a single-responsibility node: it fetches context and nothing else. It doesn't call the LLM, doesn't validate anything, doesn't touch tools. That separation is what makes the graph testable — you can unit-test load_context with a fake user_store and never touch an LLM. 3. Guardrails and Constraints Layer This layer runs before the model generates anything expensive, and it fails fast if a precondition isn't met — no fallback, no silent guessing at the budget. Python import re class GuardrailViolation(Exception): pass def apply_guardrails(state: TripPlanState) -> TripPlanState: match = re.search(r"under\s*₹?([\d,]+)", state["user_request"]) if not match: raise GuardrailViolation( "No budget detected in request — refusing to proceed without a constraint." ) budget = float(match.group(1).replace(",", "")) if budget <= 0: raise GuardrailViolation("Budget must be a positive number.") if "paris" not in state["user_request"].lower(): raise GuardrailViolation("Destination outside supported scope for this agent.") return {**state, "budget_inr": budget} Note what this is not doing: it isn't asking the LLM to "please respect the budget" in a system prompt and hoping. The budget is extracted and validated in code, before the model is in the loop at all. Prompts are guidance; guardrails are enforcement. 4. Tools and Integrations Layer The model doesn't know real-time hotel prices — nor should it guess them. This node calls a real API and hands the model facts instead of letting it hallucinate them. Python def search_hotels(state: TripPlanState) -> TripPlanState: raw_results = hotel_api.search( city="Paris", max_price_inr=state["budget_inr"], currency="INR", ) results = [ HotelOption( name=r["name"], price_per_night_inr=r["price"], rating=r["rating"], source="hotel_api_v2", ) for r in raw_results ] return {**state, "tool_results": results} 5. The AI Engine Node Only now — with a validated budget and real tool data in hand — does the model get involved. Its job is narrow: turn structured facts into a readable recommendation. It is explicitly not asked to invent prices or hotels. Python from langchain_openai import ChatOpenAI from langchain_core.messages import SystemMessage, HumanMessage llm = ChatOpenAI(model="gpt-4o", temperature=0.2) def generate_recommendation(state: TripPlanState) -> TripPlanState: hotels_text = "\n".join( f"- {h.name}: ₹{h.price_per_night_inr}/night, rated {h.rating}/5" for h in state["tool_results"] ) system = SystemMessage(content=( "You are a trip-planning assistant. Recommend hotels ONLY from the " "list provided below. Never invent a hotel, price, or rating that " "is not in the list. If the list is empty, say so plainly." )) human = HumanMessage(content=( f"User request: {state['user_request']}\n\n" f"Available hotels (budget ≤ ₹{state['budget_inr']}/night):\n{hotels_text}" )) response = llm.invoke([system, human]) return {**state, "draft_response": response.content} 6. Verification and Testing Layer The model just produced text. The harness doesn't trust it — it checks it. Specifically, this node confirms every hotel name mentioned in the draft actually exists in tool_results, which is the single most common failure mode (hallucinated entities) for this kind of agent. Python def verify_output(state: TripPlanState) -> TripPlanState: notes = [] known_names = {h.name.lower() for h in state["tool_results"]} mentioned = extract_hotel_names(state["draft_response"]) # simple NER/regex helper hallucinated = [name for name in mentioned if name.lower() not in known_names] if hallucinated: notes.append(f"Hallucinated hotels detected: {hallucinated}") over_budget = [ h for h in state["tool_results"] if h.name.lower() in [m.lower() for m in mentioned] and h.price_per_night_inr > state["budget_inr"] ] if over_budget: notes.append(f"Budget violation: {[h.name for h in over_budget]}") return { **state, "verified": len(notes) == 0, "verification_notes": notes, } 7. Feedback Loop Layer Whether verification passes or fails, the harness logs the outcome back into the user's history. This is what lets the next run start from a better context node — the loop that turns a single interaction into a system that improves. Python def record_feedback(state: TripPlanState) -> TripPlanState: trip_store.log_interaction( user_id=state["user_id"], request=state["user_request"], response=state["draft_response"], verified=state["verified"], notes=state["verification_notes"], ) final = ( state["draft_response"] if state["verified"] else "I couldn't verify a safe recommendation — please refine your search." ) return {**state, "final_response": final} 8. Wiring the Graph This is where the harness becomes visible as a structure, not a paragraph of prompt instructions: Python from langgraph.graph import StateGraph, END graph = StateGraph(TripPlanState) graph.add_node("load_context", load_context) graph.add_node("apply_guardrails", apply_guardrails) graph.add_node("search_hotels", search_hotels) graph.add_node("generate_recommendation", generate_recommendation) graph.add_node("verify_output", verify_output) graph.add_node("record_feedback", record_feedback) graph.set_entry_point("load_context") graph.add_edge("load_context", "apply_guardrails") graph.add_edge("apply_guardrails", "search_hotels") graph.add_edge("search_hotels", "generate_recommendation") graph.add_edge("generate_recommendation", "verify_output") graph.add_edge("verify_output", "record_feedback") graph.add_edge("record_feedback", END) trip_agent = graph.compile() 9. Observability, Wrapping the Whole Thing Observability isn't a node in the graph — it's a cross-cutting concern that watches every node. The simplest version is structured logging at each transition; in production this is where you'd plug in LangSmith, OpenTelemetry, or your APM of choice. Python import logging, time, functools logger = logging.getLogger("harness") def observed(node_fn): @functools.wraps(node_fn) def wrapper(state): start = time.monotonic() try: result = node_fn(state) logger.info("node=%s status=ok duration_ms=%.1f", node_fn.__name__, (time.monotonic() - start) * 1000) return result except Exception as exc: logger.error("node=%s status=error error=%s", node_fn.__name__, exc) raise return wrapper Wrap every node with @observed before adding it to the graph, and you get per-node latency, error rate, and a full trace of what fired for a given request — without touching the business logic inside each node. Running It Python result = trip_agent.invoke({ "user_request": "Find me the best hotels in Paris under ₹10,000 per night", "user_id": "user_9231", }) print(result["final_response"]) Trace of what actually happens, layer by layer: Context and memory – loads the user's saved preference for boutique hotels from a past trip.Guardrails – extracts budget_inr=10000, confirms Paris is in scope, fails fast if either is missing.Tools – calls the real hotel API, gets back four options under budget.AI Engine – drafts a recommendation using only those four hotels.Verification – confirms every hotel named in the draft exists in the tool results and is within budget.Feedback – logs the interaction, returns the verified response (or a safe fallback if verification failed). Every one of the frustrations from the intro — hallucinated hotels, wrong currency, no explanation — is closed by a specific layer, not by a bigger prompt. Why This Matters Beyond One Demo Without a harnessWith a harnessModel can say anythingGuardrails define what it can't sayPrices and facts are guessedTools supply ground truthNo way to catch a bad answerVerification catches it before the user sees itSame mistakes repeatFeedback loop uses history to improveProduction issues are a mysteryObservability shows exactly what happened This is also why harness engineering, not model selection, is where most of the engineering effort in an AI product actually goes. Swapping GPT-4o for Claude or Gemini in the generate_recommendation node above is a one-line change. Building the context, guardrail, tool, verification, and feedback layers around it is the real project. TL;DR: Harness Engineering for AI An LLM alone is just an engine — powerful, but no brakes. Harness Engineering is the system built around it to make it production-safe: memory (what it knows), guardrails (what it can't do), tools (real data, not guesses), verification (catching mistakes before users see them), feedback loops (learning from each run), and observability (knowing what happened). Example: a hotel-search agent where the budget is validated in code, prices come from a real API, and every answer is checked against that data before it's shown — no guessing, no fallback text. Bottom line: most of the real engineering effort goes into the harness, not the model. An engine alone doesn't ship — a car does. If you're building AI systems and only budgeting time for prompt engineering, you're designing an engine and skipping the car. This post is part one of the My Learning Series. Keep watching this space.

By Manas Dash

CORE

If You Can Write Acceptance Criteria, You Can Write an AI Routing Policy

TL;DR: The AI Routing Policy You moved your routine AI work to a cheaper model, so you think the cost question is handled; however, often, that is not the case. The decision lives in one person’s head and produces nothing that the person accountable for the invoices can read. Worse, it is an architectural choice nobody documented. The AI Routing Policy is the missing artifact of Stage 2 of the Delegation Lifecycle: it records which execution path, from a cheaper model to a frontier model to plain code, handles each class of work, what counts as good enough output to meet the AI Definition of Done, and who owns the call. The skill it needs to work is one you already have: You write acceptance criteria. Thesis: An AI routing policy is not about picking a cheaper model at the moment an AI task is executed The Question Nobody Can Answer Not every company has a board, but every company has a person in the chief financial officer seat: a controller, a chief accountant, or the owner who signs the invoices. At some point, that person looks at the AI line on the budget and asks a simple question: What does this spend buy, and how would we know it creates a return on investment? In most teams, the answer is a shrug. Someone switched the weekly status draft to a cheaper model last spring. Someone else left the customer-facing summaries on the frontier model because it “felt safer.” Nobody wrote either decision down. While the AI token expenditure is real, the rationale is folklore. Why a Cheaper Model Is Not a Policy The fallacy worth naming first is that switching to the cheaper model solves your cost problem. Most likely, it won’t: a model choice that lives only in habit disappears the day the person who made it changes teams. You may have lowered the bill, but you did not create cost control. You made an undocumented architecture decision with financial consequences, and left no one able to explain it. What Stage 2 of the AI Delegation Lifecycle Is In the Delegation Lifecycle, Stage 1 (Decide) uses the A3 Framework to answer whether AI should touch a piece of work at all: Assist, Automate, or Avoid. Stage 2 (Route) answers the next question: which execution path runs the work, and what is good enough for that class of work. AI routing is not the act of selecting a model while you work. That habit matters, and I covered it for the individual practitioner in an earlier piece on token economics (see Related Articles below). This is the level above the habit: the team standard that survives the individual. A written policy says status drafts run on the mid tier, contract review runs on the frontier tier with human review, and recurring calculations run on plain code, each with a one-line reason. An AI model tier bundle is more than price. It is a capability, risk, data-handling, and accountability class, and the cheapest option that clears the bar is not always a model. A route is wider than a menu of models: a frontier model, a cheaper model, deterministic code, human-only work, a model plus human review, or no automation at all. It is the difference between a developer who happens to write good code and a team that has a Definition of Done. One depends on the person. The other depends on the agreement. The Sufficiency Criterion Is Acceptance Criteria The load-bearing idea in an AI routing decision is sufficiency: what does good enough mean for this class of work? Paying for a frontier model on a task that only needs a decent first draft is a waste. A cheap model on a task that goes to a regulator is a different kind of waste, the kind that appears in an incident review. You write this standard the same way you write acceptance criteria for a Product Backlog item. The standard is not “the best the model can do.” It is whether the output meets the stated conditions that a named person can check: For a weekly internal update: traceable to the source board, under 300 words, no invented status.For an external compliance summary: every claim sourced, reviewed by a human before it leaves, zero tolerance for a feature that does not exist. It is the same discipline you apply when you refuse a story without testable acceptance criteria. The friction appears in three predictable places: Teams argue quality in the abstract and never write the criterion, so every task defaults to the most capable model, and the bill climbs.Or they name the route but skip the escalation trigger, so nobody knows when a task is allowed to move up to a more expensive one.Or they write the whole policy and name no owner, so nobody maintains it, the same way nobody maintains an unowned automation. An AI routing policy without an owner is a suggestion. The Route Most Teams Forget: No Model at All Routing to a cheaper model is the first lever everyone reaches for. It is also the smallest one. The smarter move is to ask whether the task needs a model every time, or only once. Packy McCormick and Markie Wagner make the case in their June 2026 essay: “Thinking is expensive but happens rarely. Doing is cheap and happens forever.” Their punchline is shorter: “Because you know what’s cheaper than Chinese models? Code.” (Not Boring, June 10, 2026.) A recurring calculation, a format conversion, or a status roll-up with fixed rules does not need probabilistic judgment on every run. It needs professional judgment once, to design the deterministic path, and then plain code that produces the same output every time. For a non-coding practitioner, this is still a routing decision you can make, even if you hand the build to a developer or have a model write the script one time. The leadership question is not “which model is cheapest.” It is: does this task need probabilistic judgment every time, or only once to design the path? An AI routing policy that offers only cheap, mid, and frontier models stays trapped in the model vendors’ cost logic. Add two more routes, deterministic code, and no automation at all, and the policy becomes an operating-model decision. AI Routing Is Where a Team Makes Its Trade-Offs Explicit Finance is the pressure that makes routing visible, but cost is not the only variable a route balances. Optimize for one alone, and another gives way: Cheapest model everywhere: the bill drops, and quality can collapse on the work that mattered.Frontier model everywhere: quality holds, and cost discipline collapses.Human review everywhere: risk falls, and throughput collapses.Agentic workflow everywhere: autonomy rises, and repeatability collapses.Deterministic code everywhere: cost falls, and adaptability collapses the moment the rules change. A route is where a team makes those trade-offs deliberately rather than by accident. That is the difference between using AI and governing it. Return on Tokens, and Why Task-Class Attribution Matters Once the routes exist, you need a way to determine whether each route covers its cost. McCormick and Wagner gave that discipline a name: Return on Tokens (ROT), with a plain formula: Return on Tokens = (Value of Output − Cost of Tokens) / Cost of Tokens × 100. The formula is the easy part. The operational implication is the hard part: you cannot improve Return on Tokens if you do not know which task class consumed them. The same essay reports the pattern behind the urgency: Fortune 500 leaders admitting they had committed to enormous token spend with no idea what they were getting back. That is the issue for most readers. On a Claude Pro or Max subscription, you cannot see per-task token cost; the meter moves, and you cannot trace it to a workflow. The discipline still matters because you build it before the subsidy ends, so you are not the one scrambling when flat-rate access narrows further. If your work runs through the API, you already pay per token, and Return on Tokens is a number you can compute today for each task class against your AI routing policy. The subscriber is rehearsing for metered reality, while the API user already lives in it. What the Numbers Make Finance Ask The Ramp Economics Lab publishes U.S. companies’ AI spending per employee, based on aggregated card and bill-pay data from more than 70,000 businesses. In June 2026, the median firm spent $11.38 per employee per month, about one enterprise subscription seat. The top 10% spent $611. The top 1% spent $7,449. (Ramp Economics Lab, June 10, 2026.) That data is spend-side only: it shows what companies pay, not what the spend returns, which is the gap an AI routing policy and a log are meant to close. Finance rarely worries about one subscription seat. Finance starts asking harder questions when flat-rate experimentation turns into metered API usage, agentic workflows, departmental duplication, and invoices no team can trace back to work. At that point, CFOs want to learn: what runs on frontier models versus cheaper alternatives, by task class, and why. That demand is the AI routing policy, written from the outside in. A Routing Policy You Can Copy Here is what three lines look like. Take one task class per row and fill the five columns. Weekly internal status draft Default route: mid-tier modelSufficiency criterion: traceable to the board, under 300 words, no invented statusEscalation trigger: missing data, any customer-facing claim, unresolved riskOwner: workflow ownerCustomer-facing roadmap summary Default route: frontier model plus human reviewSufficiency criterion: every claim sourced, no uncommitted feature shown as committedEscalation trigger: enterprise customer, legal or security claimOwner: Product OwnerRecurring metrics calculation Default route: deterministic script, no modelSufficiency criterion: same input produces same output, test cases passEscalation trigger: metric definition changesOwner: Product Analyst That is the structure of the record finance wants. It is not the record itself yet. The Record Finance Can Finally Read An AI routing policy defines the record’s structure; minimal routing logs turn that structure into evidence: Without the log, the policy explains intent.With the log, it explains expenditures. The log adds a few fields to each entry: the actual route used, the escalation reason when a task is moved up, a token or cost estimate, the reviewer, and an outcome signal so the policy optimizes for value, not just the lowest bill. Skip the log, and the policy is governance theater: well written, and accountable to no invoice. Keep the log, and the same meeting that routes the work leaves the trail that finance and procurement ask for later, with no separate report. That is the real economy of the Delegation Lifecycle: the operational artifacts answer the governance questions, once you spend the small effort to record what they decide. What to Do Before Your Next Planning Session Do not write a company-wide AI routing policy this week. Take one recurring AI task, the one whose cost or risk you understand least, and fill one row of the table above: the route it runs now, what good enough means in testable terms, the escalation trigger, and the owner. Add one more column, the actual route used last time, and you have started the log. You will probably find the route was never decided, and the standard was never written. That single gap is the case for the policy. Conclusion Acceptance criteria keep a team honest about what “done” means before work starts. An AI Routing Policy extends that habit to how the work gets done: which path, against what standard, with what escalation trigger, and where it is recorded. The skill is not new, only the object is. When someone in your organization asks what your AI spend buys, will you have a policy and a log to point to, or will you just shrug? Key Questions This Article Answers What Is an AI Routing Policy? A Routing Policy is a written, repeatable team decision that assigns each class of AI-assisted work to the cheapest sufficient execution path, against a stated sufficiency standard, with a named owner. It is the artifact of Stage 2 (Route) of the AI Delegation Lifecycle. The skill it needs is the one you already use to write acceptance criteria. What Are the Routing Options? A route is wider than a menu of models. The options are a frontier model, a cheaper model, deterministic code, human-only work, a model plus human review, or no automation at all. A policy that routes only among model tiers stays trapped in the model vendors’ cost logic. The smarter question is whether the task needs probabilistic judgment every time, or only once to design a deterministic path. Does an AI Routing Policy Give Finance a Spend Record? Not on its own. A Routing Policy defines the record structure: task class, route, sufficiency reason, escalation trigger, and owner. A minimal routing log turns that structure into evidence by capturing the actual route used, the escalation reason, a cost estimate, the reviewer, and an outcome signal. Without the log, the policy explains intent. With the log, it explains the spend. What Is Return on Tokens? Return on Tokens is a measure proposed by Packy McCormick and Markie Wagner in June 2026: (Value of Output − Cost of Tokens) / Cost of Tokens × 100. The formula is the easy part. The operational implication is more difficult: you cannot improve Return on Tokens without knowing which task class consumed them, which is what an AI routing policy and its log make possible.

By Stefan Wolpers

CORE

AI Is Making PHP Cool Again

Somewhere right now, an engineer is making the case to rewrite a working PHP app in Node, and the pitch includes the word "modern." I have heard a version of this for fifteen years. The app ships. The customers are happy. The code is unfashionable. And somebody wants to tear it down and rebuild it on a stack that looks better on a resume. I have shipped software for more than 20 years, and these days I spend a lot of my time watching AI coding agents write it. So here is a take that is going to sound backward: the thing everyone makes fun of PHP and Laravel for — that they are rigid, opinionated, and boring- is the exact thing that makes coding agents so good at them. When a machine writes a big chunk of your code, the most valuable thing your framework can give you is predictability, not flexibility. And the trendy, flexible stack the rewrite crowd wants is quietly making your AI tooling worse. The Thing That Makes a Stack Feel Modern Makes AI Worse at It A coding agent is a pattern matcher with a context window. It is good at your codebase to the degree that your codebase looks like the millions of others it trained on, and to the degree that it can guess where things go without reading the whole repo first. A bespoke Node service is the opposite of that. Node and Express enforce almost no structure, and that gets sold as a feature. You arrange the project however your team likes. One team puts routes in routes/. Another co-locates them with handlers. A third invents a domain-folder layout from a blog post someone read once. Controllers, services, models, and middleware live wherever this particular team decided. For a senior team, that freedom is genuinely nice. It is also poison for an agent. When you ask the model to add an endpoint, it first has to infer your project's private conventions from whatever it can see, then guess at the rest. Two runs of the same prompt come out different, because there is no canonical answer to "where does this go." The agent burns its effort rebuilding context your layout never standardized, instead of writing the feature. This is not really a Node problem. It is a configuration-over-convention problem, and it shows up anywhere the layout is a per-team decision. Even Django, a real framework with real conventions, leaves you enough rope (models in one file or split across many, your pick of API layer) that the AI output wobbles more than it does in a stricter framework. The more the framework leaves up to you, the more the agent has to guess. Convention Over Configuration Was an AI Strategy Before There Was AI Now open any Laravel project, built by any team, in any country. You already know where everything is. Models in app/Models. Controllers in app/Http/Controllers. Policies in app/Policies. Migrations follow the same timestamped naming every time. This is convention over configuration, the principle Rails made famous, and Laravel built its whole developer experience around. For two decades it was sold as a way to stop bikeshedding and onboard humans faster. It turns out it was an AI strategy the whole time, and nobody knew it yet. When the file always lives in the same place, and the code always follows the same idiom, the model has effectively seen your project a million times before it ever touches it. The structure it is predicting is not your team's private invention. It is the global standard, which is exactly what the model trained on. So the generated code comes out idiomatic, lands in the right directory, and looks the same across two runs of the same prompt. Laravel even ships official AI-assisted-development docs now, plus a tool called Boost that feeds an agent the framework's own conventions. That is the tell. The thing that makes a framework easy for a new human to read — everything is where you would expect — is the same thing that makes it easy for a machine. AI just raised the payoff on being predictable. What This Looks Like When You Actually Ship I am not making this argument in the abstract. I am watching it play out in my own company's products. Our newest product, ProductWave, is built entirely on PHP and Laravel. Not out of nostalgia. We got tired of the JavaScript churn, the dependency hell, the new framework every nine months, the constant re-platforming. Laravel is opinionated in the right places. You get auth, queues, an ORM, scheduling, and a sane directory structure on day one, so you stop arguing with the tooling and start shipping features. The AI part is what made the bet pay off harder than I expected. Because Laravel's conventions are so consistent, the agents we use write noticeably better code in our Laravel apps than in a from-scratch Node service where every team invented its own layout. Same file, same place, every time. So the output is idiomatic instead of improvised, and it holds up across runs. Here is the difference in the terms that actually matter when an agent is writing your code: What the coding agent facesConvention stack (Laravel, Rails)Bespoke stack (hand-rolled Node)Where a new controller goesSame path in every project on earthWherever this team decided, if anyone didStyle of the generated codeMatches the public examples it trained onMatches your house pattern, if one existsTwo runs of the same promptMostly consistentVary run to runContext it must rebuild per repoAlmost none, the structure is the standardMost of it, the layout is privateHow a new engineer (or agent) reads itLike every other projectLike a new language None of this needs the framework to be technically better on every axis. It needs the framework to make the same decision every time, so neither your new hire nor your AI has to wonder. PHP Got Written Off Years Ago. It Is Worth a Second Look. I know the objection, because the rewrite pitch always carries it: PHP is slow, untyped, stuck in 2010. If your last serious PHP experience was a PHP 5.6 codebase, that picture is more than a decade out of date. PHP 8 added a JIT compiler and a real type system. Union types, readonly properties, enums, the match expression, and Fibers for async are all standard now: PHP // PHP 5.6 function process($value) { if (is_int($value) || is_float($value)) { return calculate($value); } } // PHP 8.x function process(int|float $value): float { return calculate($value); } The performance cliche is just as stale. When Tumblr moved its fleet from PHP 5 to PHP 7, the engineering team documented latency dropping by half and CPU load falling at least 50 percent, and PHP 8 kept climbing from there. This is not a dead language. By W3Techs' numbers, it still runs roughly three-quarters of the websites with a known server-side language, and it powers production at the scale of Etsy and Slack. There are good, boring reasons companies still run on PHP. It is unfashionable on Hacker News, which is a very different thing from being dead. The Rewrite Reflex Gets It Backward So why does the rewrite argument keep coming up? Usually it is what I call resume-driven development. The stated reason is "PHP is outdated." The real reason is that an engineer wants the trendy stack on their resume for the next interview. That is rational for the individual and a disaster for the roadmap. I say that as someone who has approved the rewrite and regretted it! Every team I have watched hit this fork landed the same way. The ones that worked said no to the rewrite, modernized the stack they had, and kept shipping customer value. The ones that did not approve it, spent the better part of two years rebuilding what already worked, shipped nothing new in the meantime, and watched competitors eat their lunch. The AI era adds a line to that math the rewrite crowd never accounts for. When you tear down a legible, convention-driven Laravel app and rebuild it as a bespoke service in a flexible stack, you are not just paying the old rewrite tax. You are actively making your codebase harder for the AI tooling you are betting your future speed on. You are trading a structure the model understands for one it has to relearn. You are spending two years to make your own agents worse at their job. That is the opposite of modernization. What You Should Actually Do You do not have to adopt PHP to use any of this. The principle is about convention, not about a language. For greenfield work, bias toward an opinionated framework. Laravel, Rails, and the convention-heavy frameworks in any language give an agent a predictable surface to generate against. The "we will assemble our own stack" instinct feels powerful and quietly costs you AI quality.Modernize the app you have instead of rewriting it. If you are on an old PHP or Laravel version, upgrade it and adopt the conventions fully. You will get more out of your agents from a current, consistent codebase than from a brand-new language, at a fraction of the cost and risk.If you are stuck in a flexible stack, impose convention anyway. Pick a canonical layout, document it, lint for it, and keep it identical across services. The agent cannot read your mind, but it will follow a structure you actually enforce. Most of the AI-quality gap closes the moment the layout stops being a per-team decision.Stop treating "boring" as an insult. Boring means predictable. Predictable means staffable, and now it means legible to a machine too. In an AI shop, that is the competitive choice, not the compromise. The Bottom Line For fifteen years, the knock on Laravel was that it makes your decisions for you. That was always a strange thing to complain about. Now it is the entire advantage, and the agents are the ones cashing it in.

By Matt Watson

AI Won't Keep You from Hitting the Scalability Wall

Using AI to build integrations? You might just be hitting the scalability wall faster. Discover why faster builds don't solve the long-term cost of ownership. There's an idea making the rounds in B2B SaaS product and engineering meetings right now. It sounds reasonable. It feels optimistic. And it's leading companies straight into the same trap they've always fallen into, just at an accelerated rate. The idea is that "We can use AI to build our integrations." Two years ago, adding in-house dev for an ERP integration to the roadmap meant a three-month research-and-dev cycle. Today, the sentiment is often: "We can knock that out in a weekend." And in the early stages, it's often correct. Modern AI coding agents are remarkably good at generating boilerplate code, interpreting API documentation, and suggesting data mapping logic. AI can help you go from zero to integrated faster than ever before. That's completely true. But a focus on speed-to-build hides a deeper issue. Every integration you ship is an asset, but it comes with a long-term maintenance commitment. That's right. AI-assisted custom integration builds still hit the same scalability wall that has frustrated B2B SaaS engineering teams for years. In many cases, these builds defer the pain. But since AI can encourage teams to say "yes" to more integration requests, they sometimes amplify it. As a result, your team may hit the wall sooner and harder. AI, done right, builds integrations faster. It doesn't handle everything else that makes the integration run reliably at scale. Everything's Great at the Beginning When you use AI to build a custom integration, you're generally optimizing for the near term. You're cutting down the time it takes to write the initial code, map the first few fields, and get something working. Everything moves quickly, and it feels like a massive win. But the scalability wall doesn't show up now; it comes later and is composed of blocks that AI doesn't touch. API tracking – Third-party APIs are part of living systems. Their vendors deprecate endpoints, change rate limits, update authentication requirements, and release breaking changes with varying degrees of notice. Your AI coding agent helped you ship the integration. But it won't proactively monitor the Salesforce or NetSuite changelog for you, and it won't be on call when a "backward-compatible" update breaks something.The ownership gap – If an AI-assisted build handles 90% of the logic but hallucinates an edge case in a retry loop, your senior devs are the ones debugging when it fails in production (not the AI). AI accelerates the build, but the accountability remains with the team.Infrastructure overhead – Code is only one part of an integration. You still need to build and maintain everything around it: auth, logging, alerting, SOC 2-compliant data handling, and customer-facing configuration UIs. AI doesn't generate that operational layer.Customer requirements multiply – The second customer who wants your Salesforce integration doesn't want exactly what the first customer wanted. So you modify it. Then a third customer needs the original version, but with different field mappings. Now you have three versions of the same integration – each slightly different, each with its own maintenance obligation, none of them happy with anything less than individual attention. Multiply that pattern across your catalog, and you understand how teams can end up maintaining twenty-five versions of a single integration, with all the pain that entails. None of these are new problems. They've existed as long as B2B SaaS teams have been building integrations in-house. AI is simply making it easier to get to these problems faster. Why AI Feels Like the Answer When AI coding tools emerged as serious productivity multipliers, it was natural to look at the integration backlog problem and see a solution. If the bottleneck is build speed, and AI makes you faster, the math seems straightforward. But the bottleneck was never the build speed. The bottleneck was (and is) the ongoing cost of ownership. It's the time your devs lose every time a third-party API changes. It's the afternoon that disappears when an integration fails, and your customers know before you do. It's the engineering lead explaining, again, that the roadmap has slipped because, well…integrations. It's the growing stack of tech debt that you keep working around. AI lowers the barrier to entry. True. But it also lowers the barrier to overcommitment. When it becomes faster and easier to build integrations, teams build more of them. What starts as a productivity boost turns against you. Instead of five integrations, you build fifteen. Instead of "not yet," you say "we can probably do that quickly." Before you know it, you've accepted the maintenance commitment of all those integrations. The scalability wall exists because the relationship between the number of custom integrations and the dev resources required to maintain them is essentially linear. If it takes one engineer to maintain five integrations, you eventually reach a point where your team is no longer building your core product. Instead, it has morphed into the integration maintenance department. AI-assisted development shifts the starting line, but it doesn't change the slope. By building faster without a stable, managed foundation, you are simply accelerating your arrival at the point where maintenance debt overwhelms new feature development. What happens as you attempt to scale The scalability wall doesn't arrive with a big announcement. In most cases, it grows over time as the following occurs: Technical debt accrues under deadline pressure – When teams race to ship, they take shortcuts. Values are hardcoded that should be configurable. Error handling is skipped. Testing is abbreviated. Code gets tightly coupled in ways that make future changes expensive. The worst part is that this debt doesn't disappear when an integration deploys. Instead, every dev who touches that code later inherits the results of less-than-optimal decisions made in the moment.Roadmaps are held hostage – Then you try to maintain a custom integration catalog while building something new. Teams that were supposed to be shipping product features find themselves in firefighting mode, chasing failures, handling escalations, and applying patches to code that was never meant to live this long. Some teams report spending 70% or more of their integration-related engineering time on maintenance, monitoring, and debugging rather than building new value. One organization found half of its R&D team dedicated to maintaining hundreds of integrations. That's not an integration strategy. That's an integration crisis in the offing.Zombie integrations proliferate – Some integration requests are legitimate but point to integrations that shouldn't be built and maintained without a deliberate strategy. When AI makes it easy to say yes, teams say yes, and ship integrations that live in production forever, draining engineering bandwidth. The customer who requested it may have churned. The use case may have changed. But the zombie is still there, still running, still using resources. The Right Question to Ask Before You Build The question isn't "Can AI help us build this faster?" The question is: "Should we own the infrastructure required to keep this alive for the next five years?" Because that's what you're agreeing to. Not just the initial capital investment, but the day-to-day maintenance, dealing with customer edge cases, implementing security patches, monitoring everything, and handling all the tickets. Every custom in-house integration your team ships is a product you now own – with all the ongoing obligations that come with it. If the answer to the second question above is, "We're not sure we can sustain this at scale," then the solution requires a different architecture approach. Generating Deterministic Results To understand what a different architecture looks like, let's drill into what AI is – and what an integration platform is. AI is generative. It creates a solution/provides an answer at a specific moment, shaped by the context it's given. That makes it powerful for accelerating builds. It also makes it inherently variable. And non-deterministic outputs in business-critical data flows often introduce risks that careful, platform-tested infrastructure doesn't have. An embedded iPaaS is deterministic. It provides a standardized infrastructure designed to handle the full lifecycle of an integration – not just the initial build, but auth rotation, retry logic, customer-specific configs, monitoring, logging, and deployment at scale. The most successful teams use both. AI for efficiency: writing custom logic, generating complex data mappings, and accelerating new builds. And the integration platform for scalability: handling the operational layer that the AI should not reinvent every time. By wrapping integration logic in an integration platform, you decouple the build from the maintenance. When a third-party API changes, you don't have to hunt through dozens of AI-generated scripts – you update the relevant component, and it propagates across your ecosystem. What It Looks Like to Escape the Wall The teams that get over (or around) the scalability wall don't do it by building faster. They do it by changing what they're building on. An embedded iPaaS handles the infrastructure that breaks at scale, so your team can focus on integration logic rather than plumbing. The platform does the heavy lifting – Auth flows, retry logic, webhooks, auto-scaling compute, logging, config wizards, SOC 2-compliant data handling – all of it is provided and maintained by the platform. Your devs don't build it. They don't maintain it. That alone can reduce the code your team writes by 80% or more compared to in-house builds. What remains is the business logic – the part that actually delivers value to customers.Build once, deploy to many – When you productize integrations on a standard platform, you're not building a new integration for every customer. You're deploying a configurable integration that adapts to each customer's credentials, endpoints, and data mapping. One integration serves dozens (or hundreds) of customers. Updates apply across the board. That's a fundamentally different approach than maintaining twenty-five customer-specific variations in parallel.Non-engineers can own more of the lifecycle – Deployment, configuration, and first-level support don't need to involve engineering when customers and customer-facing teams have the right tools. Support staff can investigate issues without pulling an engineer from the roadmap. Customers can activate and configure integrations themselves from an embedded marketplace. Engineering stays focused on building, not on the operational overhead.Gain visibility across your entire catalog – When all integrations run on a single platform, monitoring and alerting work across all of them at once. You identify issues before customers do. You troubleshoot with full log access. You have a level of visibility into what is happening that's uncommon with custom in-house development. AI Still Has a Role, But It's Bounded None of this is an argument against AI. It's an argument for using AI where it's genuinely useful (and an argument against using it to solve problems it wasn't designed to solve). Used well, the AI can accelerate the creation of code running on a scalable foundation, rather than accelerating the accrual of tech debt. When you use AI inside a platform designed for scale, it works in your favor. When you use AI to build faster on a custom architecture that doesn't scale, you hit the wall sooner, but with more integrations already built. Build a Sustainable Integration Approach If you're evaluating how AI fits into your integration approach, here are the essentials: Adopt a tiered model – Not every integration deserves the same treatment. Productize high-volume integrations on the platform to make them reusable and maintainable. Build to bespoke requirements where the contract value justifies the build. Empower customers to create their own workflows for the long tail of idiosyncratic requests that no integration catalog can anticipate. And employ in-app agentic functionality as needed to make your workflows the structured, deterministic tools that AI agents can discover and invoke.Use AI for development velocity – AI excels at accelerating new builds and helping developers handle complex logic. Let the platform own everything else (auth, retries, logging, alerting, deployment, and the customer configuration experience). Don't ask AI to recreate that operational layer for every new integration.Track the metrics that matter after Day 1 – Build speed matters. But so do maintenance hours, average activation time, support ticket volume, and the amount of engineering time freed up for core product work. Those last two numbers are where a sustainable integration strategy shows up in the data.Audit regularly – As your catalog grows, so does the population of integrations that may no longer justify their maintenance burden. Retire integrations before they become a drain on the team. Faster Doesn't Equate to Scalable The "we can just use AI to build this faster" idea comes from a real place. Integration backlogs, customer pressure, and competitive urgency are all part of it. And AI absolutely speeds up individual builds. In the short term, that matters. But velocity without a stable foundation means you hit the scalability wall faster. Velocity doesn't make third-party APIs more stable. It doesn't reduce the maintenance burden as your catalog grows. And it doesn't change the question that every integration request brings your team: "Are we prepared to own this forever?"

By Bru Woodring

AI/ML

DZone's Featured AI/ML Resources

Top AI/ML Experts

The Latest AI/ML Topics