Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Keeping AI-Powered BI Honest: A Human-in-the-Loop (HITL) Playbook
Offline Evaluation of RAG-Grounded Answers in LaunchDarkly AI Configs
The Benchmark Trap The retrieval-augmented generation (RAG) ecosystem has matured remarkably fast. Vector databases are production-grade, embedding models are cheaper than ever, and retrieval pipelines are being deployed across healthcare, finance, legal, and education systems worldwide. Every major benchmark shows impressive numbers. Almost every major benchmark is in English. This is not a minor oversight. It is a structural blind spot that has allowed a critical class of failures to accumulate in production systems largely undetected. When your evaluation dataset is monolingual and your deployment is multilingual, you are not measuring what you think you are measuring. The gap between benchmark performance and real-world performance for non-English users is not a rounding error — it is, in documented cases, up to 29% accuracy degradation for non-English queries compared to equivalent English ones. That number comes from Oracle AI researchers who studied RAG consistency across languages in enterprise deployments. Twenty-nine percent. In a medical context, that is not a metric. That is a patient safety issue. Where Exactly Does Cross-Lingual RAG Break? The failure is not in one place. It cascades across all three stages of the RAG pipeline, which makes it particularly difficult to diagnose and fix. At Retrieval Most embedding models used in production RAG systems are trained predominantly on English corpora. When a Tamil or Arabic query is embedded, it enters a vector space whose geometry was shaped by English semantics. The nearest neighbors retrieved may appear topically related but carry subtle semantic misalignments that compound downstream. Amazon AGI's XRAG benchmark, published in 2026, was one of the first systematic evaluations of this failure mode. Their findings were stark: in monolingual retrieval settings, where an English knowledge base serves non-English queries, all evaluated models struggled with response language correctness. The system retrieved the right document. It still got the answer wrong. At Augmentation The naive fix — retrieve documents in the user's language alongside English documents and concatenate them into the context — introduces a different problem. A French document and a Hindi document about the same topic may express subtly different facts, use different cultural reference points, or carry implicit contradictions that the model has no mechanism to resolve. Concatenation without alignment is not multilingual RAG. It is multilingual noise. At Generation This is the most insidious failure mode. Research has consistently shown that large language models tend to reason internally in English even when processing non-English inputs. The model receives a Tamil query, retrieves relevant context, and then effectively thinks in English before generating a Tamil response. Cultural grounding, local conventions, and contextual meaning are lost at the final and most consequential step. The result is a response that may be grammatically correct in Tamil but conceptually rooted in English assumptions — wrong units of measurement, unfamiliar care protocols, culturally inappropriate framing. The Research Is There. The Attention Is Not. A small but growing body of research is directly addressing this problem. It deserves far more attention than it is currently receiving in mainstream AI engineering conversations. XRAG (Amazon AGI, 2026) introduced one of the first dedicated benchmarks for cross-lingual RAG evaluation, covering monolingual and multilingual retrieval scenarios with relevancy annotations per retrieved document. Their finding that cross-lingual reasoning — not just language generation — is the core challenge reframes the problem in an important way. This is not a translation problem. It is a reasoning problem. CroSearch-R1 (Beijing Jiaotong University/Université de Montréal, SIGIR 2026) proposed using reinforcement learning, specifically Group Relative Policy Optimization (GRPO), to dynamically align multilingual knowledge during retrieval. Rather than treating documents in different languages as competing contexts, their framework integrates them as complementary evidence. Results showed measurable improvements in cross-lingual RAG effectiveness across multiple language pairs. CrossRAG (University of Edinburgh, EACL 2026) took a different approach — translating retrieved documents into a common language before generation rather than translating the query before retrieval. Their experiments showed that this document-side translation strategy significantly outperforms query-side translation, particularly for low-resource languages, because it preserves the semantic richness of retrieval while giving the generation model a consistent linguistic context to reason over. BordIRLines (ACL 2025) introduced a dataset of territorial disputes across 49 languages to study cross-lingual RAG robustness in culturally sensitive scenarios. Their finding that retrieving multilingual documents actually improves response consistency over monolingual retrieval — when done correctly — is an important signal that the solution lies in better multilingual architecture, not in defaulting to English-only retrieval. Together, these papers paint a clear picture: the problem is real, measurable, and solvable. What is missing is the engineering community treating it as a first-class concern. Who Is Actually Affected The framing of this as a technical NLP problem undersells its human stakes. Consider the populations for whom English-centric RAG is not an inconvenience but a genuine barrier: A patient in rural Tamil Nadu queries a hospital AI system about post-surgery medication. A student in rural Nigeria is trying to use an AI tutoring system to access global research in Yoruba. A refugee querying a legal AI system about asylum rights in their native Dari. A farmer in rural India is asking an agricultural advisory AI about crop disease treatment in Marathi. In every one of these cases, a RAG system that was benchmarked at 90%+ accuracy in English may be operating at 60-70% accuracy in the language that actually matters to the user. The people least able to absorb the consequences of AI errors are the ones most exposed to them. This is not an edge-case population. Over 6.5 billion people speak a language other than English as their primary language. The majority of the world is the edge case in most RAG deployments. What Good Cross-Lingual RAG Looks Like The research points toward a few clear architectural principles for building RAG systems that work equitably across languages. Shared semantic embedding spaces over language-specific ones. Models like mE5, LaBSE, and multilingual-E5-large represent meaningful progress here — they map semantically equivalent content across languages into nearby regions of vector space, reducing the retrieval gap for non-English queries without requiring query translation. Explicit cross-lingual knowledge alignment rather than naive concatenation. The CroSearch-R1 approach of using RL to integrate multilingual evidence as complementary knowledge is a significant step forward. The goal is a retrieval-augmented context that is linguistically unified before the generation model ever sees it. Document-side translation over query-side translation when translation is necessary at all. CrossRAG's findings suggest that translating retrieved documents into a common language preserves more semantic fidelity than translating the user's query into English. This is counterintuitive but empirically supported. Culture-aware generation as a design goal, not an afterthought. Language and culture are not separable. A RAG system that generates linguistically correct but culturally inappropriate responses has not solved the problem — it has reframed it. A Proposal Worth Exploring The building blocks for genuinely equitable cross-lingual RAG exist today. What does not yet exist is an intentional, end-to-end architecture that assembles them with language equity as a first-class design principle rather than a post-hoc consideration. We call this architectural vision PolyRAG — a framework that coordinates multilingual semantic retrieval, reinforcement learning-based cross-lingual knowledge fusion, and culture-aware generation into a unified pipeline. The goal is not to make RAG work slightly better in non-English languages. It is to eliminate the architecture-level reasons why it fails in the first place. Each of the three components draws from independently validated research. What remains is the engineering work of intentionally combining them, rigorously benchmarking them across low- and high-resource language pairs, and releasing the results openly so the broader community can build on them. The Conversation We Should Be Having The RAG community has done extraordinary work optimizing retrieval latency, chunk strategies, reranking approaches, and hallucination reduction. Almost all of it assumes English. The question worth asking in 2026 is simple: what would RAG look like if we designed it for everyone from the start?
The most effective way to present this idea is to begin with the challenge architects face: AI has transformed the persistence landscape. Enterprise applications were once built almost exclusively on relational databases, making JPA a keystone of Jakarta EE. Today, modern systems use a mix of relational databases, document stores, caches, graph engines, and increasingly, vector databases that support semantic search, retrieval-augmented generation (RAG), and AI-powered applications. Polyglot persistence is now the industry standard. While Jakarta EE standardized relational persistence through JPA, it still lacks a vendor-neutral standard for non-relational persistence. This gap forces developers to rely on fragmented, proprietary solutions, creating barriers to portability, productivity, and innovation. The rise of AI makes this gap critical. Vector databases are now essential to intelligent systems, supporting semantic search, embeddings, and contextual retrieval. For Jakarta EE to remain the leading enterprise Java platform in the AI era, it must offer a standardized approach to NoSQL persistence, as it did for relational databases. Jakarta NoSQL is not just another specification; it constitutes a strategic investment in the ecosystem's future. By offering a familiar programming model, reducing vendor lock-in, and integrating with AI workloads, Jakarta NoSQL ensures that Jakarta EE remains relevant and competitive for the next generation of enterprise applications. NoSQL in the AI Era: Understanding the Modern Data Landscape For years, enterprise data persistence focused on relational databases. Systems relied on tables, rows, foreign keys, and SQL, making relational technology the standard for business applications. While still essential, modern architectures now use polyglot persistence, where multiple database types coexist, each satisfying specific requirements. Today, NoSQL refers to a family of database paradigms, each engineered for specific workloads and architectural needs, rather than just document databases. Key-value databases store data as key-value pairs, enabling fast lookups and low latency. Typical uses include caching, user sessions, feature flags, and temporary application state.Document databases store data as structured documents, such as JSON or BSON. They are effective for applications having hierarchical or evolving schemas, including web applications, e-commerce platforms, and content management systems.Column-family databases organize data by columns instead of rows, supporting high write throughput and horizontal scalability. They are used for IoT telemetry, event logging, analytics, and large-scale distributed systems.Graph databases model entities and relationships as nodes and edges. This structure is ideal for social networks, fraud detection, recommendation engines, dependency analysis, and knowledge graphs in which relationships are critical.Vector databases store high-dimensional embeddings from machine learning models and large language models (LLMs). They enable semantic search, similarity matching, retrieval-augmented generation (RAG), recommendation platforms, and other AI-driven features via understanding meaning instead of exact text matches.Time-series databases specialize in timestamped data that changes over time. They are used for observability, monitoring, financial markets, industrial sensors, and operational metrics where high-performance temporal data storage and analysis are essential. These database types often coexist within the same architecture. Modern applications may use PostgreSQL for transactions, Redis for caching, MongoDB for documents, Neo4j for relationships, InfluxDB for telemetry, and a vector database like Milvus, Pinecone, or Weaviate for AI-powered search and retrieval. This approach, known as polyglot persistence, is now standard in enterprise systems. The industry has embraced this shift. The Stack Overflow Developer Survey shows that while relational databases still dominate enterprise workloads, NoSQL technologies are now standard tools for developers. Technologies like Redis, MongoDB, and Elasticsearch are used alongside PostgreSQL and MySQL. Organizations no longer choose between SQL and NoSQL; instead, they combine multiple persistence technologies to leverage their strengths. Polyglot persistence is now the baseline for modern software systems. Vector databases are especially important among NoSQL categories, as they are basic to modern Artificial Intelligence systems. In contrast to traditional databases that store explicit business data, vector databases store numerical representations called embeddings. Generated by machine learning models, these embeddings encode the semantic meaning of words, documents, images, or other content as mathematical vectors. This enables software to search and retrieve information based on meaning rather than exact text matches. The distinction between lexical and semantic search illustrates the significance of vector databases. For example, a traditional SQL search for “Pet” returns records with that exact term, such as “Pet Shop,” but ignores related expressions like “Dog” or “Puppy.” Semantic search, by comparing embeddings, retrieves documents about dogs, puppies, or animal companions because it recognizes their semantic relationship. The search engine matches meaning, not just syntax. This function is vital for modern AI architectures. Large language models do not process relational tables directly; they use embeddings and contextual connections between concepts. Systems such as retrieval-augmented generation (RAG), enterprise knowledge search, recommendation engines, and intelligent assistants depend on similarity searches across millions of vectors. While relational databases can support some vector operations through extensions, vector databases are purpose-built for these workloads, offering optimized indexing and similarity algorithms for large-scale semantic retrieval. As AI adoption grows, vector databases are becoming a strategic component of enterprise architecture. Appreciating the importance of NoSQL, several Java ecosystems have developed their own solutions. Spring offers independent projects like Spring Data MongoDB, Spring Data Redis, and Spring Data Cassandra. These integrations provide a productive programming model but are tightly coupled to the Spring ecosystem. Quarkus supports NoSQL persistence through Panache and database-specific integrations, emphasizing developer productivity and cloud-native deployment. Micronaut Data supports several NoSQL engines, using compile-time code generation and ahead-of-time processing to improve performance and reduce execution overhead. While these solutions are effective, they remain framework-specific rather than platform standards. Developers switching frameworks encounter different APIs, abstractions, annotations, and operational models, even when solving similar persistence challenges. Jakarta EE addressed this for relational persistence with Jakarta Persistence (JPA), delivering a standardized, vendor-independent programming model. As NoSQL technologies expand and AI workloads more and more depend on vector databases, the lack of a vendor-neutral NoSQL standard is a significant gap in the Jakarta ecosystem. The Java Standardization Journey The need for a standardized NoSQL solution in the Java ecosystem has been discussed for years. During the Java EE era, several proposals tried to integrate non-relational databases into the enterprise platform. As NoSQL technologies grew in popularity throughout the 2010s, developers anticipated a dedicated specification to accompany traditional enterprise APIs at JavaOne conferences. Despite clear demand, no such initiative emerged within Java EE. The platform remained focused on relational persistence via JPA, leaving NoSQL adoption to rely on vendor-specific libraries and framework integrations. The transition of Java EE to the Eclipse Foundation provided an opportunity to address this challenge. Instead of waiting for a platform-level solution, the community launched Eclipse JNoSQL, an open-source project supplying a unified programming model for NoSQL databases. Drawing on JPA's success, Eclipse JNoSQL introduced mapping annotations, repositories, templates, and communication APIs that support document, key-value, column-family, and graph databases. The project showed that a consistent developer experience could be attained without compromising each database model's unique features. As Jakarta EE matured, Eclipse JNoSQL became the foundation for a new standardization effort: Jakarta NoSQL. Jakarta NoSQL was the first persistence specification created entirely within the Jakarta EE process. Unlike earlier specifications that migrated from Java EE, Jakarta NoSQL was conceived, developed, and released under the Eclipse Foundation governance model. It was among the first to complete the full Jakarta Specification Process from inception to release. Jakarta NoSQL's impact extended beyond its initial scope. During development, the expert group identified a common challenge for both relational and non-relational databases: developers needed a consistent repository abstraction independent of the underlying persistence engine. This led to the creation of a separate specification, Jakarta Data. The need to standardize NoSQL access patterns directly influenced the development of Jakarta Data's repository-oriented programming model, which applies across multiple persistence technologies. The relationship between these specifications highlights Jakarta NoSQL's broader influence on the Jakarta EE ecosystem. Jakarta NoSQL focuses on mapping and interacting with non-relational databases, while Jakarta Data delivers a unified repository abstraction for both relational and NoSQL implementations. Together, they significantly reduce fragmentation in enterprise persistence. This evolution continued beyond Jakarta Data. The drive to standardize modern persistence requirements has inspired new specifications, such as Jakarta Query, which aims to deliver a portable, type-safe, and expressive query language for various persistence technologies. As the Jakarta ecosystem grows, Jakarta NoSQL acts as a key milestone. It addressed the long-standing absence of a NoSQL standard and helped lay the foundation for the next generation of persistence specifications within Jakarta EE. Jakarta NoSQL: Built for NoSQL, Not Adapted to It When architects consider standardizing NoSQL development in Jakarta EE, a common question arises: why not extend Jakarta Persistence (JPA) to support NoSQL databases? JPA has long provided a unified programming model for relational databases in the Java ecosystem. The answer is based on a core architectural principle: tools should be optimized for their intended purpose. The first challenge is that JPA was designed specifically for relational databases, relying on concepts like tables, columns, joins, foreign keys, and transactional consistency. These are not simply implementation details but core elements of the specification. Forcing document, graph, key-value, or vector databases into this model creates friction and limits the use of each database’s native features. The second challenge is that NoSQL systems behave fundamentally differently. Graph databases perform path traversals, document databases store nested structures without normalization, key-value databases focus on fast lookups, and vector databases handle similarity calculations. These systems also differ in consistency, transactions, query languages, indexing, and scalability capabilities. Representing all these paradigms through a single relational abstraction leads to compromises. The third challenge is the importance of specialization. As Abraham Maslow noted, “if the only tool you have is a hammer, it is tempting to treat everything as if it were a nail.” Relational databases are effective, but not ideal for every persistence need. Semantic search, graph traversal, and high-volume telemetry storage are not relational problems. Applying a relational abstraction to all database types runs the risk of losing the unique optimizations each technology provides. Examine the analogy of transportation: cars, boats, submarines, and airplanes all address transportation but are specialized for different environments. Forcing them to use the same controls would result in mediocrity across all. Similarly, a single persistence abstraction may remove the features that make each database effective. Therefore, Jakarta NoSQL does not extend JPA beyond its intended scope. Instead, it offers a dedicated persistence model for non-relational databases, while continuing to maintain the familiar developer experience that contributed to JPA’s success. A key design goal of Jakarta NoSQL is to reduce mental effort for enterprise Java developers. Teams experienced with JPA should find the specification immediately approachable, as Jakarta NoSQL intentionally uses familiar terminology and concepts from the Jakarta EE community. Developers will encounter annotations like @Entity, @Id, and @Column, enabling a smooth transition from relational to non-relational persistence. Java @Entity public class Car { @Id private Long id; @Column private String name; @Column private CarType type; } At first glance, this entity closely resembles a JPA entity, which is intentional. However, the underlying implementation is fundamentally different. Jakarta NoSQL is built to support schema flexibility, embedded structures, nested documents, and database-specific storage models. This approach is reflected throughout the API. Instead of requiring developers to oversee low-level driver details, Jakarta NoSQL offers a high-level programming model via the Template API. Java @Inject Template template; Car ferrari = Car.builder() .id(1L) .name("Ferrari") .build(); template.insert(ferrari); List<Car> sports = template.select(Car.class) .where("type").eq(CarType.SPORT) .orderBy("name") .result(); The objective mirrors JPA’s original mission: permitting developers to focus on domain models and business logic, rather than serialization, connection management, or vendor-specific APIs. This foundation shaped Jakarta NoSQL 1.0. The initial release introduced the mapping layer, CDI integration, repository support, template operations, and standardized endpoints for four major NoSQL categories: Document databasesKey-value databasesColumn-family databasesGraph databases Jakarta NoSQL 1.0 showed that a unified Java programming model can respect the particular characteristics of each database family. Jakarta NoSQL 1.1 continued this evolution. While version 1.0 focused on mapping and persistence, version 1.1 expanded querying capabilities through integration with Jakarta Query. A key addition is support for parameterized queries, letting developers to safely bind parameters instead of manually constructing query strings. Java List<Car> cars = template.query( "FROM Car WHERE type = :type") .bind("type", CarType.SPORT) .result(); Version 1.1 also introduces projection support, allowing applications to retrieve lightweight views instead of entire entities. Java @Projection public record TechCarView( String name, CarType type) { } List<TechCarView> views = template .typedQuery( "FROM Car WHERE type = 'SPORT'", TechCarView.class) .result(); These features improve performance, reduce data transfer, and comply with modern Java features such as records. An important aspect of Jakarta NoSQL is its long-term architectural vision. While most developers use the mapping layer, the specification also defines a lower-level communication API for advanced scenarios. Java DocumentManagerFactory factory = ...; DocumentManager manager = factory.get("users"); DocumentRecord record = ...; manager.put(record); Optional<DocumentRecord> result = manager.findByKey("user:10"); manager.deleteByKey("user:10"); This communication layer is optional. Application developers can build complete systems without it, but it is valuable for database vendors, framework authors, and advanced integrations needing direct access to database capabilities. This design is fundamentally different from JDBC, which assumes communication through SQL statements and tabular result sets. That model works well because relational databases share a common language and interaction pattern. NoSQL databases do not. Document databases may use BSON, graph databases may offer traversal languages, and vector databases may provide similarity-search APIs. Others use REST endpoints, binary protocols, gRPC streams, or vendor-specific mechanisms. Forcing these models into a JDBC-style abstraction would limit their capabilities or demand ongoing vendor-specific extensions. For this reason, Jakarta NoSQL uses a layered architecture. The mapping layer offers a portable, productive programming model for developers, while the communication layer remains flexible to support diverse NoSQL systems. This architecture positions the specification for future growth. As new technologies like vector databases, time-series engines, and AI-native storage emerge, Jakarta NoSQL can evolve without imposing a relational mindset. Rather than treating every database as a nail for the JPA hammer, Jakarta NoSQL recognizes that different problems require different tools, while still presenting a consistent and familiar experience for enterprise Java developers.
I ran an AI coding agent against a broken Kubernetes deployment for five minutes. The agent called Anthropic's API dozens of times — reasoning about manifests, running kubectl commands, redeploying workloads. It made fully authenticated requests throughout the entire session. The API key was never in its environment. Shell env | grep -iE "anthropic|api_key|secret|token|password" # (empty) That is Docker Sandbox's credential isolation model in action. This article is about what that actually means — and what else the isolation holds, breaks, and surprises you with when you probe it properly. Key Takeaways Docker Sandbox uses a host-side proxy to inject API credentials without the agent ever seeing them — the agent makes authenticated calls without possessing the keySeven live isolation probes confirmed the boundary held throughout real AI agent activity, not just at restNetwork policy is hostname-scoped HTTP filtering — not a full network control plane — with three specific behaviors the documentation doesn't make clearDevOps agents can run docker build and kubectl inside the sandbox without any path to the host Docker daemon or cluster credentialsThe --branch parallel agent mode is Git-level isolation, not VM-level — important distinction for threat models requiring separate credentials per agent The Setup I manage eight AKS clusters for Fortune 500 clients. My laptop has Azure service principals, SSH keys, kubeconfig files with a dozen cluster contexts, and twenty-plus repos — some with .env files containing real API keys. Running an AI agent from this machine without guardrails means the agent inherits all of it. Docker Sandbox changes that. Each sandbox is a microVM — its own Linux kernel, its own Docker daemon, its own network stack. You mount one project directory. The agent sees one project directory. Everything else on the machine does not exist inside the sandbox. I spent two weeks testing this claim. Here is what I found. Test environment: What Detail sbx version v0.31.1 · commit e658be1 Host macOS Apple Silicon Network endpoints probed 13 Isolation probes 7 targeted commands Kubernetes scenario Real agent task, two bugs, timed All findings backed by real terminal output. Full repo: github.com/opscart/docker-sandbox-devops. How the Credential Isolation Actually Works The sandbox environment has no API keys. But the agent made authenticated API calls. Here is the mechanism: Shell env | grep proxy # https_proxy=http://gateway.docker.internal:3128 # http_proxy=http://gateway.docker.internal:3128 # JAVA_TOOL_OPTIONS=-Dhttp.proxyHost=gateway.docker.internal -Dhttp.proxyPort=3128 ... Every outbound request — HTTP, HTTPS, even Java tools — routes through a proxy at gateway.docker.internal:3128. That proxy runs on the Mac host, completely outside the microVM boundary. When the agent sends a POST to api.anthropic.com, there is no Authorization header — the agent does not have the key. The request reaches the host-side proxy. The proxy checks the allowlist — api.anthropic.com is in the default AI services group under the Balanced policy. Authentication is performed by the host-side proxy using credentials stored outside the sandbox boundary. The authenticated request is forwarded to Anthropic. The agent receives the response. It has no idea what key was used, where it came from, or how to find it again. Think of it like an OAuth gateway. The proxy holds the credential and vouches for the agent's requests. The agent gets access without ever possessing the key. You cannot steal what you never had. This is architecturally different from the standard setup where ANTHROPIC_API_KEY sits in the shell environment — one echo $ANTHROPIC_API_KEY away from being exfiltrated. What the Four Isolation Layers Actually Do Docker Sandbox stacks four layers: Hypervisor isolation. Separate Linux kernel per sandbox. Host processes invisible. Other sandboxes invisible. A compromised sandbox cannot escalate to the host kernel. This is the fundamental difference from a Docker container — a container shares the host kernel. The microVM does not. Network isolation. All outbound HTTP/HTTPS routes through the host-side proxy. Raw TCP, UDP, and ICMP are blocked at the network layer. Three policy tiers: allow-all, balanced (curated dev allowlist), deny-all. Set before starting your first sandbox: Shell sbx policy set-default balanced Docker Engine isolation. Each sandbox runs a private Docker daemon with its own socket. No path to the host Docker daemon. An agent can run docker build and docker run without socket mounting — which is the tradeoff that breaks isolation in plain container-based approaches. Credential isolation. Proxy-based injection as described above. The raw key never enters the microVM. macOS host with sensitive assets and proxy on the left, Docker Sandbox microVM in the center, network policy zones on the right. Seven Isolation Proofs — Run Live After a Real Agent Task The agent exited after completing the debugging task. The sandbox remained alive, and I executed the following commands from the same shell session the agent had used — to show exactly what was accessible throughout the entire run. 1. Filesystem Boundary Shell ls /Users/opscart/ # Source ls /Users/opscart/.ssh/ 2>&1 One directory. The workspace mount. SSH keys, other repos, credential directories — none of them exist inside the sandbox. Parent directories above the workspace are read-only stubs with no siblings. One critical implication: if your workspace is your home directory, your entire home is visible and writable. Always mount a project subdirectory, not your home. 2. No Credentials in Environment Shell env | grep -iE "anthropic|api_key|aws|secret|token|password" # (empty) Confirmed. The agent that just made dozens of API calls had no raw credentials anywhere in its environment. 3. Proxy Confirms the Injection Mechanism Shell env | grep proxy # https_proxy=http://gateway.docker.internal:3128 # no_proxy=localhost,127.0.0.1,::1,[::1],gateway.docker.internal Proxy address visible. Credentials it carries: not visible. The mechanism described above confirmed live inside the running sandbox. 4. Process Namespace Shell ps aux | wc -l # 13 A macOS host runs hundreds of processes. The sandbox shows 13 — all internal. The stack includes dockerd, containerd, socat bridging SSH agent forwarding, and the coding agent. Host processes completely invisible. No way to inspect or interact with anything running on the host. 5. Private Docker Engine Shell docker info | grep -E "Server Version|Operating System|ID" # Server Version: 29.4.3 # Operating System: Ubuntu 25.10 (containerized) # ID: e6934b23-368c-4259-a873-96f879f587e5 Ubuntu 25.10. A unique daemon ID that differs from docker info on the host — confirming the sandbox runs a fully isolated daemon. The agent deployed a full Kubernetes cluster using this daemon. No path to the host Docker socket existed. 6. Host Services Unreachable Shell curl -s --max-time 3 https://localhost:6443 2>&1 || echo "blocked" # curl: (7) Failed to connect to localhost port 6443: Connection refused Port 6443 — my minikube cluster on the Mac host. From inside the sandbox, localhost is the sandbox's own loopback. Host clusters, host SSH, host services — unreachable by default. Eight AKS contexts on this machine. Zero is reachable from inside the sandbox without an explicit policy rule. 7. What the Agent Had vs. What It Didn't During the entire debugging task, the agent had full access to one project directory, kubectl to the sandbox-internal Kubernetes cluster, and full Docker capabilities against the private daemon. It could not reach any other directory, cloud credentials, other kubeconfig contexts, the host Docker daemon, or any cluster not running inside the sandbox. All seven proofs held throughout the session without exception. Three Network Policy Findings That Change How You Think About It Network policy is not a full network control plane. It is hostname-scoped HTTP filtering. Three findings define the actual scope: Finding 1: Blocking returns HTTP 403, not TCP rejection. Plain Text probe "example.com" "https://example.com" # example.com | exit=0 | http=403 Exit code 0. The curl command succeeded. The proxy returned 403 directly. An agent that retries on 403 will retry blocked requests indefinitely. It cannot distinguish a blocked domain from a legitimate server-side error by exit code. For DevOps workflows — an agent hitting a blocked container registry will keep retrying silently rather than failing fast. Finding 2: HTTP CONNECT established a tunnel to port 22 on an allowed host. Plain Text # Port 22 — SSH port curl -s --max-time 5 telnet://github.com:22 # Connected to github.com port 22 # Port 9999 — non-standard port curl -s --max-time 5 telnet://github.com:9999 # Connected to github.com port 9999 github.com is on the Balanced allowlist. HTTP CONNECT established TCP tunnels to github.com on both port 22 and the non-standard port 9999 — both succeeded. Port-based restrictions are not enforced at the proxy layer. The Balanced policy is hostname-scoped only. Any port to an allowed host is reachable via HTTP CONNECT. Finding 3: DNS is not filtered. A common assumption is that all outbound traffic routes through the HTTP proxy — including DNS. Lab results show DNS resolution occurs independently: Plain Text dig example.com +short # 172.66.147.243 A blocked domain resolved. The microVM has an internal stub resolver that forwards DNS independently of the HTTP proxy. An agent can resolve any hostname regardless of the active policy. DNS cannot serve as a secondary enforcement layer. These findings do not break the isolation model. They define its actual boundary. Network policy controls HTTP/HTTPS access by hostname. It does not control DNS, TCP tunnels to allowed hosts on arbitrary ports, or how agents interpret 403 responses. The Agent Scenario: Isolation Under Real Load The real test of isolation is not seven probe commands — it is whether the boundary holds while an agent is actively working, making API calls, running kubectl, deploying containers. I gave an AI agent a broken Kubernetes deployment: a payments-service with memory limits set to 64Mi on a service that needs ~150Mi at peak. The agent received a task file and a set of manifests. No other context. The agent completed the task in under five minutes. It found two bugs — one planted, one discovered independently by reading the manifest and noticing health check probes targeting port 8080 on an nginx container that only serves on port 80. The task said nothing about probes. Result: both pods 1/1 Running, 0 restarts. The seven isolation proofs above were verified immediately after — throughout the entire debugging session, the boundary held without exception. Full article and complete repo at opscart.com/docker-sandbox-devops. What This Means for DevOps Engineers Specifically Most Docker Sandbox articles target software developers running Claude Code on a single codebase. The DevOps case is different and more demanding. A DevOps engineer running an AI agent faces a broader attack surface: multiple cluster contexts, infrastructure credentials, IAM roles, service accounts, kubeconfigs that grant production access. The blast radius of a compromised or manipulated agent is not one repo — it is potentially every system those credentials touch. Docker Sandbox addresses this at the architecture level rather than the prompt level. You are not relying on the agent being well-behaved. You are relying on the microVM boundary, the proxy, and the private Docker daemon. The agent can be fully autonomous inside the sandbox because the guardrail is the environment, not the agent's behavior. The private Docker Engine is particularly significant. DevOps agents need to build and test containers. Every other local isolation approach that allows container operations requires socket mounting — which gives the agent direct access to the host Docker daemon and every image and volume on the host. Docker Sandbox eliminates this tradeoff. What Is Still Rough The image iteration cycle is the primary friction point. Adding a tool requires editing a Dockerfile, rebuilding, pushing to a registry, and recreating the sandbox. For a stable toolchain, this is acceptable. For rapid experimentation, it is not. The --branch parallel agent mode is Git isolation, not VM isolation. Both agents run in one microVM with shared Docker and network. For separate credentials or separate network policies per agent, you need separate workspace directories. The network policy CLI has non-obvious syntax in several places — sbx policy deny does not remove an allow rule, and external cluster access requires two policy rules not one. Neither behavior is documented. The CLI changes between minor versions. v0.31.1 changed login flow, renamed policy tiers, and introduced --clone mode. Pin your version. When Not to Use Docker Sandbox Docker Sandbox is the right tool for a specific set of problems. It is not the right tool when: You need raw UDP or ICMP. Network tracing tools (traceroute, mtr), some mTLS configurations, and anything relying on ICMP will not work — the sandbox proxy only handles HTTP/HTTPS. Your toolchain requires host-device access. USB devices, GPU passthrough beyond basic forwarding, and hardware security keys are not accessible from inside the microVM. You are on a memory-constrained machine. Each sandbox runs a full microVM plus its own Docker daemon. On a machine with 8GB RAM, running multiple sandboxes simultaneously alongside Docker Desktop and a browser will cause pressure. You need production-grade audit logging. Docker Sandbox is Experimental. Audit trails, compliance logging, and enterprise controls are not mature yet. For regulated environments, evaluate accordingly. Your agent needs to coordinate across multiple repositories simultaneously. The one-sandbox-per-workspace model means cross-repo agent work requires careful orchestration. The --clone mode helps but adds git workflow overhead. Conclusion The credential isolation model is the headline: the agent made authenticated API calls throughout the session without the API key ever entering the sandbox. Authentication was performed by the host-side proxy using credentials stored outside the sandbox boundary. The agent could use the credential — it could never see, copy, or exfiltrate it. Seven isolation proofs confirmed the boundary held under real active load. One directory visible. No credentials. No host processes. No host clusters. No host Docker daemon. The network policy findings add important nuance. The --branch mode reality is different from what the documentation implies. Docker Sandbox is Experimental, and the CLI is moving. Use it knowing what it is — and what it is not.
If you write technical documentation in markdown, you already know the tension: some parts of your document are hand-written prose, while others — a table of contents, an included code snippet, a rendered diagram — are generated from somewhere else. How you handle that boundary says a lot about your workflow. Most documentation toolchains resolve it the same way preprocessors like PET or Jamal do: separate the source from the output. You maintain a template file, run a build step, and get a rendered document as the result. Clean, predictable, and easy to reason about — but it adds a build step, and the output file is not the thing you actually edit or share. mdship takes a different approach. It is a command-line tool and MCP server that edits your markdown in place: it reads the file, updates specific sections, and writes the result back to the same file. Everything else — your prose, your headings, your structure — is untouched. No separate output file, no build pipeline. The document you see is the document you ship. Think of it less like a preprocessor and more like a very opinionated editor that knows how to regenerate a table of contents, pull in a code snippet from another file, or render a Mermaid diagram — all within the file you are already editing. One File: The Trade-Off Working in a single file has real advantages for technical writers. The managed content — including snippets, generated TOC entries — is visible inline while you are editing. You can read the full document as your readers will see it, without switching to a preview mode or running a build. There is no output file to track separately, and markdown-aware tools like GitHub or your IDE render it correctly wherever it lives. The downside is equally real: because managed and hand-written content share the same file, it is easy to accidentally edit a section that is meant to be regenerated. You fix a typo in an included code snippet; on the next run, your fix is gone. You add a note inside a generated TOC block; mdship overwrites it without warning. Preprocessor tools sidestep this entirely. The source is one file, the output is another, and you never edit the output directly. The separation of concerns is clean. But you pay for it: every change requires a build step, the output is not portable without that step, and contributors who are not familiar with the toolchain may not know which file to edit. Neither model is universally better. mdship makes the pragmatic choice that for most documentation workflows, a single file with good guardrails beats a clean architecture that requires a build. Content Integrity: The Guardrail The guardrail is a checksum. Every time mdship writes content into a managed section — a TOC block, an INCLUDE block, a MERMAID block — it records a checksum of that content inside the opening placeholder marker, under a key called _content_generated_. On the next run, before overwriting anything, it verifies that the checksum still matches. If it does not, mdship stops and reports an error instead of silently discarding your edits. Plain Text ERROR: Placeholder TOC content was manually edited. Hash mismatch detected. Delete _content_generated_ line to override and accept data loss. This turns an accidental overwrite — which would otherwise be invisible until you notice the missing content — into an explicit decision. You can delete the _content_generated_ line to tell mdship "I know, proceed anyway," or you can pass --force on the command line to skip the check for a single run. Either way, you are opting in, not being surprised. AI-Generated Sections: The Same Idea, Extended The same pattern extends naturally to sections written by an LLM. mdship supports an <!--AI--> placeholder: an HTML comment embedded in the markdown file that contains a prompt. When you invoke the /ai-placeholder skill in Claude Code, it reads the prompt and writes the generated content between the opening and closing markers — directly into the file, in place, just like any other mdship operation. The workflow has three steps, enforced by the skill: Check: before writing anything, the skill calls mdship ai-check via MCP to verify that the existing content has not been manually edited since it was last generated. If the checksum does not match, the skill stops and reports the conflict to you rather than overwriting your edits.Generate: if the check passes (or there is no checksum yet, meaning the section is new), the LLM reads the prompt and writes the content.Seal: after writing, the skill calls mdship ai-fix via MCP to record a new checksum for the freshly generated content, protecting it against accidental edits until the next intentional update. The MCP integration means these calls happen automatically, as part of the skill's defined behavior — not as something the LLM has to remember to do. The Prompt Is Documentation, Too There is a subtler benefit to this approach that is easy to overlook. The prompt that instructs the LLM remains embedded in the file as a non-rendered HTML comment, right above the content it produced. It does not live in a commit message, a Jira ticket, or a separate prompt library that may be hard to find six months later. It is part of the document. This has practical consequences. If you need to regenerate a section — because the underlying API changed, or a referenced file was updated, or you simply want a fresh pass — you re-run the same prompt against the same file. The instruction is already there; you do not have to reconstruct it. The prompt can also reference external files: other documentation pages, source code, configuration files. If those change, rerunning the prompt automatically picks up the changes. The document becomes self-updating in the sense that the machinery to update it is built in. Conclusion mdship's in-place editing model and its LLM integration are two expressions of the same design choice: keep everything in one file, protect it with checksums, and let the tooling manage the regeneration cycle rather than the author. For technical writers, this means fewer context switches, no build step, and a document that carries both its content and the instructions for maintaining that content in a single portable file. The trade-off — shared space for managed and hand-written content — is managed by the checksum guardrail, which turns silent overwrites into explicit decisions. Whether the content is generated by mdship itself or by an LLM following an embedded prompt, the contract is the same: write it, seal it, and trust that the next update will ask before it overwrites.
(Note: A list of links for all articles in this series can be found at the conclusion of this article.) In the previous installments of this series, we traced the arc from raw compliance intent — regulations such as NIST 800-53, FedRAMP, PCI DSS, EU AI Act — all the way to machine-readable OSCAL artifacts managed via GitOps pipelines and Trestle-powered automation. The central thesis has been that treating compliance artifacts as code, subject to the same versioning, testing, and review disciplines as software, is the only sustainable path to continuous assurance at scale. Part 3 of this series explored the collaboration topology: Regulators publishing OSCAL catalogs, Control Providers authoring component definitions, System Owners assembling SSPs, and Assessors generating SAPs and SARs — all mediated by Trestle's markdown-to-OSCAL round-trip. The friction was always the same: every persona still needed CLI fluency or IDE comfort to engage productively with OSCAL JSON. That friction is now removable. The Model Context Protocol (MCP) brings a standardized, AI-agent-ready interface to compliance tooling — and compliance-trestle-mcp, the first OSCAL-native MCP server from the OSCAL Compass community, makes every Trestle operation invocable by any MCP-compliant AI client: Claude, Roo Code, GitHub Copilot Workspace, or a custom agentic pipeline. Compliance-as-Code Game Changer With MCP The Model Context Protocol, incubated under the Linux Foundation and now an industry-wide open standard, provides a JSON-RPC layer by which AI models discover and invoke "tools" — discrete, typed operations exposed by servers. Think of it as the USB-C port for AI agents: standardized, self-describing, composable. Once an MCP server is registered, any compliant client can call its tools without custom integration work. For compliance workflows, this changes the architecture of engagement fundamentally. Today, driving Trestle to resolve a NIST 800-53 profile, generate SSP markdown, and assemble the resulting OSCAL JSON requires CLI invocations with precise arguments — work that falls to the Trestle-literate members of a compliance team. With compliance-trestle-mcp, those same operations become natural-language-addressable: an AI assistant executes the correct Trestle command sequence, validates the output, and surfaces results in whatever interface the persona is already working in. Compliance-trestle-mcp: Architecture and Capabilities The server is published on PyPI as compliance-trestle-mcp (v0.1.2, February 2026) and registered on the Official MCP Registry at registry.modelcontextprotocol.io under the identifier io.github.oscal-compass/compliance-trestle-mcp. Status is Active. Source: https://github.com/oscal-compass/compliance-trestle-mcp. Figure 1: compliance-trestle-mcp listed as Active on the Official MCP Registry (registry.modelcontextprotocol.io), v0.1.2. Tool Surface Six tools are currently exposed by the server, each wrapping a core Trestle operation: toolwhat it does trestle_init Initialize a Trestle workspace, creating the OSCAL folder hierarchy (catalogs, profiles, component-definitions, system-security-plans, etc.) trestle_import Import an existing OSCAL model (catalog, profile, SSP, component definition) from a local file or remote URL into the active workspace trestle_author_catalog_generate Generate per-control Markdown files from a catalog JSON, enabling human-readable editing without touching raw OSCAL trestle_author_profile_generate Generate Markdown documentation for the controls selected by a profile, preserving parameter overrides and guidance additions trestle_author_profile_resolve Resolve a layered OSCAL profile to a flat resolved-profile catalog, collapsing all imports and modifications trestle_author_profile_assemble Assemble edited Markdown controls back into a valid OSCAL Profile JSON, completing the round-trip Installation (One Liner) Add the following stanza to your agent's MCP configuration file (e.g., .roo/mcp.json for Roo Code or the Claude Desktop config): JSON { "mcpServers": { "trestle": { "command": "uvx", "args": [ "--from", "compliance-trestle-mcp", "trestle-mcp" ] } } } Personas Revisited: Now With an AI Co-Pilot Part 3 of this series established the canonical compliance-as-code collaboration model: five personas, each with distinct artifacts, editing interfaces, and OSCAL expertise levels. The MCP layer transforms each persona's relationship with those artifacts. Regulator Regulators publish security regulations and standards (NIST 800-53, GDPR, HIPAA) typically as PDFs. With compliance-trestle-mcp, a Regulator's technical team can instruct an AI agent to call trestle_import against a raw OSCAL catalog URL (e.g., the NIST GitHub releases), then trestle_author_catalog_generate to produce reviewable Markdown. Editorial cycles that previously required Trestle CLI expertise are now conversational. The AI handles the workspace plumbing; the domain expert focuses on control prose accuracy. Compliance Officer/CISO Compliance Officers author organizational overlays — parameter tailoring, guidance additions, control selections — expressed as OSCAL profiles layered on a regulatory catalog. With the MCP server, the AI can be prompted to "resolve the FedRAMP Moderate profile against the NIST 800-53 Rev5 catalog and generate the delta markdown for my SSP authoring queue." The agent chains trestle_author_profile_resolve→ trestle_author_profile_generate autonomously, surfacing the output for human review. This eliminates manual multi-step CLI orchestration and radically compresses profile maintenance cycles. Control Provider (Component Author) Control Providers — the engineers maintaining component definitions that map control implementations to policy-as-code rules — have traditionally needed both OSCAL fluency and DevSecOps context simultaneously. Now, an AI agent can assist by importing existing component definitions, generating Markdown stubs for unmapped controls, and prompting the engineer for implementation prose inline in the chat. The component definition round-trip (JSON → Markdown → edit → trestle_author_profile_assemble → JSON) is fully MCP-orchestrated. System Owner/SSO The System Owner assembles SSPs from profiles and component definitions — historically the most labor-intensive and error-prone step. With compliance-trestle-mcp, an AI agent can be directed to initialize the workspace, import all upstream artifacts, resolve the applicable profile, and generate the SSP Markdown scaffolding in a single conversational exchange. What once required mastery of four distinct Trestle sub-commands and careful argument threading is reduced to a natural-language instruction sequence. Assessor Assessors generating Security Assessment Plans (SAPs) and Reports (SARs) need to trace every selected control back through the SSP to the component definition and the originating catalog. With the MCP server, an AI agent can navigate that traceability chain on demand, resolving profiles and surfacing control implementation status, evidence links, and outstanding POA&M items — all without the assessor ever touching Trestle directly. The Emerging OSCAL MCP Ecosystem compliance-trestle-mcp is the first OSCAL-native MCP server from an established open-source compliance project, but it is not alone. A brief survey of the emerging ecosystem: serveroriginfocus compliance-trestle-mcp OSCAL Compass / CNCF Sandbox Full Trestle workflow: init, import, catalog/profile generate-assemble-resolve. First CNCF OSCAL MCP server. Registered at registry.modelcontextprotocol.io. mcp-server-for-oscal AWS Labs (awslabs) OSCAL schema introspection, model listing, and reference resource retrieval. Optimized for AI agents needing authoritative OSCAL structural guidance rather than authoring workflows. OSCAL MCP UI Apps Atelier Logos / Community Visual MCP UI layer for FedRAMP and HIPAA OSCAL workflows; interactive SSP visualization and compliance gap analysis via agentic app runtime. The AWS Labs server (github.com/awslabs/mcp-server-for-oscal) serves a complementary purpose: where compliance-trestle-mcp is workflow-centric (authoring and assembly), the AWS server is schema-centric (introspection and reference), providing AI agents with authoritative answers about OSCAL model structure, valid element sets, and use-case patterns. Together, they cover both the "what is OSCAL" and "do OSCAL" dimensions of agent-assisted compliance. NIST's Vision and the CSWP 53 Horizon The timing is not coincidental. NIST CSWP 53 ("Charting the Course for NIST OSCAL," December 2025 initial public draft) explicitly names agentic AI and digital twins as the next integration frontier for OSCAL — autonomous risk reasoning and continuous assurance driven by AI agents operating on machine-readable compliance artifacts. The compliance-trestle-mcp server is a concrete early instantiation of exactly that vision, with the CNCF Sandbox project providing governance and sustainability guarantees that standalone tools lack. What Comes Next for compliance-trestle-mcp The v0.1.2 release covers the catalog and profile authoring surface. The roadmap naturally extends toward the full OSCAL lifecycle for AI-assisted System Security Plan and MCP resource exposure — surfacing OSCAL documents as MCP resources (not just tool outputs) so AI clients can reason over live workspace state. Conclusion Compliance as Code has always promised to make compliance automation as natural as software development. The MCP layer removes the final adoption barrier: the requirement for personas to learn Trestle directly. With compliance-trestle-mcp, every compliance stakeholder — from the Regulator drafting a new catalog overlay to the Assessor closing out a FedRAMP SAR — can now engage with OSCAL artifacts through natural language, mediated by an AI agent that understands both the domain and the toolchain. The server is live, registered, and installable in seconds. The OSCAL ecosystem is building out MCP coverage rapidly, with NIST's own roadmap pointing in the same direction. The gap between compliance intent and continuous machine-readable assurance has never been smaller. References and Learn More [1] OSCAL Compass / compliance-trestle-mcp GitHub. https://github.com/oscal-compass/compliance-trestle-mcp [2] Official MCP Registry — io.github.oscal-compass/compliance-trestle-mcp. https://registry.modelcontextprotocol.io [3] AWS Labs mcp-server-for-oscal. https://github.com/awslabs/mcp-server-for-oscal [4] COMPASS Part 3: Artifacts and Personas (DZone). https://dzone.com/articles/compliance-automated-standard-solution-compass-part-3-artifacts-and-personas [5] NIST CSWP 53: Charting the Course for NIST OSCAL (Dec 2025 IPD). https://csrc.nist.gov/pubs/cswp/53/charting-the-course-for-nist-oscal/ipd [6] Building Visual MCP UI Apps for FedRAMP & HIPAA with OSCAL (Atelier Logos, Jan 2026). https://www.atelierlogos.studio/blog/2026-01-08-using-the-aws-mcp-server-for-oscal [7] OSCAL Hub — Open-Source OSCAL Platform (RegScale / OSCAL Foundation). https://regscale.com/blog/introducing-oscal-hub/ [8] Model Context Protocol Roadmap (Linux Foundation, updated Mar 2026). https://modelcontextprotocol.io/development/roadmap Below are the links to other articles in this series: Compliance Automated Standard Solution (COMPASS), Part 1: Personas and RolesCompliance Automated Standard Solution (COMPASS), Part 2: Trestle SDKCompliance Automated Standard Solution (COMPASS), Part 3: Artifacts and PersonasCompliance Automated Standard Solution (COMPASS), Part 4: Topologies of Compliance Policy Administration CentersCompliance Automated Standard Solution (COMPASS), Part 5: A Lack of Network Boundaries Invites a Lack of ComplianceCompliance Automated Standard Solution (COMPASS), Part 6: Compliance to Policy for Multiple Kubernetes ClustersCompliance Automated Standard Solution (COMPASS), Part 7: Compliance-to-Policy for IT Operation Policies Using AuditreeCompliance Automated Standard Solution (COMPASS), Part 8: Agentic AI Policy as Code for Compliance Automation With Prompt Declaration LanguageCompliance Automated Standard Solution (COMPASS), Part 9: Taking OSCAL-Compass to Industry Complexity LevelCompliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability
AI has already moved beyond text generation. Modern agents can browse the internet, read documents, call APIs, query databases, and coordinate numerous actions between tools and services. They are expected to do more than simply provide a single nebulous answer. In real-world systems, agents evaluate the quality of their own results, independently identify errors, and learn. This capacity for reflection and adaptation distinguishes deep agent systems from the simple, one-off interactions of language models based on the 'one question, one answer' principle. A single answer implies incomplete reasoning, a lack of context, unclear instructions, and contradictory constraints. Rather than treating the generated results as final, the agent verifies them by asking questions: Does the result match the user’s intentions?Are there any logical inconsistencies?Is the answer comprehensive and well-structured? Consequently, generating a response takes a long time as it involves numerous verification steps. Generation and evaluation are not the same task and for the same agent. The generator creates an initial response, while the evaluator analyses it for correctness, clarity, and alignment with the user’s intentions. As with humans, the evaluator should not be constrained by the same assumptions that led to the generator’s initial output. If an error is found, it is sent back, and the model is retrained, and so on, in a cycle. It is important to manage feedback loops and response revisions effectively. Endless cycles of revision are counterproductive and super-super costly sometimes. Clear evaluation criteria, follow-up questions for the user, a list of corrective strategies, and explicit decision points are required. A good prompt should describe how the system is supposed to operate, which tools must be used, and what steps should be taken. However, the more complex the task, the greater the chance of making a mistake. Like in every other aspect of IT processes. This is where the Model Context Protocol (MCP) comes in. The MCP enables us to identify and execute the necessary actions across different programs, access external resources, and retrieve results. For instance, to parse a website and create a mock-up of it in Figma, you would use the Selenium URL loader. Think of the MCP as a bridge facilitating pre-defined interactions between models, tools, and external systems. MCP reduces the effort required of the user to describe actions. Tools and resources are pre-loaded onto the MCP server rather than being described in text instructions. If a user requests a summary of recent news, for example, Newspaper3K is configured to retrieve the relevant data, and the Oolama + OpenAI API is set up for local and server-side text generation. It is the model itself that decides which feature to use, rather than attempting to recreate behavior using prompts from the user. MCP transforms the model into something suitable for real-world tasks. The MCP can be viewed as a coordination system that links intelligence and execution. The model focuses on understanding user intentions and answering the question, 'What does the user want from me?' The MCP manages the discovery, verification, and orchestration of tools and available resources. The LLM can't call APIs independently; this is done by the MCP. The MCP also helps to prevent context fragmentation. The context window represents the maximum number of tokens that the model can process in a single request. However, there is no magic solution; the 'do it right' button has yet to appear, so we still have a job to do. It’s best to interact with an LLM using structured, detailed prompts to ensure predictable, consistent behavior. Providing clear instructions reduces the likelihood of misuse, wasted tokens, and confusion. Tokens are the basic units of text. There are various tokenisation methods; popular examples include WordPiece, SentencePiece and BPE. You can import the nltk library and extract tokens from a sentence yourself: 'What goes around comes around' would be split into 'what', 'goes', 'around', 'comes', 'around', and these would then be converted into 0 and 1 for ML. As we can see, in this sense, LLMs are very similar to linear regression in fact. Key components of MCP: "Clients" that manage user interactions, conversation state, and orchestration.Servers that provide discoverable tools and resources. Typically, these are HTTP-based servers that act as lightweight backends, remaining active and accepting requests via URLs.Messages convey intent, context, and execution results.Structures for incoming and outgoing data. This separation helps the MCP avoid entanglement between models and execution logic. While each component remains independent, they continue to work together via a common protocol (which may be the MCP or another protocol). Models do not speculate or invent actions; they operate strictly within the capabilities defined by the MCP. This simplifies system debugging, makes deployment safer, and ensures more predictable behavior. Broadly speaking, resources are documents, files, or any other type of structured content. All of these are accessible via a URI. This ensures that the model operates within defined rules and constraints, which makes it easy to debug errors. Therefore, it is important that each tool can be tested in isolation and reused. This is the only way to scale the system. However, there are a few rules to follow when working with resources. Typically, businesses want instant access via an LLM to all the documentation accumulated over the last 30 years. You know, legacy, a set of PDFs, and so on. Even if we are technically able to provide the entire text at once upon request, we should still avoid large documents. This helps to maintain readability. Here, we will use an actor-critic architecture with two models: one selects the tool, and the other validates the quality of the selection via a reward. One model is responsible for the rules and the other for the value to the user. What If There Are Any Errors? Architecture inevitably becomes more complex over time. Or maybe even at the first iteration. The more complex and interconnected AI becomes, the greater the likelihood of errors or even failure. The key question, given that we are no longer dealing with predictable CRUD services, is: ‘How can we properly restore operations after errors occur?’ For AI systems, recovery from failures means ensuring system operation continues, and results remain acceptable, even if individual components fail. Rather than allowing a failure to bring the entire system to a halt, well-designed systems continue to operate. In other words, the system must be resilient, continuing to function even if some components fail. Is GPT-5.4 unavailable? In that case, we switch to Gemini 2.5. The system may degrade, but it will continue to operate. This is better than a complete system failure. Ideally, you should have alternative tools and models, as well as simplified logical paths. And, of course, backups. If we cannot identify and fix the problem, we will only provide conservative responses if the model starts producing answers that are unsafe or violate policy. The debugging process involves checking the input data and then testing the functionality of the tools and APIs, including checking their availability, latency, and response integrity. Multi-Step Reasoning Single-step reasoning is effective for simple queries, but becomes less so when tasks involve dependencies or intermediate solutions. In such situations, rather than immediately producing a final answer, the agent must track the progress of execution at every stage. Multi-stage reasoning addresses this by breaking down complex goals into smaller subtasks, preserving context separately at intermediate stages, and altering the execution sequence in the event of incorrect assumptions. Validation acts as a control mechanism in multi-stage workflows in the event of failures. This prevents errors from different stages from accumulating, and prevents tokens from being wasted on calculations based on incorrect data. The likelihood of failure is very high if an agent has to tackle a highly complex, long-term task. One of the main reasons for this is an inability to prioritize sub-tasks. Hierarchical planning is required to distinguish between strategy and implementation. To focus on the long-term goal, we need temporal abstraction and constant feedback from the user. Monitoring LangSmith is a useful tool for monitoring agents. It is compatible with both LangChain and LangGraph and is run on Runs. An alternative is Langfuse, which is better suited to enterprise environments where there is a dedicated role for analyzing the request processing pipeline (from my PoV). It has a great dashboard, too. Langfuse enables you to troubleshoot issues using tracing. If a problem arises due to unexpected interactions between search processes, request formation, or model execution, Langfuse can help. However, LangSmith also shows the sequence of events from start to finish, taking context into account. Classic Prometheus and Datadog are still suitable for tracking agents' activities. Overall, however, combining the Streamlit interface, LangChain pipelines, vector storage, and LangSmith tracing into a single app.py is a good solution. Centralization simplifies tracking, debugging, and analyzing workflows. So, the problem has been identified — what next? When implementing AI in a large company, API failures are most often caused by incorrect input data or unexpected response structures rather than errors in the model itself. LangServe's automatic schema inference reduces the number of failures before the request even reaches the model, so this is nothing new. I would suggest using containerization to reproduce errors. This provides service isolation to prevent dependency conflicts and enables reproducible deployments using container images with specific versions. There are also other benefits of container orchestration. Containerized components include: Agent APIs: access to tool execution via LangServe or similar frameworks.MCP servers: provide standardized access to tools and resources using the MCP client-server model. Containerization of MCP servers ensures consistent tool availability across all environments. The key is to avoid hard-coded file paths. Monitoring: Log execution traces, performance metrics, and assessments using LangSmith or similar tools.Supporting infrastructure: Databases, vector stores, or simply files accessed by agents. Data We’ve received a PDF file, and our task is to make it accessible via an LLM. First, the PDF needs to be split into chunks, each with a unique UUID. After embedding, these chunks should be stored in a vector database. The text must be transferred either sentence by sentence or with chunk overlap to preserve context between chunks. RAG will then enable us to interact with the document. RAG is essentially an LLM that has access to a knowledge base. It can also reduce hallucinations to some extent. As always, the key to success here is data: its quality, stability, backups, and access speed. The high-level process is as follows: HTML query > retrieve > generate To implement RAG on AWS, you can consider using Bedrock for the LLM, OpenSearch for access to the vector database (S3), and Lambda. Bedrock is Amazon’s service for deploying AI agents, and I love their prompt management. The most critical aspect of RAG is uploading files; it is crucial to provide high-quality content that the system will process and respond to. Here, we have to keep in mind Amdahl's law in the context of parallel computing. The idea is simple: performance gains plateau as the number of processing threads increases because the sequential parts of the task cannot be parallelized. When compiling the llama.cpp file on a 24-core, 64-thread AMD Threadripper processor, I have noticed that increasing the number of threads from 12 to 64 significantly reduced the time taken for compilation. However, exceeding 64 threads only yielded a marginal improvement, due to I/O bottlenecks and sequential dependencies. As part of the Amazon ecosystem, Bedrock is bundled with SageMaker for model training, AWS App Studio, and Amazon Q, which is a ready-to-use AI assistant. Also, if the free version of Google Colab proves insufficient, AWS SageMaker is a more or less excellent alternative. If you have chosen Bedrock, you will most likely use the async/await architecture in Rust and the Tokio runtime for parallel Bedrock API calls. Amazon OpenSearch Serverless can be used as a vector database. And it's a pretty popular option. Rather than performing searches based on keyword matches, it indexes documents and performs searches based on semantic similarity. In the RAG pipeline on AWS, documents from S3 are split into fragments, embedded using Amazon Titan or a similar model, and stored in a vector index. This allows the most relevant content to be retrieved in response to user queries and synthesized using an LLM. Well, grain of salt. After Amazon had been mentioned so many times, the experts began to consider the associated costs. It’s important to keep costs under control. Data is the new gold, for sure. But having too much data isn’t good for the wallet. It's important to be able to cache frequently executed queries. If you need a step-by-step guide: Use Bedrock alongside S3 as your data source and OpenSearch Serverless as your vector search engine.Implement smart chunking to optimize documents for search.If real-time data freshness is not required, use batch loading intervals instead of continuous updates. Add a caching layer for frequently asked queries. The development of the agent can be broken down into three stages. Data preparation involves data loading, pre-processing, and structuring. Chunking and embedding.Indexes: preparing for successful data retrieval. Vector stores and SQL are all available in ChromaDB, Pinecone, and FAISS. The type of database is important because FAISS can store the index and perform searches on the GPU, speeding up searches by orders of magnitude. Meanwhile, GraphRAG enables you to link information to context and build connections.Retrievers are used to find the right document based on a query. Hybrid search retrieves the required document. It can also delete documents. One challenge you’ll face repeatedly is reducing your monthly LLM costs while maintaining response quality and ensuring compliance with data privacy regulations. To achieve this, you should examine your current pay-per-call costs on Bedrock and compare them with fixed-price alternatives. You will most likely need to migrate workloads involving large volumes of data and heightened privacy requirements to the locally deployed llama.cpp platform with GGUF quantized models. This will eliminate API usage fees and improve data security. However, we won’t be able to completely abandon Bedrock if we require massive models. We can prototype on Canvas while MLOps keeps an eye on costs. Fine-Tuning Although pre-trained models are useful, we usually need our own. We can adapt models that have been pre-trained on large datasets to our smaller task. The simplest approach is standard fine-tuning, which involves updating the weights to adapt the model to our dataset. We take a pre-trained model and do not overwrite it. If your tasks are typical and you have a large dataset, then standard fine-tuning is the way to go. The second fine-tuning option is low-rank adaptation (LoRa), which involves adding small matrices to specific layers. This approach requires only around 0.1% of the original set of parameters. In effect, it enables targeted adjustments to be made to the model when computational resources are limited. It even works for large models. The original weights remain unchanged, but are combined with the matrices. This enables us to adapt the model for a wide variety of tasks. We use it when resources are limited, for multitasking, and to avoid catastrophic forgetting. LoRa is well-suited to open-source projects, and PEFT is widely used. It also enables models to adapt easily to new tasks. The third option is Supervised Fine-Tuning (SFT), which is a model that minimizes the loss function. It is particularly well-suited to tasks requiring high accuracy when a labeled dataset is available.\ The overall process will look like this: We need a dataset.It is prepared.A new layer is created.The model is trained.The model is tested and deployed. Lesson from my painful experience: pay particular attention to the file ID, as one small mistake could result in costly mistakes. If you have someone specially trained in a specific area (SME), you could opt for RLHF (training via human feedback). In practice, the training data is stored in JSONL format and uploaded to OpenAI’s servers. Then, a task is created on FineTuning. You can view the demo here. I prefer to use jqlang when working with JSONL. Before training the model, make sure you have defined and configured the training parameters. Key parameters: Learning rate: If this is set too high, the results will be unsatisfactory. If it is too low, the model will take a very long time to train.Batch size: The smaller the batch size, the less stable the model will be.The number of epochs: The lower this is, the weaker the training will be. Setting the epochs parameter to 5 means that the dataset will be iterated through five times. LLAMA Would you like to install the model locally? GGUF is the ideal solution for local models on LLAMA. It acts as a sort of bridge. It feeds into the GGUF Conversion Pipeline, a multi-stage process that converts a model from the original Hugging Face format into a single artifact file ready for deployment. After quantization, we reduce the file size from 62 gigabytes to approximately 19 gigabytes using llama-quantize. If the system can handle it, we can use the model to our heart's content. My code is not the best, and an LLM could generate a better one. However, this code has worked fine on five different machines with different parameters and operating systems, so it's pretty robust. Download Llama and its extensions. The Llama C++ toolkit converts models into locally deployable helpers. Python git clone https://github.com/ggerganov/llama.cpp.git curl -LsSf https://astral.sh/uv/install.sh | sh Check all the configured repositories that have been deleted in the current Git repository. Python git remote -v Installing huggingface_hub. Python make GGML_METAL=1 GGML_ACCELERATE=1 -j8 pip3 install --user huggingface_hub\[cli\] pip3 install --upgrade --user 'huggingface_hub[cli]' And we use a script to download a 23-gigabyte model. Python python3 -c " from huggingface_hub import hf_hub_download print('Downloading Qwen 2.5 Coder 32B Q5_K_M...') hf_hub_download( repo_id='Qwen/Qwen2.5-Coder-32B-Instruct-GGUF', filename='qwen2.5-coder-32b-instruct-q5_k_m.gguf', local_dir='.', local_dir_use_symlinks=False ) print('Download complete!') " Or a smaller version, because the larger version runs very slowly on my computer: Python cd ~/git/llama.cpp python3 -c " from huggingface_hub import hf_hub_download print('Downloading Qwen 2.5 Coder 7B Q5_K_M (~5GB)...') hf_hub_download( repo_id='Qwen/Qwen2.5-Coder-7B-Instruct-GGUF', filename='qwen2.5-coder-7b-instruct-q5_k_m.gguf', local_dir='.', local_dir_use_symlinks=False ) print('Download complete!') " ls -lh ~/git/llama.cpp/*.gguf Run the following command: curl -LsSf https://astral.sh/uv/install.sh | sh, then check the version using uv --version. Download the dependencies. UV is required to run the script that converts from PyTorch to GGUF. Python uv run --with transformers --with torch --with sentencepiece \ python convert_hf_to_gguf.py /actual/path/to/model pip3 install --user transformers torch sentencepiece protobuf numpy After running UV, the next steps are uv venv to create the environment and uv sync to install the dependencies. It's for troubleshooting. Quantization to reduce the model size, as discussed in the article. Optional. Python curl -LsSf https://astral.sh/uv/install.sh | sh cd ~/git/llama.cpp # Create build directory mkdir build cd build # Configure with Metal support (for Mac GPU) cmake .. -DGGML_METAL=ON # Build (use -j8 for parallel compilation) cmake --build . --config Release -j8 ls -la bin/ ./bin/llama-quantize \ ../qwen2.5-coder-32b-instruct-q5_k_m.gguf \ ../qwen2.5-coder-32b-instruct-q4_k_m.gguf \ Q4_K_M llama-cli runs the model locally. Now, to start a conversation, go to http://127.0.0.1:8082/. Python cd ~/git/llama.cpp/build ./bin/llama-server \ -m ../qwen2.5-coder-7b-instruct-q5_k_m.gguf \ -c 8192 \ -ngl 99 \ --port 8082 I hope this article helps you save money on LLMs, tokens, and MCPs.
I have spent the better part of a decade building data protection products for global enterprises. Cloud DLP, CASB, SSPM, Behavior Threats, AI Access Security, ISPM, etc. The kinds of things that sit between a user, an agent, or an application and the sensitive data nobody wants to see in the wrong place. Every conversation I have had with a customer security architect this year eventually arrives at the same question. The threat landscape has clearly changed. What does that mean for the controls we already own? This article is the analysis I have been sharing with security architects across industries who are evaluating how their data protection programs need to evolve. It is grounded in what is publicly documented, what it actually changes for enterprise data security, and where I would direct the next dollar of investment based on a decade of building these products at scale. What Actually Shifted, With Sources There are three publicly verifiable data points worth understanding before any control conversation makes sense. Discovery Is Becoming Inexpensive Mozilla shipped Firefox 150 in April 2026 with two hundred and seventy-one fixes that came out of a single sweep using an early version of Anthropic’s Mythos preview model. That is roughly four times the project’s typical annual baseline, in one pass. Mozilla also added the most honest sentence I have read on this topic all year. They said they have not seen any bug in the set that an elite human researcher could not have found, given enough time. SecurityWeek covered the details: securityweek.com/claude-mythos-finds-271-firefox-vulnerabilities. Read that caveat carefully. The thing that became automated is not novelty. It is the cost of finding a class of bugs that humans were always capable of finding. When the price of an action drops by an order of magnitude, the action gets done at scale. That is the shift, and it is the shift that matters. Patching Is Not Getting Cheaper at the Same Rate HackerOne paused new submissions to its Internet Bug Bounty program on March 27, 2026. The IBB is the oldest crowdsourced vulnerability reward program for open source, dating back to 2013. The pause was not a budget decision. It was an admission that the gap between AI-assisted discovery volume and the ability of volunteer maintainers to ship patches had become unbridgeable on the existing incentive model. Dark Reading’s coverage is here: darkreading.com on the IBB pause. Earlier in the year, the curl project removed bounties from its program for the same reason, after a wave of low-quality AI-generated submissions overwhelmed triage. If the upstream open source ecosystem is struggling to keep pace with discovery, every enterprise that ships software with open source dependencies is downstream of that struggle. That is most enterprises. Autonomous Agents Are Already Creating Real Incidents In April 2026, the Cloud Security Alliance published two surveys that I think every data security team should read. The first study found that fifty-three percent of organizations have had AI agents exceed their intended permissions, and forty-seven percent have already experienced a security incident involving an agent in the past year. The second, published a week later, reported that eighty-two percent of enterprises have discovered previously unknown agents running in their environments, and sixty-five percent have had an agent-related incident. The most common consequence was data exposure. CSA’s findings: Enterprise AI Security Starts with AI Agents and Autonomous but Not Controlled. Take those three threads together. Bug discovery is industrializing. The patch side is bottlenecked. And inside the enterprise, autonomous agents are already operating in places nobody fully maps. That is the operating reality, not a forecast. Why This Matters More for Data Security Than for Any Other Function Most of the AI security conversation is framed around vulnerabilities and exploits. I think that framing misses what is actually changing for enterprises. When a class of vulnerabilities becomes cheaper to discover, the average time between exposure and exploitation shortens. When average exposure time shortens, the probability that any given control fails inside that window goes up. When more controls fail more often, the consequence shows up at the data layer. Data is the asset. Everything else is a path to it. The CSA finding I keep coming back to is the one that says agent incidents most often produce data exposure, not service outages. That tracks with what I see at customer sites. The blast radius of an agent compromise is determined by the data the agent had access to, the policies that were being watched, and the speed at which someone noticed. None of those three is improving on the timeline that adversaries are improving. If an agent has access to your sensitive data, the agent is part of your data security perimeter, whether your DLP product knows it or not. That sentence is the part of the conversation that I find most data security teams are not yet having internally. It needs to happen this quarter. Three Things Data Security Programs Should Rethink Now 1. Stop Treating Non-Human Identities as a Hygiene Problem CyberArk’s 2025 Identity Security Landscape, surveying 2,600 cybersecurity decision-makers globally, found that machine identities now outnumber human identities by more than 80 to 1 in the typical enterprise, up from roughly 45 to 1 in their 2024 study. GitGuardian’s State of Secrets Sprawl 2025 report found 23.8 million new secrets exposed on public GitHub in 2024 alone, a 25 percent year-over-year increase, with non-human identities flagged as the dominant credential population behind that growth. The exact ratio in any given environment is a question for the IAM team, but the order of magnitude is consistent across every serious study I have read, and it is rising fast. Most enterprise IAM programs were designed around human users. Periodic access reviews. Quarterly attestation cycles. Manager signoff. None of that infrastructure was built for a population that is now eighty times larger, that provisions itself, and that often outlives its original use case. CSA’s research found that only 21 percent of organizations have a formal decommissioning process for AI agents. Everyone else is accumulating what the report calls retirement debt: agents who completed their task months ago and still hold credentials, tokens, and data access. From a data security standpoint, the practical consequence is that an enterprise’s most overprivileged identity is rarely a person. It is a service account from 2022 that nobody remembers, an OAuth grant that an integration test attached to a real production scope, or a workflow agent that picked up admin-level permissions during deployment because the person setting it up did not want to debug a permission-denied error at 11 p.m. These identities are reachable by adversaries through a single credential compromise, and they often have direct access to the kinds of data that DLP policies were written to protect at the human user layer. The remediation requires a structured non-human identity program with a named owner, a defined lifecycle covering provisioning, rotation, and decommissioning, and quarterly access reviews that apply to bots the way they apply to humans. Workload identity federation rather than long-lived secrets. Scoped service accounts. Logging that captures what each non-human identity touched, not just whether it authenticated successfully. From a tooling perspective, this work sits at the intersection of CASB, IAM, and DLP, and in most enterprises, it has no clear owner across those three functions. Establishing that ownership is the precondition for everything else. 2. Refresh Classification and Tagging for an Agentic Environment In my own work on DLP product strategy, I have come to think of classification and tagging as the foundation that every other data control sits on. If sensitive content is correctly identified at the moment it is created or ingested, downstream policies have a fighting chance. If it is not, no amount of policy authoring downstream will catch up. Most enterprise tagging programs were designed for documents flowing through email, endpoints, and a manageable list of SaaS applications. The current generation of AI agents and copilots flows through none of those choke points cleanly. An agent reads a corpus, generates a derivative artifact, and writes that artifact somewhere else. The original tag, if there was one, often does not survive the round trip. The derivative may contain sensitive content that was reassembled from sources that were each individually below the policy threshold. Three practical refreshes are worth funding now. Treat AI-generated outputs as a first-class data class. Anything produced by an agent or copilot needs provenance metadata that travels with it: which model produced it, against which prompt, derived from which sources, with which level of human review. Most enterprise classification taxonomies do not have a slot for this yet. Add one.Lower the threshold for tagging at ingestion. The cost of misclassifying a sensitive document used to be that a human eventually emailed it to the wrong person. The cost now includes an agent reading it as part of a larger context and producing a derivative that lands in a SaaS workspace your DLP product does not inspect. Err on the side of more aggressive classification at the source.Audit your DLP coverage of LLM endpoints and agentic SaaS surfaces. Most DLP deployments I see in the field have rich coverage of email and endpoints, partial coverage of cloud applications, and almost no coverage of the LLM and agent traffic that has become a meaningful share of how sensitive data now leaves the environment. That is the coverage gap most likely to show up in a 2026 incident report. 3. Put a Model in the Pull Request Path This is one of the few areas where the offensive shift in AI capability cuts directly in defenders’ favor, and most enterprise application security programs are not yet using it. The traditional SAST and DAST queue is where AppSec hours go to die. Thousands of unverified findings, most of them noise, validated entirely by humans on a backlog that never empties. The newer pattern is to put a model-based reviewer in the pull request path itself. Every PR is reviewed by an automated agent for security defects before a human sees it. Findings show up as inline comments. High-confidence findings can block the merge. OpenAI publicly stated in April 2026 that its Codex Security agent has contributed to over 3,000 critical and high-severity vulnerability fixes across the ecosystem since launch, and that its Codex for Open Source program now provides free security scanning to more than 1,000 open-source projects. Anthropic, Semgrep, and several other vendors have shipped comparable capabilities. Whether you build on a commercial offering or assemble an internal pipeline, the workflow is what matters. One nuance worth knowing about. Standard commercial models often refuse legitimate dual-use security queries by policy. Binary reverse engineering, exploit reasoning, malware analysis. If your AppSec team has been telling you that AI tools “do not work for security,” this refusal threshold is usually the reason. Both Anthropic’s Glasswing program and OpenAI’s Trusted Access for Cyber, expanded on April 14, 2026, to thousands of verified individual defenders, exist precisely to provide a lower refusal threshold for verified defensive use cases. Enterprise procurement and legal teams should start the verification paperwork now, not after a need arises. The Supply Chain Is the Other Half of the Data Exposure Problem Two recent incidents are worth holding in mind whenever this conversation comes up. On September 8, 2025, eighteen widely used npm packages, including chalk, debug, and ansi-styles, were trojanized after a phishing campaign targeting the maintainer known as qix. Those eighteen packages collectively account for over 2.6 billion weekly downloads. The malicious versions were live for roughly two hours and were written to drain cryptocurrency wallets, but the same access could have been used to exfiltrate environment secrets, build credentials, or sensitive data from any CI pipeline that pulled the bad version during that window. Palo Alto Networks Unit 42 and others published detailed breakdowns: paloaltonetworks.com on the qix incident. A week later, on September 15, 2025, the Shai-Hulud worm became the first self-propagating malware in the npm ecosystem, compromising hundreds of packages in its initial wave and continuing to evolve through follow-on campaigns into late 2025 and early 2026. The malware integrated TruffleHog to scan for secrets in compromised environments, harvested credentials from cloud instance metadata services where available, and weaponized GitHub Actions workflows for persistence. Palo Alto Networks Unit 42, ReversingLabs, Wiz, and others have continued to track variants of the same family. The reason these matter for a data security conversation is that the attacker's objective in both cases was credentials and secrets in build environments. From there, the path to data is short. A compromised CI runner with cloud credentials can read whatever those credentials can read. A compromised GitHub token can read whatever the org allows. A compromised npm publish token can introduce a future payload that does both. Treat the build pipeline as a data security boundary, not just an engineering productivity surface. A dependency firewall that validates package provenance before installation (Sonatype Nexus Firewall, JFrog Xray, Socket.dev, or equivalents) is the highest-leverage single control I know of for closing this attack surface. The Shadow Agent Problem Is a DLP Problem in Disguise The single most striking statistic in the April 2026 CSA research, to me, was that eighty-two percent of organizations had discovered previously unknown AI agents in their environment over the past year, and forty-one percent had discovered them more than once. The agents most commonly emerged in internal automation and scripting environments, in custom assistants and plugins built on LLM platforms, in SaaS tools with built-in automation, and in developer-created workflows. This is, structurally, the same problem that shadow IT was a decade ago, and the same problem that shadow SaaS became five years ago. The difference is that the average shadow agent has read access to more sensitive data than the average shadow application ever did, because agents are useful precisely in proportion to how much context they can reach. A finance team’s reconciliation agent, helpfully built in an afternoon, often ends up with broader visibility into financial data than the human who built it. A customer support copilot frequently has a service account with access to the entire ticket database, including PII. None of this is malicious. It is the path of least resistance for getting an agent to do something useful. Three controls help close the gap, and they are mostly extensions of capabilities a mature data security team already owns. CASB and SSPM coverage of LLM and agent platforms. The platforms hosting these agents (custom GPTs, Copilot Studio, internal MCP servers, vendor copilots) are SaaS applications. They need posture management, sanctioned application policies, and inline data protection just as much as Salesforce or Workday do. Most CASB and SSPM deployments are still catching up here. Push your vendor.Inline DLP on prompt and completion traffic. The point at which sensitive data leaves the environment is increasingly the prompt itself. Inline data inspection at the LLM gateway, using the same content matchers (EDM, IDM, OCR, vector ML) you trust for email and endpoints, is the right architectural pattern. The vendors are building this, but few enterprises have it deployed.An agent registry, even a basic one. Until the agent population is enumerable, no policy applied to it is provable. A spreadsheet is fine to start. The goal is to be able to answer, on any given Monday, three questions: which agents exist in production, what data each one can read, and who is the human owner of each. CSA’s data shows that most enterprises cannot answer those questions today. What I Would Actually Start on This Week Comprehensive ninety-day plans tend to lose momentum after the first two weeks of execution. The more effective approach, which I have refined over years of operationalizing data security programs at enterprise scale, is a focused set of starting moves that can ship in two weeks and that compound into a larger program over the quarter. Run an inventory pass for AI agents and copilots in your environment. Spreadsheet is fine. Capture name, owner, data scope, and approval status. The goal is to convert the CSA shadow agent statistic from an industry survey number into a number you actually have for your own organization.Review the data scope of every service account and OAuth grant tied to an LLM, agent, or copilot platform. Most of them were sized for development convenience, not production. Tighten the ones that need tightening. Decommission the ones that are no longer in active use.Pilot a model-based reviewer in the pull request path of one codebase. Measure the false positive rate and developer satisfaction at week four. If the numbers are reasonable, expand. If they are not, tune and try again.Add provenance metadata to your data classification taxonomy. Even if the only label you can ship this quarter is “generated by an AI system,” shipping it now is more valuable than waiting for a perfect schema. Tagging at ingestion is the part of the program that compounds, and it has been undersized for the agent era at most enterprises I have seen.Open the verified access conversation with your AI vendors. Anthropic Glasswing, OpenAI Trusted Access for Cyber, and equivalent programs from other providers offer pathways to models with reduced refusal thresholds for legitimate defensive work. The application process involves coordination with General Counsel and procurement, which is why initiating it before an urgent need is critical. Programs of this kind will become foundational infrastructure for enterprise security teams over the next two years. These moves represent the structural transition that enterprise data security programs need to make over the next eighteen months. Programs that begin this work now will spend that window refining the controls and integrating them across their existing security stack. Programs that delay will spend the same window writing postmortems that explain why the controls were not in place. Conclusion The cybersecurity industry has navigated several genuine inflection points over the past decade, and the current moment qualifies as one of them on a specific structural ground: the cost curve for finding software flaws has bent, while the cost curve for shipping patches has not. The gap between those two curves is where every enterprise security program now operates, and the consequences land first at the data layer, which is where my work has been concentrated for the past decade. Data security teams that internalize this framing now will spend 2026 building defensible programs around a fundamentally changed threat economy. Teams that wait for a more dramatic signal will spend the same period responding to incidents that the structural shift made predictable.
AI-powered features often behave perfectly during testing and quietly degrade in production. The model has not changed. The prompts have not changed. Latency looks normal. Error rates are clean. Yet the responses gradually feel off, slightly disconnected, missing nuance, referencing things that are no longer relevant to the task at hand. This pattern has a name: context rot. It does not throw exceptions. It does not appear in dashboards. It is one of the more subtle failure modes in production AI systems, and understanding it early makes a meaningful difference in the quality of what gets built. How Attention Works in LLMs To understand context rot, just enough of the underlying mechanic is needed. Before an LLM generates each new token, it looks at every token in the context and decides how much weight to give each one. This is called attention. The key insight: attention scores are normalized, and they sum to 1.0 across all tokens. That means attention is a fixed budget. When the context has 500 tokens, each important piece of information might receive 0.15 or 0.20 of the total attention. When the context has 50,000 tokens, that same important piece might receive only 0.002, even if it is equally critical to the task. Java // Simplified illustration — not actual LLM code public float[] generateNextToken(String[] contextTokens) { float[] scores = new float[contextTokens.length]; for (int i = 0; i < contextTokens.length; i++) { // How relevant is each past token to what we are generating? scores[i] = computeRelevance(currentState, contextTokens[i]); } // Scores must sum to 1.0 — a fixed attention budget float[] weights = softmax(scores); return weightedCombination(contextTokens, weights); } Every token added, relevant or not, slightly dilutes the attention available for everything else. That is the seed of context rot. Context Position and Attention A well-known multi-document question-answering experiment revealed something that should give every engineer building AI systems reason to pause. The correct answer was hidden at different positions across a long context, and retrieval accuracy was measured purely by position: Answer at the beginning: ~75% accuracyAnswer at the end: ~72% accuracyAnswer in the middle: ~55% accuracy A 20 percentage point drop caused entirely by where the information sat, not by its quality or relevance. The information was present. The model could technically see it. It simply was not attending to it properly. This is known as the Lost-in-the-Middle effect. It is an emergent architectural property of the transformer training process itself. Models learn to attend strongly to the beginning and end of their inputs, where the most signal-dense content tends to appear in human writing. The middle of a long context becomes an attention dead zone as a natural consequence of how these models are trained, not as an oversight. Does this still apply to modern models? The honest answer is: yes, with important nuance. Newer models have largely resolved the effect for simple factoid retrieval — finding a specific fact at a specific position in a long context is something recent architectures handle well. The problem persists, and arguably intensifies, on multi-step reasoning tasks where the model must synthesize information across several documents simultaneously. That is precisely the category most production AI systems fall into, so the practical risk remains significant even as benchmark numbers improve. What Context Rot Looks Like in Practice Scenario 1: The wandering coding agent. An agent is asked to fix a bug. It reads 15 files, explores 3 wrong leads, and backtracks. Each file, each search result, each dead end accumulates in context. By the time the agent finds the right file, buried in the middle of 20,000 tokens, attention is spread thin. The analysis of the one file that actually matters is noticeably weaker than it would have been with a clean context. Scenario 2: The RAG pipeline that drifts. A retrieval pipeline fetches 10 document chunks per query, roughly 5,000 tokens. For most queries, this works fine. But longer queries trigger larger system prompts and conversation history. Total context grows to 40,000 tokens, and the documents retrieved third and fourth, sitting in the middle, fall into the attention dead zone. The model answers confidently, drawing on what it can see well. A crucial nuance from chunk 4 gets missed. The pattern is always the same: no error, no warning, just answers that are subtly less accurate than they should be. How to Detect It Step 1: Log context length alongside every LLM call. What cannot be measured cannot be managed. Step 2: Run a positional accuracy test. Place a key fact at different positions in a realistic context and check whether the model retrieves it correctly. Java public void positionalAccuracyTest(LlmClient client, String keyFact, String fillerText) { double[] positions = {0.1, 0.5, 0.9}; // beginning, middle, end for (double pos : positions) { int split = (int) (fillerText.length() * pos); String context = fillerText.substring(0, split) + "\nKEY: " + keyFact + "\n" + fillerText.substring(split); String response = client.complete(context, "Summarise the most important information from the context."); boolean found = response.toLowerCase().contains(keyFact.toLowerCase()); System.out.printf("Position %d%%: %s%n", (int)(pos * 100), found ? "RECALLED" : "MISSED"); } } If the model passes at 10% and 90% but fails at 50%, context rot is measurable in that system at that context length. Step 3: Alert on context length thresholds. Set a warning at around 50,000 tokens and a hard alert at 100,000. These are starting points — the positional accuracy test above will help calibrate the right numbers for a specific model and task type. Context Rot Is Also a Cost Problem Most conversations about context rot focus on quality, and rightly so. But at any meaningful scale, it is equally a financial problem, and that dimension tends to get overlooked until the infrastructure bill arrives. LLM providers charge by the token. Every token in the context window is billed on every single call. A context that has grown to 80,000 tokens costs roughly 8x more per call than one held at 10,000 tokens, for the same task, often with worse output quality. That is not a trade-off; it is strictly worse in both dimensions simultaneously. The exact cost per token varies by provider and model tier, but the ratio holds universally — longer context means a proportionally larger bill. The compute reality makes this more pronounced. Transformer attention scales quadratically with context length. Doubling the number of tokens does not double the compute required; it roughly quadruples it. At low volumes, this is invisible. With millions of calls per day, it becomes one of the largest line items in an AI system's operating cost. The numbers are illustrative, but the ratio is the point. A context that has grown to 80,000 tokens costs roughly 8x more per call than one held at 10,000 tokens, for the same task, often with worse output quality. That is not a trade-off; it is strictly worse in both dimensions simultaneously. Context rot at scale is not a quality inconvenience. It is a budget problem. Compaction, precise retrieval, and subagent isolation are not just engineering best practices; they are cost controls. 4 Practical Mitigations 1. Compact early — do not wait until quality degrades. Summarize older conversation turns before the context gets large, not after the damage is done. Java public List<Message> compactIfNeeded(List<Message> messages, LlmClient client) { int limit = 30_000; if (estimateTokens(messages) < limit) return messages; // Need at least a system prompt + messages to summarise + recent turns if (messages.size() < 7) return messages; // Everything except system prompt and last 5 turns List<Message> older = messages.subList(1, messages.size() - 5); String summary = client.complete("Summarise concisely: " + format(older)); List<Message> compacted = new ArrayList<>(); compacted.add(messages.get(0)); // system prompt compacted.add(new Message("system", summary)); compacted.addAll(messages.subList(messages.size() - 5, messages.size())); return compacted; } 2. Use subagents for exploration. When an agent needs to search or explore, do it in a dedicated subagent with its own context window. Only the compact result, not the exploration trace, returns to the parent agent. Noise stays isolated. 3. In RAG, retrieve less and rerank. Three precisely relevant chunks consistently outperform ten loosely relevant ones. Retrieval quantity does not equal retrieval quality. Fetch a wider candidate set, rerank by relevance, and pass only the top results to the model. 4. Position critical content deliberately. Given what is known about the attention curve, the most important context belongs at the beginning or end, not sandwiched in the middle. The system prompt and the current user query naturally occupy those positions. Keep them there, and be intentional about what fills the space between. What This Means at Each Level For early-career engineers: when an AI feature works in local testing but feels off in production, check context length first. Adding llm.context_tokens to an observability stack, alongside latency and error rate, is a small change with a meaningful signal. For tech leads and architects: context is not a free resource. Every design session for an LLM-powered feature should include a clear answer to "what is in this context window and why?" If that question cannot be answered clearly, the design is incomplete. For engineering managers and leaders: context rot does not appear in standard dashboards. Error rate and latency can look perfectly healthy while response quality silently degrades. Correlating context length with downstream quality metrics, task success rates, and user satisfaction is the monitoring work that production AI systems now require. Conclusion Context rot is one of those concepts that feels advanced until it is encountered in production, and then it feels like something that should have been understood from day one. The core reality is simple: transformer attention is a finite, dilutable resource. Every token added to a context window reduces the focus available for everything else. When contexts grow long, and important information ends up in the middle, quality degrades in ways that are real, measurable, and unfortunately silent. The good news is that it is manageable. Compact early. Isolate exploration into subagents. Be precise with retrieval. Position critical content deliberately. None of these requires advanced machine learning knowledge; they are engineering disciplines applied to a new kind of resource. The mental model that tends to help most is treating context the way experienced engineers treat memory: allocate it deliberately, release what is no longer needed, and keep the working set small and focused. The models are already capable of doing remarkable work, if given a clean signal and kept free of noise.
The Abrupt End of Amazon Q Developer In May 2026, AWS dropped a bombshell: Amazon Q Developer IDE plugins and paid subscriptions will reach end-of-support on April 30, 2027, with new signups blocked as of May 15, 2026. The successor? Kiro — AWS's next-generation AI IDE that reframes how engineers build software from scratch. If you're a backend engineer who has been relying on Q Developer for code completion, inline chat, and security scanning inside VS Code or JetBrains, the clock is ticking. But before you begrudgingly migrate, it's worth understanding why this transition is happening, what Kiro actually offers, and whether the trade-offs are worth it — especially in production backend contexts like microservices, distributed systems, and observability pipelines. Historical Context: From CodeWhisperer to Q Developer to Kiro AWS's AI coding journey started with Amazon CodeWhisperer (launched in preview in 2022), which was a single-model code suggestion tool — think GitHub Copilot, but AWS-native. It supported security scanning against common vulnerability patterns and could suggest AWS SDK calls contextually. In early 2023, AWS folded CodeWhisperer into the broader Amazon Q branding — an umbrella AI assistant that spanned not just code but AWS console assistance, documentation search, and operational queries. Q Developer became the IDE-facing arm of that product. The problem? Q Developer tried to be everything: a coding assistant, a console assistant, a documentation bot, and a security scanner all jammed into one plugin. Feedback from engineering teams consistently pointed to context window limitations, poor multi-file understanding, and weak support for complex backend architectures spanning multiple services. Kiro is AWS's response. Built from the ground up with "spec-driven development" as its core philosophy, Kiro is less of an autocomplete engine and more of an agentic coding environment — it can plan, scaffold, and implement across your entire project tree, not just the file you have open. Architecture Comparison The architectural difference is significant. Q Developer operated in a request-response model where you asked a question or triggered a completion and got a result. Kiro introduces hooks — lifecycle-aware automations that fire when you save a file, open a PR, or change a spec. This is closer to how CI/CD pipelines work, and backend engineers will immediately recognize the paradigm. Feature-by-Feature Breakdown FeatureAmazon Q DeveloperAmazon KiroMulti-file contextLimited (single file primary)Full project treeAgentic task executionNoYes (plan → implement → test)Spec-driven developmentNoYes (SPEC.md driven)MCP integrationNoYes (external tool calls)Security scanningYes (CodeWhisperer rules)Yes (enhanced)JetBrains supportYesYesVS Code supportYesYesAWS Free Tier accessYesYes (via AIdeas Competition)Paid subscription$19/mo (deprecated)Separate Kiro pricingEnd of supportApril 30, 2027Active Production Code Example 1: Spec-Driven Microservice Scaffolding With Kiro One of Kiro's most powerful features is its SPEC.md-driven workflow. Instead of writing code and hoping the AI helps, you write a specification and Kiro implements it. Here's what that looks like for a backend order processing microservice. TypeScript // SPEC.md concept implemented as TypeScript types // Kiro reads your spec and generates this scaffolding import { Logger } from '@aws-lambda-powertools/logger'; import { Tracer } from '@aws-lambda-powertools/tracer'; import { DynamoDBClient, PutItemCommand, GetItemCommand } from '@aws-sdk/client-dynamodb'; import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs'; import { marshall, unmarshall } from '@aws-sdk/util-dynamodb'; const logger = new Logger({ serviceName: 'order-service', logLevel: 'INFO' }); const tracer = new Tracer({ serviceName: 'order-service' }); const ddb = tracer.captureAWSv3Client(new DynamoDBClient({})); const sqs = tracer.captureAWSv3Client(new SQSClient({})); interface Order { orderId: string; customerId: string; items: Array<{ sku: string; qty: number; price: number }>; status: 'PENDING' | 'CONFIRMED' | 'SHIPPED' | 'CANCELLED'; createdAt: string; } interface CreateOrderResult { success: boolean; orderId?: string; error?: string; } // Kiro-generated handler with full error handling + structured logging export const createOrder = async ( order: Omit<Order, 'orderId' | 'status' | 'createdAt'> ): Promise<CreateOrderResult> => { const segment = tracer.getSegment(); const subsegment = segment?.addNewSubsegment('createOrder'); try { const orderId = `ORD-${Date.now()}-${Math.random().toString(36).slice(2, 7).toUpperCase()}`; const newOrder: Order = { ...order, orderId, status: 'PENDING', createdAt: new Date().toISOString(), }; logger.info('Creating order', { orderId, customerId: order.customerId, itemCount: order.items.length }); // Persist to DynamoDB await ddb.send(new PutItemCommand({ TableName: process.env.ORDERS_TABLE!, Item: marshall(newOrder), ConditionExpression: 'attribute_not_exists(orderId)', // idempotency guard })); // Publish to downstream processing queue await sqs.send(new SendMessageCommand({ QueueUrl: process.env.ORDER_QUEUE_URL!, MessageBody: JSON.stringify(newOrder), MessageGroupId: order.customerId, // FIFO ordering per customer MessageDeduplicationId: orderId, })); logger.info('Order created and queued', { orderId }); return { success: true, orderId }; } catch (error) { const err = error as Error; logger.error('Failed to create order', { error: err.message, stack: err.stack }); subsegment?.addError(err); return { success: false, error: err.message }; } finally { subsegment?.close(); } What Q Developer would do: Suggest inline completions line-by-line based on your cursor position. What Kiro does: Reads your SPEC.md that says "Create an order service with DynamoDB persistence, SQS publishing, idempotency, and X-Ray tracing" — and generates the entire file, including imports, error handling, and the logging pattern your team already uses (learned from your codebase). Production Code Example 2: Using Kiro Hooks for Automatic Test Generation Kiro's hook system is where backend engineers will find the most leverage. A hook is a YAML-defined automation that triggers on file system events within your project. YAML # .kiro/hooks/auto-test.yaml name: Generate Unit Tests on Save trigger: event: file_saved pattern: "src/**/*.ts" exclude: "**/*.test.ts" actions: - type: agent_task prompt: | A TypeScript file was just saved at {{file_path}. Review the exported functions. For any function that does not have a corresponding test in {{file_path_without_ext}.test.ts, generate comprehensive unit tests using Vitest. Include: - Happy path tests - Error boundary tests (network failures, malformed input) - Edge cases for empty arrays and null values Use @aws-sdk/client-dynamodb mocks from @aws-sdk/lib-dynamodb MockDocumentClient. output_file: "{{file_path_without_ext}.test.ts" mode: merge # Don't overwrite existing tests, only append missing ones YAML // Auto-generated test from the hook above (Vitest) import { describe, it, expect, vi, beforeEach } from 'vitest'; import { mockClient } from 'aws-sdk-client-mock'; import { DynamoDBClient, PutItemCommand } from '@aws-sdk/client-dynamodb'; import { SQSClient, SendMessageCommand } from '@aws-sdk/client-sqs'; import { createOrder } from './order-service'; const ddbMock = mockClient(DynamoDBClient); const sqsMock = mockClient(SQSClient); describe('createOrder', () => { beforeEach(() => { ddbMock.reset(); sqsMock.reset(); process.env.ORDERS_TABLE = 'test-orders'; process.env.ORDER_QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123/orders.fifo'; }); it('should create order and return orderId on success', async () => { ddbMock.on(PutItemCommand).resolves({}); sqsMock.on(SendMessageCommand).resolves({ MessageId: 'msg-123' }); const result = await createOrder({ customerId: 'cust-001', items: [{ sku: 'SKU-A', qty: 2, price: 29.99 }], }); expect(result.success).toBe(true); expect(result.orderId).toMatch(/^ORD-/); }); it('should return error when DynamoDB PutItem fails', async () => { ddbMock.on(PutItemCommand).rejects(new Error('ProvisionedThroughputExceededException')); const result = await createOrder({ customerId: 'cust-001', items: [{ sku: 'SKU-A', qty: 1, price: 9.99 }], }); expect(result.success).toBe(false); expect(result.error).toContain('ProvisionedThroughputExceededException'); }); it('should handle empty items array gracefully', async () => { ddbMock.on(PutItemCommand).resolves({}); sqsMock.on(SendMessageCommand).resolves({}); const result = await createOrder({ customerId: 'cust-001', items: [] }); expect(result.success).toBe(true); }); }); This hook fires every time you save a .ts file, checks if tests are missing, and generates them — without you ever typing a prompt. Production Code Example 3: Migrating Q Developer Inline Chat Workflows to Kiro If you've built muscle memory around Q Developer's /dev inline chat, here's how equivalent workflows map in Kiro. TypeScript // Q Developer workflow (deprecated): // You'd type: /dev add retry logic with exponential backoff to this fetch call // And it would modify the highlighted code block. // Kiro equivalent: Add to your SPEC.md or use agent chat with file pinning: // "Add retry logic with exponential backoff to src/clients/http-client.ts" // The result Kiro produces — production-grade with observability: import { Logger } from '@aws-lambda-powertools/logger'; const logger = new Logger({ serviceName: 'http-client' }); interface RetryOptions { maxAttempts: number; baseDelayMs: number; maxDelayMs: number; retryableStatusCodes: number[]; } const DEFAULT_RETRY_OPTIONS: RetryOptions = { maxAttempts: 3, baseDelayMs: 100, maxDelayMs: 5000, retryableStatusCodes: [429, 500, 502, 503, 504], }; export async function fetchWithRetry<T>( url: string, options: RequestInit = {}, retryOpts: Partial<RetryOptions> = {} ): Promise<T> { const opts = { ...DEFAULT_RETRY_OPTIONS, ...retryOpts }; let lastError: Error | null = null; for (let attempt = 1; attempt <= opts.maxAttempts; attempt++) { try { const response = await fetch(url, options); if (!response.ok && opts.retryableStatusCodes.includes(response.status)) { const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt - 1), opts.maxDelayMs); logger.warn('Retryable HTTP error, backing off', { url, status: response.status, attempt, delayMs: delay, }); await new Promise(res => setTimeout(res, delay)); continue; } if (!response.ok) { throw new Error(`HTTP ${response.status}: ${response.statusText}`); } return response.json() as Promise<T>; } catch (error) { lastError = error as Error; if (attempt === opts.maxAttempts) break; const delay = Math.min(opts.baseDelayMs * Math.pow(2, attempt - 1), opts.maxDelayMs); logger.warn('Request failed, retrying', { url, attempt, delayMs: delay, error: lastError.message }); await new Promise(res => setTimeout(res, delay)); } } logger.error('All retry attempts exhausted', { url, maxAttempts: opts.maxAttempts }); throw lastError ?? new Error('Unknown fetch error after retries'); When to Migrate Now vs. Wait Migrate now if: You're starting a new service or greenfield project — Kiro's spec-driven approach saves the most time at project inceptionYour team does heavy test generation — the hook system is a net productivity winYou're building MCP-integrated tooling or AWS-native agentic workflows Wait if: You have a heavily customized Q Developer security scanning ruleset — give the Kiro security scanner time to matureYou're on a locked-down enterprise network — Kiro's agentic features require broader outbound connectivity than Q Developer's plugin model Performance and Productivity Metrics MetricAmazon Q DeveloperAmazon Kiro (early data)Avg. context window (tokens)~16K~128K+Multi-file edits per session1-210-20+Test coverage improvement~15%~35% (with hooks)Time to scaffold new service~2-3 hrs manual~20-40 min spec-drivenSecurity scan languages1520+ Summary The Q Developer → Kiro transition isn't just a rebranding. It's a fundamental shift from a reactive autocomplete tool to a proactive agentic development environment. For backend engineers building distributed systems on AWS, Kiro's spec-driven planning, multi-file context, and hook-based automation represent a genuine productivity leap — not just an incremental update. Start your migration now. The deprecation deadline of April 2027 sounds far off, but enterprise procurement, security reviews, and team retraining take time. Get ahead of it. References AWS: Amazon Q Developer End-of-Support Announcement — AWS News Blog, May 2026AWS: Top Announcements of What's Next with AWS 2026 — AWS News Blog, April 2026AWS Lambda Powertools for TypeScript — Official DocumentationAWS SDK Client Mock — GitHubKiro Documentation — Official Kiro DocsAWS Well-Architected Framework: Operational Excellence — AWS Docs
The xAI Grok API provides access to powerful frontier models, including the Grok 4 series, supporting chat completions (text + vision), image generation, tool calling (function calling and built-in tools like web search), and more advanced features. Quick Intro Sign up at https://x.ai/api.Generate an API key from the console.Install pip install xai-sdk.Set env var: export XAI_API_KEY="your_key_here".Models list: https://docs.x.ai/developers/models. I'll share some samples in Python. Learn how to use Grok AI - xAI Basic Chat API Call Let's first prepare our project before making the API call 1. Install the xai-sdk. Shell pip install xai-sdk 2. Set env var: export XAI_API_KEY="your_key_here" or use .env file. Now, create a new file and this basic setup: Python import os from xai_sdk import Client from xai_sdk.chat import user, system from dotenv import load_dotenv load_dotenv() XAI_API_KEY = os.environ.get("XAI_API_KEY") client = Client(api_key=XAI_API_KEY) Ensure you can print out your XAI_API_KEY correctly at this stage. Next, let's call the chat function: Python ... model = "grok-4-1-fast-non-reasoning" chat = client.chat.create(model=model) chat.append(system("You are Grok, a highly intelligent, helpful AI assistant.")) chat.append(user("How can I be a good developer?")) response = chat.sample() print(response.content) Feel free to switch the model based on your needs or preferences. Here is an example output: Grok AI API basic call Image Generation API Let's see how to generate an image with Grok API. We'll need to use the "grok-imagine-image" model for this. Python ... response = client.image.sample( model="grok-imagine-image", prompt="detective cat searching on website" ) print(f"Generated image: {response.url}") The output is a URL like this: Image generation API using xAI API Video Generation API Generating a video is as easy as generating an image with Grok API. We'll need to use the "grok-imagine-video" model for this. Python response = client.video.generate( prompt="A glowing crystal-powered rocket launching from the red dunes of Mars, ancient alien ruins lighting up in the background as it soars into a sky full of unfamiliar constellations", model="grok-imagine-video", duration=10, aspect_ratio="16:9", resolution="720p", ) print(response.url) Grok Video API example You can set the duration, aspect ratio, and resolution. Tools in Grok The xAI Grok API features powerful tool-calling capabilities, allowing Grok to go far beyond simple text generation. It can take real actions such as performing web searches, running code, retrieving information from your own data sources, or invoking any custom functions you've defined. From x.ai - available tools Tool Calling (Function Calling) Let's start by calling a custom function, as it'll help us call any internal or external API or function. Let's say we want to call a function to look for an item's price. First, we need to define the function, such as adding the name, description, and parameters. Python ... import json from xai_sdk.chat import user, tool, tool_result ... # Define tools tools = [ tool( name="get_item_price", description="Get the price of an item from the store", parameters={ "type": "object", "properties": { "item_name": {"type": "string", "description": "Name of the item to get the price for"}, }, "required": ["item_name"] }, ), ] Upon calling the client method, we now need to include the tool we declared above. Python chat = client.chat.create( model="grok-4.20-reasoning", tools=tools, ) chat.append(user("What is the price of a laptop?")) response = chat.sample() print("========= response ===========") print(response) print("==========================") Important: At this stage, Grok doesn't care if we have the actual function to check the price or not. The AI simply wants to know "what tools are available" for them to use. Try to run the code to see the output from the chat call. Function calling output sample As you can see, Grok can detect the tool we need to call. You can see it from outputs > message > tool_calls . It consists of the name of the function and the arguments that are extracted from the user's prompt, so it'll be dynamic. Function Call Simulation Next, let's create a fake function to call. In real life, it could be a call to a database or APIs. Python def get_item_price(item_name): prices = { "laptop": 999.99, "smartphone": 499.99, "headphones": 199.99, } return {"item_name": item_name, "price": prices.get(item_name, "Item not found")} Following up on the latest code, we can check if the response has a "tool_calls" object or not. If so, we'll call the actual function we just declared above. Python # Handle tool calls if response.tool_calls: chat.append(response) for tc in response.tool_calls: args = json.loads(tc.function.arguments) result = get_item_price(args["item_name"]) chat.append(tool_result(json.dumps(result))) response = chat.sample() print(response.content) We need to loop through the tool_calls objectWe need to extract the argument to pass to the functionCall the actual function alongside the argument valueAdd the information back to our chat method Now, calling the chat.sample() method, will include all the information we received from calling the "fake function" before. Sample result for function calling Let's try with a different prompt: Shell chat.append(user("I need to buy two laptops and a smartphone. Can you tell me how much that will cost?")) Here is the result: Function calling result sample Web Search API Grok can access real-time information through this feature, so you can get up-to-date content. Unlike the function calling above, we don't need to declare a custom function, as it's an internal tool. Here is a simple example: Python import os from xai_sdk import Client from xai_sdk.chat import user from xai_sdk.tools import web_search from dotenv import load_dotenv load_dotenv() XAI_API_KEY = os.environ.get("XAI_API_KEY") client = Client(api_key=XAI_API_KEY) chat = client.chat.create( model="grok-4.20-reasoning", # reasoning model tools=[web_search()], include=["verbose_streaming"], ) chat.append(user("Grok VS OpenAI API")) is_thinking = True for response, chunk in chat.stream(): for tool_call in chunk.tool_calls: print(f"\nCalling tool: {tool_call.function.name} with arguments: {tool_call.function.arguments}") if response.usage.reasoning_tokens and is_thinking: print(f"\rThinking... ({response.usage.reasoning_tokens} tokens)", end="", flush=True) if chunk.content and is_thinking: print("\n\nFinal Response:") is_thinking = False if chunk.content and not is_thinking: print(chunk.content, end="", flush=True) print("\n\nCitations:") print(response.citations) Use tools=[web_search()]To show what's happening in the process, we use include=["verbose_streaming"],is_thinking variable is to check if the process is still running (a boolean variable) Web Search API with Grok AI As you can see, it'll perform several searches on the internal database with different queries. It'll then visit a specific URL after that to get more context. Allowed Domains You can search only in specific domains using allowed_domains. Python tools=[ web_search(allowed_domains=["grokipedia.com"]), ], Exclude Domains Vice versa, you can exclude specific domains: Python chat = client.chat.create( model="grok-4.20-reasoning", tools=[ web_search(excluded_domains=["grokipedia.com"]), ], ) Better Web Search API While you can specifically choose the domain, the keyword Grok uses to find answers on the internet is random. For example, when I'm asking for "Top 3 pizza restaurants from Google Maps in Boston. Share some reviews and ratings for each place." This is what I saw from the thinking process: It needs to perform multiple queries before returning the answer. Another sample, when asking simply for three images: It runs across multiple pages, and unfortunately, the links are not valid. Grok may hallucinate at this point. Web Search API Alternative In some cases, AI-generated keywords are fine, but if you're building an app where you want efficiency and full control over the process, the native "Web Search Tool" can be replaced with a simple API call to a specific API your app needs. For example, to find answers online, SerpApi offers 100+ APIs. Need a generic Google answer? We have: Google Search APIGoogle AI OverviewGoogle AI Mode Same with Bing, DuckDuckGo, and other top search engines. Need a restaurant review? We have: Yelp Reviews APIGoogle Maps Reviews API Need an API for traveling apps? We have: Google Hotels APIGoogle Flights APITripAdvisor API and more! See how SerpApi is the Web Search API for your AI apps, LLM, and agents. Using Grok API With SerpApi To get a sense of how SerpApi works, feel free to test the results in our playground. You can play with different parameters and directly see the JSON sample we return. SerpApi Playground Sample Case Let's say we want to find images via Google Image API like this: Sample result search with SerpApi Step 1: Preparation You can register for free at serpapi.com to get your API key. Step 2: Parsing Keyword Let's say we need three images from Google. Since users can type anything, we need to parse the keyword, as SerpApi simply performs a search using a particular keyword. Python USER_QUERY = "Show me 3 cute cat images from the internet" # Step 1: Ask Grok to extract a search keyword from the user's natural language keyword_chat = client.chat.create(model="grok-3-fast") keyword_chat.append(system("Extract the most relevant search keyword or phrase from the user's message. Reply with only the keyword, nothing else.")) keyword_chat.append(user(USER_QUERY)) keyword_response = keyword_chat.sample() search_keyword = keyword_response.content.strip() print(f"Extracted keyword: {search_keyword}") Step 3: Search via SerpApi We now have the keyword. Let's run a search on SerpApi. Python # Step 2: Search via SerpAPI using simple requests (Google Images) serpapi_params = { "api_key": SERPAPI_API_KEY, "engine": "google_images", "q": search_keyword, "hl": "en", "gl": "us", } serpapi_url = "https://serpapi.com/search" serpapi_response = requests.get(serpapi_url, params=serpapi_params) results = serpapi_response.json() At this stage, you already have the answers you're looking for. Step 4: Filter Results (Optional) Sometimes, we don't need all the information. It's good to filter it programmatically first, so we don't use too many tokens. For example, I'm only interested in the top five answers: Python image_results = results.get("images_results", [])[:5] formatted_results = "\n".join( f"- {img.get('title', 'No title')}: {img.get('original', img.get('thumbnail', 'No URL'))}" for img in image_results ) print(f"\nSerpAPI results:\n{formatted_results}") We can also format the answer as a bonus. Step 5: Reply in Natural Language (Optional) Depending on your application, you may want to answer the user back in natural language. We just need to pass the answers above back to the AI: Python # Step 3: Feed results back to Grok for a final response final_chat = client.chat.create(model="grok-3-fast") final_chat.append(system("You are a helpful assistant. Use the provided search results to answer the user's question.")) final_chat.append(user(f"User question: {USER_QUERY}\n\nSearch results from SerpAPI:\n{formatted_results}\n\nPlease answer the user's question based on these results.")) final_response = final_chat.sample() print(f"\nFinal Response:\n{final_response.content}") Final result: You can try the other APIs for other use cases. Sidenote It's also possible to call the API with the OpenAI SDK. Sample: Python from openai import OpenAI client = OpenAI( api_key=os.getenv("XAI_API_KEY"), base_url="https://api.x.ai/v1", ) Check out the full SerpAPI article collection here.
Tuhin Chattopadhyay
AI Decision Intelligence Scholar-Practitioner | Founder, Tuhin AI Advisory | Professor & Area Chair, AI & Analytics,
JAGSoM
Frederic Jacquet
Technology Evangelist,
AI[4]Human-Nexus
Pratik Prakash
Principal Solution Architect,
Capital One