Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Mastering Fluent Bit: Beginners' Guide for Contributing to Our CNCF Project Website
Testing AI-Infused Apps: A Dual-Layer Framework for AI Quality Assurance
AI agents have a memory problem. Not the kind that we all hear daily — hallucination, wrong answers, but a much quieter and fundamental problem. When you start a new conversation with the agent, it forgets who you are. It doesn't know what you have already worked on, what you have clarified multiple times across sessions, or what is common across all the sessions. You start from scratch every single time. While this does sound good in a way, in case you weren't getting what you wanted out of the agent, it does pose some challenges. LLMs are capable of maintaining a rich context of a conversation. The problem is more architectural: most of the agents designed the scope to include all state files, memory, and history into a single thread. When that thread ends, so does the state. This results in an intelligent agent but amnesiac across sessions. LangChain's deepagents have a solution with three components that work together: StoreBackend – stores files outside the conversation thread in LangGraph's BaseStoreCompositeBackend – routes specific file paths to persistent storage while also keeping everything else ephemeralMemoryMiddleware – loads memory into the agent's context automatically before any run. By the end of this article, you will learn how to create a working personal assistant remembering your preferences, provides feedback across sessions, has per-user isolation, and a clear path from local SQLite to production Postgres. Why Persistent Memory Matters for AI Agents Consider a case of a customer support agent. A customer chats with the agent, and just like how they converse with a normal human, they try to bring up something that they brought up during the last conversation, but the agent has no idea about this. This creates friction and a poor user experience. There are other such scenarios, like a coding assistant that does not remember your team's conventions and coding patterns and gives generic answers, or a personal assistant that asks for your timezone every time the agent is asked to schedule a meeting. LangChain's deepagents approach is notable because it doesn't require a vector database, an embeddings pipeline, or any kind of retrieval step at query time. Memory is a pure file. Loading memory means reading a file. Agent updates it the same way it edits any file, just like a human. The complexity comes in the routing and persistence layer, which CompositeBackend and Storebackend handle independently of the agentic loop. The Problem Conversations are stateless by default. In deep agents, every file that the agent reads or writes goes through a backend. The default is StateBackend. This stores files inside the LangGraph conversation state, which is scoped to a thread_id. Starting a new conversation? new thread_id. New state. Files gone. The fix requires separating two distinct storage concerns: Working files, scratch notes -> scope is usually session -> this shouldn't be in the memory.User profile, preferences -> this is scoped at the user level -> this should survive in the memory. Deepagents handles this with three cooperating primitives: StoreBackend, CompositeBackend, and MemoryMiddleware, and there are two storage primitives -> conversation thread, which is scoped to thread_id, and BaseStore, which is a key-value store that exists independently of threads. StateBackend reads and writes from the conversation state. StoreBackend reads and writes from BaseStore. The key difference is where the agent reads from. Setting Up the Persistent Memory Assistant Installation uv add deepagents langchain-anthropic langgraph Backend Wiring Python from deepagents.backends.composite import CompositeBackend from deepagents.backends.state import StateBackend from deepagents.backends.store import StoreBackend from langgraph.store.memory import InMemoryStore store = InMemoryStore() store_backend = StoreBackend( store=store, namespace=lambda rt: (f"user:{user_id}", "memories"), ) backend = CompositeBackend( default=StateBackend(), routes={"/memories/": store_backend}, ) The namespace lambda is what can isolate users. Consider a case where there are two users: Alice and Bob. Alice's memory lives at ("user:alice", "memories") and Bob's at ("user:bob", "memories"). Agent Creation Python from deepagents import create_deep_agent from langchain_anthropic import ChatAnthropic from langgraph.checkpoint.memory import InMemorySaver agent = create_deep_agent( model=ChatAnthropic(model="claude-sonnet-4-6"), system_prompt=SYSTEM_PROMPT, memory=["/memories/profile.md"], backend=backend, checkpointer=InMemorySaver(), ) The memory parameter is all MemoryMiddleware needs. It reads that path along with the configured backend. At the start of the session, the content is cached in state and is then injected into the system prompt before model calls within the session. If the file does not exist, then it injects "(no memory loaded)" so the agent knows to create a new one. Architecture The System Prompt Contract The agent needs to know when to update the memory and how to update this memory. The system prompt decides this contract: Python SYSTEM_PROMPT = """You are a personal assistant with persistent memory. Your persistent memory file lives at /memories/profile.md and survives across all conversations. When to update memory: - User shares name, role, or background - User mentions ongoing projects or goals - User states a preference (language, tools, response format) - User corrects you or gives explicit feedback How to update: - First conversation: write_file to create /memories/profile.md - Later conversations: edit_file to update it Keep the file concise — bullet points, not prose. Never store credentials. """ MemoryMiddleware also appends its own guidelines, which include heuristics for what not to save. Multi-User Isolation Now you might be wondering. Having this agent sounds amazing, but how to scale it for multiple people? Do we need to create separate instances for each user? The answer is no!. The namespace lambda is the only thing that separates users: namespace=lambda rt: (f"user:{user_id}", "memories") In the CLI, user_id is a flag. In LangGraph deployment, this can be derived from the request context. namespace=lambda rt: (rt.server_info.user.identity, "memories") Different Storing Backends In this example, I experimented with in-memory store, SQLite, and PostgreSQL. Python #In-memory (demos): from langgraph.store.memory import InMemoryStore store = InMemoryStore() #Resets when the process exits. Good for demo runs. #SQLite (local development, survives restarts): import sqlite3 from langgraph.store.sqlite import SqliteStore conn = sqlite3.connect("assistant_memory.db", isolation_level=None) store = SqliteStore(conn) store.setup() #Note: isolation_level=None (autocommit) is required by SqliteStore. #PostgreSQL (production, multi-instance): import os from langgraph.store.postgres import PostgresStore with PostgresStore.from_conn_string(os.environ["DATABASE_URL"]) as store: store.setup() #Set DATABASE_URL to a standard Postgres connection string. Advantages LangChain's deepagents framework provides several advantages, such as: Cross-session continuity – memory injected into the system prompt directly - no search, no embedding lookup, no extra latency.Per-user isolation – easier namespacing using StoreBackend.Explicit, inspectable memory – it's a plain markdown file. You can read it, edit it, and audit it without any special tooling.Adaptable with existing middleware – MemoryMiddleware is part of the middleware stack along with permission checks and logging. Adding persistent memory is additive and not a total rewrite. Disadvantages While there are several advantages to using LangChain's deepagents, it does come with some limitations: Context window consumption – Since the memory files are injected into the system prompt every time, it could become really large, and it could exceed the context budget. The system prompt needs to be clear and concise on what to save and what not to save. Agent manages its own memory – A poorly prompted agent may over-save, under-save, or save the wrong things. The system prompt contract is very important.Not suitable for large-scale memory – For a compact user-profile, this sounds perfect — a few hundred words. But applications that need to remember several past interactions, a RAG-based approach with a vector store makes much more sense. It doesn't scale to large memory corpora. Extending the Pattern Multiple memory files — separate concerns: Python memory=[ "/memories/profile.md", # identity and background "/memories/projects.md", # active work "/memories/preferences.md", # style and tool preferences ] Write-scoped permissions — prevent the agent from writing outside /memories/: Python from deepagents import FilesystemPermission permissions=[ FilesystemPermission(operations=["write"], paths=["/memories/**"]), FilesystemPermission(operations=["write"], paths=["/**"], mode="deny"), ] Shared team context alongside per-user memory: Python backend = CompositeBackend( default=StateBackend(), routes={ "/memories/": StoreBackend(store=store, namespace=lambda rt: (f"user:{user_id}", "memories")), "/shared/": StoreBackend(store=store, namespace=("team:engineering", "shared")), }, ) Running the Example Shell git clone -b feat/permissions-execute-task https://github.com/NinaadRao/deepagents cd examples/persistent-memory-assistant uv venv && source .venv/bin/activate uv pip install -e . export ANTHROPIC_API_KEY=your_key # Built-in two-session demo python assistant.py --demo # Interactive with SQLite persistence python assistant.py --store sqlite --user alice "I prefer Python and FastAPI" python assistant.py --store sqlite --user alice "What do you know about me?" # Different user — isolated memory python assistant.py --store sqlite --user bob "I build data pipelines in Spark" python assistant.py --store sqlite --user alice "What do you know about me?" # Alice only Conclusion Most agent memory problems trace back to two things: conversation and the user's context. Keeping them separate in the storage layer and not the application code is what makes the solution clean. The three-component design in deep agents, i.e., StoreBackend + CompositeBackend + MemoryMiddleware, handles this without coupling any layer to the others. You can change the model, store, or routing rules independently of each other, which makes it a good use case for abstraction.
Generating sequential numeric IDs sounds like one of those problems that should have been solved decades ago. And in a monolithic application, it mostly was. You create a database sequence, use an auto-increment column, and move on. Every new record gets a unique number, the ordering is preserved, and nobody on the engineering team loses sleep over it. That simplicity disappears the moment the system becomes distributed. Once your application is running across multiple services, multiple instances, or multiple Kubernetes pods, generating ordered numeric identifiers turns into a very different problem. What used to be a harmless database feature suddenly becomes a scalability bottleneck. Every request that depends on “the next number” now has to coordinate through shared state, and shared state is exactly where distributed systems become expensive. We ran into this problem while building a service that needed globally unique, monotonically increasing numeric identifiers at very high throughput. UUIDs were not a good fit because the business wanted readable, ordered numbers. At the same time, we could not afford to make a database round trip on every request. The pattern that solved it cleanly was the Hi-Lo algorithm, backed by Azure Cosmos DB for coordination. It gave us a practical way to preserve uniqueness and ordering while dramatically reducing database contention and keeping request latency extremely low. Why This Problem Gets Hard So Quickly The most obvious solution is also the one that fails first under scale. Store the current sequence value in a database record. For each request, increment it atomically and return the new value. From a correctness perspective, it works. From a scalability perspective, it is painful. The issue is not that databases cannot increment counters. They can. The issue is that when every service instance depends on the same counter, you create a write hotspot. All traffic funnels through a single piece of mutable state. As request volume grows, latency increases, write contention rises, and horizontal scaling stops helping as much as it should. You can add more pods, but they are all still lining up to talk to the same centralized counter. That is the point where teams discover that sequential ID generation is not really an ID problem. It is a coordination problem. And in distributed systems, coordination is usually the thing you want to minimize. Architecture Diagram The Idea Behind Hi-Lo The Hi-Lo algorithm works by separating identifier generation into two layers: A high value, reserved centrallyA low value, generated locally in memory Instead of asking the database for the next number every time, a service instance reserves an entire block of numbers in one operation. After that, it generates values locally from that reserved range until the block is exhausted. For example, if the current global boundary is 1000 and the configured lot size is 1000, one pod can reserve the next block: 1001 to 2000. From that point on, it does not need the database for every request. It can serve identifiers from memory until it reaches 2000. That changes the coordination model completely. Instead of one database write per ID, the system performs one database write per batch of IDs. If the batch size is 1000, the database pressure drops by roughly a factor of 1000. That is the core advantage of Hi-Lo. It does not make centralized coordination faster. It makes it far less frequent. Using Cosmos DB as the Source of Truth In our implementation, Azure Cosmos DB maintains the global upper boundary of allocated ranges. The coordination model is simple: A pod reads the current boundary, calculates the next range it wants, and tries to update the stored value to reflect the newly reserved upper limit. If the write succeeds, the range belongs to that pod. If it fails because another pod updated the value first, the pod retries. The important detail is that this is done using optimistic concurrency control through ETag validation. That gives us atomic range reservation without introducing heavyweight locks or a custom coordination service. Two pods may try to reserve a range at nearly the same time, but only one can successfully update the shared document. The others detect the conflict and try again. This is exactly the kind of pattern Cosmos DB handles well, as long as the design acknowledges that the shared document is a coordination point and treats it carefully. We also made a few deliberate configuration choices: Session consistency was used to preserve read-your-own-write behaviorDirect TCP mode helped minimize reservation latencyMulti-write regions were disabled because monotonic ordering mattered more than geographically distributed writes That last point is easy to underestimate. If strict ordering is a requirement, you cannot casually spread writes across regions and still assume the sequence semantics will behave the way the business expects. The Fast Path Is Purely In Memory Once a pod owns a range, the hot path becomes extremely lightweight. The service keeps the current range in memory, along with the current pointer and the maximum value of the reserved block. Every request simply increments the local counter and returns the next number. No network call. No shared lock. No database hit. No cross-pod communication. That means steady-state performance is not tied to remote I/O. It is essentially the cost of incrementing a number and returning it. This is where the architecture starts to feel elegant. The database is still the source of truth for range allocation, but it is no longer involved in day-to-day ID generation. The expensive coordination step has been pushed out of the critical request path. In practice, that made a major difference not just for throughput, but also for latency consistency. Preventing Pauses With Pre-Fetching One subtle issue with batch allocation is what happens when the current range runs out. If the service waits until the final value has been consumed before reserving the next block, some request will eventually have to pay the cost of going back to the database. That creates latency spikes right at the boundary between ranges. The fix is straightforward: pre-fetch the next range before the current one is exhausted. In our case, once the service had consumed around 80 percent of the current lot, a background process started reserving the next block from Cosmos DB. That block was stored as a standby range. When the active range reached its end, the generator simply switched to the pre-fetched range and continued without interruption. That small design choice helped keep the request path smooth even during transitions. Under stable conditions, callers never noticed when one block ended, and another began. It also made the system feel much more production-ready. Without pre-fetching, the architecture still works, but the boundary behavior becomes a lot noisier under load. Handling Contention Without Making It Worse Even with batched reservation, multiple pods can still collide when they try to reserve ranges around the same time. That is normal. The key is making sure those collisions stay localized and do not turn into synchronized retry storms. When a reservation fails because the ETag has changed, the pod retries with: A bounded retry countRandomized backoffJitter between attempts The jitter matters more than it might seem. Without it, competing instances can become accidentally synchronized, failing and retrying in lockstep. That creates more contention than the original conflict ever did. With randomized retry timing, the contention spreads out naturally, and one of the pods usually succeeds quickly. Most importantly, this contention only occurs during range reservation. It does not happen for every generated ID. That is a huge shift from the naive design, where every request competes for the same shared state. What the Performance Profile Looks Like The performance difference between the two approaches is dramatic. In a centralized per-request counter design, generating 20,000 IDs per second means 20,000 coordinated database operations per second. With Hi-Lo and a lot size of 1000, the same throughput requires roughly 20 database reservations per second per pod. That is not a small optimization. It is a different scaling model. The practical benefits include: Much lower write pressure on the databaseBetter request latencyMore predictable tail latencyReduced risk of hot partition behaviorBetter horizontal scalability as pods increase The architecture still has a centralized coordination point, but the frequency of access is reduced so much that it stops being the dominant constraint. That is often the real win in distributed systems: not eliminating coordination entirely, but moving it off the hot path and amortizing its cost. The Tradeoff You Have to Accept Like most scalable designs, this one is not free. The biggest tradeoff is that the sequence is not gap-free. If a pod reserves a range and crashes before consuming all of it, the unused numbers in that block are lost forever. The system still guarantees uniqueness and monotonic increase across allocated values, but it does not guarantee perfect continuity with no missing numbers. For many business cases, that is completely acceptable. For some financial, legal, or regulatory workflows, it may not be. That tradeoff has to be explicit. There is also a startup dependency on Cosmos DB. A pod cannot safely generate values until it has reserved its first range. In our design, if Cosmos DB is unavailable during initialization, the service fails fast rather than generating inconsistent identifiers. That is the safer operational choice, even if it is less forgiving. Where This Pattern Fits Best The Hi-Lo pattern makes sense when you need all of the following at once: Numeric IDs rather than UUIDsGlobal uniquenessMonotonic orderingHigh throughputDistributed deployment across multiple service instances It is especially useful in cloud-native systems where a simple database counter becomes a scaling liability. On the other hand, if your system is low-volume or does not truly need ordered numeric identifiers, this pattern may be unnecessary. Sometimes the better solution is to stop insisting on sequences and use UUIDs or another coordination-free identifier format. But when the business requirement is real, Hi-Lo is one of the cleanest ways to satisfy it without punishing the system on every request. Conclusion One of the most useful lessons in distributed architecture is that performance often improves not when coordination gets faster, but when coordination happens less often. That is exactly why the Hi-Lo algorithm works so well. By reserving ranges instead of individual values, we turned a centralized bottleneck into an occasional coordination step. Cosmos DB remained the source of truth, but it was no longer involved in every ID request. The hot path stayed local, fast, and predictable. With in-memory generation, optimistic concurrency, proactive pre-fetching, and jitter-based retries, this approach gave us a sequence generator that was both scalable and operationally practical. For teams building high-throughput distributed systems that still need ordered numeric IDs, Hi-Lo is one of those patterns that feels almost too simple at first.
Last March, our VP of Engineering asked me a deceptively simple question during our quarterly review: "How much CO2 does our AI platform emit?" I had no idea. We'd been obsessing over token costs — tracking every cent spent on OpenAI and Anthropic — but we'd never connected those tokens to their environmental impact. We were processing 42 million tokens daily. The finance team knew exactly what that cost in dollars: $127K monthly. But carbon? Nobody had asked. That question kicked off a four-month deep dive that fundamentally changed how we architect AI systems. We didn't just bolt carbon metrics onto existing dashboards. We rebuilt our entire token management strategy around efficiency as a first-class design constraint. The results surprised us: 89% reduction in token consumption, $113K monthly savings, and 19.7 tons of CO2 avoided over the program's life. What really surprised us? Being green and being lean turned out to be the same engineering problem. This article shares the specific techniques that got us there. Not abstract principles — battle-tested patterns with code examples, hard numbers, and the honest failures that taught us what actually works in production. Understanding the Real Cost of Tokens Before we optimized anything, we needed to understand what we were actually optimizing for. Token pricing is straightforward: GPT-4 costs $0.03 per 1K input tokens, $0.06 per 1K output tokens. Simple math. The environmental cost? Not so obvious. Here's what nobody tells you: every 1 million tokens processed generates approximately 0.47 kg of CO2. That number comes from energy consumption data for GPU inference (roughly 2.4 kWh per million tokens) combined with average US grid carbon intensity (0.195 kg CO2 per kWh). Your mileage will vary based on provider and data center location, but this gives a working baseline. Our initial audit painted a stark picture: That last row was the killer. We were flying blind. No breakdown by feature, no per-user attribution, no way to identify wasteful patterns. We needed observability before we could optimize. Pattern 1: Ruthless Prompt Optimization Our first discovery was embarrassing. We were using prompts that looked like they'd been written by committee — verbose, redundant, stuffed with unnecessary context-setting. Our document summarization prompt was 1,247 tokens. The actual content being summarized averaged 892 tokens. We were spending more on instructions than on the work itself. Here's the original prompt for our contract analysis feature — and what we replaced it with: We ran A/B tests on 10,000 requests per template to validate no accuracy loss. The results held across all 37 prompt templates. Key principles: Average request tokens dropped from 2,847 to 312. But here's what I didn't expect: response quality improved. Shorter prompts meant clearer instructions. The models had less noise to navigate. Less is genuinely more. Pattern 2: Streaming With Early Termination Most developers use streaming APIs for perceived performance — users see text appear progressively. We found a different benefit: streaming lets you stop generation early when you already have enough information. Think about a customer support feature. When a user asks "How do I reset my password?", the first 2–3 sentences usually contain the complete answer. Without streaming, you pay for 500 words and discard 80% of it. With early termination, you stop the moment the response is sufficient. We built a satisfaction scoring system that evaluates streaming chunks in real-time. The logic: Python def stream_with_early_stop(prompt, query_type): buffer = "" tokens_generated = 0 for chunk in client.stream(prompt): buffer += chunk tokens_generated += count_tokens(chunk) if tokens_generated >= 50: # grace period score = satisfaction_score(buffer, query_type) if score > 0.85: return buffer, tokens_generated return buffer, tokens_generated # fallback: full response Pattern 3: Context Pruning With Relevance Ranking Our RAG system was the biggest token sink. Every query triggered a vector search returning the top 10 document chunks. We embedded all 10 into the prompt context. Average context size: 8,200 tokens. Embedding cost for 42M daily tokens: $1,260 per day. Here's the uncomfortable truth: 7 of those 10 chunks were noise. The model would focus on 2–3 highly relevant passages and ignore the rest. We were paying to confuse it. We implemented two-stage relevance ranking. First, vector search returns the top 20 candidates (expanding the pool). Then, for each candidate, we compute a combined score: The counterintuitive finding still surprises me when I explain it to teams: better quality with fewer resources. More context isn't better. Relevance is better. Pattern 4: Token Budgeting by Request Type Even with all those optimizations, we still had runaway requests. A user would upload a 50-page PDF and ask for "detailed analysis." The model would generate 8,000 tokens of output — $0.48 per request. Multiply that across careless enterprise users, and you've got budget chaos. We implemented token budgets at two levels. Request-type limits are set at the 95th percentile of actual useful output length, based on 100,000 historical requests per feature: Three months after implementation: zero runaway requests consuming more than 5,000 tokens. Monthly cost standard deviation dropped from $18K to $1.1K. And here's the counterintuitive part — only a negligible number of users complained about the limits. Because limits set at the 95th percentile feel invisible to normal usage. The Carbon Calculus: Making Sustainability Visible All these optimizations saved us money. We also wanted to track the environmental impact — partly for our VP's original question, partly because it turned out to be a useful forcing function for engineering discipline. We built this into our monitoring dashboard. Every request now shows token count, API cost, and carbon footprint. We also integrated the Electricity Maps API for real-time grid carbon intensity. That last bit mattered more than we expected. Grid carbon intensity varies wildly by time of day. In California, it drops 60% at 2 PM (peak solar) compared to 8 PM. For batch workloads with no latency requirements, we schedule them during low-carbon hours. That single change reduced our carbon footprint by 23% for nightly document processing jobs — with zero accuracy or cost impact. Is 6.3 tons of CO2 annually going to save the planet? No. But multiply this by thousands of companies running AI workloads at scale, and the impact compounds. More practically, these optimizations made our system faster, cheaper, and more accurate. Sustainability was the bonus, not the trade-off. What Didn't Work: Honest Failures Three Dead Ends — So You Don't Have to Revisit Them 1. Aggressive Response Caching We thought caching responses for similar queries would save enormous tokens. In practice, cache hit rate was 4%. User queries are too diverse and too contextual. The overhead of maintaining the cache exceeded the savings. We killed it after two weeks. 2. Routing Everything to Smaller Models We tried sending all requests to GPT-3.5 instead of GPT-4. Token costs dropped 90%. Retry rate increased 340% — because users weren't satisfied with initial responses. Paying for quality upfront is cheaper than paying for retries. Every time. 3. Blanket Token Limits Without Classification Before per-request-type budgets, we tried a flat 500-token limit on everything. Users revolted. Code generation got cut off mid-function. Contract analysis was unusable. Intelligent limits beat arbitrary ones. Always classify first. Implementation Roadmap: Where to Start 8-Week Rollout Plan WEEK 1 Baseline audit. Measure current token consumption by feature, user tier, and request type. You cannot optimize what you don't measure. Add per-feature attribution to your existing observability stack first. WEEK 2 Prompt optimization. Highest ROI of any pattern. Audit your 10 most-used prompts, cut ruthlessly, A/B test on 5K+ requests to confirm no quality loss. Expect 50–80% token reduction from prompts alone. WEEK 3 Token budgets. Set max_tokens per request type based on P95 of historical output length. Stops runaway costs immediately with minimal user impact. WEEKS 4–6 Context pruning (RAG only). Implement two-stage relevance ranking. Add BM25 scoring alongside your existing vector search. Expect 60–80% context reduction and measurable accuracy gains. WEEKS 6–8 Streaming early termination. Build satisfaction scoring for factual and procedural queries. Start conservatively (threshold: 0.90) and tune down as you gather production data. ONGOING Carbon monitoring. Add CO₂ metrics to dashboards. Consider scheduling batch jobs during low-carbon grid hours via the Electricity Maps API. Free wins on an already-optimized system. Efficiency as an Engineering Discipline Token optimization isn't about being cheap. It's about being intentional. Every token you generate costs money, burns carbon, and adds latency. When you treat tokens as a constrained resource — like CPU cycles or memory — you build better systems. These patterns aren't novel. Ruthless efficiency has been a software engineering principle since resources were measured in kilobytes. What's new is applying that discipline to AI systems where token costs are invisible until suddenly they're not — usually when your finance team asks an embarrassing question during a quarterly review. Our results — 89% reduction in consumption, $1.35M annual savings, 6.3 tons of CO2 avoided — came from treating token efficiency as a first-class architectural concern. Not an afterthought. We measured, experimented, failed, and iterated. The environmental impact is real but secondary. Fast, predictable, cost-effective AI systems are the primary goal. Sustainability follows naturally from good engineering. If you're processing millions of tokens daily and haven't optimized your prompts, you're leaving money and carbon on the table. Start with the baseline audit. You'll be shocked at the waste. Then chip away systematically. Every million tokens saved is roughly $40 in your pocket and half a kilogram of CO2 out of the atmosphere. That's the hidden cost of AI tokens. The good news? It's entirely within your control to fix.
Real-time AI inference has become a fundamental feature of modern applications and has been used to drive applications in conversational agents, recommendation engines, fraud detection, and computer vision pipelines. In contrast to batch workloads, real-time inference requires stable, low-latency, predictable scaling, and resource efficiency. With the increase in the size or the number of computations performed by models, it becomes more complicated to provide these experiences at a reliable level, particularly when considering the performance versus the cost of operation. Cloud Run Cloud Run offers a simple, scalable, and managed infrastructure that delivers real-time machine learning models in the Google Cloud platform with the help of GPU acceleration and Vertex AI. This architecture allows teams to deploy containerized inference services that automatically scale with traffic while using GPUs to execute high-throughput model inference. Instead of deploying fixed clusters or provisioning resources manually, organizations can adopt a serverless-first approach, which has the capacity to bring compute capacity in step with demand. With the combination of these services, engineering teams are able to construct inference pipelines, which appear like current microservice platforms. Traffic is directed via controlled points, models are executed on specialized hardware, and observability is built into the operating system. This model takes away a significant portion of the complexity found within the underlying infrastructure, enabling the developers to concentrate on application logic and still attain production-grade performance. Deploying Low-Latency Inference With Cloud Run and GPUs Cloud Run is a service that provides a serverless experience to deploy containerized workloads. It is easily applicable to real-time inference services. Cloud Run can be used to run models that consume a lot of compute, though, with automatic scaling and billed on a request basis, when combined with instances that have GPUs. This enables teams to run stateless services as models that spin up when incoming traffic is detected and scale down when idle, enhancing responsiveness and cost efficiency. Practically, the models are bundled into containers that provide endpoints of inference via thin APIs. Such services are able to preload models upon startup and maintain them in the memory of the GPUs so that they can be swiftly executed. Cloud Run also does traffic routing, instance management, and scaling, and does not require managing node pools or orchestration layers. For latency-sensitive applications, concurrency settings can be configured, and the minimum number of instances can be set to minimize cold-start effects and guarantee a predictable response time. This deployment pattern can serve a wide variety of workloads, from transformer-based language models to vision inference pipelines. Since Cloud Run is seamlessly connected to GCP networking and identity services, inference endpoints can be sheltered under an API gateway and authenticated with IAM-based access. This allows the deployment of production that satisfies enterprise security and still offers the agility of serverless infrastructure. Integrating Vertex AI for Model Management and Observability Whereas Cloud Run supports inference serving, Vertex AI offers a support MLOps environment that can be used to scale models. Vertex AI provides a centralized system of record for the teams by handling model artifacts, experiment tracking, and versioning. This isolation of concerns enables engineers to deploy models without considering the serving infrastructure while still being able to trace iterations. Interestingly, Vertex AI also allows tracing model performance and system behavior. Numerical indicators, e.g., latency, throughput, and error rates, can also be gathered alongside model-specific indicators, helping teams notice regressions or slowdowns over time. A good number of organizations send inference logs and prediction data to BigQuery to perform offline analyses on it to gain a better understanding of how it is used and the quality of responses it offers. This feedback loop helps with continuous improvement without interrupting live services. Vertex AI is often combined with CI/CD pipelines to automatically promote models across environments in production environments. The validation of the new versions can be done in staging and deployed to Cloud Run endpoints, which are stable with the capability to quickly iterate. This practice of operation can be compared to the current software delivery practices, where machine learning models are perceived as versioned parts of a broader application ecosystem. Scaling, Cost Optimization, and Production Readiness Inference in real time can be scaled by paying special attention to the cost and performance. GPUs provide high acceleration, but they have to be put to good use to warrant their cost. A request-driven scaling model for Cloud Run can scale resources in accordance with actual demand, and utilization during peak load can be enhanced with batching strategies and concurrency controls. The teams use these techniques in conjunction with caching and request deduplication to further optimize throughput. Security and good governance are also required in production readiness. Inference services are normally executed with dedicated service accounts with limited privileges, and sensitive information is isolated using encryption protocols and access controls. Privacy can be implemented by blocking inference traffic out of trusted environments by restricting connections between networks with firewall rules and network links. These controls assist companies in launching AI services that adhere to company policies and regulations. Finally, effective real-time inference systems are similar to well-developed cloud-native systems. They are visible, automated, and constantly honed. Opposite to the traditional approach to AI platform building, which combines Cloud Run to offer scalable serving, GPUs to realize performance, and Vertex AI to provide lifecycle management, organizations can create AI platforms that provide low-latency experiences and ensure operational discipline. The combined solution will enable teams to go beyond experimentation and deliver reliable AI functionality at enterprise scale.
(Note: A list of links for all articles in this series can be found at the conclusion of this article.) In the previous installments of this series, we traced the arc from raw compliance intent — regulations such as NIST 800-53, FedRAMP, PCI DSS, EU AI Act — all the way to machine-readable OSCAL artifacts managed via GitOps pipelines and Trestle-powered automation. The central thesis has been that treating compliance artifacts as code, subject to the same versioning, testing, and review disciplines as software, is the only sustainable path to continuous assurance at scale. Part 3 of this series explored the collaboration topology: Regulators publishing OSCAL catalogs, Control Providers authoring component definitions, System Owners assembling SSPs, and Assessors generating SAPs and SARs — all mediated by Trestle's markdown-to-OSCAL round-trip. The friction was always the same: every persona still needed CLI fluency or IDE comfort to engage productively with OSCAL JSON. That friction is now removable. The Model Context Protocol (MCP) brings a standardized, AI-agent-ready interface to compliance tooling — and compliance-trestle-mcp, the first OSCAL-native MCP server from the OSCAL Compass community, makes every Trestle operation invocable by any MCP-compliant AI client: Claude, Roo Code, GitHub Copilot Workspace, or a custom agentic pipeline. Compliance-as-Code Game Changer With MCP The Model Context Protocol, incubated under the Linux Foundation and now an industry-wide open standard, provides a JSON-RPC layer by which AI models discover and invoke "tools" — discrete, typed operations exposed by servers. Think of it as the USB-C port for AI agents: standardized, self-describing, composable. Once an MCP server is registered, any compliant client can call its tools without custom integration work. For compliance workflows, this changes the architecture of engagement fundamentally. Today, driving Trestle to resolve a NIST 800-53 profile, generate SSP markdown, and assemble the resulting OSCAL JSON requires CLI invocations with precise arguments — work that falls to the Trestle-literate members of a compliance team. With compliance-trestle-mcp, those same operations become natural-language-addressable: an AI assistant executes the correct Trestle command sequence, validates the output, and surfaces results in whatever interface the persona is already working in. Compliance-trestle-mcp: Architecture and Capabilities The server is published on PyPI as compliance-trestle-mcp (v0.1.2, February 2026) and registered on the Official MCP Registry at registry.modelcontextprotocol.io under the identifier io.github.oscal-compass/compliance-trestle-mcp. Status is Active. Source: https://github.com/oscal-compass/compliance-trestle-mcp. Figure 1: compliance-trestle-mcp listed as Active on the Official MCP Registry (registry.modelcontextprotocol.io), v0.1.2. Tool Surface Six tools are currently exposed by the server, each wrapping a core Trestle operation: toolwhat it does trestle_init Initialize a Trestle workspace, creating the OSCAL folder hierarchy (catalogs, profiles, component-definitions, system-security-plans, etc.) trestle_import Import an existing OSCAL model (catalog, profile, SSP, component definition) from a local file or remote URL into the active workspace trestle_author_catalog_generate Generate per-control Markdown files from a catalog JSON, enabling human-readable editing without touching raw OSCAL trestle_author_profile_generate Generate Markdown documentation for the controls selected by a profile, preserving parameter overrides and guidance additions trestle_author_profile_resolve Resolve a layered OSCAL profile to a flat resolved-profile catalog, collapsing all imports and modifications trestle_author_profile_assemble Assemble edited Markdown controls back into a valid OSCAL Profile JSON, completing the round-trip Installation (One Liner) Add the following stanza to your agent's MCP configuration file (e.g., .roo/mcp.json for Roo Code or the Claude Desktop config): JSON { "mcpServers": { "trestle": { "command": "uvx", "args": [ "--from", "compliance-trestle-mcp", "trestle-mcp" ] } } } Personas Revisited: Now With an AI Co-Pilot Part 3 of this series established the canonical compliance-as-code collaboration model: five personas, each with distinct artifacts, editing interfaces, and OSCAL expertise levels. The MCP layer transforms each persona's relationship with those artifacts. Regulator Regulators publish security regulations and standards (NIST 800-53, GDPR, HIPAA) typically as PDFs. With compliance-trestle-mcp, a Regulator's technical team can instruct an AI agent to call trestle_import against a raw OSCAL catalog URL (e.g., the NIST GitHub releases), then trestle_author_catalog_generate to produce reviewable Markdown. Editorial cycles that previously required Trestle CLI expertise are now conversational. The AI handles the workspace plumbing; the domain expert focuses on control prose accuracy. Compliance Officer/CISO Compliance Officers author organizational overlays — parameter tailoring, guidance additions, control selections — expressed as OSCAL profiles layered on a regulatory catalog. With the MCP server, the AI can be prompted to "resolve the FedRAMP Moderate profile against the NIST 800-53 Rev5 catalog and generate the delta markdown for my SSP authoring queue." The agent chains trestle_author_profile_resolve→ trestle_author_profile_generate autonomously, surfacing the output for human review. This eliminates manual multi-step CLI orchestration and radically compresses profile maintenance cycles. Control Provider (Component Author) Control Providers — the engineers maintaining component definitions that map control implementations to policy-as-code rules — have traditionally needed both OSCAL fluency and DevSecOps context simultaneously. Now, an AI agent can assist by importing existing component definitions, generating Markdown stubs for unmapped controls, and prompting the engineer for implementation prose inline in the chat. The component definition round-trip (JSON → Markdown → edit → trestle_author_profile_assemble → JSON) is fully MCP-orchestrated. System Owner/SSO The System Owner assembles SSPs from profiles and component definitions — historically the most labor-intensive and error-prone step. With compliance-trestle-mcp, an AI agent can be directed to initialize the workspace, import all upstream artifacts, resolve the applicable profile, and generate the SSP Markdown scaffolding in a single conversational exchange. What once required mastery of four distinct Trestle sub-commands and careful argument threading is reduced to a natural-language instruction sequence. Assessor Assessors generating Security Assessment Plans (SAPs) and Reports (SARs) need to trace every selected control back through the SSP to the component definition and the originating catalog. With the MCP server, an AI agent can navigate that traceability chain on demand, resolving profiles and surfacing control implementation status, evidence links, and outstanding POA&M items — all without the assessor ever touching Trestle directly. The Emerging OSCAL MCP Ecosystem compliance-trestle-mcp is the first OSCAL-native MCP server from an established open-source compliance project, but it is not alone. A brief survey of the emerging ecosystem: serveroriginfocus compliance-trestle-mcp OSCAL Compass / CNCF Sandbox Full Trestle workflow: init, import, catalog/profile generate-assemble-resolve. First CNCF OSCAL MCP server. Registered at registry.modelcontextprotocol.io. mcp-server-for-oscal AWS Labs (awslabs) OSCAL schema introspection, model listing, and reference resource retrieval. Optimized for AI agents needing authoritative OSCAL structural guidance rather than authoring workflows. OSCAL MCP UI Apps Atelier Logos / Community Visual MCP UI layer for FedRAMP and HIPAA OSCAL workflows; interactive SSP visualization and compliance gap analysis via agentic app runtime. The AWS Labs server (github.com/awslabs/mcp-server-for-oscal) serves a complementary purpose: where compliance-trestle-mcp is workflow-centric (authoring and assembly), the AWS server is schema-centric (introspection and reference), providing AI agents with authoritative answers about OSCAL model structure, valid element sets, and use-case patterns. Together, they cover both the "what is OSCAL" and "do OSCAL" dimensions of agent-assisted compliance. NIST's Vision and the CSWP 53 Horizon The timing is not coincidental. NIST CSWP 53 ("Charting the Course for NIST OSCAL," December 2025 initial public draft) explicitly names agentic AI and digital twins as the next integration frontier for OSCAL — autonomous risk reasoning and continuous assurance driven by AI agents operating on machine-readable compliance artifacts. The compliance-trestle-mcp server is a concrete early instantiation of exactly that vision, with the CNCF Sandbox project providing governance and sustainability guarantees that standalone tools lack. What Comes Next for compliance-trestle-mcp The v0.1.2 release covers the catalog and profile authoring surface. The roadmap naturally extends toward the full OSCAL lifecycle for AI-assisted System Security Plan and MCP resource exposure — surfacing OSCAL documents as MCP resources (not just tool outputs) so AI clients can reason over live workspace state. Conclusion Compliance as Code has always promised to make compliance automation as natural as software development. The MCP layer removes the final adoption barrier: the requirement for personas to learn Trestle directly. With compliance-trestle-mcp, every compliance stakeholder — from the Regulator drafting a new catalog overlay to the Assessor closing out a FedRAMP SAR — can now engage with OSCAL artifacts through natural language, mediated by an AI agent that understands both the domain and the toolchain. The server is live, registered, and installable in seconds. The OSCAL ecosystem is building out MCP coverage rapidly, with NIST's own roadmap pointing in the same direction. The gap between compliance intent and continuous machine-readable assurance has never been smaller. References and Learn More [1] OSCAL Compass / compliance-trestle-mcp GitHub. https://github.com/oscal-compass/compliance-trestle-mcp [2] Official MCP Registry — io.github.oscal-compass/compliance-trestle-mcp. https://registry.modelcontextprotocol.io [3] AWS Labs mcp-server-for-oscal. https://github.com/awslabs/mcp-server-for-oscal [4] COMPASS Part 3: Artifacts and Personas (DZone). https://dzone.com/articles/compliance-automated-standard-solution-compass-part-3-artifacts-and-personas [5] NIST CSWP 53: Charting the Course for NIST OSCAL (Dec 2025 IPD). https://csrc.nist.gov/pubs/cswp/53/charting-the-course-for-nist-oscal/ipd [6] Building Visual MCP UI Apps for FedRAMP & HIPAA with OSCAL (Atelier Logos, Jan 2026). https://www.atelierlogos.studio/blog/2026-01-08-using-the-aws-mcp-server-for-oscal [7] OSCAL Hub — Open-Source OSCAL Platform (RegScale / OSCAL Foundation). https://regscale.com/blog/introducing-oscal-hub/ [8] Model Context Protocol Roadmap (Linux Foundation, updated Mar 2026). https://modelcontextprotocol.io/development/roadmap Below are the links to other articles in this series: Compliance Automated Standard Solution (COMPASS), Part 1: Personas and RolesCompliance Automated Standard Solution (COMPASS), Part 2: Trestle SDKCompliance Automated Standard Solution (COMPASS), Part 3: Artifacts and PersonasCompliance Automated Standard Solution (COMPASS), Part 4: Topologies of Compliance Policy Administration CentersCompliance Automated Standard Solution (COMPASS), Part 5: A Lack of Network Boundaries Invites a Lack of ComplianceCompliance Automated Standard Solution (COMPASS), Part 6: Compliance to Policy for Multiple Kubernetes ClustersCompliance Automated Standard Solution (COMPASS), Part 7: Compliance-to-Policy for IT Operation Policies Using AuditreeCompliance Automated Standard Solution (COMPASS), Part 8: Agentic AI Policy as Code for Compliance Automation With Prompt Declaration LanguageCompliance Automated Standard Solution (COMPASS), Part 9: Taking OSCAL-Compass to Industry Complexity LevelCompliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability
AI has already moved beyond text generation. Modern agents can browse the internet, read documents, call APIs, query databases, and coordinate numerous actions between tools and services. They are expected to do more than simply provide a single nebulous answer. In real-world systems, agents evaluate the quality of their own results, independently identify errors, and learn. This capacity for reflection and adaptation distinguishes deep agent systems from the simple, one-off interactions of language models based on the 'one question, one answer' principle. A single answer implies incomplete reasoning, a lack of context, unclear instructions, and contradictory constraints. Rather than treating the generated results as final, the agent verifies them by asking questions: Does the result match the user’s intentions?Are there any logical inconsistencies?Is the answer comprehensive and well-structured? Consequently, generating a response takes a long time as it involves numerous verification steps. Generation and evaluation are not the same task and for the same agent. The generator creates an initial response, while the evaluator analyses it for correctness, clarity, and alignment with the user’s intentions. As with humans, the evaluator should not be constrained by the same assumptions that led to the generator’s initial output. If an error is found, it is sent back, and the model is retrained, and so on, in a cycle. It is important to manage feedback loops and response revisions effectively. Endless cycles of revision are counterproductive and super-super costly sometimes. Clear evaluation criteria, follow-up questions for the user, a list of corrective strategies, and explicit decision points are required. A good prompt should describe how the system is supposed to operate, which tools must be used, and what steps should be taken. However, the more complex the task, the greater the chance of making a mistake. Like in every other aspect of IT processes. This is where the Model Context Protocol (MCP) comes in. The MCP enables us to identify and execute the necessary actions across different programs, access external resources, and retrieve results. For instance, to parse a website and create a mock-up of it in Figma, you would use the Selenium URL loader. Think of the MCP as a bridge facilitating pre-defined interactions between models, tools, and external systems. MCP reduces the effort required of the user to describe actions. Tools and resources are pre-loaded onto the MCP server rather than being described in text instructions. If a user requests a summary of recent news, for example, Newspaper3K is configured to retrieve the relevant data, and the Oolama + OpenAI API is set up for local and server-side text generation. It is the model itself that decides which feature to use, rather than attempting to recreate behavior using prompts from the user. MCP transforms the model into something suitable for real-world tasks. The MCP can be viewed as a coordination system that links intelligence and execution. The model focuses on understanding user intentions and answering the question, 'What does the user want from me?' The MCP manages the discovery, verification, and orchestration of tools and available resources. The LLM can't call APIs independently; this is done by the MCP. The MCP also helps to prevent context fragmentation. The context window represents the maximum number of tokens that the model can process in a single request. However, there is no magic solution; the 'do it right' button has yet to appear, so we still have a job to do. It’s best to interact with an LLM using structured, detailed prompts to ensure predictable, consistent behavior. Providing clear instructions reduces the likelihood of misuse, wasted tokens, and confusion. Tokens are the basic units of text. There are various tokenisation methods; popular examples include WordPiece, SentencePiece and BPE. You can import the nltk library and extract tokens from a sentence yourself: 'What goes around comes around' would be split into 'what', 'goes', 'around', 'comes', 'around', and these would then be converted into 0 and 1 for ML. As we can see, in this sense, LLMs are very similar to linear regression in fact. Key components of MCP: "Clients" that manage user interactions, conversation state, and orchestration.Servers that provide discoverable tools and resources. Typically, these are HTTP-based servers that act as lightweight backends, remaining active and accepting requests via URLs.Messages convey intent, context, and execution results.Structures for incoming and outgoing data. This separation helps the MCP avoid entanglement between models and execution logic. While each component remains independent, they continue to work together via a common protocol (which may be the MCP or another protocol). Models do not speculate or invent actions; they operate strictly within the capabilities defined by the MCP. This simplifies system debugging, makes deployment safer, and ensures more predictable behavior. Broadly speaking, resources are documents, files, or any other type of structured content. All of these are accessible via a URI. This ensures that the model operates within defined rules and constraints, which makes it easy to debug errors. Therefore, it is important that each tool can be tested in isolation and reused. This is the only way to scale the system. However, there are a few rules to follow when working with resources. Typically, businesses want instant access via an LLM to all the documentation accumulated over the last 30 years. You know, legacy, a set of PDFs, and so on. Even if we are technically able to provide the entire text at once upon request, we should still avoid large documents. This helps to maintain readability. Here, we will use an actor-critic architecture with two models: one selects the tool, and the other validates the quality of the selection via a reward. One model is responsible for the rules and the other for the value to the user. What If There Are Any Errors? Architecture inevitably becomes more complex over time. Or maybe even at the first iteration. The more complex and interconnected AI becomes, the greater the likelihood of errors or even failure. The key question, given that we are no longer dealing with predictable CRUD services, is: ‘How can we properly restore operations after errors occur?’ For AI systems, recovery from failures means ensuring system operation continues, and results remain acceptable, even if individual components fail. Rather than allowing a failure to bring the entire system to a halt, well-designed systems continue to operate. In other words, the system must be resilient, continuing to function even if some components fail. Is GPT-5.4 unavailable? In that case, we switch to Gemini 2.5. The system may degrade, but it will continue to operate. This is better than a complete system failure. Ideally, you should have alternative tools and models, as well as simplified logical paths. And, of course, backups. If we cannot identify and fix the problem, we will only provide conservative responses if the model starts producing answers that are unsafe or violate policy. The debugging process involves checking the input data and then testing the functionality of the tools and APIs, including checking their availability, latency, and response integrity. Multi-Step Reasoning Single-step reasoning is effective for simple queries, but becomes less so when tasks involve dependencies or intermediate solutions. In such situations, rather than immediately producing a final answer, the agent must track the progress of execution at every stage. Multi-stage reasoning addresses this by breaking down complex goals into smaller subtasks, preserving context separately at intermediate stages, and altering the execution sequence in the event of incorrect assumptions. Validation acts as a control mechanism in multi-stage workflows in the event of failures. This prevents errors from different stages from accumulating, and prevents tokens from being wasted on calculations based on incorrect data. The likelihood of failure is very high if an agent has to tackle a highly complex, long-term task. One of the main reasons for this is an inability to prioritize sub-tasks. Hierarchical planning is required to distinguish between strategy and implementation. To focus on the long-term goal, we need temporal abstraction and constant feedback from the user. Monitoring LangSmith is a useful tool for monitoring agents. It is compatible with both LangChain and LangGraph and is run on Runs. An alternative is Langfuse, which is better suited to enterprise environments where there is a dedicated role for analyzing the request processing pipeline (from my PoV). It has a great dashboard, too. Langfuse enables you to troubleshoot issues using tracing. If a problem arises due to unexpected interactions between search processes, request formation, or model execution, Langfuse can help. However, LangSmith also shows the sequence of events from start to finish, taking context into account. Classic Prometheus and Datadog are still suitable for tracking agents' activities. Overall, however, combining the Streamlit interface, LangChain pipelines, vector storage, and LangSmith tracing into a single app.py is a good solution. Centralization simplifies tracking, debugging, and analyzing workflows. So, the problem has been identified — what next? When implementing AI in a large company, API failures are most often caused by incorrect input data or unexpected response structures rather than errors in the model itself. LangServe's automatic schema inference reduces the number of failures before the request even reaches the model, so this is nothing new. I would suggest using containerization to reproduce errors. This provides service isolation to prevent dependency conflicts and enables reproducible deployments using container images with specific versions. There are also other benefits of container orchestration. Containerized components include: Agent APIs: access to tool execution via LangServe or similar frameworks.MCP servers: provide standardized access to tools and resources using the MCP client-server model. Containerization of MCP servers ensures consistent tool availability across all environments. The key is to avoid hard-coded file paths. Monitoring: Log execution traces, performance metrics, and assessments using LangSmith or similar tools.Supporting infrastructure: Databases, vector stores, or simply files accessed by agents. Data We’ve received a PDF file, and our task is to make it accessible via an LLM. First, the PDF needs to be split into chunks, each with a unique UUID. After embedding, these chunks should be stored in a vector database. The text must be transferred either sentence by sentence or with chunk overlap to preserve context between chunks. RAG will then enable us to interact with the document. RAG is essentially an LLM that has access to a knowledge base. It can also reduce hallucinations to some extent. As always, the key to success here is data: its quality, stability, backups, and access speed. The high-level process is as follows: HTML query > retrieve > generate To implement RAG on AWS, you can consider using Bedrock for the LLM, OpenSearch for access to the vector database (S3), and Lambda. Bedrock is Amazon’s service for deploying AI agents, and I love their prompt management. The most critical aspect of RAG is uploading files; it is crucial to provide high-quality content that the system will process and respond to. Here, we have to keep in mind Amdahl's law in the context of parallel computing. The idea is simple: performance gains plateau as the number of processing threads increases because the sequential parts of the task cannot be parallelized. When compiling the llama.cpp file on a 24-core, 64-thread AMD Threadripper processor, I have noticed that increasing the number of threads from 12 to 64 significantly reduced the time taken for compilation. However, exceeding 64 threads only yielded a marginal improvement, due to I/O bottlenecks and sequential dependencies. As part of the Amazon ecosystem, Bedrock is bundled with SageMaker for model training, AWS App Studio, and Amazon Q, which is a ready-to-use AI assistant. Also, if the free version of Google Colab proves insufficient, AWS SageMaker is a more or less excellent alternative. If you have chosen Bedrock, you will most likely use the async/await architecture in Rust and the Tokio runtime for parallel Bedrock API calls. Amazon OpenSearch Serverless can be used as a vector database. And it's a pretty popular option. Rather than performing searches based on keyword matches, it indexes documents and performs searches based on semantic similarity. In the RAG pipeline on AWS, documents from S3 are split into fragments, embedded using Amazon Titan or a similar model, and stored in a vector index. This allows the most relevant content to be retrieved in response to user queries and synthesized using an LLM. Well, grain of salt. After Amazon had been mentioned so many times, the experts began to consider the associated costs. It’s important to keep costs under control. Data is the new gold, for sure. But having too much data isn’t good for the wallet. It's important to be able to cache frequently executed queries. If you need a step-by-step guide: Use Bedrock alongside S3 as your data source and OpenSearch Serverless as your vector search engine.Implement smart chunking to optimize documents for search.If real-time data freshness is not required, use batch loading intervals instead of continuous updates. Add a caching layer for frequently asked queries. The development of the agent can be broken down into three stages. Data preparation involves data loading, pre-processing, and structuring. Chunking and embedding.Indexes: preparing for successful data retrieval. Vector stores and SQL are all available in ChromaDB, Pinecone, and FAISS. The type of database is important because FAISS can store the index and perform searches on the GPU, speeding up searches by orders of magnitude. Meanwhile, GraphRAG enables you to link information to context and build connections.Retrievers are used to find the right document based on a query. Hybrid search retrieves the required document. It can also delete documents. One challenge you’ll face repeatedly is reducing your monthly LLM costs while maintaining response quality and ensuring compliance with data privacy regulations. To achieve this, you should examine your current pay-per-call costs on Bedrock and compare them with fixed-price alternatives. You will most likely need to migrate workloads involving large volumes of data and heightened privacy requirements to the locally deployed llama.cpp platform with GGUF quantized models. This will eliminate API usage fees and improve data security. However, we won’t be able to completely abandon Bedrock if we require massive models. We can prototype on Canvas while MLOps keeps an eye on costs. Fine-Tuning Although pre-trained models are useful, we usually need our own. We can adapt models that have been pre-trained on large datasets to our smaller task. The simplest approach is standard fine-tuning, which involves updating the weights to adapt the model to our dataset. We take a pre-trained model and do not overwrite it. If your tasks are typical and you have a large dataset, then standard fine-tuning is the way to go. The second fine-tuning option is low-rank adaptation (LoRa), which involves adding small matrices to specific layers. This approach requires only around 0.1% of the original set of parameters. In effect, it enables targeted adjustments to be made to the model when computational resources are limited. It even works for large models. The original weights remain unchanged, but are combined with the matrices. This enables us to adapt the model for a wide variety of tasks. We use it when resources are limited, for multitasking, and to avoid catastrophic forgetting. LoRa is well-suited to open-source projects, and PEFT is widely used. It also enables models to adapt easily to new tasks. The third option is Supervised Fine-Tuning (SFT), which is a model that minimizes the loss function. It is particularly well-suited to tasks requiring high accuracy when a labeled dataset is available.\ The overall process will look like this: We need a dataset.It is prepared.A new layer is created.The model is trained.The model is tested and deployed. Lesson from my painful experience: pay particular attention to the file ID, as one small mistake could result in costly mistakes. If you have someone specially trained in a specific area (SME), you could opt for RLHF (training via human feedback). In practice, the training data is stored in JSONL format and uploaded to OpenAI’s servers. Then, a task is created on FineTuning. You can view the demo here. I prefer to use jqlang when working with JSONL. Before training the model, make sure you have defined and configured the training parameters. Key parameters: Learning rate: If this is set too high, the results will be unsatisfactory. If it is too low, the model will take a very long time to train.Batch size: The smaller the batch size, the less stable the model will be.The number of epochs: The lower this is, the weaker the training will be. Setting the epochs parameter to 5 means that the dataset will be iterated through five times. LLAMA Would you like to install the model locally? GGUF is the ideal solution for local models on LLAMA. It acts as a sort of bridge. It feeds into the GGUF Conversion Pipeline, a multi-stage process that converts a model from the original Hugging Face format into a single artifact file ready for deployment. After quantization, we reduce the file size from 62 gigabytes to approximately 19 gigabytes using llama-quantize. If the system can handle it, we can use the model to our heart's content. My code is not the best, and an LLM could generate a better one. However, this code has worked fine on five different machines with different parameters and operating systems, so it's pretty robust. Download Llama and its extensions. The Llama C++ toolkit converts models into locally deployable helpers. Python git clone https://github.com/ggerganov/llama.cpp.git curl -LsSf https://astral.sh/uv/install.sh | sh Check all the configured repositories that have been deleted in the current Git repository. Python git remote -v Installing huggingface_hub. Python make GGML_METAL=1 GGML_ACCELERATE=1 -j8 pip3 install --user huggingface_hub\[cli\] pip3 install --upgrade --user 'huggingface_hub[cli]' And we use a script to download a 23-gigabyte model. Python python3 -c " from huggingface_hub import hf_hub_download print('Downloading Qwen 2.5 Coder 32B Q5_K_M...') hf_hub_download( repo_id='Qwen/Qwen2.5-Coder-32B-Instruct-GGUF', filename='qwen2.5-coder-32b-instruct-q5_k_m.gguf', local_dir='.', local_dir_use_symlinks=False ) print('Download complete!') " Or a smaller version, because the larger version runs very slowly on my computer: Python cd ~/git/llama.cpp python3 -c " from huggingface_hub import hf_hub_download print('Downloading Qwen 2.5 Coder 7B Q5_K_M (~5GB)...') hf_hub_download( repo_id='Qwen/Qwen2.5-Coder-7B-Instruct-GGUF', filename='qwen2.5-coder-7b-instruct-q5_k_m.gguf', local_dir='.', local_dir_use_symlinks=False ) print('Download complete!') " ls -lh ~/git/llama.cpp/*.gguf Run the following command: curl -LsSf https://astral.sh/uv/install.sh | sh, then check the version using uv --version. Download the dependencies. UV is required to run the script that converts from PyTorch to GGUF. Python uv run --with transformers --with torch --with sentencepiece \ python convert_hf_to_gguf.py /actual/path/to/model pip3 install --user transformers torch sentencepiece protobuf numpy After running UV, the next steps are uv venv to create the environment and uv sync to install the dependencies. It's for troubleshooting. Quantization to reduce the model size, as discussed in the article. Optional. Python curl -LsSf https://astral.sh/uv/install.sh | sh cd ~/git/llama.cpp # Create build directory mkdir build cd build # Configure with Metal support (for Mac GPU) cmake .. -DGGML_METAL=ON # Build (use -j8 for parallel compilation) cmake --build . --config Release -j8 ls -la bin/ ./bin/llama-quantize \ ../qwen2.5-coder-32b-instruct-q5_k_m.gguf \ ../qwen2.5-coder-32b-instruct-q4_k_m.gguf \ Q4_K_M llama-cli runs the model locally. Now, to start a conversation, go to http://127.0.0.1:8082/. Python cd ~/git/llama.cpp/build ./bin/llama-server \ -m ../qwen2.5-coder-7b-instruct-q5_k_m.gguf \ -c 8192 \ -ngl 99 \ --port 8082 I hope this article helps you save money on LLMs, tokens, and MCPs.
Artificial Intelligence is rapidly becoming a part of everyday devices — smartphones, cars, cameras, and even home appliances. Traditionally, these systems rely on cloud servers to send, process, and analyze data before making decisions, which increases latency and delays responses. However, many applications require instant decision-making, where even a slight delay can be critical. In such scenarios, relying on network connectivity is not always practical, and decisions need to be made locally on the device itself. This has led to a growing shift toward running intelligence directly on devices, making real-time local processing more important than ever. In this article, we’ll explore why this shift matters and how it is shaping the future of modern intelligent systems. What is Edge AI? Edge AI refers to running AI models directly on devices such as IoT systems, smartphones, autonomous cars, drones, and sensors — right where the data is generated. With this approach, there is no need to transfer data to cloud servers or centralized systems. Edge AI enables faster, real-time decision-making by processing data locally, without sending it elsewhere. For example, Instead of sending every transaction to a central server for analysis, the system can analyze transaction patterns locally in real time. If any unusual activity is detected — such as an abnormal withdrawal amount, location mismatch, or suspicious behavior — the system can instantly block the transaction or trigger an alert. Why Real-Time Processing Matters? Real-time processing means a system can process data instantly and make decisions without delay. Even small delays in decision-making can create critical situations and lead to serious consequences. For example, an autonomous car must detect obstacles and react within milliseconds. If it relies on the cloud, even a small delay could lead to serious consequences. By processing data locally, Edge AI enables immediate decisions — such as braking or steering — making the system safer and more efficient. Reduce Latency and Faster Decisions Latency is the time it takes for data to travel to the cloud and back. Even a delay of a few milliseconds can be too slow for certain applications. With Edge AI: Data is processed instantly on the device itself.There’s no need to wait for a network response.Performance is faster, more reliable, and less dependent on connectivity. For example, a voice recognition system on a smartphone can respond much faster when speech processing runs locally on the device, rather than relying on cloud or centralized servers. Improved Privacy and Data Security Sending sensitive data to the cloud raises privacy concerns, as it can be exposed during transmission or storage. Edge AI minimizes these risks by processing data directly on the local device instead of sending it to the cloud. This approach enhances data security and helps maintain user privacy, since sensitive information never leaves the device. It also supports compliance with data protection regulations and reduces the chances of unauthorized access or data breaches. For example, a healthcare wearable that monitors heart activity should not transmit sensitive personal health data to external servers. Instead, it can analyze patterns locally on the device and instantly alert the user if any irregularities are detected. This approach not only protects patient privacy but also enables faster, real-time responses in critical situations. Such local processing is especially important in industries like banking, healthcare, finance, and smart homes, where data security and immediate decision-making are essential. Reliability Without Internet Dependency Edge devices can operate even without an internet connection, making them more stable and reliable in remote areas or environments with poor network coverage. This ensures continuous performance without interruptions or delays caused by connectivity issues. As a result, critical applications can function smoothly regardless of network availability. For example, a drone used in disaster rescue operations cannot depend on internet connectivity. It must process images locally and detect survivors in real time, enabling faster and more effective rescue efforts. Lower Bandwidth Usage and Reduce Infrastructure Costs Sending large amounts of data to the cloud consumes significant bandwidth and increases operational costs. Edge AI helps reduce these costs by processing data locally on the device. This minimizes the need for constant data transmission and optimizes network usage. Only relevant or critical information is sent to the cloud, making the system more efficient and cost-effective. For example, a factory machine monitoring system can analyze sensor data locally and send alerts only when an issue is detected, instead of continuously streaming all the data. Scalability and Cost Efficiency Cloud processing for millions of devices can become expensive and resource-intensive. Edge AI addresses this by distributing computations across devices, reducing the load on central servers. This decentralized approach lowers infrastructure costs, improves scalability, and enhances overall system performance. It also reduces latency by minimizing the need for constant communication with the cloud. For example, in a smart city, thousands of cameras can process data locally instead of sending everything to a central cloud system. This not only saves bandwidth and infrastructure costs but also enables faster, real-time insights and responses. Better User Experience Real-time processing significantly improves user experience by making systems feel faster, smoother, and more responsive. Quicker responses lead to higher user satisfaction and a more seamless interaction. With Edge AI, data is processed instantly on the device, eliminating delays and ensuring consistent performance. This is especially important for applications that require immediate feedback. For example, in gaming or augmented reality (AR), local AI can render objects and interactions in real time, creating a smoother, more immersive, and engaging user experience. An edge-based platform helps by enabling data processing and decision-making directly on devices, rather than relying entirely on centralized cloud systems. It supports faster, real-time responses by analyzing data locally, which is essential for applications that require immediate action. This leads to improved performance and reliability, especially in environments with limited or unstable internet connectivity. It also enhances data privacy and security by keeping sensitive information on the device, reducing the need for data transmission. Additionally, it optimizes bandwidth usage and lowers infrastructure costs by sending only meaningful insights or alerts to central systems instead of continuous raw data. Overall, this approach helps build systems that are faster, more efficient, secure, and scalable by bringing intelligence closer to where data is generated. Conclusion Edge AI is transforming modern systems by bringing intelligence closer to where data is created, enabling faster and real-time decision-making. It reduces latency and improves performance by processing data locally instead of relying on the cloud. This approach also enhances privacy and minimizes dependence on constant internet connectivity. Additionally, it helps reduce bandwidth usage and lowers infrastructure costs. From smart cities to healthcare and industrial automation, edge computing is driving a new era of faster, smarter, and more efficient systems. Edge AI brings intelligence closer to where data is created, enabling real-time decisions, faster performance, enhanced privacy, and reliable operation without depending on constant connectivity.
I set out to build a simple Slack bot that could answer questions about our GitHub repository — open bugs, pending PRs, and recent releases. Straightforward enough. It turned into 400 lines of API glue code. When I asked Claude, ChatGPT, Gemini, and several coding assistants for architecture advice, they all converged on the same conventional pattern: What every AI suggestedWhat it means in practice1. Slack receives the mentionWrite a GitHub REST client2. Bot calls GitHub REST APIRouting logic per question type3. Feed response into Claude/GPTPagination per endpoint4. Model formats the answerMaintain API versions5. Bot posts back to SlackRepeat for every new data source This works. I built it. Three days, 400 lines of API client code, and it answered perhaps 60% of the questions my team asked. Questions like "Are any critical bugs related to PRs merged this week?" required custom correlation logic across multiple endpoints. Every new question type meant new code. Adding error monitoring as a second data source meant a separate integration entirely. After digging deeper into how AWS Bedrock handles tool use, I discovered the Model Context Protocol. I rebuilt the same bot in an afternoon — 150 lines, answering a far wider range of questions, and adding a new data source is a handful of lines in a single function. This article explains what changed and why it matters. The core insight: don't build an API client that feeds a model. Build a model that calls tools. These are fundamentally different architectures. The Architecture: Three Layers, One Loop The system is built in three layers. Each has exactly one responsibility and hands off cleanly to the next: Slack (Socket Mode) User types @mention → question received ↓ question passed to agent AWS Bedrock — Claude (Agent Loop) Reason → decide tools → call → read results → repeat ↓ tool calls routed via registry MCP Servers (GitHub + any other) 40+ tools per server — issues, PRs, releases, code search… ↓ tool results → reasoning → formatted answer → Slack Slack receives the @mention and passes the question down. Bedrock runs the agent loop — Claude reasons about which GitHub MCP tools to call, executes them, reads the results, and loops until it has enough data to answer. The tool registry routes each call to the correct MCP server automatically. The answer travels back up to Slack. Before vs. After: A Real Question To understand why this matters, consider a specific question a developer might ask in Slack: "Are any critical bugs related to PRs merged this week?" On the surface, this seems simple. But answering it correctly requires data from two separate GitHub API endpoints — the issues API for bugs, and the pull requests API for recent merges — and then correlation logic to match issue references in PR descriptions. If you are writing a traditional bot, you need to anticipate this question, write the two API calls, handle pagination on each, and write the join logic. Now imagine a dozen different question types. Each one is a new coding task. Traditional approachMCP approach1. Search GitHub for critical bugsClaude calls list_merged_prs (this week)2. Search for PRs merged this weekClaude calls search_issues (critical bugs)3. Write correlation logic across bothClaude calls get_issue for each candidate4. Handle pagination on each endpointClaude cross-references links in PR bodies5. Feed combined data to model to formatClaude returns correlated, formatted answer6. New question? Write new logic.New question? Model figures out new tools. What makes the MCP approach powerful is not just the line count — it is what the model is doing. Claude receives the full JSON Schema for every available GitHub tool at startup. When the question arrives, it reasons over those tool descriptions, selects the relevant ones, calls them in the right order, and then reasons over the combined results to produce an answer. It does not need to be told: "for bug questions, use search_issues". It reads the tool description and figures that out. The result is that the model can handle questions you never anticipated. "Show me PRs merged this week still linked to open bugs" — a slightly different framing of the same question — works without any code changes, because Claude adapts its tool selection to the new phrasing. Example Slack response: Plain Text :rotating_light: *Critical Bugs Linked to Recent PRs* • <https://github.com/org/repo/issues/1234|#1234> — Payment processing failure (linked to <https://github.com/org/repo/pull/5678|PR #5678>, merged Apr 14) • <https://github.com/org/repo/issues/1290|#1290> — Auth token timeout on mobile (linked to <https://github.com/org/repo/pull/5691|PR #5691>, merged Apr 15) Summary: 2 critical bugs found. Both linked to PRs merged this week. 6 tool calls: list merged PRs, search critical issues, get_issue per candidate. What the Model Context Protocol Does MCP is an open standard that lets AI models discover and call external tools through a uniform interface. Every MCP server exposes a tools/list endpoint returning every available action as a full JSON Schema. The model loads these at startup and reasons over them autonomously. Your application code never routes a single query. GitHub's official MCP server at api.githubcopilot.com/mcp/ exposes 40+ tools — issues, PRs, releases, code search — and a single GitHub token is all the authentication required. The shift is architectural, not cosmetic. The conventional model is a formatter — it receives data you fetched. The MCP model is a reasoning agent — it decides what to fetch, fetches it, and synthesizes the results. The first scales with the API code you write. The second scales with the MCP ecosystem. Why SRE and Platform Teams Should Care This bot started as a developer productivity tool. But when our SRE and platform engineering teams reviewed the architecture, they saw something broader: a pattern that could eliminate an entire category of operational toil. Platform teams spend considerable time maintaining integrations — every API change means updating a client, every new data source means a new integration project. The MCP pattern changes that calculus entirely. Integration toil. MCP server owners maintain compatibility with their own APIs. When GitHub updates its REST API, GitHub's MCP server absorbs that change. You own zero API client code.API drift. Traditional bots silently degrade when response schemas change. With MCP, the server owner tracks those changes — your bot keeps working.Correlation complexity. Linking deploys to errors, PRs to bugs, incidents to changesets — this logic is brittle in code and breaks constantly. Models do this naturally by reasoning across tool results in context.Platform rebuilds for new capabilities. Each new MCP server extends the bot without touching the agent loop. The loop is infrastructure. The servers are plugins. New team joins? New tool added? It is configuration, not development.The compounding effect matters most: every new MCP server registered is immediately available for any question the model asks. Traditional integrations accumulate glue code. MCP integrations accumulate capabilities. Conclusion The conventional approach to building AI-powered developer tools is not wrong — it works, and many teams run it successfully. But it carries a hidden cost: every new capability requires new code, every new data source requires a new integration, and every API change requires maintenance. Over time, that cost compounds. The Model Context Protocol eliminates that cost. By exposing tools through a uniform interface that the model discovers at startup, MCP shifts the integration burden away from your codebase and onto the ecosystem. The model reasons about which tools to call. You reason about what questions to answer. Part 1 has covered the why — the architectural distinction, the before/after comparison on a real question, and why this matters especially for SRE and platform teams. Part 2 puts it into practice with the complete implementation, step-by-step setup, and production lessons that make it reliable for daily use. Continue to Part 2: Implementation, Setup, and Production Patterns. Full project code on GitHub: https://github.com/sangharshcs/slack-github-mcp-bot.
Artificial intelligence is rapidly transforming software development. Many developers now use AI-powered tools to generate code, but the next advancement is integrating AI directly into applications. Modern systems increasingly use large language models (LLMs) to answer questions, automate workflows, summarize information, and enhance user experiences. Software engineers must therefore combine traditional enterprise development practices with AI capabilities while ensuring reliability, scalability, and maintainability. This evolution offers Jakarta EE developers a significant opportunity. Jakarta EE provides a mature platform for enterprise applications, with standards for dependency injection, RESTful services, configuration, persistence, and cloud-native development. By integrating Jakarta EE with LangChain4j, developers can access advanced AI models through a straightforward Java API, adding intelligent features without leaving the familiar Jakarta EE environment. In this article, we will build a simple "Hello World" AI application to demonstrate how easily a Large Language Model can be integrated into a Jakarta EE application using LangChain4j. Configuring LangChain4j With Jakarta EE Technologies Before developing your first AI-powered application, it is important to understand LangChain4j’s role in the Java ecosystem and its popularity for AI integration. LangChain4j serves as an orchestration layer between Java applications and AI providers. It simplifies AI integration by offering a consistent programming model, regardless of the underlying vendor. If you are familiar with Spring Data or Jakarta Data, this concept will be familiar. With Spring Data and Jakarta Data, developers define repository interfaces and use annotations to specify behavior. Implementation details are handled by a provider that generates the concrete implementation and manages database communication. This allows developers to focus on business logic rather than low-level database operations. LangChain4j uses a similar approach for artificial intelligence. Instead of writing HTTP clients, building JSON payloads, and managing provider-specific APIs, developers define Java interfaces representing AI capabilities. LangChain4j then generates the implementation and manages communication with the chosen AI provider. LangChain4j can be viewed as the AI equivalent of Jakarta Data or Spring Data, with the AI provider dependency functioning like a JDBC driver. Switching from one AI provider to another, such as from OpenAI to a different provider, usually only requires updating the dependency and configuration, while the application code remains largely unchanged. While this article uses a Java SE application for simplicity, the same approach applies to Jakarta EE, Spring Boot, Quarkus, Helidon, Micronaut, and other Java platforms. Project Dependencies The first step is to create a Maven Quickstart project and add the required dependencies for CDI, Eclipse MicroProfile Config, and LangChain4j: XML <dependency> <groupId>io.smallrye.config</groupId> <artifactId>smallrye-config-core</artifactId> <version>3.17.2</version> <scope>compile</scope> </dependency> <dependency> <groupId>io.smallrye.config</groupId> <artifactId>smallrye-config</artifactId> <version>3.17.2</version> </dependency> <dependency> <groupId>org.jboss.weld.se</groupId> <artifactId>weld-se-core</artifactId> <version>6.0.4.Final</version> </dependency> <dependency> <groupId>dev.langchain4j.cdi</groupId> <artifactId>langchain4j-cdi-portable-ext</artifactId> <version>${langchain4j-cdi.version}</version> </dependency> <dependency> <groupId>dev.langchain4j.cdi.mp</groupId> <artifactId>langchain4j-cdi-config</artifactId> <version>${langchain4j-cdi.version}</version> </dependency> <dependency> <groupId>dev.langchain4j</groupId> <artifactId>langchain4j-open-ai</artifactId> <version>1.15.0</version> </dependency> This example uses the langchain4j-open-ai dependency, which serves as the provider-specific driver for communicating with OpenAI models. The application code remains independent of the provider implementation. Configuring the AI Provider LangChain4j integrates with Eclipse MicroProfile Config, allowing you to externalize all provider settings. Create a microprofile-config.properties file and add the following configuration: Properties files dev.langchain4j.cdi.plugin.chat-model.class=dev.langchain4j.model.openai.OpenAiChatModel dev.langchain4j.cdi.plugin.chat-model.config.api-key=<<API_KEY>> dev.langchain4j.cdi.plugin.chat-model.config.model-name=gpt-5 This configuration specifies the chat model implementation, the authentication API key, and the model that will process prompts. A key advantage of this approach is flexibility. If you choose another provider in the future, you typically only need to replace the provider dependency and update the configuration. The application code often remains unchanged, reinforcing the provider dependency’s role as similar to that of a JDBC driver in traditional data access. For this sample, you can place the API key directly in the configuration file or provide it through environment variables. In production, use environment variables, secret managers, or vault solutions. Never commit API keys to source control, as exposed credentials can lead to unauthorized use, unexpected costs, and security risks. Your First AI Service With the project configured, we can now build our first AI-powered service. As is customary in software development, we will begin with a “Hello World” example. Rather than printing a static message, we will send a question to an AI model and display its response. This example uses the simplest contract: a String as input and a String as output. Although real-world applications typically use more complex domain objects, starting with plain text helps us focus on the core LangChain4j programming model and understand how to create and use AI services. The first step is defining an AI service interface: Java import dev.langchain4j.cdi.spi.RegisterAIService; import jakarta.enterprise.context.ApplicationScoped; @RegisterAIService @ApplicationScoped public interface AssistantService { String chat(String prompt); } This interface does not include an implementation. LangChain4j generates the implementation automatically at runtime. The @RegisterAIService annotation directs LangChain4j to create an AI-backed implementation for this interface. The @ApplicationScoped annotation makes the generated implementation available as a CDI bean, which can be injected or accessed like any other Jakarta EE component. The method signature defines the AI contract. When the chat method is called, the parameter serves as the prompt for the AI model, and the returned value contains the generated response. In this example, both the request and response are simple strings. Next, we need a client application to consume this service: Java import jakarta.enterprise.context.control.RequestContextController; import jakarta.enterprise.inject.se.SeContainer; public class App { public static void main(String[] args) { try (SeContainer container = jakarta.enterprise.inject.se.SeContainerInitializer .newInstance() .initialize()) { RequestContextController requestContextController = container.select(RequestContextController.class).get(); requestContextController.activate(); AssistantService assistantService = container.select(AssistantService.class).get(); String response = assistantService.chat("What is the capital of France?"); System.out.println("Assistant response: " + response); requestContextController.deactivate(); } } } The application starts a CDI container using Weld SE, which provides dependency injection in a Java SE environment. After initializing the container, we activate the request context and obtain an instance of AssistantService from CDI. Although there is no concrete implementation in the codebase, CDI returns a fully functional service generated by LangChain4j. When the chat method is called, LangChain4j sends the prompt to the configured AI model, waits for the response, and converts the result into a Java String. Running the application produces an output similar to the following: Plain Text Assistant response: Paris is the capital of France. The exact wording may vary because large language models are probabilistic systems. Unlike traditional methods that always return the same result for a given input, AI models may produce slightly different responses while maintaining the same meaning. While using strings is useful for learning the fundamentals, enterprise applications rarely exchange raw text between layers. Business applications typically use structured data, domain objects, commands, and responses to ensure stronger contracts and better maintainability. In the next section, we will enhance this example by replacing raw strings with dedicated input and output classes, enabling LangChain4j to map between Java objects and AI interactions in a more type-safe and expressive manner. Working With Structured Input and Output The previous example showed a basic AI interaction: a string input produces a string output. While this illustrates the fundamentals, real-world applications rarely use unstructured text alone. Enterprise systems typically exchange well-defined objects that represent business concepts, making code more expressive, maintainable, and type-safe. LangChain4j’s key strength is its ability to map Java objects directly to AI interactions. It automatically converts structured input into prompts and transforms AI responses into strongly typed Java objects, eliminating the need for manual serialization and parsing. Developers can work with domain concepts instead of raw text. To demonstrate this, we will build a simple book recommendation engine. Given a book title and author, the AI will suggest three books that logically follow in a learning journey. We begin by defining the input object: Java public record BookRequest(String title, String author) { } This record captures the user’s input. Instead of manually creating a textual prompt, we provide a structured Java object with the book’s title and author. Next, we define the domain model representing a recommended book: Java import java.util.List; public record Book( String title, String author, String description, List<String> keywords) { } This record contains richer information than a simple title. This record includes more than just the title and author. It also provides a short description and a set of keywords to further characterize the recommendation, with the reason why the book was selected: Java public record Recommendation(Book book, String reason) { } Finally, we create a wrapper object that represents the complete response returned by the AI service: Java import java.util.List; public record NextReadBooks(List<Recommendation> recommendations) { } At this stage, we have a complete domain model for both the request and the expected response. Next, we define the AI service: Java import dev.langchain4j.cdi.spi.RegisterAIService; import dev.langchain4j.service.SystemMessage; import jakarta.enterprise.context.ApplicationScoped; @ApplicationScoped @RegisterAIService public interface NextReadBookService { @SystemMessage(""" Recommend up to 3 books that should naturally follow the provided book in a learning journey. Recommendations should prioritize: - conceptual progression - complementary knowledge - technical depth - thematic similarity For each recommendation provide: - title - author - concise description - relevant keywords - a short recommendation reason Keep recommendations concise, technically relevant, and focused on software engineering and architecture learning. """) NextReadBooks nextReadBooks(BookRequest bookRequest); } This example also introduces the concept of a system message. The @SystemMessage annotation provides instructions that guide the model’s behavior. Unlike user input, which varies with each request, the system message serves as a permanent set of rules for AI responses. Here, we instruct the model to recommend up to three books, explain each recommendation, and return the information using our defined Java records. The method signature uses only domain objects: BookRequest as input and NextReadBooks as output. There is no need for manual JSON handling, prompt creation, or response parsing, as LangChain4j manages these tasks automatically. The application code remains straightforward: Java import jakarta.enterprise.context.control.RequestContextController; import jakarta.enterprise.inject.se.SeContainer; public class BookApp { public static void main(String[] args) { try (SeContainer container = jakarta.enterprise.inject.se.SeContainerInitializer .newInstance() .initialize()) { RequestContextController requestContextController = container.select(RequestContextController.class).get(); requestContextController.activate(); var bookService = container.select(NextReadBookService.class).get(); BookRequest request = new BookRequest( "The Great Gatsby", "F. Scott Fitzgerald"); var recommendations = bookService.nextReadBooks(request); for (var recommendation : recommendations.recommendations()) { System.out.println( "Recommended book: " + recommendation.book().title() + " by " + recommendation.book().author()); System.out.println( "Reason: " + recommendation.reason()); } requestContextController.deactivate(); } } } When executed, LangChain4j converts the BookRequest into a prompt, sends it to the model, validates the response against the target structure, and maps the result back into NextReadBooks. For developers, this interaction is similar to calling a standard Java service. This approach offers clear advantages over raw string-based interactions. The code is easier to understand, IDE autocompletion enhances productivity, and refactoring is safer because inputs and outputs are explicit domain models. The application can also adapt more easily to new business requirements. So far, our examples have used explicit user requests and static system instructions. However, modern AI applications often need additional context beyond user input. In the next section, we will explore how to enrich AI interactions with external knowledge and context, enabling the model to produce more accurate and relevant responses aligned with the application’s domain.
Abstract This is a continuation of the first article in this series, Building a Spring AI Assistant with MCP Servers: A Step-by-Step Tutorial, and describes how one may address a serious concern when thinking of going from prototype to production — security. The Problem The MCP specification recommends that MCP servers using HTTP as their transport layer be secured with OAuth 2.0 access tokens. In practice, plenty of teams don't have the surrounding infrastructure — an authorization server, token introspection, and operational maturity — ready when they start exposing internal tools to an AI assistant. But the traffic still needs to be authenticated. This article walks through a simpler scheme that fits that gap: per-server API keys carried in a custom HTTP header. The MCP server only authorizes requests that present a valid key; the MCP client analyzes each outbound request at runtime and attaches the right header for the right destination. We'll use Spring AI 1.1.4, MCP Spring Security 0.1.5, and Spring Security on Java 25. The setup involves three applications: telecom-assistant – the AI host and MCP client (port 8080)invoice-mcp-server – exposes invoice tools, keeps API keys in PostgreSQL (port 8081)vendor-mcp-server – exposes vendor tools, keeps a single API key in memory (port 8082) Two servers, two different storage strategies, on purpose - to show both ends of the spectrum. Every MCP server has its own API key id and secret. The picture below sketches the flow and the requirements to accomplish that. To be able to follow along, switch to the 2-main branch of the designated GitHub repository. Upon resolving the TODOs in there, this goal will have been fulfilled. Securing the Vendor Service (In-Memory Keys) TODO 1. This is the simpler case. Start by adding the security dependencies in pom.xml: XML <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-security</artifactId> </dependency> <dependency> <groupId>org.springaicommunity</groupId> <artifactId>mcp-server-security</artifactId> <version>0.1.5</version> </dependency> TODO 2. Since the API keys are stored in memory in this case, they are declared in the application.properties file, still as environmental variables. Properties files api.key.id = ${API_KEY_ID} api.key.secret = ${API_KEY_SECRET} TODO 3. The main aspect regarding this enhancement is the security configuration. In this regard, the below @Configuration class is added. Java @EnableWebSecurity @Configuration public class SecurityConfig { @Value("${api.key.id}") private String apiKeyId; @Value("${api.key.secret}") private String apiKeySecret; @Bean ApiKeyEntityRepository<ApiKeyEntity> apiKeyRepository() { return new InMemoryApiKeyEntityRepository<>( List.of(ApiKeyEntityImpl.builder() .name("API key") .id(apiKeyId) .secret(apiKeySecret) .build())); } @Bean SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { return http.authorizeHttpRequests(auth -> auth.anyRequest().authenticated()) .with(McpApiKeyConfigurer.mcpServerApiKey(), apiKeyConfig -> apiKeyConfig.apiKeyRepository(apiKeyRepository()) .headerName("vendor-x-api-key")) .build(); } } A single ApiKeyEntity instance is constructed and stored as part of an InMemoryApiKeyEntityRepository. Then, when the SecurityFilterChain is built, a SecurityConfigurerAdapter is applied and an McpApiKeyConfigurer is used via which two concerns are addressed. On one hand, the expected security header name is set — vendor-x-api-key, while on the other, the repository that stores the server API key. At this point, the MCP server is secured. In order to be able to successfully communicate, an MCP client shall send HTTP requests that contain the required header that has the following form: Plain Text "vendor-x-api-key": [api-key-id].[api-key-secret] where api-key-id and api-key-secret are replaced with the values configured above. To test this functionality, the MCP Inspector [Resource 3] is used again, and additionally, before connecting to the running server, the authentication data is configured — vendor-x-api-key header is set to the known id.secret value. Securing the Invoice Service (Keys in PostgreSQL) Switching to the invoice-mcp-server, the enhancements here are a bit more complex as the API keys are stored in an external repository. TODO 4. Again the security dependencies are added in the pom.xml file, as before. TODO 5. As API keys are stored in the database, more exactly in the ServerApiKeys table, an mapping entity is created. Java @Table("ServerApiKeys") public class ServerApiKey { public static final String COL_SERVER = "Server"; public static final String COL_KEY_ID = "KeyId"; public static final String ON_CONFLICT_CLAUSE = String.format("(%s,%s)", COL_SERVER, COL_KEY_ID); @PkColumn("Id") private int id; @Column(COL_SERVER) private String server; @Column(COL_KEY_ID) private String keyId; @Column("KeySecret") private String keySecret; ... } As the Asentinel ORM library is already present in this module’s class-path, it is used to manage these entities; thus, the class is decorated with specific annotations. TODO 6. Just as previously done for the vendor server, the security configuration needs an ApiKeyEntityRepository. The approach here is more general, the interface is implemented, and the specific manner is suited. Java public class DbApiKeyEntityRepository implements ApiKeyEntityRepository<DbApiKeyEntityRepository.InvoiceApiKeyEntity> { private final OrmOperations orm; public DbApiKeyEntityRepository(OrmOperations orm) { this.orm = orm; } @Override public InvoiceApiKeyEntity findByKeyId(@NonNull String keyId) { return orm.newSqlBuilder(ServerApiKey.class) .select() .where() .column(ServerApiKey.COL_SERVER).eq("invoice-mcp").and() .column(ServerApiKey.COL_KEY_ID).eq(keyId) .execForOptional() .map(serverApiKey -> new InvoiceApiKeyEntity(keyId, serverApiKey.getKeySecret())) .orElse(null); } } As every record (API key) in the table is uniquely identified by Server and KeyId, whenever a request is received, the repository checks it and returns an implementation of the ApiKeyEntity interface, in our case Java public static final class InvoiceApiKeyEntity implements ApiKeyEntity { private final String id; @Nullable private String secret; private InvoiceApiKeyEntity(String id, @Nullable String secret) { this.id = id; this.secret = secret; } @Override public String getId() { return id; } @Override public @Nullable String getSecret() { return secret; } @Override public void eraseCredentials() { this.secret = null; } @Override public InvoiceApiKeyEntity copy() { return new InvoiceApiKeyEntity(id, secret); } } built from the database entity upon retrieval. It is a good practice to keep the secret of a ServerApiKey entity encoded in the database. In this tutorial, the default one — bcrypt — is used. To check the repository, the following simple integration test is used. Java @SpringBootTest @Transactional class DbApiKeyEntityRepositoryTest { private DbApiKeyEntityRepository apiKeyRepository; @Autowired private OrmOperations orm; private final PasswordEncoder passwordEncoder = PasswordEncoderFactories.createDelegatingPasswordEncoder(); @BeforeEach public void setUp() { apiKeyRepository = new DbApiKeyEntityRepository(orm); } @Test void provisionServerApiKey() { ServerApiKey serverApiKey = new ServerApiKey(); serverApiKey.setServer("invoice-mcp"); serverApiKey.setKeyId("api-key-id"); serverApiKey.setKeySecret(passwordEncoder.encode("api-key-secret")); orm.upsert(serverApiKey, PostgresJdbcFlavor.UPSERT_CONFLICT_PLACEHOLDER, ServerApiKey.ON_CONFLICT_CLAUSE); DbApiKeyEntityRepository.InvoiceApiKeyEntity apiKey = apiKeyRepository.findByKeyId(serverApiKey.getKeyId()); Assertions.assertNotNull(apiKey); Assertions.assertEquals(serverApiKey.getKeyId(), apiKey.getId()); Assertions.assertEquals(serverApiKey.getKeySecret(), apiKey.getSecret()); } } TODO 7. Once this is completed, the security configuration is set up, just as before. The only difference is that the previously created ApiKeyEntityRepository repository implementation is used and not an in-memory one this time. Java @Configuration @EnableWebSecurity public class SecurityConfig { private OrmOperations orm; @Autowired public void setOrm(OrmOperations orm) { this.orm = orm; } @Bean ApiKeyEntityRepository<DbApiKeyEntityRepository.InvoiceApiKeyEntity> apiKeyRepository() { return new DbApiKeyEntityRepository(orm); } @Bean SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception { return http.authorizeHttpRequests(auth -> auth.anyRequest().authenticated()) .with(McpApiKeyConfigurer.mcpServerApiKey(), apiKeyConfig -> apiKeyConfig.apiKeyRepository(apiKeyRepository()) .headerName("invoice-x-api-key")) .build(); } } At this point, the invoice-mcp server is secured as well, it can be checked with the MCP Inspector. Making the Client Send the Right Header to the Right Server Both servers are locked down. Now the client needs to know that requests to http://localhost:8081/mcp-invoice should carry the invoice-x-api-key header and requests to http://localhost:8082/mcp-vendor should carry vendor-x-api-key. A clean way to encode this is a chain of responsibility of resolvers. TODO 8. The expected API keys’ ids and secrets for the two servers are configured in the application.properties and for convenience, read from environment values. For simplicity, here both read from the same, although in a real implementation would not. Properties files mcp.server.api-key.parameters.invoice.id = ${API_KEY_ID} mcp.server.api-key.parameters.invoice.secret = ${API_KEY_SECRET} mcp.server.api-key.parameters.vendor.id = ${API_KEY_ID} mcp.server.api-key.parameters.vendor.secret = ${API_KEY_SECRET} Then, mapped into a @ConfigurationProperties annotated class. Java @ConfigurationProperties(McpServerApiKeyProperties.CONFIG_PREFIX) public class McpServerApiKeyProperties { public static final String CONFIG_PREFIX = "mcp.server.api-key"; private final Map<String, ApiKeyParams> parameters = new HashMap<>(); public Map<String, ApiKeyParams> getParameters() { return parameters; } public record ApiKeyParams(String id, String secret) {} } TODO 9. As there are two MCP servers involved, whenever the LLM instructs the AI host that it needs to query one of them, a destination-resolving strategy applied at runtime is introduced. It’s implemented as a chain of MCP server resolvers. The common interface of this chain of responsibility is McpServerResolver. It’s generic and declares two methods as part of the contract it proposes. Java public interface McpServerResolver<T> { Optional<T> resolve(URI uri); default String id() { return getClass().getSimpleName(); } } The central method that each implementer shall define receives the destination uri and attempts to resolve one of the available servers. If successful, the result is further used. The second optional method has the particular scope of identifying the current resolver; it has a default implementation and might help down the line during the resolving process (logging, etc.). As the items in the chain here have a similar approach, the next part is an abstract common implementation of the above interface. Java abstract class AbstractMcpServerResolver<T> implements McpServerResolver<T> { private static final Logger log = LoggerFactory.getLogger(AbstractMcpServerResolver.class); private final McpServerResolver<T> next; protected AbstractMcpServerResolver(McpServerResolver<T> next) { this.next = next; } @Override public Optional<T> resolve(URI uri) { if (uri == null) { return Optional.empty(); } log.debug("[{}]: Checking request towards {}.", id(), uri); Optional<T> result = resolveSpecific(uri); if (result.isPresent()) { log.debug("[{}]: Resolved target endpoint {}.", id(), uri); return result; } if (next == null) { log.debug("[{}]: No next resolver configured.", id()); return Optional.empty(); } log.debug("[{}]: Target endpoint {} not resolved. Delegating to [{}].", id(), uri, next.id()); return next.resolve(uri); } protected abstract Optional<T> resolveSpecific(URI endpoint); } The particular action is to be defined by each link in the chain as part of the Optional resolveSpecific(URI endpoint) method. Here, the functionality is similar; thus, the next common implementation is enough. Java public class UrlMcpServerResolver extends AbstractMcpServerResolver<ApiKeyHeader> { private static final Logger log = LoggerFactory.getLogger(UrlMcpServerResolver.class); private final URI serverUri; private final ApiKeyHeader header; public UrlMcpServerResolver(McpServerResolver<ApiKeyHeader> nextResolver, String serverUrl, ApiKeyHeader header) { super(nextResolver); this.serverUri = URI.create(serverUrl); this.header = header; } @Override protected Optional<ApiKeyHeader> resolveSpecific(URI endpoint) { if (serverUri.equals(endpoint)) { log.debug("[{}]: Target endpoint {} and config URL {} match.", id(), endpoint, serverUri); return Optional.of(header); } log.debug("[{}]: Target endpoint {} and config URL {} don't match.", id(), endpoint, serverUri); return Optional.empty(); } } TODO 10. The above resolveSpecific() method decides whether the current request is towards a particular server. If successful, an ApiKeyHeader object is returned so that it can be further used. Java public record ApiKeyHeader(String name, String value) {} TODO 11. The last step is the security @Configuration class that glues together the above-created pieces. Java @Configuration @EnableConfigurationProperties({McpServerApiKeyProperties.class}) public class SecurityConfig { private static final Logger log = LoggerFactory.getLogger(SecurityConfig.class); public McpStreamableHttpClientProperties mcpClientProps; public McpServerApiKeyProperties mcpServerApiKeys; @Autowired public void setMcpClientProps(McpStreamableHttpClientProperties mcpClientProps) { this.mcpClientProps = mcpClientProps; } @Autowired public void setMcpServerApiKeys(McpServerApiKeyProperties mcpServerApiKeys) { this.mcpServerApiKeys = mcpServerApiKeys; } @Bean ApiKeyHeader invoiceApiKeyHeader() { var apiKey = mcpServerApiKeys.getParameters().get("invoice"); return new ApiKeyHeader("invoice-x-api-key", String.format("%s.%s", apiKey.id(), apiKey.secret())); } @Bean ApiKeyHeader vendorApiKeyHeader() { var apiKey = mcpServerApiKeys.getParameters().get("vendor"); return new ApiKeyHeader("vendor-x-api-key", String.format("%s.%s", apiKey.id(), apiKey.secret())); } @Bean McpServerResolver<ApiKeyHeader> serverResolver() { var mcpProps = mcpClientProps.getConnections(); var mcpInvoice = mcpProps.get("invoice"); var mcpVendor = mcpProps.get("vendor"); return new VendorMcpServerResolver(new InvoiceMcpServerResolver(null, String.format("%s%s", mcpInvoice.url(), mcpInvoice.endpoint()), invoiceApiKeyHeader()), String.format("%s%s", mcpVendor.url(), mcpVendor.endpoint()), vendorApiKeyHeader()); } @Bean McpSyncHttpClientRequestCustomizer requestCustomizer() { return (builder, method, endpoint, body, context) -> { log.info("MCP Client request: method={}, endpoint={}, body={}", method, endpoint, body); serverResolver() .resolve(endpoint) .ifPresent(apiKeyHeader -> builder.header(apiKeyHeader.name(), apiKeyHeader.value())); }; } } The McpServerResolver returned by serverResolver() is used by the McpSyncHttpClientRequestCustomizer to analyze the request and add the necessary security header. Watching it Work Upon reaching this point, the MCP servers are restarted, together with the telecom-assistant. If the prompt in the screenshot below is issued, obviously, the invoice server should be queried, and the response should be received accordingly. In the AI host logs, the server resolving process can be depicted. Plain Text DEBUG i.m.c.t.HttpClientStreamableHttpTransport - Sending message JSONRPCRequest[jsonrpc=2.0, method=tools/call, id=2902fe67-2, params=CallToolRequest[name=get-invoices-by-pattern-on-number, arguments={pattern=vdf}, meta={}]] INFO c.h.t.config.SecurityConfig - MCP Client request: method=POST, endpoint=http://localhost:8081/mcp-invoice, body={"jsonrpc":"2.0","method":"tools/call","id":"2902fe67-2","params":{"name":"get-invoices-by-pattern-on-number","arguments":{"pattern":"vdf"},"_meta":{}} DEBUG c.h.t.c.r.AbstractMcpServerResolver - [VendorMcpServerResolver]: Checking request towards http://localhost:8081/mcp-invoice. DEBUG c.h.t.c.r.UrlMcpServerResolver - [VendorMcpServerResolver]: Target endpoint http://localhost:8081/mcp-invoice and config URL http://localhost:8082/mcp-vendor don't match. DEBUG c.h.t.c.r.AbstractMcpServerResolver - [VendorMcpServerResolver]: Target endpoint http://localhost:8081/mcp-invoice not resolved. Delegating to [InvoiceMcpServerResolver]. DEBUG c.h.t.c.r.AbstractMcpServerResolver - [InvoiceMcpServerResolver]: Checking request towards http://localhost:8081/mcp-invoice. DEBUG c.h.t.c.r.UrlMcpServerResolver - [InvoiceMcpServerResolver]: Target endpoint http://localhost:8081/mcp-invoice and config URL http://localhost:8081/mcp-invoice match. DEBUG c.h.t.c.r.AbstractMcpServerResolver - [InvoiceMcpServerResolver]: Resolved target endpoint http://localhost:8081/mcp-invoice. DEBUG i.m.c.t.HttpClientStreamableHttpTransport - Received SSE stream response, using line subscriber DEBUG i.m.spec.McpSchema - Received JSON message: {"jsonrpc":"2.0","id":"2902fe67-2","result":{"content":[{"type":"text","text":"[{\"id\":8,\"number\":\"vdf-tf-rev-1\",\"date\":\"2025-05-20\",\"vendor\":{\"id\":2,\"name\":\"Vodafone\"},\"serviceType\":{\"id\":3,\"name\":\"TollFree\"},\"status\":\"UNDER_REVIEW\",\"total\":10.44},{\"id\":7,\"number\":\"vdf-mpls-app-1\",\"date\":\"2025-05-10\",\"vendor\":{\"id\":2,\"name\":\"Vodafone\"},\"serviceType\":{\"id\":4,\"name\":\"MPLS\"},\"status\":\"APPROVED\",\"total\":80.44},{\"id\":6,\"number\":\"vdf-lo-paid-1\",\"date\":\"2025-06-10\",\"vendor\":{\"id\":2,\"name\":\"Vodafone\"},\"serviceType\":{\"id\":5,\"name\":\"Local\"},\"status\":\"PAID\",\"total\":85.44}]"}],"isError":false} The invoice server validates the invoice-x-api-key header against its database-backed repository, and the tool call proceeds. Final Notes API key authentication is a pragmatic stepping stone, not an end state. In a production environment, OAuth 2.0 remains the recommended approach, and Spring Security supports both. Even with API keys in play, take the basics seriously: store secrets encoded (bcrypt or stronger), rotate them, use a distinct key per server, and combine with TLS so the headers aren't visible in transit. The resolver-chain pattern on the client side gives you a natural place to add more rules later — token-fetch logic for OAuth, region-based routing, anything URI-shaped — without touching the rest of the AI host. The next (also the last) article in this series concludes the tutorial by instrumenting the chat client with advisors for memory, token tracking, and logging, and ultimately formulating several takeaways. Resources [1] – The source code for the Spring AI Telecom Assistant [2] – asentinel-orm project [3] – MCP Inspector
Tuhin Chattopadhyay
AI Decision Intelligence Scholar-Practitioner | Founder, Tuhin AI Advisory | Professor & Area Chair, AI & Analytics,
JAGSoM
Frederic Jacquet
Technology Evangelist,
AI[4]Human-Nexus
Pratik Prakash
Principal Solution Architect,
Capital One