Cloud architecture refers to how technologies and components are built in a cloud environment. A cloud environment comprises a network of servers that are located in various places globally, and each serves a specific purpose. With the growth of cloud computing and cloud-native development, modern development practices are constantly changing to adapt to this rapid evolution. This Zone offers the latest information on cloud architecture, covering topics such as builds and deployments to cloud-native environments, Kubernetes practices, cloud databases, hybrid and multi-cloud environments, cloud computing, and more!
I Built a VS Code Extension to Debug Azure AI Foundry Agents Without Leaving My Editor
Implementing Asynchronous Communication Between Microservices Using Kafka and Spring Boot
Infrastructure efficiency is rapidly becoming one of the most important factors determining profitability for cloud providers, managed service providers, and SaaS companies. For years, infrastructure growth followed a simple formula: add more servers, more storage, and more capacity whenever demand increased. That model worked when hardware prices consistently declined, and inefficiencies could be absorbed through growth. Those conditions no longer exist. Today, providers face rising costs for memory, enterprise SSDs, GPUs, power, cooling, and colocation, while customers continue to expect lower pricing, better performance, stronger SLAs, and faster service delivery. Several industry shifts have fundamentally changed infrastructure economics. Changes in virtualization licensing models have increased costs for many organizations. AI adoption has driven demand for GPUs, high-capacity memory, and high-performance storage. Power and colocation costs continue to rise globally, while sovereign cloud initiatives are creating demand for regional infrastructure that must compete economically with hyperscale cloud providers. The challenge is clear: infrastructure costs are rising faster than revenue. What Does a Workload Really Cost? Infrastructure efficiency ultimately comes down to a simple question: what does it cost to deliver a workload? Customers do not buy servers, storage systems, or software licenses. They buy virtual machines, Kubernetes clusters, databases, AI environments, SaaS applications, and business services. The true cost of delivering those workloads includes much more than infrastructure hardware: Software licensingPower and coolingColocationNetwork connectivityStorageCapacity buffersStaffing and operationsSupport and SLA commitments The providers that achieve the lowest cost per workload while maintaining performance and service quality gain a significant competitive advantage. As infrastructure costs continue to increase, "cost per workload delivered" is becoming a useful framework for evaluating efficiency. Unlike traditional metrics focused solely on hardware utilization or licensing costs, this approach considers the complete economics of delivering customer-facing services. Beyond Infrastructure Utilization Infrastructure efficiency is not measured only by CPU, memory, or storage utilization. Operational metrics often have an equally significant impact on the cost of delivering workloads. Examples include administrator-to-server ratio, administrator-to-VM ratio, workload deployment times, incident resolution times, and the number of infrastructure platforms that must be maintained. Cost alone is also a misleading metric. A workload delivered at lower cost may also deliver lower performance, higher contention, or slower support response times. A virtual machine with two vCPUs does not necessarily provide the same amount of usable compute across platforms. CPU oversubscription ratios, noisy-neighbor effects, storage latency, network performance, and support commitments all influence the actual customer experience. The relevant metric is not simply cost per workload, but cost per workload delivered at a defined SLA. Architectural Choices and Efficiency Infrastructure architecture plays a major role in determining workload economics. Traditional infrastructure environments often combine separate virtualization, storage, networking, monitoring, backup, and orchestration platforms. While this approach offers flexibility, it can also increase operational complexity, encourage overprovisioning, and create management overhead. As a result, many organizations are moving toward more integrated infrastructure models, including hyperconverged infrastructure (HCI) and software-defined platforms that consolidate multiple functions into a unified operational framework. The goal is not merely consolidation. The real objective is to reduce operational overhead, improve resource utilization, simplify scaling, and lower long-term total cost of ownership. This becomes particularly important for sovereign cloud initiatives. Unlike hyperscalers that benefit from massive global scale, regional cloud providers often need to achieve competitive economics within a specific country or market while maintaining local data residency, compliance, and operational control. In these environments, maximizing infrastructure efficiency is often critical to long-term profitability. Infrastructure Efficiency Metrics Worth Tracking Organizations evaluating infrastructure efficiency should look beyond traditional utilization metrics and monitor indicators that directly affect workload economics, including: Cost per virtual machineCost per containerCost per Kubernetes clusterCost per AI workloadStorage efficiency ratiosPower consumption per workloadAdministrator-to-server ratioWorkload deployment timesMean time to resolution (MTTR)Resource utilization across compute and storage environments These metrics provide a more accurate view of infrastructure performance than hardware utilization alone. Why AI Changes the Equation The emergence of AI workloads has made infrastructure efficiency even more important. GPU resources are expensive, but GPUs alone do not determine the economics of AI infrastructure. Storage performance, networking efficiency, workload orchestration, and operational processes all directly impact GPU utilization and overall service profitability. In many environments, the challenge is no longer acquiring GPUs. It ensures that the surrounding infrastructure can keep them fully utilized. As GPU, storage, and power costs continue to rise, organizations are increasingly focused on maximizing the value extracted from every infrastructure resource. AI infrastructure economics are becoming less about acquiring the largest amount of hardware and more about achieving the highest utilization and operational efficiency from existing investments. Measuring Infrastructure Economics One of the challenges with infrastructure efficiency is that it often remains invisible until it is measured. Many organizations focus on software licensing when evaluating infrastructure costs, but licensing is only one part of the equation. Utilization rates, storage efficiency, operational overhead, power consumption, hardware refresh cycles, staffing requirements, and SLA commitments often have a much greater impact on long-term economics. This is why Total Cost of Ownership (TCO) modeling is becoming increasingly important. Effective infrastructure evaluations should account for: Software costsHardware acquisitionEnergy consumptionColocation expensesStorage efficiencyStaffing requirementsOperational complexitySupport and maintenance costs Organizations that perform these broader analyses often discover that the greatest opportunities for savings come not from individual licensing decisions but from improving overall workload economics. Conclusion The next phase of cloud infrastructure optimization is unlikely to be driven by capacity growth alone. As infrastructure costs continue to rise and customer expectations continue to increase, providers must focus on delivering more workloads with fewer resources while maintaining performance and service quality. In that environment, infrastructure efficiency becomes more than a technical objective. It becomes a business metric. The organizations that can achieve the lowest cost per workload delivered at a defined service level will be best positioned to protect margins, remain competitive, and build sustainable cloud and AI services for the future.
Every time you open your banking app, send a private message, or log into your company's systems, a math problem is standing between your data and the rest of the world. A very specific kind of math problem, one that takes thousands of years to solve, even for the fastest computers we have today. Here is the uncomfortable truth: quantum computers are coming. And when they arrive, that math problem gets solved in hours. The lock breaks. Everything behind it becomes readable. The good news? There is already a replacement. It is called lattice cryptography, it is already available, and the window to start adopting it is open right now. Whether you act on that window is the real question. Why "Hard Math" Is the Entire Foundation of Internet Security Most encryption used today, including the RSA algorithm that secures the majority of HTTPS connections, e-commerce transactions, and enterprise systems, rests on a single idea: it is very easy to multiply two large prime numbers together, but extraordinarily hard to reverse that process. Multiply 7 and 3, and you get 21 instantly. But hand someone the number 21 and ask them to find the two prime factors without any hints, and the process gets harder. Scale that problem up to a 600-digit number, and even the most powerful supercomputers on Earth would need thousands of years to crack it. That difficulty is what makes your data safe. For now. A quantum computer with 4000 stable qubits could run Shor's algorithm to factor large integers, breaking RSA-2048 in a matter of hours. - NIST IR 8105, "Report on Post-Quantum Cryptography," 2016 Shor's algorithm, discovered in 1994, is essentially a quantum shortcut through that math. Once quantum hardware catches up to the algorithm's requirements, RSA and similar schemes collapse. Not weaken. Collapse. The Threat You Cannot See Yet: Harvest Now, Decrypt Later Here is what makes this threat different from most others in cybersecurity: you do not need to wait for quantum computers to exist before they can hurt you. Sophisticated adversaries, including nation-state actors, are already collecting encrypted data today. They store it. They wait. When sufficiently powerful quantum systems arrive, they decrypt everything in that archive. Health records, financial data, intellectual property, classified communications from years ago - all of it becomes accessible retroactively. Adversaries may be stealing encrypted data now with the intent to decrypt it later when quantum computing capabilities mature. This 'harvest now, decrypt later' strategy is a real and present danger. — CISA, NSA, NIST Joint Advisory: "Quantum-Readiness: Migration to Post-Quantum Cryptography" 2023 If your systems handle data that must remain confidential for more than five to ten years, that window is already a concern. Medical records. Legal documents. Financial histories. Long-term contracts. Any of these could be sitting in an adversary's archive right now, waiting. Lattice Cryptography: A Math Problem Even Quantum Computers Cannot Shortcut The replacement is built on a completely different class of hard math problem. One where quantum computers have no known shortcut. Picture a chess knight on an infinite board. In standard chess, a knight moves in a fixed pattern: two squares in one direction, one in another. If you know the move pattern, you can easily reach any target square by combining moves. That is basic, predictable cryptography. Now imagine the board has a thousand dimensions instead of two. The target point does not land exactly on any reachable square. You can only get close, never exact. And every attempt to get closer involves navigating a space so vast that trying every possible combination of moves would take longer than the age of the universe. That is the core idea behind lattice cryptography, and more specifically, a problem called Learning With Errors (LWE). The Learning With Errors problem asks to find a secret vector given a set of approximate linear equations over a finite field. The hardness of LWE is based on the worst-case hardness of standard lattice problems, which are believed to be resistant to quantum attacks. - Oded Regev, "On Lattices, Learning with Errors, Random Linear Codes, and Cryptography," Journal of the ACM, 2009 The "noise" Regev introduces into the problem is the key. Without it, solving the system of equations would be straightforward. With it, even a quantum computer exploring multiple solution paths simultaneously hits a wall. There is no elegant shortcut. Just brute force, across a space too large to brute force. NIST Has Already Done the Hard Work The U.S. National Institute of Standards and Technology ran an open global competition for nearly a decade, inviting cryptographers worldwide to submit quantum-resistant algorithms. In 2024, three algorithms were standardized. NIST has finalized its principal set of encryption algorithms designed to withstand cyberattacks from a quantum computer. These post-quantum cryptography (PQC) standards are ready for immediate use. - NIST, "NIST Releases First 3 Finalized Post-Quantum Encryption Standards," August 2024 The three standards are CRYSTALS-Kyber (now called ML-KEM) for key encapsulation, CRYSTALS-Dilithium (ML-DSA) for digital signatures, and SPHINCS+. All three are publicly available, open source, and deployable today on existing hardware. You do not need quantum computers to run quantum-safe encryption. That is a critical point. The algorithms run on the same servers and devices you already have. What "Crypto Agility" Actually Means in Practice For software architects and engineering leaders, the challenge is not just adopting new algorithms. It is building systems that can swap algorithms without a full architectural overhaul. The concept is called crypto agility. Think of it as designing your cryptographic layer the same way you would design a database abstraction layer: the rest of your system should not care which specific algorithm is running underneath. When a vulnerability surfaces, or when standards evolve, you should be able to change the algorithm with minimal blast radius. Getting there requires a structured approach. It starts with discovery: building a complete inventory, sometimes called a Cryptographic Bill of Materials (CBOM), of every place in your environment where cryptography is in use. That includes custom implementations, third-party libraries, hardware security modules, APIs, certificates, and protocols. Many organizations discover they have hundreds of instances they were not tracking. From that inventory, you triage by sensitivity. Data with long confidentiality requirements gets migrated first. Then you remediate, test, and build the feedback loop that lets you keep the CBOM current as your systems evolve. Organizations that do not understand their current cryptographic deployments will be unable to prioritize or execute a successful migration to post-quantum cryptography. - NIST SP 1800-38B, "Migration to Post-Quantum Cryptography," 2023 (Draft) This is not a one-time project. It is an ongoing capability. The organizations that will handle the next generation of cryptographic transitions well are the ones building that capability now, not the ones scrambling to respond when a deadline arrives. The Clock Is Running, But the Path Is Clear Estimates on when quantum computers will be capable enough to break RSA at production scale vary. Some researchers say a decade. Some say sooner. Nobody says never. We assess that a cryptographically relevant quantum computer could be built within the next decade, with nation-state actors most likely to be first. - Global Risk Institute, "2023 Quantum Threat Timeline Report," Michele Mosca and Marco Piani What is not in dispute is that the migration itself takes time. Updating cryptographic infrastructure across large organizations, particularly those running complex legacy systems or regulated environments, is measured in years, not weeks. The organizations that start now will be ready when the capability arrives. The ones that wait will be in the worst possible position: racing to retrofit under pressure. The math has already changed. The only remaining variable is whether your architecture changes with it.
Bug triage on a graphics engineering team is one of those tasks nobody really wants to own. A new crash report comes in, and somebody has to work out whether it looks like a known issue, what the stack trace points at, which subsystem the affected code lives in, and which sub-team should pick it up. The answers exist in the issue tracker, the source repo, and the architecture docs, but pulling them together by hand takes time. And the engineers best at it are the ones you least want spending hours on it. On our team, the archive of resolved bugs had grown to over 1,100 issues. That is a real corpus. It contains the answer to a lot of incoming questions, but only if you can find the right three or four entries quickly. The agent described here does that lookup automatically, combines it with crash log parsing and source code search, and produces a root cause analysis with a confidence score. Triage that used to take hours now takes minutes. This article is about the architecture choices: why AWS Bedrock with Claude, why OpenSearch with HNSW indexing, why DynamoDB for workflow state, and why ECS Fargate. None of these choices is unique. The reasoning behind them is what's portable. What the Agent Actually Has to Do Before the architecture, it's worth being concrete about the work. When a bug report arrives, the agent produces an analysis built on five signals: Historical pattern match against the knowledge base of resolved issues.Source code match against the repositories the trace points into.Crash stack analysis on the trace itself.Log evidence from whatever logs were attached or linkable.Fix ownership, derived from who has historically fixed bugs in the affected components. Each signal contributes to a final confidence score. The combination matters because no single signal is reliable on its own. A stack trace can match a bug that was fixed three releases ago, a source-code hit can be unrelated, and ownership data can be stale. A useful triage answer leans on multiple signals together. That is the work. The architecture exists to support it reliably, repeatedly, and without baking in assumptions that will hurt later. Why RAG, and Why These Pieces The obvious wrong move is to skip retrieval and pass the whole corpus to the model. Context windows aren't the bottleneck people think they are. Even when they're large, signal-to-noise gets bad fast, and cost and latency scale with input size. For any given bug, the relevant slice is small: a few prior tickets, a couple of source files, maybe one architecture doc. Retrieval-augmented generation (RAG) is the right shape because the retrieval layer's job is precisely to find that slice. OpenSearch With HNSW Indexing The knowledge base lives in OpenSearch with vector search over a k-NN HNSW index. HNSW (Hierarchical Navigable Small World) suits corpora in the low thousands to low millions of documents. Query time stays low, and recall stays high without the tuning effort IVF-based indexes demand at smaller scales. OpenSearch was chosen over a dedicated vector database for operational reasons. It runs in the same AWS environment as the rest of the stack, supports keyword and vector search in the same index when you need hybrid retrieval, and doesn't add a new vendor to the diagram. For a team-internal tool, the integration cost of a separate vector DB outweighs the marginal performance gain. Titan Embeddings Embeddings are generated with Amazon Titan. The main reason: the data (bug reports, stack traces, code snippets) never has to leave AWS. That removes a class of compliance questions that come up the moment you start sending source code or internal tickets to an external embedding API. Titan handles technical text well enough for this corpus, and it shares IAM, quotas, and billing with everything else. Claude on Bedrock as the Reasoning Model The reasoning step takes the retrieved context and the parsed crash log and produces the actual analysis. It runs on Claude through Bedrock. Two properties matter here. First, Claude handles long, messy, structured input well: stack traces aren't clean prose, and the surrounding context is a mix of code, logs, and ticket descriptions. Second, it expresses uncertainty rather than picking a confident-sounding wrong answer. For a system whose output a human engineer is going to read and either trust or push back on, that calibration matters more than fluency. The Five-Signal Confidence Score The most consequential part of the system isn't the model call. It's the scoring layer that wraps it. The agent doesn't just say "this looks like a duplicate of bug X." It produces a confidence score, and that score is what triagers use to decide whether to accept the suggestion or dig in themselves. The score is a weighted combination of the five signals listed earlier. Each contributes a sub-score; the weights reflect how predictive each signal has been, in this team's experience, of a correct triage outcome. The interesting design choice is that the weights are not static. Real bug reports don't always include all five signals. Some arrive without attached logs. Some point at code with no clear ownership history. With static weights, missing signals would drag the final score down even when the available signals were strongly aligned. The agent redistributes the weight of any unavailable signal across the available ones, normalized to sum to one. The conceptual shape: Python # Conceptual sketch of dynamic weight adjustment BASE_WEIGHTS = { "historical_match": w1, "source_code_match": w2, "crash_stack": w3, "log_evidence": w4, "fix_ownership": w5, } def adjusted_weights(available_signals): active = {k: v for k, v in BASE_WEIGHTS.items() if k in available_signals} total = sum(active.values()) return {k: v / total for k, v in active.items()} This is a small piece of code that does a disproportionate amount of the work of making the agent's output trustworthy. A given confidence score should mean roughly the same thing whether the bug arrived with logs or without. DynamoDB for Workflow State A triage run is not a single API call. The agent parses the report, retrieves embeddings, runs vector search, fetches matched documents, pulls source code context, calls the reasoning model, computes the score, and writes results back. Each step can fail or be slow independently. Workflow state for each in-flight triage lives in DynamoDB. The schema is intentionally simple: a triage ID as the partition key, a status field, and the accumulated context. Two reasons it's external rather than in-process memory. First, recovery. If the model call fails or times out, the workflow should resume without redoing the embedding and retrieval work. Token costs add up otherwise. Second, observability. The Flask dashboard the team uses to monitor triage operations reads from this same DynamoDB table. That includes real-time status, filterable history, analytics, and the routing view for issues that don't belong to this team. There is no separate event log to maintain. Workflow state is the source of truth, and the dashboard is a view onto it. ECS Fargate for Orchestration The triage workflow runs on ECS Fargate. The choice is shaped by what the workflow looks like: a sequence of calls to external services (Bedrock, OpenSearch, the issue tracker), with the long pole being model latency. Not CPU-heavy, not bursty. Incoming bugs arrive at a steady rate. Fargate handles this shape cleanly. No cold start, no execution time limit, and the operational model is straightforward: container in, container out, IAM and networking inherited from the cluster. The Flask dashboard runs in the same Fargate cluster, sharing the same VPC and observability tooling. The general pattern: short, stateless, bursty work fits Lambda. Orchestrated workflows with slower external calls and a need for predictable behavior fit Fargate. For a team-internal agent that runs continuously, Fargate's properties matter more than its slightly higher baseline cost. Keeping the Knowledge Base Current None of this works if the corpus goes stale. The ingestion pipeline syncs three sources continuously: the issue tracker, where newly resolved bugs become new entries; the documentation repo; and the source code repositories, which provide both file content and ownership signal. The pipeline is fully automated. New content is chunked, embedded with Titan, and indexed in OpenSearch without manual intervention. Ingestion is decoupled from query. They share the index but nothing else, so a slow ingestion run never affects live triage latency, and a problematic batch can be rolled back without touching the query path. What's Worth Taking From This The model layer (Bedrock, Claude, Titan) is interchangeable. Swap them for OpenAI plus their embeddings, or for a self-hosted setup, and the architecture still works. What is not interchangeable, or not easily, is the shape of the rest: Retrieval before reasoning. Don't ask the model to do retrieval against a large corpus. Get the relevant slice with a dedicated retrieval layer, then hand it over with a tight prompt.Multiple signals with dynamic weights. Single-signal confidence scores break under real-world data. Multiple signals with weight redistribution handle the cases where inputs are incomplete.Persist workflow state externally. Even for short workflows, having state in a queryable store pays off in failure recovery and gives the dashboard a single source of truth.Decouple ingestion from query. They have different reliability requirements and should be able to fail independently.Match compute to workload shape. Fargate for orchestrated, latency-tolerant workflows. The wrong choice here shows up later as cold starts, timeouts, or surprise bills. The agent has been doing useful work since it shipped. The thing that took the longest to get right wasn't any single component. It was the scoring layer and the decision to make state external. Those are the parts that determine whether a system like this is something the team relies on or something the team works around.
I run test automation for a graphics team that ships software to streaming devices. About a year ago, we changed how our visual regression suite stores and compares its references. The old approach kept around 18GB of PNG golden images in the test repo and ran a pixel-by-pixel diff on every comparison. The new approach stores around 19KB of MD5 hashes in a JSON file and compares hash strings. Storage dropped by roughly three orders of magnitude. Comparisons became effectively free. A category of flaky tests stopped being flaky. This article is about how that works, when it makes sense, and when it doesn't. It also covers the parts that surprised me, because the approach has real downsides and I want to be honest about them up front. How It Works The idea is simple once the constraints are right. On the embedded devices we test, we have access to the raw GPU frame buffer through the graphics stack. The test harness reads it as a bytes object, computes an MD5 hash of those bytes, and compares the hash against a stored reference. If the hashes match, the test passes. If they don't match, the test captures the actual frame and saves it as a failure artifact for a human to look at. The stored reference is a 32-character hex string per screen, kept in a JSON file checked into the test repo alongside the test code. The full implementation is short: Python import hashlib import json from pathlib import Path REFERENCE_FILE = Path("references/visual_hashes.json") def frame_hash(frame_bytes: bytes) -> str: """MD5 of the raw GPU frame buffer.""" return hashlib.md5(frame_bytes).hexdigest() def load_references() -> dict: if REFERENCE_FILE.exists(): return json.loads(REFERENCE_FILE.read_text()) return {} def check_frame(test_id: str, frame_bytes: bytes, references: dict) -> tuple[bool, str]: """Returns (passed, actual_hash).""" actual = frame_hash(frame_bytes) expected = references.get(test_id) if expected is None: return False, actual # no reference yet return actual == expected, actual def on_failure(test_id: str, frame_bytes: bytes, actual: str): """Only called when hashes diverge. Save the frame for review.""" artifact_dir = Path(f"artifacts/{test_id}") artifact_dir.mkdir(parents=True, exist_ok=True) (artifact_dir / f"{actual}.raw").write_bytes(frame_bytes) That's essentially the whole system. Because the references are text, intentional UI changes show up as normal source-control diffs in code review instead of opaque binary blob swaps. Because the comparison is string equality on a hex digest, it's effectively instant regardless of frame size. Why MD5 Specifically MD5 is cryptographically broken. You can construct collisions on demand, and using it for password storage or signature verification is malpractice. None of that matters here. Visual regression testing is not a cryptographic problem. The two inputs being compared are the rendered output of our own GPU yesterday and the rendered output of our own GPU today. There is no adversary trying to construct a frame buffer that hashes to a specific value. What you actually need from a hash function in this context is fast computation, low accidental collision rate on real-world inputs, and stable output across runs and platforms. MD5 covers all three. The accidental collision probability between two different rendered frames at typical buffer sizes is small enough that we have not encountered one. SHA-256 covers the same three properties at slightly higher CPU cost. If the cryptographic concern is going to come up in code review every quarter, just use SHA-256. The Conditions That Have to Hold This approach only works when three things are true about your environment. The first is access to the raw frame buffer before any encoding step. Browser-based testing, mobile UI testing through the standard automation frameworks, and most desktop application testing give you a captured screenshot, which has been through some encoding step before you see it. PNG encoders can vary across versions, and two systems can render the same pixels and produce different PNG files. If your only access point is a captured screenshot, you are comparing post-encoding output, and encoder noise will sink hashing. On embedded devices with a graphics stack you control, you usually do have raw frame buffer access, which is why this worked for us. The second condition is that the rendering pipeline has to be deterministic. Same input, same GPU state, same output bytes. If antialiasing produces different pixels for the same logical input from one run to the next, or if time-based animations get sampled at slightly different moments, or if the GPU driver rounds inconsistently, the hashes will diverge for reasons that aren't real bugs. In our case, the pipeline is deterministic, so this isn't a problem. In a lot of environments, it isn't, and you would need pixel-diff with a tolerance threshold or perceptual hashing to handle the noise. The third condition is that capture points have to be stable. The test harness has to call the capture function at the same logical point in the pipeline every run, after the same set of operations. This is usually the easiest of the three to engineer. Frame buffer access either exists or it doesn't, and determinism is sometimes a property you can't change. Capture point stability is just a discipline about where you instrument your tests. If any of these three conditions fail, frame buffer hashing is the wrong tool. Pixel-diff with a tolerance threshold is the right default for most setups, and perceptual hashing covers the middle ground where you have raw access but some non-determinism. The narrow case this article is about is the one where all three hold. What You Give Up The biggest tradeoff is failure diagnosis. With golden images, when a test fails, you have a stored reference and a new screenshot, and you can render a side-by-side diff or an overlay highlighting the changed pixels. With hash comparison, you have two strings that don't match. The failure handler captures the actual frame on the spot, but the reference image (which doesn't exist anymore in storage) has to be reconstructed by running the same test against a known-good build whenever you want to do a side-by-side comparison. That extra step is annoying when failures are common. In our case, they aren't, so the cost is manageable. If your suite has a high baseline failure rate, the math changes, and you may want to keep both the hashes and the reference images, using the hash for fast pass/fail detection and the image only for diagnosis. The other thing you give up is fuzzy matching, but that's the same point as the determinism condition. Fuzzy matching exists to compensate for non-determinism in the rendering pipeline. If your pipeline is deterministic, you don't need it. If it isn't, you do, and hashing won't work. What It Changed for Us Storage going from 18 GB to 19 KB is the change people notice first, but the second-order effects matter more in day-to-day work. Repository operations got faster because the test repo no longer carries gigabytes of binary history. Cloning a fresh checkout takes a fraction of the time it used to. PR reviews got cleaner because UI changes show up as readable JSON diffs instead of opaque PNG swaps. The flaky-test rate from encoder noise dropped to zero, which was the change that got the most attention from people on the team. Some of the old goldens had been re-saved at some point with slightly different encoder settings, and tests would fail mysteriously even though the rendered pixels were identical to the human eye. The only fix had been to regenerate the golden, which nobody really trusted. Removing the encoder from the comparison loop removed the entire class of failure. CI runs got faster, too, because hash comparison is essentially free compared to image diffing. None of these wins is novel; Skia, PDFium, and the apitrace project have used hash-based comparison of rendered output for years. What was new for us was committing to it as the primary mechanism for an entire UI test suite on embedded hardware, and accepting the implication that the stored reference is text rather than a binary asset. If you're working in an environment where the three conditions hold, the implementation is small enough that a prototype takes a day. If even one of them is missing, this isn't the right tool, and the alternatives are well understood. The interesting part is recognizing which environment you're actually in.
AWS has been building agentic infrastructure for some time now — Bedrock, AgentCore, Strands — mostly aimed at engineers who want to build their own agent systems from scratch. Amazon Quick is a different layer of the same bet: a ready-to-use agentic workspace that targets teams directly, without requiring custom orchestration code. This article walks through what Quick is, how its components fit together technically, how the MCP integration model works with real code, and where it sits relative to the rest of AWS's agent stack. What Amazon Quick Is Amazon Quick is an AI assistant for work that connects to your existing tools — Slack, Microsoft Teams, Outlook, CRMs, databases, and local files — and gives a unified layer for querying, automating, and acting across them. It launched in preview at AWS's "What's Next with AWS" event on April 28, 2026. The product is aimed at teams, not just individual users. One person can build a custom agent scoped to a specific dataset or workflow, and the whole team benefits from it. Responses from Quick agents are grounded in your actual business data, not the underlying model's training distribution. Under the hood, Quick is built on Amazon Bedrock AgentCore and uses the Model Context Protocol (MCP) as its standard for connecting to external tools. It runs on AWS IAM and VPC, which means it inherits the same security and compliance posture as the rest of your AWS workloads. Components Quick bundles five distinct capabilities. It helps to understand each one separately before thinking about how they compose. ComponentWhat it doesSpacesCollaborative workspaces where teams pool files, dashboards, and data sources. Agents in a Space are grounded in that Space's data.AgentsCustom, domain-scoped agents built on your team's specific data. One person builds, everyone uses.ResearchMulti-source synthesis across internal data, the public web, and third-party datasets. Produces structured reports.Visualize (Quick Sight)Integrated BI layer. Conversational access to dashboards, charts, and forecasting — no separate BI tool required.Automate (Quick Flows)Workflow automation from simple daily tasks to complex multi-step processes with cross-app action execution. Each component is available through the web app, mobile, and a native desktop app (currently in preview for macOS and Windows) that can read local files and calendar context without requiring browser access. Where Quick Sits in the AWS Agent Stack AWS is building in two directions at once. AgentCore is the infrastructure layer for engineers who want to compose their own agent systems — runtime, memory, gateway, observability — with any model and any framework. Quick is the product layer on top: opinionated, team-facing, and deployable without writing orchestration code. The practical implication: if you're an engineer building internal tools or automation pipelines, you'll likely interact with both layers. AgentCore for the infrastructure wiring; Quick as a surface where non-technical teammates interact with the agents you build. The Integration Architecture The core question for any engineer evaluating Quick is: how does it actually connect to external systems, and what does the request path look like? Quick uses MCP (Model Context Protocol) as its primary integration standard. This is significant because MCP is an open protocol — it means Quick agents are not locked into AWS-specific connectors, and any MCP-compatible server can be registered as a tool source. High-Level Request Flow The sequence below shows the full lifecycle of a single agent-triggered tool call — from the moment Quick receives a prompt through to the response returning from a downstream API. Quick acts as the MCP client. Your MCP server exposes tools via listTools and callTool. Quick discovers them at registration time and makes them available to any agent or automation in the workspace. Authentication flows through OAuth 2.0, with support for Dynamic Client Registration (DCR) so Quick can register itself automatically without manual credential setup. Building an MCP Server for Quick Here is a minimal Python MCP server using the mcp SDK that exposes two tools Quick can invoke — get_ticket and list_open_tickets. This pattern works whether you host the server yourself or run it on AgentCore Runtime. Install Dependencies Python pip install mcp[server] httpx uvicorn Server Implementation Python # server.py from mcp.server import Server from mcp.server.sse import SseServerTransport from mcp.types import Tool, TextContent import httpx import json from starlette.applications import Starlette from starlette.routing import Route app = Server("jira-quick-integration") JIRA_BASE_URL = "https://yourorg.atlassian.net" JIRA_TOKEN = "Bearer <your-token>" # in production, load from AWS Secrets Manager @app.list_tools() async def list_tools() -> list[Tool]: return [ Tool( name="get_ticket", description="Retrieve details for a single Jira ticket by issue key.", inputSchema={ "type": "object", "properties": { "issue_key": { "type": "string", "description": "The Jira issue key, e.g. ENG-1234" } }, "required": ["issue_key"] } ), Tool( name="list_open_tickets", description="List open Jira tickets assigned to a given user.", inputSchema={ "type": "object", "properties": { "assignee": { "type": "string", "description": "The Jira username or email of the assignee" } }, "required": ["assignee"] } ) ] @app.call_tool() async def call_tool(name: str, arguments: dict) -> list[TextContent]: headers = {"Authorization": JIRA_TOKEN, "Content-Type": "application/json"} async with httpx.AsyncClient() as client: if name == "get_ticket": key = arguments["issue_key"] resp = await client.get( f"{JIRA_BASE_URL}/rest/api/3/issue/{key}", headers=headers ) resp.raise_for_status() data = resp.json() summary = data["fields"]["summary"] status = data["fields"]["status"]["name"] return [TextContent(type="text", text=f"{key}: {summary} [{status}]")] elif name == "list_open_tickets": assignee = arguments["assignee"] jql = f"assignee={assignee} AND status != Done ORDER BY updated DESC" resp = await client.get( f"{JIRA_BASE_URL}/rest/api/3/search", headers=headers, params={"jql": jql, "maxResults": 20} ) resp.raise_for_status() issues = resp.json().get("issues", []) results = [ f"{i['key']}: {i['fields']['summary']}" for i in issues ] return [TextContent(type="text", text="\n".join(results) or "No open tickets found.")] raise ValueError(f"Unknown tool: {name}") # Wire up SSE transport for Quick compatibility sse = SseServerTransport("/messages/") async def handle_sse(request): async with sse.connect_sse( request.scope, request.receive, request._send ) as streams: await app.run(streams[0], streams[1], app.create_initialization_options()) starlette_app = Starlette( routes=[Route("/sse", endpoint=handle_sse)] ) if __name__ == "__main__": import uvicorn uvicorn.run(starlette_app, host="0.0.0.0", port=8080) A few design constraints to be aware of when building for Quick: Each MCP tool call has a 300-second hard timeout. Operations that exceed this fail with HTTP 424. Keep individual tool calls narrow and fast.The tool list is treated as static after registration. If you add or remove tools on the server, the Quick admin must re-establish the connection to pick up changes.Quick supports both Server-Sent Events (SSE) and streamable HTTP as transports. Streamable HTTP is preferred for new implementations. Registering the MCP Server in Quick Once your server is running and publicly reachable over HTTPS, registration in Quick takes the following path: Shell Quick Console → Integrations → Add Integration → MCP Fields: Server URL: https://your-mcp-server.example.com/sse Auth type: OAuth 2.0 (or Service, or None) Client ID: <from your identity provider> Authorization URL: https://auth.example.com/oauth/authorize Token URL: https://auth.example.com/oauth/token If your identity provider supports OAuth Dynamic Client Registration, Quick will auto-register and you skip the manual client ID step entirely. Quick sends an initial unauthenticated request to the MCP server; if it receives a 401 with a WWW-Authenticate header containing a resource_metadata URL, it fetches the metadata document and proceeds with DCR automatically. Once registered, Quick calls listTools at startup and exposes every discovered tool to agents and automations in the workspace. The AgentCore Gateway Option For teams that don't want to write and operate an MCP server from scratch, Amazon Bedrock AgentCore Gateway provides a managed alternative. You point Gateway at a Lambda function or an OpenAPI spec, and it handles the MCP wrapping, auth, logging, and semantic tool discovery automatically. If you use it, Quick never calls your internal APIs directly — everything flows through Gateway's auth and routing layer, as shown in the sequence diagram above. The semantic search capability is worth noting specifically. When an agent has access to dozens or hundreds of tools, passing the full tool list on every turn wastes context and causes the model to pick the wrong tool. Gateway's built-in x_amz_bedrock_agentcore_search tool lets Quick find the right tool by semantic similarity rather than scanning the entire registry each turn. Practical Considerations A few things worth keeping in mind before integrating: Tool scope matters. When agents are given too many tools simultaneously, selection accuracy degrades — the model reasons over too many options per turn and picks incorrectly more often. Keeping each agent or MCP server to a focused set of 3–5 tools produces better results than exposing everything through one endpoint. This is a known pattern in multi-agent architectures and applies equally to Quick agents. The 300-second timeout is real. Design each tool call to complete a single, bounded operation. Avoid chaining multiple downstream API calls inside a single tool invocation. If you need a multi-step workflow, model it as separate tools and let the agent orchestrate the sequence. Local context on the desktop app. The desktop app reads local files and calendar events directly, without upload. For engineers who work primarily in terminals and local editors, this is a meaningful integration point — meeting context, local documentation, and recent file changes are all available to the assistant without any configuration. MCP interoperability. Because Quick uses MCP as the standard, the same MCP server you build for Quick can also be consumed by Claude Code, Amazon Q Developer, and other MCP-compatible clients. The integration contract is portable. References Amazon Quick — Product overview and featuresIntegrate external tools with Amazon Quick Agents using MCP (AWS ML Blog, Feb 2026)MCP integration — Amazon Quick User GuideAmazon Bedrock AgentCore — Overview and documentationIntroducing Amazon Bedrock AgentCore Gateway (AWS ML Blog)Top announcements of the What's Next with AWS, 2026 (AWS News Blog, Apr 2026)
Most AI Agent frameworks treat the model as a black box: you register tools, the model picks one, the tool runs, and the cycle repeats. This pattern is perfect for demos, but for a production system, it requires more complex systems. We need to manage context windows, cache API calls, filter sensitive tools by role, and compact the information history within models to avoid token limits. I landed on middleware while reviewing issues for deepagents and understanding their codebase. This is when I started to wonder what middleware really is in the context of AI agents and its significance. This got me thinking: how do other frameworks handle this problem? So I went ahead and installed Pydantic AI, read the CrewAI source, and checked Langchain and Autogen. This article compares two frameworks that implement middleware as a primitive: Deep Agents (from LangChain) and Pydantic AI, and understands the difference between middleware and callbacks, and explains why this difference matters when running agents at scale. What You Will Learn By the end of this article, you will be able to: Distinguish middleware from tool callbacks and event callbacks, and why this mattersRead working code for deepagents' AgentMiddleware and Pydantic AI's AbstractCapabilityUnderstand the difference between the two frameworks: cross-turn AgentState access, production middleware, and config-driven profiles via HarnessProfile.Understand why frameworks built on callbacks cannot support patterns that middleware enables. What Is Middleware? The term "Middleware" often gets overloaded. In the context of AI agents, it means code that runs before or after every model call, with the ability to read and rewrite the request or response. What Differentiates Middleware From the Rest Middleware is different from: Tool callbacks – fired when the tool is called and not the model.Event callbacks – fire and forget, that can be observed but not changed.Post-processing – wrapping the final output after the agent loop ends. Middleware sits inside the request/response cycle of every LLM call, which gives it unique capabilities. Where the Middleware Sits in the Agent Loop It's the only layer with access to the request before it reaches the model and the response before it reaches the tool executor. CapabilityMiddlewareTool callbackEvent callbackModify system prompt per call✓✗✗Filter tool list dynamically✓✗✗Transform message history✓✗✗Cancel the model call✓✗✗Track state across turns✓Partial✗Observe output✓✓✓ Deep Agents: Middleware as a Composable Hook Installation: Shell pip install deepagents # Requires Python >=3.10 # Docs: https://docs.langchain.com/oss/python/deepagents/overview deepagents ships AgentMiddleware as a base class from langchain.agents.middleware.types. Every middleware subclass can override these key hooks (each has an async variant): Python class AgentMiddleware: def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelCallResult: # Intercept before AND after the model call. Call handler() to execute it. return handler(request) def before_model(self, state: AgentState, runtime: Runtime) -> dict | None: # Runs before the model is called. Can update agent state. return None def after_model( self, state: AgentState, runtime: Runtime ) -> dict | None: # Runs after the model responds. Can inject new messages into state. return None def wrap_tool_call( self, request: ToolCallRequest, handler: Callable[[ToolCallRequest], ToolMessage], ) -> ToolMessage: # Intercept individual tool calls for retry logic, monitoring, or modification. return handler(request) # async def awrap_model_call(...): ... # async versions of each hook also available The key insight: wrap_model_call receives the full request: messages, tools, settings, and can return anything, including a modified request passed to the next middleware in the stack. Multiple middleware compose like nested functions: Request -> Middleware A -> Middleware B -> Model Response <- Middleware A <- Middleware B <- Model Deep Agents middleware composition (innermost = closest to model) Built-In Middleware Deep Agents Ships Deep Agents includes several production-grade middleware out of the box: Python from deepagents.middleware import ( FilesystemMiddleware, # Filesystem read/write tools + permission enforcement MemoryMiddleware, # Injects relevant memories into system prompt each turn SkillsMiddleware, # Injects SKILL.md definitions into system prompt SubAgentMiddleware, # Spawns synchronous subagents as tools AsyncSubAgentMiddleware, # Spawns async background subagents SummarizationMiddleware, # Auto-compacts history when token budget fills SummarizationToolMiddleware,# Exposes compact_conversation as an explicit tool ) Writing a Custom Middleware Here is a practical example: a rate-limiting middleware that counts tool calls per turn and injects a warning into a system message when the agent is being "chatty": Python from langchain.agents.middleware.types import ( AgentMiddleware, ModelRequest, ModelResponse, ModelCallResult ) from langchain_core.messages import SystemMessage from collections.abc import Callable class ToolBudgetMiddleware(AgentMiddleware): """Warn the model when it has used many tools in a single turn.""" def __init__(self, budget: int = 5) -> None: self.budget = budget self._call_count = 0 def wrap_model_call( self, request: ModelRequest, handler: Callable[[ModelRequest], ModelResponse], ) -> ModelCallResult: # Count tool messages in the conversation (each = one tool call made) tool_calls_this_turn = sum( 1 for m in request.messages if hasattr(m, "tool_call_id") ) if tool_calls_this_turn >= self.budget: warning = ( f"\n\n[Budget notice: you have called {tool_calls_this_turn} tools " f"this turn. Prefer to synthesize results rather than calling more tools.]" ) system = request.system_message if system: new_content = str(system.content) + warning request = request.override( system_message=SystemMessage(content=new_content) ) return handler(request) You can wire this custom middleware alongside built-ins: Python from deepagents import create_deep_agent from deepagents.middleware import FilesystemMiddleware, SummarizationMiddleware from deepagents.backends import FilesystemBackend backend = FilesystemBackend(root_dir="/workspace") summarizer = SummarizationMiddleware( model="anthropic:claude-haiku-4-5", backend=backend, trigger=("fraction", 0.85), keep=("fraction", 0.10), ) agent = create_deep_agent( model="anthropic:claude-sonnet-4-6", middleware=[ FilesystemMiddleware(backend=backend), summarizer, ToolBudgetMiddleware(budget=5), # custom ], ) Middleware runs in list order: FilesystemMiddleware wraps first, then SummarizationMiddleware, then your custom one. Innermost is the closest to the model. The Profiles API: Middleware Configuration Without Code deepagents v0.5.4 added HarnessProfile which lets you declare middleware changes declaratively — add extra middleware, exclude a few middleware, override tool descriptions without touching create_deep_agent call sites. HarnessProfile merge semantics (additive, model-specific overrides, provider-level): Python from deepagents.profiles import HarnessProfile, register_harness_profile register_harness_profile( "anthropic:claude-haiku-4-5", HarnessProfile( system_prompt_suffix="Be concise. Prefer short answers.", excluded_middleware={SummarizationMiddleware}, # Haiku has small context, skip extra_middleware=[ToolBudgetMiddleware(budget=3)], ), ) # Now any agent using claude-haiku-4-5 automatically gets this profile applied agent = create_deep_agent(model="anthropic:claude-haiku-4-5") You can also load from a YAML file for a config file-driven deployment: YAML # haiku-profile.yaml system_prompt_suffix: "Be concise. Prefer short answers." excluded_middleware: - SummarizationMiddleware Python import yaml from deepagents.profiles import HarnessProfileConfig, register_harness_profile with open("haiku-profile.yaml") as f: register_harness_profile( "anthropic:claude-haiku-4-5", HarnessProfileConfig.from_dict(yaml.safe_load(f)), ) Pydantic AI: Capabilities as the Closest Parallel Installation: Shell pip install pydantic-ai # Docs: https://ai.pydantic.dev Pydantic AI's AbstractCapability is the closest architectural equivalent to LangChain's deepagents middleware. Subclass it from pydantic_ai.capabilities and override any of these lifecycle hooks: Python from pydantic_ai.capabilities import AbstractCapability class MyCapability(AbstractCapability): # Run-level hooks async def before_run(self, ctx, ...): ... # Before run starts async def after_run(self, ctx, *, result): ... # Observe/modify result async def wrap_run(self, ctx, *, handler): ... # Full wrap — intercept + resume async def on_run_error(self, ctx, *, error): ... # Handle run-level errors # Graph-node hooks async def before_node_run(self, ctx, *, node): ... # Before each graph node async def wrap_node_run(self, ctx, *, node, handler): ... async def on_node_run_error(self, ctx, *, node, error): ... # Model-request hooks — intercept the raw LLM call async def before_model_request(self, ctx, request_context): ... # Modify messages/tools async def wrap_model_request(self, ctx, *, request_context, handler): ... async def after_model_request(self, ctx, *, request_context, response): ... async def on_model_request_error(self, ctx, *, request_context, error): ... Note on granularity: Pydantic AI's before_model_request hook receives a ModelRequestContext containing messages, model_settings, and model_request_parameters (which includes the tool list). You can return a modified ModelRequestContext to rewrite what gets sent to the model, which is similar to deepagents' wrap_model_call. The key remaining difference is state persistence: these hooks operate within a single run's context, not across agent turns via a shared graph state. A practical example — wrapping a run to add timing and error context: Python from pydantic_ai import Agent from pydantic_ai.capabilities import AbstractCapability import time class TimingCapability(AbstractCapability): async def wrap_run(self, ctx, *, handler): start = time.monotonic() try: result = await handler() elapsed = time.monotonic() - start print(f"Run completed in {elapsed:.2f}s") return result except Exception as e: elapsed = time.monotonic() - start print(f"Run failed after {elapsed:.2f}s: {e}") raise agent = Agent( "anthropic:claude-sonnet-4-6", capabilities=[TimingCapability()], ) For injecting dynamic content into system prompts, you can use before_model_request to return a modified ModelRequestContext with updated instruction_parts, or use the instructions field and callable system_prompt at agent construction time. Pydantic AI vs. Deep Agents Middleware: The Key Differences DimensiondeepagentsPydantic AIHook classAgentMiddlewareAbstractCapabilityHook granularityPer LLM request, tool call, node, runPer LLM request, node, and runSystem prompt injectionvia ModelRequest in wrap_model_callvia ModelRequestContext in before_model_requestError hooksNo dedicated hookon_run_error, on_node_run_error, on_model_request_errorState persistence across turnsAgentState dict shared with LangGraphPer-run context onlyTool list access & filteringModelRequest.tools in wrap_model_callvia ModelRequestContext.model_request_parametersCross-framework portabilitydeepagents / LangGraph onlyPydantic AI onlyConfig-driven (no code)Yes - HarnessProfile + YAMLNoBuilt-ins included7 production middlewareNone - user-defined The biggest practical difference is that Deep Agent's middleware has access to AgentState (the full LangGraph graph state across turns) through after_modelwhich means middleware can read message history, inject summary nodes, and write back to the state. Pydantic AI capabilities are scoped to a single run's context. This means that there is no shared graph state across agent turns. What Other Frameworks Do Instead LangChain Callbacks (v0.1 Style) Python from langchain_core.callbacks.base import BaseCallbackHandler class MyCallback(BaseCallbackHandler): def on_llm_start(self, serialized, prompts, **kwargs): ... def on_llm_end(self, response, **kwargs): ... You cannot modify or cancel the request, and it is not composable in any way. This is useful for logging, but not useful in request transformation. CrewAI Step Callbacks Python from crewai import Crew def my_step_callback(output): print(f"Step completed: {output}") crew = Crew(agents=[...], tasks=[...], step_callback=my_step_callback) step Callbacks are called after each task step completes. This has no access to the request, and you cannot modify the list of tools or even the system prompt. This has similar limitations to LangChain callbacks. AutoGen v0.4 Message Middleware AutoGen's message-passing model means you can inject agents into the conversation (e.g., a logging proxy agent), but there's no formal pre or post-hook around model calls. The closest equivalent is a UserProxy agent that intercepts messages, but it's a peer agent and not a transparent middleware layer. What the Middleware Gap Can Actually Cost You Token budget. When a particular conversation is approaching the model limit, you would want to summarize old tool outputs before the model call and not after. A callback fires too late to help, and you might run out of tokens or overshoot your token usage.Per user tool filtering. In any given organization, there are different roles for different users and different access permissions. Without middleware, it's hard to filter out tools that certain users cannot run. Consider a scenario where you don't have middleware to filter, and you just call the LLM, which in turn calls the tools, only to find out that the tool call failed because of access permissions. That's wasted resources and tokens, and unnecessary LLM calls, which could be easily avoided.Prompt caching across providers. Anthropic's prompt caching requires cache_control in the request. AnthropicPromptCachingMiddleware rewrites the message and tool definitions of every model call to apply cache breakpoints in the right places. Without middleware, this would have required changes to every call site. Conclusion The middleware gap is why some production agents are trivially simple in Deep Agents and PydanticAI, but not possible in other frameworks. Summarizing message history before the model call, filtering tools based on roles, and injecting cache-control blocks in the right position are all possible with middleware, not with a callback that fires after it completes. For teams choosing a framework today: if you need to transform what the model sees on every call rather than just observe it, the choice narrows to Deep Agents or Pydantic AI. If you want that transformation to reference or rewrite history spanning multiple turns, deepagents with LangGraph is the only framework that supports this today. Middleware is not the most visible feature of an agent framework, but it is a primitive that sets the ceiling for everything else.
Real-time AI inference has become a fundamental feature of modern applications and has been used to drive applications in conversational agents, recommendation engines, fraud detection, and computer vision pipelines. In contrast to batch workloads, real-time inference requires stable, low-latency, predictable scaling, and resource efficiency. With the increase in the size or the number of computations performed by models, it becomes more complicated to provide these experiences at a reliable level, particularly when considering the performance versus the cost of operation. Cloud Run Cloud Run offers a simple, scalable, and managed infrastructure that delivers real-time machine learning models in the Google Cloud platform with the help of GPU acceleration and Vertex AI. This architecture allows teams to deploy containerized inference services that automatically scale with traffic while using GPUs to execute high-throughput model inference. Instead of deploying fixed clusters or provisioning resources manually, organizations can adopt a serverless-first approach, which has the capacity to bring compute capacity in step with demand. With the combination of these services, engineering teams are able to construct inference pipelines, which appear like current microservice platforms. Traffic is directed via controlled points, models are executed on specialized hardware, and observability is built into the operating system. This model takes away a significant portion of the complexity found within the underlying infrastructure, enabling the developers to concentrate on application logic and still attain production-grade performance. Deploying Low-Latency Inference With Cloud Run and GPUs Cloud Run is a service that provides a serverless experience to deploy containerized workloads. It is easily applicable to real-time inference services. Cloud Run can be used to run models that consume a lot of compute, though, with automatic scaling and billed on a request basis, when combined with instances that have GPUs. This enables teams to run stateless services as models that spin up when incoming traffic is detected and scale down when idle, enhancing responsiveness and cost efficiency. Practically, the models are bundled into containers that provide endpoints of inference via thin APIs. Such services are able to preload models upon startup and maintain them in the memory of the GPUs so that they can be swiftly executed. Cloud Run also does traffic routing, instance management, and scaling, and does not require managing node pools or orchestration layers. For latency-sensitive applications, concurrency settings can be configured, and the minimum number of instances can be set to minimize cold-start effects and guarantee a predictable response time. This deployment pattern can serve a wide variety of workloads, from transformer-based language models to vision inference pipelines. Since Cloud Run is seamlessly connected to GCP networking and identity services, inference endpoints can be sheltered under an API gateway and authenticated with IAM-based access. This allows the deployment of production that satisfies enterprise security and still offers the agility of serverless infrastructure. Integrating Vertex AI for Model Management and Observability Whereas Cloud Run supports inference serving, Vertex AI offers a support MLOps environment that can be used to scale models. Vertex AI provides a centralized system of record for the teams by handling model artifacts, experiment tracking, and versioning. This isolation of concerns enables engineers to deploy models without considering the serving infrastructure while still being able to trace iterations. Interestingly, Vertex AI also allows tracing model performance and system behavior. Numerical indicators, e.g., latency, throughput, and error rates, can also be gathered alongside model-specific indicators, helping teams notice regressions or slowdowns over time. A good number of organizations send inference logs and prediction data to BigQuery to perform offline analyses on it to gain a better understanding of how it is used and the quality of responses it offers. This feedback loop helps with continuous improvement without interrupting live services. Vertex AI is often combined with CI/CD pipelines to automatically promote models across environments in production environments. The validation of the new versions can be done in staging and deployed to Cloud Run endpoints, which are stable with the capability to quickly iterate. This practice of operation can be compared to the current software delivery practices, where machine learning models are perceived as versioned parts of a broader application ecosystem. Scaling, Cost Optimization, and Production Readiness Inference in real time can be scaled by paying special attention to the cost and performance. GPUs provide high acceleration, but they have to be put to good use to warrant their cost. A request-driven scaling model for Cloud Run can scale resources in accordance with actual demand, and utilization during peak load can be enhanced with batching strategies and concurrency controls. The teams use these techniques in conjunction with caching and request deduplication to further optimize throughput. Security and good governance are also required in production readiness. Inference services are normally executed with dedicated service accounts with limited privileges, and sensitive information is isolated using encryption protocols and access controls. Privacy can be implemented by blocking inference traffic out of trusted environments by restricting connections between networks with firewall rules and network links. These controls assist companies in launching AI services that adhere to company policies and regulations. Finally, effective real-time inference systems are similar to well-developed cloud-native systems. They are visible, automated, and constantly honed. Opposite to the traditional approach to AI platform building, which combines Cloud Run to offer scalable serving, GPUs to realize performance, and Vertex AI to provide lifecycle management, organizations can create AI platforms that provide low-latency experiences and ensure operational discipline. The combined solution will enable teams to go beyond experimentation and deliver reliable AI functionality at enterprise scale.
What This Series Is About This is Part 2 of a two-part series on building a Slack bot that answers natural language questions about a GitHub repository using AWS Bedrock (Claude) and GitHub's official Model Context Protocol (MCP) server. Part 1 covered the why: most AI tools suggest wrapping GitHub's REST API and feeding the response to a model. That approach works, but it produces brittle glue code that grows with every new question type and every new data source. MCP offers a fundamentally better pattern — a tool registry that the model queries at runtime, making routing decisions autonomously. The result is a 150-line bot that answers questions you never anticipated and extends to new data sources with four lines of configuration. If you have not read Part 1, it is available here: https://dzone.com/articles/build-a-github-slack-bot-with-aws-bedrock-and-mcp. The full project code is on GitHub: https://github.com/sangharshcs/slack-github-mcp-bot. This article covers the implementation — the four key architectural pieces, how to get it running, how to extend it to new MCP servers, and the production lessons from running it on a real engineering team. How It Is Built — The Four Key Pieces The bot has four distinct components. Understanding each one separately makes the whole system easier to reason about and extend. 1. The MCP Request Function All communication with GitHub's MCP server goes through a single function. GitHub MCP returns Server-Sent Events (SSE) rather than plain JSON, so the function handles both response types transparently. It also checks HTTP status and surfaces MCP-level errors cleanly — without this, a 401 or 500 from the server fails silently. The function signature accepts the endpoint and headers as parameters, not hardcoded values. This is the detail that makes the whole system extensible: the same function routes to GitHub today and to any other MCP server tomorrow. 2. The Tool Registry At startup, the bot calls tools/list on every connected MCP server and records which server owns each tool. This registry — a simple JavaScript object mapping tool name to endpoint and auth headers — is the entire routing mechanism. When Claude calls a tool, the bot looks up its origin and sends the request there. Adding a new MCP server means calling the same loadServer() function with the new URL and credentials. The registry grows. The agent loop never changes. This four-line pattern is the extensibility mechanism Eric described as worth expanding on: JavaScript // Same pattern for every MCP server you add: const myServiceHeaders = { Authorization: `Bearer ${process.env.MY_SERVICE_TOKEN}`, 'Content-Type': 'application/json', Accept: 'application/json, text/event-stream', }; await loadServer(process.env.MY_SERVICE_MCP_URL, myServiceHeaders); // Then add routing guidance to your system prompt. // The agent loop below does not change. 3. The Agent Loop The loop sends the question to Claude with the full tool list. Claude selects tools, the bot executes them via the registry, results return to Claude, and the cycle repeats until Claude has enough to answer — typically 3 to 8 tool calls. The loop is generic: it does not know whether it is answering a bug or a PR question. The system prompt configures the behavior. The same code handles every question type, present and future. 4. The System Prompt The system prompt is the highest-leverage piece in the entire system. The difference between a bot your team uses daily and one they quietly stop using is almost always prompt quality, not code quality. Three rules matter most: Explicit Slack markdown syntax. Claude defaults to standard Markdown. Without being told otherwise, it uses **bold** and [text](url), which Slack renders as raw asterisks and broken links. The prompt must specify *bold*, <url|text>, no # headings, no markdown tables.High-volume handling. Without a rule, asking 'list all open issues' on a large repo returns an unusable wall of text. The prompt should specify: if results exceed 15 items, summarise by category and show the top 10.Tool routing for multiple servers. When you add a second MCP server, the prompt tells Claude which questions map to which server. This reduces unnecessary tool calls and keeps responses fast.The complete index.js, package.json, and .env template are in the project repository at https://github.com/sangharshcs/slack-github-mcp-bot. Getting It Running Setup involves three external services — Slack, GitHub, and AWS Bedrock — each requiring a token. Rather than reproducing the full step-by-step here (that lives in the project README at https://github.com/sangharshcs/slack-github-mcp-bot), here is what each token is and where to get it. The Slack bot token (xoxb-...) comes from creating a Slack app at api.slack.com/apps with Socket Mode enabled. Socket Mode is what lets the bot run from your laptop without a public URL — it connects outbound over WebSocket. You also need an App-Level Token (xapp-...) for the socket connection itself, and a Signing Secret from Basic Information. The bot needs these scopes: app_mentions:read, chat:write, im:history, im:write.The GitHub token is a fine-grained personal access token from github.com/settings/tokens. Select your target repository and grant read access to Issues, Pull Requests, Contents, and Metadata. No write access is needed.The Bedrock API key comes from the AWS console under Amazon Bedrock → API keys. Enable the Claude Sonnet 4.6 model under Model access first. One detail that catches everyone: Claude 4.x models require a cross-region inference profile prefix. Use us.anthropic.claude-sonnet-4-6, not anthropic.claude-sonnet-4-6. The bare ID returns "on-demand throughput isn't supported". With credentials in .env, npm install and node index.js is all it takes. The bot logs the number of GitHub tools loaded and is ready to receive mentions. Extending to Other MCP Servers loadServer() is the entire extension mechanism. Call it with any MCP-compatible service. The registry grows, Claude discovers the new tools, and you add one line to the system prompt describing when to use them. MCP Server URL What it adds Linear mcp.linear.app/mcp Issues, projects, cycles, roadmaps Cloudflare mcp.cloudflare.com/mcp Workers, analytics, DNS, R2 storage Stripe mcp.stripe.com/mcp Payments, customers, subscriptions Custom @modelcontextprotocol/sdk Any internal REST API as MCP tools What We Ran Into in Production We have been running this bot on a busy engineering repository for several months. Before sharing the limitations we documented, it is worth saying that none of them were showstoppers — but they were real, and ignoring them would leave you unprepared. The biggest adjustment was latency. Complex queries that trigger 6 to 8 tool calls take 15 to 30 seconds. We handled this with the thinking-indicator pattern — post a placeholder message immediately, then update it when the answer is ready — which kept the experience feeling responsive even when the underlying calls were slow. Debugging took more work than expected. When a traditional API client gives a wrong answer, the fix is obvious. When an agentic loop gives a wrong answer, you need to know which tools Claude chose, what they returned, and how Claude reasoned over the results. We solved this by logging every tool call — name, input, result, timestamp — and shipping those logs to our observability platform. That log became our primary debugging tool for agent behavior. Prompt quality turned out to be load-bearing in a way we did not fully anticipate. Early versions of the bot would return raw asterisks in Slack, generate unusable walls of text for large result sets, and occasionally pick the wrong tool. Each of these was a prompt fix, not a code fix. Investing time in the system prompt before going live would have saved us several rounds of iteration. Cost is worth monitoring at scale. A query triggering 8 Bedrock calls costs meaningfully more than a single response. For an internal team tool used dozens of times a day, the cost is negligible. At higher volume, it warrants attention. The productivity gain from not maintaining API clients has outweighed all of these constraints at the scale we operate. The right framing is not "is MCP perfect?" but "is MCP better than the alternative?" For our team, the answer has consistently been yes. Conclusion The architecture across these two articles is intentionally small. A tool registry, a generic agent loop, and a system prompt that configures behavior — that is all of it. The 150 lines in the repository are a starting point your team can run today and grow as your toolchain does. Start with GitHub MCP. Get it answering questions in Slack. Test it with your team. Then look at what they ask most often and which data sources those questions touch. The next MCP server to register will be obvious. The code to add it is four lines. If you landed on Part 2 first and want to understand the architectural reasoning — why MCP is a fundamentally better pattern than the conventional REST API wrapper approach, and why it matters especially for SRE and platform teams — Part 1 covers all of that and is the recommended starting point. Part 1: Why MCP Changes Everything About AI Tool Integration. Full project code: https://github.com/sangharshcs/slack-github-mcp-bot.
(Note: A list of links for all articles in this series can be found at the conclusion of this article.) In the previous installments of this series, we traced the arc from raw compliance intent — regulations such as NIST 800-53, FedRAMP, PCI DSS, EU AI Act — all the way to machine-readable OSCAL artifacts managed via GitOps pipelines and Trestle-powered automation. The central thesis has been that treating compliance artifacts as code, subject to the same versioning, testing, and review disciplines as software, is the only sustainable path to continuous assurance at scale. Part 3 of this series explored the collaboration topology: Regulators publishing OSCAL catalogs, Control Providers authoring component definitions, System Owners assembling SSPs, and Assessors generating SAPs and SARs — all mediated by Trestle's markdown-to-OSCAL round-trip. The friction was always the same: every persona still needed CLI fluency or IDE comfort to engage productively with OSCAL JSON. That friction is now removable. The Model Context Protocol (MCP) brings a standardized, AI-agent-ready interface to compliance tooling — and compliance-trestle-mcp, the first OSCAL-native MCP server from the OSCAL Compass community, makes every Trestle operation invocable by any MCP-compliant AI client: Claude, Roo Code, GitHub Copilot Workspace, or a custom agentic pipeline. Compliance-as-Code Game Changer With MCP The Model Context Protocol, incubated under the Linux Foundation and now an industry-wide open standard, provides a JSON-RPC layer by which AI models discover and invoke "tools" — discrete, typed operations exposed by servers. Think of it as the USB-C port for AI agents: standardized, self-describing, composable. Once an MCP server is registered, any compliant client can call its tools without custom integration work. For compliance workflows, this changes the architecture of engagement fundamentally. Today, driving Trestle to resolve a NIST 800-53 profile, generate SSP markdown, and assemble the resulting OSCAL JSON requires CLI invocations with precise arguments — work that falls to the Trestle-literate members of a compliance team. With compliance-trestle-mcp, those same operations become natural-language-addressable: an AI assistant executes the correct Trestle command sequence, validates the output, and surfaces results in whatever interface the persona is already working in. Compliance-trestle-mcp: Architecture and Capabilities The server is published on PyPI as compliance-trestle-mcp (v0.1.2, February 2026) and registered on the Official MCP Registry at registry.modelcontextprotocol.io under the identifier io.github.oscal-compass/compliance-trestle-mcp. Status is Active. Source: https://github.com/oscal-compass/compliance-trestle-mcp. Figure 1: compliance-trestle-mcp listed as Active on the Official MCP Registry (registry.modelcontextprotocol.io), v0.1.2. Tool Surface Six tools are currently exposed by the server, each wrapping a core Trestle operation: toolwhat it does trestle_init Initialize a Trestle workspace, creating the OSCAL folder hierarchy (catalogs, profiles, component-definitions, system-security-plans, etc.) trestle_import Import an existing OSCAL model (catalog, profile, SSP, component definition) from a local file or remote URL into the active workspace trestle_author_catalog_generate Generate per-control Markdown files from a catalog JSON, enabling human-readable editing without touching raw OSCAL trestle_author_profile_generate Generate Markdown documentation for the controls selected by a profile, preserving parameter overrides and guidance additions trestle_author_profile_resolve Resolve a layered OSCAL profile to a flat resolved-profile catalog, collapsing all imports and modifications trestle_author_profile_assemble Assemble edited Markdown controls back into a valid OSCAL Profile JSON, completing the round-trip Installation (One Liner) Add the following stanza to your agent's MCP configuration file (e.g., .roo/mcp.json for Roo Code or the Claude Desktop config): JSON { "mcpServers": { "trestle": { "command": "uvx", "args": [ "--from", "compliance-trestle-mcp", "trestle-mcp" ] } } } Personas Revisited: Now With an AI Co-Pilot Part 3 of this series established the canonical compliance-as-code collaboration model: five personas, each with distinct artifacts, editing interfaces, and OSCAL expertise levels. The MCP layer transforms each persona's relationship with those artifacts. Regulator Regulators publish security regulations and standards (NIST 800-53, GDPR, HIPAA) typically as PDFs. With compliance-trestle-mcp, a Regulator's technical team can instruct an AI agent to call trestle_import against a raw OSCAL catalog URL (e.g., the NIST GitHub releases), then trestle_author_catalog_generate to produce reviewable Markdown. Editorial cycles that previously required Trestle CLI expertise are now conversational. The AI handles the workspace plumbing; the domain expert focuses on control prose accuracy. Compliance Officer/CISO Compliance Officers author organizational overlays — parameter tailoring, guidance additions, control selections — expressed as OSCAL profiles layered on a regulatory catalog. With the MCP server, the AI can be prompted to "resolve the FedRAMP Moderate profile against the NIST 800-53 Rev5 catalog and generate the delta markdown for my SSP authoring queue." The agent chains trestle_author_profile_resolve→ trestle_author_profile_generate autonomously, surfacing the output for human review. This eliminates manual multi-step CLI orchestration and radically compresses profile maintenance cycles. Control Provider (Component Author) Control Providers — the engineers maintaining component definitions that map control implementations to policy-as-code rules — have traditionally needed both OSCAL fluency and DevSecOps context simultaneously. Now, an AI agent can assist by importing existing component definitions, generating Markdown stubs for unmapped controls, and prompting the engineer for implementation prose inline in the chat. The component definition round-trip (JSON → Markdown → edit → trestle_author_profile_assemble → JSON) is fully MCP-orchestrated. System Owner/SSO The System Owner assembles SSPs from profiles and component definitions — historically the most labor-intensive and error-prone step. With compliance-trestle-mcp, an AI agent can be directed to initialize the workspace, import all upstream artifacts, resolve the applicable profile, and generate the SSP Markdown scaffolding in a single conversational exchange. What once required mastery of four distinct Trestle sub-commands and careful argument threading is reduced to a natural-language instruction sequence. Assessor Assessors generating Security Assessment Plans (SAPs) and Reports (SARs) need to trace every selected control back through the SSP to the component definition and the originating catalog. With the MCP server, an AI agent can navigate that traceability chain on demand, resolving profiles and surfacing control implementation status, evidence links, and outstanding POA&M items — all without the assessor ever touching Trestle directly. The Emerging OSCAL MCP Ecosystem compliance-trestle-mcp is the first OSCAL-native MCP server from an established open-source compliance project, but it is not alone. A brief survey of the emerging ecosystem: serveroriginfocus compliance-trestle-mcp OSCAL Compass / CNCF Sandbox Full Trestle workflow: init, import, catalog/profile generate-assemble-resolve. First CNCF OSCAL MCP server. Registered at registry.modelcontextprotocol.io. mcp-server-for-oscal AWS Labs (awslabs) OSCAL schema introspection, model listing, and reference resource retrieval. Optimized for AI agents needing authoritative OSCAL structural guidance rather than authoring workflows. OSCAL MCP UI Apps Atelier Logos / Community Visual MCP UI layer for FedRAMP and HIPAA OSCAL workflows; interactive SSP visualization and compliance gap analysis via agentic app runtime. The AWS Labs server (github.com/awslabs/mcp-server-for-oscal) serves a complementary purpose: where compliance-trestle-mcp is workflow-centric (authoring and assembly), the AWS server is schema-centric (introspection and reference), providing AI agents with authoritative answers about OSCAL model structure, valid element sets, and use-case patterns. Together, they cover both the "what is OSCAL" and "do OSCAL" dimensions of agent-assisted compliance. NIST's Vision and the CSWP 53 Horizon The timing is not coincidental. NIST CSWP 53 ("Charting the Course for NIST OSCAL," December 2025 initial public draft) explicitly names agentic AI and digital twins as the next integration frontier for OSCAL — autonomous risk reasoning and continuous assurance driven by AI agents operating on machine-readable compliance artifacts. The compliance-trestle-mcp server is a concrete early instantiation of exactly that vision, with the CNCF Sandbox project providing governance and sustainability guarantees that standalone tools lack. What Comes Next for compliance-trestle-mcp The v0.1.2 release covers the catalog and profile authoring surface. The roadmap naturally extends toward the full OSCAL lifecycle for AI-assisted System Security Plan and MCP resource exposure — surfacing OSCAL documents as MCP resources (not just tool outputs) so AI clients can reason over live workspace state. Conclusion Compliance as Code has always promised to make compliance automation as natural as software development. The MCP layer removes the final adoption barrier: the requirement for personas to learn Trestle directly. With compliance-trestle-mcp, every compliance stakeholder — from the Regulator drafting a new catalog overlay to the Assessor closing out a FedRAMP SAR — can now engage with OSCAL artifacts through natural language, mediated by an AI agent that understands both the domain and the toolchain. The server is live, registered, and installable in seconds. The OSCAL ecosystem is building out MCP coverage rapidly, with NIST's own roadmap pointing in the same direction. The gap between compliance intent and continuous machine-readable assurance has never been smaller. References and Learn More [1] OSCAL Compass / compliance-trestle-mcp GitHub. https://github.com/oscal-compass/compliance-trestle-mcp [2] Official MCP Registry — io.github.oscal-compass/compliance-trestle-mcp. https://registry.modelcontextprotocol.io [3] AWS Labs mcp-server-for-oscal. https://github.com/awslabs/mcp-server-for-oscal [4] COMPASS Part 3: Artifacts and Personas (DZone). https://dzone.com/articles/compliance-automated-standard-solution-compass-part-3-artifacts-and-personas [5] NIST CSWP 53: Charting the Course for NIST OSCAL (Dec 2025 IPD). https://csrc.nist.gov/pubs/cswp/53/charting-the-course-for-nist-oscal/ipd [6] Building Visual MCP UI Apps for FedRAMP & HIPAA with OSCAL (Atelier Logos, Jan 2026). https://www.atelierlogos.studio/blog/2026-01-08-using-the-aws-mcp-server-for-oscal [7] OSCAL Hub — Open-Source OSCAL Platform (RegScale / OSCAL Foundation). https://regscale.com/blog/introducing-oscal-hub/ [8] Model Context Protocol Roadmap (Linux Foundation, updated Mar 2026). https://modelcontextprotocol.io/development/roadmap Below are the links to other articles in this series: Compliance Automated Standard Solution (COMPASS), Part 1: Personas and RolesCompliance Automated Standard Solution (COMPASS), Part 2: Trestle SDKCompliance Automated Standard Solution (COMPASS), Part 3: Artifacts and PersonasCompliance Automated Standard Solution (COMPASS), Part 4: Topologies of Compliance Policy Administration CentersCompliance Automated Standard Solution (COMPASS), Part 5: A Lack of Network Boundaries Invites a Lack of ComplianceCompliance Automated Standard Solution (COMPASS), Part 6: Compliance to Policy for Multiple Kubernetes ClustersCompliance Automated Standard Solution (COMPASS), Part 7: Compliance-to-Policy for IT Operation Policies Using AuditreeCompliance Automated Standard Solution (COMPASS), Part 8: Agentic AI Policy as Code for Compliance Automation With Prompt Declaration LanguageCompliance Automated Standard Solution (COMPASS), Part 9: Taking OSCAL-Compass to Industry Complexity LevelCompliance Automated Standard Solution (COMPASS), Part 10: How OSCAL Mapping Paves the Way for Continuous Compliance Scalability
There's a specific kind of failure that never makes the post-mortem blog post. It's not a dramatic outage. There's no war room, no all-hands, no apology email sent to a hundred thousand users. It's quieter than that. It looks like a product that worked beautifully for thirty clients, suddenly becoming unreliable at sixty. It looks like an engineering team that can no longer ship without breaking something else. It looks like a sales pipeline that stalls because the platform can't pass a security questionnaire. This is where most SaaS products actually fail — not at launch, but somewhere around the eighteen-month mark, when the architectural decisions made during the sprint-first MVP phase start extracting their tax. I've been watching this pattern long enough to recognize it early. The symptoms vary; the underlying causes rarely do. This article is an attempt to lay out the structural decisions that determine whether a SaaS platform scales cleanly or degrades under its own weight — and to be specific enough about why things go wrong that the analysis is actually useful. The Multi-Tenancy Decision Is Made Once Every SaaS platform is a multi-tenant system. One application codebase, one infrastructure stack, multiple clients operating inside it simultaneously. That sentence sounds simple. The architectural reality it describes is not. The core question — how you isolate one tenant's data from another's — has a small number of answers, each with a distinct set of long-term consequences. AWS's SaaS Architecture Fundamentals whitepaper offers one of the cleaner frameworks for thinking about this: a spectrum from fully siloed tenancy (dedicated infrastructure per client) to fully pooled tenancy (shared everything, separated by tenant ID in the data layer), with hybrid models in between. The AWS multi-tenant architectures guidance is direct about the fundamental trade-off: "The Silo Model provides the strongest tenant isolation but incurs the most cost and complexity. Inversely, the Pool Model offers the least tenant isolation but costs the least." What this framing leaves implicit is worth stating explicitly: whichever model you choose, the choice shapes almost every subsequent technical decision your team will make. Siloed tenancy gives each client a dedicated database instance. Data isolation is structural — a bug affecting one tenant's environment cannot, by definition, reach another's. Compliance requirements from healthcare or financial services clients become dramatically simpler to satisfy because the isolation boundary is physical, not logical. The cost is proportional: you're provisioning, patching, and scaling N database instances, where N grows with your client count. Pooled tenancy places all tenants in a shared schema, differentiated by a tenant ID column embedded in every relevant table. Infrastructure costs are substantially lower, and horizontal scaling benefits all tenants simultaneously. The risk is what practitioners call the noisy neighbor problem: a single tenant running expensive aggregate queries can degrade performance for everyone sharing the same database. More critically, a bug in the tenant-filtering logic — a missing WHERE tenant_id = ?, a misconfigured ORM, a caching layer that doesn't scope keys by tenant — can expose one client's data to another. This failure mode isn't theoretical. It happens. The incidents don't always become public, but they reliably end enterprise contracts and occasionally end companies. Hybrid tenancy — dedicated infrastructure for high-value or compliance-sensitive clients, pooled resources for the long tail — is where most mature platforms land. The operational complexity of managing both models is real, but the economics usually justify it. What's not recoverable is discovering which model you've accidentally built after three years of feature development. Retrofitting siloed tenancy onto a codebase that has pooled assumptions baked into a hundred query paths is not a refactor. It's a rewrite. The teams that avoid it are the ones who treat the tenancy decision as an architectural constraint from day one — defined, documented, and intentionally chosen. Start With a Monolith; Plan to Leave It There is a category of architectural advice that circulates with great confidence among engineers who've read extensively about microservices but haven't operated them at scale under incident conditions. The advice is: "Build microservices from the start — it scales better." Martin Fowler's documented observation on this is worth citing directly: almost every successful microservices story started with a monolith that got too large and was split apart. Almost every system built as microservices from the beginning has encountered serious trouble. The trouble is operational. Running twelve services means twelve deployment pipelines, twelve sets of logs, twelve independent failure domains, and a distributed tracing requirement that doesn't exist when you have one process. A team of four engineers who are also building features, writing tests, and responding to client requests does not have the operational bandwidth for this. The cognitive overhead alone slows delivery. The alternative — a modular monolith — is not a compromise. It's a deliberate choice that preserves the ability to move to microservices later, without paying the full operational cost now. A well-structured modular monolith has clean module boundaries, explicit interfaces between modules, and no cross-module data access except through those interfaces. The billing logic doesn't reach into the notification module's tables. The reporting engine doesn't call internal functions of the core domain layer. When the time comes to extract the notification service because it needs to scale independently, or because it needs to deploy on a different cadence, there's a clean seam to cut along. You're lifting a well-defined box out of a larger structure, not untangling five years of implicit dependencies. The trigger for that extraction should always be evidence, not intuition. Real performance data. A concrete scaling bottleneck. A deployment coupling that's slowing down a specific team. Not hypothetical future requirements or architectural preference. Statelessness is the constraint that applies regardless of which model you choose. Individual application instances need to be replaceable without ceremony. Session state belongs in a distributed cache — Redis for most teams, though the technology matters less than the principle. File uploads go to object storage. Background jobs are queued and processed independently of the request/response cycle. If you can terminate any running instance without losing data or breaking user sessions, you have horizontal scalability. If you can't, no amount of autoscaling configuration will save you. The CI/CD Pipeline Is a Promise to Your Clients Here's a framing that changes how teams invest in deployment infrastructure: the CI/CD pipeline is not tooling. It is the mechanism by which your engineering organization makes and keeps reliability commitments. Every commit that flows through automated testing and staged deployment is an implicit promise that you are not shipping surprises. Every deployment that uses blue/green or canary strategies is a commitment that you can recover from problems without taking clients offline. The pipeline is the operational expression of your engineering standards. When it's not enforced, those standards become suggestions. A properly constructed pipeline enforces several stages without exception: Source control discipline. Protected main branches. Required pull request reviews. Automated checks that block merge on failing tests. This seems obvious. It isn't universal. Automated testing at multiple levels. Unit tests catch logic errors in isolation. Integration tests verify that components interact correctly at boundaries. End-to-end tests validate that user-facing flows behave correctly under production-like conditions. Coverage numbers are a proxy metric, and they get gamed. What matters is whether the test suite catches regressions before they reach clients. Security scanning in the pipeline. Static analysis for common vulnerability patterns. Dependency scanning for known CVEs. Container image scanning before any artifact reaches a deployment stage. None of this replaces a professional security review, but it raises the baseline that your security review starts from, and it catches low-hanging fruit on every commit rather than periodically. Staged deployment with canary releases. A canary release routes a controlled percentage of traffic — five or ten percent — to the new version before full rollout. Error rates and latency are monitored during the canary window. If metrics degrade beyond defined thresholds, the release rolls back automatically. Blue/green deployment maintains two production environments, with the router switching between them on successful validation. Rollbacks take seconds because the previous version is still running. Automated rollback triggers. Post-deployment error rate exceeds a defined threshold? The pipeline reverts without waiting for human acknowledgment. This requires defining what "good" looks like before the deployment goes out, which forces teams to think about observability requirements proactively. The DORA research on software delivery performance is consistent with practitioner experience: teams with mature CI/CD pipelines ship more frequently, experience fewer high-severity incidents, and recover faster when incidents do occur. The correlation isn't coincidental. Frequent small deployments are inherently lower-risk than infrequent large ones. The pipeline creates the conditions where frequent deployment is safe. One practical note on pipeline architecture: the staging environment needs to mirror production in configuration, even if not in scale. Misconfigured environment variables, incorrect secrets injection, and infrastructure assumptions that don't hold in the target environment — these all generate bugs that only appear at deployment and can't be caught by any amount of unit testing. Observability: What You Cannot See, You Cannot Fix Observability is the property of a system that allows you to understand its internal state from the signals it produces. Logs, metrics, and distributed traces are the three pillars. Most teams have logs. Fewer have metrics instrumented at meaningful granularity. Fewer still have distributed tracing that lets an engineer follow a single user request through every service it touches. The Google SRE team's framework — the four golden signals of latency, traffic, errors, and saturation — remains the clearest starting point for deciding what to measure. If you instrument nothing else, instrument these four things. They answer the question "Is the system working correctly right now?" without requiring an engineer to synthesize information from a dozen different dashboards. The gap matters most during incidents. When a client reports slow dashboards and the on-call engineer has only raw application logs to work with — logs that say "request processed in 4.3 seconds" without any breakdown of where that time went — the mean time to resolution depends entirely on how quickly the engineer's intuition gets lucky. When the same engineer has distributed traces showing the request blocking for 3.9 seconds waiting on a single database query in the reporting service, the resolution path is immediate. For multi-tenant SaaS specifically, per-tenant observability is a non-optional requirement that general monitoring guidance doesn't address. The ability to filter every metric, log line, and trace by tenant ID enables two things that matter: When a specific client reports a problem, you can immediately determine whether it's a platform-wide issue or specific to their tenant.You can detect the noisy neighbor problem in metrics before the affected client experiences it in their user interface. A single tenant whose analytics jobs are consuming disproportionate database CPU will appear in per-tenant metrics as an anomaly before their query patterns start affecting neighboring tenants' response times. That's the kind of early signal that separates reactive operations from proactive ones. Service Level Objectives translate quality commitments into measurable engineering targets. An SLO is not an SLA — SLAs are contractual commitments to clients; SLOs are internal targets that the engineering team holds itself to, set below the SLA threshold to provide a buffer. Alerting on SLO burn rate — "we're consuming our weekly error budget at three times the sustainable rate" — is meaningfully different from alerting on static thresholds like "error rate above 1%." The former fires on conditions that threaten the actual reliability commitment. The latter fires on every routine blip until engineers learn to ignore it. The SRE workbook's case studies on SLO implementation are worth reading carefully for teams setting up SLOs for the first time. The recurring insight is that getting SLOs slightly wrong is better than having no SLOs, and that they improve through iteration as the team develops better intuitions about what clients actually care about. Caching Is Architecture, Not Optimization There's a point in the growth curve of most SaaS platforms — somewhere between one hundred and five hundred active users — where the engineering team discovers that their application has been making an implicit performance bet. Every page load triggers database queries that should have been answered from a cache. Every API call recomputes values that could have been stored. The system that felt responsive at twenty clients is visibly straining at two hundred. The teams that handle this gracefully anticipated it. They designed caching into the architecture rather than retrofitting it as an emergency optimization. In a multi-tenant SaaS context, caching is more complex than "put Redis in front of your database." Every cached object must be scoped to a specific tenant. Cached data for Tenant A cannot, under any circumstances, be served to Tenant B. Cache key design must include tenant ID as a required component — not an optional one, not something checked at read time, but structurally embedded in every key. Cache invalidation — famously one of the two hard problems in computer science — becomes harder in multi-tenant environments because you're managing invalidation across tenant boundaries, and harder still when multiple application instances each maintain their own local in-process cache. An update to Tenant A's configuration needs to invalidate the right cache entries across every instance. Getting this wrong produces subtle, intermittent bugs that are difficult to reproduce and unpleasant to debug. A layered caching strategy handles different data categories appropriately. In-process cache for hot, rarely-changing data (feature flags, tenant configuration, static reference data). Distributed cache (Redis or equivalent) for session data, frequently-accessed query results, and computed aggregates that are expensive to regenerate. CDN for static assets, public-facing content, and anything that can be served without touching the application layer. Queue-based async processing is the complementary pattern for handling workload spikes without translating them into latency spikes. Long-running operations — report generation, bulk exports, email campaigns, file processing — do not belong in the synchronous request/response cycle. They belong in a job queue. The user receives an acknowledgment that the job has been accepted. The job runs in the background. The result is delivered when it's complete. This keeps p99 response times stable even under unusual load conditions, which is what enterprise SLAs actually measure. Security Is an Architecture Constraint, Not a Feature The framing problem with enterprise SaaS security is that most development teams treat it as a compliance checklist — a set of features to implement before a security audit — rather than a design constraint that shapes the system from the beginning. The OWASP Top 10 Proactive Controls are explicit about this for access control specifically: "Once you have chosen a specific access control design pattern, it is often difficult and time-consuming to re-engineer access control in your application with a new pattern. Access Control is one of the main areas of application security design that must be thoroughly designed up front, especially when addressing requirements like multi-tenancy and horizontal (data dependent) access control." The architectural implication: your access control model should be able to answer a three-variable question before every data access — does user X have permission Y in tenant Z? Note all three variables. A user with full administrative permissions in their own tenant has zero permissions in any other tenant. A service account with cross-tenant reporting access should be an explicit, audited exception, not an assumed default. Role-Based Access Control implemented at the framework level — where permission checks happen automatically on every request — is fundamentally more secure than RBAC implemented at the individual endpoint level, where checks can be forgotten or inconsistently applied. Audit logging is the forensic record that makes security audits tractable and incident investigations answerable. Every action that creates, modifies, or deletes sensitive data — and ideally, every access to sensitive data — should generate an immutable log entry recording: who took the action, which tenant they were acting within, what data was affected, and when. This is not only a compliance requirement. It's the record that lets you answer "what happened to this client's data between Tuesday evening and Wednesday morning" when that question needs answering under time pressure. Broken Access Control has held the top position on the OWASP Top 10 since 2021. In multi-tenant SaaS, it's not just the most common vulnerability — it's the one that carries the most severe consequences, because a broken access control bug doesn't affect one user, it potentially affects one tenant's entire dataset being visible to another. SSO federation and enforced MFA address the credential attack surface. The majority of cloud environment security incidents involve compromised credentials, not novel exploits. Allowing enterprise clients to authenticate through their existing identity provider reduces credential surface area and eliminates the parallel set of credentials that would otherwise need to be managed, rotated, and secured. Dependency and container image scanning in the CI/CD pipeline handles the supply chain attack surface. Known CVEs in third-party packages are a growing attack vector. Automated scanning on every build — blocking deployments when critical vulnerabilities are detected — keeps the baseline clean without requiring manual security reviews for every dependency update. Why So Many Platforms Stumble Quietly The failures rarely announce themselves dramatically. There's rarely a single decision you can point to. The pattern is a series of small optimizations for short-term velocity that individually make sense and collectively produce an architecture that resists change, punishes growth, and generates incidents faster than the team can resolve them. Treating SaaS like a desktop application. Session state held in process memory. File writes to local disk. Synchronous operations for everything. No consideration for multiple concurrent instances. This architecture has a hard ceiling on horizontal scalability that isn't visible until you're past the point where addressing it is easy. Neglecting tenant isolation until after the first incident. "We'll add proper tenant isolation once we have more clients" is a statement that makes practical sense and architectural nonsense. The isolation boundary is cheapest to implement correctly before there's existing code to refactor and existing clients whose data is stored in ways that need to be migrated. Skipping automated testing because there's no time. The codebase gradually becomes too risky to refactor. The parts that aren't understood don't get touched. Tests that were never written don't get written retroactively because the cost of retrofitting tests is higher than writing them alongside the code. Features slow down. Good engineers leave. Building observability as an afterthought. When incidents occur — and they will occur — the engineering team is debugging production systems with inadequate information, under client pressure, without the data they need to isolate the root cause quickly. Mean time to recovery extends. Trust erodes. The SLA that seemed achievable suddenly isn't. Designing for the first twenty clients, not the first two hundred. This one is subtle because the decisions feel responsible at the time. A shared database works fine for twenty clients. A monolith with no queue-based async works fine at low volume. A single deployment environment is fine for a small team. None of these are wrong in isolation. They become wrong when they're treated as permanent rather than temporary, when the plan to address them "when we need to" never gets made concrete. The honest summary is this: the decisions that are expensive to change later are cheapest to make correctly at the beginning. Not because teams should over-engineer early systems, but because the specific set of decisions that require early attention — tenant isolation model, stateless service design, CI/CD infrastructure, access control architecture — are structural, not incidental. Getting them right doesn't add months to the timeline. It adds a few weeks of design discipline that prevents a year of unplanned remediation. Applying This in Practice: An Engineering Lifecycle None of the above is useful as abstract principle. Here's what it looks like as a working process. Discovery and architecture design — Before writing code, define the problem space, the target client profiles, the compliance requirements, and the expected scale envelope. These inputs determine the tenant isolation model. They determine the access control design. They determine what "encrypted at rest" means for this specific platform. The output is a set of documented architecture decision records, not a market analysis. Infrastructure before features — The CI/CD pipeline, observability stack, secrets management system, and staging environment should exist before the first feature is developed. This is the investment that pays dividends across every subsequent sprint. A pipeline that's been running for six months has established a baseline of normal behavior; deviations from that baseline during deployments are immediately visible. Test-driven feature development — Code doesn't merge without tests. Not because 100% coverage is the goal, but because a test written for a new behavior is the cheapest possible insurance against that behavior regressing in a future sprint. Per-tenant metrics from the start — Instrumenting tenant ID into your metrics and logging schema from the beginning costs almost nothing. Retrofitting it into a mature observability stack after you have fifty tenants costs considerably more, and the retrofitted version is never as clean. Scheduled security and performance reviews — Not one-time events before launch. Recurring checkpoints. Load testing that simulates realistic tenant distributions. Security reviews that look for new attack surface introduced by recent features. Evidence-driven architectural evolution — As the platform grows, observability data guides structural changes. A service that needs to scale independently gets extracted when the data shows it's a bottleneck — not when someone has an architectural preference for microservices. Conclusion Architectural foresight isn't caution. It isn't the enemy of velocity. It's the precondition for sustained velocity — the kind that lets teams ship confidently at month twenty-four rather than spending month twenty-four unwinding the debt from month six. The SaaS platforms that degrade quietly at scale don't fail because they ran out of good ideas. They fail because the structural decisions made when speed was the only metric start exacting costs that compound faster than the team can pay them down. Multi-tenant isolation decisions made incorrectly become security incidents. CI/CD pipelines that were never built become deployment bottlenecks. Access control implemented as a checklist item becomes a failed enterprise security review. The specific decisions that prevent this aren't exotic. They're established. They're documented. They're the kind of decisions that experienced teams have been making and refining for a decade. The value in understanding them clearly is that you can make them deliberately, before the consequences of the wrong choice are already in production. References and Further Reading AWS SaaS Architecture Fundamentals Whitepaper – AWS's foundational framework for tenancy models and SaaS architectureAWS Guidance for Multi-Tenant Architectures – Silo, bridge, and pool model implementation patternsMartin Fowler: Breaking a Monolith into Microservices – Practical patterns for architectural evolutionGoogle SRE Book: Monitoring Distributed Systems – Four golden signals and SLO methodologyGoogle SRE Workbook: SLO Case Studies – Real-world SLO implementation at Evernote and Home DepotOWASP Top 10 Proactive Controls: Access Control – Access control design for multi-tenant environmentsOWASP Top 10 – Current web application security risk rankingsSapientPro SaaS Development – Architecture, multi-tenant platform design, and CI/CD delivery for SaaS products
Abhishek Gupta
Principal PM, Azure Cosmos DB,
Microsoft
Srinivas Chippagiri
Sr. Member of Technical Staff
Vidyasagar (Sarath Chandra) Machupalli FBCS
Software Developer Operations Manager | Executive IT Architect,
IBM
Pratik Prakash
Principal Solution Architect,
Capital One