Shipping Production-Grade AI Agents

Table of Contents

Introduction What Is a Production-Ready Agent? Configuration and Governance Building the Agent: State, Memory, and Security Testing, Eval Gates, and Deployment Running the Agent: Monitoring, Cost, and Feedback Loops Conclusion

Section 1

Introduction

Building an AI agent that works on a laptop is a weekend project. Building one that works for real users, under real load, with real data, is a different problem. Demos run under your supervision. Production runs at 2am on inputs you did not anticipate, with no one watching. This Refcard gives engineers and tech leads a concrete path through the six decisions that shape everything downstream: config and model governance, state and memory design, security boundaries, evaluation before every release, and operational monitoring once you go live.

Section 2

What Is a Production-Ready Agent?

This section defines what production readiness means. Before architecture choices and tooling, it pays to be precise about what breaks when a prototype meets real traffic because those failure modes are what the rest of this Refcard is designed to prevent.

The Four Things That Break at Scale

Most prototype-to-production failures fall into one of four categories. They are predictable, preventable, and almost always appear together. Knowing what to expect lets you design against them from the start.

What Breaks	Why It Breaks	What it Looks like
State	Prototypes keep everything in memory. Prod has parallel sessions, restarts, and crashes.	Two users get each other’s context. A process restart wipes a mid-task job with no way to resume.
Secrets	API keys are hard-coded or stored in `.env` files that travel with the codebase.	A key ends up in a shared repo or log file. Rotation triggers an outage because the old key was baked into a config nobody updated.
Trust	The agent can call any tool with any argument, with no validation layer between intent and execution.	A malicious input tricks an agent into calling a delete endpoint. An argument type mismatch fires a tool against the wrong resource ID.
Observability	Instrumentation beyond print statements was never wired up during rapid prototyping.	The agent starts giving subtly wrong answers. Nobody finds out until a user files a support ticket days later.

Architecture Patterns

Pick your architecture before you write the first route handler. Changing it later means restructuring state, testing, and tracing simultaneously. Three patterns cover most production use cases.

Pattern	How it Works	Best Fit	Where it Breaks
ReAct	Think, act, observe, repeat until the task is done	Variable-step tasks with many tool options	Loops without a step budget; hard to trace mid-run
Plan and Execute	Build a complete plan first, then run each step in sequence	Predictable long-horizon tasks with stable structure	Plans go stale when the environment changes mid-run
Multi-Agent	Specialist agents hand tasks to each other through a coordinator	Complex workflows that need clear role separation	State passing between agents; debugging requires a full cross-agent trace

Orchestration Patterns

A single looping agent is easy to reason about locally. Once multiple agents are handing tasks to each other, retrying on failure, and reading from shared context, you need an orchestration layer that manages coordination explicitly. Without it, you end up with an untracked state and failures that are nearly impossible to reproduce. Orchestration turns a collection of agents into a system another engineer can own and operate.

The Cost of DIY Infrastructure

Custom pipelines work until the team starts spending more time maintaining infrastructure than building the agent. Every hand-rolled piece is something you maintain, upgrade, and debug. Beyond time cost, fragmented pipelines create operational risk: config inconsistencies, secrets in too many places, and deployment steps only one person knows. Enterprise platforms absorb this layer so engineering stays focused on the actual problem.

Deployment Promotion Path

Every agent should travel through a consistent set of environments before reaching production. Each gate in the path is an opportunity to catch problems while they are still cheap to fix. Automating the checks at each gate is what prevents human error from letting a bad build through.

Figure 1: Agent deployment promotion path

Every agent should earn its way through each gate. Automated checks block promotion until the previous stage is confirmed stable.

Frameworks and Orchestration Tools

The tools in this category handle the execution model: how the agent reasons, how steps chain together, how sub-agents communicate, and how long-running tasks stay durable across failures. Choosing a framework early matters because it sets the primitives your eval, tracing, and deployment tooling will integrate with. The table contains example open-source tools and frameworks for your reference.

Tool	Purpose
LangGraph	Stateful multi-step agents
CrewAI	Multi-agent role coordination
AutoGen	Agent conversation loops
Haystack	RAG and pipeline agents
Semantic Kernel	Enterprise .NET and Python agents
Agno	Lightweight agent runtime
Temporal	Durable workflow orchestration
Pydantic AI	Type-safe agent construction

Section 3

Configuration and Governance

Config and governance are prerequisites to agent logic, not operational details you add later. How you handle config determines whether the same agent runs reliably across environments, whether a model change requires a code release, and whether your team can rotate secrets without an outage.

Environment Promotion Model

The same agent should run identically in local dev, staging, and production. What changes between environments is the config and secrets injected at runtime. Use a config service that understands named environments so developers can pull staging config locally without copying files. A single pipeline handles promotion, swapping environment-specific values automatically. Without this discipline, environment-specific bugs become a constant drain because every environment differs in ways that are hard to reproduce.

Figure 2: Secrets management in each environment

Local .env files stay on the developer’s machine and never reach staging or production.

AI Model Management and Control

The model’s name and version should live in config, not code. Treat model selection the same way you treat a database connection string: Change it by updating config and restarting the service, with no code diff required. This lets you roll back a degraded model, swap for compliance, or run cost-optimized models in lower environments without a release cycle. Governance over which teams can change which model versions belong in your config story from day one.

Feature Flags for Agent Behavior

Feature flags let you change agent behavior, such as enabling a new tool or switching model versions, without a redeploy. New capabilities can be toggled for a subset of users before rolling out broadly, and a kill switch lets you disable a misbehaving tool instantly without rolling back. Keep tool-access flags and model-selection flags separate since their risk profiles and approval requirements differ.

Manual Plumbing vs. Unified Platform

Hand-built pipelines work until the engineer who built them leaves, or until you scale from one agent to 10. A unified platform handles config management, model governance, secrets rotation, deployment pipelines, and access control as a coherent system. The risk of custom DevOps for AI is not just time. It is consistent with gaps, undocumented steps, and outages because no one updated the third system when the API key rotated.

Never commit .env files to source control. Add them to .gitignore globally and run a secrets scanning tool like truffleHog or git-secrets in your CI pipeline to catch credentials before they reach a shared branch.

Section 4

Building the Agent: State, Memory, and Security

State, memory, and security produce the most production incidents in agent systems, and are the most commonly deferred until after launch. The decisions in this section affect your testing strategy, deployment model, and incident response when something goes wrong.

Short-Term vs. Long-Term Memory

Memory in an agent system means two things: where information lives between steps right now, and whether it survives the session. Too little persistence and the agent cannot complete multi-turn tasks; too much and you carry compliance risk and retrieval latency you did not plan for.

Short-term memory stores in the context window or a TTL-keyed store, and it is the right default for most tasks. Do not reach for long-term memory until you have a concrete reason to, at which point more than a database is required. You need a schema, indexing strategy, retrieval mechanism, deletion path, and retention policy reviewed with your legal team before you write the first row.

The table below summarizes when to use each type and which backends fit each scope.

Aspect	Short-Term Memory	Long-Term Memory
Scope	Current session only	Cross-session, durable
Best for	Single-turn and multi-step tasks in one session	User preferences, past decisions, semantic retrieval
Cleared when	Session ends	Explicit deletion or policy
Key requirement	TTL or window management	Schema, indexing, retention policy, deletion path
Example tools	Redis with TTL, LangGraph checkpoints, or in-memory dict	Vector store such as Chroma

Checkpointing Long-Running Tasks

An agent running a 20-step task can fail at step 14. Without a checkpoint, the entire job restarts, burning tokens and quota on work already done. Write agent state to a durable store after each significant step, tagged with a step identifier and timestamp. On restart, load the latest checkpoint and continue from there.

The Six Threats Specific to Agent Systems

Agents process content from the outside world and act on it, creating an attack surface that standard security checklists do not fully address. The following six threats require specific mitigations at the architecture level.

Threat	How it Happens	What to Do
Prompt injection	Hostile content in tool output rewrites agent’s planned action	Treat all external content as untrusted; sanitize before it enters the context window
Tool abuse	Agent calls tool with wrong arguments or fires far more than expected	Validate arguments against a strict schema; enforce step and call budgets per run
Data exfiltration	Agent leaks sensitive data through output or external API calls	Apply output filtering before responses leave; enforce network egress allowlists on containers
Privilege escalation	Agent accesses tool or resource outside its assigned role	Use explicit per-role allowlists at runtime, not denylists; audit every tool call
Runaway execution	Agent loops indefinitely or fires hundreds of tool calls	Set hard limits on steps, tokens, and wall-clock time; return an error, not a continuation
LLM-generated code	Agent executes code that may access unintended resources	Sandbox all generated code with no network access and a read-only filesystem

Implementing Configurable Agent Guardrails

Guardrails enforced at the platform layer cannot be reached by attacks that can subvert prompt-level instructions. Each agent role needs a declarative, version-controlled allowlist specifying which tools it can call, which data it can read, and which arguments it can pass. Enforcement happens at runtime, below the agent logic, so the system cannot exceed its mandate, even if its instructions are tampered with.

Human-in-the-Loop Gates

Deleting records, sending messages outside the organization, triggering payments, and making infrastructure changes all require human confirmation before execution. Build gates as named, explicit objects in your system that are straightforward to add and impossible to bypass accidentally. Log every gate trigger, approval, rejection, and the time each decision took.

Security Layers

Security in an agent system is a stack of independent layers, each failing separately. A compromise in one layer should not grant access to the capabilities controlled by the layers above it.

Figure 3: Security layers

Each layer is enforced independently. A bypass in one does not open the others.

Human-in-the-loop gates are not optional for actions that delete data, send external messages, or trigger payments. Wire them in at project start, not after the first incident.

Section 5

Testing, Eval Gates, and Deployment

Shipping without a proper evaluation setup is the same as deploying with no tests. With agents, the failure mode is often invisible: plausible-looking answers that are wrong, or refusals on tasks that should succeed. This section explains what a complete eval setup looks like, how to connect it to your pipeline, and how to package the agent so it behaves consistently across environments.

Why Standard Tests Are Not Enough

Unit tests verify that your code runs, and integration tests verify that components connect. Neither tells you whether the agent completed the task correctly, whether its reasoning held up, or whether a prompt change introduced a safety issue mid-run. Agent evaluation sits on top of your regular test suite as a separate discipline with its own dataset, scoring criteria, and place in CI. Teams that skip this eval step run informal manual checks before each release, which does not scale.

The Eval Hierarchy

Evaluation is a hierarchy of checks, each running at a different cadence and answering a different question. Understanding where each type fits in your pipeline is the difference between genuine confidence and the appearance of it.

Eval Type	What it Tests	Run When	Blocks Deploy
Unit	Single prompt or tool call returns expected output	Every commit	Yes, on hard failure
Integration	Tool chains and handoffs work correctly end to end	Every PR	Yes, on any regression
End to end	Full task completion on realistic, prod-like inputs	Every release	Yes, if below threshold
Regression	Key metrics did not drop vs. the last release baseline	Every release	Yes, on defined delta
Human spot	Person reviews random sample of real output quality	Weekly	No, informs next cycle

Building an Eval Dataset That Reflects Production

The quality of an eval system is only as good as the dataset it runs against. A dataset built from simple, well-formed examples will pass confidently on builds that fail badly on real user inputs. The cases you can construct from imagination are almost never the cases that break in production.

Build your dataset from three real sources: inputs from your first staging runs, edge cases your team has encountered or can deliberately construct from known failure patterns, and cases where the agent should decline rather than attempt the task. Keep the dataset versioned in source control alongside the agent code. After every production incident, add the failing case to the dataset before writing a single line of fix code to prevent the same failure from returning silently in a future release.

Eval Gates in CI/CD

An eval gate is the mechanism that connects your evaluation system to your deployment pipeline, and what turns evaluation from a periodic quality check into an automated release criterion that runs on every build. This section explains how to set one up, where in the pipeline it should sit, how to set meaningful thresholds, and how to use LLM-as-judge evaluation to scale the process without losing accuracy.

Integrate eval gates into your non-production stages, not just before the final production deployment, so you catch regressions early when they are cheapest to fix. Set thresholds before the first release: task success above 80%, safety refusals under 3% of completable tasks, and latency p95 within your SLA. Adjust with real data once traffic starts flowing. An LLM-as-judge setup can scale evaluation automatically, but calibrate it against human reviewers at least once a quarter.

Packaging for Production

Agents follow the same container discipline as any production service, with one addition: Pin the model version in config so you can roll it back without touching code. Pin every layer below to eliminate an entire class of hard-to-reproduce environment inconsistencies.

What to Pin	How	Why
Base OS image	Exact tag in Dockerfile (e.g., `python:3.11.9-slim`)	Reproducible builds across all machines and CI runners
Runtime version	`.tool-versions` or ARG in Dockerfile	Prevents silent behavior change on minor upgrades
Dependencies	`requirements.txt` with hashes, or `package-lock.json`	Stops supply chain attacks from compromised packages
Model name	Config file or environment variable, never source code	Roll back the model without a code change or CI run
System prompt	Version-controlled prompt registry with change history	Full audit trail of what the agent was told to do and when

Async Execution Flow

Agents that take more than a second should run asynchronously. The API accepts the request and returns immediately. A task queue holds the job until a worker picks it up, and the client retrieves the result by polling or webhook. This keeps the API fast and lets the worker pool scale independently.

Figure 4: The standard async execution pattern

The API layer stays fast while agents run to completion in the background.

Define rollback trigger criteria before deployment day, not during an incident. A rollback should take under five minutes. If it takes longer, practice it until it does not.

Section 6

Running the Agent: Monitoring, Cost, and Feedback Loops

Shipping is not the finish line. An agent that started giving wrong answers after a model update does not throw an exception. Standard infrastructure monitoring will not catch it. You need a behavioral observability layer on top of your infrastructure metrics, and a process for turning production failures into future test coverage.

What to Instrument

Token usage, tool call behavior, step-level latency, and sampled eval scores are signals that standard APM tools do not capture by default. Start with traces and cost on your first production deployment. Add scheduled eval scoring once you have enough traffic to sample from.

Signal	What to Capture	Tool Examples
Traces	Full run: every step, tool call, model call, and decision point in sequence	LangSmith, Arize Phoenix, Weave
Token usage	Input and output tokens per run, per user, and per model version	Provider dashboards, custom span attributes
Tool calls	Which tools fired, with what arguments, and how long each call took	Custom spans inside your task runner
Latency	p50, p95, and p99 for full runs and for individual steps	Prometheus, Datadog, Grafana
Error rates	Tool failures, model timeouts, gate triggers, and parse errors	Your alerting platform plus custom counters
Output quality	Eval scores on a random sample of live output via scheduled job	DeepEval, Ragas, LangSmith eval runs
Cost	Dollars per run, per user, per agent type, and per model version	Provider billing API with custom resource tagging

Alerting on Behavior, Not Just Uptime

An agent can be fully healthy from an infrastructure perspective while producing consistently wrong outputs. Add alerts on eval score drops, refusal rate spikes, and run cost increases alongside infrastructure alerts. An automated eval run on a random sample of live traffic, run daily, is one of the most practical early-warning systems you can put in place without significant investment.

Cost Guardrails per Agent and per Tenant

Token costs are not linear with load. A retrieval step returning longer documents doubles input tokens per run. A reasoning loop with one extra step adds tokens to every request across every user. Set budget alerts per agent type and per tenant. A sudden cost spike almost always points to a concrete, fixable problem.

Budget Signal	WHat it Usually Means	First Place to Look
Cost per run doubled	Retrieval returning longer docs, or a loop added a step	Token count breakdown in the trace, step by step
Refusal rate above baseline	Prompt change, model update, or input distribution shift	Diff prompt versions against the deploy timestamp
Latency p95 up 50%	Tool call taking longer, or provider under load	Tool span durations in trace, then provider status page
Eval score dropped 5% or more	Model drift, data drift, or prompt regression	Run full eval suite against the last known-good baseline

Incident Response for Agents

Agent incidents require a different initial response than standard software failures. The nondeterministic nature of model outputs means you often cannot reproduce the exact failure, and the blast radius can be hard to assess without knowing which external systems the agent touched. This section gives you a three-question framework for starting any agent incident investigation.

When something goes wrong, start with three questions: what was the agent trying to do, based on the input it received and the plan it generated; what did it do, based on the trace; and what external state did it change, based on the audit log. The audit log is not optional. It lets you undo or compensate after a bad run and is the only reliable way to answer the third question. Without it, you are guessing.

Closing the Feedback Loop

Every production failure that reaches your team should become a named test case in the eval dataset before the next release. Review eval score trends, cost anomalies, and gate triggers on a weekly cadence. Failures feed the eval dataset, the dataset feeds the regression gate, and the gate prevents the same failure from returning silently.

Figure 5: Feedback loop

Production failures flow back into the eval dataset before the next release, closing the gap between what you tested and what users encountered.

Section 7

Conclusion

A production-ready agent is not a smarter prototype; it is a system built deliberately around the things that break: config that travels safely across environments, state that survives failures, tools that cannot be abused, an eval pipeline that catches regressions before users do, and instrumentation that tells you what happened after every run. This is the work that gets skipped when the demo is working and the deadline is close.

When config management, model governance, secret rotation, deployment pipelines, and access control are handled at the platform level, the team focuses on the problem the agent is actually solving. That is how you get AI speed without giving up enterprise control. Start with secrets management and an eval dataset, add a human gate for your riskiest tool, instrument your first deploy, and treat every release after that as the beginning of the next eval cycle, not the end of the build cycle.

Resources: