DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library
  1. DZone
  2. Refcards
  3. Shipping Production-Grade AI Agents
refcard cover
Refcard #403

Shipping Production-Grade AI Agents

Building From Localhost to Production

Building an AI agent prototype is one thing; running it safely and reliably in production is another. This Refcard walks engineers and technical leaders through the architecture, governance, security, evaluation, deployment, and monitoring decisions required to move AI agents from demo to production. Explore practical guidance for managing state and memory, enforcing guardrails, building eval gates, handling secrets and model configuration, instrumenting agent behavior, controlling costs, and closing the feedback loop after every release.

Free PDF for Easy Reference

Brought to You By

OutSystems
refcard cover

Written By

author avatar Vidyasagar (Sarath Chandra) Machupalli FBCS
Software Developer Operations Manager | Executive IT Architect, IBM
Table of Contents
► Introduction ► What Is a Production-Ready Agent? ► Configuration and Governance ► Building the Agent: State, Memory, and Security ► Testing, Eval Gates, and Deployment ► Running the Agent: Monitoring, Cost, and Feedback Loops ► Conclusion
Section 1

Introduction

Building an AI agent that works on a laptop is a weekend project. Building one that works for real users, under real load, with real data, is a different problem. Demos run under your supervision. Production runs at 2am on inputs you did not anticipate, with no one watching. This Refcard gives engineers and tech leads a concrete path through the six decisions that shape everything downstream: config and model governance, state and memory design, security boundaries, evaluation before every release, and operational monitoring once you go live.

Section 2

What Is a Production-Ready Agent?

This section defines what production readiness means. Before architecture choices and tooling, it pays to be precise about what breaks when a prototype meets real traffic because those failure modes are what the rest of this Refcard is designed to prevent.

The Four Things That Break at Scale

Most prototype-to-production failures fall into one of four categories. They are predictable, preventable, and almost always appear together. Knowing what to expect lets you design against them from the start.

What Breaks Why It Breaks What it Looks like
State Prototypes keep everything in memory. Prod has parallel sessions, restarts, and crashes. Two users get each other’s context. A process restart wipes a mid-task job with no way to resume.
Secrets API keys are hard-coded or stored in .env files that travel with the codebase. A key ends up in a shared repo or log file. Rotation triggers an outage because the old key was baked into a config nobody updated.
Trust The agent can call any tool with any argument, with no validation layer between intent and execution. A malicious input tricks an agent into calling a delete endpoint. An argument type mismatch fires a tool against the wrong resource ID.
Observability Instrumentation beyond print statements was never wired up during rapid prototyping. The agent starts giving subtly wrong answers. Nobody finds out until a user files a support ticket days later.

Architecture Patterns

Pick your architecture before you write the first route handler. Changing it later means restructuring state, testing, and tracing simultaneously. Three patterns cover most production use cases.

Pattern How it Works  Best Fit Where it Breaks
ReAct Think, act, observe, repeat until the task is done Variable-step tasks with many tool options Loops without a step budget; hard to trace mid-run
Plan and Execute Build a complete plan first, then run each step in sequence Predictable long-horizon tasks with stable structure Plans go stale when the environment changes mid-run
Multi-Agent Specialist agents hand tasks to each other through a coordinator Complex workflows that need clear role separation State passing between agents; debugging requires a full cross-agent trace

Orchestration Patterns

A single looping agent is easy to reason about locally. Once multiple agents are handing tasks to each other, retrying on failure, and reading from shared context, you need an orchestration layer that manages coordination explicitly. Without it, you end up with an untracked state and failures that are nearly impossible to reproduce. Orchestration turns a collection of agents into a system another engineer can own and operate.

The Cost of DIY Infrastructure

Custom pipelines work until the team starts spending more time maintaining infrastructure than building the agent. Every hand-rolled piece is something you maintain, upgrade, and debug. Beyond time cost, fragmented pipelines create operational risk: config inconsistencies, secrets in too many places, and deployment steps only one person knows. Enterprise platforms absorb this layer so engineering stays focused on the actual problem.

Deployment Promotion Path

Every agent should travel through a consistent set of environments before reaching production. Each gate in the path is an opportunity to catch problems while they are still cheap to fix. Automating the checks at each gate is what prevents human error from letting a bad build through.

Figure 1: Agent deployment promotion path

Every agent should earn its way through each gate. Automated checks block promotion until the previous stage is confirmed stable.

Frameworks and Orchestration Tools

The tools in this category handle the execution model: how the agent reasons, how steps chain together, how sub-agents communicate, and how long-running tasks stay durable across failures. Choosing a framework early matters because it sets the primitives your eval, tracing, and deployment tooling will integrate with. The table contains example open-source tools and frameworks for your reference.

Tool Purpose
LangGraph Stateful multi-step agents
CrewAI Multi-agent role coordination
AutoGen Agent conversation loops
Haystack RAG and pipeline agents
Semantic Kernel Enterprise .NET and Python agents
Agno Lightweight agent runtime
Temporal Durable workflow orchestration
Pydantic AI Type-safe agent construction
Section 3

Configuration and Governance

Config and governance are prerequisites to agent logic, not operational details you add later. How you handle config determines whether the same agent runs reliably across environments, whether a model change requires a code release, and whether your team can rotate secrets without an outage.

Environment Promotion Model

The same agent should run identically in local dev, staging, and production. What changes between environments is the config and secrets injected at runtime. Use a config service that understands named environments so developers can pull staging config locally without copying files. A single pipeline handles promotion, swapping environment-specific values automatically. Without this discipline, environment-specific bugs become a constant drain because every environment differs in ways that are hard to reproduce.

Figure 2: Secrets management in each environment

Local .env files stay on the developer’s machine and never reach staging or production.

AI Model Management and Control

The model’s name and version should live in config, not code. Treat model selection the same way you treat a database connection string: Change it by updating config and restarting the service, with no code diff required. This lets you roll back a degraded model, swap for compliance, or run cost-optimized models in lower environments without a release cycle. Governance over which teams can change which model versions belong in your config story from day one.

Feature Flags for Agent Behavior

Feature flags let you change agent behavior, such as enabling a new tool or switching model versions, without a redeploy. New capabilities can be toggled for a subset of users before rolling out broadly, and a kill switch lets you disable a misbehaving tool instantly without rolling back. Keep tool-access flags and model-selection flags separate since their risk profiles and approval requirements differ.

Manual Plumbing vs. Unified Platform

Hand-built pipelines work until the engineer who built them leaves, or until you scale from one agent to 10. A unified platform handles config management, model governance, secrets rotation, deployment pipelines, and access control as a coherent system. The risk of custom DevOps for AI is not just time. It is consistent with gaps, undocumented steps, and outages because no one updated the third system when the API key rotated.

Never commit .env files to source control. Add them to .gitignore globally and run a secrets scanning tool like truffleHog or git-secrets in your CI pipeline to catch credentials before they reach a shared branch.

Section 4

Building the Agent: State, Memory, and Security

State, memory, and security produce the most production incidents in agent systems, and are the most commonly deferred until after launch. The decisions in this section affect your testing strategy, deployment model, and incident response when something goes wrong.

Short-Term vs. Long-Term Memory

Memory in an agent system means two things: where information lives between steps right now, and whether it survives the session. Too little persistence and the agent cannot complete multi-turn tasks; too much and you carry compliance risk and retrieval latency you did not plan for.

Short-term memory stores in the context window or a TTL-keyed store, and it is the right default for most tasks. Do not reach for long-term memory until you have a concrete reason to, at which point more than a database is required. You need a schema, indexing strategy, retrieval mechanism, deletion path, and retention policy reviewed with your legal team before you write the first row. 

The table below summarizes when to use each type and which backends fit each scope.

Aspect Short-Term Memory Long-Term Memory
Scope Current session only Cross-session, durable
Best for Single-turn and multi-step tasks in one session User preferences, past decisions, semantic retrieval
Cleared when Session ends Explicit deletion or policy
Key requirement TTL or window management Schema, indexing, retention policy, deletion path
Example tools Redis with TTL, LangGraph checkpoints, or in-memory dict Vector store such as Chroma

Checkpointing Long-Running Tasks

An agent running a 20-step task can fail at step 14. Without a checkpoint, the entire job restarts, burning tokens and quota on work already done. Write agent state to a durable store after each significant step, tagged with a step identifier and timestamp. On restart, load the latest checkpoint and continue from there.

The Six Threats Specific to Agent Systems

Agents process content from the outside world and act on it, creating an attack surface that standard security checklists do not fully address. The following six threats require specific mitigations at the architecture level.

Threat How it Happens What to Do
Prompt injection Hostile content in tool output rewrites agent’s planned action Treat all external content as untrusted; sanitize before it enters the context window
Tool abuse Agent calls tool with wrong arguments or fires far more than expected Validate arguments against a strict schema; enforce step and call budgets per run
Data exfiltration Agent leaks sensitive data through output or external API calls Apply output filtering before responses leave; enforce network egress allowlists on containers
Privilege escalation Agent accesses tool or resource outside its assigned role Use explicit per-role allowlists at runtime, not denylists; audit every tool call
Runaway execution Agent loops indefinitely or fires hundreds of tool calls Set hard limits on steps, tokens, and wall-clock time; return an error, not a continuation
LLM-generated code Agent executes code that may access unintended resources Sandbox all generated code with no network access and a read-only filesystem

Implementing Configurable Agent Guardrails

Guardrails enforced at the platform layer cannot be reached by attacks that can subvert prompt-level instructions. Each agent role needs a declarative, version-controlled allowlist specifying which tools it can call, which data it can read, and which arguments it can pass. Enforcement happens at runtime, below the agent logic, so the system cannot exceed its mandate, even if its instructions are tampered with.

Human-in-the-Loop Gates

Deleting records, sending messages outside the organization, triggering payments, and making infrastructure changes all require human confirmation before execution. Build gates as named, explicit objects in your system that are straightforward to add and impossible to bypass accidentally. Log every gate trigger, approval, rejection, and the time each decision took.

Security Layers

Security in an agent system is a stack of independent layers, each failing separately. A compromise in one layer should not grant access to the capabilities controlled by the layers above it.

Figure 3: Security layers

Each layer is enforced independently. A bypass in one does not open the others.

Human-in-the-loop gates are not optional for actions that delete data, send external messages, or trigger payments. Wire them in at project start, not after the first incident.

Section 5

Testing, Eval Gates, and Deployment

Shipping without a proper evaluation setup is the same as deploying with no tests. With agents, the failure mode is often invisible: plausible-looking answers that are wrong, or refusals on tasks that should succeed. This section explains what a complete eval setup looks like, how to connect it to your pipeline, and how to package the agent so it behaves consistently across environments.

Why Standard Tests Are Not Enough

Unit tests verify that your code runs, and integration tests verify that components connect. Neither tells you whether the agent completed the task correctly, whether its reasoning held up, or whether a prompt change introduced a safety issue mid-run. Agent evaluation sits on top of your regular test suite as a separate discipline with its own dataset, scoring criteria, and place in CI. Teams that skip this eval step run informal manual checks before each release, which does not scale.

The Eval Hierarchy

Evaluation is a hierarchy of checks, each running at a different cadence and answering a different question. Understanding where each type fits in your pipeline is the difference between genuine confidence and the appearance of it.

Eval Type What it Tests Run When Blocks Deploy
Unit Single prompt or tool call returns expected output Every commit Yes, on hard failure
Integration Tool chains and handoffs work correctly end to end Every PR Yes, on any regression
End to end Full task completion on realistic, prod-like inputs Every release Yes, if below threshold
Regression Key metrics did not drop vs. the last release baseline Every release Yes, on defined delta
Human spot Person reviews random sample of real output quality Weekly No, informs next cycle

Building an Eval Dataset That Reflects Production

The quality of an eval system is only as good as the dataset it runs against. A dataset built from simple, well-formed examples will pass confidently on builds that fail badly on real user inputs. The cases you can construct from imagination are almost never the cases that break in production.

Build your dataset from three real sources: inputs from your first staging runs, edge cases your team has encountered or can deliberately construct from known failure patterns, and cases where the agent should decline rather than attempt the task. Keep the dataset versioned in source control alongside the agent code. After every production incident, add the failing case to the dataset before writing a single line of fix code to prevent the same failure from returning silently in a future release.

Eval Gates in CI/CD

An eval gate is the mechanism that connects your evaluation system to your deployment pipeline, and what turns evaluation from a periodic quality check into an automated release criterion that runs on every build. This section explains how to set one up, where in the pipeline it should sit, how to set meaningful thresholds, and how to use LLM-as-judge evaluation to scale the process without losing accuracy.

Integrate eval gates into your non-production stages, not just before the final production deployment, so you catch regressions early when they are cheapest to fix. Set thresholds before the first release: task success above 80%, safety refusals under 3% of completable tasks, and latency p95 within your SLA. Adjust with real data once traffic starts flowing. An LLM-as-judge setup can scale evaluation automatically, but calibrate it against human reviewers at least once a quarter.

Packaging for Production

Agents follow the same container discipline as any production service, with one addition: Pin the model version in config so you can roll it back without touching code. Pin every layer below to eliminate an entire class of hard-to-reproduce environment inconsistencies.

What to Pin How Why
Base OS image Exact tag in Dockerfile (e.g., python:3.11.9-slim) Reproducible builds across all machines and CI runners
Runtime version .tool-versions or ARG in Dockerfile Prevents silent behavior change on minor upgrades
Dependencies requirements.txt with hashes, or package-lock.json Stops supply chain attacks from compromised packages
Model name Config file or environment variable, never source code Roll back the model without a code change or CI run
System prompt Version-controlled prompt registry with change history Full audit trail of what the agent was told to do and when

Async Execution Flow

Agents that take more than a second should run asynchronously. The API accepts the request and returns immediately. A task queue holds the job until a worker picks it up, and the client retrieves the result by polling or webhook. This keeps the API fast and lets the worker pool scale independently.

Figure 4: The standard async execution pattern

The API layer stays fast while agents run to completion in the background.

Define rollback trigger criteria before deployment day, not during an incident. A rollback should take under five minutes. If it takes longer, practice it until it does not.

Section 6

Running the Agent: Monitoring, Cost, and Feedback Loops

Shipping is not the finish line. An agent that started giving wrong answers after a model update does not throw an exception. Standard infrastructure monitoring will not catch it. You need a behavioral observability layer on top of your infrastructure metrics, and a process for turning production failures into future test coverage.

What to Instrument

Token usage, tool call behavior, step-level latency, and sampled eval scores are signals that standard APM tools do not capture by default. Start with traces and cost on your first production deployment. Add scheduled eval scoring once you have enough traffic to sample from.

Signal What to Capture Tool Examples
Traces Full run: every step, tool call, model call, and decision point in sequence LangSmith, Arize Phoenix, Weave
Token usage Input and output tokens per run, per user, and per model version Provider dashboards, custom span attributes
Tool calls Which tools fired, with what arguments, and how long each call took Custom spans inside your task runner
Latency p50, p95, and p99 for full runs and for individual steps Prometheus, Datadog, Grafana
Error rates Tool failures, model timeouts, gate triggers, and parse errors Your alerting platform plus custom counters
Output quality Eval scores on a random sample of live output via scheduled job DeepEval, Ragas, LangSmith eval runs
Cost Dollars per run, per user, per agent type, and per model version Provider billing API with custom resource tagging

Alerting on Behavior, Not Just Uptime

An agent can be fully healthy from an infrastructure perspective while producing consistently wrong outputs. Add alerts on eval score drops, refusal rate spikes, and run cost increases alongside infrastructure alerts. An automated eval run on a random sample of live traffic, run daily, is one of the most practical early-warning systems you can put in place without significant investment.

Cost Guardrails per Agent and per Tenant

Token costs are not linear with load. A retrieval step returning longer documents doubles input tokens per run. A reasoning loop with one extra step adds tokens to every request across every user. Set budget alerts per agent type and per tenant. A sudden cost spike almost always points to a concrete, fixable problem.

Budget Signal WHat it Usually Means First Place to Look
Cost per run doubled Retrieval returning longer docs, or a loop added a step Token count breakdown in the trace, step by step
Refusal rate above baseline Prompt change, model update, or input distribution shift Diff prompt versions against the deploy timestamp
Latency p95 up 50% Tool call taking longer, or provider under load Tool span durations in trace, then provider status page
Eval score dropped 5% or more Model drift, data drift, or prompt regression Run full eval suite against the last known-good baseline

Incident Response for Agents

Agent incidents require a different initial response than standard software failures. The nondeterministic nature of model outputs means you often cannot reproduce the exact failure, and the blast radius can be hard to assess without knowing which external systems the agent touched. This section gives you a three-question framework for starting any agent incident investigation.

When something goes wrong, start with three questions: what was the agent trying to do, based on the input it received and the plan it generated; what did it do, based on the trace; and what external state did it change, based on the audit log. The audit log is not optional. It lets you undo or compensate after a bad run and is the only reliable way to answer the third question. Without it, you are guessing.

Closing the Feedback Loop

Every production failure that reaches your team should become a named test case in the eval dataset before the next release. Review eval score trends, cost anomalies, and gate triggers on a weekly cadence. Failures feed the eval dataset, the dataset feeds the regression gate, and the gate prevents the same failure from returning silently.

Figure 5: Feedback loop

Production failures flow back into the eval dataset before the next release, closing the gap between what you tested and what users encountered.

Section 7

Conclusion

A production-ready agent is not a smarter prototype; it is a system built deliberately around the things that break: config that travels safely across environments, state that survives failures, tools that cannot be abused, an eval pipeline that catches regressions before users do, and instrumentation that tells you what happened after every run. This is the work that gets skipped when the demo is working and the deadline is close.

When config management, model governance, secret rotation, deployment pipelines, and access control are handled at the platform level, the team focuses on the problem the agent is actually solving. That is how you get AI speed without giving up enterprise control. Start with secrets management and an eval dataset, add a human gate for your riskiest tool, instrument your first deploy, and treat every release after that as the beginning of the next eval cycle, not the end of the build cycle.

Resources:

  • OWASP Top 10 for LLM Applications
  • NIST AI Risk Management Framework (AI RMF)
  • The Twelve-Factor App
  • “Choosing the Right Multi-Agent Architecture” by LangChain
  • “The Twelve-Factor Agents: Building Production-Ready LLM Applications” by Vidyasagar Sarath Chandra Machupalli
  • The Orchestration of Multi-Agent Systems: Architectures, Protocols, and Enterprise Adoption

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

How to Build an Agentic AI SRE Co-Pilot for Incident Response
related article thumbnail

DZone Article

Minimus Expands Enterprise Security Platform with General Availability of Advanced Supply Chain Controls
related article thumbnail

DZone Article

Production-Grade RAG: Why Vector Search Isn't Enough (and How Hybrid Search Fills the Gaps)
related article thumbnail

DZone Article

How to Interpret the Number of Spring ApplicationContexts in Integration Tests
related refcard thumbnail

Free DZone Refcard

Shipping Production-Grade AI Agents
related refcard thumbnail

Free DZone Refcard

Getting Started With Agentic AI
related refcard thumbnail

Free DZone Refcard

AI Automation Essentials
related refcard thumbnail

Free DZone Refcard

Getting Started With Large Language Models
  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook