Advanced Argo Rollouts With Datadog Metrics for Progressive Delivery
The article defines and explores how progressive delivery in Kubernetes environments can be enhanced using Argo Rollouts in combination with Datadog metrics.
Join the DZone community and get the full member experience.
Join For FreeIn modern DevOps environments, delivering software quickly and reliably is essential. Progressive delivery strategies such as canary deployments have emerged as effective methods to reduce risk during application updates. Argo Rollouts is a Kubernetes-native controller that enables progressive delivery using deployment strategies like canary and blue-green. When integrated with Datadog, a powerful monitoring and observability platform, Argo Rollouts can automatically make deployment decisions based on real-time metrics. This paper explores how Argo Rollouts and Datadog work together to automate analysis, reduce manual intervention, and ensure safe, data-driven deployments in Kubernetes environments.
Introduction
As organizations adopt microservices and cloud-native architectures, the complexity of application deployments has increased significantly. Traditional deployment methods often lead to downtime, user disruption, or production issues due to the lack of real-time feedback. Progressive delivery offers a solution by incrementally rolling out changes and continuously validating them.
Argo Rollouts is an open-source Kubernetes controller that supports progressive delivery through strategies like canary and blue-green deployments. It allows teams to control how new versions are gradually introduced to users while monitoring their impact. By integrating with Datadog, teams gain the ability to evaluate deployment success or failure using live metrics such as error rates, response times, and system health indicators.
This paper discusses the architecture, configuration, and benefits of integrating Argo Rollouts with Datadog to enable intelligent deployment automation. It also highlights a real-world use case where this integration supports automatic rollbacks and promotions, ensuring deployments are both resilient and efficient.
Modern Deployment Patterns: Canary, Blue-Green, and Beyond
Argo Rollouts is a Kubernetes controller and a set of Custom Resource Definitions (CRDs) that support advanced deployment strategies, including:
- Canary Releases: Gradually shifting traffic from the old version to the new one through defined steps.
- Blue-Green Deployments: Redirecting traffic between two environments with minimal downtime.
- Experiments: Deploying multiple versions simultaneously to compare performance using key metrics.
- Automated Rollbacks: Reverting to a stable version when health checks fail.
Argo Rollouts enhances Argo CD’s functionality and is fully compatible with GitOps practices. It also supports various types of analysis to enable progressive delivery. In this example, we use Datadog metrics to drive rollout decisions.
By leveraging real-time metrics as gating conditions, rollouts proceed only when performance remains within healthy thresholds. This approach avoids blind deployments and enables smarter, data-informed decisions:
- Rollouts pause if latency increases or error rates spike.
- Rollbacks trigger automatically when service-level indicators (SLIs) degrade.
- Real-time analysis provides confidence in each stage of the rollout.
In contrast, traditional deployments often push changes to production without sufficient safeguards. With Datadog integrated into Argo Rollouts:
- Regressions are detected earlier in the release cycle.
- Automated rollbacks occur before faulty versions reach full traffic.
- Production outages and customer-facing errors are minimized.
Datadog for Metrics-Based Analysis
Datadog offers comprehensive observability across infrastructure, applications, and services. It supports custom metric ingestion, Kubernetes monitoring, Application Performance Monitoring (APM), and alerting.
When integrated with Argo Rollouts, Datadog metrics can act as automated gates, helping determine whether a rollout continues, pauses, or rolls back based on real-time performance indicators such as:
- Latency (e.g.,
http.request.latency
) - Error rate (e.g.,
http.error.rate
) - CPU and memory utilization
- Custom business KPIs (e.g.,
payment.success.count
)
Datadog provides a unified view across:
- Infrastructure metrics like CPU, memory, and disk I/O
- Application telemetry including APM, logs, and traces
- Business KPIs such as conversion rates and transaction counts
This holistic observability allows teams to base deployment decisions not only on system health, but also on user experience and business impact.
By tying deployments to real-time business indicators, rollouts can adapt dynamically. For example, if a new feature negatively affects payment success rates or slows down checkouts, the rollout is automatically halted. This gives SREs and business stakeholders confidence that changes won’t compromise the user experience.
Argo Rollouts Architecture
Prerequisites
To get started, make sure the following components are installed within your CaaS environment:
- Argo Rollouts
- Datadog Agent
- Ingress Controller
Benefits of the Integration
- Automated Decision Making: Deployment progression or rollback is driven by real-time metrics.
- Reduced Risk: Early detection of anomalies halts faulty rollouts before reaching full production.
- Business KPI-Driven Rollouts: Tie deployment decisions to user-facing metrics, not just infrastructure stats.
- Improved Observability: Datadog dashboards and alerts offer full visibility into rollout performance.
Best Practices
- Set conservative thresholds at the start, especially for latency and error rates.
- Define rollback hooks to minimize service disruption during failure scenarios.
- Use pause durations to allow metric collection and stabilization between rollout steps.
- Consider A/B testing via the Experiment strategy to compare multiple versions in parallel.
Templates to Be Created
- Create one or more analysis templates based on health check requirements—these may focus on infrastructure health, application health, or both.
- Your rollout templates should include an analysis step referencing the appropriate analysis template.
- Based on the success conditions of Datadog metrics, the rollout will either automatically promote or roll back.
For more details on configuring Datadog-based analysis templates, refer to the official documentation.
Sample Templates
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: loq-error-rate
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 5m
successCondition: result <= 0.01
failureLimit: 3
provider:
datadog:
apiVersion: v2
interval: 5m
query: |
sum:requests.error.rate{service:{{args.service-name}}}
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: datadog-cpu-check
spec:
metrics:
- name: cpu-check
interval: 1m
successCondition: result > 0.5
provider:
datadog:
query: "avg:system.cpu.user{*}"
# This example demonstrates a Rollout which performs background analysis while the Rollout is updating.
#
# Prerequisites:
# * kubectl apply -f analysis-templates.yaml
#
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: rollout-background-analysis
spec:
replicas: 4
revisionHistoryLimit: 2
selector:
matchLabels:
app: rollout-background-analysis
template:
metadata:
labels:
app: rollout-background-analysis
spec:
containers:
- name: rollouts-demo
image: argoproj/rollouts-demo:blue
imagePullPolicy: Always
ports:
- containerPort: 8080
strategy:
canary:
# An AnalysisTemplate is referenced here, which starts an AnalysisRun as soon as the update
# begins. The run is terminated when the update completes. A failure/error of the analysis
# will cause the rollout's update to abort, and set the canary weight to zero.
steps:
- setWeight: 20
- pause: {duration: 5m}
- analysis:
templates:
- templateName: loq-error-rate
args:
- name: service-name
value: analysis-service
Suggested Service Architecture
Note: The technology is not yet finalized and this diagram is only for reference of request flow

Challenges and Considerations
- Metrics Lag: Real-time metrics may have slight delays. Use interval and duration fields appropriately.
- Cost Overhead: Frequent metric queries in Datadog can incur cost; be efficient with scope and frequency.
- Debugging: Failed analysis runs can be hard to trace—use verbose logging and alerts to aid troubleshooting.
Conclusion
The integration of Argo Rollouts with Datadog metrics marks a significant step forward in modern deployment practices. By combining Kubernetes-native progressive delivery with real-time observability, development and operations teams can deploy with greater confidence and precision.
This synergy enables automated, intelligent decision-making, promoting or rolling back releases based on live system health and performance data. It minimizes the risk of faulty deployments reaching production and reduces the need for manual oversight during critical rollout phases.
As reliability, agility, and automation become core to modern DevOps pipelines, tools like Argo Rollouts and Datadog are becoming essential. Together, they offer a scalable, metrics-driven approach to continuous delivery, making deployments safer, faster, and more resilient.
Opinions expressed by DZone contributors are their own.
Comments