AI Agents for DevOps on Kubernetes Need Real Engineering, Not Magic
Kubernetes incident triage: OpenTelemetry → Kafka → CrewAI → RBAC scale. DORA 2024: 75% AI use, 39% low trust. AI correlates, humans approve changes.
Join the DZone community and get the full member experience.
Join For FreeIn a real Kubernetes cluster, incidents rarely appear as a single, clean alert. They arrive as waves of Kubernetes events, latency spikes, pod restarts, rollout failures, and unpredictable autoscaling behavior all at once. The hard part is usually not “Can we fix it?” but “Can we understand what’s happening fast enough to make a safe decision?”
AI agents for DevOps can help here — but only when they sit on solid engineering foundations. They should compress the early correlation and triage phase, not take opaque, unsafe control of production.
Google’s 2024 DORA report underlines why this matters: more than 75 percent of respondents now rely on AI for at least one professional task each day, and over one‑third report moderate to extreme productivity gains, yet 39 percent still have little to no trust in AI‑generated code. That gap between use and trust is exactly where our architecture and guardrails matter.
Why Incident Triage Needs Help Now
Traditional AIOps pitches often promise full automation, but most SREs do not want a black‑box system taking unilateral action in production. What they need is help with triage:
- Grouping noisy alerts into a single incident view
- Correlating Kubernetes events, metrics, and recent rollouts
- Proposing safe, reversible next steps — not silently applying risky changes
The DORA research still centers on the same four key metrics: lead time, deployment frequency, change failure rate, and time to restore service. AI can absolutely improve developer productivity and documentation, but it can also undermine delivery stability when used on top of weak fundamentals such as oversized batch changes and poor test coverage.
For a broader perspective on integrating DevOps services, see "Incorporating DevOps Services into Software Development."
- Traceable – every recommendation is explainable from telemetry and cluster state
- Auditable – logs and decisions reviewable after the fact
- Reversible – actions easy to roll back
- Least‑privilege – permissions constrained by Kubernetes RBAC
Architecture Overview
| Layer | Responsibility | Key Technologies |
|---|---|---|
| Telemetry capture | Collect traces, metrics, logs, and Kubernetes events | OpenTelemetry Collector |
| Event bus | Buffer and fan‑out telemetry | Kafka |
| Lightweight consumer | Normalize/enrich data, build incident context | Custom service |
| AI agent layer | Triage, correlate, draft next actions | CrewAI, Llama via Ollama |
| Controlled execution | Safe, reversible scaling under RBAC | Kubernetes RBAC, scale subresource |
Related: DZone's "10 Best Practices for Managing Kubernetes at Scale."
The pattern that consistently holds up under load uses simple, composable layers:
- OpenTelemetry collector – capture traces, metrics, logs, and Kubernetes events
- Kafka event bus – buffer, fan‑out, and replay telemetry
- Lightweight consumer – normalize signals into “incident contexts.”
- AI agent layer – CrewAI agents backed by Llama 3.1 via Ollama
- Slack approval – humans approve or reject remediation steps
- RBAC‑limited scaling – Kubernetes permissions restricted to the
scalesubresource
Each layer can be tested, inspected, and replaced without rewriting the entire system.
Why OpenTelemetry Fits Kubernetes
OpenTelemetry Collector gives you one place to capture multi‑signal telemetry—traces, metrics, logs, and Kubernetes events — with pluggable receivers and exporters.
Key points for Kubernetes:
- The
k8seventsreceiver (in contrib distributions) captures events from the Kubernetes API server and converts them into logs. - Kubernetes events are short‑lived in the cluster (often an hour or less) and are not persisted long term; exporting them via OpenTelemetry preserves them for incident analysis.
- Events complement, but do not replace, application logs and traces; they describe what Kubernetes is doing to your workloads (e.g., scheduling failures, image pull errors, autoscaling decisions).
Why Kafka Belongs in the Middle
Dropping all telemetry straight into an AI model couples your reasoning to whatever the cluster happens to emit at that moment. Kafka gives you a much sturdier backbone:
- Replayable telemetry – reproduce incident contexts for testing and post‑mortems
- Multiple consumers – feed different tools (dashboards, anomaly detectors, AI agents) from the same topics
- Decoupled ingestion and analysis – collectors push at their own pace, consumers pull at theirs
Kafka does not fix bad metric names or broken alert rules, but it does give you a consistent, durable pipe to reason about.
A typical OpenTelemetry Collector configuration for this pattern looks like this (simplified):
text
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
k8sevents:
namespaces: [production, staging]
processors:
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 10s
send_batch_size: 1000
send_batch_max_size: 1500
exporters:
kafka:
brokers:
- kafka-1.example.com:9092
- kafka-2.example.com:9092
- kafka-3.example.com:9092
retry_on_failure:
enabled: true
sending_queue:
enabled: true
traces:
topic: otel-traces
encoding: otlp_proto
metrics:
topic: otel-metrics
encoding: otlp_proto
logs:
topic: otel-logs
encoding: otlp_proto
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [kafka]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [kafka]
logs:
receivers: [otlp, k8sevents]
processors: [memory_limiter, batch]
exporters: [kafka]
This keeps the collector focused on one job: getting signals in and pushing them reliably to Kafka.
Why a Separate Consumer Layer Matters
It is tempting to point your AI agents directly at Kafka topics, but that couples fragile prompt engineering with noisy raw data. A thin consumer service in the middle gives you a deterministic place to:
- De‑duplicate repeated events and alerts
- Join pod‑level signals to the Deployment and Service metadata
- Attach rollout information (who changed what, when, and via which pipeline)
- Apply simple rules (“ignore known‑benign events,” “group alerts by owner team”) before AI sees them
This consumer produces a single “incident context” document per active incident. AI agents then reason over this structured context instead of a firehose of raw logs.
A straightforward Kubernetes Deployment for the consumer might look like this:
text
apiVersion: apps/v1
kind: Deployment
metadata:
name: incident-context-consumer
spec:
replicas: 2
selector:
matchLabels:
app: incident-context-consumer
template:
metadata:
labels:
app: incident-context-consumer
spec:
serviceAccountName: agent-runner
containers:
- name: consumer
image: your-registry/incident-consumer:v1.0.0
env:
- name: KAFKA_BROKERS
value: "kafka-1:9092,kafka-2:9092,kafka-3:9092"
- name: INCIDENT_TOPIC
value: "otel-logs"
- name: OUTPUT_TOPIC
value: "incident-contexts"
AI Agent Layer With CrewAI and Llama 3.1
On top of incident contexts, we can deploy a small CrewAI‑based agent layer. Meta’s Llama 3.1 models are available in 8B, 70B, and 405B parameter sizes, and the llama3.1:8b variant runs comfortably on a single modern GPU or even a beefy workstation via Ollama.
We split responsibilities into three agents:
- Triage Agent – groups related alerts, assigns severity, and identifies the likely owning team
- Diagnosis Agent – correlates Kubernetes events, metrics, and rollout changes to propose the most likely root cause
- Executor Agent – drafts safe, reversible next steps and requests human approval
A minimal CrewAI definition might look like this (illustrative):
from crewai import Agent, Task, Crew
from llmclient import Llama31Client
from tools import K8sTool, SlackTool, PrometheusTool
llm = Llama31Client(
endpoint="http://ollama-gateway:11434",
model="llama3.1:8b"
)
triage_agent = Agent(
role="Incident Triage Engineer",
goal="Group related alerts and identify likely impact and owning team.",
tools=[K8sTool, SlackTool],
llm=llm,
)
diagnosis_agent = Agent(
role="Correlation Analyst",
goal="Correlate Kubernetes events with metrics and recent rollout data.",
tools=[PrometheusTool, K8sTool],
llm=llm,
)
executor_agent = Agent(
role="Runbook Automator",
goal="Draft safe, reversible next steps and send them for approval.",
tools=[K8sTool, SlackTool],
llm=llm,
)
crew = Crew(
agents=[triage_agent, diagnosis_agent, executor_agent],
tasks=[
Task(description="Triage incident context and assign severity.", agent=triage_agent),
Task(description="Diagnose probable causes.", agent=diagnosis_agent),
Task(description="Draft a safe remediation step and request approval.", agent=executor_agent),
],
)
The key is that only the Executor Agent proposes actions, and even then, those actions are routed through Slack for explicit human approval.
RBAC: Safe, Scale‑Only Permissions
Kubernetes RBAC lets you grant fine‑grained permissions to specific subresources, including deployments/scale. This is exactly what we want for an AI‑assisted incident system: the ability to scale workloads up or down, without the power to change container images, environment variables, or security settings.
Scaling is reversible and far safer than mutating Deployment specs. See the official Kubernetes RBAC docs for full details on subresource permissions.
A typical “scaling‑only” role for agents looks like this:
text
apiVersion: v1
kind: ServiceAccount
metadata:
name: agent-runner
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: deployment-scaler
rules:
# Read deployments and replica sets to understand current state
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
# Scale deployments via the scale subresource
- apiGroups: ["apps"]
resources: ["deployments/scale"]
verbs: ["get", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: agent-runner-deployment-scaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: deployment-scaler
subjects:
- kind: ServiceAccount
name: agent-runner
namespace: default
By operating only on the `/scale` subresource, you give the agent layer exactly enough power to adjust replica counts and nothing else. See DZone's Implementing RBAC Configuration for Kubernetes Applications for more RBAC patterns.
How a Real Incident Flows
When a rollout goes wrong, or a dependency starts failing, a typical incident flows like this through the system:
- Telemetry capture: The OpenTelemetry Collector gathers metrics, traces, logs, and Kubernetes events, and exports them to Kafka.
- Context building: The consumer service reads relevant records from Kafka and builds an “incident context” (involving namespaces, Deployments, pods, events, SLOs, and recent changes).
- AI‑assisted triage: The Triage Agent classifies severity (e.g., SEV‑1 vs SEV‑3), identifies impacted services, and tags likely owner teams.
- Correlation and diagnosis: The Diagnosis Agent matches restart reasons (
ImagePullBackOff,OOMKilled,CrashLoopBackOff, etc.) with rollout timelines and metric anomalies to propose plausible root‑cause hypotheses. - Drafting a reversible action: The Executor Agent proposes a small, clearly reversible change: for example, temporarily scaling a canary deployment from 10 replicas back to 2, or scaling a known‑stable previous version up to absorb traffic.
- Human approval: The proposed command and rationale are posted to a Slack incident channel. An on‑call SRE or incident commander explicitly approves or rejects the action.
- Execution under RBAC: If approved, the agent uses its
deployments/scalepermissions to apply the change. Every call is logged and auditable. For a deeper context for incident response, see DZone's Incident Response Guide.
Where This Pattern Works Best (and Where It Doesn’t)
This architecture is strongest when:
- Telemetry is clean and labeled (good metric names, consistent labels, sane alerts)
- Triage, not remediation, is the bottleneck
- Runbooks already exist with reversible actions
- Platform teams are comfortable owning Kafka and the consumer service
It is less effective when:
- Every incident is truly novel and unstructured
- Data is sparse or heavily delayed
- Organizational trust in automation is low, and there is no appetite for experimental changes
- The AI endpoint itself has no SLOs, rate limits, or clear failure modes
Final Thoughts
This pattern fits squarely within the 2024–2026 shift toward platform engineering and AI-augmented DevOps workflows, but it succeeds only when built on strict operational guardrails. The goal isn't to replace humans in the incident response loop — it's to dramatically compress the time between "something broke" and "we understand the blast radius and have safe, reversible recovery options on the table."
AI agents excel at grouping noisy Kubernetes signals into coherent incident contexts and proposing next steps grounded in telemetry and recent changes. Humans remain the final decision-makers for production actions, retaining full control through Slack approval gates and Kubernetes RBAC constrained to the safe scale subresource.
When telemetry is clean, runbooks exist, and platform teams can own the Kafka/consumer layers, this architecture delivers measurable wins in mean time to understanding. When incidents remain truly novel or organizational trust in automation is low, it gracefully falls back to human-led triage. Either way, the system stays transparent, auditable, and reversible — never expanding blast radius through opaque "magic.
Opinions expressed by DZone contributors are their own.
Comments