Devs Don't Want More Dashboards; They Want Self-Healing Systems

Developers don't want more dashboards to stare at or more complex alerts to manage; they want systems that actively heal themselves.

Thomas Johnson

CORE ·

Jun. 22, 26 · Analysis

Likes (0)

Comment

Save

1.2K Views

Every observability vendor's roadmap right now includes some version of "AI-powered insights." Smarter dashboards, with an assistant bolted on, to help you make sense of the data faster.

That's not what developers are asking for.

Nobody opens a laptop hoping for a better dashboard. What they're actually hoping for is a system that goes from bug to fix on its own, so their job shifts from digging through logs at 3 a.m. to something that actually uses their judgment: governing outcomes, managing risk, deciding which fixes get shipped and which need a second look.

That idea of self-healing software isn't new. IBM coined the term in 2001 with the vision formalized into a loop: monitor, analyze, plan, execute. For two decades, only the first and last steps were actually automated. Analyzing why something broke and planning a fix for it requires judgment, and that's always been a human job. AI coding agents are the first real candidates to take it on.

This article looks at what that actually means in practice and what has to change before AI agents can close the bug-to-fix loop for real.

The Self-Healing Loop, Only Half Solved

Infrastructure heals itself constantly. Kubernetes restarts what crashes. Autoscalers add capacity. Circuit breakers fail over. Nobody gets paged for any of it. Graceful degradation, resilient architecture, automated failover: these ideas are so built into how we expect distributed systems to behave that it's easy to forget they weren't always there.

"Self-healing" was formally introduced by IBM in 2001, when Paul Horn proposed systems that could regulate themselves the way the human autonomic nervous system does: automatically, without conscious thought.

“An autonomic computing system must perform something akin to healing — it must be able to recover from routine and extraordinary events that might cause some of its parts to malfunction.

It must be able to discover problems or potential problems, then find an alternate way of using resources or reconfiguring the system to keep functioning smoothly.”

The paper itself acknowledges that the easy part was already solved even in 2001: "certain types of 'healing' have been a part of computing for some time. Error checking and correction […] and redundant storage systems like RAID allow data to be recovered even when parts of the storage system fail." In other words, IBM already knew the hard part would be root cause analysis: figuring out what actually broke and why, not just recovering from the fact that something did.

These principles were eventually formalized into the MAPE-K loop: Monitor, Analyze, Plan, Execute, all running against a shared Knowledge base. It became the reference model for how a self-managing system should behave.

Two and a half decades later, half of that loop is a solved problem. Monitoring and executing are largely mechanical: detect a deviation, run a predefined response. The infrastructure layer works so well that we've stopped calling it "self-healing" at all. It's just how systems behave now.

The other half was, and still is, the hard part. Analyzing why something broke and planning what to do about it requires reasoning about what a system is supposed to do, not just whether it's currently running. For application-level bugs, that reasoning has always required a person. No amount of infrastructure automation changes the fact that someone still has to figure out why the checkout flow is returning the wrong total.

With AI coding agents, we finally have the first credible candidate to take on the analysis and the planning.

Closing the Loop

Here's what fixing a bug actually looks like for most developers today.

An alert fires in PagerDuty or Slack. You open your APM (Datadog, New Relic, ...) and start hunting for the error. Once you find it, you switch to logs, search for the request ID, and start piecing together what happened. From there, it's traces: open Tempo or Jaeger, scroll through 200+ spans looking for the one that matters. By now, you've switched tools four times, and you still don't have a fix. You move to your IDE, run git blame to figure out who touched this code last and why, form a theory about what's actually wrong, and finally try something.

Five tools. Eleven steps. Four hours. And that's the good outcome, where the fix on the first attempt is the right one.

This is the loop that "AI-powered observability" claims to close. Bolt an agent onto the APM, give it access to the logs and traces, and, in principle, the agent does steps 2 through 10, and a developer just reviews step 11.

In practice, this doesn't close the loop. It automates a workflow that was designed for humans and not agents (and that matters greatly).

Every step in that staircase exists because the data needed for the next step lives somewhere else, in a different tool, with a different data model, often with no shared identifier connecting them. A human bridges these gaps with intuition: they know, roughly, what an error in the APM probably looks like in the logs, and what a slow span in the trace probably means for the code. An agent doing the same walk doesn't have that intuition. It has to either guess at the same correlations a human guesses at, usually with less context, or be given a stack where those correlations already exist before it starts.

Closing the loop, for real, means the agent's starting point isn't step 1. It's closer to step 11, already holding the unsampled, full-stack session data, pre-correlated, deduplicated, with the relevant code already identified, before it ever opens a single tool.

Systems That Watch and Heal Themselves

Developers don't want more dashboards to stare at or more alerts to triage. They want the thing that broke to fix itself, the way a bruise heals without you having to think about it.

Getting there starts with the telemetry layer. Today it's a passive record: data gets written somewhere, and someone (human or agent) comes along later to dig through it.

An architecture built for this new consumer, AI coding agents, works differently. It captures full-fidelity, pre-correlated session data at the source, so an agent isn't reconstructing a failure from sampled traces or scattered tools. The data arrives ready to reason about.

A few concrete shifts follow from that. Random sampling fades, replaced by systems that cache locally and decide, in the moment, what's worth keeping when something goes wrong. Observability stops being a separate product bolted onto the side of a system and becomes part of how the system runs: less a storage bucket, more an active participant. And the whole model flips from pull to push: instead of someone opening a dashboard to go looking for a problem, the system surfaces what happened, pre-correlated by user, session, and deployment, the moment it happens.

What changes for developers is the shape of the work itself. Less time spent reconstructing what broke from fragments across five tools. More time spent on the things that actually require judgment: deciding which fixes are safe to ship automatically, which ones need a second look, and what the system should and shouldn't be allowed to do on its own.

That's the goal "self-healing" has pointed at since IBM coined the term in 2001, modeled on a nervous system that handles the details so you can think about the things that matter. Paul Horn put it simply: the best measure of success is when people think about the functioning of computing systems "about as often as they think about the beating of their hearts."

Twenty-five years later, that's finally within reach.

systems Observability

Opinions expressed by DZone contributors are their own.

Related

Trending