Seeing the Whole System: Why OpenTelemetry Is Ending the Era of Fragmented Visibility

By a technology correspondent who has sat through enough war rooms to know that the data you need is almost always in a system nobody thought to connect.

Igboanugo David Ugochukwu

CORE ·

Apr. 16, 26 · Opinion

Likes (0)

Comment

Save

4.5K Views

The incident had been running for forty-seven minutes when I watched the on-call engineer open his sixth browser tab. Grafana for the infrastructure metrics. Splunk for the application logs. A separate Jaeger instance — legacy, running on a server that was itself poorly monitored — for traces from the API layer. A custom dashboard someone had built in Kibana eighteen months earlier for the payment service, which used a different logging format than everything else. And a Datadog trial that a team had spun up six weeks prior for a new microservice, not yet integrated with anything.

He wasn't incompetent. He was experienced, methodical, and clearly doing his best under pressure. The problem was that the answer — a cascade that had started when a downstream dependency began timing out under load, causing queue depth to grow on a service that nobody had instrumented with queue metrics — was distributed across four systems that had no awareness of each other. He had to hold the context in his head. Manually. While an incident was live.

They found the root cause at minute sixty-one. The customer-facing impact had lasted forty-four of those minutes. The postmortem identified the observability fragmentation as a contributing factor, listed it under "areas for improvement," and moved on to the next agenda item.

I've watched variations of that scene in a half-dozen organizations over the past two years. The tooling changes. The services change. The outcome — an engineer assembling context manually from disconnected systems while something is actively broken — remains depressingly consistent.

The Silo Problem Nobody Talks About Honestly

Here is the honest history of how most engineering organizations arrived at their current monitoring stack: incrementally, by accident, without design.

A team needs metrics. They stand up Prometheus. Another team is doing distributed tracing and chooses Jaeger because a consultant recommended it in 2021. The security team wants log aggregation and procures an ELK deployment. A new service gets built by an engineer who prefers Datadog and expenses a trial. An acquired company brings its own observability tooling in the merger. Nobody made a bad decision in isolation. The aggregate result is four or five disconnected systems, each with partial visibility into the environment, none of which speak to each other.

The cost of this architecture isn't obvious until an incident. In steady state, the fragmentation is an inconvenience — a bit of extra work to check multiple dashboards, some duplicated alerting logic, occasional inconsistencies between what different systems report. Engineers adapt. Runbooks get written that specify which tab to open first.

Then something goes wrong in a way that crosses system boundaries — which, in a microservices environment, is basically every interesting incident — and the cost becomes immediate and concrete. The trace context doesn't propagate from the service instrumented with one agent to the service instrumented with another. The log timestamp in one system doesn't align with the metric spike in the other, and you spend eight minutes ruling out whether the difference is a timezone issue or a real sequence. The tool that would answer your question doesn't have the data because that service was never instrumented for it.

The question OpenTelemetry is answering — slowly, imperfectly, but at a scale that suggests genuine momentum — is whether the industry can agree on a common foundation for telemetry that makes this fragmentation a choice rather than an inevitability.

What OpenTelemetry Actually Is, Stripped of the Hype

The CNCF project's ambitions are larger than its name implies. OpenTelemetry isn't primarily a tool. It's a specification, a set of APIs, a collection of SDKs across most major languages, and a Collector — a standalone service that receives, processes, and routes telemetry — that together constitute a vendor-neutral foundation for how applications produce and transmit observability data.

The practical significance of "vendor-neutral" is easy to understate. Before OpenTelemetry reached maturity — and it only really reached meaningful production stability in its core components sometime in 2023 — instrumenting an application for observability meant tying yourself to a specific vendor's agent or SDK. Switch from Datadog to Honeycomb, or from Jaeger to a commercial backend, and you were re-instrumenting. Not just reconfiguring — actually touching code, removing one library, adding another, retesting.

With OpenTelemetry, the instrumentation in application code emits to a standard protocol: OTLP, the OpenTelemetry Protocol. The Collector receives that data and routes it wherever you configure. Change your backend, change the Collector configuration. The application code doesn't know and doesn't care.

This portability is real and I've watched organizations use it in practice. A fintech company in São Paulo that I spent time with in mid-2025 had been running Jaeger for distributed tracing. Their compliance team needed traces available in a system their auditors could access with enterprise-level controls — specifically, a commercial vendor's platform. Because their instrumentation was already OTel-native, the migration was a Collector configuration change and a two-day integration project. The engineers were visibly surprised it went that smoothly. Their previous vendor migration, before OTel, had taken three months.

The Adoption Numbers and What They Mean

EMA Research published figures in 2025 that I found genuinely striking for a project that was still cutting release candidates as recently as 2022: nearly half of organizations surveyed reported active OpenTelemetry usage in production, with another quarter indicating planned adoption. Grafana's observability survey from the same period showed Prometheus running at 67% adoption — its established position — while OpenTelemetry had closed to 41%, an extraordinary trajectory for a project that was pre-1.0 on most signals until 2023.

What explains that velocity? Partly the backend consolidation play — organizations that have already committed to multiple observability vendors simultaneously see real value in a neutral collection layer. Partly the engineering community's attraction to open standards over proprietary lock-in, which has only intensified as vendor pricing for high-cardinality metrics and traces has become a genuine budget line item. And partly, I think, the slow accumulation of platform engineering investment described above — teams that are already thinking about their infrastructure as a product are more likely to make deliberate observability decisions rather than accumulating tools reactively.

The 84% of OTel adopters reporting meaningful cost reductions figures that surface in EMA's research are worth treating carefully — vendor-adjacent surveys have obvious incentive structures — but the cost argument has a structural logic independent of any survey. When you centralize telemetry collection through a Collector with sampling and filtering capabilities, you gain control over what you're actually sending to backends. A common pattern I've seen in large-scale deployments: teams instrument comprehensively at the source, then configure tail-based sampling at the Collector level to send perhaps 10 to 15 percent of traces to expensive storage backends while retaining 100 percent of errored or slow traces. The result is complete visibility into what's actually going wrong, at a fraction of the ingestion cost of sending everything everywhere.

The Collector as the Linchpin

Of all OpenTelemetry's components, the Collector is the one I've watched teams misunderstand most consistently — both underinvesting in it and overcomplicating it.

The underinvestment failure mode: treat the Collector as a pass-through and configure it to forward everything to a single backend without filtering, sampling, or enrichment. This works. It also eliminates most of the architectural benefit of centralizing collection in the first place. A Collector that simply relays raw telemetry is better than per-service direct export to a vendor — at least your configuration is centralized — but it's not capturing the value of having a processing layer in the pipeline.

The overcomplication failure mode: attempt to route telemetry to five backends simultaneously from day one, with complex processor chains, multiple sampling strategies, and attribute transformations that nobody fully understands six months later. I've seen this create Collector configurations that are harder to reason about than the systems they're observing, maintained by one engineer who has become the de facto owner of something that should be team-legible infrastructure.

The teams that do this well — and the pattern is consistent enough that I've started calling it out explicitly in conversations — start with one receiver, one processor, one or two exporters, and a clear ownership model. They expand the pipeline deliberately, treating each new processor or export target as a discrete decision with documented rationale. Their Collector configuration is in Git. Changes go through review. The observability pipeline is itself observable: they watch the Collector's own health metrics for export latencies and drop rates.

An SRE manager at a US-based SaaS company described this to me in September 2025 with unusual clarity: "We treat the Collector like a service. It has an owner, it has SLOs, it has an on-call rotation. When we first deployed it we treated it like infrastructure — just set it up and forgot about it. That lasted until it became a single point of failure for our entire telemetry path during an incident and we had no visibility into why."

The Correlation Problem That OTel Mostly Solves

The deepest value of a unified telemetry standard isn't the cost savings or the backend portability. It's correlation — the ability to move from a metric anomaly to the trace that explains it to the log line that identifies the specific operation.

Before unified context propagation, this was manual. You saw a latency spike in your metrics, pulled up your tracing tool, searched by time window and service name, found the relevant traces — maybe — then looked for correlated logs by timestamp, hoping the clocks were synchronized and the log levels were informative enough to be useful. For an experienced engineer who knew all the systems, this might take five minutes. For someone less familiar with the environment, or dealing with an unfamiliar failure mode, it could take much longer.

OpenTelemetry's trace context propagation — the traceparent header that flows through HTTP calls between services, automatically attached by OTel SDKs — makes correlation mechanical. A single trace ID links the request path across every service it touched. If your logs are also emitting that trace ID — which OTel log instrumentation handles — you can navigate from a slow span in a trace directly to the log lines produced during that span, in the same system, with a single click in any backend that supports the correlation.

I watched a junior engineer at a retailer do a root-cause analysis last November that, by the on-call lead's estimate, would have taken forty minutes before their OTel migration. It took nine. She had been on the team for three months. She'd never seen the failure mode before. The trace context gave her a path through the system that she could follow without needing to know in advance which service to look at next.

That's the promise that the observability conversation has been making for five years. OpenTelemetry is the first time I've watched it delivered consistently enough, in enough organizations, to stop treating it as aspirational.

What Remains Hard

Honesty requires acknowledging the parts that haven't gotten easier.

Auto-instrumentation — OTel's mechanism for capturing telemetry from common libraries without code changes — is excellent for standard HTTP calls, database queries, and gRPC. It's considerably less useful for anything proprietary or unusual: custom message queue implementations, legacy protocols, in-house frameworks built before any of this existed. Teams with significant legacy surface area still face manual instrumentation work that is unglamorous and time-consuming.

Log integration is the signal that has lagged furthest. Traces and metrics in OTel are mature and stable. The logging specification and its SDK implementations have been catching up, and the situation is meaningfully better in early 2026 than it was eighteen months ago, but organizations with established logging pipelines face real migration complexity if they want fully correlated logs under OTel. The teams I've seen navigate this most smoothly have done it incrementally: add trace and span IDs to existing log output first, then migrate the collection path when the operational picture is clearer.

And the Collector's operational complexity is real. It's not prohibitive, but it's not invisible either. A production Collector deployment handling high-cardinality telemetry from dozens of services is infrastructure that requires capacity planning, failure mode analysis, and ongoing operational attention. Teams that assume the Collector is a set-and-forget component inevitably discover otherwise.

The Visibility You Don't Have Is the One That Matters

I've thought about that engineer in the war room often over the past year. Six browser tabs, forty-seven minutes, an incident that was answerable by the data that existed — it just existed in the wrong places.

The case for unified observability isn't primarily theoretical. It's the accumulated cost of every incident that ran longer than it needed to because context was scattered, every postmortem that identified observability gaps as a contributing factor and then filed that finding away, every junior engineer who couldn't navigate an unfamiliar system under pressure because there was no coherent thread to follow.

OpenTelemetry doesn't eliminate incidents. It doesn't make systems less complex. What it does — when it's implemented thoughtfully, with a Collector that's treated as real infrastructure and instrumentation that covers the services that actually matter — is make the complexity legible. One data model. One propagation standard. One collection pipeline. Backends that can be swapped without touching application code.

For an industry that has been drowning in its own telemetry for the better part of a decade, that's not nothing.

The author covers cloud infrastructure, reliability engineering, and distributed systems for enterprise technology organizations. They have reported from engineering teams across North America, Europe, and South America over fifteen years.

Observability Telemetry systems Visibility (geometry)

Opinions expressed by DZone contributors are their own.

Related

Trending