OpenTelemetry: Unifying Application and Infrastructure Observability

Explore how OpenTelemetry is revolutionizing observability by unifying application and infrastructure monitoring and empowering developers with open standards.

Tom Smith

CORE ·

Jul. 25, 24 · Interview

Likes (2)

Comment

Save

4.2K Views

In this insightful Q&A, Goutham Veeramachaneni, a long-time Prometheus maintainer and Product Manager at Grafana Labs, shares his unique perspective on the transformative impact of OpenTelemetry (OTel) in the observability landscape. Veeramachaneni discusses how OTel is standardizing telemetry data and inspiring new open-source data collectors and workflows that bridge the gap between application and infrastructure monitoring. He offers valuable insights into the evolving ecosystem, the challenges ahead, and the exciting possibilities for developers in composing more effective telemetry data pipelines.

Q: As a Long-Time Prometheus Maintainer, What’s Your Take on the Overall Impact That OpenTelemetry Has Had on the Market?

A: It’s given developers and platform teams much greater ownership of their data. It’s given them flexibility and freedom that they didn’t have before. Previously, with no universal open standard for telemetry data, the proprietary vendor mousetraps were designed to make it super difficult to migrate to other solutions, which was insane. These vendors didn’t have a lot of incentive to innovate or compete, because they had instrumented such effective mousetraps to lock users in. They spoke their protocols and collected their metrics, and there was no standardization. OpenTelemetry already has forced the entire market to standardize on the OTLP protocol and its ecosystem of SDKs and APIs. That has taken the power away from vendors and created a standard that is dynamic and open and where everyone collaborates — which is driving a ton of innovation.

Q: What Is the Most Exciting New Progress That You’ve Seen With OTel in the Last Year?

A: I’m really excited to see how Prometheus and OTel are coming together and all the momentum that’s bringing application observability to the same level of standardization and consistency as infrastructure observability. Prometheus is such a staple in infrastructure observability – everyone uses it directly, or a flavor of Prometheus from a vendor. One of the reasons why Prometheus is so popular is because there is an exporter for just about every infrastructure component and just such a massive community supporting it. However, until OTel, no such standardization and velocity of innovation existed in application observability because you needed to spend a lot of effort to create auto instrumentation agents, and only the prominent vendors with teams of 20 people working on these instrumentation agents had that skill. So, application observability had all these proprietary protocols and methods for metrics collections, and there was no standardization. But now OTel has created a foothold by bringing in a standard where you can monitor your application and is similarly implementing that standard for all the popular languages.

Q: What’s the Implication of Application and Infrastructure Observability Coming Together, and What Needs To Happen for Us To Get There?

A: Well, we already have a standard, where, just like Prometheus and its exporters, now in the application observability world, you have all these SDKs and auto-instrumentation agents generating OpenTelemetry. Prometheus has made a lot of strides in the past year to start ingesting OTel data, so the infrastructure and application metrics now sit in the same system side-by-side. Prometheus 3.0, coming later this year, has OpenTelemetry support as one of the main features and focus areas.

However, the story doesn’t end there. You need to be able to correlate the metrics together easily. For example, you need to be able to relate a spike in errors in your RED metrics with a saturation of the CPU on the node the application is running on. This is tricky because the conventions for Prometheus and OpenTelemetry don’t line up yet. I believe this is what the community will focus on in the next year: making sure you can seamlessly correlate and navigate the data between the two worlds of App monitoring with OTel and Infra monitoring with Prometheus.

Finally, there is also the sticking point of “collection” of this data. While you can collect Prometheus metrics with the OpenTelmetry Collector, you’ll convert the Prometheus into OTLP and then back into Prometheus data (as the datastore is Prometheus). This has a high CPU overhead today and is something we want to optimize. This is also why I am excited about Alloy, Grafana’s new open-source collector. Alloy comes from a Prometheus-first heritage and embeds several infra-Prometheus exporters. It is also an OTel Collector distribution and supports collecting and processing OTel data efficiently. It shines because if you are collecting Prometheus data and your final destination is also Prometheus, it avoids the CPU cost of converting into OTLP in the pipeline.

That’s one of the beauties of open standards and something stable to build against — that you can either use the OTel collector directly, you can use a new collector like Alloy that is optimized for making Prometheus and OTLP more seamless to work with together, or you can try any other collector. OTel has created this buyer’s market where developers will have so much optionality on which observability tools they use while knowing that they own their own telemetry data underneath and that OTel itself is doing the heavy lifting of language support.

Q: Can You Describe the Historical Disconnect Between Application and Infrastructure Observability?

A: Today, if you look at the systems you’re trying to debug, you have the infrastructure – MySQL, Postgres, databases, node-level data like what hosts I’m running, and memory and CPU it’s running – and then you are running applications/containers on top. When you get a page, users see errors — and suddenly, you have a set of dashboards that show a list of applications and how each application performs. It’s easy to see where an error is occurring (say, in MySQL), but going from there to finding out what is wrong with MySQL is not easy. Because the application is talking OpenTelemetry metrics and MySQL is talking Prometheus metrics — there is no standardization between the two. It takes a lot of expertise to find the correct instance of a running service and do these types of correlations, and this is going to get much easier as we see deeper native integration between Prometheus and OpenTelemetry.

Q: What Are Some of the Problem Domains That You See OTel Tackling in the Future That Are Yet To Be Solved for Telemetry Data?

A: To get anything across the goal line in terms of standardization, you have to build consensus across many people and groups–that’s one of the things that’s so remarkable about what OTel has accomplished. Once specifications are settled, execution and innovation on top of that becomes easy. I see some realms where it’s inevitable that OpenTelemetry will have a similar impact on standardizing telemetry data, but where things are still so early that the consensus is still a way out. Front-end monitoring user monitoring is one example. In logging, we need to see more databases and logging systems adopting the OTel specification. Semantic conventions for LLMs are still largely proprietary. Observing messaging queues and applications built on top is still in its early days. There’s a ton of innovation in the CI/CD space, which needs better metrics to understand how long builds are taking — and that’s another exciting opportunity area for OTel.

Infrastructure Observability Telemetry application

Opinions expressed by DZone contributors are their own.

Related

Trending