Over a million developers have joined DZone.

Designing Prometheus Metrics

DZone's Guide to

Designing Prometheus Metrics

In Prometheus, tagging is essential, but somewhat different. Learn what you need to know about how this applies to using it to design metrics.

· Performance Zone ·
Free Resource

Maintain Application Performance with real-time monitoring and instrumentation for any application. Learn More!

Foreword: This article was written before the first release of the Vert.x Micrometer module, which was inspired by the examples and code snippets presented here.

Working on Hawkular, Vert.x Micrometer and Kiali, I am often confronted with the responsibility of designing metrics in a way that empowers good usability when it comes to querying/building dashboards.

With Hawkular Metrics, as well as many other time series engines, one essential concept to produce good, searchable metrics is tagging. I had the occasion to write about it, more especially about how Hawkular Metrics tagging could be leveraged while building Grafana dashboards, and concluded with these words: “Proper metrics tagging is the cornerstone to make sense of [your] data.”

With Prometheus, this is even truer. And also very different.

In Hawkular Metrics, tags are essentially meta-data. A metric, that is defined by its name, can be annotated with key-value pairs. It’s just like a Map associated with the metric. A tag key can only have one value for a given metric. As a side-effect, tags can be modified over time, it won’t impact the metric itself.

In Prometheus, first they don’t call it tag, they call it label. And it means dimension. That is to say, a metric can have several dimensions, one per label, and every combinations of labels values represent an unique time series. Influx is another TSDB that leverages dimensions. That might seem complicated for who've never used it, but it’s not. Let’s take a concrete example, from an implementation of Vert.x metrics¹:


Here vertx_eventbus_handlers is the metric name, address is a label and the whole vertx_eventbus_handlers{address=“my_endpoint”} denotes a time series.

Now if I have:


It denotes another time series that exists independently from the first one.

One crucial aspect that comes at this point is that the labels cardinality must be considered carefully while designing Prometheus metrics. If you set uncontrolled, unbounded label values on a metric, you will blow up your Prometheus server with tons of time series.

Still with Vert.x Metrics, there is a metric that tracks an HTTP server response time:

 vertx_http_server_response_time{local=<server address> ,remote=,client address> }

Is there a problem here? Sure there is. If a Vert.x server is going to be used as a public web server, which is very likely, the remote label values are unbounded. We would get tons of time series, like:

  • vertx_http_server_response_time{local=“localhost”,remote=“(an IP)”}
  • vertx_http_server_response_time{local=“localhost”,remote=“(another IP)”}
  • vertx_http_server_response_time{local=“localhost”,remote=“(yet another one)”}
  • and so on…

Prometheus doesn’t like it. However, there might be some cases where the remote end is bounded. For instance, think about a private REST API that is consumed only by a small set of known clients. In Vert.x, the strategy we have is to ignore unbounded labels by default, but they still can be activated by config. As a general rule, we must resist to the temptation of labelling everything.

Why is it tempting, by the way? It is, because on the querying side we can play with labels and aggregate stuff very easily. These examples are taken from a demo of my own:

  • Query a specific time series:

  • Query all series for this metric:

  • Query the sum:

And so on. PromQL can do a lot. And labels are essential here.

But it leads us to another important aspect while designing labels. If we can sum metric series over its dimensions, it has to be meaningful. In the above example, it is meaningful: it is the sum of all event bus messages received per address (in Vert.x, an address is more or less an event bus endpoint). In other words, it’s all event bus messages received by the application.

Aside this metric, there are other ones such as: “vertx_eventbus_sent”, “vertx_eventbus_pending”, “vertx_eventbus_delivered”, etc. One might think: hey, why not having a “state” label? We would have just one metric “vertx_eventbus” with label “state” and possible values “received”, “sent”, “pending”, “published” etc. The label cardinality is low, values are under control.

Bad idea. While it’s possible, it would make the aggregations, such as the sum in the above example, meaningless on the base metric. This is because an event bus message can have several states, for instance, it can be both “received” and “delivered”. So it might be counted twice. One would have to supply additional labels to make sense of it again, like sum(vertx_eventbus{state=“received”}), but would lose the immediateness of the metric usability, making it harder for people querying, building dashboards, to reason about.

Personally, I consider there’s a strict relation between the logical element we’re reasoning about (ex: an event bus message) and the metric dimensions: the logical element has one and only one “projection” onto these dimensions. To make another parallel, it’s a bit like partitioning/sharding a database: it’s the art of finding all relevant partition keys that uniquely address an entity. The same is true for Prometheus labelling.

¹ It has now turned into the Vert.x Micrometer module, which supports Prometheus reporting. Examples showed here may not match exactly what's in the current module.

Collect, analyze, and visualize performance data from mobile to mainframe with AutoPilot APM. Learn More!

metrics ,prometheus ,vert.x ,performance ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}