Prometheus Monitoring: Pros and Cons

Read about the features, design, and scalability in Prometheus open source monitoring that might be pros or cons when choosing it as a solution.

Updated Jul. 26, 17 · Opinion

Likes (3)

Comment

Save

24.4K Views

Prometheus is a monitoring solution that gathers time-series based numerical data. It is an open-source project started at SoundCloud by ex-Googlers that wanted to monitor a highly dynamical container environment. As they were not satisfied with the traditional monitoring tools, they started working on Prometheus. Google’s monitoring system, Borgmon served as inspiration for Prometheus.

In this post, we will discuss the important design decisions made by Prometheus and their implications. We will highlight the strong points, talk about how Digital Ocean managed to scale Prometheus to 1M machines and how you can leverage Prometheus when using CoScale.

How Prometheus Works

A nice overview of how Prometheus works can be found here. To monitor your services using Prometheus, your services need to expose a Prometheus endpoint. This endpoint is an HTTP interface that exposes a list of metrics and the current value of the metrics.

Prometheus has a wide range of service discovery options to find your services and start retrieving metric data from them. The Prometheus server polls the metrics interface on your services and stores the data.

In the Prometheus UI, you can write queries in the PromQL language to extract metric information. For example:

topk(3, sum(rate(container_cpu_time[5m]) by (app, proc)))

will return the top three most CPU-consuming services.

Alerts can be configured in the alertmanager, again using the PromQL language. Grafana is a popular option for creating dashboards for Prometheus metrics.

Prometheus Monitoring - CoScale

Prometheus Design Decisions

We will start off by talking about the Prometheus metric endpoints. These endpoints define the metrics and values for the metrics and are exposed over HTTP. They provide a standardized way for gathering metrics. Prometheus metrics follow a lot of the guidelines set out by metrics 2.0: the metrics have name, description, dimensions, and values. The only thing that is missing is a unit for the metrics.

Many services are exposing Prometheus endpoints, which makes gathering metrics for them really easy. For other services that don’t have native Prometheus endpoints, converters are required. This means that for these services an extra sidecar container has to be deployed in order to expose metrics in the Prometheus format.

The second design decision I would like to discuss here is pull vs push.

Prometheus polls services for metrics. This means that all services you want to monitor using Prometheus should expose a Prometheus metrics endpoint. Prometheus uses service discovery, which is nicely integrated with Kubernetes, to find all your services. Once it has found all services, it will gather metrics for all those services by polling their Prometheus metrics endpoint.

The strong points for the pull approach are that there is no need to install an agent and that the metrics can be pulled by multiple Prometheus instances.

The downsides are:

All metric endpoints have to be reachable for the Prometheus poller, implying a more elaborate secure network configuration and
Scaling becomes an issue in large deployments. Prometheus advises a push-based approach for collecting metrics for short-lived jobs.

One of the primary design objectives of Prometheus is operational simplicity. This way Prometheus limits the number of possible failure modes for the monitoring system. Following this principle, Prometheus is currently limited to a single node, since clustering comes with additional operational complexity. Using a single node is less complex, but puts a hard limit on the number of metrics that can be monitored by Prometheus.

What Prometheus Is Not

Prometheus does not aim to solve several aspects.

One is support for logs. Both metrics and logs are necessary parts to get a complete visibility into your applications, but there are already lots of open- and closed-source log aggregators that manage logs.

Prometheus also does not offer durable long-term storage, anomaly detection, automatic horizontal scaling and user management. From our customer base, we see that these features are required in most large-scale enterprise environments.

Prometheus is not a dashboarding solution, it features a simple UI for experimentation with PromQL queries but relies on Grafana for dashboarding, adding some additional setup complexity.

How Digital Ocean Scaled Prometheus to 1M machines

During their talk at PromCon 2016, Matthew Campbell from Digital Ocean explained how they scaled Prometheus to 1 million machines. During the talk, he explains how they started off with a default Prometheus installation and what they had to change to make it scale.

They started off with 1 Prometheus machine per data center. They ran into scalability issues and created larger and larger machines to run Prometheus. Once they scaled the machines to the maximum size, they decreased the retention of the system to three days and decided to drop certain metrics. This approach only goes so far, so they decided to further shard their Prometheus, based on node labels. The difficulties with this approach are that querying becomes more difficult and they ended up implementing a Prometheus proxy that gathers data from multiple shards. Bigger problems that they were unable to solve using this approach were shard redistribution and over-provisioning.

When scaling from ten thousand servers to one million virtual machines, they decided to take a different approach. They created a "reverse node exporter," which is basically an agent that is installed on the nodes and pushes data to a central point. On the backend side, they also did major changes: they kept the Prometheus API, but added a Kafka cluster for incoming metrics and Cassandra for metric storage. They also introduced downsampling. This project is called Vulcan and is available as open source.

The approach taken by Vulcan looks a lot like the approach taken by CoScale. We also use an agent and scalable, high-available backend.

We believe that there is a lot of value in having a standardized metrics format. This makes it easy to gather metrics from different types of services. CoScale offers a Prometheus plugin that gathers metrics exposed in the Prometheus format. This allows you to easily get the metrics from your Prometheus-enabled services into CoScale.

There are, however, still lots of services that don’t expose a Prometheus endpoint. Deploying a converter for these services is cumbersome. CoScale offers a scalable approaching including an agent and a scalable, high-available backend available as SaaS and on-premise.

Metric (unit) Service discovery Cons

Published at DZone with permission of Frederick Ryckbosch, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending