Prometheus is an open-source infrastructure and services monitoring system that has become popular for Kubernetes and cloud-native services and applications. Although Prometheus itself is just the metrics server, often it's what the entire monitoring stack is known as. A few pieces build the monitoring system:
- Exporters: Sidecar containers that collect and expose container and service metrics
- Prometheus server: Walks through all the exporters' and other metrics' endpoints collecting (pulling) the data
- AlertManager: Provides alerting capabilities on top of the metrics server
- Grafana: Dashboarding interface for querying and displaying metrics
- Prometheus metrics: Metric format for Prometheus; many of the first libraries to implement this format were born within the Prometheus project (has its own project here) While deploying and quick-start monitoring your infrastructure and services with Prometheus is easy and straightforward, there are some areas where it falls short. Common challenges that organizations face with Prometheus once they are in production involve scalability, high availability, long-term storage, and day 2 operational drawbacks. In the last few months, several solutions have been made available. We will discuss which approaches can be taken to run Prometheus at scale.
While many cloud-native services are heavily instrumented with Prometheus metrics, other services and applications are not. Metrics are a precious commodity, but way more rich and powerful insights can be obtained from your infrastructure, services, and applications, which can support essential workflows like troubleshooting or even security. This Refcard will introduce four ideas on how we can see more deeply so we can fix issues faster or even before they happen.
Simplified metric collection
Events correlation and alerts
Troubleshooting and tracing
This is page 1 of the Scaling and Augmenting Prometheus Refcard. To read the full Refcard, you can download the full PDF above.