DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

  1. DZone
  2. Refcards
  3. Kubernetes Monitoring Essentials
refcard cover
Refcard #301

Kubernetes Monitoring Essentials

Exploring Approaches for Monitoring Distributed Kubernetes Clusters

A centralized framework for monitoring your Kubernetes ecosystem offers valuable insights into how containerized workloads are running and can help you optimize them for better performance. However, as with any distributed system, monitoring Kubernetes is a complex undertaking. This Refcard first presents the primary benefits and challenges, and following, you'll learn about the fundamentals of building a Kubernetes monitoring framework, including how to capture monitoring data insights, leverage core Kubernetes components for monitoring, identify key metrics, and the critical Kubernetes components and services you should be monitoring.

Download Refcard
Free PDF for Easy Reference
refcard cover

Written By

author avatar Sudip Sengupta
Technical Writer, Javelynn
Table of Contents
► Introduction ► Monitoring a Kubernetes Ecosystem: Benefits vs. Challenges ► Fundamentals of Building a Kubernetes Monitoring Framework ► Popular Open-Source Kubernetes Monitoring Tools ► Conclusion
Section 1

Introduction

Kubernetes was designed from the ground up to be a scalable and flexible platform for orchestrating containerized workloads across multiple servers. It automates many of the tasks that would otherwise need to be performed manually, such as provisioning new resources, scheduling tasks, and monitoring performance. This enables developers to focus on developing and deploying applications rather than worry about infrastructure specifics.

However, abstracting application code from infrastructure introduces operational complexities toward managing its own distributed ecosystem. Being composed of a number of different components, each with its own purpose and role to play in the overall system, Kubernetes monitoring often requires key considerations.

In this Refcard, we discuss the fundamentals of building a Kubernetes monitoring framework, various approaches, and key metrics to monitor. We also discuss popular open-source tools for Kubernetes monitoring while exploring features and limitations of each.

Section 2

Monitoring a Kubernetes Ecosystem: Benefits vs. Challenges

There are several reasons to monitor a Kubernetes cluster. By understanding what is happening inside an operating cluster, you can diagnose performance issues more quickly, identify vulnerabilities, and remain compliant. Some benefits of monitoring Kubernetes include:

  • Enables tracking cluster resource capacity and utilization
  • Provides complete visibility into cluster health and operational data
  • Enables quicker deployment of updates and new features
  • Enables quick identification and remediation of errors in security and configuration
  • Helps remain compliant
  • Detects security vulnerabilities

While monitoring offers a number of benefits, the dynamic nature of a Kubernetes ecosystem is often one of the primary challenges that complicates the collection and visualization of cluster data. Challenges of monitoring Kubernetes include:

  • Clusters generate large metric volumes
  • Ephemeral pods make tracking trend data challenging
  • Dynamic provisioning increases infrastructure monitoring complexity
  • There is incomplete observability due to microservices architecture
Section 3

Fundamentals of Building a Kubernetes Monitoring Framework

One of the key purposes of using a centralized monitoring framework is that it can help you quickly identify and mitigate issues across your entire Kubernetes deployment. It is also important to note that monitoring a distributed cluster not only requires monitoring components and services, but also visibility of all cluster characteristics of the system.

As a result, it is crucial to build a central system that offers valuable insights by aggregating both tangible and intangible data from multiple sources. This can help you generate usability trends and measure anomalous patterns that would otherwise be difficult to spot.

Capture Monitoring Data Insights 

Some common approaches to capture monitoring data insights include Golden Signals, the RED Method, and the USE Method.

Golden Signals 

These represent vital performance indicators that offer valuable insights into the overall performance of your system. They include:

  • Latency – time taken for a request to be processed by the system to help identify potential performance issues
  • Errors – number of errors per second being generated by the system to help identify potential problems with the application or infrastructure
  • Saturation – degree to which the system is being utilized to help identify whether additional capacity is required
  • Traffic – number of requests per second being handled by the system to help identify potential bottlenecks

While Golden Signals can be used to monitor any type of system, it is particularly well suited for monitoring more general metrics that are captured through liveness probes.

The RED Method 

In contrast to Golden Signals, the RED Method uses liveness and readiness probes to determine whether a pod or the overall deployment is healthy. This allows for more granular, user-controlled performance visibility of individual containers as well as the applications and services that they power. The Red Method also offers a robust set of features for managing alerts and notifications, offering users more control over when and how they are notified of issues with their clusters.

Metrics captured through this approach include:

  • Rate – requests a service can handle in a given period
  • Error – number of failed requests in a given period
  • Duration – latency or time taken by each request

The USE Method 

While Golden Signals and the RED approach monitor overall system health, the USE Method helps track how resources are being used within the system. Metrics captured through the USE Method include:

  • Utilization – degree to which the system is being utilized
  • Saturation – degree by which the system requires additional resources for completing a task
  • Error – rate of error in a given period

As the USE Method relies on data collected from within the Kubernetes cluster, the approach can provide more accurate and detailed insights of a Kubernetes cluster. Additionally, the USE Method is built to work with multiple data sources and can help track changes over time by generating trends and patterns of a working cluster.

Leverage Kubernetes Components for Monitoring 

Besides adopting different approaches to capture metrics, it is equally important to adopt the right approach to monitor a Kubernetes cluster. To ascertain this, it is crucial to weigh the various factors such as cluster complexity, workload types, and the granularity level on which monitoring is to be administered.

Kubernetes Metrics Server 

The Metrics Server is a Kubernetes component that aggregates resource usage data from all nodes in a cluster. By default, the Metrics Server does not collect any data itself; rather, it relies on individual nodes to push their own metrics to the server through kubelets. These metrics are pushed to the Kubernetes API Server and can be further monitored through kubectl top commands, or they can be exposed to the Kubernetes Dashboard for monitoring and tracking resource utilization.

In addition to collecting node-level resource usage data, the Metrics Server can also be used to troubleshoot performance issues and capacity planning for applications running on top of Kubernetes. This can be achieved by instrumenting workloads with one of the supported exporters (such as Prometheus), and then exposing application-level metrics to the Metrics Server for querying and visualization of resource usage. 

DaemonSets With Monitoring Agents 

One of the key features of Kubernetes is its ability to run DaemonSets, which are sets of pods that are guaranteed to be running on all nodes in the cluster. Leveraging the Kubernetes DaemonSets with monitoring agents for cluster monitoring can provide a comprehensive solution for keeping tabs on your system. By taking advantage of the inherent parallelism of DaemonSets, you can run multiple instances of your monitoring agent on each node for redundancy, consequently ensuring that there's always a working copy of the agent available.

This approach also has the benefit of being highly scalable since adding more nodes to your cluster will automatically add more copies of the agent, which ensures that no matter how many nodes are added or removed from the cluster, there will always be at least one agent running on each node. 

Monitoring clusters through DaemonSets offers several advantages:

  • The approach ensures that there is always data being collected from every node in the cluster, which is essential for detecting anomalies early on.
  • Because the monitoring agents themselves are deployed on Kubernetes, they can easily be scaled up or down without impacting the rest of the system.

Identify Key Metrics for Monitoring Kubernetes 

As a Kubernetes cluster grows in scale and complexity, so too does the importance of observing resource utilization. Thankfully, Kubernetes offers a variety of utilization metrics that can help you gain insights into the health and performance of your applications. These include the following featured metrics.

Cluster State Metrics 

Cluster state metrics provide a high-level overview of the health of a Kubernetes cluster. While these metrics can be used to determine how well the system is performing as a whole and identify potential aspects of performance optimization, cluster state metrics are mostly used as the first step toward analyzing deeper anomalies of a system.

A typical approach is to use cluster state metrics to assess an overview of the system, and then use more specific metrics (such as network latency metrics) to identify root cause and take corrective action. 

Key cluster state metrics to monitor include:

  • Cluster size
  • Pod count
  • Node count
  • Deployment count

Node State Metrics 

Node state metrics provide an overview of the status of each node in a Kubernetes cluster. They can be used to monitor the health of nodes and identify any issues that may be affecting their performance. The most important node state metric is CPU utilization, which can be used to determine if a node is underutilized or overloaded. 

Other important metrics include:

  • Memory usage
  • Disk space
  • Network traffic

Pod Availability 

These include the number/percentage of pods that are available, the average time to recover from a pod failure, and more. There are a number of pod availability metrics that can be used for Kubernetes monitoring, including:

Metric

Description

Pod ready

Percentage of pods that are ready and available to receive traffic; a high pod ready percentage indicates a healthy cluster

Pod failed

Percentage of pods that have failed; a high pod failed percentage indicates an unhealthy cluster

Pod restarts

Number of times a pod has been restarted; a high number of restarts indicates an unstable cluster

Pod age

Age of a pod; a high pod age indicates a stale or outdated cluster

Disk Utilization 

Disk utilization metrics for Kubernetes monitoring provide visibility into the health of your Kubernetes cluster by offering insights on how much disk space is being used. When monitoring disk utilization, there are a few key metrics to keep track of, including: 

Metric

Description

Utilized storage space

Indicates the size of the data files and any accompanying metadata to help identify services that may lead to strained capacity in the future

Disk read/write rate

Indicates the rate at which data is being read from or written to disk

Disk latency

Indicates high latency that can lead to slowdowns and capacity overrun

CPU Utilization 

Considered one of the core metrics of Kubernetes monitoring, CPU utilization metrics help identify issues that may lead to suboptimal computing and its underlying causes. Some key CPU utilization metrics include: 

Metric

Description

CPU cores

Number of processing cores being used by an app or container to help identify if underlying workloads are using the optimal number of cores

CPU usage

Amount of time that a process or thread is using the CPU

CPU idle time

Amount of time the CPU is idle; an unusual (high or low) percentage of idle time indicates inefficient resource allocation and utilization

Memory Utilization 

Memory utilization can be affected by a number of factors, including container density, pod memory limits, and application memory usage. By monitoring memory utilization metrics, you can identify out-of-memory errors proactively and prevent them from becoming performance bottlenecks. 

Some key memory utilization metrics include:

Metric

Description

Overall memory usage

General overview of how much memory the Kubernetes system is using

Memory usage per pod

Memory consumption of each individual pod in a system

Memory usage per node

Amount of memory each node in a Kubernetes system is consuming

Memory pressure

Indicates if the system is struggling to keep enough free memory available for other services to operate

Available memory

Amount of memory available for new apps or cluster operations

API Request Latency 

API request latency metrics are one of the key performance indicators (KPIs) for any Kubernetes deployment that help track if workloads are responsive and performing as expected for end users or services. There are a few different metrics that you can use to measure API request latency: 

Metric

Description

Median API request latency

Time taken for half of all API requests to be processed to get an overall sense of a cluster's performance

95th percentile API request latency

Time taken for 95% of all API requests to be processed to identify potential bottlenecks

99th percentile API request latency

Time taken for 99% of all API requests to be processed to help identify severe bottlenecks of a cluster

Monitor Core Components and Services of Kubernetes 

Kubernetes is supported by different components and services that work together to provide a unified platform for running containerized applications. When monitoring Kubernetes, it is a common approach to identify the core components and services that support its ecosystem. Some critical components and services to monitor in Kubernetes include the control plane and worker nodes, as shown in Figure 1:

Figure 1: Key components and services of a Kubernetes architecture

Control Plane

Known as the brain of a Kubernetes cluster, the control plane coordinates operations of all other nodes (referred to as worker nodes) of the cluster. The control plane is also responsible for handling all the cluster's communications with the outside world through APIs. 

A control plane further comprises a number of components, including:

  • Etcd – As Kubernetes uses the etcd distributed key-value store for storing all cluster data, it is one of the most critical components that should be monitored for optimum performance and availability. To monitor etcd, a common approach is to check for errors in the logs or use automated tools to verify that data is being replicated across servers to maintain high availability (HA).
  • API Server – The Kubernetes API server acts as the central point of communication for all components within and external to the system, which needs to be responsive at all times of cluster operations. To monitor its health, a typical approach is to detect errors in the logs and make sure that all API calls are returning timely responses.
  • Scheduler – As the scheduler is responsible for assigning pods onto nodes, it is crucial to ensure workloads are running where they should be. Monitoring the scheduler helps verify if pods are being scheduled onto the correct nodes.
  • Controller manager – A controller manager is responsible for running one or more controllers, which are essentially background processes that oversee the state of a cluster. To ensure controllers are scalable and highly available as the cluster grows, it is crucial to monitor the behavior of both individual controllers and the controller manager.

Worker Nodes 

Worker nodes are standalone machines of a Kubernetes cluster that run applications and store underlying data. Monitoring worker nodes helps troubleshoot issues affecting hosted applications and identify performance bottlenecks. Key services of worker nodes include:

  • Kubelet – Performs basic tasks such as starting and stopping containers, collecting log files, and managing storage. In addition, the kubelet also makes sure that containers are running the correct version of application code. Monitoring worker nodes with the kubelet helps ensure all pods on the node are appropriately assigned and are operating without resource-level conflicts.
  • Container runtimes – Are used to launch and manage containers in a cluster. Kubernetes uses container runtimes to manage its pods, which are groups of containers that need to be co-located and co-scheduled on a host. Besides troubleshooting performance issues, monitoring runtimes also helps track compliance and detect security vulnerabilities.
  • Kube-proxy – Monitoring kube-proxy helps inspect service connectivity and ensure that all nodes are able to communicate with each other. This can be especially important when you are rolling out updates as any issues with kube-proxy can cause uneven distribution of traffic between nodes.
Section 4

Popular Open-Source Kubernetes Monitoring Tools

Monitoring Kubernetes clusters is an extensive task that often relies on automated monitoring platforms to help reduce the manual overhead of keeping track of an application's performance. The following sections list popular open-source monitoring tools, their features, and their limitations.  

Kubernetes Dashboard 

Kubernetes offers the Dashboard as a native, web-based interface for general-purpose monitoring and troubleshooting of Kubernetes clusters. Besides offering you an overview of the cluster's health, the tool also provides detailed insights on individual pods, nodes, services, and deployments.

Although the Kubernetes Dashboard is an easy-to-use, quick-start platform to inspect runtime resource utilization and perform critical actions such as scaling deployments or rolling out new versions of applications, the platform has several limitations when monitoring clusters at scale. These include a lack of actionable insights or efficient alerting mechanisms to detect anomalies in real time. 

Prometheus 

Managed by the Cloud Native Computing Foundation, Prometheus is one of the most popular open-source monitoring and alerting solutions that leverages a time series database for monitoring distributed Kubernetes architectures. Apart from offering a wide range of metrics out of the box to monitor the health and performance of Kubernetes workloads, Prometheus also supports the creation of your own custom metrics to support different use cases.

The platform's multi-dimensional data model and powerful query language allow for efficient metric collection and storage while enabling precise control over which metrics are returned for visualization. By leveraging exporters to support a wide range of integrations, Prometheus can be used to expose and visualize just about any system- or service-level metric.

The ELK Stack 

As a combination of three open-source tools, viz., Elasticsearch, Logstash, and Kibana, the ELK Stack is a comprehensive observability platform with an extensive set of features to manage and monitor log data of Kubernetes clusters. The ELK Stack is a popular choice for log analysis and monitoring because of its powerful search capabilities and ability to handle large amounts of data.

  • Elasticsearch is a Apache Lucene-based search engine that indexes and searches data in real time.
  • Logstash is a data processing platform that ingests and transforms data from multiple sources to a centralized data store.
  • Kibana is a web interface that allows users to create dashboards and charts for visualization of the data stored in Elasticsearch. 

Jaeger

Jaeger is an efficient distributed tracing system that provides accurate insights on distributed Kubernetes clusters by tracing individual requests as they flow through the system. The platform integrates with the Kubernetes API to collect data about a cluster's state and resources, and it can be deployed either as a standalone application or via a DaemonSet in a Kubernetes cluster.

Grafana 

Grafana is a popular visualization and monitoring tool for observing time series Kubernetes data. Apart from allowing users to create customized displays of metrics, logs, and traces associated with containerized applications, Grafana also offers granular control over the data it collects and stores, allowing users to select which data sources to include in their dashboards.

The table below compares the features and limitations of these five open-source tools:

Tool

Features

Limitations

Kubernetes Dashboard

  • Easy-to-use, web-based interface
  • Extensible, can be customized to support various use cases
  • Helps manage and deploy nodes, pods, services, and deployments
  • Offers access to the Kubernetes API from the Dashboard UI
  • Only shows data for the past 24 hours
  • Limited scaling capabilities
  • Large number of nodes or pods may cause the Dashboard to slow down or be unresponsive
  • Doesn't provide granular control over monitoring data
  • Can only be used to manage resources in a single namespace

Prometheus

  • Automated service discovery
  • Time-stamped metric capture
  • Comprehensive metric collection of all cluster services
  • Powerful query language for inspecting collected metrics
  • Provides easy-to-use dashboards for efficient visualization of collected metrics
  • Complex to integrate with existing monitoring systems
  • Uncompressed metric data consumes huge storage space
  • Complex usage of the PromQL query language

ELK Stack

  • Easy to use, highly customizable
  • Comprehensive solution for gathering, processing, and visualizing cluster state
  • Supports ingestion, parsing, and transformation of data from various sources
  • Offers plugins to monitor core services (e.g., Kubernetes API server, Docker Daemon)
  • Highly resource-intensive
  • Requires setting up additional cluster to support HA
  • Requires custom scripting for automatic collection and mapping of Kubernetes metadata to fields in Elasticsearch
  • Requires considerable customization of default dashboards for Kubernetes data visualization

Jaeger

  • Enables tracing individual requests traveling through the cluster
  • Helps visualize request latency timelines
  • Helps identify performance bottlenecks
  • Complex configuration of tracing rules
  • Can only monitor containerized workloads
  • Inefficient for monitoring multi-cluster or hybrid-cluster setups

Grafana

  • Highly customizable dashboards
  • Supports a wide range of data source integration
  • Offers seamless scalability
  • Doesn't support persistent volumes
  • Can't monitor historical data out of the box
  • Doesn't offer free services for certain on-prem data sources
Section 5

Conclusion

Monitoring Kubernetes offers valuable insights into how containerized workloads are running and can help you optimize them for better performance. However, as with any distributed system, monitoring Kubernetes is a complex undertaking. There are a number of factors that can make this difficult, including the dynamic nature of Kubernetes, the scale at which it can operate, and the variety of ways in which it can be used. 

In order to get the most out of monitoring, it is important to adopt the right monitoring strategy, identify the critical components, and choose the right monitoring tool. It is also important to make sure that the monitoring framework is scalable and can handle the consistent volume of data generated by a Kubernetes cluster.

Further reading:

  • Getting Started With Kubernetes – https://dzone.com/refcardz/getting-started-kubernetes
  • Kubernetes Security Essentials – https://dzone.com/refcardz/kubernetes-security-1
  • Advanced Kubernetes – https://dzone.com/refcardz/advanced-kubernetes
  • Getting Started With Prometheus – https://dzone.com/refcardz/scaling-and-augmenting-prometheus
  • Monitoring and the ELK Stack – https://dzone.com/refcardz/monitoring-and-the-elk-stack

Like This Refcard? Read More From DZone

related article thumbnail

DZone Article

Best Practices, Tools, and Approaches for Kubernetes Monitoring
related article thumbnail

DZone Article

50+ Useful Kubernetes Tools
related article thumbnail

DZone Article

How to Convert XLS to XLSX in Java
related article thumbnail

DZone Article

Automatic Code Transformation With OpenRewrite
related refcard thumbnail

Free DZone Refcard

Kubernetes Monitoring Essentials
related refcard thumbnail

Free DZone Refcard

Kubernetes Multi-Cluster Management and Governance
related refcard thumbnail

Free DZone Refcard

Getting Started With Kubernetes
related refcard thumbnail

Free DZone Refcard

Getting Started With Serverless Application Architecture

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: