Improving Kubernetes Observability and Monitoring
Find out what's really going on inside your Kubernetes processes, or find out how you can find out.
Join the DZone community and get the full member experience.Join For Free
Monitoring has always been a big part of solutions design. It is a continuous process of data collection about a system for the purpose of analyzing that system. Monitoring is usually done in an active way, with a tool pinging or probing a system in order to get responses. Those responses are then analyzed to better understand how the system is performing.
In recent years, as we shift towards a more open and manageable system in a cloud environment, monitoring becomes a mundane part of the process. This is where observability comes in. Despite the many debates about what observability really means, the reality is actually very simple: it is the process of making the data required for monitoring available from within the system.
You may also enjoy: Observability vs. Monitoring
The way cloud environments are set up and how microservices are now used to construct complex apps make observability a more appealing concept. It is as much about acquiring mission-critical data for business purposes as it is about keeping services running. It is a new world, and Prometheus and Grafana are among the most popular inhabitants of that world.
Site Reliability Engineering (SRE) at the Core
Before we can get to observability, we need to know the true purpose of it: better systems reliability. For a complex app made from microservices to be reliable, it needs to be designed for reliability. Site Reliability Engineering, or SRE, principles are used to ensure maximum reliability through simple means.
SRE starts with a simple premise of availability. A system needs to be available in order to perform its tasks. This is why a Service-Level Objective (SLO) becomes the first component of SRE. It governs the target availability that a system must achieve in order to reach a reliable stage.
Interestingly, SLO is now implemented through techniques such as planned downtimes and availability planning. Rather than allowing servers and services to become overly available, you can now probe your system using planned downtimes in order to measure the use of server resources by services.
The next element is a Service-Level Agreement or SLA. Once the SLO is defined, a promise is made to a user. That promise is what we know as SLA. SLA’s availability isn’t exclusively defined by the SLO. This means your organization can have an internal SLO that is different than the promised SLA.
In a more common environment, SLA usually focuses on a specific part of the SLO rather than availability in general. Metrics such as downtime are more commonly used than performance or lead time. After all, SLA is a promise that needs to be kept.
Then, we have Service-Level Indicators or SLIs. They are the indicators monitored and analyzed to measure whether the organization’s SLO and SLA are met. Metrics such as Request Error Ratio or Request Latency—monitored at different percentiles—creates a more holistic SLI for better monitoring.
USE vs. RED
That brings us straight to availability. SLIs can be monitored better—and availability can be achieved in a more manageable way—when the system is designed to be highly observable. There are challenges to solve when it comes to monitoring microservices, especially in a complex cloud environment. Measuring request latency isn’t always enough.
Two methods that try to solve those challenges are known as USE and RED. USE stands for Utilization, Saturation, and Errors. As the components suggest, USE aims to take the guesswork out of monitoring by making systems more observable. Utilization measures the time during which a resource is in use. Saturation, on the other hand, focuses more on the queue of work handled by that resource. "Errors," of course, actually counts errors in the system.
RED takes a slightly different approach. RED stands for Rate, Errors, and Duration. It measures the number of requests and failed requests as Rate and Errors. Both metrics are measured every second for maximum accuracy. It is more in tune with how microservices are set up. Duration completes the set by measuring the amount of time needed to process a request.
Observable Containers for Improved Monitoring on Kubernetes
Container monitoring may not seem challenging when there is only one container to monitor. With a complex system, you may have a lot more containers to analyze. This is where high observability becomes a real advantage.
Unfortunately, Kubernetes doesn’t have a built-in native monitoring tool, hence the need for an observable system. Kubernetes monitoring becomes easy though when containers—microservices—actively make monitoring data available.
However, the actual monitoring process can be handled by tools like Prometheus—another Cloud Native Computing Foundation (CNCF) project tool that gets along very well with Kube. In fact, the Prometheus and Grafana combination is the most common combination for creating a Kubernetes monitoring dashboard.
Prometheus handles the heavy lifting. It collects metrics based on predefined targets. Prometheus uses a pull model and utilizes service discovery for better automation. As long as the microservices are configured to be available, Prometheus will have no trouble monitoring your Kubernetes system.
Grafana, on the other hand, utilizes Prometheus as a data source. Since Grafana is a data visualization tool, its primary purpose is transforming data pulled by Prometheus and creating a visual dashboard for easier monitoring.
Combined with methodologies discussed earlier—and a server designed to feed data and handle requests—setting up an observable system is no longer a big hurdle. Once you have an observable system, long-term maintenance becomes less of an issue through monitoring.
For more on container monitoring, read our article Container Monitoring: Prometheus and Grafana Vs. Sysdig and Sysdig Monitor.
This post was originally published here.
Published at DZone with permission of Mauricio Ashimine. See the original article here.
Opinions expressed by DZone contributors are their own.