Questions on Monitoring: Before You Monitor
Questions to consider before monitoring so that you can improve the environment and mitigate infrastructure incidents before they impact your business.
Join the DZone community and get the full member experience.Join For Free
As organizations continue to move applications from legacy systems to cloud-native, new practices and approaches for monitoring have emerged. Modern monitoring makes use of RED (Rate-Errors-Duration) and USE (Utilization-Saturation-Errors) as standard approaches for dealing with the complexity of our complex environments.
However, organizations must also consider the meta-questions that need to be answered — those that set the basis for how successful you are in your monitoring. In truth, you most likely have these answers, at least in an instinctive manner, but considering them and understanding them can make monitoring your environment even easier. So let’s dive into what these meta-questions are to better monitor for and mitigate infrastructure incidents before they impact your business.
Has This System Ever Performed Well?
And with this first question, we probably have the most “meta” of all questions. Our applications and infrastructures are changing at an amazing pace. We’re all on some part of a journey towards cloud-native computing. Our application world is defined by microservices architectures and powered by elastic computing either in a public or private cloud. And yes, we still have the legacy systems to deal with.
So, how do you know this has ever performed well, or at all?
In general, we should approach this in our testing and deployment phase. Obviously, we would expect the system to run without errors and within some performance range. But how do we know? Let’s boil it down to one word: observability.
By providing us insightful data into the behavior of our applications and infrastructure, observability lets us establish our successful completion and our base performance range. But we need to make sure that our test case is understood and repeatable. Most often we’d use synthetics monitoring to give us that monitoring insight because it gives us a controlled simulation of the real-user, real-world experience.
This simulation becomes more important as we continue to move from multi-page web apps (rendered in the back end) to single-page web apps (rendered on the client side). Since our single-page apps may also incur more communications (XHR, API calls), we need ways to reliably compare performance over time.
Synthetic monitoring emulates the end-user transactions using scripting in a way that allows us to also gain some level of repeatable reliability. Synthetics also allows us to build baselines for our performance that we can use to show that the app is working and collect metrics that allow us to answer our questions on performance. In fact, using synthetics as a part of both your testing strategy and regularly in production can help you identify and circumvent error burn before you exceed your error budgets or cumulative SLOs.
What Makes You Think There Is a Problem?
In the world of today, we need to rapidly and accurately determine erroneous behavior. Errors can be considered problems that cause an incorrect, incomplete, or unexpected result. These errors are often tracked by metrics (like in reporting HTTP errors) or by flag indicators (stating that this is an error, as in a status code).
Of course, in the modern world, slow is our new downtime. Slow costs sales lead to unhappy users and can be a massive headache in determining the underlying causes.
When we need to detect, alert, and troubleshoot incidents, we most often turn to a combination of monitoring — both in the infrastructure and in the application. And yes, you do need both.
Infrastructure monitoring gives us data-powered observability insights into what our elastic and ephemeral computing environment is doing. Application Performance Monitoring (APM) helps us track our application as it twists between the services or microservices and the underlying infrastructure.
Of course, traditional monitoring and alerting require that we have some idea of what can go wrong. We monitor for disk space, network bandwidth, memory, and more. But in today’s environments, the detection must extend to help us track and trigger the “unknown unknowns.”
In modern concepts, we most often find USE and RED as base dashboards. USE allows us to track and find problems in our underlying systems, both hardware and operating software. RED, usually generated from our APM, is more the domain of applications. RED uses distributed tracing as the basis for tracking our requests, and indirectly, our user happiness. APM will help us understand those tricky “slow” issues, and get to the root causes quickly.
What Changed? Software? Hardware? Load?
Most of us use DevOps principles in our day-to-day activities, and a major part of our practice is in deployment and integration. With microservices architectures, it’s not unusual to see pushes daily, or even more frequently. Add to that the elastic spin-up and spin-down and serverless functions powering our compute environments and loosely coupled communications linking those services and we have not only constant planned changes, but also unpredictable changes in our observability world.
So what changed?
Well, here we need to consider two interrelated data factors: streaming data and full-fidelity data. We need to see the data in real-time, as our structures can change in a heartbeat. Think of it this way: a serverless function might take 30 microseconds to start and less than a second to complete. When that occurs, we need the data showing that activity and its impact as soon as possible. After all, when you go looking for that function, it’s not going to be there. That leads us to the full-fidelity data. If something we don’t expect goes wrong, we need data to be able to triage and deduce the underlying causes, as well as the full impact on the environment. Traditionally, monitoring has focused on one application or service at a time, which can cause us to miss cascading problems. Observability allows us to expand our space to resolve those concerns, but we need all the data to ensure that can happen. Nothing is worse than identifying an issue and then finding out that there is no data to allow for an in-depth inspection.
So make sure your data is fast and complete — anything else makes your life much harder.
Could the Problem Affect Other People or Applications?
As we move into more services-based applications, it is not unusual to see disparate teams being responsible for separate services. So how do you deal with blast radius issues, where one failure may impact a completely different app, or impact an unexpected group of users.
Well, many of the points we have considered help here, but two items that need to be called out are AI (artificial intelligence)/ML (machine learning) and correlated data.
Our world is complex and it can be a challenge to deeply understand the exact flow through the system on each transaction. An alert generated on one service may be caused by a root cause issue on a completely separate service not even directly connected. Using AI techniques, we can track from the alerting cause to the potential root causes, if we have all the connecting data. Similarly, AI/ML techniques can help identify “normal” behaviors even when they seem out of scope, like recognizing that a potential anomaly is actually a historical recurring event.
But when something is wrong, we need to be able to step cleanly and simply between our data environments (like from app to infrastructure, from metric to trace to log files) without having to repeat our forensic steps. This is only possible if our data is correlated, aligned in a way that the monitoring tools understand and can analyze, visualize, and inspect. Correlated data may sound simple, but when infrastructure data can be coming in from thousands of virtual servers powering Kubernetes pods, containers, and apps, we need our system to handle that load, including issues around lag and skew.
Stuff to Remember
While the hot term these days is observability, we’re still really dealing with monitoring. Observability gives us new and deeper data from all of our potential sources like our infrastructure, orchestration, and apps. But it’s still up to us to understand what that data is telling us. While our tools can help analyze and visualize our environments, it’s still up to us to answer those questions that in turn set the stage for identifying and responding as needed to incidents and issues.
Opinions expressed by DZone contributors are their own.