Metrics That Help Reduce MTTR When Looking for a Root Cause
It's not always easy to pinpoint the right metrics. Follow these tips to more easily locate the cause of incidents in your systems.
Join the DZone community and get the full member experience.Join For Free
Recently there was a mini-incident in a data center where we host our servers. It did not affect our service after all. And thanks to the right operational metrics, we've been able to instantly figure out what's happening. But then a thought came up to me, how we would've been racking our heads trying to understand what's happening without two simple metrics.
The story begins with an on-call engineer spotting an anomalous increase in some service response time. He then checks whether this is true for service overall or just for some handler by examining this service /ping handler response percentiles. This PING handler doesn't go in any other service or database or whatever, it just returns 200 ok and is needed for the sole purpose of health check by load balancers and Kubernetes.
So what first comes to mind? It's probably resource starvation and specifically the CPU. Let's check it:
OK, we see a surge. Let's figure out which process on the server is that, to see if it's one of the neighbors or what:
We see that it's not some particular process misbehaved, but all of them started to use more CPU time simultaneously. So now there's no easy way forward: since all the services are tangled one with another, we need to check load profile levels, understand what generated them — users or some internal causes, etc. Or it could be some sort of degradation or resources themselves.
Though I tried to keep you intrigued, you might've already figured out that it was the CPU itself being in a degraded state. Dmesg showed:
CPU3: Core temperature above threshold, cpu clock throttled (total events = 88981)
Basically, CPU frequency was lowered. Let's check the temperature:
OK, it's clear now what was going on. As we saw this happening to 6 servers at the same time, we figured it was definitely a data center issue, but not a global one — only some racks were affected.
Let's Get Back to Our Metrics
For future events like this, we probably want to know that any server got its CPU overheated ASAP. On the other hand, you wouldn't want to add CPU temp charts on your dashboard, because it takes screen space and peoples attention, but actual problems like this are ridiculously rare.
Usually, you would use triggers to automatically monitor and control some parameters or metrics. But to set up a trigger one need to choose a proper threshold. What should we set for a CPU temperature?
It's the difficulty of choosing the right threshold that pushes a lot of software operations engineers to dream of a universal anomaly detection, the one that will solve all of their problems.
Still, in the real world we need a threshold.
Keep it simple, right? What are we care about? Our service performance. So let's set the temp on which our service experienced issues. But what about other services running on servers that were never overheated?
OK, how about some physics intuition then? Let's check "usual" temp across our cluster and get the baseline:
90°С seems appropriate, right? Let's just check with another cluster:
Hm...Here it's way lower on average. Should we set different threshold?
Digging deeper — it's not the temperature that caused service issue. It's CPU freq being lowered!
Let's check the number of such events. Linux sysfs gives us that in
Though our monitoring service collects this automatically by default, it's not even rendered as a chart anywhere in the system. There's just an auto-trigger for all of our clients that alerts whenever there's more than a couple of such throttle events per second. It is invisible and doesn't require your attention, and one might argue that it triggers once in a century, but then it works like a charm, notifying you whenever it affects your server performance. And you don't need to play guess game. It's all on the platter for you!
That what takes most of our effort at okmeter.io — researching and developing a knowledge base of error-proof auto-triggers, that will save you the trouble figuring out the unknown problem. It takes work to figure out the right metrics that are simple but effective. Simple is hard.
Published at DZone with permission of Pavel T. See the original article here.
Opinions expressed by DZone contributors are their own.