Last year, I had the honor of presenting a talk titled “The Unrealized Role of Monitoring and Alerting” during the 15-hour global DevOps (FREE) conference known as All Day DevOps.
The online event provided amazing content to over 13,500 attendees across multiple time zones and included topics like Continuous Integration, Continuous Deployment, Modern Infrastructure, Monitoring and Alerting, DevSecOps, Cultural Transformations, DevOps in Government, and more.
I based my talk on a monitoring and alerting panel I took part in last year. The panel moderator asked the question, “What is the main role of monitoring?”
As you might imagine, the responses varied in ways that correlated almost directly with the company the panelist represented.
One panelist believed that the purpose of monitoring is to prevent downtime related to over-utilized resources; that it’s most important to measure the availability of IT assets. In his view, we must keep a watchful eye on disk, CPU, and memory usage, and that they help Ops teams detect and investigate imminent or active problems.
Another panelist from an Application Performance Monitoring (APM) company felt that the purpose of monitoring was to inform developers about, you guessed it, application performance. Measuring the quality of service to the end user was just as (or perhaps more) important than assessing whether a disk was full. It’s a valid point, particularly for organizations leveraging the cloud. Low on hard disk space? Spin up more resources as demand changes.
One by one, each expert shared her position on early detection, methods of preventing downtime, and appeasing disgruntled customers.
A Different Perspective
My mind went in a different direction. Since I represent VictorOps, it’s conceivable that I would agree with these two viewpoints while adding this thought: monitoring is of no value if you can’t alert the right teams or individuals with actionable context and resources to address the problem. After all, alerting is the other half of the monitoring equation. Until we have collectively automated every possible fix to every possible problem (i.e. never), people will always be a big part of the story.
Were the previous panelists incorrect in their claims about the role of monitoring? No. But they have a myopic view of the critical role monitoring can play. Of course we want to know about problems early. Of course we need feedback on the user experience. There’s no argument there.
What Might We Do Differently Next Time?
However, there is a value to monitoring that is often overlooked: the chance to learn and innovate after service disruptions are over and unhappy customers have been pacified.
Continually understanding and responding to feedback from monitoring and alerting (including logs, metrics, etc.) prepares us to use information about past events to drive future actions. In other words, the role of monitoring and alerting isn’t just about prediction and prevention. They also serve as a lever to help teams reduce the time it takes to respond to and recover from problems. When examined in retrospect, they help teams learn and innovate.
After all, failure of IT systems is inevitable. The good news is that success is a result of failure. But success only happens if we make the effort to learn from the spoils of information available once the proverbial fire hoses have been rolled up and engineers’ heart rates return to normal levels.
I explored this topic further in an article I wrote for TechBeacon titled “How to use monitoring for innovation and resilience, not firefighting.” I’ve spent a good part of my career helping engineering teams get over this common shortcoming. When they move past chaotic responses to service disruptions and toward proactive learning, organizations make huge advancements in their own operational maturity.
All Day DevOps has something for anyone with any interest in DevOps. The lineup is stacked with experts from across the world. If you’d like to watch my talk from last year, here’s the recording. I look forward to watching this year’s talks and taking part in the conversation on Twitter. Use and follow the hashtag #AllDayDevOps to contribute to the conversation. See you in the Twitter feed!