A conversation I recently had with the DevOps manager of a major online retailer really made me think about DevOps monitoring tools. The manager and I discussed how several DevOps shops seem to define themselves based on the number of tools they have monitoring their build and IT stack. The point he went on to make is:
You can go up and down the isles at a conference with the corporate credit card and buy every tool in sight, but all those purchases don’t make you a DevOps. All it makes you is the owner of many tools.
The point of the manager’s comment is that being an effective DevOps shop or IT service provider means you go beyond just owning tools. You have to incorporate those tools into a meaningful DevOps philosophy and an understanding of proper tool management and proper team integration.
DevOps, as a philosophy, encourages shifting left and putting testing earlier into the process so that teams can be proactive in their support rather than reactive to problems. So, how does a DevOps team enable this shift in thinking from reactive to proactive? Read on to find out.
DevOps Monitoring Tools: A Love Affair
DevOps is about bringing development and operational teams together. To some extent, tools can be a way to improve this relationship. A recent whitepaper from Puppet describes how:
Adopting DevOps practices usually means embracing automation as a default solution to many problems.
Indeed, every Dev and Ops love their shiny new toys. Tools do allow for faster builds, quicker deployment, greater visibility, and faster feedback.
Puppet, for example, can be used for greater server configuration and configuration management. Nagios is also a favorite for infrastructure monitoring. Jenkins can be used to build code, create Docker containers, and push code to production. Jenkins is also great for Continuous Integration.
Yet these tools, as strong as they are at dealing with reams of data, do not alert the end user, be it Dev or Ops, when a real issue arises. For the most part, they will not solve underlying issues that arise in in any operation such as failed deployments, security issues, or scaling problems. Instead, those types of issues need to be alerted on and responded to appropriately.
Don’t Forget the Alerting
If DevOps were just to rely on their tools, they would be left in a position where they were always reacting to situations rather than being proactive. Metric provided by all the shiny DevOps tools enable us to measure and observe various components of the operation. However, it is alerting that draws attention to the particular systems that require observation, inspection, and intervention. It is alerting that furthers proactive management.
By putting alerting earlier in the monitoring process, DevOps teams take the true meaning of shifting left to heart. Teams can see early on when software doesn’t deploy as expected by alerting the proper team members. Similarly, security vulnerabilities can be detected early on and alert the engineers who can react appropriately and intervene.
Not All Alerts Are Created Equal
Even though most DevOps teams have adopted alerting practices, they are often far from alerting best practices. It’s not enough to just have an alerting tool. Like a monitoring tool, if left uncalibrated, alerts will simply produce a sea of noisy data. Instead, teams should calibrate alerts so that they are meaningful.
For example, a meaningful alert might be something along the line of web requests taking more than x seconds to process and respond or new servers are failing to spin up as expected. These are great examples of what could be high priority alerts for a company. The Ops team, in these cases, can then investigate based on specific information rather than complaints from end users.
Alternatively, alerts that are less high priority, such as the server being 90% full, can be a low priority alert that can be forwarded to the on-call engineer but don’t rise to the level of a 2 a.m. wake-up call.
6 Steps to Alerting Best Practices
It’s an important realization that not all alerting needs to wake up an engineer. Successful adoption of DevOps means planning ahead and providing meaningful alerts when issues do occur.
1. Make Sure Your Alerts Are Calibrated
Establish a baseline so you know how your systems are supposed to work
2. Ensure Alerts Are Tied to a Schedule
As weird as it sounds, some shops just alert everyone. You never want to alert everyone. Make sure your alerts are tied to a schedule so that one person is alerted. If the engineer is unavailable, then escalate to the next person on call.
3. Ensure Alerts Are Actionable
Who wants to be woken up to a message that is pointless, such as there being a problem with deployment in the test environment? Instead, ensure alerts have a direct piece of information that needs to be investigated and resolved.
4. Develop Run Books
Publish operating procedures so on-call can become more standardized.
5. Review Audit Trails
Make sure alerts went to the right person on the team who is best able to resolve the issue
6. Review On-Call at Weekly Meetings
Review alerts that were received during the week to ensure sufficient information is arriving with alerts and that alerts are actionable. If they are not, then alter the alert messaging so it is more effective.
By following these steps, your DevOps team will begin the process towards thinking from a proactive rather than a reactive position.