A DevOps Approach to Incident Management Means You Can Still Innovate With ITIL
Approaching incident management and response with DevOps and Lean principles creates space for innovation and decentralized monitoring.
Join the DZone community and get the full member experience.Join For Free
Collaboration for Incident Management
It is not true that ITIL disallows innovation and that DevOps and IT Service Management (ITSM) and the IT Infrastructure Library (ITIL) are like oil and water. ITIL is a framework from which you can take or leave portions you like and, in fact, this framework provides many useful paradigms for DevOps implementations.
There’s actually lots in common between ITIL and DevOps. ITIL is a set of detailed practices which provides a set of process frameworks. DevOps is primarily a culture of collaboration so there is no reason you cannot have a process framework integrate very well with a culture of collaboration.
This article looks at how we can take the core principles of DevOps and apply them to the ITIL process of Incident Management.
Incident Management With DevOps Principles
The Service Management process of Incident Management is focused on the resolution of issues impacting technology services. Tightly-honed incident management process rely on collaboration between teams to drive the rapid resolution of issues and are therefore a good opportunity for the application of DevOps tools and principles. Tools such as Slack and PagerDuty can help ensure that the right people are engaged at the right team while practices such as blameless postmortems help ensure continuous improvement.
There is a lot of Incident Management activity that can happen before an incident occurs. From clear roles and responsibility to tightly honed instrumentation, DevOps principles can be applied to prevent incidents from happening in the first place and reduce their impact when the do occur. DevOps focuses a lot on system telemetry, the data we collect about how our systems and services are performing, and for good reason. If we can instrument our systems properly we can find issues before they impact customers and prevent minor incidents from becoming major incidents. In addition, if systems are properly instrumented we can begin to apply machine learning and predictive analytics to actually anticipate incidents before they occur.
DevOps has its roots in Lean Manufacturing and we see many of the concepts of Lean reflected in DevOps practices. One great process we borrow from Lean is the Andon Cord. In Lean, this Cord literally this shut down an assembly line if something went wrong. This decentralized the decision to stop the assembly line and ensured that all resources were brought to bear on localized problems that impacted the end-to-end delivery of a product. We can apply similar concepts to Incident Management by allowing anyone to declare an incident when there is a problem. In addition, we can bring all resources to bear on a problem with the concept of swarming to ensure that issues are resolved as quickly as possible.
Part of the DevOps culture of collaboration is accountability. The concept of empowering engineers and a learning culture means that people working on incidents can focus on resolving the issue rather than deferring blame or pointing fingers. This gets back to the idea of functional teams and full stack engineers focused on the system as a whole rather than individual horizontal slices. An example of this would be a site reliability engineer who resolves a problem and just does not throw it over a wall. This practice is very much in line with proper incident management processes and ultimately leads to faster resolution times.
The ephemeral nature of systems in modern DevOps architectures has changed the approach to Incident Management. In these systems, maintaining the state is no longer important so incident responders can easily kill the systems on which applications run and restart them. In fact, in some cases, you can kill the whole system. For services with ephemeral infrastructure, it has shifted the resolution process from investigate and diagnose to rapid restart and restore procedures leading to significant improvements in resolution times. This is not to say that we should disregard diagnosis; it is important to capture important information such as application and system logs so that diagnosis can happen at a later time, but this allows us to separate restoration from investigation. Without further investigation or post-incident reviews, we will not build a learning culture nor will we prevent repeat incidents from occurring.
A culture of continuous learning is also a key component of DevOps and this type of learning can be instilled in organizations though the ITIL process of Post Incident Reviews. This DevOps principle is highlighted in Gene Kim’s 3rd way of DevOps which “is about creating a culture that fosters two things: continual experimentation, taking risks and learning from failure.” Also referred to as Postmortems or Post Action Reviews, the PIR process codifies this DevOps principle through ITIL practice. It is key that these are approached with a spirit of learning rather than a spirit of blame. In order to truly have a culture of continual learning it is critical that all resources involved in the incident use the incident and PIR as an opportunity for the organization to grow and learn.
We have worked in a lot of environments that include both ITIL and DevOps, and it became increasingly apparent that they can not only co-exist but also build and enhance one another. Especially in enterprise environments, the practices of ITIL such as Incident Management and Post-Incident Reviews can be used to codify and enhance DevOps principles.
Opinions expressed by DZone contributors are their own.