Measuring in a DevOps World
What do we do with all the data collected by our DevOps tools? Learn about using a big data analytics approach to your CI/CD pipeline.
Join the DZone community and get the full member experience.Join For Free
It is quite easy to find articles and books about DevOps practices like Continuous Delivery out there, and there seems to be general agreement on which practices are good. In the measurements space, I am not so sure that we have found a general consensus yet, and there are a few very counterintuitive things happening that challenge our traditional approach — just like Agile and DevOps did to project management best practices.
In this article, I want to address two aspects of measuring in the DevOps world to help navigate this space: the technical dimension of how measurements should be done and some of the challenges associated with them, and then, how to look at some common measures with a new set of eyes.
For years, we in IT have been speaking to businesses about the use of analytics and big data to make them more effective by leveraging the insights coming from that data. Yet, when we look at what we have done in IT, we can see that most of the data we create is not being analyzed with the same mindset. Think of all the data that is “hiding” in your Agile lifecycle tool, your Continuous Integration server, your code quality tools, etc. You are likely looking at that data within the context of those tools but don’t have a formal way to correlate the data and derive analytics from it. That data is mostly “wasted” without its full value being realized — it’s “dark data.”
The answer, of course, is to use the practices we have been recommending to business ourselves: collect all the data, visualize it through a dashboard, and use analytics and machine learning to improve performance. My team has been using a Splunk-based solution to do this, and I know that Capital One has open-sourced their Hygieia dashboard solution for the community to use. I am surprised that there are not many more solutions for this problem easily available. A key challenge is, of course, that pretty much every organization is using a diverse tool stack.
The challenge that we often have to overcome while collecting data from our IT delivery tools is that many tools do not easily part with their data. Some of the data is only accessible within the tool itself, and the database structures that sit behind it are convoluted and not intuitive to query (or maybe not even possible to query). I think tool vendors have some work to do here to make this a more straightforward process for us in the community.In the meantime, we can use hooks and logging mechanisms to write important events and data out into a place that can be accessed by our analytics solution. There is no reason why your Continuous Delivery pipeline cannot write out the results after each step with as much metadata as possible so that we can use that as input into our analytics solution later.
The data that we create within IT will not grow that quickly, which means we can be quite generous with the amount of logging that we do. Imagine what we can do when we have collected data from many years of work, hundreds of releases, and thousands of application builds and test runs. You can start to validate your assumptions about IT delivery. Did the quality of our software improve when we increased unit test coverage? Did late changes in requirements cause higher production outages?Do teams with high amounts of build failures have a higher chance of delaying their deployments? These are all questions that in the past we answered with examples or intuition but that we will now be able to answer with real data.
There is one other aspect that we can start to address through this approach, as well: status reporting. I cannot believe that the main way of looking at status continues to be through PowerPointand Excel. The data in those documents is at least a few hours old (if not several days) and has gone through many hands and interpretations before it is presented. We should be able to look at real-time status by gathering information straight from the systems we use, which is something my team has started doing and is something I highly encourage you to do, too. It will change the way you see status. It will force a rigor across your processes and systems that will improve data quality and transparency for everyone. It will take a while to clean the data up because you cannot “fudge” things anymore, but it pays back later when you can always get accurate data from your system.
What else can we do with all the data that we now collecting for our dashboard? You can start using some basic machine learning and data analysis to look at ticket data, for example, and all those requests that your team is dealing with. Of course, the quantitative data is important for measurement, but the content of the tickets can also be used. You can analyze where the volume of work is coming from and identify opportunities for automation. With machine learning, you can derive from the text where to route the ticket to and thus improve resolution times because you are avoiding ticket ping-pong between teams. And as you automate services and expose them as self-service capabilities, your team will start to have more and more time to focus on improving automation — a virtuous cycle that you just have to start. It will take off and feed itself.
There are some surprising things you need to consider when you go down the path of DevOps measurements that are worth exploring. One of the things that most people measure is compliance with Service Level Agreements (SLAs), which is a deeply conflicted space. Assume your team has only two SLAs: two hours for a password reset and 24 hours for a Sev 2 incident in production. When both happen at the same time, which one do you think your team will do first? Which one is more important to the company? The two answers are, unfortunately, often conflicting. You can see how this quickly becomes a problem with more SLAs; it creates a complex system with outcomes we cannot predict.
Measuring KPIs and trends is much better than adherence to an SLA because we want to see that we are becoming better for each category of work. But even here, automation throws us a curveball. A common thing to measure is first-time resolution rate and time to resolve a ticket. The first one, we want to see increase, and the second one, we want to see to go down as we get better. As we are moving more towards self-service capabilities, the easy tasks will be automated, which means only the more complex and difficult problems require a ticket that the team has to resolve. But this means that we will need more time to resolve those — and the chance of getting it wrong with our first attempt increases. As a result, our time to resolve and first-time resolution rate will look worse as we increase the levels of automation. If you don’t prepare your management for this counterintuitive situation, you might get into trouble.
I think there is tremendous opportunity in this space and we are just at the beginning of our measurement story. As we get better at this, we will have more meaningful conversations and will learn a lot more about what works and doesn’t work. We will move from intuition to a more scientific discussion on DevOps. I gave some examples of legacy ways of looking at measures that have to change, and I am sure we will learn about many more.Let’s be open-minded and curious to find new ways to look at the data we have.
Published at DZone with permission of Mirco Hering, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.