Starting and Scaling DevOps in the Enterprise: Optimizing The Basic Deployment Pipeline
This excerpt from an industry veteran's DevOps guide talks about the planning and testing shifts required to optimize a deployment pipeline.
Join the DZone community and get the full member experience.Join For Free
the original post can be found on the electric cloud blog .
gary gruver has been sharing with electric cloud parts of his new book in order to give readers a sneak peek into what it's all about! this is the third free chapter from gary gruver's recent book “starting and scaling devops in the enterprise." you can read the first chapter here .
the book provides a concise framework for analyzing your delivery processes and optimizing them by implementing devops practices that will have the greatest immediate impact on the productivity of your organization. it covers both the engineering, architectural and leadership practices that are critical to achieving devops success. this will be a helpful resource for you on your devops path!
chapter 3: optimizing the basic deployment pipeline
setting up your deployment pipeline (dp) and using devops practices for increasing its throughput while maintaining or improving quality is a journey that takes time for most large organizations. this approach, though, will provide a systematic method for addressing inefficiencies in your software development processes and improving those processes over time. we will look at the different types of work, different types of waste, and different metrics for highlighting inefficiencies. we will start there because it is important to put the different devops concepts, metrics, and practices into perspective so you can start your improvements where they will provide the biggest benefits and start driving positive momentum for your transformation.
the technical and cultural shifts associated with this will change how everyone works on a day-to-day basis. the goal is to get people to accept these cultural changes and embrace different ways of working. for example: as an operations person, i have always logged into a server to debug and fix issues on the fly. now i can log on to debug, but the fix is going to require updating and running the script. this is going to be slower at first and will feel unnatural to me, but the change means i know, as does everyone else, that the exact state of the server with all changes are under version control, and i can create new servers at will that are exactly the same. short-term pain for long-term gain is going to be hard to get some people to embrace, but this is the type of cultural change that is required to truly transform your development processes.
additionally, there are lots of breakthroughs coming from the field of devops that will help you address issues that have been plaguing your organization for years that were not very visible while operating at a low cadence. when you do one deployment a month, you don’t see the issues repeating enough to see a common cause that needs to be fixed. when you do a deployment each day, you see a pattern that reveals the things that need fixing. when you are deploying manually on a monthly basis, you can use brute force, which takes up a lot of time, requires a lot of energy, and creates a lot of frustration. when you deploy daily, you can no longer use brute force. you need to automate to improve frequency, and that automation allows you to fix repetitive issues.
as you look to address inefficiencies, it is important to understand that there are three different kinds of work with software that require different approaches to eliminate waste and improve efficiency. first, there is new and unique work, such as the new features, new applications, and new products that are the objective of the organization. second, there is triage work that must be done to find the source of the issues that need to be fixed. third, there is repetitive work, which includes creating an environment, building, deploying, configuring databases, configuring firewalls, and testing.
since the new and unique work isn’t a repetitive task, it can’t be optimized the way you would a manufacturing process. in manufacturing, the product being built is constant so you can make process changes and measure the output to see if there was an improvement. with the new and unique part of software you can’t do that because you are changing both the product and the process at the same time. therefore, you don’t know if the improvement was due to the process change or just a different outcome based on processing a different type or size of requirement. instead the focus here should be on increasing the feedback so that people working on these new capabilities don’t waste time and energy on things that won’t work with changes other people are making, won’t work in production, or don’t meet the needs of the customer. providing fast, high-quality feedback helps to minimize this waste. it starts with feedback in a production-like environment with their latest code working with everyone else’s latest code to ensure real-time resolution of those issues. then, ideally, the feedback comes from the customer with code in production as soon as possible. validating with the customer is done to address the fact that 50% of new software features are never used or do not meet their business intent. removing this waste requires getting new features to the customers as fast as possible to enable finding which parts of the 50% are not meeting their business objective so the organization can quit wasting time on those efforts.
in large software organizations, triaging and localizing the source of the issue can consume a large amount of effort. minimizing waste in this area requires minimizing the amount of triage required and then designing processes and approaches that localize the source of issues as quickly as possible when triage is required. devops approaches work to minimize the amount of triage required by automating repetitive tasks for consistency. devops approaches are also designed to improve the efficiency of the triage process by moving to smaller batch sizes, resulting in fewer changes needing to be investigated as potential sources of the issue.
the waste with repetitive work is different. devops moves to automate these repetitive tasks for three reasons. first, it addresses the obvious waste of doing something manually when it could be automated. automation also enables the tasks to be run more frequently, which helps with batch sizes and thus the triage process. second, it dramatically reduces the time associated with these manual tasks so that the feedback cycles are much shorter, which helps to reduce the waste for new and unique work. third, because the automated tasks are executed the same way every time, it reduces the amount of triage required to find manual mistakes or inconsistencies across environments.
devops practices are designed to help address these sources of waste, but with so many different places that need to be improved in large organizations, it is important to understand where to start. the first step is documenting the current dp and starting to collect data to help target the bottlenecks in flow and the biggest sources of waste. in this chapter we will walk through each step of the basic dp and will review which metrics to collect to help you understand the magnitude of issues you have at each stage. then, we will describe the devops approaches people have found effective for addressing the waste at that stage. finally, we will highlight the cultural changes that are required to get people to accept working differently.
this approach should help illustrate why so many different people have different definitions of devops. it really depends what part of the elephant they are seeing. for any given organization, the constraint in flow may be the planning/requirements process, the development process, obtaining consistent environments, the testing process, or deploying code. your view of the constraint also potentially depends on your role in the organization. while everything you are hearing about devops is typically valid, you can’t simply copy the rituals because it might not make sense for your organization. one organization’s bottleneck is not another organization’s bottleneck so you must focus on applying the principles!
here we are talking about new and unique work, not repetitive work, so fixing it requires fast feedback and a focus on end-to-end cycle time for ultimate customer feedback.
for organizations trying to better understand the waste in the planning and requirements part of their dp, it is important to understand the data showing the inefficiencies. it may not be possible to collect all the data at first, but don’t let this stop you from starting your improvements. as with all of the metrics we describe, get as much data as you can to target issues and start your continuous improvement process. it is more important to start improving than it is to get a perfect view of your current issues. ideally, though, you would want to know the answers to the following questions:
- what percentage of the organization's capacity is spent on documenting requirements and planning?
- what is the amount of requirements inventory waiting for development, roughly, in terms of days of supply?
- what percentages of the requirements are reworked after originally defined?
- what percentages of the delivered features are being used by the customers and are achieving the expected business results?
optimizing this part of the dp requires moving to a just-in-time approach to documenting and decomposing requirements only to the level required to support the required business decisions while limiting the commitment of long-term deliveries to a subset of the overall capacity. the focus here is to limit the inventory of requirements as much as possible. ideally this would wait until the developer is ready to start working on the requirement before investing in defining the feature. this approach minimizes waste because effort is not exerted until you know for sure it is going to be developed. it also enables quick responsiveness to changes in the market because great new ideas don’t have to wait in line behind all the features that were previously defined.
while this is the ideal situation, it is not always possible because organizations frequently need a longer-range view of when things might happen in order to support different business decisions. for example, you might ask yourself, ”do i need to ramp up hiring to meet schedule, or should i build the manufacturing line because a product is going to be ready for a launch?” the problem is that most organizations create way more requirements inventory a long way into the future than is needed to support their business decisions. they want to know exactly what features will be ready when using waterfall planning because that is what they do for every other part of the business. the problem is that this approach drives a lot of waste into the system and locks in to a committed plan what should be your most ﬂexible asset. additionally, most organizations push their software teams to commit to 100% of their capacity, meaning they are not able to respond to changes in the marketplace or discoveries during development. this is a significant source of waste in a lot of organizations.
for many organizations, like the one described in chapter 2, the time it takes for operations to create an environment for testing is one of the lengthiest steps in the dp. additionally, the consistency between this testing environment and production is so lacking that it requires finding and fixing a whole new set of issues at each stage of testing in the dp. creating these environments is one of the main repetitive tasks that can be documented, automated, and put under revision control. the objective here is to be able to quickly create environments that provide consistent results across the dp. this is done through a movement to infrastructure as code, which has the additional advantage of documenting everything about the environments so it is easier for different parts of the organization to track and collaborate on changes.
to better understand the impact environment issues are having on your dp, it would be helpful to have the following data:
- time from environment request to delivery
- how frequently new environments are required
- the percent of time environments need fixing before acceptance
- the percent of defects associated with code vs. environment vs. deployment vs. database vs. other at each stage in the dp
one of the biggest improvements coming out of the devops movement concerns the speed and consistency of environments, deployments, and databases. this started with continuous delivery by jez humble and david farley. they showed the value of infrastructure as code, where all parts of the environment are treated with the same rigor and controls as the application code. the process of automating the infrastructure and putting it under version control has some key advantages. first, the automation ensures consistency across different stages and different servers in the dp. second, the automation supports the increased frequency that is required to drive to smaller batch sizes and more frequent deployments. third, it provides working code that is a well-documented definition of the environments that everyone can collaborate on when changes are required to support new features.
technical solutions in this space are quickly evolving because organizations are seeing that getting control of their environments provides many benefits. smart engineers around the world are constantly inventing new ways to make this process easier and faster. cloud capabilities, whether internal or external, tend to help a lot with speed and consistency. new scripting capabilities from chef, puppet, ansible, and others help with getting all the changes in scripts under source control management. there have also been breakthroughs with containers that are helping with speed and consistency. the “how” in this space is evolving quickly because of the benefits the solutions are providing, but the “what” is a lot more consistent. for environments, you don’t want the speed of provisioning to be a bottleneck in your dp. you need to be able to ensure consistency of the environment, deployment process, and data across different stages of your dp. you need to be able to qualify infrastructure code changes efficiently so your infrastructure can move as quickly as your applications. additionally, you need to be able to quickly and efficiently track everything that changes from one build and environment to the next.
having development and operations collaborate on these scripts for the entire dp is essential. the environments across different stages of the dp are frequently different sizes and shapes, so often no one person understands how a configuration change in the development stage should be implemented in every stage through production. if you are going to change the infrastructure code, it has to work for every stage. if you don’t know how it should work in those stages, it forces necessary discussions. if you are changing it and breaking other stages without telling anyone, the scm will find you out and the people managing the dp will provide appropriate feedback. working together on this code is what forces the alignment between development and operations. before this change, development would tend to make a change to fix their environment so their code would work, but they wouldn’t bother to tell anyone or let people know that in order for their new feature to work, something would have to change in production. it was release engineering’s job to try and figure out everything that had changed and how to get it working in production. with the shift to infrastructure as code, it is everyone’s responsibility to work together and clearly document in working automation code all of the changes.
this shift to infrastructure as code also has a big impact on the itil and auditing processes. instead of the itil processes of documenting configuration of a change manually in a ticket, it is all documented in code that is under revision control in an scm tool. the scm is designed to make it easy to track any and all changes automatically. you can look at any server and see exactly what was changed by who and when. combine this with automated testing that can tell you when the system started failing, and you can quickly get to the change that caused the problem. this localization gets easier when the cycle time between tests limits this to a few changes to look through.
right now, the triage process takes a long time to sort through clues to find the change that caused the problem. it is hard to tell if it is a code, environment, deploy, data, or test problem, and currently the only thing under control for most organizations is code. infrastructure as code changes that and puts everything under version control that is tracked. this eliminates server-to-server variability and enables version control of everything else. this means that the process for making the change and documenting the change are the same thing so you don’t have to look at the documentation of the change in one tool to see what was approved and then validate that it was really done in the other tool. you also don’t have to look at everything that was done in one tool and then go to the other tool to ensure it was documented. this is what they do during auditing. the other thing done during auditing is tracking to ensure everyone is following the manual processes every time–something that humans do very poorly, but computers do very well. when all this is automated, it meets the itil test of tracking all changes, and it makes auditing very easy. the problem is that the way devops is currently described to process and auditing teams makes them dig in their heels and block changes when instead they should be championing those changes. to avoid this resistance to these cultural changes, it is important to help the auditing team understand the benefits it will provide and include them in defining how the process will work. this will make it easier for them to audit, and they will know where to look for the data they require.
using infrastructure as code across the dp also has the benefit of forcing cultural alignment between development and operations. when development and operations are using different tools and processes for creating environments, deploying code into those environments, and managing databases, they tend to find lots of issues releasing new code into production. this can lead to a great deal of animosity between development and operations. as they start using the same tools, and more specifically the same code, you will likely find that making the code work in all the different stages of the dp forces them to collaborate much more closely. they need to understand each other’s needs and the differences between the different stages much better. they also need to agree that any changes to the production environments start at the beginning of the dp and propagate through the system just like the application code. over time, you will likely find that this working code is the forcing function that starts the cultural alignment between development, operations, and all the organizations in between. this is a big change for most large organizations. it requires that people quit logging in to servers and making manual changes. it requires an investment in creating automation for the infrastructure. it also requires everyone to use common tools, communicate about any infrastructure changes that are required, and document the changes with automated scripts. it requires much better communication across the different silos than exists in most organizations.
organizations doing embedded development typically have a unique challenge with environments because the firmware/software systems are being developed in parallel with the actual product so there is very little, if any, product available for early testing. additionally, even when the product is available, it is frequently difficult to fully automate the testing in the final product. these organizations need to invest in simulators to enable them to test the software portions of their code as frequently and cheaply as possible. they need to find or create a clean architectural interface between the software parts of their code and the low-level embedded firmware parts. code is then written that can simulate this interface running on a blade server so they can test the software code without the final product. the same principle holds true for the low-level embedded firmware, but this testing frequently requires validating the interactions of this code with the custom hardware in the product. for this testing, they need to create emulators that support testing of the hardware and firmware together without the rest of the product.
this investment in simulators and emulators is a big cultural shift for most embedded organizations. they typically have never invested to create these capabilities and instead just do big bang integrations late in the product lifecycle that don’t go well. additionally, those that have created simulators or emulators have not invested in continually improving these capabilities to ensure they can catch more and more of the defects over time. these organizations need to make the cultural shift to more frequent test cycles just like any other devops organization, but they can’t do that if they don’t have test environments they can trust for finding code issues. if the organization is not committed to maintaining and improving these environments, the organization tends to loose trust and quit using them. when this happens, they end up missing a key tool for transforming how they do embedded software and firmware development.
- the time it takes to run the full set of testing
- the repeatability of the testing (false failures)
- the percent of defects found with unit tests, automated system tests, and manual tests
- the time it takes the release branch to meet production quality
- approval times
- batch sizes or release frequency at each stage
- the time and effort required to deploy and release into production
- the number of issues found during release and their source (code, environment, deployment, test, data, etc…)
operation and monitoring
- issues found in production
- time to restore service
in the coming weeks, gruver will be sharing additional chapters and tips from the book.
can’t wait? you can download your free copy now .
Published at DZone with permission of Anders Wallgren, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.