Infrastructure and operations as code is an essential practice for realizing the advantages of modern clouds. For enterprises looking to migrate to Amazon Web Services, Azure, or Google Cloud Platform, scripted infrastructure and automation are the key first steps through which other DevOps practices become accessible. This post will enumerate some key benefits that become possible once we embrace infrastructure as code practices.
By codifying our infrastructure, we enable better testing and quality control, more efficient and predictable deployments, and decreased recovery times. It provides improved testability and monitoring, lowers the cost of experimentation and innovation, makes deployments more predictable, and decreases the mean time to resolution (MTTR) for issues.
Automate Your Deployment and Recovery Processes
With infrastructure automation, reproducible environments become possible. We can use the same automation scripts to deploy exact copies of production to development, test, and production environments. With these consistent deployments, we are able to achieve the ever-elusive development-to-prod parity, finally putting an end to the “it worked on my machine!” problems.
The pinnacle of infrastructure automation is the Blue/Green deployment strategy. This strategy enables zero downtime deployments and allows us to run live tests before releasing our changes to our users. Blue/Green Deployments take advantage of our ability to run exact copies of our environments in parallel. By controlling when traffic is routed to our new copy, we can defer a release until we are 100% confident that our new environment is ready.
In a Blue/Green deployment, we deploy a new, isolated copy of our environment. This new, copied environment is named Green. It is our release candidate. It contains our new changes and is isolated from the live environment, which we call Blue. The Green environment is configured for production and is ready to go live, but it is launched darkly – that is, no traffic is routed to Green.
Next, we run our acceptance tests against the live Green environment. If we encounter an error, we can simply log the error, remove the Green environment and go back to the drawing board. No users ever know a difference, as we never routed any live traffic to Green.
If our acceptance tests do pass, we promote our Green environment to be the new live environment. This can be done by changing a DNS entry to point at the Green environment or by removing the Blue environment from our load balancer and adding the Green environment to the load balancer.
The Blue environment does not need to be automatically deleted. If necessary, we can keep it around for a short grace period in case we need to rollback. The rollback process would consist of reversing the traffic swap to point back at Blue.
This is merely an overview of the Blue/Green deployment strategy. For an in-depth discussion of Blue/Green techniques in an AWS environment, see the AWS whitepaper on the topic.
Rollback With the Same Tested Processes
Our deployment scripts are also our rollback scripts. Because our deployments are automated, we can reproduce the state of the infrastructure any number of times by simply re-running the deployment scripts with the same inputs. With our codified infrastructure, we can reach back in version control to grab any commit since the repository began. By reverting to the desired commit and re-running our deployment scripts, we can restore the state of the infrastructure as it was on any given day.
Don’t Repair, Redeploy**
Server time is cheap, but engineer time is expensive. Further, troubleshooting server performance issues can be very time-consuming. For these reasons, it no longer makes sense to troubleshoot and repair our servers. Rather it is now more economical to destroy the old server instance and replace it with a new, working copy.
We can use our automated deployment scripts to deliver working servers to replace broken and impaired servers. We can now follow an immutable infrastructure pattern, in which nothing ever changes on a server after it is deployed. This helps avoid the problem of configuration drift and also greatly simplifies our operations. Now, the only repair operation is to redeploy the service. A service crashed? Redeploy. Having performance issues on a host? Redeploy. Lost connectivity to a host? Redeploy.
Focus on Mean Time to Recovery
They say you can’t fix what you don’t measure, but it’s important to choose the right metrics to measure and improve upon. To traditional IT organizations, the key metric is Mean Time Between Failures (MTBF). Server uptime is paramount, and this is the metric that gets optimized. This leads to a reluctance to accept changes, as each change can potentially introduce a failure. Moreover, configuration changes are generally made manually by administrators. This leads to long-running snowflake servers which are virtually impossible to reproduce. This presents a very nasty challenge in restoring service availability when the inevitable failures do occur.
Failure of an IT component means the organization is losing money. But failures do and will happen. In a cloud-native world, we solve this problem by turning it on its head. Rather than trying to avoid failures, DevOps organizations accept that failures are a part of life and design our applications to minimize the impact of those failures by recovering gracefully. To accomplish this, we focus on Mean Time To Recovery (MTTR) as our key metric. By minimizing the time it takes to recover from failure, we minimize the impact of each failure. Optimizing for MTTR necessitates automation of our processes. Our recovery processes must be consistent and reliable.
Practice Makes Perfect
If we want to improve at anything, we have to practice. Recovering from failures is no different. We do not want the first test of our recovery processes to be during an actual disaster. Rather, we want to test our recovery process numerous times before we actually need it. Doing so gives us confidence that our recovery process will work as intended and restore the availability of our service.
Traditionally, creating an isolated environment for disaster recovery was too cost-prohibitive and time-consuming to be a feasible strategy. The only way to test our process was to actually have a disaster. However, with modern cloud environments, we no longer have this limitation. Creating a new environment is an API call away. Once we’ve codified our infrastructure, we can create a copy of our production environment by running the same code we used to create production.
We create our new copy environment to be totally isolated from our production environment. We are now free to simulate disasters and test our recovery processes. This can be done regularly in a low-stress environment, allowing our engineering teams to troubleshoot and strategize without the added pressure of an actual outage.
Each time the process fails, we learn a little bit more. We can then use this information to correct the problem and improve our automated recovery scripts. At the very least, we document the known issues and add the solutions to common problems in our standard procedures.
We should practice these failures regularly. By the time an actual disaster occurs, we should have multiple practice runs of recovering from the disaster, as well as hundreds or even thousands of trial runs from the deployments being run with the same scripts.
Use Testing Tools to Verify Your Infrastructure
With our infrastructure codified and our restore process automated, the next step is to design a set of automated tests that will verify. Because we now think of our infrastructure as a software application, we should use software testing tools to test our infrastructure. By using tools like Python’s Behave or Ruby’s RSpec, we can test that our service is behaving as expected.
These tests don’t have to be complicated and can start out very simply. The first test can just be “Is the service up and reachable?” After all, this is the entire goal of the software project – if it is not up and working, it is of no use. Then we can start to further refine our tests to include those behaviors we expect a healthy service to exhibit. A good starting point is to hit each of our service’s endpoints in an automated fashion. These basic tests give us a high level of certainty that the app is behaving as expected, and we can add more detailed testing to test for specific failure cases.
As we practice our failures and recovery process, we will find new issues that can cause our system to not operate correctly. As these issues are discovered, we test for them and add those tests to our suite of automated tests. These tests also double as regression tests. When a new feature is added and a test breaks, we know exactly which change caused the service tests to fail. As time goes by, we build a more comprehensive test suite and incrementally increase our confidence in our recovery process.
Hook Your Tests Into Your Monitoring System
Our automated test suite gives us confidence that our service is behaving correctly during deployment and recovery. In these situations, the conditions are known and assumptions can be hidden. But what happens when our service is used in unexpected ways, as is bound to happen when real users start using the application? We can hook these tests into our monitoring systems and run these tests on a periodic basis. In this way, we can be alerted the moment something goes wrong. Running our tests in this fashion will allow us to test against real-world scenarios. This is our first line of defense in detecting real world errors.
You don’t have to be a Netflix or Airbnb company to take advantage of DevOps practices. Fortune 500 companies and government agencies are adopting these patterns so that they can recover from failure more quickly, deploy more often, deploy more quickly. The prerequisite to practicing these modern DevOps techniques is Infrastructure as Code. If your organization is looking to begin capitalizing on the benefits of modern clouds but does not know where to start, codifying infrastructure should be the first step.