It’s Not All Bad! Using Cloud Drift for Teachable Moments
Typically, drift between running cloud resources and their associated infrastructure as code (IaC) is something to avoid at all costs. When it does inevitably happen, however, treat it as an opportunity to learn from your mistakes and train your teams to prevent it in the future.
Join the DZone community and get the full member experience.Join For Free
Stack Overflow’s 2021 Developer Survey found that 54% of developers use AWS, yet only 7% use Terraform. That means that far more developers have adopted provisioning, managing, and decommissioning cloud infrastructure using methods other than infrastructure as code (IaC). And that’s not including non-developers, such as Ops teams.
Yet, because of the knowledge and adoption gap with IaC, not everyone on a large team will know how to write IaC code. At least a few will have to resort to making changes in the cloud console or CLI, causing drift.
Drift is when your IaC state, or configurations in code, differs from your cloud state, or configurations of running resources. For example, if you use Terraform to provision a database without encryption turned on (oops) and an SRE at your company goes in and adds encryption using the cloud console, then your Terraform file is no longer in sync with the actual running cloud infrastructure. Now your Terraform code is less useful and you may not even be aware of that until you go to apply your code again.
This is in spite of the benefits of IaC. IaC, especially in combination with version control systems (VCS) is:
- Faster to provision
- Easier to understand and therefore improve
- Easier to collaborate around
- Able to be secured before runtime
- Easier to replicate for disaster recovery and new projects
In this instance, the drift created was a good thing. A misconfiguration (lack of encryption) was remediated. It just should have been done in code and in collaboration between the two teams. Perhaps the person who made the change didn’t know Terraform well enough to make the change or didn’t know you, the developer, well enough to talk to you about fixing the issue.
According to Stack Overflow’s survey, when learning new tools, like IaC, many developers turn to online resources, school, or books. However, my favorite way to learn is missing from the list — learning by doing. Nothing works like picking something to build and hacking away at it until you create a working version. You make mistakes, but the learning sticks with you. Similarly, nothing breaks down silos like working together to solve a problem in the trenches.
This can also be effective in architecting secure cloud infrastructure. Making changes manually to a cloud you’re familiar with in combination with a drift 90, tool, can actually help you learn IaC. Also, identifying drift caused by different teams can open up the conversation on how to improve collaboration. Let’s explore the two different reasons drift occurs and explain how they help teach secure IaC and cloud infrastructure best practices.
When Drift Is Temporary or Accidental (or Heaven Forbid a Misconfiguration)
Sometimes a change was made directly to cloud infrastructure that was meant to be temporary but turned into a permanent fixture. Let’s say during an incident, an SRE is unable to access a virtual machine, so they go into a cloud firewall (such as a security group) and manually open up SSH (port 22) to the world (0.0.0.0/0). They identify and fix the problem in the application, but forget to close up SSH once they’re done, adding to the attack surface of that VM.
If the DevOps or security team identifies this drift, it’s a great time to ask questions.
- Why was it necessary to open SSH to the world instead of a specific IP address?
- How long did it take to identify?
- Why wasn’t it detected and remediated earlier?
- How do we improve the roadblocks that exacerbated the issue?
This event should happen in a blameless environment, where issues identified are brought up in good faith as a growth opportunity and with both teams in the room. In this case, this may be an opportunity to improve the collaboration between two teams or the response or approval processes to grant the right IP address access to a host without exposing it to the world.
When Drift Is Permanent or a Fix
Let’s say you’re new to IaC. You deploy your multiple-tier web application using Terraform, but you receive a runtime alert that your running database isn’t encrypted. That’s not great. If anyone gets a hold of that database, they’ll have free access to all of the data!
If you go into your cloud console and perform your encrypted backup and relaunch an encrypted database, you’ve improved your security posture, but you’ve created drift. You’ve lost the auditability, collaboration, and repeatability benefits of IaC. You wouldn’t want to use that Terraform code as a starting point for your next web project and if you apply your code, you’ll just overwrite that fix.
These are arguably the more important questions to address when getting comfortable with IaC and GitOps:
- How long did it take to identify the unencrypted database?
- Do we have the right security tooling in place?
- Why wasn’t the mistake caught in code?
- Why was that insecure code allowed to be deployed?
- Why was the fix made in the cloud instead of in code?
The last one could be because you aren’t familiar enough with IaC to make the change in code. A good drift detection tool will identify not only that there is a difference between your code and cloud, but also the code needed to bring your Terraform code back inline. Use this as an opportunity to study the correction and learn how to add encryption or other proper configurations in templates in the future.
Also, take this opportunity to work on the collaboration between teams. Bring the two teams involved into a room and discuss how to increase the communication between them. Set up processes for notifying the right engineer when a runtime misconfiguration is identified by the SRE team.
Drift Detection for Continuous Learning
Learning a new coding language and way of operating infrastructure can be difficult. You can expect both cultural and knowledge speedbumps as you adopt it. However, building real-world projects is the best way to fully understand some of the nuances and details you may miss from online or classroom learning. We encourage everyone to move towards solely managing their cloud infrastructure in code, but that doesn’t happen overnight.
Opinions expressed by DZone contributors are their own.