DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Understanding Infrastructure as Code at Scale
  • Why GitOps Is Gaining Popularity in DevOps: A Deep Dive Into the Future of Infrastructure Management
  • Understanding the Purposes of Key Terraform Files and How to Structure Their Folders
  • Terraform Tips for Efficient Infrastructure Management

Trending

  • Understanding Java Signals
  • Solid Testing Strategies for Salesforce Releases
  • The Role of Retrieval Augmented Generation (RAG) in Development of AI-Infused Enterprise Applications
  • Grafana Loki Fundamentals and Architecture
  1. DZone
  2. Testing, Deployment, and Maintenance
  3. DevOps and CI/CD
  4. Terraform Drift Detection at Scale: How to Catch Configuration Drift Early

Terraform Drift Detection at Scale: How to Catch Configuration Drift Early

Automated drift detection ensures Terraform-managed infrastructure stays aligned with Git, reducing risk from manual changes across multi-cloud environments.

By 
Yugandhar Suthari user avatar
Yugandhar Suthari
·
Apr. 15, 25 · Analysis
Likes (2)
Comment
Save
Tweet
Share
5.1K Views

Join the DZone community and get the full member experience.

Join For Free

While Terraform possesses the declarative model for managing infrastructure across cloud platforms, it makes one assumption that is rare at scale: that the state of the deployed infrastructure is always managed exclusively through Terraform. In practice, environments evolve. Under such pressure, teams manually make changes, apply hotfixes directly in the cloud console, and deploy infrastructure through the parallelisation of automation.

These changes create configuration drift outside of the Terraform lifecycle. The infrastructure stays functional but is misaligned with the Terraform codebase, causing unpredictable behavior, broken expectations, and sometimes even production incidents.

In order to resolve this problem, we have an automated drift detection system that runs continuously across our multi-cloud environments (AWS, Azure, GCP). It is a system that detects and reports divergence between what is provisioned and what is declared in Git. In this article, we explain how it works, how we integrated it into our workflows and the decisions we made to ensure it scaled reliably.

Understanding Configuration Drift

How Drift Happens in Terraform Workflows

Configuration drift occurs when infrastructure is modified outside of Terraform — through manual edits, external tools, or emergency fixes not committed to code. This can happen through:

  • Manual changes in the cloud console
  • Untracked hotfixes during incidents
  • Other tools and APIs that modify resources
  • Environmental factors such as auto-scaling or third-party tagging

Terraform does not detect changes made directly to infrastructure unless the state is explicitly refreshed, and even then, minor changes may go unnoticed. It compares the Terraform code to the last known state, not the actual infrastructure, unless it is refreshed. Even if you do this, minor or cloud-managed changes might go unnoticed unless they're adequately differentiated.

Risk Profile of Drift

While a drift will not always break infrastructure quickly, it slowly degrades reliability. For example, IAM policies that are not stored in version control and are, therefore, changed outside of the source of truth typically remain undocumented. Configuring the network can help restore the lost network until Terraform solves the issues through changes that later become long-term liabilities when Terraform reverts.

Drift introduces fragility in multi-team environments. The more small changes propagate inconsistencies across environments, the more engineers lose confidence in Terraform, which should be a source of truth.

Designing Drift Detection That Scales

Anchoring to Git, Not Just Terraform State

In our design, the core principle was to treat Git as the source of infrastructure truth rather than the Terraform state. We measure drift by checking the latest live environment (with a fresh Terraform plan-refresh=true) against the state of code as committed to version control.

In each module, we have added a CI job, which runs on a regular schedule and builds on each PR. The job:

  • It refreshes the provider's Terraform state.
  • Runs a plan against the code at the current Git SHA
  • Compares the output with the expected plan (previous commit or baseline

The module is flagged if there are differences and no matching code change is found.

Terraform drift detection system architecture

Multi-Cloud Support

Since our infrastructure is spread over AWS, GCP, and Azure, modular runners help us run drift checks per cloud provider. In each environment, a scheduled drift detector is executed in a restricted read-only context, which prevents accidental modification and allows it to run in production.

The plan hash, commit ID, timestamp, and diff details are pushed together with the results and metadata into a central storage location (in our case, an S3 bucket). This data is surfaced and aggregated through Slack alerts and Grafana dashboards.

Integrating Detection into the Developer Workflow

Pull Request Integration

The drift surfaces directly during the development cycle. We also check whether the corresponding module drifts when a pull request that only modifies the infrastructure code is opened.

If a drift occurs, we annotate the PR with a detailed plan diff and a warning. This allows the developer to check whether the drift is benign or should be reconciled. For instance, if a security group rule was modified manually, the diff will show what was edited before the PR removes it anyway.

Suppressing Known Non-Issues

Not all drift is actionable. Cloud providers frequently add or change metadata (e.g., last_updated, self_link, and dynamic tags) to make apps more dynamic. These do not require remediation.

To address this, we introduced configurable suppression rules per module. With those rules in place, the drift detection job provides only actionable differences, enough to create drift detection jobs for teams to focus on.

Improving Signal and Accountability

Human-Readable Diff Output

In earlier iterations, the system would simply post "drift detected" logs. These alerts weren't helpful. We also made improvements to the toolchain that presented human-readable diffs between code and live config, which were grouped by resource type and attribute for readability purposes.

This allowed developers and SREs to process their random fails without having to run plans locally or manually browse the cloud console.

Drift Aging and Priority

We also track the age of drift (changes that have persisted through multiple check cycles without resolution). A drift in a three-day-old dev database is okay. A drift that is older than 15 minutes is public-facing to an IAM policy is not.

With this aging model in place, we can escalate the alert appropriately without overwhelming teams during triage.

Lessons from Production

Start With Critical Modules

So, we started by performing drift detection only on modules associated with production environments and security-sensitive resources like IAM, VPCs, and external-facing gateways. When drift occurs, the blast radius from these resources is the worst.

Because the rollout was phased, we built confidence while showing value early without slowing down delivery or introducing unnecessary friction.

Detect Before You Enforce

At first, we did not enforce hard gates for drift. The goal was awareness. Once teams had visibility, they were proactive in helping tidy up and codify manual changes.

Today, our most critical modules are likely to break PR checks if a known drift exists. Some others report but do not block deployment. It allows for temporary fixes without negating long-term consistency.

Drift frequency and resolution time by environment


Conclusion

Using Infrastructure-as-Code tools such as Terraform is a solid base for managing cloud resources; however, without drift detection, they operate on blind trust. This is not a real-world environment, so manual edits, unmanaged exceptions, and hotfixes will always exist in real-world areas. The trick is to catch it early, before it cascades into outages or regressions.

We combined Terraform, Git history, automated CI, and structured reporting to build a system that detects when infrastructure is not consistent with the code. It doesn't eradicate all risk but is primed for the feedback loop required to maintain consistency at scale.

For any operation in a multi-cloud environment working on multiple teams, this approach is not optional. At its core, it is about the difference between configuration as documentation and configuration as reality.

Git Terraform (software) Infrastructure as code

Opinions expressed by DZone contributors are their own.

Related

  • Understanding Infrastructure as Code at Scale
  • Why GitOps Is Gaining Popularity in DevOps: A Deep Dive Into the Future of Infrastructure Management
  • Understanding the Purposes of Key Terraform Files and How to Structure Their Folders
  • Terraform Tips for Efficient Infrastructure Management

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!