DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Bringing AI Agents to Cloud Engineering: How Autonomous Operations Are Changing Reliability at Scale
  • Policy-as-Code for Terraform in Regulated Environments
  • Terraform Drift Detection at Scale: How to Catch Configuration Drift Early
  • Understanding Infrastructure as Code at Scale

Trending

  • How to Build an Agentic AI SRE Co-Pilot for Incident Response
  • Building a High-Throughput Distributed Sequence Generator Using the Hi-Lo Algorithm
  • Zero-Downtime Deployments for Java Apps on Kubernetes
  • Beyond Manual Annotation: Engineering Self-Correcting Pseudo-Labeling Pipelines
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Infrastructure as Code: How Automation Evolved to Power AI Workloads

Infrastructure as Code: How Automation Evolved to Power AI Workloads

Learn about how Infrastructure as Code progressed in 2025 and how it helped automation, particularly for provisioning AI infrastructure.

By 
Vidyasagar (Sarath Chandra) Machupalli FBCS user avatar
Vidyasagar (Sarath Chandra) Machupalli FBCS
DZone Core CORE ·
Dec. 18, 25 · Analysis
Likes (3)
Comment
Save
Tweet
Share
1.7K Views

Join the DZone community and get the full member experience.

Join For Free

If you read my articles published on DZone this year, you would have sensed that I love automation and that Infrastructure as Code (IaC) is my buddy for automating infrastructure provisioning. Recently, I started exploring and learning about the major shifts happening in the IaC landscape. 

As part of my weekend readings in the last couple of months, I came across several exciting announcements from HashiConf 2025, Pulumi's new AI capabilities, and a revolutionary platform called Formae. In this article, let's learn about how IaC progressed in 2025 and how it helped automation, particularly for provisioning AI infrastructure.

Infrastructure-as-Code has transformed how we manage cloud resources, yet 2025 brought innovations that fundamentally changed the game. From AI-powered agents that write and deploy infrastructure code to stateless platforms that eliminate drift detection complexity, this year marked a turning point in infrastructure automation.

The State of IaC: Where We Stand Today

Before diving into specific tools and announcements, let's understand the current landscape. According to the State of IaC 2025 report, cloud complexity has grown for 65% of organizations. Only 6% achieved full cloud codification, meaning most infrastructure is still managed manually. Less than 33% continuously monitor drift, taking a reactive approach to infrastructure changes.

The report makes it clear that manual provisioning is legacy. Declarative configuration files are table stakes. The automation-first pipeline emerged as the gold standard, where infrastructure changes are treated the same way as code deployments: version-controlled, tested, reviewed, and automated.

HashiConf 2025: Major Announcements That Matter

September 2025 marked HashiConf's 10th anniversary in San Francisco. HashiCorp, now part of IBM, had several announcements that caught my attention.

Project Infragraph: Real-Time Infrastructure Intelligence

Project Infragraph represents a fundamental shift in infrastructure observability. Instead of piecing together data from multiple monitoring tools, teams get a unified view that understands relationships between resources.

Project Infragraph

Project Infragraph


Project Infragraph enables infrastructure that can observe its own state, reason about optimal configurations, and act autonomously. The private beta launches in December 2025.

Source: https://newsroom.ibm.com/2025-09-25-hashicorp-previews-the-future-of-agentic-infrastructure-automation-with-project-infragraph

Terraform Stacks: General Availability

After months in public beta, Terraform Stacks reached general availability with backward-compatible APIs. The concept addresses a pain point I've experienced countless times: coordinating deployments across different teams, each managing their own state files.

Stacks use a component-based architecture. Here's a simple example showing how you define reusable components:

Plain Text
 
# stack.tfcomponent.hcl
component "vpc" {
  source = "./modules/vpc"
  inputs = {
    region = var.region
    environment = var.environment
  }
}

component "eks_cluster" {
  source = "./modules/eks"
  inputs = {
    vpc_id = component.vpc.vpc_id
    region = var.region
  }
  depends_on = [component.vpc]
}


What Changed in the GA Release? 

All configuration files now use the .tfcomponent.hcl extension instead of .tfstack.hcl, providing a standardized naming convention. Deployment groups support new orchestration rules for better control over deployment order. Destroy operations work through code instead of UI-only, giving teams version-controlled teardown workflows. Most importantly, Terraform manages dependency resolution automatically, eliminating manual orchestration.

What used to require careful orchestration and multiple deployment windows now happens with one action. Terraform handles orchestration, dependency resolution, and change propagation automatically. This makes managing complex multi-component infrastructures significantly simpler.

MCP Servers: Bridging AI and Infrastructure

HashiConf introduced Model Context Protocol servers for Terraform, Vault, and Vault Radar. These MCP servers act as bridges between AI agents and existing infrastructure tools.

Here's a simple example of how you might use MCP to interact with Terraform through natural language:

Python
 
# Example: Using MCP to trigger Terraform workspace runs
from mcp_client import MCPClient

client = MCPClient("terraform")

# Natural language request
response = client.execute(
    "Trigger a workspace run for the production environment and notify the team on Slack when complete"
)

# MCP translates this to Terraform API calls
# No need to write complex API integration code
print(f"Workspace run initiated: {response.run_id}")


You can now tell your AI assistant to trigger workspace runs, query secrets, or discover unmanaged resources without switching contexts or writing complex scripts. This dramatically reduces the friction in infrastructure operations.

Additional Features Worth Noting

HashiConf also announced several other features that reached general availability. Terraform search helps teams discover and import resources in bulk more efficiently. Azure Copilot with Terraform integration simplifies adoption without requiring deep Terraform knowledge. Hold Your Own Key gives organizations ownership of encryption keys used to access sensitive data in HCP Terraform. HCP Waypoint provides application template catalogs, shielding developers from code-level infrastructure details.

Pulumi Neo: AI-Powered Infrastructure Agent

While Terraform continued its market dominance, Pulumi made serious waves with Neo, their AI infrastructure agent. After a long journey with Terraform, when HCL2 came out, I started exploring alternatives where I could use the programming language of my choice. That's when I found Pulumi.

Why Pulumi Matters

Pulumi is a modern Infrastructure as Code platform that enables developers to create, deploy, and manage cloud resources using familiar programming languages instead of domain-specific languages. Instead of learning HCL, you can use TypeScript, Python, Go, C#, Java, or even YAML. This means full IDE support with code completion, error checking, and refactoring capabilities that come naturally with general-purpose programming languages.

Here's a comparison of the same infrastructure in Terraform vs Pulumi. First, the Terraform approach:

Terraform:

Plain Text
 
resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-data-bucket"
  tags = {
    Environment = "Production"
  }
}


Pulumi:

Python
 
import pulumi_aws as aws

data_bucket = aws.s3.Bucket(
    "data-bucket",
    bucket="my-data-bucket",
    tags={"Environment": "Production"}
)

pulumi.export("bucket_name", data_bucket.id)


Neo: The AI Infrastructure Agent

Neo represents Pulumi's answer to the "velocity trap," where AI coding assistants make developers faster, but infrastructure teams can't keep up.

Neo request flow

Neo request flow

Neo offers progressive autonomy. Development environments might permit fully autonomous operation, like daily waste cleanup and weekly drift reconciliation. Production changes may require human approval. When Neo encounters unexpected states or errors, it can self-diagnose or loop in a human for assistance. As confidence builds, the autonomy boundary expands.

Formae: Rethinking IaC Fundamentals

In October 2025, Platform Engineering Labs launched Formae, challenging fundamental assumptions about how IaC should work. Let's learn about how it uses PKL and introduces a stateless approach.

The Problems Formae Solves

State file corruption and drift detection have plagued infrastructure teams forever. You know the scenario: someone makes a manual change in the console, your Terraform state drifts, and now you're spending hours reconciling reality with what your code thinks exists.

Traditional IaC tools require importing existing resources through a painful manual process, maintaining state files with corruption risk, detecting drift reactively, and reconciling manual changes in a time-consuming manner. Formae eliminates these issues by making reality itself the state.

Metastructure: A New Concept

Formae introduces "Metastructure," which combines infrastructure configuration with operational logic. Traditional IaC uses static configuration and planned state files, requires manual imports, performs periodic drift detection, and only manages tool-specific resources. Formae's Metastructure combines configuration with operational logic, uses reality as state, provides automatic discovery, performs continuous synchronization, and discovers all resources regardless of creation method.

Here's a PKL configuration example:

Plain Text
 
// infrastructure.pkl
module infrastructure
import "pkl:aws"

vpc: aws.VPC {
  cidrBlock = "10.0.0.0/16"
  tags {
    ["Name"] = "production-vpc"
  }
}

instances: Listing<aws.EC2.Instance> {
  new {
    instanceType = "t3.large"
    vpcId = vpc.id
  }
}


Brownfield Environments

Formae excels in brownfield environments where existing infrastructure needs code management. For existing AWS resources, traditional approaches require manually importing each resource, while Formae simply runs formae extract --target aws. For resources created via the console, traditional tools show drift detection alerts, but Formae automatically codifies and merges them. When using multiple management tools, traditional approaches face import conflicts, but Formae co-exists with all tools. The team learning curve is steep with tool-specific import syntax, but Formae makes it automatic with no imports needed. 

Example workflow:

Shell
 
# Discover existing infrastructure
formae extract --target aws --output current-infrastructure.pkl

# Review and modify
vim current-infrastructure.pkl

# Apply changes
formae apply current-infrastructure.pkl


Formae automatically discovers and codifies existing infrastructure, eliminating painful import processes.

AI Infrastructure Provisioning: The Driving Force

Much of the IaC innovation in 2025 came from one driving force: provisioning and managing AI infrastructure at scale. Training frontier AI models requires coordination that makes traditional deployments look simple.

The AI Infrastructure Challenge

Let's understand the complexity through a diagram showing the full AI training infrastructure stack:

AI infrastructure provisioning

AI infrastructure provisioning

You're dealing with petabytes of data preparation across thousands of CPU cores. Massive GPU clusters running hot for months. Checkpoint management, where you lose hours of training because you didn't save state properly, costs real money. Gradient synchronization across hundreds of GPUs. Fault-tolerant scheduling where hardware failures become statistical certainties rather than edge cases.

Traditional applications use CPU-based compute running for minutes to hours, dealing with gigabytes of data, using standard retry logic for fault tolerance, having predictable costs, and scaling horizontally. AI training requires GPU clusters with hundreds of GPUs, runs for days to months, handles petabytes of data, needs checkpoint-based recovery with resumption on failure, has extremely high costs requiring constant optimization, and uses specialized distributed training approaches.

If you are new to the processing units world, learn about CPUs vs GPUs vs TPUs in this article.

Here's provisioning a GPU training environment with Pulumi:

Python
 
import pulumi_gcp as gcp

cluster = gcp.container.Cluster(
    "ai-training-cluster",
    initial_node_count=1,
    remove_default_node_pool=True,
    location="us-central1-a"
)

gpu_node_pool = gcp.container.NodePool(
    "a100-node-pool",
    cluster=cluster.name,
    location=cluster.location,
    node_count=10,
    node_config=gcp.container.NodePoolNodeConfigArgs(
        machine_type="a2-highgpu-8g",
        guest_accelerators=[
            gcp.container.NodePoolNodeConfigGuestAcceleratorArgs(
                type="nvidia-tesla-a100",
                count=8
            )
        ]
    )
)

pulumi.export("cluster_name", cluster.name)


Platform Engineering: The Abstraction Layer

Platform engineering emerged as a discipline providing self-service infrastructure catalogs. Instead of learning Terraform or Pulumi directly, developers select pre-built templates for common use cases, customize a few parameters, and get infrastructure that meets organizational standards.

The platform engineering stack consists of multiple layers working together. The self-service portal layer uses tools like Backstage, Port, and Humanitec to provide the developer interface. The IaC templates layer leverages Terraform modules and Pulumi components as reusable infrastructure patterns. Policy enforcement happens through OPA, Sentinel, or CrossGuard for governance and compliance. Deployment automation uses ArgoCD, Flux, or HCP Waypoint for GitOps workflows. Cost management relies on tools like Cloudability and Kubecost for spending visibility.

Here's a self-service database template:

Python
 
import pulumi_aws as aws

class DatabaseService(pulumi.ComponentResource):
    def __init__(self, name, args, opts=None):
        super().__init__('custom:database:Service', name, None, opts)
        
        self.db = aws.rds.Instance(
            f"{name}-db",
            engine="postgres",
            instance_class=args.get("instance_class", "db.t3.medium"),
            allocated_storage=args.get("storage_gb", 100),
            storage_encrypted=True,
            multi_az=args.get("environment") == "production",
            opts=pulumi.ResourceOptions(parent=self)
        )


Developers use this without understanding RDS details:

Python
 
user_database = DatabaseService(
    "user-service-db",
    args={
        "database_name": "users",
        "team": "backend-team",
        "environment": "production"
    }
)

Security and Compliance in the AI Era

With AI tools generating more infrastructure code than ever, security validation has become critical. Google reports that 25% of its new code comes from AI, making automated security validation non-negotiable.

Essential security tools include Checkov for static analysis of misconfigurations, tfsec for Terraform-specific security scanning, Terrascan for policy-as-code security, OPA for runtime policy enforcement, and Sentinel as HashiCorp's policy framework. These tools integrate at different points: Checkov and tfsec run in pre-commit hooks and CI/CD pipelines, OPA validates at runtime, and Sentinel enforces policies within HCP Terraform.

Here's an example of how you might configure Checkov for your infrastructure repository:

YAML
 
# .checkov.yml
branch: main
download-external-modules: true
framework:
  - terraform
  - terraform_plan
  - cloudformation
soft-fail: false
check:
  - CKV_AWS_20  # S3 bucket encryption
  - CKV_AWS_21  # S3 bucket versioning
  - CKV_AWS_19  # S3 bucket logging
  - CKV_AWS_18  # S3 bucket access logging
  - CKV_AWS_145 # S3 bucket KMS encryption
  - CKV2_AWS_6  # S3 bucket public access block
skip-check:
  - CKV_AWS_23  # Skip unencrypted S3 for public static assets
output: cli
quiet: false


Comprehensive IaC Platform Comparison

OpenTofu, the open-source fork created after HashiCorp's license change, continued gaining traction in 2025 under the Linux Foundation. Organizations appreciated having a community-driven alternative without vendor lock-in concerns.

Terraform uses the Business Source License, which is proprietary, while OpenTofu uses the Mozilla Public License 2.0. Governance differs significantly: Terraform is controlled by HashiCorp, which is now owned by IBM, while OpenTofu operates under the Linux Foundation with community governance. Terraform includes proprietary additions in its feature set, while OpenTofu maintains community-driven development. Enterprise support for Terraform comes through HCP Terraform, while OpenTofu relies on third-party vendors. Both maintain robust provider ecosystems, though Terraform's is officially backed by HashiCorp while OpenTofu's is community-maintained.

Terragrunt also announced its own Stacks feature reaching GA in May 2025, providing orchestration capabilities for teams in the OpenTofu ecosystem. Gruntwork built Terragrunt Stacks through extensive community engagement, with the RFC gathering dozens of positive reactions and hundreds of comments from participants.

After exploring the major developments in 2025, here's a comprehensive comparison of leading IaC platforms:

Feature Terraform OpenTofu Pulumi Formae CloudFormation Crossplane
Language HCL (proprietary) HCL (open) TypeScript, Python, Go, C#, Java PKL YAML/JSON YAML (CRDs)
License BSL (proprietary) MPL 2.0 (open) Apache 2.0 FSL → Apache 2.0 Proprietary (AWS) Apache 2.0
State Management Local/remote files Local/remote files SaaS backend Stateless (reality = state) AWS-managed Kubernetes etcd
Resource Discovery Manual import Manual import Manual import Automatic Manual import Kubernetes-native
Drift Detection Periodic checks Periodic checks Periodic checks Continuous sync AWS-only Controller-based
Testing Support Limited Limited Native unit/integration Built-in Limited Kubernetes tests
IDE Support Basic Basic Full (LSP, completion) PKL tooling Basic YAML validation
Learning Curve Learn HCL DSL Learn HCL DSL Use existing language Learn PKL Learn CFN syntax Learn K8s + CRDs
Multi-Cloud Excellent Excellent Excellent Growing AWS only Good
Provider Ecosystem 3000+ providers Community-maintained 150+ packages Early stage AWS services 80+ providers
AI Capabilities MCP servers None Neo agent None None None
Governance HashiCorp/IBM Linux Foundation Pulumi Corp Platform Eng Labs AWS CNCF
Best For Mature ecosystems Open governance Developer teams Brownfield envs AWS-only shops K8s-centric orgs
Brownfield Support Manual process Manual process Manual process Excellent Manual process K8s resources only
Enterprise Features HCP Terraform Third-party Pulumi Cloud Coming AWS Orgs Enterprise distros
Cost Free/Enterprise Free Free/Team/Enterprise Open source Free (AWS costs) Free
Deployment Speed Moderate Moderate Fast Fast Moderate Moderate
Community Size Very large Growing Medium Small AWS community Growing


The Challenges That Remain

Despite progress, significant challenges persist. Only 6% achieved full codification. Configuration drift continues to plague teams. Multi-cloud complexity affects 65% of organizations. The human element remains crucial for defining policies, setting guardrails, and making architectural decisions.

Conclusion

The infrastructure community stands at an inflection point. Manual provisioning is legacy. The next frontier involves infrastructure that can observe its own state, reason about optimal configurations, and act autonomously.

Project Infragraph represents this future. AI agents will reason about infrastructure state and act across the application lifecycle. These agents won't replace infrastructure engineers, but they'll handle repetitive tasks that currently burn out teams.

As we close out 2025, one thing seems certain: infrastructure automation will only accelerate. The organizations that embrace these tools, invest in platform engineering, and leverage AI while maintaining proper guardrails will move faster than competitors still manually clicking through cloud consoles.

The infrastructure has become code. Now the code is becoming intelligent.

AI Terraform (software) Infrastructure as code

Opinions expressed by DZone contributors are their own.

Related

  • Bringing AI Agents to Cloud Engineering: How Autonomous Operations Are Changing Reliability at Scale
  • Policy-as-Code for Terraform in Regulated Environments
  • Terraform Drift Detection at Scale: How to Catch Configuration Drift Early
  • Understanding Infrastructure as Code at Scale

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook