Infrastructure as Code: How Automation Evolved to Power AI Workloads

Learn about how Infrastructure as Code progressed in 2025 and how it helped automation, particularly for provisioning AI infrastructure.

Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE ·

Dec. 18, 25 · Analysis

Likes (3)

Comment

Save

2.0K Views

If you read my articles published on DZone this year, you would have sensed that I love automation and that Infrastructure as Code (IaC) is my buddy for automating infrastructure provisioning. Recently, I started exploring and learning about the major shifts happening in the IaC landscape.

As part of my weekend readings in the last couple of months, I came across several exciting announcements from HashiConf 2025, Pulumi's new AI capabilities, and a revolutionary platform called Formae. In this article, let's learn about how IaC progressed in 2025 and how it helped automation, particularly for provisioning AI infrastructure.

Infrastructure-as-Code has transformed how we manage cloud resources, yet 2025 brought innovations that fundamentally changed the game. From AI-powered agents that write and deploy infrastructure code to stateless platforms that eliminate drift detection complexity, this year marked a turning point in infrastructure automation.

The State of IaC: Where We Stand Today

Before diving into specific tools and announcements, let's understand the current landscape. According to the State of IaC 2025 report, cloud complexity has grown for 65% of organizations. Only 6% achieved full cloud codification, meaning most infrastructure is still managed manually. Less than 33% continuously monitor drift, taking a reactive approach to infrastructure changes.

The report makes it clear that manual provisioning is legacy. Declarative configuration files are table stakes. The automation-first pipeline emerged as the gold standard, where infrastructure changes are treated the same way as code deployments: version-controlled, tested, reviewed, and automated.

HashiConf 2025: Major Announcements That Matter

September 2025 marked HashiConf's 10th anniversary in San Francisco. HashiCorp, now part of IBM, had several announcements that caught my attention.

Project Infragraph: Real-Time Infrastructure Intelligence

Project Infragraph represents a fundamental shift in infrastructure observability. Instead of piecing together data from multiple monitoring tools, teams get a unified view that understands relationships between resources.

Project Infragraph

Project Infragraph enables infrastructure that can observe its own state, reason about optimal configurations, and act autonomously. The private beta launches in December 2025.

Source: https://newsroom.ibm.com/2025-09-25-hashicorp-previews-the-future-of-agentic-infrastructure-automation-with-project-infragraph

Terraform Stacks: General Availability

After months in public beta, Terraform Stacks reached general availability with backward-compatible APIs. The concept addresses a pain point I've experienced countless times: coordinating deployments across different teams, each managing their own state files.

Stacks use a component-based architecture. Here's a simple example showing how you define reusable components:

    Plain Text
   
 

   # stack.tfcomponent.hcl
component "vpc" {
  source = "./modules/vpc"
  inputs = {
    region = var.region
    environment = var.environment
  }
}

component "eks_cluster" {
  source = "./modules/eks"
  inputs = {
    vpc_id = component.vpc.vpc_id
    region = var.region
  }
  depends_on = [component.vpc]
}
  

What Changed in the GA Release?

All configuration files now use the .tfcomponent.hcl extension instead of .tfstack.hcl, providing a standardized naming convention. Deployment groups support new orchestration rules for better control over deployment order. Destroy operations work through code instead of UI-only, giving teams version-controlled teardown workflows. Most importantly, Terraform manages dependency resolution automatically, eliminating manual orchestration.

What used to require careful orchestration and multiple deployment windows now happens with one action. Terraform handles orchestration, dependency resolution, and change propagation automatically. This makes managing complex multi-component infrastructures significantly simpler.

MCP Servers: Bridging AI and Infrastructure

HashiConf introduced Model Context Protocol servers for Terraform, Vault, and Vault Radar. These MCP servers act as bridges between AI agents and existing infrastructure tools.

Here's a simple example of how you might use MCP to interact with Terraform through natural language:

    Python
   
 

   # Example: Using MCP to trigger Terraform workspace runs
from mcp_client import MCPClient

client = MCPClient("terraform")

# Natural language request
response = client.execute(
    "Trigger a workspace run for the production environment and notify the team on Slack when complete"
)

# MCP translates this to Terraform API calls
# No need to write complex API integration code
print(f"Workspace run initiated: {response.run_id}")
  

You can now tell your AI assistant to trigger workspace runs, query secrets, or discover unmanaged resources without switching contexts or writing complex scripts. This dramatically reduces the friction in infrastructure operations.

Additional Features Worth Noting

HashiConf also announced several other features that reached general availability. Terraform search helps teams discover and import resources in bulk more efficiently. Azure Copilot with Terraform integration simplifies adoption without requiring deep Terraform knowledge. Hold Your Own Key gives organizations ownership of encryption keys used to access sensitive data in HCP Terraform. HCP Waypoint provides application template catalogs, shielding developers from code-level infrastructure details.

Pulumi Neo: AI-Powered Infrastructure Agent

While Terraform continued its market dominance, Pulumi made serious waves with Neo, their AI infrastructure agent. After a long journey with Terraform, when HCL2 came out, I started exploring alternatives where I could use the programming language of my choice. That's when I found Pulumi.

Why Pulumi Matters

Pulumi is a modern Infrastructure as Code platform that enables developers to create, deploy, and manage cloud resources using familiar programming languages instead of domain-specific languages. Instead of learning HCL, you can use TypeScript, Python, Go, C#, Java, or even YAML. This means full IDE support with code completion, error checking, and refactoring capabilities that come naturally with general-purpose programming languages.

Here's a comparison of the same infrastructure in Terraform vs Pulumi. First, the Terraform approach:

Terraform:

    Plain Text
   
 

   resource "aws_s3_bucket" "data_bucket" {
  bucket = "my-data-bucket"
  tags = {
    Environment = "Production"
  }
}
  

Pulumi:

    Python
   
   import pulumi_aws as aws

data_bucket = aws.s3.Bucket(
    "data-bucket",
    bucket="my-data-bucket",
    tags={"Environment": "Production"}
)

pulumi.export("bucket_name", data_bucket.id)

Neo: The AI Infrastructure Agent

Neo represents Pulumi's answer to the "velocity trap," where AI coding assistants make developers faster, but infrastructure teams can't keep up.

Neo request flow

Neo offers progressive autonomy. Development environments might permit fully autonomous operation, like daily waste cleanup and weekly drift reconciliation. Production changes may require human approval. When Neo encounters unexpected states or errors, it can self-diagnose or loop in a human for assistance. As confidence builds, the autonomy boundary expands.

Formae: Rethinking IaC Fundamentals

In October 2025, Platform Engineering Labs launched Formae, challenging fundamental assumptions about how IaC should work. Let's learn about how it uses PKL and introduces a stateless approach.

The Problems Formae Solves

State file corruption and drift detection have plagued infrastructure teams forever. You know the scenario: someone makes a manual change in the console, your Terraform state drifts, and now you're spending hours reconciling reality with what your code thinks exists.

Traditional IaC tools require importing existing resources through a painful manual process, maintaining state files with corruption risk, detecting drift reactively, and reconciling manual changes in a time-consuming manner. Formae eliminates these issues by making reality itself the state.

Metastructure: A New Concept

Formae introduces "Metastructure," which combines infrastructure configuration with operational logic. Traditional IaC uses static configuration and planned state files, requires manual imports, performs periodic drift detection, and only manages tool-specific resources. Formae's Metastructure combines configuration with operational logic, uses reality as state, provides automatic discovery, performs continuous synchronization, and discovers all resources regardless of creation method.

Here's a PKL configuration example:

    Plain Text
   
 

   // infrastructure.pkl
module infrastructure
import "pkl:aws"

vpc: aws.VPC {
  cidrBlock = "10.0.0.0/16"
  tags {
    ["Name"] = "production-vpc"
  }
}

instances: Listing<aws.EC2.Instance> {
  new {
    instanceType = "t3.large"
    vpcId = vpc.id
  }
}
  

Brownfield Environments

Formae excels in brownfield environments where existing infrastructure needs code management. For existing AWS resources, traditional approaches require manually importing each resource, while Formae simply runs formae extract --target aws. For resources created via the console, traditional tools show drift detection alerts, but Formae automatically codifies and merges them. When using multiple management tools, traditional approaches face import conflicts, but Formae co-exists with all tools. The team learning curve is steep with tool-specific import syntax, but Formae makes it automatic with no imports needed.

Example workflow:

    Shell
   
   # Discover existing infrastructure
formae extract --target aws --output current-infrastructure.pkl

# Review and modify
vim current-infrastructure.pkl

# Apply changes
formae apply current-infrastructure.pkl

Formae automatically discovers and codifies existing infrastructure, eliminating painful import processes.

AI Infrastructure Provisioning: The Driving Force

Much of the IaC innovation in 2025 came from one driving force: provisioning and managing AI infrastructure at scale. Training frontier AI models requires coordination that makes traditional deployments look simple.

The AI Infrastructure Challenge

Let's understand the complexity through a diagram showing the full AI training infrastructure stack:

AI infrastructure provisioning

You're dealing with petabytes of data preparation across thousands of CPU cores. Massive GPU clusters running hot for months. Checkpoint management, where you lose hours of training because you didn't save state properly, costs real money. Gradient synchronization across hundreds of GPUs. Fault-tolerant scheduling where hardware failures become statistical certainties rather than edge cases.

Traditional applications use CPU-based compute running for minutes to hours, dealing with gigabytes of data, using standard retry logic for fault tolerance, having predictable costs, and scaling horizontally. AI training requires GPU clusters with hundreds of GPUs, runs for days to months, handles petabytes of data, needs checkpoint-based recovery with resumption on failure, has extremely high costs requiring constant optimization, and uses specialized distributed training approaches.

If you are new to the processing units world, learn about CPUs vs GPUs vs TPUs in this article.

Here's provisioning a GPU training environment with Pulumi:

    Python
   
 

   import pulumi_gcp as gcp

cluster = gcp.container.Cluster(
    "ai-training-cluster",
    initial_node_count=1,
    remove_default_node_pool=True,
    location="us-central1-a"
)

gpu_node_pool = gcp.container.NodePool(
    "a100-node-pool",
    cluster=cluster.name,
    location=cluster.location,
    node_count=10,
    node_config=gcp.container.NodePoolNodeConfigArgs(
        machine_type="a2-highgpu-8g",
        guest_accelerators=[
            gcp.container.NodePoolNodeConfigGuestAcceleratorArgs(
                type="nvidia-tesla-a100",
                count=8
            )
        ]
    )
)

pulumi.export("cluster_name", cluster.name)
  

Platform Engineering: The Abstraction Layer

Platform engineering emerged as a discipline providing self-service infrastructure catalogs. Instead of learning Terraform or Pulumi directly, developers select pre-built templates for common use cases, customize a few parameters, and get infrastructure that meets organizational standards.

The platform engineering stack consists of multiple layers working together. The self-service portal layer uses tools like Backstage, Port, and Humanitec to provide the developer interface. The IaC templates layer leverages Terraform modules and Pulumi components as reusable infrastructure patterns. Policy enforcement happens through OPA, Sentinel, or CrossGuard for governance and compliance. Deployment automation uses ArgoCD, Flux, or HCP Waypoint for GitOps workflows. Cost management relies on tools like Cloudability and Kubecost for spending visibility.

Here's a self-service database template:

    Python
   
 

   import pulumi_aws as aws

class DatabaseService(pulumi.ComponentResource):
    def __init__(self, name, args, opts=None):
        super().__init__('custom:database:Service', name, None, opts)
        
        self.db = aws.rds.Instance(
            f"{name}-db",
            engine="postgres",
            instance_class=args.get("instance_class", "db.t3.medium"),
            allocated_storage=args.get("storage_gb", 100),
            storage_encrypted=True,
            multi_az=args.get("environment") == "production",
            opts=pulumi.ResourceOptions(parent=self)
        )
  

Developers use this without understanding RDS details:

    Python
   
 

   user_database = DatabaseService(
    "user-service-db",
    args={
        "database_name": "users",
        "team": "backend-team",
        "environment": "production"
    }
)
  

Security and Compliance in the AI Era

With AI tools generating more infrastructure code than ever, security validation has become critical. Google reports that 25% of its new code comes from AI, making automated security validation non-negotiable.

Essential security tools include Checkov for static analysis of misconfigurations, tfsec for Terraform-specific security scanning, Terrascan for policy-as-code security, OPA for runtime policy enforcement, and Sentinel as HashiCorp's policy framework. These tools integrate at different points: Checkov and tfsec run in pre-commit hooks and CI/CD pipelines, OPA validates at runtime, and Sentinel enforces policies within HCP Terraform.

Here's an example of how you might configure Checkov for your infrastructure repository:

    YAML
   
 

   # .checkov.yml
branch: main
download-external-modules: true
framework:
  - terraform
  - terraform_plan
  - cloudformation
soft-fail: false
check:
  - CKV_AWS_20  # S3 bucket encryption
  - CKV_AWS_21  # S3 bucket versioning
  - CKV_AWS_19  # S3 bucket logging
  - CKV_AWS_18  # S3 bucket access logging
  - CKV_AWS_145 # S3 bucket KMS encryption
  - CKV2_AWS_6  # S3 bucket public access block
skip-check:
  - CKV_AWS_23  # Skip unencrypted S3 for public static assets
output: cli
quiet: false
  

Comprehensive IaC Platform Comparison

OpenTofu, the open-source fork created after HashiCorp's license change, continued gaining traction in 2025 under the Linux Foundation. Organizations appreciated having a community-driven alternative without vendor lock-in concerns.

Terraform uses the Business Source License, which is proprietary, while OpenTofu uses the Mozilla Public License 2.0. Governance differs significantly: Terraform is controlled by HashiCorp, which is now owned by IBM, while OpenTofu operates under the Linux Foundation with community governance. Terraform includes proprietary additions in its feature set, while OpenTofu maintains community-driven development. Enterprise support for Terraform comes through HCP Terraform, while OpenTofu relies on third-party vendors. Both maintain robust provider ecosystems, though Terraform's is officially backed by HashiCorp while OpenTofu's is community-maintained.

Terragrunt also announced its own Stacks feature reaching GA in May 2025, providing orchestration capabilities for teams in the OpenTofu ecosystem. Gruntwork built Terragrunt Stacks through extensive community engagement, with the RFC gathering dozens of positive reactions and hundreds of comments from participants.

After exploring the major developments in 2025, here's a comprehensive comparison of leading IaC platforms:

Feature	Terraform	OpenTofu	Pulumi	Formae	CloudFormation	Crossplane
Language	HCL (proprietary)	HCL (open)	TypeScript, Python, Go, C#, Java	PKL	YAML/JSON	YAML (CRDs)
License	BSL (proprietary)	MPL 2.0 (open)	Apache 2.0	FSL → Apache 2.0	Proprietary (AWS)	Apache 2.0
State Management	Local/remote files	Local/remote files	SaaS backend	Stateless (reality = state)	AWS-managed	Kubernetes etcd
Resource Discovery	Manual import	Manual import	Manual import	Automatic	Manual import	Kubernetes-native
Drift Detection	Periodic checks	Periodic checks	Periodic checks	Continuous sync	AWS-only	Controller-based
Testing Support	Limited	Limited	Native unit/integration	Built-in	Limited	Kubernetes tests
IDE Support	Basic	Basic	Full (LSP, completion)	PKL tooling	Basic	YAML validation
Learning Curve	Learn HCL DSL	Learn HCL DSL	Use existing language	Learn PKL	Learn CFN syntax	Learn K8s + CRDs
Multi-Cloud	Excellent	Excellent	Excellent	Growing	AWS only	Good
Provider Ecosystem	3000+ providers	Community-maintained	150+ packages	Early stage	AWS services	80+ providers
AI Capabilities	MCP servers	None	Neo agent	None	None	None
Governance	HashiCorp/IBM	Linux Foundation	Pulumi Corp	Platform Eng Labs	AWS	CNCF
Best For	Mature ecosystems	Open governance	Developer teams	Brownfield envs	AWS-only shops	K8s-centric orgs
Brownfield Support	Manual process	Manual process	Manual process	Excellent	Manual process	K8s resources only
Enterprise Features	HCP Terraform	Third-party	Pulumi Cloud	Coming	AWS Orgs	Enterprise distros
Cost	Free/Enterprise	Free	Free/Team/Enterprise	Open source	Free (AWS costs)	Free
Deployment Speed	Moderate	Moderate	Fast	Fast	Moderate	Moderate
Community Size	Very large	Growing	Medium	Small	AWS community	Growing

The Challenges That Remain

Despite progress, significant challenges persist. Only 6% achieved full codification. Configuration drift continues to plague teams. Multi-cloud complexity affects 65% of organizations. The human element remains crucial for defining policies, setting guardrails, and making architectural decisions.

Conclusion

The infrastructure community stands at an inflection point. Manual provisioning is legacy. The next frontier involves infrastructure that can observe its own state, reason about optimal configurations, and act autonomously.

Project Infragraph represents this future. AI agents will reason about infrastructure state and act across the application lifecycle. These agents won't replace infrastructure engineers, but they'll handle repetitive tasks that currently burn out teams.

As we close out 2025, one thing seems certain: infrastructure automation will only accelerate. The organizations that embrace these tools, invest in platform engineering, and leverage AI while maintaining proper guardrails will move faster than competitors still manually clicking through cloud consoles.

The infrastructure has become code. Now the code is becoming intelligent.

AI Terraform (software) Infrastructure as code

Opinions expressed by DZone contributors are their own.

Related

Trending