The Platform or the Pile: How GitOps and Developer Platforms Are Settling the Infrastructure Debt Reckoning
By a technology correspondent who has spent the better part of a decade watching engineering teams drown in YAML they wrote themselves.
Join the DZone community and get the full member experience.
Join For FreeThere is a specific kind of organizational dysfunction that doesn't show up in sprint velocity metrics or deployment frequency dashboards. It lives in Slack threads where a senior engineer is, for the third time this week, helping a product team figure out why their staging environment behaves differently from production. It lives in the postmortem where someone admits, with genuine embarrassment, that a misconfigured resource limit brought down a service because the relevant YAML file was copied from a two-year-old deployment that nobody remembers creating. It lives in the quiet calculation a platform team lead makes when she realizes her team of six is fielding forty tickets a week, almost none of which required human judgment, and almost all of which could have been prevented by infrastructure that didn't exist yet.
This dysfunction has a name now, though it took the industry a while to agree on one. Platform engineering. The practice of building deliberate, opinionated abstractions between developers and the underlying complexity of modern infrastructure. And in 2025, it stopped being a trend and started being a reckoning.
The Spreadsheet That Broke a Release Cycle
A conversation I keep returning to, from a site reliability engineer at a German industrial software company, October 2024. His team had inherited a Kubernetes environment that had grown organically across three years and two acquisitions. By the time he arrived, they had over four thousand cluster-specific configuration files spread across eleven repositories, maintained by roughly thirty teams who had each developed their own conventions for structuring them.
Nobody had planned this. It had accreted, the way technical debt always does — one reasonable decision at a time, in the absence of a shared standard. A team needed a slightly different ingress rule. Another needed non-default resource limits for a memory-intensive service. A third had a custom network policy that predated the company's security baseline. Multiply this across thirty teams over three years and you get a configuration landscape that no single person fully understands.
The release that broke him wasn't dramatic. A routine Kubernetes version upgrade that should have taken a long weekend consumed six weeks, because the team couldn't confidently predict which of those four thousand files would conflict with the new API versions and which wouldn't. They needed to test everything. They had no automated way to do it. They did it manually.
He told me, with the flat affect of someone who has processed the experience thoroughly: "We weren't doing infrastructure. We were doing archaeology."
What GitOps Actually Solves — and What People Get Wrong About It
GitOps is one of those terms that has been repeated enough times in conference talks that it has acquired a kind of rhetorical inevitability. Everyone agrees it's the right approach. Fewer people can articulate precisely why, or why it keeps failing to deliver on its promise in practice.
The core idea is genuinely simple and genuinely powerful: Git is your system of record for infrastructure state. Tools like Argo CD or Flux run continuously inside your clusters, comparing what's deployed with what's in the repository, and reconciling any differences. A change to infrastructure is a pull request. A rollback is a revert. An audit trail is just the commit history.
The benefits are real. I've talked to enough engineering organizations that have made this transition to be confident that they're not imaginary. Drift — the quiet divergence between what you think is deployed and what's actually deployed — is dramatically reduced. Incident response gets faster because rollbacks are mechanical rather than procedural. Security teams can audit changes without asking engineers to reconstruct what happened from memory.
But here's what the GitOps advocates tend to understate: Git as a source of truth for infrastructure only works if the things committed to Git are trustworthy representations of intent. If thirty teams are each committing their own raw Kubernetes YAML, with their own conventions, their own interpretations of what a "standard" deployment looks like, you haven't solved the configuration sprawl problem. You've just moved it into version control. You have a very auditable pile.
The insight that platform engineering adds to GitOps is the layer that was always implied but rarely explicit: someone has to own what goes into Git. Not the individual teams, working independently with their own preferences and their own copy-paste histories. A platform abstraction, curated by people whose job is to encode organizational best practices into templates that generate correct configuration rather than trust that correct configuration will emerge organically from thirty autonomous teams.
The Compiler Metaphor That Actually Lands
The frame I've found most useful — borrowed from a conversation with a platform architect in Amsterdam who worked on Humanitec's orchestration model — is the compiler.
When a developer writes application code, they don't write machine instructions. They write in a high-level language, and a compiler translates their intent into the machine instructions required to execute it. The developer doesn't need to understand register allocation or instruction pipelining to write correct software. The compiler handles the gap between intent and implementation.
An Internal Developer Platform is doing something structurally analogous for infrastructure. A developer describes what they need: a web service, two replicas, monitoring enabled, a Postgres database attached. The platform — the orchestrator, in the language the field has settled on — translates that description into the full complement of Kubernetes manifests, Helm values, network policies, service mesh configuration, and whatever else the organization's standards require. The developer doesn't write those artifacts. They can't misconfigure them. The platform generates them correctly, every time, from templates that the platform team maintains and updates centrally.
The compilers metaphor breaks down at the edges, as all metaphors do. But the core intuition — that abstraction layers are how complex systems become manageable — is sound. And the organizational implication is significant: it relocates the complexity from distributed to centralized, from implicit to explicit, from configuration sprawl to versioned platform code.
Bechtle's Numbers and Why They're Credible
When I first heard the figure — roughly a 95% reduction in configuration file volume after a platform engineering adoption — I was skeptical in the way that I'm always skeptical of round numbers from case studies. Vendor-backed success stories have a tendency to report the metric that flatters the product and omit the ones that complicate the narrative.
So I spent some time understanding what that number actually means in the Bechtle context. They implemented a tool called Score, which provides a developer-facing schema for describing workloads at a level of abstraction above raw Kubernetes. A developer says, in essence: my service needs a Postgres database and a Redis cache. The platform resolves that into whatever the underlying environment requires — production might mean managed cloud services, staging might mean containerized versions — without the developer ever seeing the infrastructure-specific YAML.
The 95% reduction isn't a fabrication. It's an arithmetic consequence of the architecture. If a hundred services each previously had their own deployment manifests, service definitions, network policies, ingress configurations, and resource quota files — say, ten to fifteen files per service — and the platform now generates all of those from a single five-line developer schema, the math is roughly right. The files still exist. They're generated, not handwritten. No individual team owns them. The platform does.
What this buys you operationally is harder to quantify but equally important. When your security baseline changes — new network policy requirements, updated container security contexts, a revised resource limit standard — you update the platform template. Every service gets the update on its next deployment. There's no manual propagation across a hundred repositories. There's no version of the security standard that some teams are on and others aren't.
The Ticket Queue as Organizational Symptom
One pattern I've noticed repeatedly in platform engineering adoptions, which rarely gets written about because it's organizational rather than technical: the transformation of the platform team's role.
Before: platform teams are primarily a service desk. Developers need something new, they file a ticket, a platform engineer interprets the request, configures the infrastructure manually or semi-manually, closes the ticket. The platform team's productivity is measured by ticket throughput. Their ceiling is the number of hours in the day.
After: platform teams are primarily a product team. Their customers are developers. Their product is the abstraction layer — the templates, the CLI, the portal, the orchestrator. Their productivity is measured by the quality of the self-service experience they've built. Their ceiling is the value of the platform they've shipped, not the capacity to process requests.
This sounds like a subtle distinction. It isn't. I talked with a platform team lead at a UK-based financial services firm in early 2025 who described the before-and-after with unusual precision. Before their IDP rollout, her team averaged about forty tickets per week. After — three months into the rollout, with roughly sixty percent of their internal services onboarded — they were averaging seven. The other thirty-three had become self-service actions that developers completed without human involvement.
Her team didn't shrink. They redirected. The people who had been triaging tickets were now building better templates, improving documentation, running office hours that were actually about capability building rather than issue escalation. The work was harder, in the sense of requiring more design thinking. It was also, by her account, significantly more sustainable.
The Security Case That Gets Underemphasized
GitOps and platform engineering are usually sold on developer productivity. Faster deployments, less toil, better developer experience. These benefits are real and worth pursuing. But I'd argue the security case is at least as strong, and it gets underemphasized in most of the literature.
Consider the attack surface of a configuration landscape where every team manages its own infrastructure files, with their own conventions, and deploys through processes they've assembled themselves. Security policies are applied inconsistently, if at all. New vulnerabilities in base images or Helm charts propagate to services that are only updated when someone remembers to update them. Drift between environments means security controls that are present in staging may not be present in production.
Now consider the same organization with a centralized platform. Security controls — image scanning, runtime policy enforcement, secret management patterns, network segmentation — are encoded into templates. They're not optional. They're not something individual teams remember or forget. They're the output of the platform, automatically, for every service. When a new CIS benchmark requirement comes through, the platform team ships an updated template. Compliance propagates.
I spoke with a CISO at a mid-market enterprise software company in November 2025 who made a point I hadn't heard framed this way before: the audit-readiness argument. His company operates in a regulated sector. Before their platform engineering investment, SOC 2 audit preparation was a two-month project every year, involving manual evidence collection across dozens of teams. After — with every infrastructure change committed to Git, every deployment traceable to a specific approved template version — the audit became primarily an automated evidence export. His estimate: the platform investment paid for itself in audit cost reduction within eighteen months, before accounting for any of the deployment velocity benefits.
What This Doesn't Solve
I'd be doing readers a disservice if I left the impression that GitOps plus an IDP is a complete answer to infrastructure complexity. It isn't.
The templates themselves need maintenance. A platform team that doesn't invest continuously in the quality of its abstractions ends up with a different kind of sprawl — one that lives inside the platform rather than outside it. Opinionated abstractions that made sense in 2023 may actively constrain what teams need to do in 2026. The platform has to evolve with the organization, which means someone has to own that evolution and treat it with the same seriousness as any other product roadmap.
The organizational adoption is harder than the technical implementation, in my experience. Developers who have spent years with full control over their own YAML sometimes resist abstractions that feel limiting. Platform teams that haven't operated as product teams before sometimes underinvest in the developer experience of their own tools. Both failure modes are common and both are addressable, but neither is automatic.
And there's a dependency risk that doesn't get discussed enough: a well-adopted IDP becomes critical infrastructure. If the orchestrator goes down at the wrong moment, your deployment pipeline stops. The platform team's on-call rotation becomes a central dependency for every team that uses the platform. This is a solvable architecture problem — idempotent reconciliation, robust failure modes — but it has to be designed for explicitly, not assumed.
The Organizational Bet Worth Making
I've been covering enterprise infrastructure long enough to remember when containerization was a controversial technology decision, when Kubernetes was something you adopted cautiously, when "infrastructure as code" was a novel phrase rather than a baseline expectation.
Platform engineering is in that same phase now. The organizations that are doing it well are visibly ahead of those that aren't — not in benchmark numbers, but in the qualitative texture of how their engineering organizations operate. Less firefighting. Less configuration archaeology. Fewer incidents traced back to a YAML file that nobody recognized as the source of truth for anything.
The investment required is real. A platform team is a product team, and building a product is expensive and slow before it's cheap and fast. The organizations that have made the investment, in my observation, made it because they did the math on what the alternative was costing them: in engineering time, in incident rate, in developer frustration, in compliance overhead.
The pile is always cheaper until it isn't. And by the time it isn't, you're doing archaeology at the worst possible moment.
The author covers enterprise infrastructure, developer tooling, and organizational technology strategy. They have reported from engineering organizations across three continents over a fifteen-year career.
Opinions expressed by DZone contributors are their own.
Comments