Cost Is an SLI: Why Your System Is “Healthy” but Burning Cash

Runaway cloud spend hides in healthy systems — driven by poor cost visibility, idle resources, and scaling inefficiencies. Fix it with cost-per-request metrics.

David Iyanu Jonathan

May. 04, 26 · Opinion

Likes (0)

Comment

Save

1.9K Views

There's a class of failure that doesn't page anyone.

No SLO breaches, no latency spikes, no 3 AM Slack messages from an on-call engineer clutching cold coffee. The system is working — by every conventional measure it's healthy — and yet something is deeply wrong. Money is hemorrhaging out of the infrastructure at a rate that won't become visible until the CFO opens a billing dashboard, squints at a number that seems obviously misformatted, and then realizes with a specific, cold dread that it isn't.

This is what runaway cloud spend actually feels like from the inside. Not an explosion. A slow bleed mistaken for normal circulation.

I've watched this happen to teams that were, by all accounts, technically sophisticated. Engineers who could discourse fluently on consensus algorithms and distributed tracing, who had meticulous runbooks and well-tended Grafana boards — and who had absolutely no instrumentation on what their systems cost per request. The money question was someone else's problem. Finance's problem. The CFO's problem. Right up until it became everyone's problem simultaneously, in a conference room, with a spreadsheet nobody had any good answers for.

The SaaS startup whose AWS bill doubled to $500,000 in a single month didn't have a cloud problem. They had an instrumentation problem wearing a cloud problem's clothing. Orphaned virtual machines — instances spun up for a load test, or a one-off migration, or some experiment that concluded months ago — sitting there, billing hourly, invisible because nobody had tagged them to a team or a project or a cost center. Reserved-instance coverage that looked adequate in aggregate but had grown misaligned with the actual workload topology. The machines doing real work were on-demand; the reservations were funding a ghost fleet. This isn't exotic negligence. It's the default state of systems that grow faster than their accounting practices.

The $2.4 million cloud bill where 80% of charges were data egress — that one is almost elegant in how completely it exposes a conceptual failure. Egress fees are the tollbooth you forget exists until you've already driven through it ten thousand times. Architects design for compute. They think in CPUs and memory and IOPS. Network transfer is ambient, assumed-cheap, treated as infrastructure rather than metered consumption. But cloud providers have always made money on the exits. Data flowing inward is free or nearly so; data flowing outward is where the revenue hides. A system that fetches large payloads, transforms them, and then ships the results to another region or a third-party analytics endpoint can accumulate egress charges that dwarf its compute costs — and nothing in the default monitoring stack will tell you this is happening until the invoice arrives.

The deeper pathology here is architectural, and it predates cloud computing entirely.

Distributed systems were designed by people who had to fight for every byte of memory and every millisecond of CPU time. Scarcity was the operating assumption. The engineering culture that emerged from that constraint treated resource efficiency as a first-order concern — you measured it, you optimized it, you were embarrassed when your code was wasteful. Then the cloud arrived with its promise of elasticity, its pay-as-you-go rhetoric, its infinite-seeming provisioning capacity, and something in the collective engineering psyche decided that scarcity was solved. Spin up what you need. Scale to meet demand. The infrastructure will handle it.

This was always a category error. Elasticity is not abundance. It's the ability to acquire resources quickly, which is genuinely useful — but those resources still cost money, real money, money with line items and quarterly reviews attached to it. The "elastic" metaphor implies that the system returns to its original state, like a rubber band. Most autoscaling configurations do the opposite: they scale out aggressively and scale in lazily, because the engineers who configured them were optimizing for availability, not for cost. Of course they were. Availability failures page you. Cost failures invoice you three weeks later.

This asymmetry in feedback latency is, I'd argue, the root cause of most cloud waste. You feel a reliability failure immediately, in your monitoring, in your error rates, in the angry emails from customers. You feel a cost failure at month-end, abstracted behind aggregates and allocation reports, at a distance from the specific code that caused it. The causal chain is so long and so obscured that attribution becomes genuinely difficult. Which service? Which deployment? Which query that suddenly started doing full table scans because someone dropped an index? You're doing forensic accounting on systems that didn't bother to leave evidence.

Consider what actually happens inside a Lambda-based microservice when the retry logic goes wrong.

A downstream dependency starts returning 429s — rate limiting, legitimate, expected under load. The Lambda function catches the error, implements exponential backoff, retries. Fine. Normal. But the backoff parameters were configured for a dependency that's usually briefly unavailable, not one that's rate-limiting at scale, and the jitter is insufficient, so you get retry storms: dozens of function instances all backing off to similar intervals, all hammering the dependency in synchronized bursts, all being rejected, all retrying again. Each invocation is cheap individually — fractions of a cent, execution measured in milliseconds. But you're running thousands of them simultaneously, each one burning GB-seconds of memory while waiting on a backoff interval, and the function is stateless so there's no circuit breaker state shared between invocations, and AWS will happily keep invoking your function at full concurrency because from its perspective, demand is high and capacity is available.

Nobody gets paged. The error rate might actually look acceptable — most requests eventually succeed. Latency is elevated but within the p99 SLO. Meanwhile the bill for this three-hour incident is climbing toward what would normally be a week's worth of Lambda spend.

This is the failure mode that the "cost as SLI" framing is trying to address, and it's worth being precise about what that means mechanically. A Service Level Indicator is a measurement of some property of the service's behavior. Latency, error rate, throughput — these are the canonical SLIs because they directly reflect the user experience. Cost doesn't appear in that list because it doesn't affect the user, not directly. But cost does reflect system behavior in ways that the other SLIs might not. A function that's executing correctly but expensively is exhibiting a real defect. The defect is just measured in dollars instead of milliseconds.

Define it concretely: cost-per-request, tracked as a rolling average with a time window short enough to catch anomalies before they compound. For a Lambda function handling API traffic, this is derivable — you know the invocation count, you know the GB-seconds consumed, you know the memory configuration, you know the egress bytes. The math isn't complicated. What's missing in most stacks is the pipeline to compute it continuously and route it somewhere actionable.

    YAML
   
 

   - alert: CostPerRequestAnomaly
  expr: |
    (
      increase(cloud_spend_dollars_total{service="payment-processor"}[30m])
      /
      increase(http_requests_total{service="payment-processor"}[30m])
    ) > 0.02
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Payment processor cost/request exceeding $0.02 threshold"
    runbook: "https://wiki.internal/runbooks/cost-anomaly"
  

The alert above is simple to the point of being naive — a real implementation needs to handle the edge cases around division-by-zero when request volume drops, needs to account for fixed-cost components that don't scale with traffic, needs to be tuned per-service rather than applying a uniform threshold. But the principle is sound: instrument cost the way you instrument latency. Put it in the same pipeline. Give it the same alerting treatment. Let it page someone.

Zombie resources deserve particular attention because they're so easy to dismiss as solved problems that keep not being solved.

The inventory of forgotten things in a mature cloud environment is, in my experience, always larger than anyone expects. Unattached EBS volumes, left behind when instances were terminated but the delete-on-termination flag wasn't set. Elastic IPs not associated with any running instance, costing $0.005/hour each — individually trivial, collectively real money at scale. NAT Gateways in regions where you decommissioned the VPC workloads but left the gateway standing because the Terraform state wasn't cleaned up and touching Terraform state makes everyone nervous. RDS snapshots accumulating indefinitely because the backup retention policy was set aggressively and nobody wrote the cleanup job. Elastic Load Balancers with no healthy targets, passing health checks against nothing, billing for capacity they're delivering to nobody.

The cumulative drag of this kind of waste is hard to calculate but easy to feel when you run the audit. It's almost never catastrophic individually. It's ambient cost noise that compounds month over month, gradually shifting the baseline upward so that each budget cycle starts from a floor that's slightly higher than the last one, and nobody can quite pinpoint why the efficiency curve keeps drifting in the wrong direction.

Tooling exists for this — AWS Trusted Advisor, Compute Optimizer, the idle resource detection in Cost Explorer — but tooling that generates recommendations is only useful if someone's job is to act on them. That organizational detail is where most cost hygiene programs quietly fail. The recommendations accumulate in a dashboard somewhere. Engineers see them, acknowledge them, add them to a backlog, and then prioritize the feature work that someone with authority is actually asking for. The idle resources survive because their survival costs nobody anything immediately measurable.

The fix isn't more tooling. It's accountability, specifically the kind that creates immediate feedback. Tag enforcement at resource creation — if you can't create a resource without a team tag and an environment tag and a project tag, the tagging happens. Automated cleanup of untagged resources after a grace period — not a suggestion, an actual termination, which focuses attention remarkably. Chargeback rather than showback: show teams what they're actually being charged for their cloud consumption, real money against real budgets, not just informational usage reports that feel abstract because no actual transfer occurs.

Showback is useful. Chargeback is clarifying.

Autoscaling deserves its own reckoning, because the failure mode isn't as simple as "it scales too much."

Horizontal Pod Autoscaler configurations are typically written by engineers whose primary experience with the service was getting it to scale up fast enough during an incident. The scaling-out parameters get tuned aggressively; the scaling-in parameters stay at default or get made more conservative, because prematurely scaling in caused latency problems once and nobody wants that phone call again. The result is a ratchet: the cluster grows to accommodate load peaks and then stays grown, because the hysteresis is asymmetric.

At a per-node cost of, say, $0.20/hour for a reasonable compute instance, running thirty nodes when twelve would suffice represents nearly $25,000 in annual waste — for a single workload. Multiply across services in a mid-sized platform organization and you're looking at numbers that fund engineering headcount.

The reactive versus predictive scaling trade-off in the original article's table is real, but it understates the implementation cost of predictive scaling. Getting good predictions requires either historical data with stable periodicity (traffic patterns that repeat weekly, that kind of thing) or ML-based forecasting infrastructure with its own operational overhead. Most teams don't have clean enough signals to train reliable forecasting models, especially if their traffic has high variance or is driven by irregular external events — marketing campaigns, news cycles, competitor outages generating unexpected traffic. Scheduled scaling is more tractable: if you know from two years of logs that traffic increases 40% on weekday mornings at 9 AM in your primary timezone, you can pre-scale before that ramp rather than chasing it reactively. This doesn't require ML. It requires looking at your traffic patterns, which is a thing engineers often don't do because it doesn't feel like engineering.

The honest trade-off isn't between reactive and predictive scaling as equivalent strategies with symmetric costs and benefits. It's between the certainty of reactive (it's always correct, just sometimes late) and the efficiency of predictive (it's cheaper when right, occasionally wrong, and wrong in ways that are visible and embarrassing). Most platform teams are better served by reactive scaling with more aggressive scale-in parameters and shorter cooldown windows than they currently run, plus scheduled pre-scaling for known patterns, than by investing in a forecasting infrastructure they won't maintain properly.

What does a careful builder actually do on Monday morning?

Probably not a complete FinOps transformation. Those take quarters, involve organizational dynamics that engineering can't resolve unilaterally, and have a tendency to generate dashboards that everyone nods at in the monthly review and nobody uses between reviews.

Start with the instrumentation gap. Identify the three services that collectively drive the most spend — AWS Cost Explorer will tell you this, broken down by service type, in about five minutes. For each of those three services, answer the question: do you know what a normal cost-per-request looks like? If the answer is no, you don't have a cost monitoring problem, you have a cost observability problem, and that's the thing to fix first. Add the metrics, derive the baseline, set an alert threshold at 2x baseline as a starting point. You'll tune it. But you won't tune something you haven't measured.

Then: run the idle resource report. Not for the purposes of immediately cleaning things up — though do that too — but to understand what the organizational failure mode is that produced those resources. Someone created them. Someone forgot them. Was there no offboarding process for decommissioned projects? Was there no budget owner for that cost center? Was the tagging policy unenforced? The idle resources are symptoms. The absence of process is the condition.

And then — this is the uncomfortable one — have the conversation with whoever owns budget decisions about treating a cost anomaly the same way you'd treat a reliability incident: with a postmortem, with a timeline, with root cause analysis, with action items. Not blame. Not forensic punishment. The same blameless retrospective process you'd apply to a production outage, applied to a billing spike. Because a billing spike is a production incident. It just bills you for it differently.

The systems are already distributed. The costs are already real. The instrumentation is the part you chose not to build yet.

Build it.

Cloud computing Engineer IT systems

Opinions expressed by DZone contributors are their own.

Related

Trending

Cost Is an SLI: Why Your System Is “Healthy” but Burning Cash

Runaway cloud spend hides in healthy systems — driven by poor cost visibility, idle resources, and scaling inefficiencies. Fix it with cost-per-request metrics.

Related

Partner Resources