The Hidden Latency of Autoscaling

Autoscaling isn’t real elasticity — it’s slow, reactive, and can mislead. Use demand metrics, keep warm capacity, and pair with circuit breakers & observability.

May. 05, 26 · Opinion

Likes (0)

Comment

Save

2.1K Views

There is a comfortable fiction at the center of most cloud architectures, one that gets written into runbooks and repeated in postmortems with the same exhausted confidence: we autoscale. As if the declaration itself is a reliability posture. As if telling your HPA to watch CPU utilization is the same thing as building a system that breathes.

It isn't. And the gap between those two things has eaten more than a few production environments.

Let's be precise about what autoscaling actually does, mechanically, because the abstraction conceals something important. Kubernetes HPA scrapes metrics from the metrics server on a 15-to-30-second polling interval. It evaluates whether current utilization exceeds a configured threshold. If it does, it issues a scale directive. The scheduler then has to find nodes with sufficient allocatable resources, which may require the cluster autoscaler to provision new nodes from the cloud provider — a process that itself takes between 90 seconds and several minutes depending on instance type, AMI warm state, and the sheer caprice of the underlying hypervisor orchestration. Only then does your pod actually start, pull its layers if they aren't cached, initialize its runtime, maybe warm a database connection pool, and finally register itself as healthy behind the load balancer.

The whole chain, optimistically, is three to five minutes. Under load, during the exact moment you need capacity, three to five minutes is a geologic epoch.

Meanwhile, your existing pods are absorbing a traffic spike that the autoscaler hasn't yet responded to. Latency climbs. Thread pools exhaust. The CPU metric that HPA is watching? By the time it reads 80%, you're not in the early stages of a problem — you're already in the middle of one. The SLA breach happened somewhere around 65%. The metric is a lagging indicator dressed up as a trigger.

Slack's January 2021 outage is instructive here, though not quite in the way their postmortem presents it. When their web tier started degrading, the platform attempted to scale — 1,200 additional servers, an aggressive response. But the provisioning service itself was under strain, and newly requested nodes sat in a liminal state: allocated but not configured, counted in the autoscaler's math but useless to actual request traffic. The scale event created the appearance of capacity expansion while the actual serving pool remained undersized. HPA saw the scale directive succeed. The system saw the latency continue to climb. These two truths coexisted, quietly catastrophic.

This is a failure mode that doesn't have a common name, but it should. Call it phantom capacity — the autoscaler believes it has scaled, the infrastructure believes it has provisioned, and only your users know the truth. It's distinct from scale-up delay in a meaningful way: delay is about time, phantom capacity is about the decoupling of control plane state from data plane reality. And it's not unique to Slack. Anyone who has watched an ASG report healthy instances while their application servers were crashing on boot, or seen a Kubernetes deployment show three-of-three pods running while each pod was stuck in an init container loop, has met this failure mode before.

The thrashing problem is its own category of misery. Configure your HPA with too-aggressive thresholds and too-short cooldown windows, and you'll watch your replica count oscillate — up, down, up, down — with a rhythm that correlates inversely with your sleep quality. Each scale event isn't free. It consumes scheduler cycles, triggers pod disruption budgets, potentially shifts traffic in ways that expose session affinity bugs you didn't know you had. The stabilization window in Kubernetes HPA exists precisely because someone experienced this in production and was sufficiently traumatized to write the feature. The default is five minutes for scale-down. Most teams I've seen leave it there without understanding why it exists, or override it to something aggressive because they want to save money, and then wonder why their service occasionally falls off a cliff.

There's also the cold-start problem, which is particularly acute in Lambda-based architectures but present anywhere you're running containerized workloads with non-trivial initialization. A Java service with Spring Boot can take 20-40 seconds to reach a healthy state even on warm hardware. During that window, your load balancer is either routing traffic to a pod that isn't ready — causing errors — or excluding it via health checks — extending the period of under-provisioning. AWS Lambda's provisioned concurrency is an honest acknowledgment of this: we cannot eliminate cold starts, so we'll let you pay to not have them. It's a tax on the fiction that scale-to-zero is truly elastic.

What would a careful builder actually change? A few things that don't require exotic tooling, just different thinking.

The first is to stop treating CPU as a primary scaling signal for latency-sensitive workloads. CPU is a decent proxy for throughput in batch processing — it maps reasonably well to work being done. But for services where latency is the SLO, CPU tells you about utilization, not about the queue of work waiting to be processed. A service can be at 40% CPU with request latencies spiking because its downstream dependency is slow and it's accumulating in-flight connections. KEDA's SQS queue depth trigger — or more generally, any demand-side metric — responds to the actual pressure on the system rather than an internal resource utilization proxy. The scaling trigger should be as close to the user experience as possible. Queue depth, active connection count, P95 latency where you can get it: these are meaningful. CPU is one level of abstraction removed from what you care about.

The second change is boring but important: maintain a warm baseline. Not everything needs to scale to zero. For services on your critical path, the cost of keeping three or five pods running at minimal utilization is trivial compared to the cost of a scale event that takes four minutes during a traffic surge. Sizing that baseline is a conversation between your traffic patterns and your cost tolerance — but the conversation should happen explicitly, not by accident because nobody configured a minimum replica count.

The third change is harder and more cultural: use load testing to tune autoscale parameters, not intuition. Most teams configure cooldown windows, thresholds, and buffer percentages once, when they first deploy, based on a guess. Then they never revisit them because nothing catastrophically broke. But systems change — traffic patterns shift, dependencies get slower, code gets heavier. The HPA config that was adequate eighteen months ago may be quietly wrong today. Periodic load tests that exercise scale-up and scale-down scenarios, instrumented to measure actual time-to-ready for new capacity, are the only way to keep these parameters grounded in reality.

Predictive scaling is worth discussing, with appropriate skepticism. AWS Predictive Scaling and Azure Scheduled Autoscale work well for workloads with legible periodicity — the Monday morning login rush, the end-of-month billing batch, the daily ETL pipeline. They work by looking at historical CloudWatch metrics, identifying patterns, and pre-provisioning capacity ahead of predicted load. This is genuinely useful and materially better than purely reactive scaling for those cases.

But most interesting failure modes aren't periodic. They're caused by viral content, cascading failures from dependencies, configuration errors that cause request fan-out, or any number of irregular events that no forecasting model would anticipate. Predictive scaling buys you safety for the events you know are coming. Reactive scaling with good metrics buys you safety for surprises. You need both, layered, with explicit thought about which layer covers which failure scenario.

A word on circuit breakers and the relationship between autoscaling and network-level controls, because these pieces are often treated as unrelated. When your service is scaling up and the new pods aren't ready yet, your existing pods are absorbing more than their designed share of traffic. If you've configured retry logic naively — and most default retry configurations are naive — then timeouts from the overwhelmed pods are causing clients to retry, which doubles the load, which makes the problem worse. This is a thundering herd variant, and it happens specifically because autoscaling has introduced a capacity deficit that triggers retries.

Istio's RetryBudget or Envoy's circuit breaking can interrupt this positive feedback loop by shedding load before retries compound the problem. The right mental model is that autoscaling and circuit breaking are complementary, not redundant: autoscaling restores capacity over time, circuit breaking manages demand in the gap before capacity is restored. Deploying one without the other leaves you exposed to the exact window where both would have mattered.

There's a monitoring gap that most teams discover too late. You track CPU. You track request rate. You track error rate. But do you track scale latency — the actual measured time from when a scaling event was triggered to when the new capacity was serving traffic? Probably not. Without that metric, you have no visibility into whether your autoscaling configuration is performing adequately. You might discover during an incident that your scale events routinely take eight minutes, which makes your reactive HPA configuration essentially decorative for any spike shorter than that.

Define an SLO for provisioning latency. Measure seconds-under-provisioned as a metric — time spent in a state where demand exceeds available capacity. These aren't standard out-of-the-box metrics, but they're not difficult to instrument once you decide they matter. And they should matter, because they're the honest measure of whether your autoscaling configuration is actually achieving elasticity or just providing the comforting appearance of it.

Elasticity, as a systems property, means that capacity tracks demand closely enough that neither users nor the service itself can perceive the gap. That's the aspiration. What cloud autoscaling delivers, in its default configurations, is something narrower and more qualified: capacity that reacts to demand, with a lag, after thresholds are breached, subject to provisioning delays and control-plane accuracy. That's useful. It's not the same thing.

The distance between those two definitions is where outages live.

Autoscaling Database connection IT Event

Opinions expressed by DZone contributors are their own.

Related

Trending

The Hidden Latency of Autoscaling

Autoscaling isn’t real elasticity — it’s slow, reactive, and can mislead. Use demand metrics, keep warm capacity, and pair with circuit breakers & observability.

Related

Partner Resources