Autoscaling Is Not Elasticity

Autoscaling is reactive, not resilient. Without caps, metrics, or overrides, it can worsen failures. True elasticity requires policy, testing, and bottleneck awareness.

Mar. 05, 26 · Opinion

Likes (0)

Comment

Save

2.0K Views

The first autoscaling incident I handled personally left me staring at CloudWatch graphs at 3 AM, watching our infrastructure commit suicide by optimization. The Auto Scaling Group was doing exactly what we'd configured it to do — launching EC2 instances to meet target capacity — except AWS's control plane was choking on an internal partition failure, and every instance we launched hung in pending state, timed out health checks, got terminated, triggered a replacement launch. Over and over. We'd built a perfect feedback loop of failure, and the irony was we'd followed the documentation to the letter.

The fix took four minutes once we understood what was happening: manually pin the ASG at current capacity, stop all scaling activity, let the control plane recover on its own schedule. But we didn't have a procedure for that. No Terraform config sitting in version control with min_size = desired = max_size = 12. No runbook that said "when AWS is on fire, turn off your autoscaler first." We improvised it in the AWS console while our service degraded, and afterward we wrote it down in a document we titled "The Red Button."

That's the lesson compressed: autoscaling reacts. Elasticity absorbs. They're not the same category of thing.

When Scaling Makes It Worse

Slack's January 2021 outage is the textbook case, documented in their incident report with the kind of granular honesty that makes it valuable. A network partition caused CPU metrics to report artificially low values — the servers were fine, churning through work, but the metrics collection pipeline saw them as idle. The autoscaler concluded it had overcapacity and started terminating instances. Standard behavior. Economically rational.

Then the partition healed. Suddenly the system was understaffed for actual load, queues backed up, request timeouts cascaded, and the autoscaler saw the inverse signal: we need more capacity, immediately. It added 1,200 nodes. Except those nodes hit cold-start penalties — JVM warmup, cache priming, connection pool establishment — while also overwhelming the database connection limit, exhausting file descriptors, and generally creating more problems than they solved. The postmortem uses careful language: "the scale-up did not work as intended" and "made things worse."

You can construct scenarios where adding capacity actively degrades total system throughput. It's not theoretical.

Consider: your application servers maintain connection pools to a Postgres database. RDS has max_connections=300. Each app server uses 15 connections. Simple math says you can run 20 instances comfortably. Now a traffic spike triggers your autoscaler, which adds 30 instances because CPU crossed 75%. Those 30 new instances attempt to establish connection pools. You're now requesting 675 connections against a limit of 300. Postgres refuses the excess, new instances can't serve traffic, health checks fail, instances terminate, autoscaler tries again. You've built a livelock where scaling activity prevents service recovery.

The fix isn't "turn off autoscaling." The fix is "understand what you're scaling and why."

Primitives Don't Think

Every cloud vendor provides autoscaling primitives. AWS has Auto Scaling Groups and Spot Fleet. Azure has VM Scale Sets. GCP has Managed Instance Groups. Kubernetes has Horizontal Pod Autoscaler and Cluster Autoscaler, plus newer entrants like Karpenter that bypass some of the legacy abstractions for faster node provisioning. They all give you similar knobs: target metrics, threshold values, cooldown periods, step adjustments.

These are mechanisms, not intelligence.

I've debugged an ASG configured with a 60-second scale-up cooldown and a 300-second scale-down cooldown. Looks reasonable in isolation — don't thrash, give scale-up priority over scale-down. Except in practice: traffic spike triggers scale-up, instances take 120 seconds to pass health checks and start serving requests, traffic normalizes, 300-second cooldown expires, scale-down begins, next spike arrives during the scale-down, system is understaffed again. The oscillation had a period of about eight minutes, and you could see it in the graphs like a sine wave.

The team switched to a flat 10-minute cooldown on all scaling activities and added a composite trigger: CPU and active connection count and p95 request latency. Single metrics lie. CPU can spike because of a memory leak causing GC thrash. Request count can drop because clients are timing out and giving up. Latency can degrade because a downstream dependency is slow, not because you need more instances. Composite signals reduce false positives.

But even composite metrics are just symptoms.

The tooling ecosystem tries to paper over this. Prometheus scrapes metrics and feeds them into HPA policies. Warm Pools (AWS) or Hot Pools (Azure) keep instances pre-allocated in a stopped state, cutting scale-up time from five minutes to thirty seconds. Infrastructure-as-Code tools like Terraform, CDK, Pulumi let you version-control your scaling policies so you can track changes and roll back when something breaks — which it will.

The actual tool that matters is paranoid policy design.

Guardrails Over Cleverness

Unbounded reactive scaling is the anti-pattern that keeps showing up. No cap on instance count. No check on whether downstream services can handle the increased load. No escape hatch when things go sideways.

Here's a real failure mode I've seen twice: autoscaler detects high queue depth, adds 40 instances, those instances boot and immediately start polling the queue, which is backed by Redis. Your Redis instance has 50GB of memory and is now handling 40 additional clients hammering BLPOP commands. Memory spikes, eviction policy kicks in, cache hit rate drops, application performance degrades, more requests time out, queue depth increases further, autoscaler adds another 30 instances. You're in a death spiral.

The missing piece was a bound on scale-up speed. Step scaling: add 5 instances, wait two minutes, assess, add another 5 if needed. Or a dependency health check before scaling — if Redis latency is above threshold, don't scale the thing that talks to Redis. Or a hard cap: "never run more than 50 instances regardless of metrics" because you know your architecture has bottlenecks that won't improve past that point.

Circuit breakers fit here. Hystrix is ancient now but the pattern remains: when a downstream dependency is failing, stop calling it. Let requests fail fast instead of piling up retries. The autoscaler sees reduced load (because you're shedding work) rather than amplified load (because every request is timing out and retrying). The scaling decision becomes sane.

Queue-based buffering is another mitigation. Put SQS or Kafka between your API and workers. When workers are saturated, the queue grows but you're not dropping requests. Scale worker count based on queue depth with a defined ceiling — say, 100 workers max — and apply back-pressure to API clients (HTTP 429 with a Retry-After header) if the queue itself exceeds 10,000 messages. You're absorbing bursts without infinite resource commitment, and you're giving clients explicit feedback to back off.

Scheduled scaling handles predictable patterns. If your SaaS sees traffic peaks every weekday at 9 AM as people start work, pre-scale at 8:45. If your ML training jobs run nightly at 2 AM, don't wait for CPU to spike — allocate capacity beforehand. Reactive scaling handles surprises. Proactive scaling handles inevitability.

But you still need a manual override. The red button. A way to say "stop all automatic changes right now" when the automation is making things worse.

Testing the Failure Cases

Most teams load-test. Fewer chaos-test their autoscalers specifically.

Chaos Mesh, Gremlin, AWS Fault Injection Simulator — they let you inject failures: terminate instances, throttle API calls, introduce network latency, corrupt packets. But the scenario you want to test is: what happens when your autoscaler tries to add capacity during a control plane failure? What if your metrics endpoint becomes unreachable and the autoscaler goes blind? What if the scaling API is degraded and requests time out?

One pattern I've used: simulate a scale-up event in staging, then use iptables to block traffic from the autoscaler to the cloud control plane. Does the system recover gracefully, or does it retry forever, accumulating state and eventually deadlocking? You can automate the mitigation — an EventBridge rule that triggers a Lambda to disable autoscaling if it detects sustained API errors — but you need to test that the Lambda actually works under realistic failure conditions.

Another test: artificially saturate your database or cache, then trigger a traffic spike that causes autoscaling. Do the new instances improve throughput, or do they make latency worse? If they make it worse, your scaling policy is fundamentally broken and needs rethinking. Maybe you need a semaphore that limits concurrent database queries. Maybe you need a smarter health check that fails instances if database latency exceeds a threshold, so they don't join the load balancer pool until conditions improve.

Kubernetes adds layers of complexity. Pods have readiness probes, but if those probes depend on downstream services being healthy, you can cascade: new Pods never become Ready, old Pods are evicted due to resource pressure from Pending Pods, you're effectively scaling down healthy capacity while scaled-up capacity contributes nothing. Pod Disruption Budgets prevent too many Pods from being unavailable simultaneously, which is critical — autoscaling shouldn't create availability gaps.

The interaction between HPA (scales Pods) and Cluster Autoscaler (scales nodes) deserves specific testing. HPA requests more Pods, but if nodes aren't available, Pods stay Pending. Cluster Autoscaler provisions nodes, but that takes minutes. Meanwhile you're down. Karpenter tries to solve this with faster provisioning, but you're still trusting a complex distributed system to coordinate correctly under load. Test it by breaking it on purpose.

Composite Signals and Cold Reality

Single-metric autoscaling is a gamble. CPU utilization alone doesn't tell you if the system is healthy or dying. You need overlapping signals:

CPU and memory utilization (catches leaks that won't show in CPU)
Request rate and error rate (distinguishes genuine load from failure)
Queue depth and message age (detects processing stalls vs. legitimate backlog)
p50, p95, p99 latency distributions (surface degradation before total failure)

And you must account for cold starts. A new EC2 instance takes 90-180 seconds to boot, join the ASG, pass health checks, and start receiving traffic. A new container in Kubernetes can be faster — maybe 20 seconds — but there's still JVM warmup if you're running Java, cache warming if your application relies on local state, connection pool establishment if you're talking to databases. During that window, the instance is consuming resources but providing no value.

If you scale up during a spike and the spike resolves before new instances become useful, you've wasted money and potentially created instability. Some systems solve this with "warm standby" — instances already running but not in the active pool, promoted instantly when needed. That's expensive. You're paying for idle capacity. Trade-off.

Step scaling mitigates overshoot: instead of jumping from 10 instances to 30, go 10 → 15 → 20 → 25 → 30, with pauses to evaluate whether each increment was sufficient. It's slower to respond to genuine emergencies, but it prevents wild overcorrection.

Target tracking is often superior to threshold-based scaling. Instead of "if CPU > 70%, add instances," you specify "maintain average CPU at 60%." The autoscaler makes incremental adjustments to keep the metric near target. But it still requires bounds — minimum and maximum instance count—and it still can't know whether the root cause of high CPU is something horizontal scaling can actually address. If CPU is high because of a memory leak, more instances just means more leaking processes.

The Monday Morning Checklist

The incident is over. You've been paged, you've fought the autoscaler or watched it do nothing while the system burned, and now you're writing the postmortem. Here's what actually changes:

Hard instance caps. If your database supports 200 connections and each app server uses 10, your maximum instance count is 20. Maybe 25 if you have PgBouncer or another connection pooler, but fundamentally you're capped by downstream capacity. Enforce that ceiling in your ASG configuration.

Manual override procedure. A Terraform module or kubectl patch command that freezes autoscaling at current capacity. You can deploy it in under 60 seconds. Document it. Test it. This is your red button.

Composite metric triggers. If you're only watching CPU, add request rate and error percentage. If you're only watching queue depth, add message processing latency. No single number captures system health.

Longer cooldowns. A five-minute cooldown on scale-downs prevents oscillation. You'll be slower to release capacity, which costs money in EC2 hours or pod-minutes, but you won't thrash. That's the trade-off. Stability over marginal cost savings.

Dependency health gates. Before scaling app servers, verify the database isn't already struggling. Before scaling workers, check that the message queue is responsive. If dependencies are degraded, adding capacity makes it worse, not better.

Scaling event instrumentation. Log every scaling action: timestamp, triggering metric and value, instances added or removed, health check results, downstream latency impact. Surface this in a dashboard. Alert if scaling events happen more than five times per hour — something is oscillating.

Runbooks for scaling failures. Document how to disable autoscaling, how to manually add capacity, how to recover from a scaling-induced cascade. Your 2 AM self, woken by a pager, will not remember the kubectl incantation. Write it down.

Periodic chaos drills. Schedule quarterly tests where you deliberately inject control plane failures, metric pipeline failures, or dependency timeouts while autoscaling is active. Observe what breaks. Fix it before production hits it.

Cost circuit breakers. Use AWS Budgets or GCP billing alerts to catch runaway scaling. If your normal monthly spend is $12K and you hit $20K in a single day, something is catastrophically wrong. Stop it before the bill becomes existential.

The Consulting Opportunity

There's a market here because the failure modes are public and expensive, and most engineering teams have stepped on this rake at least once.

You can sell:

Policy audits. Review autoscaling configurations, identify unbounded rules and single-metric triggers, recommend multi-signal policies with guardrails. Deliverable: a report and updated Terraform modules.

Chaos testing as a service. Run controlled failure injections against staging and production-like environments, document how autoscalers behave under degradation, provide remediation plans.

Training workshops. Teach resilient autoscaling patterns — circuit breakers, back-pressure, dependency health checks, composite metrics, pod disruption budgets. Include hands-on labs where participants break and fix autoscaling policies.

Managed infrastructure. Offer DevOps-as-a-Service where autoscaling configuration, tuning, and monitoring are included in a monthly retainer. Teams get expert-configured policies without building in-house expertise.

Specialty products exist too. Knative provides serverless autoscaling on Kubernetes with better cold-start handling. Custom ML-based scalers predict traffic patterns more accurately than simple threshold rules. Auto-remediation bots detect control plane outages and automatically disable autoscaling until recovery. If you can build a tool that prevents one Slack-style thrashing incident, enterprises will pay five figures for it annually.

The demand persists because the failure surface is large and the cost of getting it wrong is measured in incident postmortems and customer churn.

Not a Magic Fix

Autoscaling is reactive elasticity assistance. It helps you handle load variation without manual intervention at 4 PM on a Friday. It does not make your system resilient to cascading failures. It does not diagnose whether high CPU is caused by legitimate traffic or a memory leak. It does not protect you from your own architectural bottlenecks.

Used carelessly — single metrics, no bounds, no dependency checks — it amplifies failure modes. The autoscaler becomes the problem.

Used carefully — composite signals, hard caps, manual overrides, chaos-tested under realistic failure scenarios — it's a useful operational tool that reduces toil and improves resource efficiency.

The difference is whether you've watched it fail in production, understood the failure mechanism in detail, and built guardrails accordingly.

I have. You probably will too, if you haven't already.

When it happens, remember: elasticity is not an AWS feature you enable with a checkbox. It's an emergent property of thoughtful system design, defensive automation, and the willingness to distrust your own clever optimizations. The autoscaler is a servo mechanism. You're the control system. Act like it.

Autoscaling Scaling (geometry) systems

Opinions expressed by DZone contributors are their own.

Related

Trending