The Myth of Horizontal Scalability

Horizontal scaling only works for the parts of your system that are parallel. Find and shrink the serial fraction before adding more infrastructure.

Apr. 02, 26 · Opinion

Likes (1)

Comment

Save

2.5K Views

There's a particular kind of confidence that sets in after you've watched Kubernetes spin up a dozen fresh pods in under a minute. You've done the work. You've containerized everything, written your Helm charts, set your HPA thresholds. The architecture diagram shows twelve clean boxes behind a load balancer, arrows flowing neatly left to right, and some unspecified cloud thing in the corner labeled "DB" — as if that box were inert, as if it were just furniture.

Then traffic doubles. And the database CPU pegs at 99%.

The pods multiply. The database CPU stays at 99%. Orders queue. Latency climbs from 80ms to 800ms to "we need to page someone." What you've built, effectively, is a more efficient funnel for funneling work into a drain that cannot drain faster. You poured fuel on the bottleneck.

Amdahl's Law is taught in every undergraduate systems course and ignored in roughly the same proportion of production architectures. The math is almost insultingly simple: if 10% of your work is serial — meaning it must happen in sequence, on one node, under one lock — then no matter how many parallel workers you add, you cannot speed that work up by more than 10×. Ever. The ceiling isn't a soft limit you can engineer around. It's a hard asymptote baked into the nature of the workload. Most engineers nod at this in class and then go design a checkout service with a single Postgres writer anyway.

The failure mode is seductive because it's invisible until it isn't. Stateless API tiers scale magnificently. Kubernetes autoscaling works exactly as advertised for work that is genuinely embarrassingly parallel — request parsing, JWT validation, template rendering, lightweight computation. You add nodes, throughput climbs, p99 latency holds, everything looks like success. The graph of requests-per-second trends up and to the right and someone makes a slide about it.

The shared state is where your bottlenecks hide. Not the stateless tier. Not the load balancer. The shared parts: the single writer, the global lock, the config service that every pod hammers on startup, the session store that wasn't supposed to be stateful but became stateful over eighteen months of accumulated convenience.

Consider what actually happens inside a relational database under load. The write path isn't a pipeline that gets faster with more concurrent input — it's a serialized process that writes to a WAL, acquires row-level locks, flushes pages, updates indexes, and emits replication events, approximately in that order, approximately one batch at a time. You can tune buffer pool sizes and checkpoint intervals and connection pool settings, and all of that buys real headroom — but the fundamental serialization of writes is load-bearing in the consistency guarantees you're almost certainly relying on. You can't parallelize it away without changing what the database is.

So more API pods mean more concurrent connections, mean more concurrent write requests, mean more lock contention, mean more time each transaction spends waiting rather than executing, mean higher CPU burn on mutex overhead, mean slower individual transactions, mean longer queue depths, mean the exact inverse of what the autoscaler was supposed to deliver.

One e-commerce team I know about spent three quarters optimizing their checkout service — caching product lookups, moving to async order confirmation emails, replacing synchronous third-party payment polling with webhooks. They were diligent, thoughtful engineers. The checkout service became genuinely fast: 40ms p50, 120ms p99 under moderate load. Then they ran a Black Friday load test and watched the inventory database fall over at roughly 3× their target throughput. All that work on the checkout service had made it more efficient at generating writes to a table that couldn't absorb more writes. The optimization surfaced the real limit faster. That's not irony — that's Amdahl's Law working as specified.

The inventory table was the chokepoint because inventory management has an inherent consistency requirement that their product owners correctly insisted on: you cannot oversell. You cannot allow two concurrent purchases of the last unit. This is a serializable constraint, and serializable constraints require serialization somewhere in the stack. You can push that serialization around, but you can't make it disappear. What the team needed wasn't faster checkout — they needed to think hard about where the serialization lived and whether it was doing unnecessary work.

In their case, much of it was. The inventory update was holding a row lock for the entire duration of a downstream API call to a fulfillment service — a call that averaged 200ms and occasionally spiked to two seconds during peak load. The lock existed because some engineer, probably under time pressure, had placed the fulfillment call inside the transaction boundary. There was no good reason for it to be there. Extracting it — making the inventory decrement optimistic, publishing a fulfillment event to Kafka, handling the rare cases of downstream failure with compensating transactions — reduced their lock hold times from ~200ms to ~2ms. Suddenly the writer wasn't the bottleneck anymore, because the writer wasn't doing two seconds of work per transaction.

This is the kind of change that isn't visible in an architecture diagram but matters enormously at runtime. The diagram still showed "DB" in a box.

The vocabulary of distributed systems is full of words that promise more than they deliver. "Microservices" suggests independence; in practice, microservices that share a database or coordinate on every request are a distributed monolith with extra latency and a much harder debugging story. "Event-driven" suggests decoupling; in practice, synchronous consumers on an event bus are just RPC with more steps. "Stateless" suggests horizontal scalability; in practice, stateless services frequently externalize their state into a shared Redis instance that becomes the new bottleneck, just with different error messages.

None of these patterns are wrong. They're right, conditionally. The condition is usually something like "and also you've correctly identified where your coordination actually lives and designed that part deliberately." Engineers skip the second half of that sentence.

Sticky sessions are an instructive case. The naive cure for session stickiness is to move sessions into Redis. Fine, that works. But now you've traded CPU spikes on individual app servers for hot keys in Redis if any session becomes pathologically large or pathologically active. The right design usually involves making the session genuinely stateless — encoding enough authorization and preference context into a signed JWT to avoid the session lookup entirely on most requests. That's not always possible; sometimes session state is legitimately large or mutable. But the question is worth asking before you build the Redis cluster.

Similarly, read replicas are a classic scaling lever for read-heavy workloads, and they genuinely work. But they introduce replication lag, which means reads can return stale data, which means some fraction of your application logic is now reasoning about eventually-consistent state whether you've accounted for that or not. If your application was written assuming strong consistency — if the code does a write followed immediately by a read and expects the read to reflect the write — you've now introduced a subtle, intermittent, load-dependent bug. It will not appear in development. It will not appear in staging. It will appear in production at 2am during a traffic spike, and it will look like a race condition in code that hasn't changed in six months.

Eventual consistency is not free. It relocates complexity from the database to the application tier and from the infrastructure layer to the reasoning layer. Whether that trade-off is worth it depends on the workload and the team's ability to reason clearly about distributed state. Pretending the trade-off doesn't exist is how you get six-month-long debugging sagas.

CQRS and Event Sourcing are tools that genuinely address the scalability problem, and they're frequently misapplied in ways that create new versions of the same problem. The insight behind CQRS — separate your write model from your read model — is sound: writes require consistency and coordination; reads require latency and throughput. Why force the same data model to serve both masters? Separate them, let each be optimized for its actual use case, project the write model into whatever read model shape serves the query patterns.

But. CQRS requires that you process events in order, reliably, with at-least-once delivery and idempotent consumers. Building that correctly is a real engineering problem. The event bus becomes a critical dependency. The projection code becomes complex and fragile. Debugging is harder because the causal chain between a write and its visible effect now crosses multiple services, multiple queues, potentially multiple data stores. Teams that adopt CQRS because they read a blog post about scalability often end up with lower throughput and far more operational complexity than the write-heavy Postgres setup they replaced.

The pattern is right. The prerequisite is honest assessment of whether your team has the operational maturity to run it. That prerequisite is usually glossed over.

What would a careful builder actually do on Monday morning?

First: identify the serial fraction. Not by intuition, by measurement. Distributed tracing — Jaeger, Zipkin, Honeycomb, whatever your stack supports — will show you exactly which spans are sequential and which are parallelizable. You're looking for long chains where each span depends on the completion of the previous one, especially if those chains cross service boundaries. A checkout flow that does: auth token validate → cart fetch → inventory lock → payment charge → inventory decrement → order record → fulfillment event, all sequentially, has a serial fraction close to 1.0 regardless of how many pods the stateless services run on.

Second: scrutinize lock hold times. Every lock in your system is a serialization point. How long does it hold? Is it holding longer than it needs to because work that could happen outside the transaction is happening inside it? This is the most common and most tractable bottleneck — not architecture, just discipline about transaction scope.

Third: question the consistency requirements. Not to eliminate them — some consistency requirements are load-bearing, non-negotiable, correct. But some have been inherited from an earlier version of the system where they made sense and now they're just habit. "We've always locked that row" is not a consistency requirement. It's a pattern. The requirement is the business invariant underneath it. Sometimes you can satisfy the same invariant with an optimistic approach: attempt the update assuming no conflict, detect conflict after the fact, retry or compensate. That works when conflicts are rare. Whether conflicts are rare is an empirical question.

Fourth: load test both tiers. Load test the stateless tier and the database tier separately, and then together. You want to know the saturation point of each component independently. If the DB saturates at 5000 writes/second and your app tier can generate 20,000 writes/second with eight pods, adding a ninth pod doesn't help you. It just costs you cloud spend.

Fifth: don't add complexity before you understand the current limit. The temptation when facing a scalability problem is to reach for the sophisticated tool — partition the data, add a Kafka topic, introduce CQRS. Sometimes that's right. More often there's a simpler change that doubles throughput without increasing operational surface area: a missing index, an unnecessary serializable isolation level on a read-mostly query, a connection pool set to 10 connections when the DB can handle 200. Do the boring work first. Save the interesting work for problems that actually require it.

There's a version of this essay that ends with a tidy framework, a numbered list of scalability principles, a checklist you can print and post near the whiteboard. I'm skeptical of that version. Scalability problems are context-specific in ways that make generalized frameworks feel reassuring but behave unreliably. The e-commerce team's problem was lock hold time. The next team's problem might be connection pool exhaustion, or hot shards in a Cassandra cluster, or a chatty service graph where adding instances multiplies cross-node traffic superlinearly, or a configuration service that becomes a synchronization bottleneck as pods start up, or something nobody has named yet.

What generalizes is the habit of thought: assume there's a serial fraction; find it; measure it; shrink it before adding parallelism. Everything else is execution.

The "adding servers" instinct isn't wrong — it's incomplete. Horizontal scaling works. It just doesn't work everywhere, uniformly, automatically, without understanding which parts of your system are fundamentally not horizontal. The servers multiply. The bottleneck doesn't care.

Scalability

Opinions expressed by DZone contributors are their own.

Related

Trending

The Myth of Horizontal Scalability

Horizontal scaling only works for the parts of your system that are parallel. Find and shrink the serial fraction before adding more infrastructure.

Related

Partner Resources