Engineering Capacity Plans for Load-Shedding in High-Demand Enterprise Apps

This article provides a practical approach to capacity planning and demonstrates load-shedding patterns for large-scale enterprise applications.

Amit Kumar Padhy

Mar. 27, 26 · Analysis

Likes (0)

Comment

Save

1.8K Views

Large-scale enterprise applications typically have many microservices that are deployed across numerous cloud providers and various geographic locations. When running a high-demand period (i.e., during peak campaigns), the most significant engineering challenge faced by large-scale enterprise applications is not "slow down," but rather: "be correct when correctness matters," "be gracious when correctness doesn't matter," and "recover reliably and predictably."

Below, I present a practical approach to capacity planning and demonstrated load-shedding patterns for large-scale enterprise applications based on historical campaign behavior, with examples and illustrations of actual data points from previous campaigns.

1. Capacity Planning: Prioritize the Most Important Path Through Your System vs. the Overall Average TPS

Traditional capacity planning does not lend itself well to campaign-style activities because it treats your overall system as one entity and relies on averages. However, you should treat your system as a collection of "critical paths": ordered sets of services that need to function properly to drive revenue and keep users safe.

Below are examples of critical paths in an e-commerce application:

Browse path: CDN -> Edge -> Product API -> Catalog -> Pricing Preview -> Recommendations
Cart path: Edge -> Cart -> Inventory -> Promotions -> Pricing -> Tax
Checkout path (highest priority): Edge -> Checkout -> Payments -> Fraud -> Order -> Confirmation

Determine the SLOs, budgets (for latency and concurrency), and dependencies for each path.

SLOs determine the acceptable performance of each path (e.g., Checkout P95 < 1.5 seconds; error rate < 0.1% during peak).
Budgets describe the amount of latency that may occur on each path and the amount of concurrency that may be supported by each service (maximum number of concurrent requests).
Dependencies are the systems needed to allow the flow (payment gateways, fraud vendors, databases, queues).

This approach will help identify the parts of your architecture where you need to provision excessive resources, and parts of your architecture where you can make resources elastic.

2. Multi-Region + Multi-Cloud: Account for the Asymmetric Nature of Demand Across Regions, And the Potential Size of the Blast Radius of a Single Region's Failure

Campaign-style events typically produce demand asymmetrically across regions. A single region may get a marketing push and therefore see high demand, while the other regions see very little activity. Additionally, multi-cloud introduces multiple levels of asymmetry regarding scalability behavior, rate limiting, network jitter, and different operational tooling.

Successful strategies include:

Regional capacity envelopes: Establish firm target values for capacity per region (minimum, nominal, maximum) and threshold values for failovers.
Blast radius controls: Create a circuit breaker per region so that an overwhelmed region cannot cause the world to cascade into failure.
Traffic steering policies: Use GSLB/Anycast/Traffic Manager to steer traffic to healthy and available regions based on traffic steering policies.

Example (Illustrative): If Checkout in US-East is running at about 85% of the safe concurrency, start sending new checkout sessions to US-West while keeping all current checkout sessions sticky to avoid losing carts and checkout progress.

3. Service-Level Capacity: Develop Services With Concurrency Limits, Not Just Based on CPU Resource Utilization

When operating a microservices-based architecture, the most common bottleneck during campaigns is usually due to:

Thread pools/event loops
Database connection pools
Downstream vendor call limits (per second, per minute, etc.)
Queue consumer lag
Cache miss storms

Service-level capacity plans should include both concurrency limits for each service and downstream call limits per second. For example, a service made up of 500 pods can still run out of capacity if each pod has 50 open database connections and the database has a hard limit of 5,000 connections (all numbers in this example are illustrative).

Useful Technique: Assign a Dependency Budget to Each Service:

Database connections: Hard ceiling
Redis operations/second: Hard ceiling
Vendor calls/second: Contracted ceiling

Next, develop load-shedding rules that initiate before the ceilings are reached.

4. Strategies For Balancing Load-Shedding With both System Stability and Revenue Protection

Load shedding is not a singular entity; it is made up of many entities working in concert to provide consistent and predictable failure modes of system functionality. Good systems implement multiple load-shedding strategies and shed the least critical first.

Strategy A: Admission Controls (Global and Service-Level)

Place a "gate" at the edge of the system (API Gateway/Service Mesh/Ingress), and at each of the critical services. Utilize token buckets keyed by:

Region
Tenant/Customer segment
Request type (Checkout/Browse)
Service dependency (e.g., "Promo Engine Calls")

Illustrative example: Limit "Apply Coupon" to 1000 RPS per region during peak hours while allowing checkout to continue without recalculation of coupons (use the last known valid price, or use a simplified set of rules).

Strategy B: Priority Queuing + Fairness

Not all requests are equal. Prioritize:

Confirmation of checkout
Authorization of payment
Update of cart
Browse/recommendations

Ensure fairness among the same priority level to prevent a single client/tenant from consuming all available capacity.

Illustrative example: An aggressive B2B tenant retried, and could consume 60% of total available capacity; Per-Tenant fairness would ensure that the other tenants/consumer traffic would remain protected.

Method #1: Load Shedding Through Experience (Feature Flag) Deterioration

Do not eliminate users; eliminate experiences — deprecate non-essential features first. Disable (non-essential) features such as:

Recommendations (disable)
Calls for image-size reduction/personalization
Reduce cache "price band" instead of exact price breakdown
Combine multiple downstream calls into one call for a summary

Method #2: Layered Circuit Breaking and Bulk Heading

Break circuits when making failed dependency calls, and segregate your resource usage to prevent failures from being passed throughout your entire application:

Thread pools for vendor calls versus internal calls
Connection pools for read versus write
Pod/node pools for checkout versus browse

Illustrative example: Vendor fraud calls have increased latency. Open the circuit breaker; switch to "Risk Based Fallback" (Low-risk transactions allowed, high-risk transactions queued asynchronously for review); Do not allow timeouts to spread across all checkouts.

Method #3: Queue First for Non-Interactive Work

Remove expensive operations from the request path:

Order enrichment
Dispatch email/SMS
Analytics events
Inventory reconciliation

Illustrative example: Synchronous inventory reconciliation during a Product Launch was moved to an idempotent async worker; checkout utilizes a lightweight "hold."

5. How To Prevent Two Common Failures During Peaks

Failure 1: Retry Storms

During latency spikes, clients will retry, gateways will retry, services will retry — creating exponentially increasing loads. Use:

Retry budgets for each request type
Jitter-bounded retries
Hedged requests only for safe idempotent reads

Failure 2: Cache Miss Storms ("Thundering Herd")

If all cache keys expire at the same time, tens of thousands of requests will stampede the backend. Use:

Single flight coalescence of requests
Staggered TTLs (Jitter)
Stale while revalidate

6. Establish A Peak Run Book That Can Be Implemented

A successful peak plan must have an implementable run book:

T-7 days: Synthetic load test critical paths — verify dependency budgets
T-24 hours: Pre-scale baseline — warm up caches — enable campaign feature flags
T-0 (peak): Enable admission control thresholds — monitor concurrency — watch queue lag — monitor vendor latencies
T+1 hour: Gradually scale down from peak — monitor for delayed queue replays

Define success as controlled degradation, not perfect throughput.

Conclusion

Capacity planning for multi-cloud, multi-region microservices requires discipline in determining what needs to be done. It is impossible to over-provision everything — it is too costly and rarely works. Identify your critical paths, apply dependency budgeting, and develop layered load shedding that protects your checkout and core flows while shedding non-essential experience. The best campaign architecture does not just get through a peak — it creates predictable and repeatable conditions for peak traffic.

Engineering applications Circuit Breaker Pattern

Opinions expressed by DZone contributors are their own.

Related

Trending