Engineering Capacity Plans for Load-Shedding in High-Demand Enterprise Apps
This article provides a practical approach to capacity planning and demonstrates load-shedding patterns for large-scale enterprise applications.
Join the DZone community and get the full member experience.
Join For FreeLarge-scale enterprise applications typically have many microservices that are deployed across numerous cloud providers and various geographic locations. When running a high-demand period (i.e., during peak campaigns), the most significant engineering challenge faced by large-scale enterprise applications is not "slow down," but rather: "be correct when correctness matters," "be gracious when correctness doesn't matter," and "recover reliably and predictably."
Below, I present a practical approach to capacity planning and demonstrated load-shedding patterns for large-scale enterprise applications based on historical campaign behavior, with examples and illustrations of actual data points from previous campaigns.
1. Capacity Planning: Prioritize the Most Important Path Through Your System vs. the Overall Average TPS
Traditional capacity planning does not lend itself well to campaign-style activities because it treats your overall system as one entity and relies on averages. However, you should treat your system as a collection of "critical paths": ordered sets of services that need to function properly to drive revenue and keep users safe.
Below are examples of critical paths in an e-commerce application:
- Browse path: CDN -> Edge -> Product API -> Catalog -> Pricing Preview -> Recommendations
- Cart path: Edge -> Cart -> Inventory -> Promotions -> Pricing -> Tax
- Checkout path (highest priority): Edge -> Checkout -> Payments -> Fraud -> Order -> Confirmation
Determine the SLOs, budgets (for latency and concurrency), and dependencies for each path.
- SLOs determine the acceptable performance of each path (e.g., Checkout P95 < 1.5 seconds; error rate < 0.1% during peak).
- Budgets describe the amount of latency that may occur on each path and the amount of concurrency that may be supported by each service (maximum number of concurrent requests).
- Dependencies are the systems needed to allow the flow (payment gateways, fraud vendors, databases, queues).
This approach will help identify the parts of your architecture where you need to provision excessive resources, and parts of your architecture where you can make resources elastic.
2. Multi-Region + Multi-Cloud: Account for the Asymmetric Nature of Demand Across Regions, And the Potential Size of the Blast Radius of a Single Region's Failure
Campaign-style events typically produce demand asymmetrically across regions. A single region may get a marketing push and therefore see high demand, while the other regions see very little activity. Additionally, multi-cloud introduces multiple levels of asymmetry regarding scalability behavior, rate limiting, network jitter, and different operational tooling.
Successful strategies include:
- Regional capacity envelopes: Establish firm target values for capacity per region (minimum, nominal, maximum) and threshold values for failovers.
- Blast radius controls: Create a circuit breaker per region so that an overwhelmed region cannot cause the world to cascade into failure.
- Traffic steering policies: Use GSLB/Anycast/Traffic Manager to steer traffic to healthy and available regions based on traffic steering policies.
Example (Illustrative): If Checkout in US-East is running at about 85% of the safe concurrency, start sending new checkout sessions to US-West while keeping all current checkout sessions sticky to avoid losing carts and checkout progress.
3. Service-Level Capacity: Develop Services With Concurrency Limits, Not Just Based on CPU Resource Utilization
When operating a microservices-based architecture, the most common bottleneck during campaigns is usually due to:
- Thread pools/event loops
- Database connection pools
- Downstream vendor call limits (per second, per minute, etc.)
- Queue consumer lag
- Cache miss storms
Service-level capacity plans should include both concurrency limits for each service and downstream call limits per second. For example, a service made up of 500 pods can still run out of capacity if each pod has 50 open database connections and the database has a hard limit of 5,000 connections (all numbers in this example are illustrative).
Useful Technique: Assign a Dependency Budget to Each Service:
- Database connections: Hard ceiling
- Redis operations/second: Hard ceiling
- Vendor calls/second: Contracted ceiling
Next, develop load-shedding rules that initiate before the ceilings are reached.
4. Strategies For Balancing Load-Shedding With both System Stability and Revenue Protection
Load shedding is not a singular entity; it is made up of many entities working in concert to provide consistent and predictable failure modes of system functionality. Good systems implement multiple load-shedding strategies and shed the least critical first.
Strategy A: Admission Controls (Global and Service-Level)
Place a "gate" at the edge of the system (API Gateway/Service Mesh/Ingress), and at each of the critical services. Utilize token buckets keyed by:
- Region
- Tenant/Customer segment
- Request type (Checkout/Browse)
- Service dependency (e.g., "Promo Engine Calls")
Illustrative example: Limit "Apply Coupon" to 1000 RPS per region during peak hours while allowing checkout to continue without recalculation of coupons (use the last known valid price, or use a simplified set of rules).
Strategy B: Priority Queuing + Fairness
Not all requests are equal. Prioritize:
- Confirmation of checkout
- Authorization of payment
- Update of cart
- Browse/recommendations
Ensure fairness among the same priority level to prevent a single client/tenant from consuming all available capacity.
Illustrative example: An aggressive B2B tenant retried, and could consume 60% of total available capacity; Per-Tenant fairness would ensure that the other tenants/consumer traffic would remain protected.
Method #1: Load Shedding Through Experience (Feature Flag) Deterioration
Do not eliminate users; eliminate experiences — deprecate non-essential features first. Disable (non-essential) features such as:
- Recommendations (disable)
- Calls for image-size reduction/personalization
- Reduce cache "price band" instead of exact price breakdown
- Combine multiple downstream calls into one call for a summary
Method #2: Layered Circuit Breaking and Bulk Heading
Break circuits when making failed dependency calls, and segregate your resource usage to prevent failures from being passed throughout your entire application:
- Thread pools for vendor calls versus internal calls
- Connection pools for read versus write
- Pod/node pools for checkout versus browse
Illustrative example: Vendor fraud calls have increased latency. Open the circuit breaker; switch to "Risk Based Fallback" (Low-risk transactions allowed, high-risk transactions queued asynchronously for review); Do not allow timeouts to spread across all checkouts.
Method #3: Queue First for Non-Interactive Work
Remove expensive operations from the request path:
- Order enrichment
- Dispatch email/SMS
- Analytics events
- Inventory reconciliation
Illustrative example: Synchronous inventory reconciliation during a Product Launch was moved to an idempotent async worker; checkout utilizes a lightweight "hold."
5. How To Prevent Two Common Failures During Peaks
Failure 1: Retry Storms
During latency spikes, clients will retry, gateways will retry, services will retry — creating exponentially increasing loads. Use:
- Retry budgets for each request type
- Jitter-bounded retries
- Hedged requests only for safe idempotent reads
Failure 2: Cache Miss Storms ("Thundering Herd")
If all cache keys expire at the same time, tens of thousands of requests will stampede the backend. Use:
- Single flight coalescence of requests
- Staggered TTLs (Jitter)
- Stale while revalidate
6. Establish A Peak Run Book That Can Be Implemented
A successful peak plan must have an implementable run book:
- T-7 days: Synthetic load test critical paths — verify dependency budgets
- T-24 hours: Pre-scale baseline — warm up caches — enable campaign feature flags
- T-0 (peak): Enable admission control thresholds — monitor concurrency — watch queue lag — monitor vendor latencies
- T+1 hour: Gradually scale down from peak — monitor for delayed queue replays
Define success as controlled degradation, not perfect throughput.
Conclusion
Capacity planning for multi-cloud, multi-region microservices requires discipline in determining what needs to be done. It is impossible to over-provision everything — it is too costly and rarely works. Identify your critical paths, apply dependency budgeting, and develop layered load shedding that protects your checkout and core flows while shedding non-essential experience. The best campaign architecture does not just get through a peak — it creates predictable and repeatable conditions for peak traffic.
Opinions expressed by DZone contributors are their own.
Comments