DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • A Step-by-Step Guide to Write a System Design Document
  • Managing Private Zone Records in GCP Cloud DNS
  • Mutable vs. Immutable: Infrastructure Models in the Cloud Era
  • The Case for Working on Non-Glamorous Migration Projects

Trending

  • Concourse CI/CD Pipeline: Webhook Triggers
  • Hybrid Cloud vs Multi-Cloud: Choosing the Right Strategy for AI Scalability and Security
  • AI’s Role in Everyday Development
  • Streamlining Event Data in Event-Driven Ansible
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Congestion Control in Cloud Scale Distributed Systems

Congestion Control in Cloud Scale Distributed Systems

Distributed systems must handle traffic surges and provide predictable performance for the best customer experience.

By 
Ankit Kumar user avatar
Ankit Kumar
·
Dec. 19, 23 · Analysis
Likes (1)
Comment
Save
Tweet
Share
3.4K Views

Join the DZone community and get the full member experience.

Join For Free

Distributed systems are composed of multiple systems that are wired together to provide a specific functionality. Systems that operate at a cloud scale can get expected or unexpected surges of traffic from one or multiple callers and are expected to perform in a predictable manner. 

This article analyzes the effects of traffic surges on a distributed system. It lays out a detailed breakdown of how every layer is affected and provides mechanisms to achieve predictable performance during traffic surges.

Anatomy of a Distributed System

To understand how traffic surges affect different layers of a distributed system, let's look at the composition of a typical distributed system:

  • Client: Client software sends a request to an endpoint. The endpoint is resolved via DNS to point to a Virtual IP address (VIP).
  • VIP (Virtual IP address) and Load Balancer: A VIP address, in turn, points to one or more load balancers. A load balancer is a networking device that proxies traffic to a set of service hosts.
  • Service Fleet: A service fleet is comprised of multiple service hosts. A service host is a physical box where the service application logic runs. 
  • Dependent Service(s): A set of dependent services might do some work for a customer request. For example, in the case of a service that needs some state of the data to answer the customer request, the state may be looked up from a data storage system/service. 

Traffic Surges, Congestion, and Predictable Performance

Traffic surges lead to congestion in one or multiple layers of the distributed system. Let's analyze how congestion affects each layer and how we can provide predictable performance. Let's look at the call stack bottom up:

Dependent Service(s)

A congestion in dependent services results in request timeouts or increased latency at the service host layer. This, in turn, results in the threads being held longer in the service host per customer request. In extremities, this can lead to thread exhaustion on the service host, which in turn leads to request queuing on the load balancer. The queuing on the load balancer can lead to spillovers, which means that for any new request, the load balancer cannot route it to a viable host, leading to request timeout for the client.

To provide predictable performance in case a dependent service is suffering from congestion, we can take the following approaches:

  • Rate limiting: We can rate limit traffic on the service host so that the service host receives less traffic than the dependent service can process. 
  • Circuit breaker: Circuit breakers track the health of a dependent service and reject any new requests that require the dependent service to be available till the dependent service is recovered. 
  • Thread pool isolation: Another approach is thread pool isolation for requests that rely on dependent service being available. The thread pool isolation allows a set capacity from the service host to be allocated for the dependent service-related work. This allows the service to continue addressing other requests in a predictable manner while the dependent service works through the congestion.

A dependent service can also act on requests asynchronously by tracking them in a queue and working offline on the queued requests. This is typically referred to as a producer-consumer pattern. We talk about the congestion characteristics of this pattern in a separate article.

Service Fleet

Congestion in the service fleet caused by a traffic surge can lead to request timeouts for the customer. Congestion happens because the service fleet has hit a certain limit. This limit can be a software or a hardware limit. Some examples are as follows:

  • Thread pool size limit that governs how many request threads can be worked upon at a given time on a service host.
  • Physical hardware limits such as disk space, network bandwidth, or CPU usage limits. 

Let's look at the ways we can configure the service fleet to provide predictable performance in the face of a traffic surge:

  • Understand service host limits: We should test the service fleet and understand the various scaling cliffs. In most systems, the hardware limits are difficult to reach, and we typically run into a software limit before hitting a hardware one. By understanding the maximum request volume at which our service hosts(s) perform predictably, we can set limits that ensure a host does not process more requests than that. 
  • Rate limit traffic: We should rate limit traffic on service host(s) so that any additional traffic outside the bounds is rejected. This leads to a service host providing predictable performance. A good way to implement rate limiting is via a token bucket algorithm. 
  • Scale on demand: An external capacity monitoring system can track how much traffic is being served from the service fleet and add hosts when traffic is increasing beyond a certain threshold. While the scaling kicks in, certain traffic will be rejected, but this is a better alternative than taking in more traffic than what hosts can handle, which can lead to the entire service collapse. 
  • Prevent performance regressions: Performance regressions are possible whenever new code is deployed across a service fleet, or hardware is swapped. A good way to prevent this is to bake performance regression testing in the release workflow for the software that is to be deployed on the service fleet and also for any planned hardware changes. 

VIP and Load Balancer

VIPs are just DNS mappings. We will keep DNS congestion out of the scope of this article as that is a separate topic. DNS systems heavily rely on caching and rarely get congested. 

Load balancers direct requests to a service host. Congestion in load balancers can lead to request timeouts for customer requests. Like any other physical computing device, the load balancer has its limits as well. One such limit that most systems run into is the network throughput limit on the load balancers. A simple way to avoid running into this limit is to track how close to the cliff we are, add more load balancers, and provision their IP addresses in the VIP configuration.  

Client

The client plays a pivotal role in how congestion happens and how it can help the system recover faster. When a system is congested, it rate limits traffic to aid in recovery. Clients re-try requests on receiving timeouts or being rate-limited. Re-tries lead to more traffic for the service, thereby spinning in this circle of traffic surge -> timeouts/rate limiting -> re-tries -> traffic surge. An effective way to avoid the traffic surge from retries is to add jitter and exponential backoff in clients. 

Conclusion

Traffic surges can catch any cloud-scale distributed system unguarded. Understanding various system limits, tuning each layer for predictable performance, rate limiting, and scaling on demand can help us provide the most optimal customer experience. 

Domain Name System Thread pool Analyze (imaging software) Cloud Load balancing (computing) systems

Opinions expressed by DZone contributors are their own.

Related

  • A Step-by-Step Guide to Write a System Design Document
  • Managing Private Zone Records in GCP Cloud DNS
  • Mutable vs. Immutable: Infrastructure Models in the Cloud Era
  • The Case for Working on Non-Glamorous Migration Projects

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!