DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Caching Strategies for Resilient Distributed Systems
  • Caching Across Layers in Software Architecture
  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 1
  • Detection and Mitigation of Lateral Movement in Cloud Networks

Trending

  • Developers Beware: Slopsquatting and Vibe Coding Can Increase Risk of AI-Powered Attacks
  • A Guide to Developing Large Language Models Part 1: Pretraining
  • AWS to Azure Migration: A Cloudy Journey of Challenges and Triumphs
  • Beyond Linguistics: Real-Time Domain Event Mapping with WebSocket and Spring Boot
  1. DZone
  2. Data Engineering
  3. Data
  4. Cache Wisely: How You Can Prevent Distributed System Failures

Cache Wisely: How You Can Prevent Distributed System Failures

The article describes different approaches to protect systems from hidden scaling bottlenecks that can creep into the system when caching is implemented.

By 
Tejas Ghadge user avatar
Tejas Ghadge
·
Jul. 09, 24 · Opinion
Likes (5)
Comment
Save
Tweet
Share
11.5K Views

Join the DZone community and get the full member experience.

Join For Free

Caching is often implemented as a generic solution when we think about improving the latency and availability characteristics of dependency service calls. Latency improves as we avoid the need to make the network round trip to the dependency service, and availability improves as we don’t need to worry about temporary downtimes of the dependency service given that the cache serves the required response that we are looking for. It is important to note that caching does not help if our requests to a dependency service lead to a distinct response every time, or if a client makes vastly different request types with not much overlap between responses. There are also additional constraints to using caching if our service cannot tolerate stale data. 

We won’t be delving into caching types, techniques, and applicability as those are covered broadly on the internet. Instead, we will focus on the less talked about risk with caching that gets ignored as systems evolve, and this puts the system at risk of a broad outage. 

When To Use Caching

In many cases, caching is deployed to mask known scaling bottlenecks with dependency service or caching takes over the role to hide a potential scaling deficiency of dependency service over time. For instance, as our service starts making reduced calls to dependency service, they start believing that this is the norm for steady-state traffic. If our cache hit rate is 90%, meaning 9/10 calls to the dependency service are served by the cache, then the dependency service only sees 10% of the actual traffic. If client-side caching stops working due to an outage or bug, the dependency service would see a surge in traffic by 9x! In almost all cases, this surge in traffic will overload the dependency service causing an outage. If the dependency service is a data store, this will bring down multiple other services that depend on that data store.

To prevent such outages, both the client and service should consider following recommendations to protect their systems. 

Recommendations

Cache Is Not Just a “Good to Have” Optimization

For clients, it is important to stop treating the cache as a "good to have" optimization, and instead treat it as a critical component that needs the same treatment and scrutiny as a regular service. This includes monitoring and alarming on cache hit ratio threshold as well as overall traffic that is sent to the dependency service. If you see any deviation from the pre-defined threshold, you need to investigate it with the same criticality as when your service has availability drops. 

Maintain Extreme Caution for Processes Around Cache-Related Updates

Any update or changes to caching business logic also need to go through the same rigor for testing in development environments and in the pre-production stages. Deployments to servers participating in caching should ensure that the stored state is transferred to new servers that are coming up post-deployment, or the drop in cache hit rate during deployment is tolerable for the dependency service. If a large number of cache-serving servers are taken down during deployments, it can lead to a proportional drop in cache hit ratio putting pressure on dependency service. 

Clients Being Good Citizens

Clients need to implement guardrails to control the overall traffic, measured as transaction per service (TPS), to dependency service. Algorithms like the Token Bucket algorithm can help restrict TPS from the client fleet when the caching goes down. This needs to be periodically tested by taking down caching instances and seeing how clients send traffic to the dependency service. Clients should also think about implementing a negative caching strategy with a smaller Time-to-live (TTL). Negative caching means that the client will store the error response from the dependency service to ensure the dependency service is not bombarded with retry requests when it is having an extended outage. 

Server Side Techniques

Similarly, on the service side, load-shedding mechanisms need to be implemented to protect the service from getting overloaded. Overloaded in this case means that the service is unable to respond within the client-side timeout. Note that as the service load increases, it is usually manifested with increased latency as server resources are overused, leading to slower response. We want to respond before the client-side timeout for a request and start rejecting requests if the overall latency starts breaching the client-side timeout. 

There are different techniques to prevent overloading; one of the simplest techniques is to restrict the number of connections from the Application Load Balancer (ALB) to your service host. However, this could mean indiscriminate dropping of requests, and if that is not desirable, then prioritization techniques could be implemented in the application layer of service to drop less important requests. The objective of load shedding is to ensure that the service protects the goodput, i.e., requests served within the client side timeout, as overall load grows on the service. The service also needs to periodically run load tests to validate the maximum TPS handled by the service host, which allows fine-tuning of the ALB connection limit. We introduced a couple of techniques to protect the goodput of a service which should be widely applicable but there are more approaches that readers can explore depending on their service need.

Conclusion

Caching offers immediate benefits for availability and latency at a low cost. However, neglecting the areas we discussed above can expose hidden scaling bottlenecks when the cache goes down, potentially leading to system failures. Regular diligence to ensure the proper functioning of the system even when the cache is down is crucial to prevent catastrophic outages that could affect your system's reliability. 

Here is an interesting read about a large-scale outage triggered by cache misses.

Cache (computing) systems

Opinions expressed by DZone contributors are their own.

Related

  • Caching Strategies for Resilient Distributed Systems
  • Caching Across Layers in Software Architecture
  • Endpoint Security Controls: Designing a Secure Endpoint Architecture, Part 1
  • Detection and Mitigation of Lateral Movement in Cloud Networks

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!