Apache RocketMQ: Lessons Learned on How to Ensure Stable Capacity
See how to ensure stable capacity with RocketMQ.
Join the DZone community and get the full member experience.Join For Free
In a previous article, we talked about how Apache RocketMQ fine-tuned the bottlenecks related to latency.
Remember Little’s law?
Not surprisingly, there are exceptions when performance fluctuates. In those situations, how do we maintain the stability of the capacity? We must understand the urgency when talking about solutions. If not dealt with immediately, these emergencies might cause cascading failure of the whole cluster. The solution goes to the three well-known approaches: downgrade, traffic shaping, and circuit breaker.
Downgrade means the system will acknowledge that there is a problem and adapt accordingly. It is really the lazy solution of the three. What it does is simply drop certain messages. As to who will decide which ones to drop, it depends on the Qos configuration and the analysis of the user data. As a result, those deemed as “less important” topics will be closed.
There are two major theories on how to shape traffic: the leaky bucket and the token bucket. The leaky bucket assumes the leaking happens at a steady speed, which signals the bucket to accept more water. If the rate of receiving exceeds the rate of leaking, the bucket will overflow.
The token bucket requires every request to acquire a token before getting into the bucket. Of course, the tokens are renewed at a constant speed. If the requests arrive without tokens, that’s the signal to cut off the traffic.
Both theories are used in software practices. For example, Guava has RateLimiter module, Netty has TrafficShaping module.
RocketMQ tackles this problem by categorizing traffic. For those identified as frequent, repeating requests, they will be put in a fail-fast track. Which means they will be terminated if the latency exceeds a threshold. For the non-frequent or large requests, mechanisms such as sliding-window will be applied to adjust the frequency and control the impact.
Many Java developers are familiar with Netflix Hystrix. But for those who are not, circuit breaking is the mechanism to reject or time out requests to protect the cluster. Let’s take a look at the following example:
“Dependency I” is becoming the bottleneck. If left unattended, more and more requests will hang there and eventually bring down the whole system. A circuit breaker is introduced to reject these requests.
However, many products like Hystrix needs 30s or so to determine the bottleneck, especially those related to system resources. This is just too long for a message queue. RocketMQ does this by quickly identifying the hardware level bottlenecks and will respond within milliseconds.
The stable capacity of a message queue is a critical feature. RocketMQ makes sure that aspect works well. Now many of us might wonder, how would all three of these mechanisms work together? Well, like any design, there are targeted uses cased and those left out. We use three of them to make sure there is no use case forgotten.
Opinions expressed by DZone contributors are their own.