This blog post could get two titles. The title above, but also "Stuck Messages in ActiveMQ, Not Getting Dispatched to Active Consumer." Both can be symptoms of the same problem I will discuss in this post.
In my day-to-day job, I analyze many ActiveMQ-related problems and have seen a number of weird behaviors that are often caused by users not knowing the side effects of their ActiveMQ tuning activities. But I was also very surprised by the following discovery.
When you download and extract one of the later (including latest) versions of Apache ActiveMQ or JBoss A-MQ and look at the out-of-the-box configuration, then you will notice, the <destintionPolicy> section of the broker configuration does not configure any destination limits anymore. Older versions did configure these limits out of the box.
I always was a supporter of this configuration. Why do you want to restrict every queue cursor to enforce a particular memory limit if most of the time your destinations have small backlogs? If a backlog accumulates on a particular queue, it's better to use the broker's full <memoryUsage> to cache messages in memory irrespective of the destination, in order to dispatch them quickly when needed. This also allows you to better utilize the broker's <memoryUsage>: queues on which a backlog builds up can use the brokers memory; queues that have no backlog obviously don't need the memory at the moment. If the backlog grows too high or if backlogs build up on too many queues, the broker will enforce the overall <memoryUsage> limit across all destinations. So from this point of view setting no destination limits make perfect sense.
However, we lately discovered a not so exotic use case where not setting a destination limit caused problems. Here are the details:
We initially reproduced the problem in a test case that may be less likely to mirror a real production environment. However, this test case makes it easier to explain the situation. In that test we only used two JMS queues. The broker configuration did not set any destination limits and it does not matter how high the <memoryUsage> limit is set to. The higher the limit the more messages are needed in the test, but it can be reproduced with every <memoryUsage> limit. We used KahaDB as the persistence store.
The broker was started with a few messages on the first destination, let's say queue A, stored in KahaDB. As this queue had no consumers attached upon broker start, the messages remained in the store only and did not get loaded into the cursor cache. Note that messages only get loaded from store into memory when there is a demand, i.e. when there is is an active consumer.
Now a producer connected to the second queue, queue B, and pushed enough messages until 70% of the broker's <memoryUsage> limit got used by queue B. Remember, no destination limits were set, so each destination can use up to the broker's full <memoryUsage> limit. However, the StoreQueueCursor used for a JMS queue stops caching more messages in memory once it reaches 70% (the threshold is configurable via cursorMemoryHighWaterMark). Any additional messages received from a producer are written to the store only but not accepted by the cursor. When it's time to dispatch these additional messages (i.e. once the cache runs empty again), they will be loaded from the KahaDB store.
So we had a few messages on queue A that were not loaded into memory but only resided in KahaDB, and we had a few 10,000 messages on queue B that were all loaded into memory and made the cursor for queue B use 70% of the configured <memoryUsage> limit. Since queue B did not configure for any destination limit, it inherited the limits of the <memoryUsage> and had therefore used 70% of that broker's limit.
However, the same applied to all other JMS queues. They also did not set any destination limits and hence also inherited the <memoryUsage> limit of the broker, which was utilized to 70% already (due to queue B).
Since there was no consumer connected to queue B, messages would not get removed from the queue and <memoryUsage> limit would not decline.
Next a consumer connected to queue A, ready to receive messages. The cursor for queue A would typically go to the store now and load maxPageSize (200 by default) number of messages from the persistence store into memory in one batch. But it could not do so this time, because 70% of the broker's <memoryUsage> limit was already reached. Again, remember 70% is the tipping point at which the cursor stops accepting or loading more messages into its cache. The cursors own MemoryPercentUsage JMX attribute for queue A was 0% (it had not loaded any messages in memory yet) but the broker's MemoryPercentUsage was already at 70%. The latter condition is enough that the cursor for queue A cannot load any more messages into memory. The broker needs to protect against running out of memory and needs to enforce its <memoryUsage>. That's why it would load a full maxPageSize (again, 200 by default) number of messages if the MemoryPercentUsage is below 70% but stops loading any messages into memory once the 70% limit got reached.
The result is an active consumer on queue A that does not receive any messages although there is a backlog of message sitting in the persistence store. Unless a consumer drains off some messages on queue B and hence reduces the brokers MemoryPercentUsage below 70%, the cursor for queue A will not be able to load any messages from the store. The consumer for queue A gets starved as a result.
A few consequences:
- If there are multiple consumers on queue A, they will all get starved.
- If there are other destinations with no messages loaded into memory, but there are messages in the store and active consumers connected, they get starved as well.
- You don't necessarily need one destination with no consumers that uses 70% of the broker's <memoryUsage> limit. There could be multiple destinations that have no consumers but a larger backlog which sums up to 70% of the broker's <memoryUsage> limit to reproduce the same behavior...
How to Identify This Problem
The following conditions should all be met when you run into this problem:
- Do you detect consumer(s) that receive no messages despite a queue backlog?
- Does the destination to which the consumer(s) are connected show a MemoryPercentUsage of 0%?
- Look at the brokers overall MemoryPercentUsage. Does it match 70% or higher?
- Then drill into the JMX MemoryPercentUsage value of the various destinations and check for destinations that use a substantial portion of these 70% and that have no consumers attached.
- If you find all of these conditions then you may have hit this problem.
How to Resolve This Situation
On a running broker you can either connect a consumer or more to queue A and start consuming messages, or if you can afford it from a business perspective, purge the queue A. Both should bring the brokers <memoryUsage> below 70% and allow cursors of other destinations to load messages from store into their cursor cache.
Restarting the broker would also help, as after the restart messages only get loaded from the store if there are consumers connected. The many messages of queue A won't be loaded unless there is a consumer connected, and even then the cursor loads maxPageSize number of messages only in one go (200 as you surely learned by now). The broker's <memoryUsage> should remain well below 70% in this case.
Configuring destination limits would typically also work around the problem. If you know that certain destination may not have any consumers for a while, then perhaps explicitly configure decent memory limits for these destinations so they cannot take the entire broker's <memoryUsage>.
I raised this problem in ENTMQ-1543. However, no fix was made, as fixing turned out to be very difficult.
Yes, as with that much background now, we can come to the second symptom of this problem. Above I talked about one or more destinations with large backlog(s) and no consumers starving consumers of other destinations.
If you think this further, perhaps queue A does have a consumer connected, but the consumer is much slower than the rate at which messages get produced. Perhaps it takes a few seconds to process each message (not entirely off the world for certain use cases). Now imagine we have multiple destinations like queue A: slow consumers, large backlog of messages.
These destinations together could use the 70% of the broker's <memoryUsage>. Now think a second about what happens to other destinations that have fast consumers and (almost) no backlog. These destinations could see a high throughput in general. Because of the destinations with slow consumers and large backlogs together constantly reaching 70% of the broker's <memoryUsage> limit, any new messages sent to other destinations with fast consumers and no backlog would not get loaded into the cursor cache of that destination. It's the same problem as above. So these fast consumers don't receive messages until the slow consumers of other destinations have consumed some messages and reduced the broker's <memoryUsage> limit below 70%. In essence these fast consumers do not get starved completely, but they get slowed down pretty much to the speed of the slow consumers on other destinations.
I produced a unit test that shows this problem in action. If you are interested, check out the code from demo StuckConsumerActiveMQ and follow the instructions in the test's README.md.
Again the solution to this problem is to set destination limits for at least the destinations that have slow consumers and a large message backlog.