For the past two months, we have been busy polishing our next major release. Through our private beta program, more than 100 different companies got access to the locked thread detection functionality. Based on their feedback, we ironed out the problems and are now happy to announce that all Plumbr users can access the locked thread detection as of today.
Before demonstrating our solution, lets dig into the problem itself. Based on the early tests conducted on 500 different applications, we saw that 16% of the applications regularly stop application threads for 5,000ms or longer due to lock contention issues. In worst case we saw applications locking threads for several minutes or locking so frequently that only 15 minutes of every hour was spent on actual work. So I would dare to say that the problem was frequent and severe enough to tackle it.
The main challenge was to build a solution where the severity and root cause of the incident would be clearly communicated back to the end user. When the application threads are forced to wait behind locks, we wished to make the incident transparent in several dimensions:
- Has this happened before or is this an isolated incident?
- How severe is the particular incident and what is the total impact to end users?
- What was a thread executing when the lock occurred?
- What was the lock the thread was waiting for?
We succeeded in building a solution that answers all the above questions. In the following screenshot you see a situation where Plumbr detected a thread being locked for 16.5 seconds, confirms that this was a recurring issue (where in total of 182 times the application threads had been forced to wait a total of 3 hours and 24 minutes) and outlines that the threads had been waiting for the same lock to be released:
Equipped with this information, you can quickly triage the issue based on the severity of the problem. The next step is finding the actual root case, where we again equip you with the necessary information:
We can see that the particular thread was just trying to execute a Log4j logging call. As Log4j in this case had been configured to use synchronized loggers, by checking the source code of thecallAppenders(), we see that the call is indeed synchronized. When you synchronize the call on theRootLogger class, then calls from thousands of different places in your application code are pretty much guaranteed to eventually be locked. The solution in the example above is as simple as switching to asynchronous loggers (or reducing the log verbosity as a workaround).
The example above is the most common cause for locks, that is why I picked it to illustrate our solution. But we have already detected thousands of different locks in hundreds of different JVMs, so go ahead and find out how your JVM is performing in this regard. Grab your free trial and start monitoring your JVMs for those nasty locks!