In our labs, we are working hard on the new performance optimization solutions. One of such new features taking shape is a lock contention detector. In just few months from now we are about to give wider access to the lock detection functionality currently in private beta.
During the beta program we trace the JVM internals to detect locks and determine how long the threads are waiting for a certain lock to grant access. Private beta program members get notified about the detected lock and we can verify whether we have indeed found a performance bottleneck and whether our solution guidelines are correct.
Based on this data we build further improvements into the detection algos. To give you a hint what you can expect from us soon, we open up an example belonging into “eat your own dog food” category. Indeed, one of the applications we monitor for lock contention issues is Plumbr itself. In last week statistics about lock contention we were suddenly facing the following:
|Lock identifier||# locks||# long locks||Long wait (ms)|
The easiest way to interpret the data above is to read it as following:eu.plumbr.SocketRegistry.addSocket() method invocations were waiting behind a lock 19,024 times. Out of those 19,024 accesses, 1,422 waited for a long time (long time being defined longer than 10,000ms) to get access to the synchronized content. In total, such long wait times resulted in more than six hours (or 22,520,822 ms) of total wait times for our end users. Oops.
Checking the code base we were able to confirm the problem. SocketRegistry is a class responsible for keeping track of connections from Plumbr agents sending data to the Server. These connections are also connected to a web interface to provide real time updates in UI.
Internally sockets were kept in several maps, to enable quick lookup for a session using different keys. Access for modifications of those maps inside those methods was done using standard Java synchronization on these maps. All this was done to ensure that connecting sockets, disconnecting sockets and web interface querying existing sockets are seeing the same picture.
As usual, during software evolution initial assumptions are changing and some initial nitty-gritty details slip away during further development. So, while initially operations on adding, removing and accessing sockets were fast, new features added a couple of database calls within those synchronized methods. Those added database round-trips increased the time of holding the lock. Increased lock holding meant that once the number of clients increased, the number of threads waiting concurrently for the lock also increased. Combination with increased method execution times snowballed the lock wait times to tens of seconds in worst cases.
We immediately took action and fixed the contention issue. As of now, we do not perform any database calls from inside synchronized blocks keeping the synchronized sections as short as possible. But the moral of the story is a bit different. As it played out – eating your own dog food can sometimes be a surprisingly pleasurable.
And as another take-away: we are more and more certain that the lock contention problems are truly difficult to trace down and resolve. We can and we will provide an easy way to tackle this complex problem. In case you are interested in getting early access to this functionality, let us know and we can add you to our private beta program.