Case Study: Switching JVMs May Help Reveal Issues in Multi-Threaded Apps
In one of my previous articles at Javalobby, I explained why an official JBoss AS release did not work on any JVM other than Sun(-Oracle) HotSpot. The root cause was (unintended) reliance on Java implementation features not enforced by the Java spec. This story, based on a customer case study, is also about lurking bugs and about how testing on different JVMs may help you bring problems to light.
On fixing dead locks
Dead locks in "oversynchronized" code and data races provoked by "undersynchronized" code are serious issues in parallel programming. Typically, they appear as volatile, hard-to-reproduce bugs in multi-threaded applications, and often remain undiscovered during QA. After shipping software products, such issues bite the end users and significantly increase the maintenance cost.
The main reason of such "volatility" is that non-deterministic thread scheduling may produce billions of program execution states with only a few of them being erroneous. As a result, it is tough to meet the conditions necessary to stably reproduce the bugs with stress tests.
As known from practice, altering the picture of thread scheduling, e.g. by improving code performance, may help reveal such latent issues. In the case of Java, it may be done by running the application not only on the standard JRE, but also on other JVMs which may deliver higher execution speed. For example, Excelsior JET, a compliant JVM with an AOT compiler may provide better performance as compared to traditional JIT-based JVMs, especially on application startup.
Charles O'Dale from Senomix Software Inc. has kindly agreed to write down an interesting case of deadlock detection occurred when testing a client-server application. Here goes the Senomix' story.
By Charles O'Dale, Senomix Software Inc., May 2010
As part of the development process of Senomix Time Tracking, our networked software, a number of automated tests have been created to stress-test the system and ensure any one-in-a-million threading race conditions will be caught before release. In a typical test, test client applications are left to 'attack' our application's server over the course of a few hours and simulate the amount of traffic the system could expect to see over a few centuries of real-time use. If the server continues to operate through that stress test without any difficulties, we can then conclude it will be able to smoothly run through any peak-period traffic an office will experience. The server program is then packaged up as an Excelsior JET executable and tested again before being distributed to our customers for installation.
Our problem occurred when performing this final set of tests for the latest version of our system with the executable compiled under Excelsior JET. Although the Java .jar version of our server program was able to handle any amount of network traffic our tests could throw at it when run against Sun's JRE, the JET-compiled executable would freeze within a few minutes, with the application halting at a seemingly random location in the code on every interruption.
Under normal circumstances we would conclude that a new race condition had been discovered and go about correcting it. However, every test run against the pure Java version of our server would perform flawlessly when operated on Sun's JRE -- only tests run against the JET-compiled executable resulted in failure. After implementing every check we could think of to prevent deadlock in the executable program's threads, we concluded there must be a problem with JET and set about informing Excelsior of the issue.
Our test environment made duplicating this problem a straight-forward process, and the test applications and troublesome Java server jar were sent along by e-mail to Excelsior for review. Excelsior's support team were then able to use their test environment to duplicate the problem we were seeing in the executable and set about identifying the underlying cause.
It turned out the problem was in our own code after all! A newly created thread involved in communication had its run method mistakenly set to be synchronized, with that declaration causing a deadlock in the faster code generated by the JET compiler.
After correcting that mistaken declaration of:
public synchronized void run()
public void run()
The JET-compiled executable ran flawlessly, with the program demonstrating the same reliability as our standard Java jar file.
If our system's server application only ever operated as a standard Java jar, it's unlikely we would ever encounter a problem with this code (as the conditions required to bring about the deadlock would truly be a one-in-a-billion event). However, the improved efficiency of the Excelsior JET executable increased thread speed just enough to bring this problem to light.
Data races and dead locks in multi-threaded applications are hard-to-reproduce bugs, hence they are often called not bugs, but "random features". Using different JVMs in your QA process may help you reproduce and fix such issues in less time and spend the saving on development of less random features.
Java theory and practice: Characterizing thread safety: a helpful article on safe parallel programming in Java
Multi-Thread Run-time Analysis Tool for Java: a dynamic analyzer of dead locks and race conditions
Excelsior JET JVM: product info
A Tale of Four JVMs and One App Server: yet another article on revealing latent bugs by switching JVMs