This story goes back. For weeks or even decades, depending on how you mark the starting date. Anyhow, few weeks ago one of our customers had problems with interpreting a leak reported by Plumbr. Quoting his words “It seems that Java itself is broken”.
As a matter of fact, the customer was right. Java was indeed broken. But let’s check the case and see what we can learn from it. Lets start by looking into the report generated by Plumbr. It looked similar to the one below.
From the report we can see that the application at hand contains a classloader leak. This is a specific type of a memory leak when classloaders cannot be unloaded (for example on Java EE application redeploys) and thus all the class definitions referenced by the classloader are left hanging in your permanent generation.
In this specific case there are 14,343 class definitions wasting your precious PermGen:
Those classes are all loaded by the org.apache.catalina.loader.WebAppClassloader which cannot be garbage collected because it is still referenced through the following chain:
- This classloader is referenced from a field contextClassLoader in a java.lang.Thread instance.
- The Thread blocking our classloader to be garbage collected is referenced from the sun.net.www.http.KeepAliveCache instance field keepAliveTimer
- And last in this hierarchy is sun.net.www.http.HttpClient which seems to do something clever and keeping a cache of something internally used in an instance variable kac.
Now we are indeed in a situation where all the symptoms of the problem at hand are pinpointing to the JVM internals and not to the application code. Could this really be true?
Immediately after googling for “sun.net.www.http.HttpClient leak” I stumbled upon endless pages of references to the same problem. And about the same amount of different workarounds found for different libraries and application servers. So for some reason indeed it seems like the caching solution in this HttpClient class does not let go of the internal keep-alive cache. Which in turn refuses to release a reference to the classloader it was created in.
But what is the actual cause for it? Most of the stackoverflow threads or the application server vendor bugreports only offered workarounds to the problem. But there has to be a real reason why this keeps happening. Some more googling revealed a possible suspect in Oracle Java SE public bug database – an issue 7008595.
Let’s look into the issue and see what we can conclude from it. First, those of you who are not familiar with a nice bug report – take another look into it and learn. This is how you should file a report – with minimal test case to reproduce the problem and just two steps to go through when concluding the test. But praising aside, it seems that this problem has been present in Java at least since 1.4 was released. And was patched in a 2011 Java 7 release. Which translates to at least NINE years of buggy releases and thousands (maybe even millions) of affected applications.
But now into the code packaged along with the sample testcase. Its relatively simple. In very general level it goes through the following steps:
- After start, application creates a new classloader and sets this newly created classloader as a context classloader to the running thread. This is done to emulate a typical web application, where the classloader of the current thread is a special classloader and not inherited from the system. So the author sets the context classloader to his own.
- Next it loads a new class using the newly created classloader and invokes a getConnection() static method on this class.
- The getConnection() method opens an URL, connects to it, reads the content and closes the stream. In the very same method author is doing something completely weird as well. Namely allocating 20MB to a byte array never used. He is doing it solely to highlight the leak later on, so I guess we do not have to point fingers and call him mad here. Let’s be grateful instead.
- Now all the references are set to null and System.gc() is called within the the code.
- One should now expect that the to the ApplicationClass declaration is now garbage collected as it is no longer referenceable from anywhere.
After walking you through the steps the test application is using we are now ready to compile and run the application. For this run I used the latest Java 6 build 37 available. After I have run the application, taken a heap dump and opened it in Eclipse MAT we see the problem staring right into our face:
And if you are thinking now that you will never mess with custom classloaders and so this whole talk is not relevant to you, then think again. Vast majority of java developers work with customer classloaders every day. Most often with classloaders your application servers (like Tomcat or Glassfish or JBoss) use for creating and loading web applications. If your web application open a http connection somewhere, and as a result KeepAlive timer is spawn, I congratulate you. You have the exact memory leak described in this article.
So indeed, we have verified the assumption that “Java is broken”. And has been broken ever since the Java 1.4 was release. Which was 12 years ago. Luckily the new patches to Java 7 no longer have this problem. But as the different statistics show – vast majority of the applications out there have not migrated to Java 7 as we speak. So most often than not, your application at hand has got the very same problem waiting to surface.
In either case, the story definitely serves as a great case study on how hard it is to trace down a memory leak. Or how difficult it used to be without Plumbr. It took just one customer with one report and the culprit was staring right into our face. But this is now turning into a commercial and this is not you guys are into, so I am going to stop here.
If you enjoyed the post then – stay tuned for more. We get new and interesting insights from JVM on a daily basis nowadays. Unfortunately we do have to work with our product also every once in awhile, but I do promise interesting posts on a weekly basis!