Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Java.lang.OutOfMemory: PermGen Space - Garbage Collecting a Classloader

DZone's Guide to

Java.lang.OutOfMemory: PermGen Space - Garbage Collecting a Classloader

· Performance Zone
Free Resource

We recently ran into the "java.lang.OutOfMemory: PermGen space" issue. The experience has been a real eye opener.  In SMART we use Classloader isolation to isolate multiple tenants. We started testing it in beta a few weeks back and found that every 2 days our server went down giving OutOfMemory: PermGen space. We had 6 tenants running with little or very little data. It was very worrying since this started happening when the server was not accessed and the load on it was very low. For us it was evident that the leak was the classloaders. But the question was, why was it not getting garbage collected? To summarize our findings, the following had to be fixed before we could get the classloader to garbage collect:

  • JCS cache clearing
  • Solr Searcher threads
  • Static Variables
  • Threads and Threadpools

Tracking the leaks

Before I talk in detail about each of the items in the above list, let me tell how we tracked down these leaks. The major problem in tracking and fixing memory leaks is re-creating the problem consistently. If you are able to recreate it then half the problem is solved. It took us sometime and a lot of outside the box thinking, but we were able to pin-point exactly the set of steps to be done to recreate the problem. We had to remove tenants from JCS cache and reload them again into the cache. Do this 4 or 5 times and we could recreate out OutOfMemory problem.When we initially started seeing this issue we added the standard java parameters to dump heap when the process went out of memory.

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heapDumps

This helped us to point to the ClassLoader not being garbage collected as the problem. But the dump occurred every 2 days and irregularly based on how the server was used and every time a fix was put we had to wait two days to see if the fix had really worked.First lesson learnt, you don't have to run out of memory to dump heap. A very useful tool that helps in heap analysis is jmap that comes with jdk 6. Two really useful commands using jmap are:

jmap -permstat <pid>

This command shows the classloaders in the perm gen space and their status if they are live or dead. This is very useful to check if the classloaders have been garbage collected.

jmap -dump:format=b,file=heap.bin <pid>

This dumps the heap into the heap.bin file and can then be examined to find the reason for the classloader not being garbage collected. We used the tool called visualvm another very useful tool. It helps view the dumped heap and can show the nearest GC root that is holding an object in the heap to prevent it from garbage collecting.With these tools it became an iterative process of:

  • Recreate the problem with a single tenant
    • We reduced the JCS time out to be as small as 2 mins.
    • The tenant now got removed from the cache every 2 mins
  • Dump heap using the jmap command
  • Examine the heap using Visual VM
  • Find the object holding the classloader and fix the leak.

JCS cache Clearing

The primary problem that was facing us when we started to fix this problem was "How do we capture the event of a tenant being removed from JCS cache and add our own processing to it?". One suggestion for this is:

     JCS jcs = JCSCacheFactory.getCacheInstance(regionName);
     IElementAttributes attributes = jcs.getDefaultElementAttributes();
     attributes.addElementEventHandler(handler);
     jcs.setDefaultElementAttributes(attributes);

But setting it this way did not change the attributes. But what was clear from here is that:

  1. There were events that were raised for each object in the jcs cache
  2. The Element Attributes had the handler for these events

We just had to find a way to replace this event handler. The ElementAttributes are actually being configured in cache.ccf. All we had to do was replace the configuration with our own Element Attributes class. We created a TenantMemoryAttributes class that did the following:

public class TenantMemoryAttributes extends ElementAttributes implements java.io.Serializable
{
    public class CacheObjectCleanup implements IElementEventHandler
    {
        public void handleElementEvent(IElementEvent event)
        {
                CacheElement elem = (CacheElement)(((ElementEvent)event).getSource());
                if (elem != null)
                {
                    SmartTenant tenant = (SmartTenant)elem.getVal();
                    .... cleanup here ...
                }
        }
    }
    public TenantMemoryAttributes()
    {
        super();
        System.out.println("TenantMemoryAttributes is instantiated and new handler registered.");
        addElementEventHandler(new CacheObjectCleanup());
    }
}

Once we had this class we just had to add this to cache.ccf

jcs.region.TenantsHosted.elementattributes=org.anon.smart.base.tenant.TenantMemoryAttributes

Now we had the hook to clean up when a tenant was removed from memory. With this in place we found that we were not done with JCS cache as of yet. Though we had called the JCS.clear() function to clear out all the elements in the JCS cache (data used by the tenant), there were other things that had to be done to release the class loader.The CompositeCacheManager is a singleton and has a static variable "instance". Calling clear just cleared the data, but the instance still contained the CompositeCache class. So, we had to call freeCache on the manager, so all references to the CompositeCache was released. Once this was done we realized that there is a event queue processor thread that sends out the events and this thread is not stopped. To do both these we had to call:

CompositeCacheManager.getInstance().freeCache(_name);
CompositeCache.elementEventQ.destroy();

This ensured that JCS cache no more was the bottleneck holding the classloader.

Solr Searcher threadpool

We use Embedded solr in SMART to index and provide text based search on stored data. This started causing the next bottleneck to release the classloader. The SolrCore class represents each core in solr and contains a threadpool called searcherExecutor. The solr classes are loaded by the bootstrap loader in SMART. We expected that since these classes are loaded by the bootstrap classloader, Solr should not prevent our classloader from unloading. Here we were wrong and the reason is tied deeply into how java threads and security work and was a pain to find and fix.In java classloader when a class is defined with a call as below:

defineClass(className, classBytes, 0, classBytes.length, null); -- note the null for the domain

Java creates a default ProtectionDomain in this manner (Check out ClassLoader.java in rt.jar):

      this.defaultDomain = new ProtectionDomain(localCodeSource, null, this, null); 
......

This default domain is then used as the protection domain for the classes that are loaded with null domain. By itself it is just a circular reference and should not really have caused any problem. Yet, when this combined with the thread security this became very critical in not releasing our classloader.Threads in java contain what is called "inheritedAccessControlContext". From this article on java security:


When a new thread is created, we actually ensure (via thread creation and other code) that it automatically inherits the parent thread's security context at the time of creation of the child thread, in such a way that  subsequent checkPermission calls in the child thread will take into consideration the inherited parent context.

This reflects in the code for thread creation. Check out the code in Thread.java from rt.jar. The init method has this piece of code.

    this.inheritedAccessControlContext = AccessController.getContext();

And the getContext in AccessController has this piece of code:

    AccessControlContext localAccessControlContext = getStackAccessControlContext();

Now this seems correct, logical and harmless. Yet, we have a sequence as below:

my Class Loader
|---> creates thread running tenant code
     |---> creates EmbeddedSolrServer (bootstrap cl)
         |---> creates threadpool for searcherExecutor
              |---> creates threads (This inherits parent thread's security and our CL default protection domain

So, now the bootstrap CL has a thread object that has a reference to a protection domain which references my classloader. To overcome this problem, we had to force our classloader to create a default protection domain that does not have a reference to the classloader as below and hence release the classloader:

    Permissions perms = new Permissions();
    perms.add(new AllPermission());
    ProtectionDomain domain = new ProtectionDomain(new CodeSource(url, new CodeSigner[0]), perms);
    defineClass(className, classBytes, 0, classBytes.length, domain);


Static Variables

This is the least and the most surprising of all problems we had expected. It was easily fixed, but tough to recognize all the places where we had declared static variables and release them. We write very easily singleton classes and classes that register and store cached data and do not think of the impact of this on garbage collection until the situation is hit. We had to add a cleanup to all singleton and cached values classes to clear and set the static variables to null to release them.

Threads and Threadpools

Coming to the last but not the least of all problems. Thread related problems. Some of the standard points to remember here are:

  • Remove all ThreadLocals once used and release them
  • Shutdown all ThreadPools so that the threads are released
  • If threads are loaded by Bootstrap classloader and you have called a setContextClassloader, remove it.

One of the non-standard problems here that we faced was related to subclassAudits. Again something we were not aware of that existed in threads but is present and can prevent a classloader from unloading. A subclassAudit is a static variable in the Thread.java class (I still have no idea why it is there), and contains a reference to classes derived from thread class. What I mean by this is if U have declared as we do a class that is derived from Thread and use this to start threads rather than the standard thread class, then a reference to this class is stored in subclassAudits variable and remains there till infinity. We had to manually clean this variable using reflect to release our classes and hence the classloader.

Class cls = Thread.class;
Field fld = cls.getDeclaredField("subclassAudits");
fld.setAccessible(true);
Object val = fld.get(null);
if (val != null)
{
    synchronized(val)
    {
        Map m = (Map)val;
        m.clear();
    }
}


Once this was done, the going was smooth, the classloaders got released and no OutOfMemory errors were thrown.Releasing a classloader to garbage collect maybe a tough job, but it can be done even if it is just one step at a time and highly time consuming to find and fix.


 

Topics:
java ,solr ,high-perf ,performance ,threads ,tips and tricks ,classloaders ,outofmemory ,permgen space

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}