This article will describe the complete root cause analysis and solution of a
High CPU problem (infinite looping) affecting a Weblogic 10.0 environment running on the Java HotSpot VM 1.5.
This case study will again demonstrate this importance of mastering Thread Dump analysis skill and CPU correlation techniques such as Solaris
- Java EE server: Oracle Weblogic Portal 10.0
- Middleware OS: Solaris 10
- Java VM: Java HotSpot VM 1.5
- Platform type: Portal application
Monitoring and troubleshooting tools
- JVM Thread Dump (HotSpot format)
- Solaris prstat (CPU contributors analysis)
Problem type: High CPUobserved from our Weblogic production environment
A high CPU problem was observed from our Solaris physical servers hosting a Weblogic Portal 10 environment. Users also reporting major slowdown of the portal application.
Gathering and validation of facts
As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:
- What is the client impact? HIGH
- Recent change of the affected platform? No
- Any recent traffic increase to the affected platform? Yes
- How does this high CPU manifest itself? A sudden CPU increase was observed and is not going down; even after load goes down e.g. near zero level.
- Did an Oracle OSB recycle resolve the problem? Yes, but problem is returning after few hours or few days (unpredictable pattern)
: The high CPU problem appears to be intermittent vs. pure correlation with load
: Since high CPU remains after load goes down, this typically indicates either the presence of some infinite looping or heavy processing Threads
Solaris CPU analysis using prstat
Solaris prstat is a powerful OS command allowing you to obtain the CPU per process but more importantly CPU per Thread within a process. As you can see below from our case study, the CPU utilization was confirmed to go up as high as 100% utilization (saturation level).
## PRSTAT (CPU per Java Thread analysis)
prstat -L -p 8223 1 1
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID
8223 bea10 2809M 2592M sleep 59 0 14:52:59 38.6%java/494
8223 bea10 2809M 2592M sleep 57 0 12:28:05 22.3% java/325
8223 bea10 2809M 2592M sleep 59 0 11:52:02 28.3% java/412
8223 bea10 2809M 2592M sleep 59 0 5:50:00 0.3% java/84
8223 bea10 2809M 2592M sleep 58 0 2:27:20 0.2% java/43
8223 bea10 2809M 2592M sleep 59 0 1:39:42 0.2% java/41287
8223 bea10 2809M 2592M sleep 59 0 4:41:44 0.2% java/30503
8223 bea10 2809M 2592M sleep 59 0 5:58:32 0.2% java/36116
As you can see from above data, 3 Java Threads were found using together close to 100% of the CPU utilization.
For our root cause analysis, we did focus on Thread #494 (decimal format) corresponding to 0x1ee (HEXA format).
Thread Dump analysis and PRSTAT correlation
Once the culprit Threads were identified, the next step was to correlate this data with the Thread Dump data (which was captured exactly at the same time as prstat).
A quick search within the generated Thread Dump file did reveal the Thread Stack Trace (Weblogic Stuck Thread #125) for 0x1ee as per below.
"[STUCK] ExecuteThread: '125' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=1 tid=0x014c5030 nid=0x1ee runnable [0x536fb000..0x536ffc70]
Thread Dump analysis – HashMap.get() infinite loop condition!
As you can see from the above Thread Stack Trace, the Thread is currently stuck in an infinite loop over a java.util.HashMap that originates from the Apache Axis
This finding was quite revealing. The 2 others Threads using high CPU also did reveal infinite looping condition within the same Apache Axis HashMap Object.
Root cause: non Thread safe HashMap in Apache Axis 1.4
Additional research did reveal this known defect affecting Apache Axis 1.4; which is the version that our application was using.
As you may already know, usage of non Thread safe / non synchronized HashMap under concurrent Threads condition is very dangerous and can easily lead to internal HashMap index corruption and / or infinite looping. This is also a golden rule for any middleware software such as Oracle Weblogic, IBM WAS, Red Hat JBoss which rely heavily on HashMap data structures from various Java EE and caching services.
Such best practice is also applicable for any Open Source third party API such as Apache Axis.
The most common solution is to use the ConcurrentHashMap data structure which is designed for that type of concurrent Thread execution context.
Our team did apply the proposed patch from Apache (synchronize the non Thread safe HashMap) which did resolve the problem. We are also currently looking at upgrading our application to a newer version of Apache Axis.
I hope this case study has helped you understand how to pinpoint the root cause of high CPU Threads and the importance of proper Thread safe data structure for high concurrent Thread / processing applications.
Please don’t hesitate to post any comment or question.
Find office supplies promo codes to save money for your department's bottom line, so you can spend more on the latest software.