Oracle Service Bus – Stuck Thread Case Study
Join the DZone community and get the full member experience.
Join For Free
This
case study describes the complete root cause analysis process of a
stuck thread problem experienced with Oracle Service Bus 11g running on
AIX 6.1 and IBM Java VM 1.6.
This article is also a great opportunity for you to improve your thread dump analysis
skills and I highly recommend that you study and properly understand
the following analysis approach. It will also demonstrate the importance
of proper data gathering as opposed to premature middleware (Weblogic) restarts.
Environment specifications
- Java EE server: Oracle Service Bus 11g
- OS: AIX 6.1
- JDK: IBM JRE 1.6.0 @64-bit
- RDBMS: Oracle 10g
- Platform type: Enterprise Service Bus
Troubleshooting tools
- Quest Software Foglight for Java (monitoring and alerting)
- Java VM Thread Dump (IBM JRE javacore format)
Problem overview
Major
performance degradation was observed from our Oracle Service Bus
Weblogic environment. Alerts were also sent from the Foglight agents
indicating a significant surge in Weblogic threads utilization.
Gathering and validation of facts
As
usual, a Java EE problem investigation requires gathering of technical
and non technical facts so we can either derived other facts and/or
conclude on the root cause. Before applying a corrective measure, the
facts below were verified in order to conclude on the root cause:
- What is the client impact? HIGH
- Recent change of the affected platform? Yes, logging level changed in OSB console for a few business services prior to outage report
- Any recent traffic increase to the affected platform? No
- Since how long this problem has been observed? New problem observed following logging level changes
- Did a restart of the Weblogic server resolve the problem? Yes
Conclusion #1:
The logging level changes applied earlier on some OSB business services
appear to have triggered this stuck thread problem. However, the root
cause remains unknown at this point.
Weblogic threads monitoring: Foglight for Java
Foglight for Java
(from Quest Software) is a great monitoring tool allowing you to
completely monitor any Java EE environment along with full alerting
capabilities. This tool is used in our production environment to monitor
the middleware (Weblogic) data, including threads, for each of the
Oracle Service Bus managed servers. You can see below a consistent
increase of the threads along with a pending request queue.
For
your reference, Weblogic slow running threads are identified as
“Hogging Threads” and can eventually be promoted to “STUCK” status if
running for several minutes (as per your configured threshold).
Now what should be your next course of action? Weblogic restart? Definitely not…
Your
first “reflex” for this type of problems is to capture a JVM Thread
Dump. Such data is critical for you to perform proper root cause
analysis and understand the potential hanging condition. Once such data
is captured, you can then proceed with Weblogic server recovery actions
such as a full managed server (JVM) restart.
Stuck Threads: Thread Dump to the rescue!
The
next course of action in this outage scenario was to quickly generate a
few thread dump snapshots from the IBM JVM before attempting to recover
the affected Weblogic instances. Thread dump was generated using kill
-3 <Java PID> which did generate a few javacore files at the root
of the Weblogic domain.
javacore.20120610.122028.15149052.0001.txt
Once
the production environment was back up and running, the team quickly
proceeded with the analysis of the captured thread dump files as per
below steps.
Thread Dump analysis step #1 – identify a thread execution pattern
The
first analysis step is to quickly go through all the Weblogic threads
and attempt to identify a common problematic pattern such as threads
waiting from remote external systems, threads in deadlock state, threads
waiting from other threads to complete their tasks etc.
The
analysis did quickly reveal many threads involved in the same blocking
situation as per below. In this sample, we can see an Oracle Service Bus
thread in blocked state within the TransactionManager Java class (OSB
kernel code).
[ACTIVE] ExecuteThread: '292' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x0000000139B76B00, j9thread_t:0x000000013971C9A0, java/lang/Thread:0x07000000F9D80630, state:B, prio=5 (native thread ID:0x2C700D1, native priority:0x5, native policy:UNKNOWN) Java callstack: at com/bea/wli/config/transaction/TransactionManager._beginTransaction(TransactionManager.java:547(Compiled Code)) at com/bea/wli/config/transaction/TransactionManager.beginTransaction(TransactionManager.java:409(Compiled Code)) at com/bea/wli/config/derivedcache/DerivedResourceManager.getDerivedValueInfo(DerivedResourceManager.java:339(Compiled Code)) at com/bea/wli/config/derivedcache/DerivedResourceManager.get(DerivedResourceManager.java:386(Compiled Code)) at com/bea/wli/sb/resources/cache/DefaultDerivedTypeDef.getDerivedValue(DefaultDerivedTypeDef.java:106(Compiled Code)) at com/bea/wli/sb/pipeline/RouterRuntimeCache.getRuntime(RouterRuntimeCache.java(Compiled Code)) at com/bea/wli/sb/pipeline/RouterManager.getRouterRuntime(RouterManager.java:640(Compiled Code)) at com/bea/wli/sb/pipeline/RouterContext.getInstance(RouterContext.java:172(Compiled Code)) at com/bea/wli/sb/pipeline/RouterManager.processMessage(RouterManager.java:579(Compiled Code)) at com/bea/wli/sb/transports/TransportManagerImpl.receiveMessage(TransportManagerImpl.java:375(Compiled Code)) at com/bea/wli/sb/transports/local/LocalMessageContext$1.run(LocalMessageContext.java:179(Compiled Code)) at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:363(Compiled Code)) at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:146(Compiled Code)) at weblogic/security/Security.runAs(Security.java:61(Compiled Code)) at com/bea/wli/sb/transports/local/LocalMessageContext.send(LocalMessageContext.java:144(Compiled Code)) at com/bea/wli/sb/transports/local/LocalTransportProvider.sendMessageAsync(LocalTransportProvider.java:322(Compiled Code)) at sun/reflect/GeneratedMethodAccessor980.invoke(Bytecode PC:58(Compiled Code)) at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code)) at java/lang/reflect/Method.invoke(Method.java:589(Compiled Code)) at com/bea/wli/sb/transports/Util$1.invoke(Util.java:83(Compiled Code)) at $Proxy111.sendMessageAsync(Bytecode PC:26(Compiled Code)) ……………………………

Thread Dump analysis step #2 – review the blocked threads chain
The next step was to review the affected and blocked threads chain involved in our identified pattern. As we saw in the thread dump analysis part 4,
the IBM JVM thread dump format contains a separate section that
provides a full breakdown of all thread blocked chains e.g. the Java
object monitor pool locks.
A
quick analysis did reveal the following thread culprit as per below. As
you can see, Weblogic thread #16 is the actual culprit with 300+
threads waiting to acquire a lock on a shared object monitor TransactionManager@0x0700000001A51610/0x0700000001A51628.
2LKMONINUSE sys_mon_t:0x000000012CCE2688 infl_mon_t: 0x000000012CCE26C8: 3LKMONOBJECT com/bea/wli/config/transaction/TransactionManager@0x0700000001A51610/0x0700000001A51628: Flat locked by "[ACTIVE] ExecuteThread: '16' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CA3C800), entry count 1 3LKWAITERQ Waiting to enter: 3LKWAITER "[STUCK] ExecuteThread: '0' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011C785C00) 3LKWAITER "[STUCK] ExecuteThread: '1' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011CA93200) 3LKWAITER "[STUCK] ExecuteThread: '2' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011B3F2B00) 3LKWAITER "[STUCK] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000011619B300) 3LKWAITER "[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012CBE8000) 3LKWAITER "[STUCK] ExecuteThread: '21' for queue: 'weblogic.kernel.Default (self-tuning)'" (0x000000012BE91200) ..................
Thread Dump analysis step #3 – thread culprit deeper analysis
Once you identify a primary culprit thread, the next step is to perform a deeper review of the computing
task this thread is currently performing. Simply go back to the raw
thread dump data and start analyzing the culprit thread stack trace from
bottom-up.
As
you can see below, the thread stack trace for our problem case was
quite revealing. It did reveal that thread #16 is currently attempting
to commit a change made at the Weblogic / Oracle Service Bus level. The
problem is that the commit operation is hanging and taking too much
time, causing thread #16 to retain the shared object monitor lock from TransactionManager for too long and “starving” the other Oracle Service Bus Weblogic threads.
"[ACTIVE] ExecuteThread: '16' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x000000012CA3C800, j9thread_t:0x000000012C9F0F40, java/lang/Thread:0x0700000026FCE120, state:P, prio=5 (native thread ID:0x35B0097, native priority:0x5, native policy:UNKNOWN) Java callstack: at sun/misc/Unsafe.park(Native Method) at java/util/concurrent/locks/LockSupport.park(LockSupport.java:184(Compiled Code)) at java/util/concurrent/locks/AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:822) at java/util/concurrent/locks/AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:853(Compiled Code)) at java/util/concurrent/locks/AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1189(Compiled Code)) at java/util/concurrent/locks/ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:911(Compiled Code)) at com/bea/wli/config/derivedcache/DerivedCache$Purger.changesCommitted(DerivedCache.java:80) at com/bea/wli/config/impl/ResourceListenerNotifier.afterEnd(ResourceListenerNotifier.java:120) at com/bea/wli/config/transaction/TransactionListenerWrapper.afterEnd(TransactionListenerWrapper.java:90) at com/bea/wli/config/transaction/TransactionManager.notifyAfterEnd(TransactionManager.java:1154(Compiled Code)) at com/bea/wli/config/transaction/TransactionManager.commit(TransactionManager.java:1519(Compiled Code)) at com/bea/wli/config/transaction/TransactionManager._endTransaction(TransactionManager.java:842(Compiled Code)) at com/bea/wli/config/transaction/TransactionManager.endTransaction(TransactionManager.java:783(Compiled Code)) at com/bea/wli/config/deployment/server/ServerDeploymentReceiver$2.run(ServerDeploymentReceiver.java:275) at weblogic/security/acl/internal/AuthenticatedSubject.doAs(AuthenticatedSubject.java:321(Compiled Code)) at weblogic/security/service/SecurityManager.runAs(SecurityManager.java:120(Compiled Code)) at com/bea/wli/config/deployment/server/ServerDeploymentReceiver.commit(ServerDeploymentReceiver.java:260) at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer.doCommitCallback(DeploymentReceiverCallbackDeliverer.java:195) at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer.access$100(DeploymentReceiverCallbackDeliverer.java:13) at weblogic/deploy/service/internal/targetserver/DeploymentReceiverCallbackDeliverer$2.run(DeploymentReceiverCallbackDeliverer.java:68) at weblogic/work/SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:528(Compiled Code)) at weblogic/work/ExecuteThread.execute(ExecuteThread.java:203(Compiled Code)) at weblogic/work/ExecuteThread.run(ExecuteThread.java:171(Compiled Code))
Root cause: connecting the dots
At this point the collection of the facts and thread dump analysis did allow us to determine the chain of events as per below:
- Logging level change applied by the production Oracle Service Bus administrator
- Weblogic deployment thread #16 was unable to commit the change properly
- Weblogic runtime threads executing client requests quickly started to queue up and wait for a lock on the shared object monitor (TransactionManager)
- The Weblogic instances ran out of threads, generating alerts and forcing the production support team to shut down and restart the affected JVM processes
Our
team is planning to open an Oracle SR shortly to share this OSB
deployment behaviour along with hard dependency between the client
requests (threads) and OSB logging layer. In the meantime, no OSB
logging level change will be attempted outside the maintenance window
period until further notice.
Conclusion
I
hope this article has helped you understand and appreciate how powerful
thread dump analysis can be to pinpoint root cause of stuck thread
problems and the importance for any Java EE production support team to
capture such crucial data in order to prevent future re-occurrences.
Oracle Service Bus
Java (programming language)
Dump (program)
Java EE
Published at DZone with permission of Pierre - Hugues Charbonneau, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Five Java Books Beginners and Professionals Should Read
-
Getting Started With Istio in AWS EKS for Multicluster Setup
-
Integrate Cucumber in Playwright With Java
-
Introduction to Domain-Driven Design
Comments