PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID
9116 bea 3109M 2592M sleep 59 0 21:52:59 8.6% java/76
9116 bea 3109M 2592M sleep 57 0 4:28:05 0.3% java/40
9116 bea 3109M 2592M sleep 59 0 6:52:02 0.3% java/10774
9116 bea 3109M 2592M sleep 59 0 6:50:00 0.3% java/84
9116 bea 3109M 2592M sleep 58 0 4:27:20 0.2% java/43
9116 bea 3109M 2592M sleep 59 0 7:39:42 0.2% java/41287
9116 bea 3109M 2592M sleep 59 0 3:41:44 0.2% java/30503
9116 bea 3109M 2592M sleep 59 0 5:48:32 0.2% java/36116
9116 bea 3109M 2592M sleep 59 0 6:15:52 0.2% java/36118
9116 bea 3109M 2592M sleep 59 0 2:44:02 0.2% java/36128
9116 bea 3109M 2592M sleep 59 0 5:53:50 0.1% java/36111
9116 bea 3109M 2592M sleep 59 0 4:27:55 0.1% java/55
9116 bea 3109M 2592M sleep 59 0 9:51:19 0.1% java/23479
9116 bea 3109M 2592M sleep 59 0 4:57:33 0.1% java/36569
9116 bea 3109M 2592M sleep 59 0 9:51:08 0.1% java/23477
9116 bea 3109M 2592M sleep 59 0 10:15:13 0.1% java/4339
9116 bea 3109M 2592M sleep 59 0 10:13:37 0.1% java/4331
9116 bea 3109M 2592M sleep 59 0 4:58:37 0.1% java/36571
9116 bea 3109M 2592M sleep 59 0 3:13:46 0.1% java/41285
9116 bea 3109M 2592M sleep 59 0 4:27:32 0.1% java/48
9116 bea 3109M 2592M sleep 59 0 5:25:28 0.1% java/30733
Thread Dump and PRSTAT correlation approach
In order to correlate the PRSTAT data please follow the steps below:
- Identify the primary culprit / contributor Thread(s) and locate the Thread Id (DECIMAL format) from the last column
- Convert the decimal value to HEXA since Thread Dump native Thread id is in HEXA format
- Search from your captured Thread Dump data and locate the Thread native Id >> nid=<Thread Id HEXA>
For our case study:
- Thread Id #76 identified as the primary culprit
- Thread Id #76 was converted to HEXA format >> 0x4c
- Thread Dump data Thread native Id >> nid=0x4c
correlation and finding did confirm that this rogue Thread was actively
processing (reading/writing to the File Store) and allocating a large
amount of memory on the Java Heap in a short amount of time.
Service Bus allows Alert actions to be configured within the message
flow (pipeline alerts). These pipeline alert actions generate alerts
based on message context in a pipeline, to send to an alert destination.
Such alert destination is the actual Weblogic diagnostic File Store
which means this structure will grow over time depending of the volume
of Alerts that your OSB application is generating.
It is located under >> //domain_name/servers/server_name/data/store/diagnostics/
In our case, the File Store size was around 800 MB.
increase of the diagnostic File Store size over time is leading to an
increase of elapsed time of the Thread involved in read/write
operations; allocating large amount of memory on the Java Heap. Such
memory cannot be garbage collected until Thread completion which is
leading to OutOfMemoryError and performance degradation.
2 actions were required to resolve this problem:
- Reset the diagnostic File Store by renaming the existing data file and forcing Weblogic to create a fresh one
- Review and reduce the Oracle Service Bus alerting level to the minimum acceptable level
reset of the diagnostic File Store did bring some immediate relief and
by ensuring short and optimal diagnostic File Store operations and not
causing too much impact on the Java Heap.
The level of OSB alerting is still in review and will be reduced shortly in order to prevent this problem at the source.
monitoring and purging of the diagnostic File Store will also be
implemented in order to prevent such performance degradation going
forward; as per Oracle recommendations.
Conclusion and recommendations
hope this article has helped you understand how to identify and resolve
this Oracle service bus problem and appreciate how powerful Thread Dump
analysis and Solaris PRSTAT tool can be to pinpoint this type of
Please do not hesitate to post any comment or question.