Java Application Troubleshooting the Sherlock Holmes Way
Java Application Troubleshooting the Sherlock Holmes Way
Java application failure troubleshooting can be closely associated with the typical crime scene investigation
Join the DZone community and get the full member experience.Join For Free
Some of the Holmes' ways of solving the crime inspired me to do better in my troubleshooting. Java application failures can be painful at times when our application doesn't leave enough traces to help us to solve the mystery. In the next few paragraphs, I will share my experience on how our troubleshooting was done step by step.
You may also like: Troubleshooting Java Applications With Arthas
I recently joined a project, which is been running for a few years. I will tell some facts about the project, before going into the problem.
- This system had an API that accepts various kinds of requests from POS devices.
- While it processes the requests, it will send some queries to third-party systems.
- It had a database system to store the product and purchase order information.
- In recent times we got a few more businesses integrated into the system, so we got more traffic into the system.
- Even though, the system is giving out so much of functionalities it was yet to be tuned for the performance.
1) Mystery Happens, and it Leaves No Clue
One fine morning our system failed to respond to any of the API requests, and we got complaints from the clients. Then we had to restart the system to restore the system to the normal state. We didn't have many traces to understand why the system had gone to that weird state. We only got a few traces, which were not quite sufficient to determine the reason. Now we got a mystery in our hands, but no evidence to lead us.
We were able to capture the following information during the time of the incident.
- Slow query log of our MySQL database.
It had several slow query logs, which were only there during the time of the incident. But we were confused why those slow queries were not present during the other times. Further our incident period was not the peak traffic time. We were puzzled about why our system failed during normal traffic times.
- We got CPU usage logs, which showed our java application took nearly 100% of the CPU for more than 20 minutes.
This showed our threads were kept busy during the issue time. But we couldn't determine what kept the system busy.
- Application request logs were reviewed many times with no luck.
We couldn't determine the reason with the above-limited information. So we were focusing on finding some solution with some guess-work. We thought to increase the heap memory.
2) The Trap
Anyway, we planned to wait until the issue happens next time to collect further evidence. So we planned to set the trap to capture more details. Jdk offers some tools, which are quite handy in identifying this kind of mysteries.
1) Jstack — This tool helps us to take a threaddump. Threaddumps will give a clear indication of the thread's status along with the associated stacktrace at a given time.
Since our Java Application took nearly 100% CPU, we were curious to know what our threads were doing at that time. Usually, it is advisable to take 3-4 such threaddumps with an interval of 5 seconds. It is important to take the threaddump when the system is experiencing the issue.
2) Jmap - histo — This gives a memory snapshot with the object's histogram, which gives the number of objects created within the heap. If our system is having any memory leaks, this histogram can easily point out the culprit.
This is how we set up the trap for our unknown villain.
3) Villain visits again, so incident happens yet again
Again the same issue occurred and this time we were given a short period to collect the traces. So they took one threaddump and one jmap-histo just before the system restart. We prepared a summary of the thread states using the evergreen Linux shell commands. Shell command summaries showed that one-third of the application threads are stuck on trying to place their incoming commands into a request queue.
4) Time to Investigate the Evidence
I will show how I have processed the thread-dump.
A. First I was looking for our local package-name references. Our local package starts with 'com.gotham'
Boy, there are quite a lot of threads stuck in our code. Our code is keeping our system busy.
B. Later I wanted to find the most popular lines of code, where most of the threads are stuck. This is where 'uniq' command is handy.
This is the extract of the above command's output. The first column shows the number of appearances of a particular line, which is listed in the second column. I always love the "uniq" command to summarize with counts.
My shell commands are saying that most of the threads are stuck in the line shown below.
Finally found a place to start with. So I checked the relevant code-base to understand what our code does in this particular line.
C. When I further read the complete stacktrace, which above line is associated with. I found out that 91 threads are waiting to put their requests into a queue, which may have been full at that particular time.
D. I went to the code base and investigated the reasons for the queue to become full and found out that we are consuming the queue and try to push some notifications to the Apple push-notifications. Even though it was an asynchronous call, we were waiting on the Future object just to log the status of that notification sending activity.
Unfortunately, one queue-consumer thread was stuck on waiting for the response indefinitely (this was seen in the threaddump). It may have lost the response due to some packet loss. Unfortunately, we had just one queue-consumer thread. Below is the stacktrace of the queue-consumer thread from the threaddump.
Packet losses can happen at any time in any network. We might have unfortunately stepped into such a situation.
Crime Scene Reconstruction.
Even though the above things have been derived, we have to reproduce the same scenario to prove it much stronger. So we introduced an artificial wait in the queue-consuming thread to see whether the same situation occurs. Then CPU usage started to increase as we started to fire more requests towards our system. So our villainous code is revealed, so now it is the time to terminate the villain.
We removed that unnecessary thread-wait and we lived happily thereafter :).
The End Nears...
In this situation, we couldn't solve the mystery by analyzing the existing traces. So we were ready to capture more information in the form of threaddump and jmap-histo. Luckily threaddump was quite sufficient to solve the mystery.
Threaddump is always a suitable weapon to analyze a java application when it becomes unresponsive. Shell commands are really powerful to analyze and get the essence of the issue. So it is always useful to learn them.
Opinions expressed by DZone contributors are their own.