Fighting Java Performance Issues in Production Systems
Fighting Java Performance Issues in Production Systems
Fight Java performance issues in your production systems!
Join the DZone community and get the full member experience.Join For Free
SignalFx is the only real-time cloud monitoring platform for infrastructure, microservices, and applications. The platform collects metrics and traces across every component in your cloud environment, replacing traditional point tools with a single integrated solution that works across the stack.
In my first article, I explained how we can fight memory leaks in production systems. In this article, I am going to explain how we can troubleshoot Java performance issues in the production systems. Performance issues in production systems can happen at any time. It might show up immediately after you put it in production, or it can show up a few months or even few years after its deployed. Once I had to work on a performance issue, which appeared three years after the system originally went to production. So, you have to be ready to work on performance issues at any time. Firstly, we will start by exploring the reasons for having performance issues in production.
Reasons for Performance Issues in Production Systems
1. Lack of Local Performance Testing
There can be many possible performance combinations, with which we have to run performance tests. For example:
- Different delay patterns, e.g. Network Delays, Third-party response delays, etc.
- Different propotions in the rate of sending different kinds of request types. Certain request types can give be a big stress to the system, and more so than others. So, it is good to test for production-forecasted values.
Further, we may not have production equivalent hardware in the local Testbed. This can be a serious issue as we may not conduct an adequate amount of performance testing for this reason. So, it is always recommended to test the performance up to some accepted level when we don't have the production equivalent hardware.
2. Third-Party Systems May Behave in Unexpected Ways
If our system is a middleware platform, there can be many third-party systems that might connect to ours. Then, different third-parties can behave in different ways, so it is hard to control. These behaviors are hard to predict in early stages. Therefore, these behaviors need to be monitored so that we can adjust our system to overcome these threats.
3. Unexpected Network Delays
Network delays can make a big impact in the system's performance, as request-sending threads will be blocked until responses are received. If those threads are synchronously waiting for the response, the network delays can affect the number of free threads at a given time. These network delays are quite hard to prove unless a valid packet trace is taken during the time of the delay
4. Lack of Logs to Measure the System's Performance
If you don't have a proper performance log, you wouldn't know that your system is struggling, unless your users are complaining. The performance log's format will vary based on the system's working flow. If your system's response time matters to its' user, you better add the average time taken for a given period of time, along with other measurable parameters.
If your system heavily depends on database response times or third-party response times, it is better to measure those delays and show the average time taken along with maximum time taken values. So, choose your performance log's format wisely to suit your performance indicators
5. Data Growth in Database Systems Will Slow Down Your System
As time goes on, production systems will accumulate data in to the databases. So, our database queries may become slower as they may have to deal with a lot of data. So, we have to accumulate enough test data and test the performance locally.
What You Should Do to Fix Production Performance Issues
1. Make Sure to Go Through the All the Possible Traces in Hand
- Performance logs — Start with the perf logs to see whether any weird signs can be seen in the mentioned times of the performance issues. These performance logs can give partial conclusions. With this information, you may know where to look for next your investigation. But your performance logs should be powerful enough to provide sufficient details that identify performance-related issues.
- Trace Logs — Later, you have to look for the matching information in the trace logs or request logs to identify the possible issues. You should try to learn the system's behavior during the issue time. If trace logs contain any unexpected errors or logs, it is better investigate whether they make any impact on the performance.
- Database slow query logs — Better check whether there are any slow queries.
- Third-party request/response logs — analyze and see whether there are any delays from third-party systems that might contribute to your system's delays
2. Taking Packet Traces
If you suspect that there is a network delay, you better take network packet captures if performance issues are happening at the time of investigation. Otherwise, you can ask your System Support Engineers to get one during the performance issue occurance in the future.
- Your system logs might indicate that there is a big delay between the request and response. But your requests might be queued up due to a system bug, so there is only way to make sure to bring in a another external tool to capture the network traffic. But if you suspect network delays are causing the probable performance issue, packet traces will help you to prove that there is a delay in other third-party responses. Tools, such as Wireshark, can be used for this purpose.
3. Take Thread Dumps
- Typical profiling tools are generally not allowed to connect to the production systems. Further, even if you are lucky to connect your tool to the production system, it might add an overhead and would lead you down the wrong path. Already, your system is struggling to get enough performance, so profiling might put the production system into a further slump. Therefore, thread dump is the other way to get the current state of your Threads.
- Thread dumps are usually light weight and do not impact system performance in a big way. Multiple thread dumps may tell you how your threads are being utilized in your production system. But this may not reveal the issue in all the times. If something is holding your threads for a long time, that can be seen in the thread dump. But it may not reveal the actual issue all the time. You need certain expertise in using thread dumps, or you have to use a good tool to read the thread dumps.
4. Check GC Logs
- GC logs will indicate the health of your heap memory handling. You can check the memory allocation and clearance in the GC logs to see whether your memory allocation is sufficient. Further, you can confirm whether there is a memory leak in the system, which might cause the actual performance issue.
5. Add More Logs if You Need Clear Proof to Determine the Situation
- It is always tricky to add logs into the production servers as you have to apply a code change into the current production system. So, this should be the last resort to prove something, which you don't have any clear evidence. So, you have to make sure your logs do not impact the current state of the system and it prints enough details to prove your expectation, or at least, it should give you more detail to evaluate the current issue. All these things are only possible if your client allows you to sneak in a patch to the production system.
How to Solve It
After finding out the reason for the performance issue, you may have to take a valid action to solve it. Here are some of the typical ways to solve the issue.
1. Introduce Suitable Database Indexes to Speed up Queries
- You can choose a valid Index for the queries, which were listed in the slow query logs. Make sure to check the "query execution plan" to see whether the added index is actually being used.
2. Cache the Repeated Queried Information
- If the caching does not impact the correctness of your system in a big way, you can always go for the caching to get performance benefits.
- But this decision needs to be validated properly for correctness.
3. Increase the Allocated Heap Memory if the Current Allocation Is Not Sufficient
- If memory can not be increased, you may have to change your code to eliminate any unnecessary memory allocations
4. Third-Party Response Delays
- If your third-party request calls are blocking your usual request handling threads, third-party response delays can stall your system.
- You may have to redesign your threads to make sure external delays cannot impact your system's health
- Request processing threads need to be separated from third-party calling threads
How Can We Make the Systems More Traceable
Depending on your system's behavior, you can think of your own ways to trace performance. Here are some of the ways I suggest:
1. Enable GC Log With Human-Readable Time
- You may have to configure the JVM parameters properly so that GC logs are printed in a readable format.
2. Add a Valid Performance Log That Captures all the Possible Performance Parameters to Suit Your System
- Performance logs should be short enough to read easily and should be dense enough to cover all the relevant performance parameters to enable trouble-shooting
- It is always nice to print the parameters in a single line so that logs will be compact and make it easier to monitor. Therefore, you can use short names.
Here is an example for a performance log with the details in shorter form. This log will summarize the information for a particular second.
01 Feb 2019 00:26:12 - REC ACC REJ DB-AVG DB-MAX TH-SND TH-REC TH-TIMEOUT
Some of the shortnames are explained below:
REC — number of requests received in that second
ACC — number of requests accepted in that second
REJ — number of requests rejected in that second
DB-AVG — average time spent on database queries
3. Enable Slow Query Logs in the Database
4. Log All the Errors and Exceptional Situations — Including Third-Party-Responsible Errors
- These errors can impact your system performances depending on the way it is done.
5. Print the Round-Trip Time for Third-Party Calls and Critical Database Calls
When all is said and done, performance tuning is a lengthy process. You will need to spend a lot of time learning more about your system. Then, and only then, can you conclude the correct decision.
Thanks for reading!
Opinions expressed by DZone contributors are their own.