APM — Production Monitoring With New Relic
APM — Production Monitoring With New Relic
A new tool? Sign me up!
Join the DZone community and get the full member experience.Join For Free
The next major thing for application teams after a software release is production monitoring. Production monitoring is broadly classified as
- Application Monitoring
- Infrastructure Monitoring
The production monitoring is made easy by APM (Application Performance Monitoring) tools. Both Application and infrastructure monitoring is done by APM’s. The APM agents will be installed on each application server and they will push the metrics by hooking up with the corresponding server process.
You may also like: Code Review for Software Quality
Before selecting any APM, please make sure to run a performance test to understand the overhead of APM's on the server and application.
Our application is an enterprise-level application serving requests of more than 30 million requests per week. We have experience in two APM’s. For some time, we had used App Dynamics but now we are using New Relic.
In this article, I am going to talk about a few metrics that we will monitor in New Relic very regularly to assess the health of our application and infrastructure.
In New Relic, broadly we focus on below metrics
- Overview of historical metrics
- Transaction level metrics
Historical metrics provide data about the trendline. It will give us key performance indicators about application performance. It indicates the performance is normal or it has deviated from the trendline.
We must look at the deviations and dig deeper to understand the root cause of the deviations.
The historical performance metrics are available under the “Overview” tab under “monitoring” the main menu.
The overview metrics provide details about:
- Response time
- Error Rate
- Application node details
Response Time Metric
It is the time taken by an application to complete a user transaction. In the historical metrics, it will compare current day response time with Yesterday's response time and last week's response time. It will give information on the application response time trendline.
The response time is lesser the value better the performance. Below is the example
It is an Application performance index that defines the user satisfaction level for an application. A threshold T will be set as an SLA (Service Level Agreement). If the T value is 1.2 seconds and a response completes in 0.5 seconds then it is defined to be satisfactory. A request-response time longer than 1.2 seconds is dissatisfactory. The response time more than 4.8 seconds frustrates the user. More details on the same can be found here
Below is an example graph
It provides information on the total number of exceptions in an application. We must keep it as low as possible. Resolve the topmost error count exceptions to keep it low. I always restrict that not to cross more than 1%.
However, it depends from application to application. We should not allow crossing more than 5% from my perspective. The % is based on error per minute for transactions in a minute.
Below is an example graph:
This report provides performance metrics for all the application nodes part of the deployment. The metrics provided for each node are as follows
- Response time
- Error rate
- CPU Usage
- Memory consumption for each node
Transaction Level Metrics
In the overview tab, we will be able to identify the application performance trendline. The details on the performance deviations in the trendline can be accessed under the “Transaction” tab.
It will give 3 important details on the URL’s of an application
The below section provides information on the above important sections to debug and understand the root cause for performance issues.
Most Time Consuming
This tab lists the URLs that have the highest number of requests in the application. They are not slow URL’s rather there are maximum requests for these URLs. We must explore if it needs to have so many requests for that specific transaction. Caching can be one of the options to reduce the total number of requests.
Slowest Average Response Time
This option lists the URLs having a higher response time. They need more attention to make them perform better. We must review them periodically and focus on the URL’s that are consistently not performing well.
Apdex Most Dissatisfying
This option lists the most dissatisfying URL’s. All of these are crossing the SLA set as the satisfying threshold. We must dig deeper to understand the reason for crossing the SLA.
How to Identify the Root Cause for Slow Performing URL’s?
Under the menu, “Transaction”, select the slowest average response time URL, Right side, you will get the details of the transaction.
Below the drop-down, you will get a list of poor-performing Transactions. Select the transaction to understand the root cause of the instance of the poor performing incident.
Once you select the Key transaction, it will display the below graph on the right side
In the above graph, observe the Apdex 0.86, we should attempt to make it reach 1. How to decide if it needs immediate attention or not. You should look at the value of Average RPM (Request Per Minute). The RPM is very high and it is listed under the Slowest average response time group, then you should consider it for investigating it.
Under the bottom of “Track as Key Transaction”, it will list the instances with the highest response time. select the specific to the instance of a transaction to deeper analysis. In the transaction trace, it will display the specific method or stored procedure taking maximum time. This way we will be able to identify the root cause for the slow performing transaction.
To conclude, I have tried to list down all the key metrics that we should monitor in New relic. There are many other metrics provided by New relic; however, we use the above metrics in day to day basis to find out the application performance in production. There are many reports like SLA and web transactions. The database is also available under the "Reports" tab, you could use them to export the data to an excel sheet and analyze.
There is an option to set even automated alerts as well. These automated alerts will be fired on crossing the threshold set in the New relic.
Opinions expressed by DZone contributors are their own.