AWS CloudWatch + yCrash = Monitoring + RCA
We had an outage in our online application GCeasy on Oct 11, 2021. In this post, we would like to document our journey to identify the root cause of the problem.
Join the DZone community and get the full member experience.Join For Free
We had an outage in our online application GCeasy on Monday morning (PST) Oct 11, 2021. When customers uploaded their Garbage Collection logs for analysis, the application was returning an HTTP 504 error. HTTP 504 status code indicates that transactions are timing out. In this post, we would like to document our journey to identify the root cause of the problem.
Here are the primary components of the technology stack of the application:
- AWS EC2 instance
- AWS Elastic Beanstalk
- Nginx Web Server
- AWS Elastic Load Balancer
- Java 8
- Tomcat 8
- MySQL (RDS Service)
- AWS S3
AWS Cloud Watch – Monitoring Tool
Fig 1: AWS Cloud watch report.
We use AWS CloudWatch as our monitoring tool. From Fig 1, you can see AWS CloudWatch clearly reporting that the CPU consumption and MYSQL DB connection started to climb up from Oct’ 09 (Friday). Actually, on Oct’ 09, we made the new code deployment to the production environment. So it was clear that the new code was the culprit, causing the instability in the production environment.
AWS CloudWatch clearly indicated two things:
- Problem symptom (i.e., CPU and DB connection count spiked up).
- Time frame since the problem started (Oct 09, Friday).
However, AWS cloud watch didn’t report which line of code (i.e., root cause) was causing the CPU or DB connections to spike up.
Fig 2: yCrash summary report.
We use yCrash as our root cause analysis tool. This tool captures GC log, thread dump, heap dump, netstat, vmstat, iostat, top, disk usage, kernel logs, and other system-level artifacts from the sick application, analyzes them, and generates root cause analysis reports instantly. Fig 2 shows the summary page of the yCrash report. Please refer to the red arrow mark in Fig 2, it points out that "20 threads are stuck waiting for a response from the external system." It also gives a hyperlink to the thread report to examine those 20 BLOCKED threads stack traces. Clicking on the hyperlink shows the stack trace of those 20 threads, as shown in Fig 3.
Fig 3: yCrash thread report.
Based on the stack trace, you can see that these threads were making MySQL Database calls. Look at the red arrow mark in Fig 3. It points to 'com.tier1app.diamondgc.dao.
It turned out that this SQL was added in the recent release, which went live on Friday. Because yCrash pointed out the exact line of code causing this degradation, we commented out this ‘select’ SQL. Once new code was deployed, the application’s performance recovered right away.
In our earlier blog we attempted to explain the difference between a monitoring tool (AWS Cloud watch) and a root cause analysis tool (yCrash) in theory. In this blog, we have once again attempted to explain the difference through a real-world production problem. Thank you for reading this post.
Published at DZone with permission of Ram Lakshmanan, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.