DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

[DZone Research] Observability + Performance: We want to hear your experience and insights. Join us for our annual survey (enter to win $$).

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Keep Your Application Secrets Secret
  • Non-blocking Database Migrations
  • Cloud Database Services Compared: AWS, Microsoft, Google, and Oracle
  • The Ultimate Guide on DB-Generated IDs in JPA Entities

Trending

  • DZone's Article Submission Guidelines
  • A Guide to Data-Driven Design and Architecture
  • Monkey-Patching in Java
  • Demystifying Project Loom: A Guide to Lightweight Threads in Java
  1. DZone
  2. Data Engineering
  3. Databases
  4. AWS CloudWatch + yCrash = Monitoring + RCA

AWS CloudWatch + yCrash = Monitoring + RCA

We had an outage in our online application GCeasy on Oct 11, 2021. In this post, we would like to document our journey to identify the root cause of the problem.

Ram Lakshmanan user avatar by
Ram Lakshmanan
CORE ·
Nov. 01, 21 · Tutorial
Like (1)
Save
Tweet
Share
5.83K Views

Join the DZone community and get the full member experience.

Join For Free

AWS Cloud Watch + yCrash = Monitoring + RCAWe had an outage in our online application GCeasy on Monday morning (PST) Oct 11, 2021. When customers uploaded their Garbage Collection logs for analysis, the application was returning an HTTP 504 error. HTTP 504 status code indicates that transactions are timing out. In this post, we would like to document our journey to identify the root cause of the problem.

Application Stack

 Here are the primary components of the technology stack of the application:

  • AWS EC2 instance
  • AWS Elastic Beanstalk
  • Nginx Web Server
  • AWS Elastic Load Balancer
  • Java 8
  • Tomcat 8
  • MySQL (RDS Service)
  • AWS S3

AWS Cloud Watch – Monitoring Tool


AWS Cloud Watch Report

Fig 1: AWS Cloud watch report.

We use AWS CloudWatch as our monitoring tool. From Fig 1, you can see AWS CloudWatch clearly reporting that the CPU consumption and MYSQL DB connection started to climb up from Oct’ 09 (Friday). Actually, on Oct’ 09, we made the new code deployment to the production environment. So it was clear that the new code was the culprit, causing the instability in the production environment.

 AWS CloudWatch clearly indicated two things:

  1. Problem symptom (i.e., CPU and DB connection count spiked up).
  2. Time frame since the problem started (Oct 09, Friday).

However, AWS cloud watch didn’t report which line of code (i.e., root cause) was causing the CPU or DB connections to spike up. 

yCrash Summary Report

Fig 2: yCrash summary report.

We use yCrash as our root cause analysis tool. This tool captures GC log, thread dump, heap dump, netstat, vmstat, iostat, top, disk usage, kernel logs, and other system-level artifacts from the sick application, analyzes them, and generates root cause analysis reports instantly. Fig 2 shows the summary page of the yCrash report. Please refer to the red arrow mark in Fig 2, it points out that "20 threads are stuck waiting for a response from the external system." It also gives a hyperlink to the thread report to examine those 20 BLOCKED threads stack traces. Clicking on the hyperlink shows the stack trace of those 20 threads, as shown in Fig 3.

yCrash Thread Report

Fig 3: yCrash thread report.

Based on the stack trace, you can see that these threads were making MySQL Database calls. Look at the red arrow mark in Fig 3. It points to 'com.tier1app.diamondgc.dao.GCReportDAO.selectReportById(GCReportDAO.java:335)'. This is the line of code that is making the MySQL Database call. We looked up the source code of this line. This line of code was making a ‘select’ SQL call to a table in the MySQL Database. This ‘select’ SQL query turned out to be quite inefficient. This inefficiency wasn’t exposed in the lower test environments because we had only a handful of records on the table. However, in production, this table had several million records. Thus ‘select’ SQL query started to perform poorly in the production environment. It took anywhere from 5 to 7 minutes to complete. During this time period, application threads were completely stuck, thus ultimately customer requests started to timeout with HTTP 504 error.

 It turned out that this SQL was added in the recent release, which went live on Friday. Because yCrash pointed out the exact line of code causing this degradation, we commented out this ‘select’ SQL. Once new code was deployed, the application’s performance recovered right away.

Conclusion

In our earlier blog we attempted to explain the difference between a monitoring tool (AWS Cloud watch) and a root cause analysis tool (yCrash) in theory. In this blog, we have once again attempted to explain the difference through a real-world production problem. Thank you for reading this post.

Video


AWS Database application MySQL

Published at DZone with permission of Ram Lakshmanan, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Keep Your Application Secrets Secret
  • Non-blocking Database Migrations
  • Cloud Database Services Compared: AWS, Microsoft, Google, and Oracle
  • The Ultimate Guide on DB-Generated IDs in JPA Entities

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: