One of the most difficult tasks that a system administrator has to face during an application outage is getting to a root cause analysis. Most of the traditional data center applications were built in a monolithic design, often with data and application all-in-one. Even in a situation where there was a shared data layer, the use of message queuing wasn’t employed often because there was still a single application fronting the back-end database.
Microservices and the Death of RCA
OK, it may not be the death of the root cause analysis, but it is certainly a change in the importance and speed with which we must get to the RCA. Horizontally distributed applications with message queues and a scalable data tier introduce a new way of thinking when it comes to surviving application outages. The reason is that we can now have partial outages that don't take down the entire application environment.
Look at this distributed environment example on AWS, which doesn't even delve into microservices. Even the interim step from traditional monolithic to a distributed application design can show us where RCA becomes less of a situation we have to worry about.
While the image above may appear complicated and subject to many interdependencies that could create points of failure, it is quite the opposite. Each of the layers of the application is backed by resilient infrastructure which is being consumed and used to present application data. It could be the inbound request using Route53, the subsequent Elastic Load Balancer, the VPC that hosts a cluster of EC2 AMI instances running our front-end web application. We see the use of SQS (Simple Queuing Service) to a distributed MySQL database across multiple instances.
On the front end, we also have an S3 bucket with CloudFront caching content to deliver for fastest performance, and to reduce the amount of throughput across the infrastructure. In effect, we have layers upon layers of redundancy, which means that we have a unique situation when it comes to discovering RCA.
RCA Still Matters
Don't discount the need for legitimate Root Cause Analysis. There may be code that rolls into production without knowing its effect. There may be real outages that affect the environment, even with resiliency built in at multiple tiers. The reality is that by doing the separation of roles into a more service-based approach, we reduce the outage risks. By doing so, we reduce the need to do RCA.
There will still be outages. That's a reality. With an effective, resilient design, the need to do RCA post-mortems on application outages will be greatly reduced. That's just the way things go, and it's a fundamental reason that we should look towards this type of application design.
Keeping loosely coupled, service-oriented environments means resiliency, risk limiting, and much more. Don't put away the pager just yet, but you may go through a lot fewer batteries if we do things right.