Chaos Engineering in the World of SaaS and Cloud Computing
My take on conducting an effective chaos exercise, based upon the experience of conducting chaos engineering in a wide array of applications.
Join the DZone community and get the full member experience.Join For Free
The 3-step approach listed below is based upon the experience of conducting chaos engineering in a wide array of applications, as well as having built Disaster Recovery (DR) solutions in the past. This is part 1 of a 2-part series that covers SaaS offering. I hope it adds value to your engineering journey.
What's Chaos Engineering?
It's a technique where you automatically inject failures to an active system to study the impact and recoverable state of your system. The term "chaos engineering" originated from Netflix's internal practice.
Why Chaos Engineering and What Type of System Needs ItDisasters are inevitable in the SaaS world. In the era of relentless engineering, we must foresee and anticipate the issues for managing reliable systems. The chaos exercise reduces any disaster's blast radius and in many cases solves it without users experiencing it. Systems consisting of distributed services or components with dependencies need chaos engineering to validate SLAs against disruption of one or many services.
How Do We Approach the Chaos Engineering Exercise?
The 3-step approach laid out below ensures that you optimize your effort and get the best results. The word "chaos" sounds crazy but it still has to be well-planned & a controlled exercise.
"We still don't know what we don't know yet."
This is a coordinated effort from the engineering team to do the following:
- Define steady-state metrics that state the overall health of the system (synthetic checks as experienced by customers).
- Derive various baseline hypotheses against steady states mentioned above.
- List all well-known disaster scenarios along with fixes (triage post-mortems).
- Identify all existing reusable code/tools required to support automation.
- List all the tests, covering a wide range of issues and real-world problems. Examples below:
- Data-center/region failures
- Race conditions
- Overall or individual services load
- Dependency breakdowns
- Functional bugs
- 3rd party service failures
- Chaos exercise flow plan
- Ownerships (SMEs)
- Template (to record each triggered test plan)
"Your output is as good as your in-depth planning and focus."
2. War Room Exercise
Build a replica production environment to conduct all types of disaster exercises. This environment should not only emulate the production in infrastructure setup but also the load and traffic characteristics of your infrastructure. Adopt or build automated tools to conduct chaos exercises, which may vary depending on your tech stack. The chaos exercise on a replica should achieve the following objectives:
- Chaos engineering operators should have in-depth knowledge of the systems. If not, this is the time to train them.
- It consolidates tools (off the shelf?), playbooks, and recovery process (automated).
- It reduces the blast radius of the production environment by fixing some of the issues.
- It validates/invalidates a few hypotheses (but do not completely disregard them yet).
"Real-world events will showcase real-world problems."
3. Live Exercise
By this time, you should have automation in place to conduct your chaos exercise in your production environment. Remember that you are trying to break the system, but still stay within the rail guards of SLO and comply with your SLA. Few precautions to take:
- SMEs for various services are available and are on-call.
- You follow the flow and prepare a factual report based on observed deviations from steady-state, associated triggered actions, and any tactical fixes.
- Validate various metrics against SLOs. Terminate chaos exercise if it's close to breaching SLOs (again through automation). Revisit it after you get a fix for it.
- Automation should clean up when you terminate/interrupt the chaos exercise and ensure there are no zombie processes that are left behind.
Mantra: “Harder it gets to break, more stable is your system.”
What's the End Game of Chaos Engineering?
Its objective is to build resilient systems that will consistently improve SLIs and ROI.Coming soon, part 2 of 2: Chaos Engineering in the World of IoT, Robotics, and Edge Computing.
Published at DZone with permission of Shankar Muniyappa. See the original article here.
Opinions expressed by DZone contributors are their own.