Getting Started With Chaos Engineering
In this article, learn ways to ease your concerns about chaos engineering and help your organization get started implementing this practice.
Join the DZone community and get the full member experience.
Join For FreeBreaking stuff on purpose primarily in the production environment is one of the mantras in chaos engineering. But when you tell your plan to your engineering manager or product owner, you will often get some resistance.
Their concerns are valid. What if breaking stuff is irreversible? What will happen to the end users? Will our support ticket system get busy?
This article will help you ease these concerns and get started with chaos engineering in your organization.
What Is Chaos Engineering?
There are various definitions available from industry leaders about chaos engineering. Here is a slide from one of my videos:
Getting Started
The goal of chaos experiments is to learn how our system will behave in case of catastrophic failures in production and how resilient our system is. This gives us an opportunity to optimize and fix the issues.
Here is how you can begin your chaos engineering practice.
Get Buy-In from Your Manager
The first step is to get an approval from your manager to carry out the experiments in the test environment. Typically, chaos experiments should be carried out in production — but I suggest you take baby steps. You can perform chaos experiments in any valid environment. If the production environment is not available, I recommend running the experiments in a non-production (or stage) environment.
Explain the values that you are bringing by performing the chaos experiments, such as:
- Identifying failures and bottlenecks
- Resilient validation
- Scaling validation
Understand the Architecture
Systems fail all the time. Before running your chaos experiments, thoroughly understand your system's architecture. Have a working session with your developers, architects, and SREs, and brainstorm the application architecture. Make sure everyone understands the upstream/downstream components, dependencies, timeline, game day schedule, and so on. This will help you to get a better sense of where your system could fail.
Write Hypotheses
Start writing a list of hypotheses, e.g., given a Kubernetes deployment, deleting one pod should not increase the service response time under the typical load. Another example: a load balancer must route the requests only to the healthy and running nodes.
There is no right or wrong while writing a hypothesis. It is an iterative process.
Our objective is not to make our hypotheses "pass" or "fail." Testing each hypothesis will give us an opportunity to understand our system.
Minimize the Blast Radius
Always start small. Reduce the impact on the end users while running the chaos experiments by minimizing the blast radius, e.g., instead of deleting a deployment in the Kubernetes cluster, delete the pods and validate the resiliency. Even if you are deleting a deployment, make sure GitOps is in effect, so that GitOps flow will create a deployment automatically.
Another example: instead of shutting down all the nodes in the cluster, go for 50% of running nodes, or instead of plugging off the entire region, shut down a zone.
Once the chaos process is matured and your team is in a comfortable zone, you can slowly increase the blast radius.
Plan for a Game Day
Plan ahead and always have a Plan B for your "game day." Notify all your stakeholders at least one week before, and start a unified communication channel in Slack (or your company's collaboration platform) to post updates regularly. I suggest having your developers on call — or your DevOps or SRE team — when you run your first experiment.
Run Your First Experiment
No one is perfect. It is okay if you are having trouble running your first experiment. Post an update promptly and inform all the stakeholders.
If you are able to run your first experiment successfully, here is my virtual pat on the back.
Believe me, running the first chaos experiment is like riding a high thrill-level roller coaster. If things go south, make sure you are able to halt the experiment and revert the infrastructure with the help of your DevOps or SRE team.
During the experiment, monitor your observability dashboard and observe the vitals such as response time, disk utilization, pass/fail transactions, health checks, and so on.
Analysis
Once the experiment is done, log all your observations in a spreadsheet, analyze them, and define your hypothesis verdict. Again, there is no pass or fail; it is all learning.
Brainstorm
Schedule a meeting with your developers, architects, and DevOps/SRE team to discuss your verdict. This will help the team to understand the verdicts and fix the issues you discovered. Once the issues are addressed, you can rerun the experiments.
If you find that the system is resilient, you can try increasing the blast radius and rerun the experiments.
Next Steps
After running various game days, you can learn about team dynamics, system performance, and more. The next step is embedding chaos experiments into your developer workflow, so that chaos experiments will be automated, which will bring more confidence to your team.
Happy chaos-ing!
Published at DZone with permission of NaveenKumar Namachivayam, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments