DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Parameters to Measure in Chaos Engineering Experiments
  • The Ultimate Chaos Testing Guide
  • Chaos Engineering for Microservices
  • Shift-Right Testing: Smart Automation Through AI and Observability

Trending

  • The Smart Way to Talk to Your Database: Why Hybrid API + NL2SQL Wins
  • Operational Principles, Architecture, Benefits, and Limitations of Artificial Intelligence Large Language Models
  • Infrastructure as Code (IaC) Beyond the Basics
  • How GitHub Copilot Helps You Write More Secure Code
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Getting Started With Chaos Engineering

Getting Started With Chaos Engineering

In this article, learn ways to ease your concerns about chaos engineering and help your organization get started implementing this practice.

By 
NaveenKumar Namachivayam user avatar
NaveenKumar Namachivayam
DZone Core CORE ·
Sep. 04, 22 · Analysis
Likes (5)
Comment
Save
Tweet
Share
9.3K Views

Join the DZone community and get the full member experience.

Join For Free

Breaking stuff on purpose primarily in the production environment is one of the mantras in chaos engineering. But when you tell your plan to your engineering manager or product owner, you will often get some resistance. 

Their concerns are valid. What if breaking stuff is irreversible? What will happen to the end users? Will our support ticket system get busy?

This article will help you ease these concerns and get started with chaos engineering in your organization.

What Is Chaos Engineering?

There are various definitions available from industry leaders about chaos engineering. Here is a slide from one of my videos:

Chaos Engineering Definitions
Chaos engineering definitions


Getting Started

The goal of chaos experiments is to learn how our system will behave in case of catastrophic failures in production and how resilient our system is. This gives us an opportunity to optimize and fix the issues.

Here is how you can begin your chaos engineering practice.

Get Buy-In from Your Manager

The first step is to get an approval from your manager to carry out the experiments in the test environment. Typically, chaos experiments should be carried out in production — but I suggest you take baby steps. You can perform chaos experiments in any valid environment. If the production environment is not available, I recommend running the experiments in a non-production (or stage) environment.

Explain the values that you are bringing by performing the chaos experiments, such as:

  • Identifying failures and bottlenecks
  • Resilient validation
  • Scaling validation

Understand the Architecture

Systems fail all the time. Before running your chaos experiments, thoroughly understand your system's architecture. Have a working session with your developers, architects, and SREs, and brainstorm the application architecture. Make sure everyone understands the upstream/downstream components, dependencies, timeline, game day schedule, and so on. This will help you to get a better sense of where your system could fail.

Write Hypotheses

Start writing a list of hypotheses, e.g., given a Kubernetes deployment, deleting one pod should not increase the service response time under the typical load. Another example: a load balancer must route the requests only to the healthy and running nodes.

There is no right or wrong while writing a hypothesis. It is an iterative process.

Our objective is not to make our hypotheses "pass" or "fail." Testing each hypothesis will give us an opportunity to understand our system.

Minimize the Blast Radius

Always start small. Reduce the impact on the end users while running the chaos experiments by minimizing the blast radius, e.g., instead of deleting a deployment in the Kubernetes cluster, delete the pods and validate the resiliency. Even if you are deleting a deployment, make sure GitOps is in effect, so that GitOps flow will create a deployment automatically.

Another example: instead of shutting down all the nodes in the cluster, go for 50% of running nodes, or instead of plugging off the entire region, shut down a zone.

Once the chaos process is matured and your team is in a comfortable zone, you can slowly increase the blast radius.

Plan for a Game Day

Plan ahead and always have a Plan B for your "game day." Notify all your stakeholders at least one week before, and start a unified communication channel in Slack (or your company's collaboration platform) to post updates regularly. I suggest having your developers on call — or your DevOps or SRE team — when you run your first experiment.

Run Your First Experiment

No one is perfect. It is okay if you are having trouble running your first experiment. Post an update promptly and inform all the stakeholders.

If you are able to run your first experiment successfully, here is my virtual pat on the back.

Believe me, running the first chaos experiment is like riding a high thrill-level roller coaster. If things go south, make sure you are able to halt the experiment and revert the infrastructure with the help of your DevOps or SRE team.

During the experiment, monitor your observability dashboard and observe the vitals such as response time, disk utilization, pass/fail transactions, health checks, and so on.

Analysis

Once the experiment is done, log all your observations in a spreadsheet, analyze them, and define your hypothesis verdict. Again, there is no pass or fail; it is all learning.

Brainstorm

Schedule a meeting with your developers, architects, and DevOps/SRE team to discuss your verdict. This will help the team to understand the verdicts and fix the issues you discovered. Once the issues are addressed, you can rerun the experiments.

If you find that the system is resilient, you can try increasing the blast radius and rerun the experiments.

Next Steps

After running various game days, you can learn about team dynamics, system performance, and more. The next step is embedding chaos experiments into your developer workflow, so that chaos experiments will be automated, which will bring more confidence to your team.

Happy chaos-ing!

Chaos engineering

Published at DZone with permission of NaveenKumar Namachivayam, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Parameters to Measure in Chaos Engineering Experiments
  • The Ultimate Chaos Testing Guide
  • Chaos Engineering for Microservices
  • Shift-Right Testing: Smart Automation Through AI and Observability

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!