DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Embracing the Chaos of Chaos Engineering

Embracing the Chaos of Chaos Engineering

Learn how chaos engineering helps you find seemingly random errors in today's modern, increasingly complex applications.

Chris Ward user avatar by
Chris Ward
CORE ·
Jul. 20, 18 · Tutorial
Like (5)
Save
Tweet
Share
7.22K Views

Join the DZone community and get the full member experience.

Join For Free

This article is featured in the new DZone Guide to Automated Testing: Your End-to-end Ecosystem. Get your free copy for more insightful articles, industry statistics, and more! 

Modern applications are increasingly growing in complexity. Adding a dizzying amount of moving parts, layers of abstraction, reliance on external systems, and distribution that all result in a stack that few truly fully understand.

Any developer worth hiring now knows the merits of a thorough testing regime, but one of the issues with testing is that you are often testing for predictable outcomes. Despite our "logical systems," show-stopping issues are typically unexpected; situations that no one foresaw.

These unforeseen eventualities are what chaos engineering attempts to account for. It's a reasonably new principle, practiced by Netflix for several years and then formalized in 2015, setting out its principles in a time-honored manifesto.

Naturally, there are critics of the practice, and the comments at the bottom of this TechCrunch article summarize some of them. The typical counterarguments are that the principle is a band-aid for applications that were poorly planned and architected in the first place, or that it's another buzzword-laden excuse to invent shiny new tools that no one knew they needed.

Still, it's proponents are a friendly bunch, so in this article, I summarize my findings on the practice and let you decide.

Organized Chaos

In many ways, while the term "chaos" is a good eye-catching phrase, it's misleading, summoning images of burning servers and hapless engineers running around an office screaming. A better term is experimental engineering, but I agree that is less likely to get tech blog or conference attention.

The core principles of chaos engineering follow similar lines to those you followed in school or university science classes:

  1. Form a hypothesis.

  2. Communicate to your team.

  3. Run experiments.

  4. Analyze the results.

  5. Increase the scope.

  6. Automate experiments.

Early in the lifetime of chaos engineering at Netflix, most engineers thought chaos engineering was about "breaking things in production," and it is in part. But while breaking things is great fun, it's not a useful activity unless you learn something from it.

These principles encourage you to introduce real-world events and events you expect to be able to handle. I wonder if fully embracing the"chaos" might result in more interesting results, i.e., measuring the worst that could happen. True randomness and extremity could surface even more insightful results and observations.

Let's look at each of these steps in more detail.

1. Form a Hypothesis

To begin, you want to make an educated guess about what will happen in which scenarios. The key word here is "educated;" you need to gather data to support the hypothesis that you'll share with your team.

Decide on Your Steady State

What is "steady" depends on your application and use case, so decide on a set of metrics that are important to you and what variance in those metrics is acceptable. For example:

  • When completing checkout, the majority of customers should have a successful payment processed.

  • Users should experience latency below a particular rate.

  • A process should complete within a timeframe.

When deciding on these metrics, also consider external factors such as SLAs and KPIs for your team or product(s).

Introduce Real-World Events

The sorts of events to test vary depending on your use case, but common to most applications are:

  • Hardware/VM failure

  • State inconsistency

  • Running out of CPU, memory, or storage space

  • Dependency issues

  • Race conditions

  • Traffic spikes

  • Service unavailability

Run in Production

"Testing in production" has long been a tongue-in-cheek reference to an untested code base, but as chaos engineering is likely run in collaboration with a properly pre-tested code base, it takes on a different meaning.

The principles we're working with here encourage you to undertake tests in production, or if you have a genuine reason for not doing so, as close as possible. Chaos engineering principles are designed to identify weakness, so they argue that running in production is fundamentally a good thing.

Some banks are already following these principles, and while engineers behind safety-critical systems should be confident of their setup before embarking on chaos engineering, the principles also recommend you design each experiment to have minimal impact and ensure you can abort at any time.

Metrics

While the most tempting hypothesis is "let's see what happens" (much like "let's just break things"), it's not a constructive one. Try to concoct a hypothesis based on your steady state, for example:

  • If PayPal is unavailable, successful payments will drop by 20 percent.

  • During high traffic, latency will increase by 500ms.

  • If an entire AWS region is unavailable, a process will take 1 second longer to complete.

2. Communicate to Your Team

As a technical communicator, this is perhaps the most important step to me. If you have a team of engineers running experiments on production systems, then relevant people (if not everyone) deserve to know. It's easy to remember engineers, but don't forget people who deal with the public, too, such as support and community staff who may start receiving questions from customers.

3. Run Your Experiments

The way you introduce your experiments varies, some from code deployments, others by injecting calls you know will fail, or simple scripts. There are myriad tools available to help simulate these; I've provided links to find them below. Make sure you have alerting and reporting in place to stop an experiment if needed, but also to analyze results later.

4. Analyze the Results

There's no point in running an experiment if you don't take time to reflect on what data you gathered and to learn from it. There are many tools you probably already use to help with this stage, but make sure you involve input from any teams whose services were involved in the experiment.

5. Increase the Scope

After defining your ideal metrics and the potential effects on them, it's time to start testing your hypothesis. Much like other aspects of modern software development, be sure to iterate these events, changing parameters or the events you test for.

Once you've tried one experiment, learned from it, and potentially fixed issues it identified, then move on to the next one. This may be introducing a new experiment or increasing the metrics of an existing one to find out where a system really starts to break down.

6. Automate the Experiments

The first time(s) you run an experiment, manually is fine — you can monitor the outcome and abort it if necessary. But you should (especially with teams that follow continuous deployment) automate your experiments as quickly as possible. This means that the experiment can run when new factors are introduced into an application, but it also makes it easier to change input parameters for the scope of your experiments.

Again, the resources section below lists places to find tools to help with this.

Quietly Chaotic

While engineers and developers are divided on the usefulness of chaos engineering, the most interesting aspects to me are not the technical ones, but rather that it tests and checks ego.

The principles state in many places that if you are truly confident in your application, then you shouldn't fear what it proposes. They force you to put your money where your mouth is and (albeit in a careful and controlled way) prove your application is as confident as you are. I can imagine many insightful debriefing sessions after a chaos engineering experiment.

Tools and Resources

  • The free O'Reilly book on Chaos Engineering

  • The comprehensive Chaos Engineering awesome list that features a plethora of useful tools and resources

  • The Chaos Community, Mailing List, and Meetups

  • Gremlin, whose staff are often behind a lot of the Netflix-independent chaos engineering resources, runs a 'Failure-as-a-service' platform that commoditizes many of the tools and practices featured in this post.

This article is featured in the new DZone Guide to Automated Testing: Your End-to-end Ecosystem. Get your free copy for more insightful articles, industry statistics, and more! 

Chaos engineering

Published at DZone with permission of Chris Ward, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 19 Most Common OpenSSL Commands for 2023
  • OpenVPN With Radius and Multi-Factor Authentication
  • What Are the Different Types of API Testing?
  • Journey to Event Driven, Part 1: Why Event-First Programming Changes Everything

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: