Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Assessing Software Services for Resiliency With Chaos Testing

DZone's Guide to

Assessing Software Services for Resiliency With Chaos Testing

How do you ensure resilient services? Cause chaos. Learn the four steps of chaos testing to reinforce good performance.

· Performance Zone ·
Free Resource

xMatters delivers integration-driven collaboration that relays data between systems, while engaging the right people to proactively resolve issues. Read the Monitoring in a Connected Enterprise whitepaper and learn about 3 tools for resolving incidents quickly.

This is the second of a series on resilience, self-healing systems, and ongoing testing. Part one was about the inevitability of IT failures. This blog focuses on the steps xMatters took to introduce chaos testing.

As engineering teams work toward building software applications that are large-scale, distributed services that operate in cloud infrastructure, it has become imperative that software services are resilient to the inevitable failures. This makes testing for “resiliency” a crucial step in software engineering.  Its purpose is to build the confidence that systems are designed with the capability to withstand and recover from failures. At xMatters, we started looking at the principles of chaos engineering and how we can adopt chaos testing within our engineering department to assure our services can handle turbulent conditions without impacting the SLAs to our clients.

The harder it is to impact the steady state behavior, the higher our confidence can be in the resiliency of our system and of meeting the service level agreements we made with our customers.

To facilitate such kind of tests, we took four steps:

1. Defining “Steady State” Behavior

We first simulated production-like traffic in our test infrastructure, which included setting up environments that closely matched production in terms of requests per second and types of request. We built an in-house load testing application based on Locust (a modern, open source, load generating framework) to generate a swarm of HTTP requests, targeting various services in our test environments. This traffic simulation tool can be run and stopped on demand and constitute the foundation of our resilience testing platform.  By putting our test environments in a state of steady production-like traffic, we can establish a “steady state” baseline that we use as a reference when measuring the impact caused by inducing failures. 

2. Monitoring Services

We defined dashboards to monitor the state of our software services and provide statistics of the traffic going through them.  These also include alerting engineering teams when a failure prevents services from meeting their service level agreements (SLA) .  We also have early warning alerts notifying teams when there is an increased probability of service degradation which potentially can impact SLAs.

3. Simulating Failure Scenarios

As mentioned in the first article of this series, we developed Cthulhu (our in-house chaos testing tool) to facilitate the introduction of failures within the services of our cloud-based infrastructure. The tool thus executes failure scenarios, simulating events like the untimely shutdown of services or services stuck in a dead-lock (done by pausing the application’s process).  Other scenarios aim at testing the limits of our services — i.e. Given that Service A is unable to restart, how long do we have before clients are impacted?

When chaos scenarios are running, we send notifications to the engineering teams without giving details of the nature of the failure such that engineers can correlate the chaos test events with the subsequent recovery or failure alert events within the service to determine if the behavior of the service is as expected.

4. Analyzing the Differences in Service Behavior

As we monitored the impacts of executing chaos test scenarios in our services, we are able to compare it with what we know is a normal, steady state behavior (prior to any induced failures).

Our system being made of a series of distributed services, each managed by different engineering teams, we expect alerts to be triggered and to notify the right team when anomalies are detected or if the steady state of its service is compromised. For example, failure in processing requests or service not being available to process requests.

4 Ways to Improve Your DevOps Testing

Also read: 4 Ways to Improve Your DevOps Testing.

All the above steps are based on the core principle that the harder it is to impact the steady state behavior, the higher our confidence can be in the resiliency of our system and of meeting the service level agreements we made with our customers.

Chaos testing is a powerful practice to test the resiliency of software services; but because of its nature, it can have severe consequences if it’s used carelessly on an unprepared environment. We must always be aware of the potential impacts and ensure that the effects are contained to minimize disruption to our valued customers.

As we are in the early adoption phase of this practice, many of these steps are currently only performed at small scale, within our test infrastructure. Such tests have enabled us to improve the resiliency of our services by detecting deficiencies earlier in the development cycle; before the rollout to production. In the near future, we plan to introduce some randomness in selecting the failure scenario and executing such tests automatically on a schedule, matching the continuous evolution of our cloud-based software.

To learn more about chaos testing, read our white paper, 4 Ways to Improve Your DevOps Testing. To check our integration-driven collaboration platform, try xMatters free today.

The next blog in this series will focus on Components of Resilient Architecture.

Discovering, responding to, and resolving incidents is a complex endeavor. Read this narrative to learn how you can do it quickly and effectively by connecting AppDynamics, Moogsoft and xMatters to create a monitoring toolchain.

Topics:
performance ,tutorial ,monitoring ,resiliency ,chaos testing

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}