DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Breaking AWS Lambda: Chaos Engineering for Serverless Devs
  • Architecting for Resilience: Strategies for Fault-Tolerant Systems
  • Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.
  • AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch

Trending

  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • Lambda-Driven API Design: Building Composable Node.js Endpoints With Functional Primitives
  • Run Gemma 4 on Your Laptop: A Hands-On Guide to Google's Latest Open Multimodal LLM
  • Bringing Intelligence Closer to the Source: Why Real-Time Processing is the Heart of Edge AI
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Chaos Engineering — Simulate AZ Failures on AWS

Chaos Engineering — Simulate AZ Failures on AWS

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.

By 
Gaurav Gupta user avatar
Gaurav Gupta
·
Jun. 08, 20 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
12.8K Views

Join the DZone community and get the full member experience.

Join For Free

Chaos engineering is about introducing turbulent conditions that systems are likely to face in production environments.  These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing.  In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome.  The experiment gives us new information about the system, which can then be used for the betterment of systems.

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.  Highly available applications need to be resilient against AZ failures.  Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.

Chaos Toolkit gives a good framework for defining chaos experiments.  I have forked chaostoolkit-aws repo and added AZ failures probes and methods in the ec2 module.  I have used boto3 python aws library to create these experiments. You can access the code here — AZ Failure Git Repo

This is how an AZ failure experiment comes together -

  • Steady State Hypothesis — Before we kick off the experiment, we want to establish a steady-state hypothesis, that is "what normal looks like". In this case, I have assumed that if I can successfully SSH into EC2 instance then there is no AZ failure right now and hence a normal state.
Python
 




xxxxxxxxxx
1
11


 
1
 # .... Refer Code repo for full function.....
2
  
3
  logger.info('Starting SSH into ec2 instance — ' + instance.instance_id)
4
  ssh = paramiko.SSHClient()
5
  ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
6
  privkey = paramiko.RSAKey.from_private_key_file(pem_file_path)
7

          
8
  try :
9
    ssh.connect(instance.public_dns_name, username='ec2-user', pkey=privkey, timeout=10)
10
  except :
11
    logger.info('SSH Times out — waited for 10 seconds')
12
    return False



  • Action: Simulate AZ failure — To simulate AZ failure, I have created a blackhole ACL which is then attached to the subnet of our instance. This blackhole ACL  has one rule which disallows all ingress traffic covering CIDR — ‘0.0.0.0/0’ and all from and to ports.
Python
 




x


 
1
# .... Refer code repo for full function....
2

          
3
logger.info('Simulating AZ failure for — ' + subnet.availability_zone)
4

          
5
# Create new network ACL
6
acl_response = create_network_acl(vpc_id)
7
logger.info('Created new network ACL — ' + str(acl_response))
8
acl_id = acl_response['NetworkAcl']['NetworkAclId']
9

          
10
# Create blackhole ACL
11
logger.info('Creating blackhole ACL')
12
create_network_acl_ingress_entry(acl_id, rule_num=1, protocol="-1", cidr_block="0.0.0.0/0", from_port=-30000, to_port=30000, allow=True)
13

          
14
NetworkAclAssociationId = None
15
prev_NetworkAclId = None
16

          
17
# get list of Network ACLs
18
nw_acl_dict = get_network_acls()
19

          
20
for x in nw_acl_dict['NetworkAcls'] :
21
  for y in x['Associations'] :
22
    if y['SubnetId'] == subnet_id :
23
      NetworkAclAssociationId = y['NetworkAclAssociationId']
24
      prev_NetworkAclId = y['NetworkAclId']
25
      d2["acl_id"]= prev_NetworkAclId
26
      d2["blackhole_acl_id"] = acl_id
27
      json.dump(d2, open("exp_data1.txt", 'w'))
28

          
29
logger.info('Replacing Original ACl — ' + prev_NetworkAclId + ' with blackhole ACL ' + acl_id + ' to subnet' + subnet_id)
30

          
31
#  Associate Subnet with blackhole ACL
32
change_network_acl_association(acl_id, NetworkAclAssociationId)


  • Check steady-state hypothesis again —  Our steady-state hypothesis was about successful SSH into the EC2 instance.  Since AZ failure is now simulated, SSH will timeout and hence steady-state hypothesis is broken.  This failure means that our system will not survive AZ failures. We have new information available, which can now be used to make it more resilient.  

  • Rollback AZ failure — It is important to also rollback the AZ failure.  In this, we attach the previous ACL to the subnet and delete the blackhole ACL.

Python
 




x


 
1
# Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
2
def rollback_az_failure():
3

          
4
    d2 = json.load(open("exp_data1.txt"))
5
    prev_NetworkAclId = d2["acl_id"]
6
    blackhole_acl_id = d2["blackhole_acl_id"]
7
    subnet_id = d2["subnet_id"]
8

          
9
    logger.info('Rolling back ACL for subnet ' + subnet_id  + ' from blackhole acl — '+ blackhole_acl_id + ' to original ACl — ' + prev_NetworkAclId)
10
    nw_acl_dict = get_network_acls()
11

          
12
    for x in nw_acl_dict['NetworkAcls'] :
13
        for y in x['Associations'] :
14
            if y['SubnetId'] == subnet_id :
15
                NetworkAclAssociationId = y['NetworkAclAssociationId']
16

          
17
    change_network_acl_association(prev_NetworkAclId, NetworkAclAssociationId)
18
    logger.info(' Removing Black hole ACl — ' + blackhole_acl_id)
19
    delete_network_acl(blackhole_acl_id)



Chaos toolkit framework allows us to piece this experiment together. Below YAML is how you do it in Chaos toolkit

YAML
 




x


 
1
version: 1.0.0
2
title: What happens if there is an AZ Failure
3
description: Simulate AZ failure by creating blackhole Network ACL
4
configuration:
5
  aws_region: us-east-2
6

          
7
steady-state-hypothesis:
8
  title: SSH access to EC2 machine is working
9
  probes:
10
    — type: probe
11
      name: Check SSH Access to EC2 Instance
12
      tolerance: true
13
      provider:
14
        type: python
15
        module: chaosaws.ec2.probes
16
        func: ssh_test
17
        arguments:
18
          pem_file_path: Test-Chaos.pem
19

          
20
method:
21
- type: action
22
  title: Simulate AZ Failure by creating Blackhole ACL and attaching to Subnet
23
  name: AZ Failure Action creates a blackhole ACL, attaches it to a subnet thereby simulating AZ failure
24
  provider:
25
    type: python
26
    module: chaosaws.ec2.actions
27
    func: az_failure
28

          
29

          
30
rollbacks:
31
- type: action
32
  title: Rollback AZ failure and restore original ACL to Subnet
33
  name: Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
34
  provider:
35
    type: python
36
    module: chaosaws.ec2.actions
37
    func: rollback_az_failure



Happy Coding! 

AWS Chaos engineering Steady state (chemistry)

Opinions expressed by DZone contributors are their own.

Related

  • Breaking AWS Lambda: Chaos Engineering for Serverless Devs
  • Architecting for Resilience: Strategies for Fault-Tolerant Systems
  • Chaos Engineering Has a Blind Spot. Agentic AI Lives in It.
  • AWS Managed Database Observability: Monitoring DynamoDB, ElastiCache, and Redshift Beyond CloudWatch

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook