DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Breaking AWS Lambda: Chaos Engineering for Serverless Devs
  • Architecting for Resilience: Strategies for Fault-Tolerant Systems
  • The Ultimate Chaos Testing Guide
  • Unlocking the Benefits of a Private API in AWS API Gateway

Trending

  • How to Configure and Customize the Go SDK for Azure Cosmos DB
  • Unlocking AI Coding Assistants Part 4: Generate Spring Boot Application
  • A Guide to Developing Large Language Models Part 1: Pretraining
  • Kubeflow: Driving Scalable and Intelligent Machine Learning Systems
  1. DZone
  2. Software Design and Architecture
  3. Cloud Architecture
  4. Chaos Engineering — Simulate AZ Failures on AWS

Chaos Engineering — Simulate AZ Failures on AWS

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.

By 
Gaurav Gupta user avatar
Gaurav Gupta
·
Jun. 08, 20 · Tutorial
Likes (7)
Comment
Save
Tweet
Share
12.4K Views

Join the DZone community and get the full member experience.

Join For Free

Chaos engineering is about introducing turbulent conditions that systems are likely to face in production environments.  These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing.  In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome.  The experiment gives us new information about the system, which can then be used for the betterment of systems.

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.  Highly available applications need to be resilient against AZ failures.  Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.

Chaos Toolkit gives a good framework for defining chaos experiments.  I have forked chaostoolkit-aws repo and added AZ failures probes and methods in the ec2 module.  I have used boto3 python aws library to create these experiments. You can access the code here — AZ Failure Git Repo

This is how an AZ failure experiment comes together -

  • Steady State Hypothesis — Before we kick off the experiment, we want to establish a steady-state hypothesis, that is "what normal looks like". In this case, I have assumed that if I can successfully SSH into EC2 instance then there is no AZ failure right now and hence a normal state.
Python
 




xxxxxxxxxx
1
11


 
1
 # .... Refer Code repo for full function.....
2
  
3
  logger.info('Starting SSH into ec2 instance — ' + instance.instance_id)
4
  ssh = paramiko.SSHClient()
5
  ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
6
  privkey = paramiko.RSAKey.from_private_key_file(pem_file_path)
7

          
8
  try :
9
    ssh.connect(instance.public_dns_name, username='ec2-user', pkey=privkey, timeout=10)
10
  except :
11
    logger.info('SSH Times out — waited for 10 seconds')
12
    return False



  • Action: Simulate AZ failure — To simulate AZ failure, I have created a blackhole ACL which is then attached to the subnet of our instance. This blackhole ACL  has one rule which disallows all ingress traffic covering CIDR — ‘0.0.0.0/0’ and all from and to ports.
Python
 




x


 
1
# .... Refer code repo for full function....
2

          
3
logger.info('Simulating AZ failure for — ' + subnet.availability_zone)
4

          
5
# Create new network ACL
6
acl_response = create_network_acl(vpc_id)
7
logger.info('Created new network ACL — ' + str(acl_response))
8
acl_id = acl_response['NetworkAcl']['NetworkAclId']
9

          
10
# Create blackhole ACL
11
logger.info('Creating blackhole ACL')
12
create_network_acl_ingress_entry(acl_id, rule_num=1, protocol="-1", cidr_block="0.0.0.0/0", from_port=-30000, to_port=30000, allow=True)
13

          
14
NetworkAclAssociationId = None
15
prev_NetworkAclId = None
16

          
17
# get list of Network ACLs
18
nw_acl_dict = get_network_acls()
19

          
20
for x in nw_acl_dict['NetworkAcls'] :
21
  for y in x['Associations'] :
22
    if y['SubnetId'] == subnet_id :
23
      NetworkAclAssociationId = y['NetworkAclAssociationId']
24
      prev_NetworkAclId = y['NetworkAclId']
25
      d2["acl_id"]= prev_NetworkAclId
26
      d2["blackhole_acl_id"] = acl_id
27
      json.dump(d2, open("exp_data1.txt", 'w'))
28

          
29
logger.info('Replacing Original ACl — ' + prev_NetworkAclId + ' with blackhole ACL ' + acl_id + ' to subnet' + subnet_id)
30

          
31
#  Associate Subnet with blackhole ACL
32
change_network_acl_association(acl_id, NetworkAclAssociationId)


  • Check steady-state hypothesis again —  Our steady-state hypothesis was about successful SSH into the EC2 instance.  Since AZ failure is now simulated, SSH will timeout and hence steady-state hypothesis is broken.  This failure means that our system will not survive AZ failures. We have new information available, which can now be used to make it more resilient.  

  • Rollback AZ failure — It is important to also rollback the AZ failure.  In this, we attach the previous ACL to the subnet and delete the blackhole ACL.

Python
 




x


 
1
# Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
2
def rollback_az_failure():
3

          
4
    d2 = json.load(open("exp_data1.txt"))
5
    prev_NetworkAclId = d2["acl_id"]
6
    blackhole_acl_id = d2["blackhole_acl_id"]
7
    subnet_id = d2["subnet_id"]
8

          
9
    logger.info('Rolling back ACL for subnet ' + subnet_id  + ' from blackhole acl — '+ blackhole_acl_id + ' to original ACl — ' + prev_NetworkAclId)
10
    nw_acl_dict = get_network_acls()
11

          
12
    for x in nw_acl_dict['NetworkAcls'] :
13
        for y in x['Associations'] :
14
            if y['SubnetId'] == subnet_id :
15
                NetworkAclAssociationId = y['NetworkAclAssociationId']
16

          
17
    change_network_acl_association(prev_NetworkAclId, NetworkAclAssociationId)
18
    logger.info(' Removing Black hole ACl — ' + blackhole_acl_id)
19
    delete_network_acl(blackhole_acl_id)



Chaos toolkit framework allows us to piece this experiment together. Below YAML is how you do it in Chaos toolkit

YAML
 




x


 
1
version: 1.0.0
2
title: What happens if there is an AZ Failure
3
description: Simulate AZ failure by creating blackhole Network ACL
4
configuration:
5
  aws_region: us-east-2
6

          
7
steady-state-hypothesis:
8
  title: SSH access to EC2 machine is working
9
  probes:
10
    — type: probe
11
      name: Check SSH Access to EC2 Instance
12
      tolerance: true
13
      provider:
14
        type: python
15
        module: chaosaws.ec2.probes
16
        func: ssh_test
17
        arguments:
18
          pem_file_path: Test-Chaos.pem
19

          
20
method:
21
- type: action
22
  title: Simulate AZ Failure by creating Blackhole ACL and attaching to Subnet
23
  name: AZ Failure Action creates a blackhole ACL, attaches it to a subnet thereby simulating AZ failure
24
  provider:
25
    type: python
26
    module: chaosaws.ec2.actions
27
    func: az_failure
28

          
29

          
30
rollbacks:
31
- type: action
32
  title: Rollback AZ failure and restore original ACL to Subnet
33
  name: Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
34
  provider:
35
    type: python
36
    module: chaosaws.ec2.actions
37
    func: rollback_az_failure



Happy Coding! 

AWS Chaos engineering Steady state (chemistry)

Opinions expressed by DZone contributors are their own.

Related

  • Breaking AWS Lambda: Chaos Engineering for Serverless Devs
  • Architecting for Resilience: Strategies for Fault-Tolerant Systems
  • The Ultimate Chaos Testing Guide
  • Unlocking the Benefits of a Private API in AWS API Gateway

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!