{{announcement.body}}
{{announcement.title}}

Chaos Engineering — Simulate AZ Failures on AWS

DZone 's Guide to

Chaos Engineering — Simulate AZ Failures on AWS

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.

· Cloud Zone ·
Free Resource

Chaos engineering is about introducing turbulent conditions that systems are likely to face in production environments.  These chaos experiments uncover new information, which can then be used to make changes to code, making our systems more resilient than they were before. Chaos experiments are not equivalent to Testing.  In Testing, we check system response against a predefined expected result. However, in the case of chaos experiment, we don’t have a predefined outcome.  The experiment gives us new information about the system, which can then be used for the betterment of systems.

In this article, I will walk you through how you can create chaos experiment of Availability Zone (AZ) failure on AWS.  Highly available applications need to be resilient against AZ failures.  Your application, for example, a Kubernetes cluster spanning across multi-AZ, should be able to survive such AZ failures. These chaos simulations allow you to check and prepare for that.

Chaos Toolkit gives a good framework for defining chaos experiments.  I have forked chaostoolkit-aws repo and added AZ failures probes and methods in the ec2 module.  I have used boto3 python aws library to create these experiments. You can access the code here — AZ Failure Git Repo

This is how an AZ failure experiment comes together -

  • Steady State Hypothesis — Before we kick off the experiment, we want to establish a steady-state hypothesis, that is "what normal looks like". In this case, I have assumed that if I can successfully SSH into EC2 instance then there is no AZ failure right now and hence a normal state.
Python
 




xxxxxxxxxx
1
11


1
 # .... Refer Code repo for full function.....
2
  
3
  logger.info('Starting SSH into ec2 instance — ' + instance.instance_id)
4
  ssh = paramiko.SSHClient()
5
  ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
6
  privkey = paramiko.RSAKey.from_private_key_file(pem_file_path)
7
 
          
8
  try :
9
    ssh.connect(instance.public_dns_name, username='ec2-user', pkey=privkey, timeout=10)
10
  except :
11
    logger.info('SSH Times out — waited for 10 seconds')
12
    return False



  • Action: Simulate AZ failure — To simulate AZ failure, I have created a blackhole ACL which is then attached to the subnet of our instance. This blackhole ACL  has one rule which disallows all ingress traffic covering CIDR — ‘0.0.0.0/0’ and all from and to ports.
Python
 




x


 
1
# .... Refer code repo for full function....
2
 
          
3
logger.info('Simulating AZ failure for — ' + subnet.availability_zone)
4
 
          
5
# Create new network ACL
6
acl_response = create_network_acl(vpc_id)
7
logger.info('Created new network ACL — ' + str(acl_response))
8
acl_id = acl_response['NetworkAcl']['NetworkAclId']
9
 
          
10
# Create blackhole ACL
11
logger.info('Creating blackhole ACL')
12
create_network_acl_ingress_entry(acl_id, rule_num=1, protocol="-1", cidr_block="0.0.0.0/0", from_port=-30000, to_port=30000, allow=True)
13
 
          
14
NetworkAclAssociationId = None
15
prev_NetworkAclId = None
16
 
          
17
# get list of Network ACLs
18
nw_acl_dict = get_network_acls()
19
 
          
20
for x in nw_acl_dict['NetworkAcls'] :
21
  for y in x['Associations'] :
22
    if y['SubnetId'] == subnet_id :
23
      NetworkAclAssociationId = y['NetworkAclAssociationId']
24
      prev_NetworkAclId = y['NetworkAclId']
25
      d2["acl_id"]= prev_NetworkAclId
26
      d2["blackhole_acl_id"] = acl_id
27
      json.dump(d2, open("exp_data1.txt", 'w'))
28
 
          
29
logger.info('Replacing Original ACl — ' + prev_NetworkAclId + ' with blackhole ACL ' + acl_id + ' to subnet' + subnet_id)
30
 
          
31
#  Associate Subnet with blackhole ACL
32
change_network_acl_association(acl_id, NetworkAclAssociationId)


  • Check steady-state hypothesis again —  Our steady-state hypothesis was about successful SSH into the EC2 instance.  Since AZ failure is now simulated, SSH will timeout and hence steady-state hypothesis is broken.  This failure means that our system will not survive AZ failures. We have new information available, which can now be used to make it more resilient.  

  • Rollback AZ failure — It is important to also rollback the AZ failure.  In this, we attach the previous ACL to the subnet and delete the blackhole ACL.

Python
 




x


1
# Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
2
def rollback_az_failure():
3
 
          
4
    d2 = json.load(open("exp_data1.txt"))
5
    prev_NetworkAclId = d2["acl_id"]
6
    blackhole_acl_id = d2["blackhole_acl_id"]
7
    subnet_id = d2["subnet_id"]
8
 
          
9
    logger.info('Rolling back ACL for subnet ' + subnet_id  + ' from blackhole acl — '+ blackhole_acl_id + ' to original ACl — ' + prev_NetworkAclId)
10
    nw_acl_dict = get_network_acls()
11
 
          
12
    for x in nw_acl_dict['NetworkAcls'] :
13
        for y in x['Associations'] :
14
            if y['SubnetId'] == subnet_id :
15
                NetworkAclAssociationId = y['NetworkAclAssociationId']
16
 
          
17
    change_network_acl_association(prev_NetworkAclId, NetworkAclAssociationId)
18
    logger.info(' Removing Black hole ACl — ' + blackhole_acl_id)
19
    delete_network_acl(blackhole_acl_id)



Chaos toolkit framework allows us to piece this experiment together. Below YAML is how you do it in Chaos toolkit

YAML
 




x


 
1
version: 1.0.0
2
title: What happens if there is an AZ Failure
3
description: Simulate AZ failure by creating blackhole Network ACL
4
configuration:
5
  aws_region: us-east-2
6
 
          
7
steady-state-hypothesis:
8
  title: SSH access to EC2 machine is working
9
  probes:
10
    — type: probe
11
      name: Check SSH Access to EC2 Instance
12
      tolerance: true
13
      provider:
14
        type: python
15
        module: chaosaws.ec2.probes
16
        func: ssh_test
17
        arguments:
18
          pem_file_path: Test-Chaos.pem
19
 
          
20
method:
21
- type: action
22
  title: Simulate AZ Failure by creating Blackhole ACL and attaching to Subnet
23
  name: AZ Failure Action creates a blackhole ACL, attaches it to a subnet thereby simulating AZ failure
24
  provider:
25
    type: python
26
    module: chaosaws.ec2.actions
27
    func: az_failure
28
 
          
29
 
          
30
rollbacks:
31
- type: action
32
  title: Rollback AZ failure and restore original ACL to Subnet
33
  name: Rollback AZ failure by restoring Original ACL to Subnet and deleting blachole ACL
34
  provider:
35
    type: python
36
    module: chaosaws.ec2.actions
37
    func: rollback_az_failure



Happy Coding! 

Topics:
aws, chaos, chaos engineering, tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}