Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Serverless Approach to Backup and Restore EBS Volumes

DZone 's Guide to

Serverless Approach to Backup and Restore EBS Volumes

Check out the types of storage backup that Azure offers for your needs.

· Cloud Zone ·
Free Resource

Amazon Elastic Compute Cloud (EC2) instances use Elastic Block Storage (EBS) as a root volume as well as an additional data store for applications. It is necessary to select a proper EBS volume type depending upon the workload to achieve high performance and the right approach to backup EBS volumes reqularly in production environments. We need a solution to backup and restore application data from EBS volume snapshots at any point of time and we should not pay an unnecessary cost for archiving the older snapshots. This article covers choosing the right EBS volume type for your application and provides a mechanism to handle EBS snapshots using serverless technology.

EBS Volume Types

Amazon EBS provides different volume types which have different performance characteristics and cost models. We can choose the volume type based on our application requirements (the type of the workload) to achieve higher performance as well as saving overall storage cost. EBS volume is available in two different categories, SSD-backed volumes and HDD-backed volumes. SSD backed volumes are used when the workload is I/O intensive, like transactional workloads where frequent read-writes happens in an application. Its performance is rated in IOPS. HDD-backed volumes are used when an application requires continuous read and write to the disk at a cheaper rate with high throughput. Its performance is rated in throughput MiB/s.

SSD-Backed EBS Volumes

SSD volumes are high-performance EBS storage. They are backed by modern Solid State Drive Storage technology. It is available as General Purpose SSD volume (GP2) and Provisioned IOPS SSD volume (IO1). GP2 volume helps to operate a wide variety of workloads by providing high performance as well as balancing price, whereas IO1 provides high performance and high throughput, and is suitable for mission-critical, low latency workloads, especially hosting databases such as Cassandra, MongoDB, Postgres, MySQL, and Oracle Database workloads. GP2 volume can be created from 1 GB to 16 TB size and IO1 volume can be created from 4 GB to 16 TB. It provides IOPS from 16,000 to 64,000 per volume depends on volume type and size.

HDD-Backed EBS Volumes

HDD volumes are throughput optimized volumes at a lower price. It is backed by a Magnetic Storage Technology. It is further classified into throughput optimized HDD and cold HDD. These volume types can be used for frequent and infrequently accessed workloads at cheaper and the cheapest prices, respectively. It is generally used in scenarios such as hosting data warehouses and big data solutions where we need to consider throughput as well as storage price. But it can’t be used as a boot volume for any EC2 servers.
This volume type can be created from 500 GB to 16 TB in size and it provides IOPS from 250 to 500 per volume. The maximum throughput per volume ranges from 200 MiB/s to 500 MiB/s.

Selecting a Proper EBS Volume Type for An Application

If our solution relies on frequent access to data and the transaction between application and database is frequent, we need a high-performaning and reliable data storage available at comparatively lower cost. We use General Purpose SSD volume for root volume and Provisioned IOPS SSD volume for application and database data directory which requires continuous read and write operation with low latency. We can monitor the performance of the volume under loaded conditions when the application receives the maximum traffic it can handle without much latency. Based on that, we may increase/decrease the IOPS count to reduce the storage cost.

EBS Volume Data Protection

Encrypting EBS volume protects the data at rest as well as the motion. We can encrypt EBS volumes while creating it. The encrypted volume encrypts all the data inside that volume and it provides encryption when the data moves between EC2 instances and EBS storage. The EBS snapshots created from these volumes are also encrypted. There is no performance impact on using encrypted volumes but there is a minimal effect on I/O latency. We cannot make encrypted volumes public and its snapshots cannot be shared between AWS accounts. We can use encrypted volumes for application data directory to secure our data at rest as well as motion.

Architecture

This architecture diagram explains the implementation of the lambda function that takes a snapshot backup of the EBS volume in an AWS account across all regions daily and purges the snapshot which is lesser than specified retention period in days. By implementing this solution, we need not clean up any of his EBS snapshots manually and we can maintain the number of EBS snapshot volumes as per the retention policy. It helps us to reduce the unnecessary AWS billing cost as well as avoids keeping the snapshots which are no longer useful. The volume to be backed up using our solution should have a Tag with Key say, "Backup" with some value to that key, The snapshot will be created with that name followed by the timestamp. It will check for the snapshot which is created before the number of days mentioned as retention period for deletion. We do not support tag with Key "Name" for taking backup.

import os
import boto3
import datetime
import traceback

def lambda_handler(event, context):
    #default environment variables if env var is not set
    snapshot_tag = os.getenv('snapshot_tag', 'Backup')
    retention_period_in_days = os.getenv('retention_period_in_days', '7')
    sns_topic_arn = os.getenv('sns_topic_arn', 'arn:aws:sns:us-east-1:113809544561:ebs-snapshot-backup')

    def get_aws_regions(ec2_client):
        try:
            regions = ec2_client.describe_regions()
            aws_regions = []
            for region in range(len(regions['Regions'])):
                reg = regions['Regions'][region]['RegionName']
                aws_regions.append(reg)
            return aws_regions
        except Exception as e:
            print 'Unable to get AWS region list'
            print e.message
            traceback.print_exc()
            exit(1)

    # list all ebs volume with tag
    def get_ebs_volumes_with_specified_tags(client, snapshot_tag):
        try:
            ebs_volumes = []
            paginator = client.get_paginator('describe_volumes')
            response_iterator = paginator.paginate(
                Filters=[{'Name': 'tag-key', 'Values': [snapshot_tag]}],
                PaginationConfig={
                    'MaxItems': 1000,
                    'PageSize': 5,
                }
            )
            for response in response_iterator:
                for volume in response['Volumes']:
                    ebs_volumes.append(volume['VolumeId'])
            return ebs_volumes
        except Exception as e:
            print 'Unable to get EBS volumes with specified tags'
            print e.message
            traceback.print_exc()
            exit(1)

    # take snapshot of all volumes from the above list
    def create_ebs_volume_snapshot(client, ebs_volumes, snapshot_tag):
        try:
            snapshot_list = []
            timestamp = datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S')
            for volume in ebs_volumes:
                snapshot = client.create_snapshot(
                    Description='EBS-Volume-Snapshot-Backup-' + volume + '-' + timestamp,
                    VolumeId=volume,
                    TagSpecifications=[
                        {
                            'ResourceType': 'snapshot',
                            'Tags': [
                                {
                                    'Key': 'Name',
                                    'Value': volume + '-' + timestamp
                                },{
                                    'Key': snapshot_tag,
                                    'Value': volume + '-' + timestamp
                                }
                            ]
                        },
                    ],
                )
                snapshot_list.append(snapshot['SnapshotId'])
            return snapshot_list
        except Exception as e:
            print 'Unable to create EBS Snapshot Backup'
            print e.message
            traceback.print_exc()
            exit(1)

    # list all ebs snapshots which are older than the retention period
    def get_ebs_snapshot_less_than_retention_period(client, snapshot_tag):
        try:
            old_ebs_snapshots = []
            paginator = client.get_paginator('describe_snapshots')
            response_iterator = paginator.paginate(
                Filters=[{'Name': 'tag-key', 'Values': [snapshot_tag]}],
                PaginationConfig={
                    'MaxItems': 1000,
                    'PageSize': 5,
                }
            )
            for response in response_iterator:
                for snapshot in response['Snapshots']:
                    current_time = datetime.datetime.utcnow()
                    start = datetime.datetime.strptime(str(snapshot['StartTime']), '%Y-%m-%d %H:%M:%S.%f+00:00')
                    end = datetime.datetime.strptime(str(current_time), '%Y-%m-%d %H:%M:%S.%f')
                    td = abs(end - start)
                    td1 = str(td).split(':')
                    duration = datetime.timedelta(hours=int(td1[0]), minutes=int(td1[1]),
                                                  seconds=int(td1[2].split('.')[0]))
                    duration_in_sec = duration.total_seconds()
                    diff = duration_in_sec / (3600 * 24)
                    if diff >= float(retention_period_in_days):
                        old_ebs_snapshots.append(snapshot['SnapshotId'])
            return old_ebs_snapshots
        except Exception as e:
            print 'Unable to get EBS Snapshots which are older than the retention period'
            print e.message
            traceback.print_exc()
            exit(1)

    # main method to start start which is called by the handler function
    def run_task():
        try:
            if snapshot_tag == 'Name':
                print 'Tag key **Name** is not allowed. Use different tag key such as ProdBackup, DevBackup, Backup, etc.'
                exit(1)
            file_path = '/tmp/msg.txt'
            if os.path.isfile(file_path) == True:
                os.remove(file_path)
            sts_client = boto3.client('sts', region_name='us-east-1')
            account_id = sts_client.get_caller_identity()["Account"]
            with open(file_path, 'a') as file:
                file.write('*' * 100 + '\n')
                file.write('EBS Volume Snapshot Backup Details(AWS account ' + str(account_id) + '):' + '\n')
                file.write('Backup retention period in days: ' + str(retention_period_in_days) + '\n')
                file.write('Snapshot Tag: ' + snapshot_tag + '\n')
                file.write('EBS volumes under snapshot tag key will be backed up by creating an EBS volume snapshot.' + '\n')
                file.write('EBS snapshots will be peristed up to the retention period in days and older snapshots will be purged.' + '\n')
            ec2_client = boto3.client('ec2', region_name='us-east-1')
            regions = get_aws_regions(ec2_client)
            overall_ebs_volumes_tobe_backed_up = 0
            overall_ebs_snapshot_created = 0
            overall_ebs_snapshot_deleted = 0
            for region in regions:
                ec2_conn = boto3.client('ec2', region_name=region)
                ebs_volumes = get_ebs_volumes_with_specified_tags(client=ec2_conn, snapshot_tag=snapshot_tag)
                if len(ebs_volumes) != 0:
                    with open(file_path, 'a') as file:
                        file.write('Number of volumes to be backed up in ' + region + ': ' + str(len(ebs_volumes)) + '\n')
                        file.write(','.join(ebs_volumes) + '\n')
                create_snapshot = create_ebs_volume_snapshot(client=ec2_conn, ebs_volumes=ebs_volumes, snapshot_tag=snapshot_tag)
                if len(create_snapshot) != 0:
                    with open(file_path, 'a') as file:
                        file.write('Number of snapshots created in ' + region + ': ' + str(len(create_snapshot)) + '\n')
                        file.write(','.join(create_snapshot) + '\n')
                old_ebs_snapshots = get_ebs_snapshot_less_than_retention_period(client=ec2_conn, snapshot_tag=snapshot_tag)
                # delete all the ebs snapshots from the above list
                for old_ebs_snapshot in old_ebs_snapshots:
                    ec2_conn.delete_snapshot(SnapshotId=old_ebs_snapshot)
                if len(old_ebs_snapshots) != 0:
                    with open(file_path, 'a') as file:
                        file.write('Number of deleted snapshots in ' + region + ': ' + str(len(old_ebs_snapshots)) + '\n')
                        file.write(','.join(old_ebs_snapshots) + '\n')
                overall_ebs_volumes_tobe_backed_up = overall_ebs_volumes_tobe_backed_up + len(ebs_volumes)
                overall_ebs_snapshot_created = overall_ebs_snapshot_created + len(create_snapshot)
                overall_ebs_snapshot_deleted = overall_ebs_snapshot_deleted + len(old_ebs_snapshots)
            with open(file_path, 'a') as file:
                file.write('Number of volumes to be backed up across regions ' + ': ' + str(overall_ebs_volumes_tobe_backed_up) + '\n')
                file.write('Number of ebs snapshots created across regions ' + ': ' + str(overall_ebs_snapshot_created) + '\n')
                file.write('Number of ebs snapshots deleted across regions ' + ': ' + str(overall_ebs_snapshot_deleted) + '\n')

            #sending notification to the cloud administrator
            count = overall_ebs_volumes_tobe_backed_up + overall_ebs_snapshot_created + overall_ebs_snapshot_deleted
            if count !=0:
                with open(file_path, 'r') as file:
                    message = file.read()
                    print message
                    try:
                        print 'Sending email to the cloud administrator'
                        sns_client=boto3.client('sns', region_name=sns_topic_arn.split(':')[3])
                        send_message = sns_client.publish(TopicArn=sns_topic_arn,
                                                          Message=message,
                                                          Subject='EBS Volume Snapshot Backup Details(AWS account ' + str(account_id) + ')',
                                                          )
                        print 'Message id: ' + send_message['MessageId']
                        print 'Email sent successfully.'
                    except Exception as e:
                        print 'Unable to send email.'
                        print e.message
            else:
                print 'No EBS volumes to be backed up and no EBS snapshots to be deleted.'
        except Exception as e:
            print e.message
            traceback.print_exc()
            exit(1)

    run_task()

# if __name__ == '__main__':
#     lambda_handler(1, 1)


How to Implement This Solution

We implement this solution as a CloudFormation stack. The CloudFormation template creates resources such as a Lambda function with all the important configuration such as runtime, handler, memory settings, timeout value, etc. In order to bring security, we have attached the lambda execution role with a fine-grained IAM policy. This IAM role and policy is defined in the same cloudformation template. The lambda function is triggered by a Cloud Watch event that acts as a cron scheduler. Once the lambda function is executed, it will create volume snapshot and purges the snapshots which are older than the retention period. The CloudFormation template also takes care of creating an eve  lo with necessary permission to invoke lambda function. This infrastructure can be reproduced in multiple AWS accounts by using the same Cloud Formation template.


---
AWSTemplateFormatVersion: 2010-09-09

Description: This template creates a lambda function for taking backup of EBS volumes and delete the backup which is lesser than the retention period.

Parameters:
  LambdaFunctionName:
    Type: String
    Description: Name of the lambda function
  LambdaS3Bucket:
    Type: String
    Default: ebs-snapshot-backup-s3-bucket
    Description: Name of the S3 bucket where the zipped lambda code is available.
  LambdaS3BucketKey:
    Type: String
    Default: ebs-snapshot-backup.zip
    Description: Name of the S3 bucket key(zipped lambda code).
  CloudwatchScheduleExpression:
    Type: String
    Default: 'cron(30 0 * * ? *)'
    Description: Cron Expression for Cloudwatch Event
  CloudWatchEventState:
    Type: String
    Default: ENABLED
    Description: Enable/Disable cloudwatch event
  RetentionPeriod:
    Type: Number
    Default: 7
    Description: Number of days to retain the snapshot backup and to delete the older backups
  SnapshotTag:
    Type: String
    Default: Backup
    Description: Name of the tag associated with an EBS volumes to be backed up.
  Email:
    Type: String
    Description: Valid email id to receive notifications. Only one email id is allowed to add by this template. You can add mulitple subscription endpoints from AWS SNS Management console.

Resources:
  LambdaExecutionRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - lambda.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: "/"
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - logs:*
                Resource: arn:aws:logs:*:*:*
              - Effect: Allow
                Action:
                  - s3:ListBucket
                Resource: !Join [ '', [ 'arn:aws:s3:::', !Ref LambdaS3Bucket ] ]
              - Effect: Allow
                Action:
                  - s3:GetObject
                Resource: !Join [ '', [ 'arn:aws:s3:::', !Ref LambdaS3Bucket, '/*' ] ]
              - Effect: Allow
                Action:
                  - ec2:DescribeRegions
                  - ec2:DescribeVolumes
                  - ec2:CreateSnapshot
                  - ec2:DeleteSnapshot
                  - ec2:DescribeSnapshots
                  - ec2:DescribeSnapshotAttribute
                  - ec2:CreateTags
                  - ec2:DeleteTags
                  - ec2:DescribeTags
                Resource: '*'
              - Effect: Allow
                Action:
                  - sts:GetCallerIdentity
                Resource: '*'
              - Effect: Allow
                Action:
                  - SNS:Publish
                Resource: !Ref Topic
    DependsOn: Topic

  LambdaFunction:
    Type: "AWS::Lambda::Function"
    Properties:
      Description: "Lambda function for taking backup of EBS volumes and delete the backup which is lesser than the retention period."
      FunctionName: !Ref LambdaFunctionName
      Handler: 'handler.lambda_handler'
      Runtime: 'python2.7'
      Timeout: 900
      MemorySize: 256
      Role: !GetAtt LambdaExecutionRole.Arn
      Code:
        S3Bucket: !Ref LambdaS3Bucket
        S3Key: !Ref LambdaS3BucketKey
      Environment:
        Variables:
          snapshot_tag: !Ref SnapshotTag
          retention_period_in_days: !Ref RetentionPeriod
          sns_topic_arn: !Ref Topic
    DependsOn:
    - LambdaExecutionRole
    - Topic

  CleanupEventRule:
    Type: AWS::Events::Rule
    Properties:
      Description: "Cloudwatch Rule to backup ebs volumes across regions and purge the old volume snapshots"
      ScheduleExpression: !Ref CloudwatchScheduleExpression
      State: !Ref CloudWatchEventState
      Targets:
        - Arn: !Sub ${LambdaFunction.Arn}
          Id: "CleanupEventRule"
    DependsOn: LambdaFunction

  LambdaSchedulePermission:
    Type: AWS::Lambda::Permission
    Properties:
      Action: 'lambda:InvokeFunction'
      FunctionName: !Sub ${LambdaFunction.Arn}
      Principal: 'events.amazonaws.com'
      SourceArn: !Sub ${CleanupEventRule.Arn}
    DependsOn: LambdaFunction

  Topic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: !Join ['-', ["sns-tpoic", !Ref "AWS::StackName"]]

  Subscription:
    Type: AWS::SNS::Subscription
    Properties:
      Endpoint: !Ref Email
      Protocol: email
      TopicArn: !Ref Topic
    DependsOn: Topic


Monitoring the Serverless Application

We can use CloudWatch metrics for monitoring our solution. CloudWatch exposes metrics about lambda execution and error rate including latencies. With those metrics, we optimize our solution if needed. We use logs published by lambda in the Cloud Watch log stream for debugging.

refactoring/maintenance of application

Implementing this solution does not require a maintenance window. It is not going to impact anything in our infrastructure. This solution is easy to maintain. The codebase of the lambda function is available in S3. Whenever we make any change in code, we have to update the lambda function to effect the changes during the upcoming execution. If you wish to update/disable backup schedule, update the CloudFormation stack.

Topics:
aws ,serverless ,ebs ,aws lambda ,cloud formation ,devops and cloud ,devops ,storage ,cloud

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}