We are using Amazon's DynamoDB (DDB) as part of our platform. AWS itself replicates the data across three facilities (Availability Zones, AZs) within a given region to automatically cope with an eventual outage of any of them. This is a relief, and useful as part of an out-of-the-box solution, but you'd probably want to go beyond this setup, depending on what your high availability and disaster recovery requirements are.
I have recently done some research and proof-of-concepts as to how it would be best to achieve a solution inline with our current setup. We needed it to:
- Be as cost effective as possible, while covering our needs.
- Introduce the least possible complexity in terms of deployment and management.
- Satisfy our current data backup needs and be in line with allowing us to handle high availability in the near future.
There's definitely some good literature on the topic online (1), aside from related AWS resources, but I have decided to write a series of posts that will hopefully provide a more practical view on the problem and the different range of possible solutions.
In terms of high availability, probably your safest bet would be to go with cross-region replication of your DDB tables. In a nutshell, this will allow you to create replicas of your master tables in a different AWS region. Luckily, AWS labs provides an implementation for this — open-sourced and hosted on GitHub. If you take a close look at the project's README, you'll notice it is implemented by using the Kinesis Client Library (KCL). It works by using DDB Streams, so for this to work, you need to enable streaming for the DDB tables you want to replicate, at least for the masters (replicas don't need it).
From what I've seen, there would be several ways of accomplising our data replication needs:
Using a CloudFormation template
Using a CloudFormation (CF) template to take care of the infrastructure setup, you need to run their cross-replication implementation mentioned above. If you're not very familiar with CF, they describe it as:
AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.
Creating a stack with it is quite straightforward, and the wizard will allow you to configure the options on the following screenshot, besides some more advanced ones on the following screen — for which you can use defaults in a basic setup.
Using this template will take care of creating everything from your IAM roles and Security Groups creation to launching the defined EC2 instances to perform the job. One of those instances will take care of coordinating replication, and the other(s) will take care of the actual replication process (i.e. running the KCL worker processes). The actual worker instances are implicitly defined as part of an auto-scaling group, as to guarantee that the worker instances are always running. This prevents events from the DDB stream from not being processed, which would lead to data loss.
I couldn't fully test this method. After CF finished setting up everything, I couldn't use the ReplicationConsoleURL to configure master/replica tables due to the AWS error below. Anyway, I wanted a more fine-grained control of the process, so I looked into the next option.
Manually Creating Your AWS Resources and Tunning the Replication Process
This would basically imply performing most of what CF does on your behalf. So it would mean quite a bit more work in terms of infrastructure configuration, be it through the AWS console or as part of your standard environment deployment process.
I believe this would be a valid scenario if you want to use your existing AWS resources to run the worker processes. You'll need to leverage what your cost restrictions and computing resource needs are before finding considering this a valid approach. In our case, this would help us with both, so I decided to explore it further.
Given that we already have EC2 resources set up as part of our deployment process, I decided to create a simple bash script that would kick off the replication process as part of our deployment. It basically takes care of installing the required OS dependencies, cloning and building the git repo, then executing the process. It requires four arguments to be provided (source region/table and target region/table). Obviously, it doesn't perform any setup on your behalf, so the argument tables will need to exist on the specified regions, and the source table must have streaming enabled.
This proved to be a simple enough approach, and it worked as expected. The only downside of it is that, regardless of it running within our existing EC2 fleet, we still needed to figure out a mechanism for monitoring the worker process in order to restart it in case it dies for any reason to avoid data loss as mentioned above. This is definitely an approach we might end up using in the near future.
Using Lambda to Process the DDB Stream Events
This method uses the same approach as above, in that it relies on events from your DDB table streams, but removes the need of having to take care of the AWS compute resources you will need in doing so. You will still need to handle some infrastructure and write the lambda function that will perform the actual replication, but it will definitely help with the cost and simplicity requirements mentioned in the introduction.
I'll leave the details of this approach for the last post of this series, though, as it is quite a broad topic that I will cover there in detail.
In the upcoming posts I will discuss the overall solution we end up going with. But before getting to that, in my next post, I will discuss how to backup your DDB tables to S3.