This post is part one of a two-part series to try to help you understand the performance and workings of one of the most popular cloud NoSQL databases currently on the market — DynamoDB. In this post, we'll be talking about two key performance counters and the importance of monitoring them. In the next post, we'll explore other DynamoDB attributes.
As part of the Amazon Web Services (AWS) portfolio, DynamoDB is a NoSQL cloud database designed to provide single-digit millisecond latency to applications running at any scale. In other words, when your application goes enterprise and starts to store large volumes of data, DynamoDB will still provide the same fast and consistent performance. But once implemented, how do you actually know whether your DynamoDB is providing minimal latency and great throughput for all your database operations? The only way to find out is to quantify and measure several key parameters.
However, monitoring DynamoDB tables can get a little tricky, especially when the number of base tables and secondary indexes starts to increase. Also, when it comes to monitoring, every metric matters. The lingering question in the minds of database administrators will always be, "Which metric should I focus on? " Understanding the importance of these metrics will go a long way in helping you make smart decisions in your AWS environment.
For the Uninitiated
With a fully managed service like DynamoDB, you don't need to think about infrastructure. You just need to configure the request throughput you want to achieve for your application. This value will determine the number of concurrent database operations that can be processed by your DynamoDB table and secondary indexes. In AWS parlance, this throughput is measured by two different metrics: provisioned read capacity units and provisioned write capacity units. DynamoDB gives you the ability to set up request throughput for both your origin table and secondary indexes, but another important point to keep in mind is that you'll be paying AWS a flat, hourly rate based on your request throughput.
The Importance of Provisioning the Right Throughput
Wait, hold on a second. Didn't AWS just make my job as an administrator a lot simpler by introducing AutoScaling? Well, to answer your question, yes, it kind of did. With AutoScaling, you get to automate capacity management for your DynamoDB tables by specifying target throughput utilization and by setting upper and lower limits for both reads and writes.
But to understand why monitoring is necessary, let's pull the curtain back to see how AutoScaling works for DynamoDB. When the throughput utilization for your table exceeds or drops below the target value, your application's AutoScaling policy kicks in and starts to increase or decrease capacity to make sure the throughput utilization gets back to the target value again.
If you really look at it, the metric target utilization (ratio of consumed read capacity units to provisioned capacity), which is the trigger at the heart of AutoScaling, still depends on the initial provisioned throughput you set up manually.
Difficulties in Optimizing Provisioned Capacity
But determining the exact throughput requirement of an application at the onset is difficult. You need to factor in application logic, user activity, geography, and other estimates before deciding on a throughput capacity.
The Other Estimates
There are two key factors which you need to consider before you start provisioning throughput capacity units. The first is the request size (size of the item), and the other is the request rate (number of read/write operations performed per second). Of these two, capacity calculations are primarily bound by the size of the items/objects being read/written.
Another small matter to consider is partitioning. As the database requests and volume of data stored start to increase, your table will get partitioned. Provisioned throughput will be equally portioned among the various partitions of your table. DynamoDB partitions your table automatically; the number of partitions is not disclosed to the user. This can put a lot of doubt in the mind of the user, so to be on the safe side, people generally tend to over-provision.
The Hot Partition Issue
The metric throttled request will start to tick when the consumed read/write requests start to go beyond the provisioned limit. However, sometimes your application request can get throttled when you are well within the provisioned limit. Looking at the graph above, you can see that the application's write activities were well within the write capacity specified. But when you look at the throttled write request graph, there was a spike during the same time period. Why? Looking back at the graph, we can see that the provisioned write capacity was set to 1,000 units. Imagine that there are 10 partitions, so each partition gets 100 write capacity units each. This value is a lot closer to the average consumed write value. This could be why we're seeing a rise in the throttled request metric.
Monitor Benchmark Throughput Requirements
How do you know whether the throughput you have set up is exceedingly high or downright low? Monitoring the consumed read/write capacity units metric will tell you how much of the provisioned throughput is actually in use. Armed with this data, you can analyze your application's throughput pattern to see if it's favoring reads or writes. Also, you can compare throughput usage and check if the read/write operations performed are well within the provisioned limit or if they are dangerously close to exceeding them.
Set up thresholds for the consumed read/write request attribute and get alerted before your database requests start to get throttled.
Successful Request Latency
Latency is the time it takes for a database operation to get processed. The CloudWatch metric "Successful request latency" reports latencies for various read/write operations like GET, PUT, Scan, etc. However, the latency metric attribute is collected from a DynamoDB service side perspective, meaning that any delay due to client-side application errors or network flapping issues is not taken into account.
DynamoDB Is Optimized for Highly Concurrent, Throughput-Intensive Workloads
In a distributed database system like DynamoDB, each database table is backed by an EC2 server instance. This instance provides the necessary computational resources needed to service the incoming read/write requests. This practice is done to ensure low latency. Even if you start adding more data, DynamoDB will still maintain a low latency response by creating partitions and equally distributing data. For each partition created, DynamoDB will spin up another instance to handle the requests.
However, in spite of all this, if database access is not uniform, then low latency can't be guaranteed. For example, if your application is making frequent requests for the same item using the same hash key, then only a small subset of a particular database table will get accessed, leaving the other tables untouched. The instance behind the table will start to get taxed, leading to more and more queued requests and negatively impacting overall performance.
Optimize your table design, use a wide range of hash keys, and make your workload more uniform.
Optimizing DynamoDB usage and performance will depend a lot on your application's throughput pattern, which needs to be profiled and fine-tuned with the help of a monitoring tool. If you are interested in learning more about other important DynamoDB metrics, keep an eye out for part two of this series.