Amazon’s Relational Database Service (RDS) is one of the most popular database services in the world, used by 47% of companies on AWS according to 2nd Watch’s 2015 AWS Scorecard. Although the media buzz tends to focus on the newer Big Data and NoSQL data stores, relational databases continue to be the workhorse for many websites and applications. Companies like Qubole and Zenefits rely on relational databases like Postgres and MySQL to scale to millions of daily active users. Amazon Web Services (AWS) has made it easier to scale into thousands of requests per second and terabytes of data under management. However, monitoring all the metrics for a large and complex database still remains a significant challenge. Here are the top things you should monitor that can affect availability, performance, and cost.
You need to know quickly when there’s an outage, latency, or even a downward trend in availability in your database. These can possibly cause an application performance problem and, ultimately, data loss.
If you have a distributed database, you should be aware of whether each instance and your cluster as a whole are healthy. While Amazon has the ability to track availability of each instance, a better health indicator will say whether instances are responding to requests in an expected time and without errors. Realistically, if your queries take longer than they should, you’re not getting enough value out of the database. More importantly, from a product perspective, high latency on your queries is not only indication of possible ill database health, but also of a bad end-user experience for your application. So it’s important to also think of availability metrics in the context of higher-level strategic policies.
Databases can become unavailable or unhealthy for a variety of reasons. All RDS databases include weekly maintenance windows for patches and security updates. While most maintenance can be completed with a minimal impact on performance, some may require a multiple-availability-zone failover. Even then, it’s possible that an upgrade could cause an unexpected outage or performance issue with your application. You can find details about past upgrades by looking at logs in the RDS console.
Resource Usage and System Errors
Databases can also become unavailable due to hitting limits on resource usage, including the table size limit in MySQL, the max number of database connections allowed per instance, and capacity constraints during periods of peak demand. You should also check your database error log in the RDS console for clues about errors that are internal to the database.
A database could also go offline due to an accidental change by one of your team members. You can take a look at the CloudTrail log to audit changes made by your team using the console or API calls. A misconfigured schema or an out-of-sync cluster could also make the database appear unavailable. Watching your error rate after making schema or cluster changes can highlight this kind of issue.
Every application will hit some kind of a performance limit as it scales. It’s important to know what that limit is so you can monitor and optimize around it. All these performance improvements depend on a strong monitoring system to identify the bottlenecks you can optimize around. You need to watch each of these limits on a continuous basis because, even if you are below them on an average, you can exceed them during periods of bursting or peak demand.
If you have a larger database, you should watch out for limits that could impact your performance. RDS databases include limits from RDS as a whole, and for each specific database engine like MySQL. For example, RDS databases include limits on the maximum number of database connections allowed per instance and the maximum table size.
To optimize your performance, consider whether your database is heavier on reads (select) or writes (insert, update, delete). If it’s heavy on reads, you might benefit from taking advantage of indexes, read replicas, materialized views, or upstream caches. If your database is heavy on writes you may benefit from removing slow indexes, table locking, and foreign key constraints. Adding additional shards or partitions and using asynchronous replication can also be more performant.
Standard system metrics include CPU, memory, and disk usage. It’s relatively easy to see if you are hitting the limit for these resources because the AWS Console shows them alongside the maximum threshold.
If you are hitting a CPU limit on a shared instance, you might want to check out your CPU credit balance as shown in the CloudWatch dashboard below. Even if you are using less than the full amount of disk storage space, you might be hitting a ceiling on I/O operations per second (IOPS) or volume throughput. If you’re using too many IOPS, you might want to consider adding more RAM so fewer disk reads are needed. Additionally, each instance type has a different amount of network throughput. It’s easy to overlook dependencies between resources. For example, using Elastic Block Storage (EBS) volumes can also impact your network usage.
Slow Query Log
Additionally, a good way to see which of your statements is taking the most time in production is to look at your slow query log. This log is not kept by default, so you will need to explicitly turn it on. It’s often kept in a separate database table, and Amazon keeps it to below 2% of your total database size. Once you know which statements are taking the most time, you can explain the query plan the database uses to complete the statement and try to optimize it. For example, if you find a statement with a full table scan, it might be improved by adding an index. Keys or integers are often faster to look up than strings, especially with string operations or wildcards.
Taking advantage of Amazon’s elasticity when scaling up is an easy short-term solution. However, managing your costs requires smarter approaches over the long term. A good monitoring and alerting system can help your team manage costs and proactively prevent or address issues.
Small systems can be scaled vertically by switching to larger instances, but you’ll eventually get to a point where you’re running on the largest instance and paying a hefty price. You can save money by choosing lower-cost Amazon instances and automatically scaling based on need. If you have a steady baseline capacity need, you can purchase reserved instances at a lower cost than on-demand instances. Furthermore, you can also save a lot of money by taking advantage of off-hours pricing for spot instances, such as for doing nightly batch jobs on read replicas.
Scaling systems horizontally by adding read replicas or shards is another good way to grow. However, you’ll hit limits dealing with queries that can’t be separated into shards because you need to join tables, and dealing with delayed replication between nodes. Large clusters also take more operational effort to maintain because the systems are more complex and require logic to route to read replicas and proxies that manage load and add fault tolerance. Delayed replication can also lead to sacrifices in user experience.
You don’t want to have to wake up your operations team every night to run batch jobs or to be on call 24×7 whenever your site has a spike in traffic. You can configure scripts to automatically scale your infrastructure as needed to match demand. For example, if you hit a CPU threshold as monitored in CloudWatch, you can run the script, add read replicas, and update your proxies with the new node. If you make use of Amazon Aurora, which can add read replicas in only 10ms, you can closely follow your demand curve and not waste money on excess capacity.
Taking advantage of auto-scaling requires trust and intelligence in your monitoring solution because real money is on the line. You don’t want a bug in your code or some kind of security incident to cause runaway spending. It’s not good enough to find out about this on your monthly bill.
Challenges of Monitoring
Amazon RDS makes managing databases much easier, but they remain a challenge to monitor and optimize. As we discussed above, Amazon makes a variety of data available through CloudWatch metrics, CloudWatch logs, CloudTrail, and various log tables. That’s not even counting metrics from on-prem systems or other applications. Overall, it’s not easy to keep an eye on all the different sources of infrastructure data or to get a crucial service-wide view of performance trends. Aggregating all the data from your cloud infrastructure and the other services in your environment in one place for easy analysis is not possible with CloudWatch alone, but is absolutely critical to operating a distributed architecture without disrupting the end-user experience.
Furthermore, the analytics capabilities for each of Amazon’s services varies widely, and many do not offer any kind of visualization or aggregated analytics capabilities. For example, CloudWatch lacks calculated fields, which is helpful when trying to look for patterns, calculating how much of the resource you have available (in days, weeks, etc.), or watching for anomalies.
To manage and analyze log data, you might prefer to use the Elastic Stack or a third-party log management service. To monitor metrics, SignalFx will aggregate data from CloudWatch metrics with metrics from other AWS services, open-source middleware, your custom app components, and on-prem systems. It also includes more advanced features like calculated fields and outlier detection.
Come back tomorrow for part two of this blog series, where you’ll learn how SignalFx aggregates Amazon RDS metrics as they stream for the most relevant view of service availability and threats. We give you a powerful dashboard right out of the box and built-in detectors so that you can proactively monitor the metrics that matter to performance without the guesswork or painful trial-and-error.