There are a staggering number of businesses building their core applications around Amazon Web Services' cloud offerings, such as their Elastic Compute Cloud (EC2). In 2015, Amazon reported an EC2 growth rate of 95%. It has been used for everything from creating an on-demand supercomputing cluster for pharma research to reinforcing the backend for music-recognition app Shazam.
EC2 is extremely useful for business IT teams looking to assemble massive computing power without purchasing physical infrastructure, and for those that must prepare for regular spikes in customer demand. That being said, EC2 is a tool like any other. It's prone to being misused or misconfigured, and if that occurs, your business could be left with nowhere to scale. Here is how to avoid the most common EC2 errors before your application starts to crash.
1. Confusing Storage Latency
Amazon EC2's storage volumes (called Elastic Block Storage, or EBS) come in two flavors. Standard volumes serve data at roughly the same rate as a standard desktop hard drive, but Provisioned volumes are designed to serve data much faster. If you need throughput at a rate up to 4,000 IOPS (input/output operations per second), Provisioned volumes can get that for you-assuming you jump through several hoops.
For example, the instance in use must be compatible with the number of IOPS you need. You need to set block size to 16KB or less, and the blocks must have been accessed at least once. Your volumes will slow down if a backup is about to be performed. Lastly, Amazon EC2 apparently doesn't include a tool that will give you real-time IOPS data from a given volume (although this can be calculated using third-party monitoring applications).
This last restriction can lead to some difficulties. It's possible for the following to happen:
- IOPS increases without administrator knowledge until it hits the threshold for a volume.
- Operations subsequently begin queueing up.
- The entire application is then rate-limited by the EBS.
- The application fails entirely.
Fixing this issue means finding other ways to track IOPS. The VolumeQueueLength metric tracks the number of pending I/O requests. If you find that your EBS volumes are slowing down, and you find a high associated VolumeQueueLength, this might indicate an issue with IOPS.
2. Idle EC2 Instances
Amazon can now charge by the second for use of an EC2 instance. For some enterprises-for example, those who need to use less than two minutes of flexible capacity at a time-this represents a good deal. For other instances, it might represent a slow trickle of wasted money. Leftover test environments, underperforming applications, and other use cases lend themselves to underutilized EC2 instances. How do you track them down?
Metrics are one way to sort out this problem, but they're a moving target. Different applications utilize different amounts of CPU, for example, so you can't simply look at CPU utilization to see where the underperformers are. You need to weed out instances in which all usage metrics are low, where "low" means "lower than the baseline utilization of all your EC2 instances." Instances with usage metrics that are low across the board can be spun down or consolidated.
3. EC2 Memory Leaks
It's possible for EC2 instances to run out of memory. As some EC2 instances do not include swap volumes, associated applications will be unable to free up memory at the same time as it's being allocated. As this occurs at times of especially heavy application usage, it's therefore possible for EC2 to freeze up right as it's undergoing peak workload-more or less completely defeating the purpose of purchasing EC2 instances in the first place.
As is the case with IOPS, it can be difficult to see under EC2's hood in order to figure out which applications are using too much memory. While EC2 will automatically kill memory-hogging processes, it's better practice to not have to kill these processes in the first place. It's also possible to add more EC2 instances in order to function as a swap volume, or as failover in case a vital process is killed. Adding these instances is expensive, however.