Originally authored by Ming Lee
Cloud computing has now entered into most executives’ common vernacular. Certain ideas and themes come through and a common misconception is that the cloud is cheap, right? In some circumstances, it is even free (check out Amazon Web Services ‘Free Web Tier’) so one would be a fool not to consider this as a viable deployment option. However, the reality can be a lot different and some people’s perceptions of the cost of cloud computing needs to be re-adjusted to the reality of running live, production customer-paying websites, applications, and services complete with high-availability options and failover. In this article I’m going to look at some of the ways your cloud solution can accidentally bleed money, and what you can do about it.
The advantages of cloud computing are well known: pay-for-what-you-use subscription service, elasticity of supply to match demand, and low or zero levels of hardware maintenance. So it was a surprise to some when I presented a quarterly report on the cost of running our cloud computing. Now, the amount is high only relative to one’s expectation – the quarterly cost was still only 75% of the cost of the hardware alone for a typical project.
However, I was asked to explain the main areas of cost to the stake-holders.
As per my previous post, most of the expense when running in the cloud is determined by your architecture: how many servers (instances) are you running and for how long? In most cases, you will be running something 24/7. That suggests calculating the cost should be quite easy, right? I was pressed on this point – did I get the initial architecture wrong?
Checking the current system and the submitted plans, there were no major changes. We have live servers, staging servers, enterprise load balancers, elastic IPs, and other bits of AWS EC2 related paraphernalia. All was in order yet the cost, broken down per month, was always higher than expected. Clearly, there was something that I had either missed or had underestimated.
When I first submitted the cost, I added a contingency of an extra 15% of the total, just in case. It was just enough but I was personally surprised how close I came to going over this amount. Those in charge of the budgets were even more surprised.
After some investigation I discovered the culprits and drew some valuable lessons from them. Now, I’d like to share some hard-won insight into controlling the cost of using AWS.
Truly be elastic – shut down unwanted instances
We made the mistake, which I think many have and will, of merely transferring physical hardware into the cloud. We continued to treat the instance like our own hardware and kept them running 24/7 even when they were not being used. This was partly because our monitoring and logging systems were all designed around a 24/7 continuous monitoring. In reality, outside of office hours, for some of the services, there were no customers. They should have been switched off either manually or via a batch script and then restarted in time for the first customers.
Lesson: Think elastic; shut down all un-used instances; re-architect your design to accommodate this.
It’s easy to spin up new instances, almost too easy
Related to the above point, it’s easy to spin up a new instance for testing or for sheer curiosity. It’s also pretty simple to tear down an instance and spin up a new one from an AMI. Problem is, each time a new AMI is spun up, it incurs the full hourly charge even if it only ran for a few minutes.
We had some issues with applications not working properly due to misconfiguration. Thus, a lot of troubleshooting was performed with each major step being saved as a snap-shot to enable roll back. Roll-backs were frequent as various troubleshooting avenues were explored and rejected. Over time, fixes were done but this had the effect that we were running two to three times the number of instances per hour at times.
Lesson: A subscription service isn’t efficient when it comes to troubleshooting; one is better off working on local resources or local virtual machines. Understand the ‘rhythm’ of your environment. The elasticity people speak of is only worthwhile when handling usage peaks in live, production systems where extra use = extra revenue. For testing, research and development, and other related functions, consider using local resources.
Being human and forgetful
There were a few occasions when AWS Quadruple Extra Large instances ($2.44 per hour) were accidentally left switched on and idle over a long weekend. On another occasion someone spun up a new instance in a different region, forgot to switch it off, and only remembered a couple of weeks later.
Accidents happen but it all builds up.
Lesson: Institute some sort of policy that controls and monitors what instances are running, which region it’s in, and who owns it. Use the meta-tags to more properly label the instances (example: ‘test – can be deleted after 01/01/2012’) to aid in the management of the instances.
Use the new medium instances wherever possible
Recently, AWS made their medium instance type available. You could use it before as an option via the EC2 Command Line, but now it’s part of the main product line as an option. The medium instance has very good specifications and we are actively looking at the performance metrics of this instance type. Already, any ‘test’ or ‘staging’ environments are running as medium instances.
Lesson: Consider using the micro or medium instance types instead of the default large instance.
Use Linux where you can
A comparative Linux instance usually costs 50% less per hour than its Windows counterpart. Definitely look at porting over what code and applications you can to open source variants. Websites, forums, and e-business backend applications have 100s of open source options that can run happily under Linux. AWS has almost all of the main flavours of Linux available including Ubuntu, SuSe, RedHat, Debian, and Centos as well as Amazon’s own Linux offering. Recently, a core part of our application stack was significantly re-engineered so it can work very happily under RedHat and Suse linux.
Identify and eliminate ‘dead-time’
Any process that requires human interaction can be plagued by delays, each stage waiting for human input. A process that finishes at 02:00 in the morning and waits for a human to click on the ‘next’ button at 09:00 when the office opens up is not very efficient as you have now paid for seven hours of ‘dead-time’.
Lesson: Automation of processes through scripts and programs is important. This includes utilizing well known options including Windows schedule and Python to integrate the EC2 command line and AWS Elastic Beanstalk. Automation should also be able to shut down instances when a process is completed.
Actively manage the AWS environment
Over time, snapshots and copies of Elastic Block Store (EBS) volumes will litter the dashboard. Everything there will cost money and a big factor of that are the disk partitions, especially of instances that are either unused or have already been removed. EBS volumes take up valuable space and they cannot be used. Since it’s so easy to spin up an instance and then tear it down, AWS sometimes leaves a number of artefacts behind – including the disks. The same thing applies to the Relational Database Service (RDS) – snapshots can take space; look at downloading a copy of the data outside of AWS as part of any disaster recovery plan.
Lesson: Be proactive and seek out orphaned disk volumes, redundant snapshots, and other artefacts. Then see about removing or archiving them.
Cost control and monitoring
Since AWS makes it easy to manage the account, I realised that the task of keeping track of spend had actually devolved down to me. I manage the technical operations but not the budget. However, those who worry about the budget had no visibility of the regular spending as they did not have the login details nor was the billing through the usual purchase order / invoice mechanism. It was actually deemed too easy to subscribe and use resources as subscriptions could be easily started through credit card sign-up. In the past, one had to raise a purchase order and it would go through a number of steps before approval.
Lesson: Controlling and managing cost needs to be embraced by everyone in your organization.
When a cost/benefit analysis is performed between a ‘traditional’ hosting platform (where you buy your own server and have it running in your server room) and the cloud then I am still quite sure that the cloud option will turn out significantly cheaper to start-up (lower capital expenditure costs) and to run (operational costs).
However, the bills will come in regularly and some months may be higher than others. The cloud is elastic so it’s easy to activate more resources; there’s nothing like throwing CPU resources at a problem! Due to the ease of this, you can find that a particular job can balloon in terms of resources as more hardware is thrown at it to speed things up. This can add to the cost significantly. Loose release procedures and inadequate resource tracking can be a big issue. Being able to think and act elastically means a review of existing processes and working practices. Working with a public cloud provider as if it’s still a finite bunch of servers downstairs in your server room can be just as costly a mistake as a poorly thought out plan of execution.
Be warned but do not be afraid, the cloud is great. Just don’t get too soaked when it pours it down!