I spoke on a panel about a month ago at a Mitchell developer's conference, and a question was raised around how does Mitchell avoid cloud cost overruns. I thought this was a good question, and, after reading a recent LinkedIn article on out-of-control cloud OpEx, I wanted to present my thoughts on the matter in a quick post.
Most cloud providers, like AWS, do not provide any of the basic controls companies need to ensure predictable expenses. We believe at Skytap that it is the responsibility of the cloud provider to offer both visibility as well as controls into usage so that companies can avoid moving their on-premises VM sprawl problem to the cloud.
In the case of on-premises environments, teams typically camp on VMs because they were so difficult to acquire. Forms were filled out, justifications written, approvals secured all before the gear was ordered and shipped to the loading dock. Even with virtualization, these practices continued, and led to days or weeks of time before teams got the resources they needed. It's no wonder they were reluctant to give resources up, and this is what has led to astoundingly low levels of utilization for on-premises environments for development and testing purposes.
In the case of the cloud, resources have now become easily available. Self-service access to computing is one of the core tenets of the cloud, but sprawl still occurs for two primary reasons:
What is really needed to avoid sprawl and runaway costs are a few key capabilities, which we offer with Skytap cloud:
- Ability to suspend an environment. By "suspend," I mean "capture the in-memory running state of the application — just like it works on your laptop." How many of you reboot your laptop on a regular basis? And yet, this is how most clouds work today. By suspending the environment, the resources running the environment can be released and only storage costs accrue to the camped environment. If teams can quickly resume an environment and get back to the previous state without having to rebuild it from scratch, they are less likely to want to keep them running.
- User and Department Quotas. I know it sounds old-school and "ITish," but simply put, if you can assign quotas to individuals for smaller companies, or departments for larger companies, they will self-police their usage. To do this reasonably, you need course-grained controls like VM RAM concurrency or storage. In the case of AWS, the number of widgets you have access to is very high, so this would likely translate to some kind of cost quotas, but those are more difficult for end users to manage.
- Reporting and Notifications. Finally, you need good tools for reporting and notification on usage. If you can provide usage reporting based on departments, groups, projects, etc., you can ensure the right funds are being used appropriately. With notifications, you can avoid placing hard limits like quotas, but still be involved when usage reaches certain thresholds.