Why Your Cloud is Speeding for a Scalability Cliff
Don’t believe me that you’re headed for the cliff?
A startup scales up to no avail
Towards the end of 2012 I worked with an internet startup in the online education space. Their web application was not unusual, built in PHP and using Linux, Apache & Mysql all running on Amazon web services. They had three webservers in the mix and were seeing 1000 simultaneous users during peak traffic.
All this sounds normal except they were hitting major stalls, and app slowdowns. Before I was brought in they had scaled their MySQL server from a large to extra large instance, but were still seeing slow downs. What can we do, they asked?
I dug in and took at look at the server variables. They seemed to have substantial memory allocated to the server and Innodb. I then dug into the slow query log. This is a great facility in MySQL which sifts through activity happening against your database, and logs those which take a long time. In this case we had it set to ½ second and found tons of activity.
What was happening? Turns out there were lots of missing indexes, and badly written SQL queries.
A related popular piece AirBNB didn’t have to fail despite Amazon’s outage.
How can we resolve these problems?
The customer asked me to explain the situation. I asked them to imagine finding a friend’s apartment in NYC without an address. Not easy right? You have to visit all of it’s 8 million residents until you locate your friend’s home.
Also check out: Real Disaster Recovery Lessons from Sandy.
This is what you’re asking the database to do without indexes. It’s very serious. It’s even compounded when you have hundreds or thousands of other users hitting different pages all with the same problems. Your whole dataset can fit in memory you tell me? So-called logical I/Os still cost, and can indeed cost dearly. What’s more sorting, joining, and grouping all compound the amount of memory your dataset can require.
Why didn’t a bigger server help?
Modern computers are fast and EC2 extra large instances have a lot of memory. But with thousands or tens of thousands of users hitting pages simultaneously, you can take down even the largest servers.
High performance code isn’t automatic
We have automation, we have agile processes, we can scale web, cache and search servers with ease. The danger is in thinking that deploying in the cloud will magically deliver scalability. Another danger is thinking that ORMs like ActiveRecord in Ruby or Hibernate in Java will solve these problems. Yes they are great tools to speed up prototyping, but we become dependent on them, and they are difficult to rip out later.
Want more, check out our 5 Things Toxic To Scalability.
Fred Wilson says Speed is an essential Feature
Fred Wilson recently gave a talk on his top 10 golden principals to successful web applications. He says speed is the most important feature. Enough said!