Lessons Learned from a Cloud Emergency
I have been focusing my career on the cloud for the past five years of so, and people reach out to me from time to time with help on a project. I am always willing to pitch in when I can, and if I have the time. Of course I have to remind them the limits I am under as an evangelist. I can’t write code, or give you code that you intend to put into production, and I can’t see your real production code or data. I am just not bonded and insured to be a consultant.
A couple of Fridays ago I received an urgent IM from a colleague. He had a customer who just went live with a Facebook marketing app/promotion, and it just crashed from load. They were already using Windows Azure, so they wanted some help to get it back up and running. I had time, so I jumped in.
A few minutes later I was on a conference call with the customer and the agency they hired to build the site. I start off by gathering the basic information to get started. What type of app is this? A promotion/Facebook app. What happened? They went live and crashed under the load. What is it built with? Ubuntu/Linux on Azure VMs, Apache, MySQL, and PHP.
I then asked (since this was just after the SimCity debacle) had they done any load testing? No they hadn’t (wasn’t in the Statement of Work), but since the crash they had done some testing. They found that after 100 users the system comes to a halt. The fact that it was exactly 100, and not 123, or 1,459 (or some other random number) made me immediately think this was odd.
I immediately start brainstorming causes to this sudden crash. They were running everything on a single Ubuntu server, with 8 cores. While this is not the way “I” would architect it, it should have plenty of horsepower for many more than 100 users.
I explained that in these situations (<rage>the site is DOWN!</rage>), that they should create a kill board (a brainstorm) of causes, and start investigating as many of the most likely candidates in parallel as they can. They don’t want to single thread their response, since the goal was to get the site back up as soon as possible.
We spent some time coming up with possible reasons for the crash. The dev team was eager to pin everything on Azure. They were already in the process of getting the site running over at Amazon. I thought to myself that this was a horrible idea. Since they would be running in a VM on both clouds, it is very likely not going to fix anything. And it is very likely that the cause it not Azure in any case.
I pushed gently on this, and their thinking was that they had another project with a similar architecture, on Azure, that was having the same problems. My comment back was that coincidence doesn’t guarantee causality. Since they shared the same architecture, in my mind, the architecture was the first suspect, and not the virtual environment. I have worked with some very big scale applications in Azure, and I know that it can easily handle the load they were talking about, even with their weak sauce use of the cloud. And this clearly was not some innate limit of PHP of MySQL. It had to be the code that they wrote, or the config of the deployment.
As we started brainstorming, I mentioned that the first thing I would look at is to make sure that the dev environment configuration for PHP/MySQL didn’t get copied to production. I have seen this happen before, and it’s easy to miss, especially if you don’t have either a great deployment guide or a tested automated deployment script.
1. Check the request queue in the web server. As requests come in, they are queued, until a thread/process of the web server can handle it. If this is growing, your server can’t keep up.
Answer: Checked, not an issue.
2. Check CPU/memory usage during the spike.
Answer: Checked. Nominal.
Nominal tells me that the issue is not a memory leak by the software, or a thread leak. And the app wasn’t doing anything particularly memory or cpu intensive.
3. Check that the MySQL server is configured for persistent /pooled connections.
Answer: Will have to check.
When app servers connect to a database server, they use a connection. In a stateful app environment, they can be setup and used all day. Web apps are rarely stateful, they are usually stateless. In a stateless app, these are setup, used for a second, and then torn down. This leads to a lot of churn, and then you repeatedly incur the overhead of setting up and tearing down the connections. Most web/data platforms now have the concept of pooling. A set of connections are created and pooled. As the app server needs to use the db, it borrows a connection out of the pool, and returns it when it is done.
Sometimes the issue is that the pool can’t keep up with the demand. Sometimes the pool has too many or too few connections. It is very easy with MySQL to turn the pool off altogether, which just kills performance.
There were many ideas, but you get the picture. We were going through the stack, trying to pick out what might be causing such a low user level (100 users ) to crash such a big box. My instinct was still screaming ‘connection to the database’.
The call ends, everyone on the dev team has a lot of work to do, and I go about my weekend. A few times they had clarifying questions in email.
Come Monday I email the customer and ask if the situation has been resolved and if there is anything else I can do to help. He said, yes, it had. They had deployed the dev configuration into production by accident, and that he was busy screaming at the CEO of the dev shop for such a stupid mistake that cost him a lot of money and bad PR.
I thought this was an interesting story, but I also wanted to pull a few lessons learned from it:
- 1. Don’t start using the cloud unless you understand the platform.
- The cloud can seem really easy, and I know that we work really hard to make it as easy as possible. But, like with any platform, the more you expect out of it, the better you need to understand it. If you are deploying a simple blog, with few readers (like this one), then roll on and learn as you go.
- If you are going to go all in with a big and high profile app as your first use of the cloud, please engage someone early on in the process who has the experience. I could have helped them architect this system in a better way so it would have worked out of the gate.
- 2. Test. Test. Test.
- This is the age of test in the software development industry. It is all about testing, at all levels of the process and architecture. Even though a load test wasn’t in the SOW, they should have done some load testing, even if out of curiosity! For crying out load, this is a Facebook app, that is meant to have rapid scale. Don’t you think you could do some simple testing. As a developer, with my name on the project, I would want to know. There are plenty of tools out there to easily test scale on your system. Do it!
- 3. Don’t use your jump to conclusions mat.
- It was easy for this team to waste hours of valuable time because ‘they had another app with the same architecture behaving poorly in Azure’. They just assumed it was Azure. They should not have jumped to a conclusion, but looked at all of the variables and understood that with the same architecture, of course you would have the same results. They should have immediately started to diagnose the situation with DATA. Looking in logs, performance monitor counters, etc to see what was going on. I also recommended that they take the machine in the cloud as is into their environment and do the same tests. That would prove if it was Azure or not.
- 4. Run with scissors during an outage.
- During the normal course of debugging you should only ever try to one solution at a time. If you fix three things, and the problem goes away, you have no idea what fixed it. The opposite is true when THE SITE IS DOWN!!! RAGE!! FIX IT!!! In these cases you need to fork the team/process/code in as many paths as you can so you can find the problem. If you don’t do this, you end up spending critical hours on a path that doesn’t work, and then you have to start over with the second idea on the list. If you run four or five of them down in parallel, then you can cover more ground.
- 5. Perform a post mortem, with results to the project sponsors.
- Hopefully the dev team did a post mortem and root cause analysis on this problem, and sent a very transparent report to their company’s executives, and to their customer. This has to be open kimono. It has to explain what happened, without being a blame pointing device. You can’t improve unless you really look at what happened hard. The report should include how you are going to avoid this problem in the future. You need to immediately do this after people catch back up on sleep, while everything is still fresh in their heads. If you drag this out, the impact will be lost.
- 6. Reach out for help.
- When you are in the weeds, and the server is on fire, the last thing you want to do is bring in someone without any context. They will just slow you down, question every decision you’ve made, and possibly make your team look bad. BUT YOU HAVE TO! Why? because you need someone with a clear head to ask you questions, someone that isn’t in a panic. This could be someone from another team at your company, or anyone that has the experience to help out. It will hurt, you will be frustrated with this person, but it will help a lot. This is were professional networking comes in. You should know someone that you can informally bounce ideas off of, to get some outside help. Someone you know won’t be busy trying to make you look bad. In this case, the only thing I had to gain was to make sure Azure was working for our customer. I had nothing to gain from making the developers look bad.
- I am not saying I rescued the day, but by asking all the dumb questions I could, because I had nothing to lose (but a Friday night) and no context, the team did find the solution. These were questions that they skipped over because of their time on the project. Someone with outside context and experience can see things your team can’t see.