According to DownDetector, a site that tracks internet sites and mobile apps in real time, users were experiencing the most trouble with Twitter’s website, smartphone app and tablet apps. Also third-partyservices, such as TweetDeck, were intermittently unavailable. It turns out that Twitter experienced an issue ‘related to an internal code change’ that caused the outage for a long time. On Tuesday afternoon, Twitter said they reverted the change, which fixed the issue.
This application downtime had a huge impact on Twitter’s business. The average hourly cost of a critical application failure is $500,000 - $1 million. Tuesdays outage lasted for more than six hours and the stock price reached a new low, losing 7% and almost $700 million market value.
More importantly: how could this outage happen and why did it take so long to fix the issue? Probably someone wrote or edited the code, deployed it and as a result everything went down. It seems like the problem-finding process at Twitter is a hell of a job. They didn’t know who changed the code, what was changed and how this affected critical business services. They had to start a time-consuming investigation between DevOps teams to find and resolve the problem. The better way to deal with outages is to fully automate the problem-finding process across teams. Every DevOps team should be aware of what’s happening in the full IT stack. Providing business services is and always will be a multiple team effort. To prevent future outages Twitter has to step up their game and take a proactive visual approach for smooth IT operations. They can't wait for the next big incident to happen.
Let’s hope Twitter will learn from these outages. Eventually you and I, the customers, are suffering the most. We can’t tweet and have to login on Facebook to complain about our problems. ;-)