3 Lessons DevOps Can Learn From 5 Biggest Outages of Q2 2020
Read this article to learn 3 lessons from the biggest outages of IBM Cloud, T-Mobile, and GitHub.
Join the DZone community and get the full member experience.Join For Free
‘Learn from the mistakes of others. You can't live long enough to make them all yourself’ – Eleanor Roosevelt.
Nobody is immune from outages but it’s better to learn from other’s mistakes than from your own. The second quarter of 2020 was marked by several serious outages at prominent services including IBM Cloud, GitHub, Slack, Zoom and even T-Mobile (Source: StatusGator Report). I’m sure you noticed these outages like our team did. I decided to share the lessons we learned from this downtime, hoping we can all grow from it.
Lesson 1: Don’t Host Your Status Page on Your Own Infrastructure – Outage of IBM Cloud
Having a status page helps to communicate with clients and keep them abreast of changes. This is a reliable and efficient tool. Status pages help clients and teams. Also, they can reduce support tickets because users will know what’s happened. Put simply, the status page is a convenient, efficient and necessary communication tool. But it becomes useless if you host it on your own infrastructure. It is advised to host your status page on a separate domain.
On June 10, 2020, IBM Cloud had an outage that impacted its general cloud services: Kubernetes Service, Cloud Object Service, VPN for VPC, Identity and Access Management (IAM), Continuous Delivery, App Connect, Watson AI and… their status pages. Fortunately, this page was available in the early stages of the outage and became available intermittently later. But in general, a lot of users criticized IBM on social media because of a lack of transparency and lack of communication. So, we can draw the first conclusion.
Hosting the status page on your own infrastructure can be dangerous for your reputation because of negative users’ impressions. It’s also quite useless because, in the event of downtime, it won’t be available – just like the rest of your services.
What Should IBM Do Now?
Of course, set up a status page.
There are three services they can use for status pages:
- StatusPage.io. The largest and most popular. This is probably the only status page provider big enough to handle companies like IBM Cloud.
- Status.io. They are also very large and popular, and most likely able to offer the kind of scale that IBM Cloud would require.
- Or they could build their own, being careful not to be dependent on any of their own infrastructure. They could host it as a static page on a third-party CDN to reduce complexity dependency on their network.
If you are smaller than IBM take a look at Cachet. This is an open-source tool. It’s pretty good and easy to deploy to any number of providers like DigitalOcean.
Lesson 2: If You Don’t Have an Established Status Page, Communicating and Projecting Confidence to Your Users Is Hard – Outage of T-Mobile
Did you know that there are huge providers with more than 80 million clients… without status pages? T-Mobile is not an exception. On June 15, 2020, they experienced downtime across the US. After 13 hours of outage T-Mobile’s CEO Mike Sievert wrote a post on the company’s blog about the issue. The post is no longer available but I have a screenshot for you.
Users wrote negative posts on social media during the downtime. Even Federal Communications Commission Chairman Ajit Pai wrote on Twitter that the T-Mobile outage is unacceptable.
Users were displeased not only because of the outage but also because of T-Mobile’s silence. Having a status page could have resolved this issue. Today every company should have a status page even if you are a huge telecom provider. This is a good way to talk to your customers. Even Netflix has a status page! Thus, if T-Mobile had notified its users about the downtime and current actions, its users wouldn’t have been so angry and disappointed. In turn, the company’s reputation wouldn’t have suffered.
The conclusion from this lesson: if you don’t have a status page, create one immediately!
Lesson 3: Detailed Post Mortems Can Help You Regain Trust – Outage of Slack and GitHub
The first two lessons were about what not to do. And the next one is about what to do.
On May 12, 2020 Slack had a substantial outage that affected the entire service: nobody could log in or receive notifications. Every user of the Slack desktop app received a generic HTTP error. Even when the issue was resolved users still saw the error message because the application didn’t refresh automatically. It was confusing for non-technical users because they didn’t know that they needed to press Ctrl+R to refresh the app and make the error message go away. But Slack did a great job. They accurately updated their status page during the whole issue. Also, they wrote a detailed technical post mortem about the downtime. Even the CEO of Slack, Stewart Butterfield commented on the situation on Twitter and said that he’s in the same boat as other users.
Another case worth examining is GitHub and its outage on June 29, 2020. The service wasn’t available for two hours. Afterward, GitHub decided to publish a monthly availability reports summary and post mortems on the blog.
It is great to see how companies care about users and explain their own mistakes and challenges. It makes the company more human and user-friendly.
We can draw small conclusions from these cases. Try to inform your users as fast as possible. The time when a huge distance existed between a company and its customers is long gone, so try to be closer to people. Also, you shouldn’t ignore writing post mortems on performance. It can smooth the users’ displeasure and increase their loyalty to the company.
Final Thoughts and Conclusions
We can’t avoid outages – even if we try to use all the tools at our disposal and hire super-skilled specialists. As you see, even big companies with highly qualified IT staff have downtimes and make mistakes. And you can’t predict anything. A good example is a Zoom outage on May 17, 2020. Demand during the pandemic placed a huge load on the service, and Zoom couldn’t predict having such a large number of users. A few months ago it would have been impossible to foresee the popularity of so many online meetings, webinars, and lectures. So, Zoom and other similar tools had to scale up in a short space of time and therefore experienced outages sometimes.
Downtimes are unpleasant but they are a part of the business. Far more important is how you communicate with your users and how you pass on information. In the social media era, silence for hours is strange. People understand that services can stop working and they’re ready to keep calm and support you if you inform them why their favorite tools aren’t available. You can write a message on the corporate blog, or post an announcement on Twitter or Facebook. All these methods are ok but the most efficient and modern is having a status page. A post mortem is also a good idea. But if you use just post mortems you’ll forever be ‘putting out the fires’. So, having a status page should be considered a real solution for every company. But, please, don’t host your status page on your own infrastructure :)
Opinions expressed by DZone contributors are their own.