Incident Review: Google Cloud Outage
Let's see what went wrong.
Join the DZone community and get the full member experience.Join For Free
Outages on the Internet always catch you by surprise, whether you are the end user or the Head of Site Reliability trying to keep a clear mind while you execute your incident playbook.
As people in charge of ensuring reliable services for our customers, our normal experience of outages involves surfing a deluge of fire alarms and crisis calls as we work to solve the problem as quickly as we can. We often forget, therefore, what an outage means to the end user.
On Tuesday, November 16, 2021, however, I was reminded of exactly how the shoe feels on the other foot.
In the Shoes of the End User: The Broken Google Bot
On Tuesday, as I was trying to purchase something for my home on Homedepot.com, my browser rendered an unusual page: the Google bot page with its 404 message.
Surprised, I clicked “reload.” Nope, still the same page. I typed the URL again with www. and then without it, but still the same broken Google bot greeted me.
“Well, there’s no way Google acquired Home Depot,” I thought (although its parent company is missing a company that starts with the letter H).
Being a tech geek, the next thing in my mind was maybe I was somehow using Google’s Public DNS 188.8.131.52 and the DNS lookup was failing and Google had decided to launch a new feature that routed non-resolvable domain to their IP (a technique we’ve seen a few companies use before)? But no, neither of those was it either.
Giving up on being a consumer and blind to what was incomprehensible, I simply logged onto the Catchpoint platform and see what was going on. The answer was immediately clear: multiple sites failing, all experiencing the same error message and all of them customers of Google Cloud. I visited Google’s status page, and nothing was posted yet… and there was still nothing on it for another thirty minutes from when the problems started.
Let’s take a quick look at the incident itself.
The Latest Outage of 2021
Tuesday, November 16, starting soon after midday ET, many companies not owned by Google saw their websites knocked offline, replaced by the Google 404 page.
What was going on?
Google didn’t acquire your favorite site then shut it down. In fact, it was collateral damage due to the latest outage of 2021, this time on Google Cloud, which many, many companies rely on for hosting. The impact on a lot of these companies would have been lost revenue and possible damage to business reputation.
Catchpoint Saw a Sudden Burst of Test Failures
At Catchpoint, we saw a sudden burst of test failures, beginning at 12:39pm ET. It impacted many companies, large and small. Some of the businesses that were affected include the likes of Nest, 1800Flowers, CNET, Home Depot, Etsy, Priceline, Spotify, and Google itself.
By around 13:10, the problem was partially resolved. However, according to Google, the issue did not get fully resolved for all impacted products for almost two hours, lasting until 14:28 am ET. Some companies were back online quickly, while others continued to experience errors or long loading times for some time.
Below is a chart showing the availability of many of these sites. You can easily see where the sudden plunge from the cliff edge took place, diving from steady high availability to 0%.
Failed tests show the impact on many of our customers (Catchpoint)
Spotify dashboard showing availability at 0% (Catchpoint)
The customer impact varied according to the Google Cloud service it depended on. For instance, Google App Engine saw an 80% decrease in traffic in central parts of the U.S. and portions of Western Europe. Google Cloud Networking customers were unable to make changes to website load balancing, which led to the 404 error pages.
Indeed, it was not just web pages that were impacted, but multiple Google Cloud products, including:
- Google Cloud Networking
- Google Cloud Functions
- Google Cloud Run
- Google App Engine
- Google App Engine Flex
“A Latent Bug in a Network Configuration Service”
Google Cloud apologized for the service outage and any inconvenience caused its downstream customers. The organization specified the root cause as “a latent bug in a network configuration service which was triggered during a leader election charge.” The cloud giant has assured customers there are now “two forms of safeguards protecting against the issue happening in the future.”
With the massive adoption of public cloud, this latest incident (in a long year of outages) illustrates how significant the impact a public cloud vendor outage can be downstream. It also illustrates how vulnerable enterprises are to third-party vendor outages.
You can clearly see that many enterprises rely on public Internet, services, and infrastructure to conduct their business and deliver digital experiences to their clients. While there are many positives to this situation, the challenge is that those same businesses have little to no control over the underlying infrastructure on which their organizations run.
Three Key Lessons for Any Company
Below are three critical lessons that we took away from this outage, which you can apply to your own business:
- While failure is bound to happen, don’t overlook the importance of communicating to the end user through a proper error page. Don’t assume your users will find your status page or go to Twitter to see your communications. They will have already moved on to your next competitor, and will eventually read about it in the news. If there was one thing that confused the end users in this instance, it was the Google error page that greeted people. Most people would have expected an error message from the company they were trying to reach, not the hosting company. It’s a little unclear if Google Cloud allows folks to modify this. Perhaps it’s not possible. However, any company should be ready for such failures, and implement a process where they are able to change the DNS or CDN configuration to point people to a proper error page with their own branding and messaging to apologize for the failure in their own words. And ideally, make it fun. Don’t be afraid to be human and relate to the end users. A proper error page is always better than confusing error pages (as in this instance), obscure errors (such as “the server failed to respond”), or worse, nothing at all and hanging on into infinity to connect to the server.
- Ensure you implement proper observability of your services, which means from outside your firewall, datacenter, or cloud. While many observability platforms have defined “observability” to fit their products (tracing and logging) - in reality observability had its origins long before tracing came about. In control theory, observability is defined as a measure of how well the internal states of a system can be inferred from the knowledge of its external outputs. This won’t have been the first time that a company will have learned they are down from the news or a customer complaining. You do not want to find out about the problem in this way. Far better to stay ahead of it by observing your services from outside your cloud provider. When you are relying on code tracing and logs alone, you won’t see the problem. Be proactive and stay on top of your services, and the services and infrastructure providers you rely on that are single points of failure.
- Finally, track the SLA of your services, and know your MTTR. You need to track how good your teams and providers are at resolving issues. This is how you build trust and verify people are doing what they are accountable for. Real-time data from an independent monitoring and observability solution will allow you to find out precisely when the issue started and when it was resolved. You cannot rely on status pages to be accurate about the impact the problem had on your site. Everyone will have been impacted differently: slightly earlier or later, shorter or longer…
Opinions expressed by DZone contributors are their own.