Not even Google is immune to downtime caused by the continuous and constant change of modern data center environments. Recently Gmail, Google Drive, Google Documents, Google Spreadsheets, Google Presentations, Google Groups, and the Admin control panel/API were all down for some users. Twitter was alight with complaints, including the folks at ZDNet who were getting a server error when trying to use their Google Apps.
So what was the cause? According to Google, “the outage was caused by a misconfiguration of a user authentication system, which caused a fraction of the login requests to be unintentionally concentrated on a relatively small number of servers. At the time the misconfiguration occurred, monitoring systems detected a load increase and alerted Google Engineering at 1:08 a.m. PT on April 17. However, the alert cleared and the authentication system operated normally under the current load conditions.
At 5:00 a.m. as login traffic increased, the misconfigured servers were unable to process the load. This began to cause errors for some users logging in to Google services. The request load, exacerbated by retry requests from users and automated systems such as IMAP clients, initially appeared as the cause of the login errors.”
This is a great example of the challenge IT operations face in managing configurations. Problematic changes can slip into systems at anytime, causing anomalies that stay hidden until service interruptions strike. Even an infrastructure the size of Google can be brought down by a single small change that slipped by. Could this disaster have been avoided? We think so. Anomaly detection designed for highly scalable, dynamic web environments is one of the key solutions DevOps teams can use to minimize downtime.
We’ve put together a white paper called “Anomaly Detection in the Data Center and the Cloud” to help shed some light on best practices for anomaly detection. It covers how the different types of environmental and behavioral anomalies occur, why automated provisioning is only a start, why thresholds don’t tell the whole story, and the current anomaly detection methods. Hopefully the info in this short (9 page) white paper can help you avoid some downtime (and 3am phone calls!).
Let me know if you find the white paper helpful. If you have suggestions, need help with anomalies, or just want to say “hello”, leave a comment or shoot us a tweet @metaforsoftware.