Standard practice for updating the production database is to have a human review the proposed change and implement it. We have done it that way for a long time. We trust the human, or better yet the expert database administrator (DBA), to properly make the change and avoid mistakes caused by people not familiar with the specifics around a unique production database.
To get uber-technical with a theoretical example, we count on that DBA to look for errors like the benign addition of a column with a default value set. That may appear to be an innocuous change at first glance, but when the production table row count is very large (greater than 500K, for instance), that simple mistake can cause a data manipulation language (DML) lock on the table. That, in turn, would extend the maintenance window and impact the SLA.
We count on individuals to make proper changes and catch errors, and most of the time they get it right. However, as Gliffy experienced recently, humans make mistakes. On March 21, an administrator deleted the production database. I repeat: an administrator deleted the entire production database. It wasn’t until a very long three days later that all data was restored. A single human error with the database resulted in the entire system being down and customers feeling the impact for three painful days.
Let that sink in for second. The individual trusted to make the change and consider all possible consequences of the change, made a mistake that led to three whole days of lost business, not to mention long-term impact on customer trust and relationships.
I don’t mean to kick the poor soul while he or she is down. I’m presenting this incident as evidence of a systemic issue. A breakdown in process and technology. I’ve been rocking a command line since the 1980s. In that time, I’ve seen the same pattern emerge over and over. I call that pattern "The Hand on the Rudder." Actually, I’m going to start calling it an anti-pattern.
As humans, we have false confidence that since we (personally) are the ones making the changes, that process is somehow superior to a machine. "Let’s not trust the autopilot; I’m a human." Thing is, unlike a machine, humans become tired, sick, or hungry. We become distracted thinking about our weekend plans or our sick child at home. Yet time and time again, we believe that there is something inherently valuable in us pushing the "Enter" key on the keyboard.
I’d posit that we often choose to perform these tasks manually because we do not have confidence in automation. Garbage in, garbage out. If we can go through steps manually, we valiantly believe we can catch errors on the fly and respond appropriately. I’m certain the administrator at Gliffy thought the same thing. They were wrong, and it cost them. Big time.
If you need a more mainstream argument for trusting automation over humans, let’s consider the self-driving car. Since 2009, Google’s Self Driving Cars (SDC) have logged 1,452,177 miles. In that time, the cars have experienced one lone accident while in autonomous mode. All other accidents occurred while a human was driving the SDC. (You can read the monthly reports here: http://www.google.com/selfdrivingcar/reports/.)
We’ve seen these types of repetitive tasks successfully taken over by automation systems in IT as well. There was a time when I actually performed manual builds on my workstation and used sneaker-net to copy it to a test server using a CD-R. (Kids, sneaker-net is when you walk the file over.) Since Cruisecontrol was first released, I’ve never performed a manual build. There was a time when the "webmaster" would update webpages using Notepad and FTP them to a server. Now, we use a webhost’s admin console to make changes.
I know what you’re about to say. Yes, we still need a person to create the build process. We still need someone to design the webpages. Humans aren’t exiting IT anytime soon. But the boring, repetitive tasks in the process are a recipe for disaster because the human brain simply was not made to perform boring, repetitive tasks. The human mind specializes in creatively solving problems. This is why we are the most successful species on the planet. (No offense, ants… we won on quality, not quantity.) What humans fail at miserably is completing the same task over and over again with a zero failure rate.
In order to get to that zero failure rate and avert three-day catastrophes in database change or any other IT discipline, we need to embrace the autopilot, not hold it at arm’s length. We need the ability to restrict bad behavior and prevent DBAs and other ordinary mortals from making innocent but inevitable mistakes. We need to shuck some manual rituals, stop thinking our limited human capacity is always superior to that of a computer, and trust automation.