When we talk about failures in DevOps we are usually talking about how we want to fail fast. What is failing fast though? The term “fail fast” has several origins but the one that is referred to in Agile is probably the most appropriate for DevOps. Fail fast is a strategy in which you try something, it fails, feedback is delivered quickly, you adapt accordingly, and try again. Failing fast is the only way to fail in DevOps. Roll out a product, service, or application quickly, and if it doesn’t pan out, move on quickly.
We don’t often think of military failures in a positive light, and rightfully so. The Battle of the Little Bighorn did not work out well for General Custer. Operation Eagle Claw, the aborted operation to rescue hostages in Iran, resulted in loss of life and contributed to President Jimmy Carter’s failed re-election bid. And a nuclear mishap almost resulted in the creation of “a very large Bay of North Carolina,” according to Dr. Jack ReVelle, Explosive Ordinance Disposal officer. For the scope of this article, we’ll focus on this near-nuclear B-52 blunder, exploring what went wrong and applying the lessons learned from it to our DevOps craft today.
The 1961 Goldsboro B-52 crash is what I consider a near miss for humanity and a very well-timed wake-up call for the US military, who nearly bombed its own country. Around midnight on January 24, 1961 a B-52G Stratofortress was flying a Cold War alert flight out of Seymour-Johnson Air Force Base, North Carolina. These alert flights were part of the US military’s answer to what was believed to be a superior Soviet ballistic missile threat. The B-52G that took off that night in ‘61 had two Mark 39 thermonuclear weapons on board.
The Cold War was a very tenuous time in world history. The battle over who could deploy weapons fastest between east and west was far greater than any Vim vs. Emacs flamewar could ever hope to reach; it was almost unquantifiable. The US military commanders in charge of the nuclear weapons were so afraid of not being able to respond to an attack, that they routinely fought against the use of safeties in nuclear weapons. We can draw a parallel here to eager investors wanting to see return on investment or managers trying to meet unrealistic delivery dates.
Continuing our analogy, the role of development and operations will be played by the scientists and bomb makers of the era, who wanted weapons to fail safely (thus not exploding when something went wrong). The makers wanted safe backout plans laid out as part of the design, planning, and implementation. Meanwhile, the investors/managers (military commanders) wanted weapons that could be deployed quickly and cheaply. The makers and commanders were at odds in their philosophies. As a result, the Mark 39 bomb had safeties disabled when the aircraft carrying them was aloft. On this night in 1961, the B-52G carrying these weapons above Faro, North Carolina had a structural failure and broke up in mid-air.
The two Mark 39 bombs, clocking in at 3.8 megatons a piece (more than 250 times the destructive power of the Hiroshima bomb), plummeted to earth. One Mark 39 deployed its parachute (a part of a planned detonation of the weapon) and it was later discovered that three of the four safety mechanisms were flipped off during the accident; one step away from a nuclear explosion. This particular bomb landed in a tree and was safely recovered.
The other Mark 39 bomb did not deploy its parachute. Instead, it became a nuclear lawn dart and slammed into a swampy patch of earth at an estimated 700 miles per hour breaking apart on impact. The core (or pit) of the bomb was safely recovered. A complete recovery of all of the weapon’s components was not possible due to the conditions. As a result, a small chunk of eastern North Carolina has some fissile material leftover from the accident. There is a concrete pad in place to prevent tampering and a hastily written law that no farming or other activity will take place deeper than five feet at the site. There was no disaster recovery plan, and it showed.
The 1961 Goldsboro B-52 Crash is a prime example of failing fast going wrong. We were one step away from not one but two multi-megaton nuclear detonations on the US eastern seaboard. The US military did not want inert bombs to fall on the Soviet Union if a delivery system was somehow disabled. The investors/managers wanted the weapons to fail spectacularly. The development and operations teams wanted the weapons to fail fast and safely. Investors/managers did not want to bother with security testing, failure scenarios, and other DevOps-type planning, and this nearly resulted in catastrophic costs.
Luckily, for most of us following DevOps practices, we do not have lives or the fate of humanity in our hands. We can deploy things like Chaos Monkey in our production environments with little risk to life and limb. If you break something in stage, is it not doing its intended purpose: to catch bugs safely before they manifest themselves in production? Take advantage of your dev, test, and stage environments. If those non-production environments are not easily rebuilt, do the work to make them immutable. Automate their deployment so you can take the time to rigorously test your services. Practice failures vigorously; spend the time needed to correct or automate issues out of the systems.
Following the near disaster in Goldsboro, the US government conducted an amazingly detailed postmortem. Speaking with Dr. Jack ReVelle, the Explosive Ordnance Disposal (EOD) officer who responded to the accident, I learned that significant improvements were made to training and documentation. Security and safety became an iterative part of the development and deployment processes. Prior to this deployment, Dr. ReVelle was not trained in disaster recovery of nuclear devices, “We were writing the book on it as we went.” As a result of this accident, techs were taught how to manage the systems before deployment. Additionally, documentation was updated continuously to include necessary information about the systems as they were being developed. Better tooling was procured for the teams to manage incidents. In short, significant rigor was added to training programs, deployment plans, and documentation processes in the EOD teams. EOD teams now planned for failures in the way DevOps teams of today plan for failures.
One thing you have to keep in mind about everything procured by the US government is that it is delivered by the person or company that is the lowest bidder. In other words, a failure rate is expected with anything. The military has to practice failing in any and all forms that can be thought of just like DevOps professionals should. Practicing failures is important because it will improve your recognition time and response times to these failures. Your teams will build a sort of “muscle memory” to effectively quash issues before they become incidents. This “muscle memory” will allow your teams to iterate through one scenario while discussing other scenarios more calmly. Common sense is not common, so explicit documentation and processes are incredibly important. Remember, the most important part of failing is not the fact something failed, but how you respond to and learn from such failures.
More DevOps Goodness
For more insights on implementing unambiguous code requirements, Continuous Delivery anti-patterns, best practices for microservices and containers, and more, get your free copy of the new DZone Guide to DevOps!
If you'd like to see other articles in this guide, be sure to check out: