I spent this morning reflecting on everything I had learned and experienced in the past two days at the London DevOps Enterprise Summit, which I co-hosted. I was so inspired by all the amazing tales of courageous business transformation by the amazing technology leaders representing almost every industry vertical.
And then I ran into this article, which I found utterly dismaying. It quoted how Willie Walsh, CEO of International Airlines Group, parent of British Airways, responded to the BA incident that stranded 75,000 passengers.
Walsh was quoted as saying, "It was a power failure, not an IT failure." And he then goes on to say, "There are incidents from time to time that are damaging to our reputation, but we recover from these."
I find this startling, utterly unbelievable and frustrating for a variety of reasons:
First the facts: The Economist wrote: "More than 1,200 flights, booked to carry over 75,000 passengers, were canceled over three days; hundreds of thousands more miserable travelers had their trips ruined by delays, lost luggage and missed connections. Analysts estimate that the total cost to BA of refunds, plus compensation of up to €600 ($675) for each delayed passenger, could climb as high as £150m ($192m)." (Source: The Economist)
This was, without doubt, a colossal business failure — it was not merely an IT failure, and certainly, trivializes to the point of absurdity to call it merely a power failure. It jeopardized the achievement of business goals, financial goals, operational goals, and certainly technology goals… and maybe even strategic goals, too. After all, £150m in passenger compensation would nearly wipe out the unexpected record 2017 first quarter profit of €170m, and could potentially result in delaying or canceling planned strategic programs. To not first identify this as primarily a business failure is astonishing to me. (Source: doubt, a colossal business failure — it was not merely an IT failure, and certainly, trivializes to the point of absurdity to call it merely a power failure. It jeopardized the achievement of business goals, financial goals, operational goals, and certainly technology goals… and maybe even strategic goals, too. After all, £150m in passenger compensation would nearly wipe out the unexpected record 2017 first quarter profit of €170m, and could potentially result in delaying or canceling planned strategic programs. To not first identify this as primarily a business failure is astonishing to me. (Source: The Independent)
I could be totally misunderstanding Mr. Walsh, but it seems to me his comments trivialized and abdicates his responsibilities as CEO to address the factors that led to this accident — it would be like a CEO of an automobile manufacturer saying about a production plant that was shut down for three days: "Someone tripped over a power cord. We've fired that person. Nothing else to discuss here."like a CEO of an automobile manufacturer saying about a production plant that was shut down for three days: "Someone tripped over a power cord. We've fired that person. Nothing else to discuss here."
It is disheartening to see the "name, blame, shame" pattern underway, where a worker will be identified (perhaps wrongly), then be punished or be fired, because of an accident that resulted from that person operating in an extremely fragile and unsafe technology environment. Instead of naming/blaming/shaming people, the job of the leader is to do whatever is required to create a safe system of work, where workers are able to perform their jobs safely, where small failures have small consequences, instead of a massive failure.
I've been in London for the past week, and BA CEO Alex Cruz told British television outlets this, as reported by The Register: "'All the parties involved around this particular event have not been involved with any type of outsourcing in any foreign country,' he said (Reg emphasis). 'They have all been local issues around a local data centre who [sic] has been managed and fixed by local resources.'" (Source: centre who [sic] has been managed and fixed by local resources.'" (Source: The Register) I'm sure there are some political lightning rods here that I'm not aware of or sensitive to when it comes to potentially outsourcing labor or sending it overseas. But simply put, no organization can outsource responsibility, regardless of who does the work, or where that work is performed.
I spent hours thinking about the letter I wanted to read from the CEO of an airline who is recovering from an incident like last week. If I had to write The Phoenix Project for an airline instead of an automotive parts manufacturer, maybe the CEO would write a letter like this…
My Imaginary Letter as CEO of a Fictitious Airline
To all our customers and employees —
My name is Gene Kim, the CEO of Mythical Airlines, and I want to personally apologize to each of the 75,000 of our customers that we left stranded or who were unable to complete their trips, because of the 1,200 flights we had to cancel. I also want to apologize to the hundreds of thousands of passengers that we delayed, who lost their bags, or who missed their connection.
Our business is a service business. We make a simple promise to you: we fly you from Point A to Point B safely, comfortably, quickly, on time, and do it at a price low enough so you can experience the amazing joys of air travel.
We let you down last week in many ways: some were stranded in the middle of a trip, we were unable to book some of you quickly on another airline, we were unable to find some of you overnight lodging, you were separated from your luggage, and many other indignities that we would never willingly inflict on our customers.
Although our promise to you is simple, our business is incredibly complex. We fly the newest planes that often cost hundreds of millions of dollars, each requires skilled and certified pilots, flight crew, mechanics, baggage handlers, customer agents.
We have forty thousand dedicated employees around the globe, and a million things must go right every day in order for you to make all your flight connections, be able to travel on our partner airlines without fuss, arrive at the right place, safely, on-time, with the right assigned seat, with all your bags, be rewarded in our frequent flyer programs, and so forth…
We rely on hundreds of incredibly complex technology systems that we've built over decades that handle everything from ground operations, baggage tracking, flight crew scheduling, customer check-in, seat assignment, ticket sales, loyalty programs… All these software systems were built over decades, some of them requiring us to solve problems that have never ever been solved before, all developed and run by thousands of passionate technologists, some of whom are employees and others from our critical partners.
In fact, those hundreds of technology systems that our organization relies upon every day are actually orders of magnitude more complex the airplanes we fly — it is an engineering achievement that represents our best-known understanding of how to run a great airline.
But last week, many things went wrong. From what we can tell, we experienced a complex and cascading failure in the critical technology systems that run these incredibly important business processes.
I want to credit the tens of thousands of hard-working employees who, for three days, worked around the clock to try to help our customers at airports, on the telephone, and in our offices worldwide. I got to thank many of you for your heroic efforts from our Portland Operations Center, but I want to express my gratitude to everyone in this letter.
But I especially want to thank the technology teams around the world who helped us recover from this unprecedented accident, the likes of which we have never experienced before. It is during times like this when we see so clearly how critical technology is to our business.
One of the narratives I've read in the press is that the failure was due to a power failure.
This is absolutely false, and I want to put that to rest.
The accident last week was not due to a power failure, or an IT failure — this was a business failure. After all, we were unable to perform some of our most critical business operations for nearly three days.
I say this not to place blame — in fact, if there is someone to blame, the only person that can be blamed is me.
As CEO, I am ultimately responsible for the performance of the organization, and I am responsible for the quality of our people and processes. I am responsible for guaranteeing that we have safe systems of work that support our employees and our customers.
As CEO, I promise to work tirelessly to make sure that this doesn't happen again — and if it does happen again, we must do what it takes so that we can recover more quickly, ideally long before any of our customers are ever inconvenienced.
I have reflected deeply each day since the accident last week that created so much suffering for our customers, and also upon the four other times in the last year when our technology systems have resulted in delays to our customer.
As a result, I have made a decision. To better serve our customers and our employees, I will dedicate myself to helping achieve zero delayed or canceled passengers here at Mythical Airlines.
I know it is an ambitious, and likely even an impossible, goal. But why would we not shoot for perfection, when perfection would mean ecstatic customers, an offering that would dominate those markets we choose to compete in, and by definition, being performed by a world-class organization whose professionals are admired by everyone, especially our competition.
But to achieve that, I will need your help — and that includes you, our customers, all our employees, contractors, and our suppliers.
For employees, one of the most important things I need from you is for you to tell me about the obstacles that are in your way that prevent us from achieving our goal of zero delayed or canceled customers. I'm asking you to contact me directly — you can email me, or you can call me at my office or my cell phone.
You may ask why I want you to contact me directly — it's because I remember so vividly watching on TV as a child in elementary school the Space Shuttle Challenger explode in 1986, 73 seconds after it launched, killing all seven astronauts. I will never forget my feelings of utter disbelief, shock, and sadness.
I was too young to remember the safety commission put together afterward, but many decades later, I read about how Dr. Richard Feynman and many other experts served on the Rogers Commission, appointed to investigate the accident. One of their conclusions was that organizational culture and decision-making processes were key contributors to the accident — I'll never forget reading about how engineers at NASA and its contractors were concerned about how an O-Ring failure in cold weather could jeopardize the safety of the Space Shuttle astronauts.
They tried to communicate their concerns to NASA leadership, but were unable to convince them to delay the launch, leading to that fateful and tragic day.
When I read that ten years ago as a young professional, I dreamed of being able to have been the head of NASA, and being able to write an email to all my employees and contractors the day before that Challenger launch: "I know you care as much about safety as I do. If you ever have concerns about safety and you feel like you have not been heard, contact me directly, by email or by telephone. And here is my telephone number at work or at my home. I promise you I will listen."
None of us can change the past, but as CEO of this airline, I will not let something like that happen here. And that is why I need to hear from you.
Because I suspect that somewhere in this great and talented organization, there are similar emails from employees who care passionately about our customers, warning that the incident last week could have happened. And I'm guessing that those people writing those emails had ideas on what we could do to help make our organizations more resilient to accidents and failures.
And maybe they felt like their concerns weren't heard, and were not acted upon.
Or worse, maybe people were afraid to tell bad news or bring up problems. I've worked in organizations where there was a culture where people were afraid to do the right thing, where daily heroics were required to defeat the bureaucracy that people from serving the needs of our customers.
I will not let that happen on my watch — because I need all your best ideas on how we can prevent the accident last week from happening again. If you see something that jeopardizes our ability to achieve zero delayed or canceled flights, email me or call me at my office or my home.
I promise that I will listen, and I promise to dedicate myself and my entire leadership team to helping you do what it takes for us to create the resilient systems and processes to achieve our goal of zero delayed or canceled passengers.
I have long admired the work of Paul O'Neill, the former CEO of Alcoa, who helped them become one of the safest workplaces in the industry, reducing the workplace accident rate over ten years from 2% of the workforce in 1997 being hurt on the job per year (that's seven injuries per day in their 90,000 strong workforce), to less than 0.05% yearly.
Before Mr. O'Neill started his job as CEO, he declared that he would have one simple goal: no workplace injuries.
Long after he retired, he reflected about the first workplace fatality that occurred during his tenure at Alcoa. He flew to that plant there with his entire executive team, where they learned how a 17-year-old boy was killed in his third week on the job when he cleared a jammed machine — he climbed over a safety barrier and was killed instantly by a spinning boom, watched by supervisors.
After being briefed at the plant, he told all the assembled executives and plant supervisors, "We killed him. Yes, the supervisors were there, but we killed him. I killed him. Because obviously, I didn't do a good enough job communicating clearly to everyone how safety is the single most important thing to me. Safety is not a just priority, it is a pre-condition for how we work."
Many years later, he said, "People were stunned that they should take personal responsibilities for these deaths. Believe me, Alcoans were some of the most compassionate people around. We always cared for people and we mourned. But they thought safety was someone else's job. No one ever told them that when someone died on the job, we all helped get that person killed. We needed to create a system of work where everyone is empowered to create a genuinely safe workplace, where accidents like that could never happen."
One of the first things he did was to put in a new rule: any workplace injury had to be reported to him within 24 hours by the business unit president, along with an action plan. It wasn't a statistic, it was the name of each person who was hurt on the job.
And the purpose of the report wasn't for Mr. O'Neill to punish people, but to elevate the importance of any needed preventive actions to the highest level, so that Mr. O'Neill got fast feedback whenever an accident occurred, and could help mobilize whatever was needed to complete the preventive actions. He said, "We don't budget or prioritize safety. Anything that could harm a fellow Alcoan, we must fix it immediately. Safety is a not a priority — it is a necessary precondition of work."
Mr. O'Neill dedicated his organization to the goal of zero injuries on the job, and over years, created one of the most dynamic and competitive organizations in the industry.
If Mr. O'Neill can do it for Alcoa, then we can do this for Mythical Airlines.
And so now I ask all of our employees and valued partners to help me achieve a similar revolution in the airline industry. I believe that by focusing ourselves on zero delayed or canceled passengers, we will create one of the most admired organizations on the planet. And we will create an organization that we will all be proud to work for.
And we will start by beginning an unflinching examination of what happened last week, and figure out how to mobilize this organization to create a safer and more resilient system of work, and to help prevent disasters like last week from ever happening again.
(And let me briefly state: we cannot blame anyone last week who, while attempting to restore our technology systems, did something that accidentally led to the accidental meltdown, just as we cannot blame the 17 year old Alcoa worker for his actions that led to his death.)
Please email me if you have ideas — as the CEO, I am the person ultimately responsible for the outcomes of this organization, and I need to hear from you. Don't worry about passing it up the chain or about politics or someone passing the buck. Just as if I would have if I were the head of NASA before the launch of Challenger, I care about what you have to say and will listen.
Thank you for reading this letter, and I look forward to Day 2 of our journey toward our goal of zero delayed or cancelled passengers.
CEO, Mythical Airlines
Call to Action
I wrote this fictitious letter as a thought-experiment, to see whether it was even possible for a CEO-level call to action to provide a platform that would enable technologists to build/re-build a safer operating environment that would prevent a similar systems meltdown from happening in the future.
I'll write further on the construction of this letter, as well as a potential follow-up letter that would be directed at other internal stakeholders (e.g., public relations, shareholders, finance, internal operations, IT).
In the meantime, if this letter resonated with you, I'd be honored if you could leave a quick comment below, so that we can show that these issues are relevant to technology organizations across all industry verticals, etc.
Thank you for Mary Kirby at Runway Girl and everyone at The Register for their great reporting. Thanks to Mik Kersten, Scott Prugh, Mike Nygard, Norman Marks, Sam Newman, Tim Birkett, and Brian Mericle for their always astute observations on Twitter and their encouragement.
I want to acknowledge the lifetime of work of Dr. Steven Spear, author of The High Velocity Edge: How Market Leaders Leverage Operational Excellence to Beat the Competition, who I've learned so much from over the years. He describes the how and why of dynamic, learning organizations are able to win, such as Toyota and its suppliers, Alcoa, the U.S. Naval Reactor Corp, Pratt & Whitney and so many more organizations You'll find a bunch of information on his work in the Resources section below.
- Some links on the Space Shuttle Challenger accident:
Article about Edward Tufte and his assertion that better information presentation could have averted the NASA Challenger accident: http://www.asktog.com/books/challengerExerpt.html
My writeup of the amazing work of Dr. Steven Spear, including his book The High Velocity Edge: How Market Leaders Leverage Operational Excellence to Beat the Competition: https://itrevolution.com/devops-book-review-the-high-velocity-edge-by-dr-steven-spear/
The 30m video of Dr. Steven Spear speaking at DevOps Enterprise Summit 2015: https://youtu.be/onwhZwroQHs
An amazing one-hour hour lecture from Paul O'Neill, former CEO of Alcoa: he talks about the 17 year old boy who was killed at an Alcoa plant at 12m mark: https://youtu.be/0gvOrYuPBEA
Incidentally, here's a great (and IMHO, highly effective) video released by Edward Bastian, CEO of Delta Air Lines, apologizing for a system outage that affected 800 passengers: https://youtu.be/n0zaE03T2Sw
All of these themes are explored in The DevOps Handbook, mostly around the principles and practices of the Third Way: Building A Culture Of Continuous Experimentation and Learning.