Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

When Human Error Is a Good Thing

DZone 's Guide to

When Human Error Is a Good Thing

If someone makes a mistake, don't chastise them; congratulate them for finding a flaw in the process, then look at ways to patch or repair that flaw.

· Agile Zone ·
Free Resource

Oops

Yes, a person broke this plate, but why? Human error is anything but simple.

I have recently finished reading Sidney Dekker's The Field Guide To Understanding Human Error, and while the book is generally about safety, I believe the lessons learned could be applied to incident management with IT systems.

He first talks about the "bad apples" making mistakes because they're not trying hard enough or not paying enough attention and missing some significant detail. Sidney calls this the "old view," which I see as not surprising.

You may also like:  Why Expert Developers Still Make Mistakes

We as humans tend to see our creations as perfect, especially when it comes to processes we develop. We develop these complex steps that need to be followed, and when someone makes a mistake, it is their fault for not following the process. It is their fault for not trying hard enough. It is their fault for missing that crucial bit of information that could have prevented the incident.

So, when you have this mindset, what is the solution? Reprimand or fire the person who made he mistake? Give the person more training? Increase adherence to the process or make the processes stricter? Add more contingencies and paths to handle any situation? Add more technology?

According to Decker, though, these steps do not work. The problem with blaming "someone" is that it stops learning. The investigation stops, and you're done. Whereas if you continue to investigate, you will see what the underlying problems were that caused the incident in the first place.

In this view, which Dekker dubs the "new view," human error is a symptom of an underlying problem. Something more systemic. The incident that occurred is but the start of the investigation, not the end.

First things first, you need to assume that when people come to work, they come to do a good job. If you have someone coming to work to cause havoc, then this is something different, but then you can still investigate what's going on. This can be especially important for a long term employee. What drove them to do this? For a short term employee? You may ask, what did we miss with our hiring procedures?

One reason that the old view is popular is because of hindsight. You know what the effect of the actions were because the incident already happened. We know that if you do "chown -R root /*," you change all files to have root as the owner (OK, I admit that the command could be wrong; I'm not willing to try to verify), completely screwing up the system (and yes, I have done that early in my career, and for the record, it does not save time when trying to change permissions on several directories when you're in a hurry).

The thing is, when the person is taking the action at the time, they don't know what will happen. And what's more, you don't know what is going through the person's mind at the time they're performing the action.

Could they be concentrating on something else that they deemed important at the time? Could their priorities be elsewhere?

For example, hypothetically, a pilot brings a plane through a storm and crashes (no one injured, but lots of damage to the aircraft). Should they have flown through the storm? Obviously not, now that we know the consequences.

But, let's say that they were already several hours delayed. Their priority was to get the passengers back on time. Had they diverted around the storm, they would have been late and reprimanded. Had the pilots gone through the storm and nothing happened, they would have been heroes to the passengers. At the time that the decision needed to be made, without knowing the consequences, what would you have done?

Another example Dekker goes through is a certain type of aircraft that had a large number of crashes during WWII. Pilots were pulling the lever for the flaps instead of the landing gear. The two levers were near each other. They tried everything. Reprimanding pilots, re-training pilots, but the crashes continued.

It wasn't until an engineer looked at the problem in a different way that things started to change. What the engineer did was glue little flaps on the levers that were for flaps, and little wheels on the lever for wheels. You see, the pilots could now find the levers by touch as their concentration were focused on landing, which pretty much eliminated the crash landings all together.

IT Incidents

How many of you record incidents? Do you put them into Jira, or some other bug tracker? Do you do anything with the incidents you record? Or, do you just fix the immediate issues and move on, only for the incident to occur again and again?

Recording incidents, even large ones, does not fix the problem.

And even is something is error free -- which is rather unlikely given how complex our jobs are -- there can still be circumstances that cause issues. Users do tend to find ways to break things.

Sometimes, and I'm a big sucker for this one, we think that replacing a person with technology will prevent the issue. The problem here is that it may fix the immediate issue, but what about boundary conditions that were never thought of? This could cause a minor issue that a human could resolve into a catastrophic issue.

Dekker doesn't say "don't automate," but be careful what you automate. Make sure it augments the person rather than replace the person. Technology is good for repetitive problems, but isn't so good for changing conditions. Only a human can do that.

Look at the error that someone made as the start of an investigation. Look at what caused the issue in the first place; what state of mind was the person in? What were their incentives at the time? What can be done to alleviate those issues to prevent the problem in the first place? Amazon does this, so there's probably some wisdom there. 

The thing is, to prevent incidents, you have to rely on the people you have. Make sure they have the knowledge, experience, and support to handle incidents. Make sure that they are the ones who figure out how to permanently solve the issue. They are the experts on the issue, after all, because it is part their job.

If someone makes a mistake, don't chastise them; congratulate them for finding a flaw in the process, then look at ways to patch or repair that flaw.

This may start a culture where mistakes are not hidden, but made out in the open. When they are out in the open, they can be fixed -- permanently. This can only make your organization better.

And this, in my opinion, is the crux of Agile. Finding the problems in the system and fixing them to make you faster, make you better, make you more knowledgeable, and make you feel safe enough that you can expose more problems in the system.

Further reading

How to Deal With Making Mistakes

People Related Classic Mistakes

Topics:
agile ,human error ,mistakes ,solutions ,no one is perfect ,people not processes ,dev life

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}