I have yet again engaged in an argument on Twitter with people who think Root Cause Analysis is a terrible, bad and misguided thing, and I've decided to put this very simplistic explanation of why I don't agree with that perspective here so I can just point people at it the next time I need to have this discussion.
In the interest of making sure we're all talking about the same thing here let's talk about what a Root Cause analysis is and isn't within the context of this blog post.
What is a "Root Cause Analysis"?
A Root Cause Analysis is an investigative process undertaken after a bad thing happens with the goal of determining why the bad thing happened.
This process may identify one or more causes (things that immediately precipitated the bad thing happening, which if prevented would have prevented the bad thing) and factors (things that made it easier for the bad thing to happen, perhaps leading to one of the causes).
This process may lead to recommendations on how to prevent similar bad things from happening in the future.
The term bad thing is contextual. An airplane crash is a bad thing. So is an information security breach, or a medical error, or a product malfunction. Some bad things are absolute catastrophies (Tenerife), others are dangerous situations (a toaster that fails to shut off when the bread is done - best case your kitchen smells bad, worst case your house burns down), and still others are mild inconveniences (your favorite meme website went down because of a server misconfiguration).
What is a "Root Cause Analysis" NOT?
- A Root Cause Analysis is NOT a process for assigning of blame or responsibility for a bad thing happening.
If your process is undertaken with the goal of point the magic finger of blame at a person you are not performing a Root Cause Analysis. We'll call this process Blamestorming.
- A Root Cause Analysis is NOT a way to absolve a person or system of responsibility for a bad thing happening.
If your process is undertaken with the goal of pointing the magic finger of blame away from yourself or a system you're responsible for (e.g. by attributing an incident to "Human Error") then you are not performing a Root Cause Analysis. This is just another form of Blamestorming.
- A Root Cause Analysis is NOT identifying things to fix.
It's nice if the result of your analysis includes recommendations and fixes, and this is certainly a goal of analyzing how and why things failed, but that is a result, not a process.
So what's wrong with a "Root Cause Analysis"?
Hopefully my definitions above seem sane to you - if you're like me they probably seem like the Right Thing To Do when something breaks. So if we're adhering to the definitions I provided above of what Root Cause Analysis is and is not what's the problem? Why do smart and qualified people keep saying "Root cause analysis is bad!" all the time?
Lots of reasons. Some I agree with, and some I don't.
The semantics of the word "Root"
A frequent complaint about the term "Root Cause Analysis" is a semantic one: The implication (by use of the word "root") that there is only one underlying cause for a bad thing happening, and that you can magically make that bad thing stop happening if you can just fix that one underlying cause.
This is sometimes true but as the complexity of a given system increases the chances of there being a single easily identifiable cause rapidly approaches zero. A proper "Root Cause Analyis" should not lead to a single magical cause, but rather to a list of causes and factors (those words having the meanings which I gave above).
The semantic point is one that I will readily concede - If you don't like the term "Root Cause Analysis" then call what you do something else. As long as it meets my operating definition above and steers clear of the rest of these caveats I'm happy that you're Doing It Right.
If you want to argue semantics beyond the above you should feel free to do so, but not with me: I use the term "Root Cause Analysis" is embedded into regulations I work with every day. In many cases I am legally required to use that phrase to describe my investigative process, and must identify something as a "Root Cause" at the end of the process or my company will be found to be in violation.
If that bothers you then know that I am equally bothered, but your fight is with the United States Government and the International Organization for Standardization, not with me.
The "Easy Fixes" Temptation
There is a natural human tendency to look for the easiest solution to any given probelm we're presented with.
- Problem: The car's engine stopped running on the way to grandma's house.
- Root Cause: I ran out of gas ("Human Error").
- Solution: Take more gas next time, or stop for gas on the way.
Seems simple and logical, right? That's the easiest fix in the world. No investigation required, right?
Here's a piece of information you didn't have: When the engine quit I looked down at the dashboard and the gas gauge said "Full" but when we looked in the tank it was bone dry.
Suddenly the "Root Cause" has changed - The gas gauge in this car is broken. The solution is different too: Get the gas gauge fixed so you know how much gas you have, then you can make an educated decision on when you need to get more.
People tend to point at the temptation to find "Easy Fixes" as a problem with Root Cause Analysis, but that is disingenuous: This temptation exists no matter what your investigative process is, therefore the investigative process must include measures to ensure that you don't fall victim to that temptation. If the Pumpkin Process doesn't do that it's just another bad implementation of Root Cause Analysis with a different name.
A common method to avoid the "Easy Fixes" temptation is to use iterative tools like Five Whys or Why-Because Analysis to drive investigators to look deeper into a problem, but none of these tools will completely eliminate the temptation to look for an easy fix (and the tools themselves can be misused to lead an investigation down the primrose path). The best defense is a good investigative culture within your organization, and a strong cross-disiplinary investigation team, and there's no magic recipe for achieving those things.
Failure to learn
This one takes many forms, but ultimately the criticism is that organizations don't learn from doing a Root Cause Analysis: They put stuff down on paper, maybe make some fixes based on the recommendations, and then they go on to make similar mistakes later.
My view is that this, like the "Easy Fixes" temptation, is not an issue with "Root Cause Analysis" process as I've defined it, because regardless of what you call your process it should be one where your organization looks at a bad thing that happened from a "How can we be better next time?" perspective. The problem is that I can't tell you how to guarantee that happens (and I don't think anyone else can give you a recipe for it either) - it has to be part of your organization's culture: You have to be more interested in being better than in personally being "right" or finding out who was "wrong."
I am not an organizational psychologist - or a psychologist of any kind - but having been in a lot of incident reviews analyzing bad things that happened I'm firmly of the opinion that "Failure to Learn" is brought about by the spectre of blame: It puts everyone in a defensive "This is not my fault!" stance, because nobody wants to be responsible for the bad thing. Being the one responsible is embarrassing ("You took down the company's website with that change!"), and you might even fear for your job.
This is why the investigative process must not be one in which blame is assigned, but even if you're scrupulous in making sure your process is not one that formally assigns blame you're still going to have problems: When a bad thing happens it often makes people upset and emotional, and talking about/investigating the bad thing will rekindle those feelings in people no matter how gently you approach it. It will feel like someone is being blamed - when you ask "Why did you take this action?" people will naturally defend their decision because human beings do not like to be wrong.
The unhealthy quest for 'cause'
I'm going to say something that may make a lot of people angry: Sometimes you just can't figure out why somethng happened, and that's OK! "The incident review board was unable to determine the root cause for this event based on the factual information available." is a perfectly legitimate outcome for a Root Cause Analysis.
That does NOT mean that you get to just throw your hands up and say "We don't know what happened!" - it means that you have performed a thorough investigation and found that there were no specific things which could have been done differently to interrupt the sequence of events that lead to the bad thing happening. This is the nature of complex systems in the real world: Sometimes everything within your control goes right but a bad thing still happened, and you can't determine why based on the information available to you. (If you believe this is not possible and that every investigation can and must lead to a cause determination then you're just wrong, and I submit as proof the unsolved case files of every law enforcement agency in history.)
There is a tendency to call the above a "Failed Root Cause Analysis" - You didn't identify a cause for the bad thing, so you failed. That's an overly simplistic way of looking at Root Cause Analysis because it focuses on the outcome, not the processess.
The value of Root Cause Analysis is in the process: Even if you do not identify specific dominoes in your failure cascade that you could remove to protect yourself in the future your investigation may identify things that you can improve. These things may or may not prevent a similar bad thing from happening in the future, but they still make your organization better.
The "Human Error" trap
Perhaps the most legitimate complaint about Root Cause Analysis is the overwhelming tendency to attribute bad things happening to Human Error. It's actually a common complaint from pilots when reading NTSB reports: "It's always pilot error!"
Human Error can be a legitimate cause. Consider an all-too-common real-world scenario:
A pilot plans to make a flight from point A to point B, this pilot does not have an instrument rating (which is fancy aviation talk for "You don't have the training required to fly into a cloud and survive the experience.")
Shortly before the flight the pilot calls Flight Service and receives a weather briefing, telling the briefer they intend to make the flight VFR (fancy avaition talk for "I don't want to go through any clouds because I don't have an instrument rating.").
hey are advised that conditions are currently Marginal VFR (fancy aviation talk for "The weather is kinda lousy.") and forecast to become IFR (fancy aviation talk for "You WILL be in a goddamn cloud if you make this flight."), the briefer ends the briefing with the phrase "VFR not recommended" which is the sternest warning they can give a pilot - when the nice person on the phone tells you that they're basically saying "If you try this you will kill yourself."
The pilot makes the flight anyway. The pilot flies into a cloud, comes out of it upside down, and crashes in the middle of a field. The aircraft is destroyed and the pilot is killed.
I read NTSB reports on incidents like that all the time. They all end the same way: "The NTSB determines the probable cause of this accident to be the pilot's improper condition to continue VFR into IMC." - Human Error.
This is a correct conclusion: Inherent hazards exist in the activity being undertaken, the human had ample data and warnings about these hazards, but the human chose to ignore the data and warnings. The human cannot be physically prevented from doing a bad thing in this case any more than a chef can be prevented from stabbing themselves with their own knife: Things have been made as safe as reasonably possible.
Human error is also often used as a cop-out. Consider another all-too-common real-world scenario:
A pilot departs on a 3 hour flight, with 4 hours of fuel on board. Two hours into the flight the engine quits, the pilot attempts an emergency landing, and the aircraft is badly damaged during the attempt.
The NTSB shows up and inspects the aircraft. They find that the left fuel tank contains no fuel, and that the right fuel tank is full. The fuel tanks were not damaged or compromised in any way, and there were no fuel leaks evident from either tank. The aircraft's fuel selector was found set to the left fuel tank.
I read NTSB reports like this all the time too, they're often about the exact type of airplane I fly. They aso all end the same way: "The NTSB determines the probable cause of this accident to be the loss of engine power due to fuel starvation as a result of the pilot's fuel mismanagement." but that's a cop-out: The system is not as safe as it could be.
As pilots we know we have to switch tanks to manage fuel during flight. We make calculations of fuel burn, and we even have gauges on the instrument panel telling us how much fuel is in each tank. That's how our vintage 1960s aircraft design works, even in the PA28 aircraft being built today. The fuel system could be made safer though, redesigning itto eliminate the need to switch tanks is possible (have per-tank fuel pumps and check valves to prevent crossfeeding or starvation when one tank runs dry, or a dozen other solutions).
Instead the finger of blame is used, and the Blamestorming process points at the pilot rather than the system.
So what should I take away from all this?
That was a whole lot of words, but ultimately I really don't have any grand wisdom to impart here folks, this is all I can offer you:
- Bad things will happen.
- When a bad thing happens you should investigate it and try to find out why.
- Your investigation should be a REAL investigation, focused on improving things rather than just assigning blame.
- If you're doing all these things you're "Doing Root Cause Analysis Right"
- If you don't like the term "Root Cause Analysis" call it something else, just fucking Do It Right.