Sometimes when you just can’t find the exact root cause of an RCA, some might recommend finding an expert to help you solve the riddle or to help preparations to mitigate the consequence of the next occurrence, since it’s due to happen. What else can you do? The fact is, in the real world critical evidence is often lost or destroyed, and yet you still have to do all you can do within constraints of time, money, resources and regulations to avoid a reoccurrence of the failure.
When the severity of the consequences dictates that preparing for the worst and hoping for the best is just not enough, here is one approach that can be used to systematically reduce the chances of the next failure. You can focus on the most probable cause or causes and either mitigate or eliminate them. Set up “data traps” (instruments or procedures to ensure that the critical determining information is not lost or destroyed) for the next time it happens, so that the real cause can be verified. ALWAYS verify.
How do you find the most probable cause or causes given the lack of complete data? First, construct a very detailed fault tree that details all the physically possible causes, regardless of how apparently improbable they are. This usually requires a good team of people. Then systematically eliminate those pieces in each branch of the fault tree for which there is no concrete evidence of a break in the sequence. That is, first eliminate what you can PROVE the cause WAS NOT. From a Sherlock Holmes story: “…when you have eliminated the impossible, whatever remains, however improbable, must be the truth.”
That alone is often enough to significantly narrow down what you have to deal with. It may still be a case of “shotgunning” the answer, but at least you are now aiming it in the proper direction instead of pointing around indiscriminately. Often the process of elimination reduces the problem to a few causes with reasonable mitigation efforts. If the first round of elimination is still insufficient, apply failure rate data (human failure rate as well as hardware failure rates) to those events in the fault tree branches that are “unknown” because no concrete data, for or against, exists.
If there is not good failure rate data for the system under study or similar ones, then go to published data that’s used for risk assessment/risk management analysis. There is plenty of it in the public domain. However you get it, use the most appropriate and probable failure rate data you can find or generate. For instance, if you don’t know what the failure rate for that particular motor is, calculate a gross Mean Time Between Failures for the population of the site, or the department, or that class of equipment on site as a starting point. You have now at least removed some element of subjectivity and probably are within an order of magnitude of the actual failure rate.
Then it is just a matter of doing the simple probability math to obtain the overall probability of occurrence within each particular branch in the fault tree. Next you mitigate either the highest probability branches or mitigate the few causes / events that are common to the most probable branches. When doing risk analysis, keep in mind that significance is generally measured in orders of magnitude. So unless two events or branches have probabilities at least an order of magnitude apart, they are, for all intents and purposes, equivalent. So don’t worry if the failure frequency is two times per year or five times per year, but if it is two times a year or two times a decade.
Having done this, the risk of having made a bad corrective action(s) still exists. But you have significantly reduced the overall risk without randomly trying to fix everything. And after all, risk reduction is what reliability engineering is all about. This method has been used on many occasions with success. Where time does not allow waiting for the next failure to get missing data, or the business cannot bear the consequences of the next failure, this approach meets the requirement to put forth the best due diligence possible.
© Life Cycle Engineering