MANAGING FAILURE
ANALYSIS
BY: Ronald L. Hughes
To be a good failure analyst one must also be a good manager.
After all, failure analysis or problem solving is more than
just brainstorming a solution to an identified problem.
Successful analysis can only be achieved when a structured
technique that uncovers the facts of the incident being
investigated is used and adhered to at every step of the
analysis process. As the manager or Principal Analyst for
the failure your management skills will not only be put
to the test but will be an integral part of the investigation.
Managing The Failure Definition
The first step in the analysis effort would be to clearly
define what constitutes a failure. This may sound simple
but I can assure you that it is not. Ask anyone and they
will all tell you that they know what their failures are.
Now explore a little deeper and you will find that they
all know what’s breaking down but they care for a
different reason. The fact is we all tend to care for a
different reason and there are many factors that will directly
affect the reason why we care thereby changing our failure
definition. For example, consider a plant whose production
levels are low and maintenance, downtime, and parts cost
high. In this example the Operations Manager considers the
low production levels to be the failure, while the Maintenance
Manager considers the Mean Time Between Failure (MTBF) and
Mean Time To Repair (MTTR) to be the failure. The Plant
Manager considers the low bottom line to be the failure
while the maintenance staff cares about the number of times
that they must repair the equipment. What we have here is
clearly a failure but a different failure definition at
every level of the organization. Now add to the thought
process by considering another factor that affects how we
feel about the failure; i.e., the business environment.
Low production levels in a non-sold out condition are not
as big a problem as high maintenance cost. Conversely, in
a sold out condition maintenance cost are not nearly as
important as production levels and downtime. The job of
the Principal Analyst is to recognize these factors and
apply the necessary focusing tools (Impact – Effort
Matrix, Decision by Pairs, Force Field Analysis, Failure
Modes and Effects Analysis, etc.) to uncover those failures
that represent the greatest amount of potential return or
unrealized opportunity based on the right definition of
failure for the facility.
Managing the Scope of the Analysis
Don’t bite off more than you can chew! The size and
scope of the analysis you intend to tackle should not exceed
the available resources for the analysis effort. Therefore,
the scope of the analysis should be directly proportional
to the resources available to conduct the analysis. Always
remember that the bigger the scope the bigger the analysis.
Process or system related-analyses tend to be the largest
in size because of the many variables associated with the
modes of failure. Whereas, single components tend to be
the smallest due to the relatively few variables associated
with a single item. The key is to determine what is really
important and what you can reasonably manage. This is easily
done if you have already determined the amount of opportunity
by performing a Failure Modes and Effects Analysis (FMEA)
and know the available resources on hand. Here the scope
and the opportunity have already been identified. The goal
is to eliminate failure and recover opportunity as quickly
as possible by going after the biggest “bang for the
buck”. In essence, limit the scope of the analysis
at an early stage and get a payback as soon as possible.
By doing so it becomes easier to dedicate resources for
those analyses that are larger in scope and therefore more
time consuming to resolve. Although the analysis with the
largest scope may have the greatest potential return it
is not always the best analysis to go after first. Managing
the scope of the analysis is important when you realize
that an incomplete effort is worse than a smaller completed
problem resolution. In effect, don’t go after world
hunger on your first attempt, although an attractive opportunity,
it may be a bit more than you can chew with the available
resources at hand.
Managing the Failure Data
One of the most challenging aspects of any failure analysis
effort is the management of the data necessary to solve
the failure. Failure data provides the key that unlocks
the mystery when problem solving. What the data tells you
are the facts of the failure. Therefore, the management
of failure data is vital to the successful outcome of the
analysis.
It is not enough to merely set down and identify the data
necessary to find the root cause(s) of failure, but to develop
and implement a data collection strategy that ensures that
the integrity of the failure data is maintained. Not just
identifying the person responsible for data collection,
but how they are going to obtain the data and what they
are going to do with it once it has been collected. Think
of it like a police investigation. The forensic strategy
is handled in such a manner as to ensure that all the evidence
is collected and stored until needed. Pictures are taken,
evidence is bagged and tagged for use in the investigation
and in court, all the witnesses are interviewed and their
statements recorded, locations and times are noted to determine
all the positional information, etc. The collection of failure
data should receive exactly the same type of stringent detail
as the evidence collected at any crime scene.
Managing the Analysis Team
Managing the analysis team consist of more than just managing
the people. This includes making sure you have the right
team, not only in size but also in makeup. A common mistake
made by most organizations is to form an ad hoc committee
comprised entirely of subject matter experts (lead by the
most senior or experienced of the experts) to solve the
egregious effects of the incident being investigated. The
results tend to be pre-tailored solutions for the specific
problem based on the expertise of the team. Make no mistake
about it; although subject matter experts are absolutely
necessary to solve the failure, to make sure all the possibilities
are covered individuals that have little or no knowledge
of the failure being investigated should compliment them.
Non subject matter experts bring the element of questioning
to the table. When they ask a question such as “can
this happen or occur?” the subject matter experts
then must think about the possibility and answer yes or
no to the question. The problem with a team comprised solely
of subject matter experts is that they often overlook possibilities
due to their interment knowledge of the failure. They believe
that they already know why the failure is occurring and
want to follow that path to uncover root cause(s). Non subject
matter experts want to explore all the possibilities because
they have no pre-conceived notions.
It is not necessary for the Principal Analyst to be a
subject matter expert in the failure. Quite to the contrary
as this is often a detriment to the analysis effort because
he also will have developed pre-conceived notions as to
why the failure is occurring. What the Principal Analyst
needs to be an expert in is the science of Problem Solving
or Failure Analysis.
The perfect analysis team is usually made up of 5 to 7
cross-functional people who have a common goal and commitment
to solving the failure under investigation. Proper management
of the team involves not only the selection of the right
people, but also the correct assignment of individuals involved.
Each must have clearly defined rolls and duties based on
their unique strengths and weaknesses. For example, every
team needs a critic to keep the team honest. Fortunately
every organization seems to have an abundance of people
with this characteristic. The job of the Principal Analyst
is to make sure this individual is critical but not to the
point of disruption.
Managing the Analysis Effort
The first step in managing the actual analysis effort is
to determine what you expect from the final outcome. This
can be easily accomplished by developing a charter that
clearly delineates the terminal objective of the analysis.
This is further enhanced through the development of critical
success factors that will tell you whether or not the terminal
objective has been obtained. For example, if you are solving
a problem involving an administrative issue such as slow
invoice processing your charter could be something like
the following:
“Uncover the root causes of the recurring invoice
processing problems. This includes identifying deficiencies
in or lack of management systems. Appropriate recommendations
for root causes will be communicated to management for rapid
resolution.”
Examples of possible critical success factors could include
the following:
-
Reduce invoice processing turnaround time from two
weeks to one week.
-
No lost invoices.
-
No incorrect invoices.
-
Maintain an invoice tracking system that is 100% accurate.
By developing a good charter and critical success factors
for the analysis the team has a common goal and focusing
mechanism to keep them on track and stop them from straying
off on tangents.
When failure analysis begins the goal of the Principal Analyst
is to make sure that the logic is sound and that all hypotheses
have been proven or disproved. Here it is good to understand
that the Principal Analyst manages the analysis and is responsible
for its successful outcome. He owns the process the team
owns the failure. Keeping this in mind, if the team can
prove it to the Principal Analyst, them he can subsequently
prove it to management.
Often during the logic tree development portion of the
analysis team members will disagree and some conflict will
result. This conflict is not necessarily a bad thing. With
conflict comes valuable discussion. As long as the conversation
it pertinent to the analysis and provides benefit it should
be allowed to continue. The trick is to keep this conflict
from becoming confrontational and therefore detrimental
to the analysis. One management technique used to maintain
control during the analysis is for the Principal Analyst
to ask questions that will help to clarify points. Questioning
not only minimizes the amount of conflict between the team
members it keeps the team focused. This is especially important
for those team members who are not subject matter experts
in the failure under investigation.
Managing the Final Report
The final report is the alpha and omega of the failure.
It represents the culmination of the analysis effort and
the beginning of failure elimination. Remember that the
goal of any failure analysis should be the elimination of
identified causes. The final report is the tool used to
obtain the resources necessary to implement solutions to
the uncovered root cause(s) of the failure thereby achieving
that goal. In essence, the final report can be thought of
as a sales tool and should be developed with that in mind.
At a minimum the final report should not only provide solutions
with expected returns on investments but also identify how
the failure occurred in the first place. To accomplish this
an event summary, a description of the failure mechanism
and list of recommendations should be included in the report.
The event summary is nothing more than a brief description
how the failure was first noticed, how long it has been
going on and the method(s) used to isolate or mitigate the
consequences of the failure.
The failure mechanism can be thought of as a summary of
the root cause(s) that led to failure occurrence. It chronologically
characterizes the things that must occur in order for the
failure to manifest itself.
The list of recommendations should not only explain what,
when and who is going to be responsible for implementation,
it should also include a detailed cost benefit-ratio associated
with each recommendation.
Summary
The success or failure of your problem solving efforts
often depends on the management strategies used to conduct
the analysis. A sound management strategy must be devised
and put into place for every step in the Root Cause Analysis
process in order for the analysis to be both effective and
efficient.
Obviously collecting and maintaining the paperwork associated
with the failure investigation can be a daunting task. For
this reason the use of software that is designed specifically
for this purpose is extremely beneficial and is highly recommended.
Although there are several packages on the market RCI’s
PROACT® is by far the best and most complete of the
software packages designed for this purpose.
RCI’s PROACT® software not only makes this difficult
job seem almost effort free, but also provides a mechanism
that allows easy and ready access to all the pertinent data
associated with the analysis, including the structured logic
tree. Failure data is maintained in a database unique to
the failure and can be sorted by type, person responsible
for its collection, date required, etc.
Of equal importance to the analysis is keeping track the
verification techniques use for the hypotheses pertaining
to how the failure occurred. PROACT® automatically requires
the completion of a verification log once a hypothesis is
identified. This log can than be retrieved at any time to
determined how to proceed with the analysis. In addition,
PROACT® has many features that help the analyst do his
job. It will help you to determine what your critical success
factors are for the analysis, write a report on the analysis,
communicate your findings to management, and tract the results
of your analysis efforts, just to name a few.
As a failure analyst I find that PROACT® is an invaluable
tool for doing my job. My analysis efforts are not only
easily managed, but are much quicker than ever before.
|
Mr. Hughes, a mechanical engineer,
is a member of the American Society of Mechanical
Engineers (ASME) & the American Society of Training
and Development (ASTD). He is currently a Senior Training
and Reliability Consultant with Reliability Center,
Inc. (an engineering and consulting firm). His expertise
encompasses all areas of Human and Plant Reliability
including the training/mentoring and facilitation
of Root Cause and Opportunity Analysis efforts worldwide
for client companies. |
©2003
Reliability
Center, Inc All rights reserved.
P.O.
Box 1521 Hopewell, VA. 23860 Phone: (804)458-0645
Fax: (804)452-2110 Email:
rhughes@reliability.com
|