Vous êtes sur la page 1sur 8

J Fail. Anal. and Preven.

(2009) 9:185192
DOI 10.1007/s11668-009-9226-1

FEATUREPEER-REVIEWED

A View on the General Practice in Engineering Failure Analysis


S. K. Bhaumik

Submitted: 30 September 2008 / in revised form: 28 January 2009 / Published online: 20 March 2009
 ASM International 2009

Abstract Analysis of engineering failures is a complex


process that requires information from personnel having
expertise in many areas. From the information gathered, a
failure analyst tries to discover what was fundamentally
responsible for the failure. This fundamental cause is
termed the root cause and helps in the determination of
the sequence of events that led to the final failure. Root
cause analysis also helps in finding solutions to the
immediate problem and provides valuable guidelines as to
what needs to be done to prevent recurrence of similar
failures in future. However, experience suggests that most
failure analyses fall short of this goal. A significant number
of failure analysts incorrectly use the term root cause
when what they really establish is the primary cause of
failure or simple physical cause. This paper examines a few
service failures to demonstrate that the term root cause is
not adequately understood.
Keywords Engineering failures  Errors 
Failure analysis  Root cause

Introduction
Failure represents an adverse situation wherein a component or assembly fails to satisfactorily perform its intended
function. In other words, failure can be defined as the gap

S. K. Bhaumik (&)
Failure Analysis & Accident Investigation, Materials Science
Division, National Aerospace Laboratories, Council of Scientific
and Industrial Research (CSIR), Bangalore 560 017, India
e-mail: subir@css.nal.res.in

between the expected performance and the actual performance of any component or assembly. The purpose of
failure analysis is to establish the mechanism and causes of
the failure and to recommend a solution to the problem.
Often failures do not just happen but are caused, and
determination of the cause for the failure helps to identify
what exactly went wrong and what needs to be done to
avoid similar failures in future. Even the most sophisticated
simulation testing cannot adequately duplicate the varied
factors and the many unanticipated events that may lead to
failure. Hence, failure analysis offers the most reliable tool
in ensuring the continuing safety of the component or an
assembly or a system.
Analysis of engineering failures is a formidable, complex, and challenging task. It is a task that requires
information from personnel with expertise in many areas. It
also demands tremendous responsibility and coordination
on the part of the analyst and a thorough knowledge of
materials science supplemented with an appreciation for,
and willingness to apply, related engineering disciplines.
The effort to identify the root cause of failure not only
helps to solve the immediate problem, but provides valuable guidance as to what needs to be done to prevent
recurrence of similar failures in a given system or organization. However, experience suggests that most failure
analyses fall short of this goal. This is because a significant
fraction of failure analysts incorrectly use the term root
cause when what they really establish is the primary cause
of failure or the simple physical process of failure [1, 2].
This aspect of the failure analysis process is discussed in
this paper. A few examples of service failures are cited
wherein the immediate causes of physical failures were
obvious, but the underlying causes or the root causes
leading to these failures were traced to errors involving
human factors that are often overlooked.

123

186

J Fail. Anal. and Preven. (2009) 9:185192

Failure Analysis Process: Approach and Expertise


In a way, failure analysis is much like the work of a
detective. The process begins with the collection of information and then proceeds to the examination of hardware
and the gathering of unique data from each step throughout
the failure analysis process. The data collected, when
analyzed properly, provides insight into what may have
caused the failure and what contributing factors may have
been involved. There is no preset procedure for conducting
failure analysis, and it is largely dependent on the nature of
the failure and the expertise of the analyst [1, 3, 4]. Nevertheless, the following general approach is recommended
for a successful failure investigation.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.

Identify and describe the apparent problem or


reason(s) that the system was not working properly
Collect information related to the problem, separating facts from assumptions and inferences
Define the real problem and look beyond the
immediate cause(s)
Generate data by evaluations and analysis and
identify primary cause(s)
Identify all probable root causes
Identify most likely root cause(s)
Generate probable solutions and courses of action
Evaluate the merits, risks, potential for success and
implications of respective options
Select the best option available and develop a
suitable plan of action
Implement the plan of action
Follow-up and obtain feedback

Experience suggests that unfortunately, in many


instances, the investigation stops after identification of the
immediate cause of failure. The failure analysis reports
submitted merely describe the physical failure of the
component or system. Hence the recommendations suggested do not provide long-term solutions to the problems.
What could be the reasons for such incomplete investigations? Is it because of the lack of initiative from the
customer/client or lack of expertise of the analyst?
It is important to note that the outcome of an investigation is dependent on the attitude of customer/client as
well as the expertise of the failure analyst. For any investigation, the information provided by the customer/client
forms a vital input in the analysis. If the inputs are incorrect
or conceal relevant information, a failure analyst can do
very little in discovering the root cause of a failure. This is
common when the customer/client has a vested interest in
the outcome of the analysis. Additionally, even when all
relevant information is made available, the investigation
can still be unsatisfactory because of inadequate expertise
of the analyst. Many failure analysis practitioners follow an

123

established procedure for the evaluation of failed components/systems, and in the process they ignore the complexity involved in establishing the facts of the failure.
They may adopt a preset procedure without even trying to
discover the situations or conditions under which the failure had occurred. The use of a preset procedure generally
occurs because of a lack of appreciation that the same
physical failure can be arrived at in many ways. Because of
the many potential paths to a specific physical failure, an
understanding of the relative importance of various factors
in the specific case at hand is essential. It must be borne in
mind that all failures are unique, and, hence, each of them
should be treated uniquely. As described by Dennies [1],
established recipe-type procedures are generally inadequate in determining the root cause(s) of a failure.
Analysis of a Few Investigations: Case Studies
The failure analysis trends in organizations vary widely
from one to another, and the process adopted is often
dependent on the organizational culture or rules [1, 2].
Root cause analysis shows that eventually all failures are
caused by human errors. According to Zamanzadeh et al.
[2], these can be broadly classified into the following three
categories:

Errors of knowledge
Errors of performance (which might be caused by
negligence)
Errors of intent (which might even be acts of greed or
sabotage)

Over the past four decades, about 1100 cases of service


failures have been analyzed in the authors laboratory. In
some of these cases, the investigations were referred to the
authors laboratory for a second opinion after the failures
had been investigated elsewhere. There were also a few
occasions where more than one laboratory was involved in
the investigations. A statistical survey indicates that in a
significant number of cases, it was not possible to complete
the investigations to a desired level of root cause determination since the contributions of the human element in
the causal chain of the failures were significant. Human
factors are generally very difficult to investigate within any
organization, be it a manufacturer or a user. The investigation into human factors often raises questions about
organizational culture that may need to be changed before
implementing the recommendations and preparing to prevent/minimize future failures. When culture changes are
required, the failure analyst faces the daunting task of
determining how to proceed further with the investigation
without upsetting the customer. In the authors opinion,
this is one of the major reasons why a large number of

J Fail. Anal. and Preven. (2009) 9:185192

187

Fig. 1 Fractured tooth of gear


Fig. 3 Spalling on the working flank

Fig. 2 Fatigue striations

investigations of engineering failures are concluded without even making any attempt for the root cause
determination. The investigation simply stops after the
identification of the physical cause(s) for the failure, and
the impact of human error is not thoroughly investigated.
However, the failure analyst may also contribute to such an
incomplete investigation because of a lack of expertise.
The following examples amply illustrate these facts.
A Gear Failure: Deviation in Heat Treatment Procedure
Fabrication processes for aircraft components are generally
well designed and strictly controlled because of the safety
concerns. This is adequately supplemented by inspection at
various levels of fabrication schedule and maintenance of
records at each stage. Therefore, deviation in the manufacturing process/parameters, if any, is easily traceable.
Although the engineers working in the production shop are
well aware of this fact, still error can occur. The following
is an example.

The pilot of a helicopter reported continuous chip


warning in flight. On inspection, a large chip with one side
shining and trapezoidal in shape was found attached to the
magnetic chip detector. This was followed by a thorough
examination of the main gearbox, and it was discovered
that the metal piece observed in the chip detector was a
fractured fragment of a tooth from the gear.
Metallurgical failure analysis showed that the tooth of
the gear had failed by fatigue and there were multiple
fatigue crack initiation sites at the dedendum of the
working flank (Fig. 1, 2). Detailed examination revealed
that the fatigue crack initiation was promoted by severe
spalling on the tooth flank (Fig. 3).
The material of construction and the microstructure of
the core material of the gear tooth were found to be satisfactory. However, examination showed that there was
wide variation in the case depth across the length of the
tooth flank. While the hardness profile was satisfactory
toward the top land and root of the teeth, the central region
had a hardness profile typical of a decarburized case with
surface hardness much lower than the acceptable limit
(Fig. 4). The reason for the physical failure of the gear
tooth was established. The customer was satisfied with the
findings and requested for an analysis report. It is not
uncommon that the investigations are concluded at this
stage. However, was the investigation complete? Did the
analyst identify anything beyond the physical cause of
failure? Would it be possible to suggest recommendations
based on this analysis? Hence, the question arises, Why
was the investigation concluded? Was it because of the
customers attitude/intention to discourage further investigation or the lack of understanding of the analyst to
appreciate that root cause analysis of the failure has not
been performed?
The next part of the investigation should be to identify
why the decarburization of the gear occurred. Was it

123

188

J Fail. Anal. and Preven. (2009) 9:185192

Fig. 4 Hardness profile on gear tooth


Fig. 5 Fractured brace bolt

because of improper control of process parameters during


the postcarburizing heat treatment or because of a deviation
in the carburizing process itself? Examination of the documents showed that recommended process parameters
were maintained during the heat treatment of the gear.
However, it was discovered that this particular gear was
re-heat-treated, which, of course, was a deviation from the
recommended practice. Why was the deviation allowed by
the shop engineer? The reason was not very clear from the
records.
The rationale behind the decision of re-heat-treatment
taken by the shop engineer was unknown for some time.
Subsequent interrogation revealed that the gear was
rehardened and tempered to take care of the excessive
distortion noticed after the first heat treatment schedule.
The engineer was unaware of the fact that repeating the
heat treatment does not correct the distortion already
introduced in the component and that the reheat involves
the risk of decarburization of the case with an undesirable
hardness profile on the teeth. What are the corrective
measures to be taken to prevent such practices? Can
transferring the engineer to another section solve the
problem? Root cause analysis is required to answer these
questions, and, therefore, both failure analyst and the
organization have to be keen on pursuing the investigation
further.
What is the root cause in this failure? Is it an error
involving insufficient knowledge, education, training, and/
or experience or an error of intent to salvage the defective
components? Is the failure restricted to an individual or the
result of an organizational culture? Investigation of these
aspects is necessary for suggesting any actionable plan as
a long-term solution. Merely asking the organization to
follow the recommended practice or transferring the personnel may solve the immediate problem, but it does not
guarantee the avoidance of production of similar defective
components in the organization in a long run.

123

Failure of a Brace Bolt: Incorrect Surface


Reconditioning Procedure
A brace bolt fitted on to the oleo leg assembly of the main
undercarriage of an aircraft had fractured unexpectedly
during the functional check. The main undercarriage was
manufactured in 1978 and had been overhauled four times.
The last overhaul was carried out in 2003, and the undercarriage was cleared for an additional 1500 landings and
was kept in stores until 2007. A functional check (retraction and extension) was being carried out, and while
retracting for second time at normal pressure the brace bolt
failed. The brace bolt made of high-strength steel had been
zinc plated and phosphate coated.
The brace bolt was found to have fractured at the change
of cross section and outside the threaded region (Fig. 5).
The fracture surface had a smooth and speckled appearance. There were two steps on the fracture surface
indicative of propagation of two crack fronts leading to the
fracturing of the component (Fig. 6). Fractographic study
showed that majority of the fracture was by brittle intergranular mode (Fig. 7). However, at the steps, the fracture
features were entirely different. These regions showed
dimple rupture, typical of that observed in overload ductile
failure (Fig. 8). Similar fractographic features were
observed along the circumference of the bore. There was
no evidence of fatigue failure. Also, there were no signatures of corrosion on the fracture surface. Metallographic
study revealed secondary cracks perpendicular to the
fracture surface at a few locations.
The aforementioned fractographic features are typical of
hydrogen embrittlement failure. In hydrogen embrittlement, the crack nucleates and propagates over a period of
time under sustained tensile load. It appears that two cracks
had developed in the brace bolt at diametrically opposite
locations in the fillet region of the change of cross section.
These two cracks then propagated progressively through

J Fail. Anal. and Preven. (2009) 9:185192

Fig. 6 Fracture surface of the brace bolt

189

the thickness. The fracture surfaces generated by these two


individual cracks were at different planes, and, hence, the
bolt was held together by a thin ridge of material separating
these two crack surfaces. Also, the cracks had propagated
almost through thickness, leaving only a rim of material
along the bore. After the fitting of the undercarriage onto
the aircraft and during functional check, the bolt fractured,
creating two steps on the fracture surface and shear lip
along the circumference of the bore.
The failure mechanism of the brace bolt was established.
However, what was the source of hydrogen that caused the
hydrogen embrittlement failure? Study of the records
showed that the brace bolt had completed 4000 landings
and the surface of the bolt was reconditioned during the last
overhaul after ensuring no cracks in the component through
magnaflux inspection. Hence, it is probable that the brace
bolt had picked up hydrogen at the time of surface reconditioning during the last overhaul. The steps followed
during surface reconditioning were:
1.
2.
3.
4.
5.
6.

Fig. 7 Intergranular fracture (brittle) features

Depainting
Sand blasting
Zinc metallizing
Oxyparkerizing (phosphating)
Repainting
Assembly and storage since 2003

Study showed that the brace bolt could pick up hydrogen


during the depainting and oxyparkerizing processes.
Therefore, baking of the component immediately after
these processes should have been incorporated in the procedure. It is a puzzle to understand why the baking process
was not incorporated during surface reconditioning of used
parts because it was a process requirement for the production of new parts. In this case, what is the root cause of
failure: hydrogen embrittlement or incorrect process for
surface reconditioning? Was the failure caused by an error
of knowledge or an error of performance or an error of
communication? What are the corrective measures to be
suggested to prevent similar failures? Can it be identified
without the involvement of the organization that reconditioned the bolt? In authors opinion, probably Not.
Moreover, should the recommendations address only
this specific case of brace bolt failure or manufacturing
of all components where vulnerability for hydrogen
pickup exists during the surface treatments associated with
reconditioning?
Failure of a Rod-End Top: Inadequate Processing
Equipment Maintenance

Fig. 8 Dimple rupture (ductile) features at the steps and along the
rim of the bore

The rod-end top was made of a titanium alloy and fitted


with a spherical bearing by a process referred to as staking.
Inconsistencies in fatigue life of the component were

123

190

noticed during mandatory qualification tests of a few production batches. The acceptance criteria for fatigue life was
fixed as 10 9 106 cycles at 20 kN load, and failures
occurred after as few as 1.86 9 105 cycles.
A prematurely cracked rod-end top is shown in Fig. 9.
Examination revealed that the fatigue crack had initiated at
stress concentrations arising from mechanical damages
caused by metal-to-metal contact (Fig. 10). Detailed
investigation discovered that there were two reasons for the
mechanical damages to occur: (a) nonuse of recommended
MoS2 coating at the bearing interface and (b) improper fit
within the bearing. The failures continued even after restoration of MoS2 coating. Examination did not show any
deviation in the dimensions of the rod-end top bore and the
bearing or problems in the manufacturing procedure.
Hence, the manufacturer was asked to examine the staking
tools used for the assembly of the bearing, closely monitor
the process, and report. In spite of all care, the failure
continued to happen. Analysis of a few more failed components confirmed that in all cases, there was nonuniform
contact at the bearing/bore interface. As an additional
check, the manufacturer was asked to examine the press
used for the staking process. After examination, the
supervisor/operator reported back saying there were no
abnormalities in the press.
The reason for the improper fit and the nonuniform
contact were difficult to identify for the moment. Based on
the physical failure, no other conclusions could be established. Finally, it was suggested that the staking process for
a few components be carried out using a well-calibrated/
maintained universal testing machine. The components so
Fig. 9 (a) A prematurely
cracked rod-end top. (b) Closeup view of the cracked region.
(c) Crack surface showing
fatigue crack origin (arrow)

123

J Fail. Anal. and Preven. (2009) 9:185192

manufactured were found to pass the qualification test.


Following this, the staking press was dismantled and
examined in detail, and it was found that during compression/staking, the platens in contact with the component
were not parallel because of excessive wear in the staking
press system. This wear had occurred over the long usage.

Fig. 10 (a) Fracture surface showing crack origin (arrow).


(b) Corresponding region on the bore surface (arrow); note mechanical damages

J Fail. Anal. and Preven. (2009) 9:185192

Immediately, the press was reconditioned by replacing the


worn-out components and the problem was solved.
Why did the process of identification of the cause for the
improper fitment of the rod-end top take so long? In spite
of examination, why was the fault in the press not determined early in the evaluation? Was the operator properly
trained on the equipment operation and maintenance? The
answer appears to be NO. The press was examined
under no-load condition, and everything was found to be
alright. However, under pressure, the slippage in the
transmission system caused by excessive wear on the shaft
of the top anvil was responsible for the misalignment of the
platens in contact with the component during the staking
process. In this particular case, the incident prompted a
regular audit of the manufacturing equipment used by the
organization. However, whether or not training of the
personnel on the maintenance of the equipment was made
mandatory is not known to the author.
The customer was very satisfied that the problem was
solved, and the regular production resumed. However, no
one was critical about the omission of the MoS2 coating on
the surface of the bore before the staking process. It was a
process requirement to avoid metal-to-metal contact at the
interface, and the process had been followed for many years.
Why was the decision made to exclude the MoS2 coating?
Investigation revealed that the process was excluded from
the manufacturing schedule in order to minimize the time of
production per unit so that the committed production target
could be achieved. Was it an error of knowledge or an error
of intent? Should not a failure analyst look into these
aspects? Does it not reflect on the organizational culture if
such practices are regularly adopted?

191

Root Cause Analysis: Definition and Requirements


Why does a failure occur? There can be various reasons
such as (a) the component/system is subjected to an environment beyond its design envelope, (b) the choice of
material and its condition is inappropriate for the design
and operating conditions, (c) the material of construction is
defective, or (d) the design itself is wrong. It is often seen
that there is no single cause or no single train of events
leading to the failure. Generally, several factors combine at
a particular time and place to cause a failure to occur [1, 3].
When a failure occurs, an understanding of the failure
process is required so that attempts can be made to determine how and why a component/system failed. Root cause
analysis provides this understanding.
As described by Zamanzadeh et al. [2], the old saying,
For want of a nail the shoe was lost, for want of a shoe the
horse was lost, for want of a horse the battle was lost, for
want of the battle the kingdom was lost, summarizes a
classic primary/proximate cause. The primary cause of
failure is the set of conditions or parameters that initiated
the sequence of events leading to failure, and this cause is
defined as events that occurred, including any conditions
that existed immediately before the failure, directly resulted in its occurrence and if eliminated or modified, would
have prevented the failure [5]. On the other hand, root
cause is defined as one of multiple factors (events, conditions, or organizational factors) that contributed to or
created the primary/proximate cause and subsequent failure
and if eliminated, or modified, the failure would not have
occurred or would have occurred differently [1, 5].
Typically, multiple root causes contribute to a failure.

Fig. 11 Causal factor tree of


the failure of gear. Level A,
event; Level B, physical failure;
Level C, primary cause; and
Level D, Root cause

123

192

It must be understood that identification of the root


cause requires examination beyond the immediately visible
cause, which is often the primary/proximate cause, and
unfortunately once this cause is determined, the failure
analysis process frequently stops. For determination of root
cause, the primary cause needs to be supplemented by the
entire history of the failed component or system, which
includes both manufacturing and use. This has been illustrated through examples in the previous section. There are
various tools/methods available for root cause analysis
[59]. One of them is generation of causal factor tree for an
event. If one takes the example of the gear failure cited in
this paper, the tree can be constructed as shown in Fig. 11.
By no means is the tree exhaustive, and many more elements can be added to the tree depending on the situation at
hand.
An individual with basic training may be able to analyze
a simple physical failure. However, root cause analysis
requires input from people with many areas of expertise.
Hence, a failure analyst performing root cause analysis
must possess expertise in the various aspects of human
relations, education, and training [1]. In addition to the
understanding the manufacturing and functioning of the
engineering component or system, the analyst must visit
the failure site, interrogate people, and understand the
organizational culture/practice. Most importantly, a failure
analyst must not shy away from recognizing that human
interactions often contribute to failure and that corporate
culture changes may be required to minimize the probability of failure.

J Fail. Anal. and Preven. (2009) 9:185192

In the example presented on the failure of a gear, the


primary cause for the fatigue fracture of the tooth may have
been the decarburization of the surface-hardened gear
teeth. However, the root cause was the systemic problem in
the organization wherein the engineer engaged for the job
lacked the knowledge or training or experience. Unless this
is addressed, recurrences of similar production-related
failures are bound to creep in the organization from time to
time. The other examples cited in the paper also point
toward one or the other aspects of human element that
contribute significantly in the causal chain of all engineering failures. Experience suggests that the human
factors are largely dictated by the organizational culture/
practice.
In true sense, root cause determination is performed
only in a handful of investigations, while the rest do not go
beyond the determination of primary/proximate cause of
failure. Study shows that majority of the investigations stop
after the primary cause determination because of lack of
initiative and/or information from the customer/client. The
contributions of failure analysts to such investigations
cannot be ignored. A failure analyst must, however, always
advocate for the necessity of root cause analysis. Without
this, the investigation is not complete.
Acknowledgments The author thanks Head, Materials Science
Division and Director, NAL, for granting permission to publish this
paper. The contributions of Mr. M.A. Venkataswamy, Dr. M. Sujata,
and Mr. M. Madan in the investigation of the failure cases presented
in this paper are acknowledged.

References
Summary
It has been shown that the identification of the causes for
the physical failure of a component or system is a part of
the total investigation process. This is only the first step in
the entire process and often sets the direction of the
investigation. The most important part of the investigation
is to determine the root causes, that is, what went wrong to
create the conditions for the physical failure to occur. In
this phase of investigation, dealing with human factors is
inevitable. Therefore, further progress into the investigation largely depends on the attitude of the customer/client
and expertise of the failure analyst. In many cases, unfortunately, the investigation stops when the physical cause is
determined, and the conclusions are drawn without even
making an attempt to establish the root cause.

123

1. Dennies, D.P.: How to Organize and Run a Failure Investigation.


ASM International, Materials Park, OH (2005)
2. Zamanzadeh, M., Larkin, E., Gobbin, D.: A re-examination of
failure analysis and root cause determination. www.matcoinc.com.
Accessed July 2008
3. Ramachandran, V., Raghuram, A.C., Krishnan, R.V., Bhaumik,
S.K.: Failure Analysis of Engineering Structures: Methodology
and Case Histories. ASM International, Materials Park, OH (2005)
4. Bhaumik, S.K.: An aircraft accident investigation: revisited.
J. Fail. Anal. Preven. 8, 399405 (2008)
5. Office of Safety & Mission Assurance, Chief Engineers Office,
Root cause analysis: overview. www.hq.nasa.gov. Accessed July
2008
6. Kepner-Tregoe, Inc., Princeton, NJ. www.kepner-tregoe.com.
Accessed July 2008
7. The Reliability Center, Inc., Hopewell, VA. www.reliability.com.
Accessed July 2008
8. The Failsafe Network, Inc., Montebello, VA. www.failsafenetwork.com. Accessed July 2008
9. Shanin LLC, Livonia, MI. www.shanin.com. Accessed July 2008

Vous aimerez peut-être aussi