Vous êtes sur la page 1sur 20

CHRIS MCANDREW

FaultDiagnosis
Table of Contents
Troubleshooting .............................................................................. 1
Theory of Operation ........................................................................ 2
Intermittent Symptoms .................................................................... 3
Multiple Failures.............................................................................. 3
Gathering Information ..................................................................... 4
The Five Whys ................................................................................ 5
Example .......................................................................................... 5
Clearly, State the Problem .............................................................. 6
Form a Hypothesis.......................................................................... 7
Test the Hypothesis ........................................................................ 8
Observe the Results and Draw Conclusions................................... 9
Repeat Until You Are Happy With Your Conclusions.................... 10
Failure Analysis............................................................................. 11
Root Cause Analysis..................................................................... 12
RCA Based Corrective Action ....................................................... 13
Basic Elements of Root Cause ..................................................... 14
Applied Logic ................................................................................ 16
Sutton’s Law ................................................................................. 16
Occam’s Razor ............................................................................. 17
Hickam's Dictum ........................................................................... 17
Holmesian Deduction.................................................................... 18
Murphy’s Law................................................................................ 18

ii Chris McAndrew
T R O U B L E S H O O T I N G

1
Chapter

Troubleshooting
The following is an outline of diagnostic principals which can be applied to virtually any fault finding situation.

The basic premise is to distil fault finding down to a common set of instructions which can be adapted to suite
the needs of the problem at hand.

Diagnostics is an investigative process which applies logic to test theories through observation and
experimentation.

Solving problems first requires a logical and systematic procedure which allows you to gather the available
information, discard that which is irrelevant, discover other useful facts and draw logical conclusions in order
to arrive at the cause of the problem.

The process of diagnosing faults within electrical systems is commonly referred to as troubleshooting.

Troubleshooting is the systematic search for the source of a problem in order to facilitate rectification.

The fault is normally described as symptoms of a failure and troubleshooting is the process of determining the
causes of these symptoms.

Troubleshooting is often a process of elimination - eliminating potential causes of a problem.

The process of elimination is a basic logical tool used to solve problems. By removing options that may be
deemed impossible, illogical, or can be ruled out due to explicit understanding of the scenario in question, the
pool of remaining possibilities grows smaller.

One of the core principles of troubleshooting is that reproducible problems can be reliably isolated and
resolved. Often considerable effort and emphasis is placed on reproducibility ... on finding a procedure to
reliably induce the symptom to occur.

Once this is done then systematic strategies can be employed to isolate the cause or causes of a problem; and
the resolution generally involves repairing or replacing those components which are at fault.

Efficient methodical troubleshooting starts with a clear understanding of the expected behaviour of the system
and the symptoms being observed. From there the engineer forms hypotheses on potential causes, and devises
(or perhaps references a standardised checklist) of tests to eliminate these prospective causes.

1 Chris McAndrew
T R O U B L E S H O O T I N G

Theory of Operation

A theory of operation is a description of how a device or system should work. It should be included in
documentation, especially maintenance documentation, or a user manual. It aids troubleshooting by helping to
provide the engineer with a mental model that will aid him or her in diagnosing the problem.

This should not be confused with the undocumented version, which is generally what the customer expected
after the salesman had left!!!!!

2 Chris McAndrew
T R O U B L E S H O O T I N G

Intermittent Symptoms

Intermittent faults can be defined as a fault which occurs irregularly or inconsistently.

Some of the most difficult troubleshooting issues relate to symptoms that are only intermittent.

In electronics this often is the result of components that are thermally sensitive (since resistance of a circuit
varies with the temperature of the conductors in it- remember Ohms Law?). Compressed air or freezer can be
used to cool specific spots on a circuit board and a heat gun can be used to raise the temperatures; thus
troubleshooting of electronics systems frequently entails applying thermal stress in order to reproduce a
problem.

Equally, there is a distinction between frequency of occurrence and a "known procedure to consistently
reproduce" an issue. For example knowing that an intermittent problem occurs "within" an hour of a
particular stimulus or event ... but that sometimes it happens in five minutes and other times it takes almost an
hour ... does not constitute a "known procedure" even if the stimulus does increase the frequency of
observable exhibitions of the symptom.

Nevertheless, sometimes engineers must resort to statistical methods ... and can only find procedures to
increase the symptom's occurrence to a point at which serial substitution or some other technique is feasible.
In such cases, even when the symptom seems to disappear for significantly longer periods, there is a low
confidence that the root cause has been found and that the problem is truly solved.

Multiple Failures

Isolating single component failures which cause reproducible symptoms is relatively straightforward.

However, many problems only occur as a result of multiple failures or errors. This is particularly true of fault
tolerant systems, or those with built-in redundancy. Features which add redundancy, fault detection and
failover to a system may also be subject to failure, and sufficient failures in any system will "take it down."

Even in simple systems the engineer must always consider the possibility that there is more than one fault.
(Replacing each component, using serial substitution, and then swapping each new component back out for
the old one when the symptom is found to persist, can fail to resolve such cases. More importantly the
replacement of any component with a defective one can actually increase the number of problems rather than
eliminating them).

Note that, while we talk about "replacing components" the resolution of many problems involves adjustments
or tuning rather than "replacement." For example, intermittent breaks in conductors --- or "dirty or loose
contacts" might simply need to be cleaned and/or tightened.

The first phase, therefore, of troubleshooting is;

3 Chris McAndrew
T R O U B L E S H O O T I N G

Gathering Information
You must gather all the information available about the effect of fault in order to discover its cause, ideally that
information should come from multiple sources;

1/. Check power lights, status indicators and displays. Do not forget the basics, has it been unplugged or has
the fuse blown?

2/. Seek commonalities and exclusivities.

Is everyone having the same fault?

Does it only affect people in one area?

Does it only affect one person?

Does the fault occur at a specific time?

3/. Ask users what they are experiencing, but treat this information with care – most users are non technical.
Users have also been known to lie, particularly if they believe that they are responsible for the fault.

4/. Check the system log files

5/. Check and verify system configurations – are there any known software issues?

6/. ALWAYS try to reproduce the fault.

7/. Ask yourself – Is this really a fault?

Any system can be described in terms of its components or subsystems. Each subsystem can be described in
terms of its expected behaviour. So the inputs to a system can be described as a cascade of inputs and results
among the components of the system.

For example: handset to curly, curly to telephone, telephone to cat5 cable, cat5 cable to wall port, wall port to
patch chord, patch chord, to network switch…….Now think of an unplugged handset curly cord and an
unplugged CAT5 to wall port cable. Both are unplugged wires; is the effect the same? Obviously not!! But now
you are beginning to visualise the subsystems in their own right.

Often, troubleshooting is applied to something which has suddenly stopped working, since it’s previously
working state forms the expectations about its continued behaviour.

So the initial focus is often on recent changes to the system or to the environment in which it exists.

For example a handset that "was working when it was plugged in over there". However, there is a well known
principle that correlation does not imply causality.

For example the failure of a device shortly after it's been plugged into a different outlet doesn't necessarily
mean that the events were related. The failure could have been a matter of coincidence.

It's useful to consider, at this point, the common experiences we have with light bulbs. Light bulbs "blow"
more or less at random; eventually the repeated heating and cooling of its filament, and fluctuations in the
power supplied to it cause the filament to crack or vaporise. The same principle applies to most other
electronic devices and similar principles apply to mechanical devices. Some failures are part of the normal
wear-and-tear of components in a system.
4 Chris McAndrew
T R O U B L E S H O O T I N G

The Five Whys

The ‘five whys’ is a question-asking method used to explore the cause/effect relationships underlying a
particular problem. Ultimately, the goal of applying the five whys method is to determine a root cause of a
defect or problem.

Example

The following example demonstrates the basic process:

My car will not start. (The problem)

1. Why? - The battery is dead. (1st)

2. Why? - The alternator is not functioning. (2nd)

3. Why? - The alternator belt has broken. (3rd)

4. Why? - The alternator belt was well beyond its useful service life and has never been replaced. (4th)

5. Why? - I have not been maintaining my car according to the recommended service schedule. (5th)

The questioning for this example could be taken further to a sixth, seventh, or even greater level. This would
be legitimate, as the "five" in five whys is not gospel; rather, it is postulated that five iterations of asking why is
generally sufficient to get to a root cause. The real key is to encourage the engineer to avoid assumptions and
logic traps and instead to trace the chain of causality in direct increments from the effect through any layers of
abstraction to a root cause that still has some connection to the original problem

Only once all the available information has been gathered will the engineer be ready to move onto the next
phase

5 Chris McAndrew
T R O U B L E S H O O T I N G

Clearly, State the Problem

This is the process of reviewing all the available information and getting a clear understanding of the perceived
fault.

For example, let’s say that you have a user complaining that they can not transfer a call to an outside line.

Upon investigation you find that they are pressing this when they should be pressing this .

Now, we have a training issue NOT A FAULT!!!

6 Chris McAndrew
T R O U B L E S H O O T I N G

Form a Hypothesis

Having collected your information and clearly stating what the problem is you now need to form your initial
hypothesis.

The best way to do this is in the form of a question which can be proven or disproved.

1/. Extension 123 was working before and it is not working now. At what point in time did it break and what
broke it?

Is it

a). The handset

b). The cables

c). The power etc, etc, etc?

2/. Why won’t Ops Manager install/start/run etc, etc?

When starting to form your hypothesis bear in mind that a basic principle in troubleshooting is to start from
the simplest and most probable possible problems first. This is illustrated by the old saying "When you see
hoof prints, look for horses, not zebras", or to use another maxim, use the KISS principle.

This principle results in the common complaint about help desks or manuals, which sometimes first ask: "Is it
plugged in and is the power turned on?", but this should not be taken as an affront, rather it should serve as a
reminder or conditioning to always check the simple things first.

7 Chris McAndrew
T R O U B L E S H O O T I N G

Test the Hypothesis

Once you have stated the problem and formed a hypothesis you must devise a method to test that hypothesis.

Your testing must enable you to eliminate one single possible cause by changing only one setting at a time.

If you make more than one change at a time you will not be able to eliminate all possible causes and it is quite
likely that you will never find the root cause of the actual problem.

An engineer could check each component in a system one by one; substituting known good components for
each potentially suspect one. However, this process of "serial substitution" could be considered wasteful when
components are substituted without regards to the hypothesis concerning how their failure could result in the
symptoms being diagnosed. (e.g. there is no power light on so let’s change the hard drive…..)

Two common strategies used by engineers are to check for frequently encountered or easily tested conditions
first.

For example, checking to ensure that a handset's display is on and that its cables are firmly seated at both ends.

Secondly, to "bisect" the system.

For example, checking to see if the voice packets which leave a handset also leave the network switch further
down the line.

This latter technique can be particular efficient in systems with long chains of serialized dependencies or
interactions among its components. It's simply the application of a binary search across the range of
dependences.

Troubleshooting can also take the form of a systematic checklist, procedure, flowchart or table that is made
before a problem occurs. Developing troubleshooting procedures in advance allows sufficient thought about
the steps to take and organising them into the most efficient process.

Always remember, when you are testing your hypothesis, if you can not identify the root cause, how will you
repair it the next time?

8 Chris McAndrew
T R O U B L E S H O O T I N G

Observe the Results and Draw Conclusions

After each test note whether the change you made did or did not solve the problem, gather any new
information and then draw a conclusion as to whether the problem is solved or whether the change you made
had any affect on the problem at all.

Once you have drawn conclusions you can devise new tests to eliminate other possible causes.

9 Chris McAndrew
T R O U B L E S H O O T I N G

Repeat Until You Are Happy With Your Conclusions.

This entire methodology is based upon removing the possible causes of the fault, one at a time, until the root
cause has been identified and eliminated.

Therefore, until the root cause has been identified – go back and start again!!

10 Chris McAndrew
T R O U B L E S H O O T I N G

Failure Analysis

Once the root cause has been identified and the fault has been rectified then we move to the final phase.

Failure analysis is the process of collecting and analyzing data to determine the cause of a failure and how to
prevent it from recurring. It is an important discipline and it is a vital tool used in the development of new
products and for the improvement of existing products.

11 Chris McAndrew
R O O T C A U S E A N A L Y S I S

2
Chapter

Root Cause Analysis


Root cause analysis (RCA) is a class of problem solving methods aimed at identifying the root causes of
problems or events. The practice of RCA is predicated on the belief that problems are best solved by
attempting to correct or eliminate root causes, as opposed to merely addressing the immediately obvious
symptoms. By directing corrective measures at root causes, it is hoped that the likelihood of problem
recurrence will be minimized. However, it is recognized that complete prevention of recurrence by a single
intervention is not always possible. Thus, RCA is often considered to be an iterative process, and is frequently
viewed as a tool of continuous improvement.

Root cause analysis is not a single, sharply defined methodology; there are many different tools, processes, and
philosophies of RCA in existence. However, most of these can be classed into five, very-broadly defined
"schools" that are named here by their basic fields of origin: safety-based, production-based, process-based,
failure-based, and systems-based.

• Safety-based RCA descends from the fields of accident analysis and occupational safety and health.

• Production-based RCA has its origins in the field of quality control for industrial manufacturing.

• Process-based RCA is basically a follow-on to production-based RCA, but with a scope that has been
expanded to include business processes.

• Failure-based RCA is rooted in the practice of failure analysis as employed in engineering and
maintenance.

• Systems-based RCA has emerged as an amalgamation of the preceding schools, along with ideas taken
from fields such as change management, risk management, and systems analysis.

Despite the seeming disparity in purpose and definition among the various schools of root cause analysis,
there are some general principles that could be considered as universal. Similarly, it is possible to define a
general process for performing RCA.

12 Chris McAndrew
R O O T C A U S E A N A L Y S I S

RCA Based Corrective Action

Notice that RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective action, because it
directs the corrective action at the root of the problem. That is to say, it is effective solutions we seek, not root
causes. Root causes are secondary to the goal of prevention, and are only revealed after we decide which
solutions to implement.

1. Define the problem.

2. Gather data/evidence.

3. Ask why and identify the causal relationships associated with the defined problem.

4. Identify which causes if removed or changed will prevent recurrence.

5. Identify effective solutions that prevent recurrence, are within your control, meet your goals and
objectives and do not cause other problems.

6. Observe the recommended solutions to ensure effectiveness.

7. Repeat steps 1 to 6 until you are happy with step 7

8. Implement the recommendations.

13 Chris McAndrew
R O O T C A U S E A N A L Y S I S

Basic Elements of Root Cause

• Materials

ƒ Defective raw material

ƒ Wrong type for job

ƒ Lack of raw material

• Machine/Equipment

ƒ Incorrect tool selection

ƒ Poor maintenance or design

ƒ Poor equipment or tool placement

ƒ Defective equipment or tool

• Environment

ƒ Orderly workplace

ƒ Job design or layout of work

ƒ Surfaces poorly maintained

ƒ Physical demands of the task

ƒ Forces of nature

• Management

ƒ No or poor management involvement

ƒ Inattention to task

ƒ Task hazards not guarded properly

ƒ Other (horseplay, inattention....)

ƒ Stress demands

14 Chris McAndrew
R O O T C A U S E A N A L Y S I S

• Methods

ƒ No or poor procedures

ƒ Practices are not the same as written procedures

ƒ Poor communication

• Management system

ƒ Training or education lacking

ƒ Poor employee involvement

ƒ Poor recognition of hazard

ƒ Previously identified hazards were not eliminated

15 Chris McAndrew
A P P L I E D L O G I C

3
Chapter

Applied Logic

Sutton’s Law
Sutton's law states that in attempting to diagnose a problem, one should first do the experiment that can
confirm the most likely diagnosis. It is taught in medical schools to guide new doctors in ordering tests in a
way that leads to faster treatment, while minimizing unnecessary costs. It is also applicable to other disciplines,
such as debugging computer programs.

A more thorough analysis will consider the false positive rate of the test and the possibility that a less likely
diagnosis might have more serious consequences.

The law is named after the bank robber Willie Sutton, who supposedly answered a reporter inquiring why he
robbed banks by saying "because that's where the money is." He later denied saying it, however.

A similar idea is contained in the adage, "When you hear hoof beats, think horses, not zebras."

16 Chris McAndrew
A P P L I E D L O G I C

Occam’s Razor

Occam's razor (sometimes spelled Ockham's razor) is a principle attributed to the 14th-century English
logician and Franciscan friar William of Ockham. The principle states that the explanation of any
phenomenon should make as few assumptions as possible, eliminating those that make no difference in the
observable predictions of the explanatory hypothesis or theory. The principle is often expressed in Latin as the
lex parsimoniae ("law of parsimony" or "law of succinctness"): "entia non sunt multiplicanda praeter
necessitatem", roughly translated as "entities must not be multiplied beyond necessity".

This is often paraphrased as "All other things being equal, the simplest solution is the best." In other words,
when multiple competing theories are equal in other respects, the principle recommends selecting the theory
that introduces the fewest assumptions and postulates the fewest entities. It is in this sense that Occam's razor
is usually understood.

Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as a heuristic
maxim (rule of thumb) that advises economy, parsimony, or simplicity, often or especially in scientific theories.

Hickam's Dictum

Hickam's dictum is a counterargument to the use of Occam's razor in the medical profession. The principle is
commonly stated: "Patients can have as many diseases as they damn well please". The principle is attributed to
John Hickam, MD

17 Chris McAndrew
A P P L I E D L O G I C

Holmesian Deduction

Holmes is famous for his intellectual prowess, and is renowned for his skilful use of "deductive reasoning"
while using abductive reasoning (inference to the best explanation) and astute observation to solve difficult
cases

As Holmes says in the story, (The Sign of the Four) "How often have I said to you that when you have
eliminated the impossible, whatever remains, however improbable, must be the truth?"

Murphy’s Law

If you skip a step, Murphy's Law states that the step you skip is where the problem will lie.

18 Chris McAndrew

Vous aimerez peut-être aussi