Dugan Comp Sys Fta Tutor

Fault Tree Analysis
of Computer-Based Systems
Joanne Bechta Dugan

Professor of Electrical & Computer Engineering
University of Virginia
(jbd@Virginia.edu)
RELIABILITY and MAINTAINABILITY Symposium

slide(1)
Presentation Outline
I. Introduction to fault trees
II. Fault tree analysis of an example control system
III. Fault trees as design aid for software systems
IV. Adapting the fault tree to analysis of computer-based systems
V. Dynamic fault trees for modeling sequential behavior
VI. Modular approach to fault tree analysis
VII. Sensitivity analysis
VIII. Summary and Conclusions

slide(2)
Introduction to fault tree analysis
Fault trees provide a good framework for both qualitative and
quantitative analysis because they have both a logical (boolean
algebra) and probabilistic basis.
What is a fault tree?

• not a tree (in the graph-theoretic sense)
• a graphical representation of a logical function
• shows logical relationship between an event (failure) and its causes
• provides a logical framework for expressing combinations of compo-
nent failures that can lead to system failure

slide(3)
Why use fault tree analysis?
A fault tree model provides a logical framework for analyzing the failure
behavior of a system.
A fault tree model precisely documents which failure scenarios have

been considered and which have not.
Fault tree analysis can be used to support engineering and

management decisions, trade-off analysis and risk assessment.
The fault tree model has a well-defined boolean algebraic and

probabilistic basis which relates probability calculations to boolean
logic functions.

slide(4)
Basic Static Fault Tree Constructs
Basic Events
Basic Event: corresponds to a basic failure

name event (usually a component failure) in the system.
Characterized by failure rate or failure probability
Undeveloped Basic Event: A basic event that is not

name completely developed, usually because of unavailable
information.
Replicated Basic Event; represents k statistically

k * name
identical copies of a component

slide(5)
Static fault tree gates
AND gate - output event occurs

only if ALL input events occur
OR gate - output event occurs

if one or more input events occur
m/n gate - output event occurs if

m/n m or more of the n inputs occur

slide(6)
Example Fault Tree
Washing Machine Overflows
Structure Function:
Fail = valve_failed OR
fill mode too long (timer_failed AND sensor_failed)
valve
stuck open
F = A + BC
timeout full
control sensor
failed failed

slide(7)
Probabilistic Fault Tree Analysis of Example
F = A + BC Suppose
Pr[A] = 0.01
Pr[F] = Pr[ A + BC]
Pr[B] = 0.05
= Pr[A] + Pr[BC] - Pr[ABC]
Pr[C] = 0.075
= 0.01 + 0.00375 - 0.0000375
(all failures are independent)
= 0.0137125

slide(8)
Fault Tree Analysis - Cutsets
Most fault tree analysis techniques start with the generation of cutsets
A cutset is a set of basic events; if all the basic events in a cutset occur,
then the top event (system failure) occurs.
A mincut (minimum cutset) is one that contains no redundant elements.

If an element is removed from a mincut, it ceases to be a mincut.
Cutsets:
fill mode too long
valve
stuck open {valve} (single point of failure)
timeout
control
failed
full
sensor
failed
{timer, sensor}

slide(9)
Cutset generation by example
G1
{A1,A3}
G3 {A4,A5}
G2 G3 {A1,G5} {A1,A4}
G1
G2 {G4,G5}
{A2,G5}
A5 {A2,A3}
G4 G5
{A2,A4}
A1 A2 A3 A4

slide(10)
Probabilistic Analysis using Cutsets
The probability of system failure is simply the probability that one or
more of the cutsets occur.
But the cutsets are not disjoint so we cannot sum their individual
probabilities. We must account for the overlap of the events.
Cutsets:
fill mode too long
valve
stuck open {valve} (single point of failure)
timeout
control
failed
full
sensor
failed
{timer, sensor}
Prob(failure) = Prob(valve failure) + Prob(both timer and sensor fail)

- Prob(all three fail)

slide(11)
Probabilistic Analysis using Inclusion-Exclusion
A AB B
ABC
AC BC
Pr(A) + Pr(B) + Pr(C)

- Pr(A and B) - Pr(B and C) - Pr(A and C)
+ Pr(A and B and C)

slide(12)
Probabilistic Analysis using Sum-of-Disjoint-Products
_
A AB
__
ABC
Pr(A) + Pr(not A and B) + Pr (not A and not B and C)

slide(13)
Probabilistic Analysis
using Binary Decision Diagrams
Fault tree model BDD representation
A1
B1 B1
A2 1
A2
0
A1 B1 0
B2
0 AB
0 1
A2 B2
AB

slide(14)

slide(15)
An example control system including software
Consider a simple tank level and flow control system.
The key features of this system are:
• a water tank, fed by a water pump on the inflow and regulated by

control and stop valves on the inflow and outflow pipes.
• a tank and level control system with three sensors (level, inflow and
outflow) implemented in software
• a tank bypass to prevent overflow, controlled by the three stop

valves
• valve actuation and control implemented in software

slide(16)
Function of example system
The function is to maintain the water level and downstream flow rate
at particular values by opening and closing control valves cv1 and cv2.
The controller receives inputs from the three sensors, implements the
control logic and then gives commands to the two control valves and
the two stop valves.
(This example is adapted from an example in: S. Guarro, M. Yau and M. Motamed,
“Development of tools for safety analysis of control software in advanced reactors”,
NUREG/CR-6465, April 1996.)

slide(17)
Diagram of example system
Digital
Controller
Control Flow Normally

Pump Valve Sensor Open Level
cv1 valve v1 Sensor
Check
valve
Check Normally Flow Control
valve Open Sensor Valve
valve v3 cv2
Normally
Closed
valve v2

slide(18)
Control Flow
close v1
level too high Drain water open v2
from tank open v3
cv1 = min
cv2 = max
flow set point
open v1
compare level calculate close v2
level
with level level positions for open v3
set points within control valves calculate cv1
bounds calculate cv2
upstream downstream
flow measurement flow measurement
open v1
replenish water
in tank close v2
level too low close v3
cv1 - max
cv2 = min

slide(19)
Software Operational modes
very very
low normal high
low high
Underflow Overflow
correct operation
non-critical failure behavior
critical failure behavior

slide(20)
System Failure modes
The control system is not designed to be fault tolerant, so that most

hardware failures present either an overflow or underflow hazard.
These are the two system-level failure modes being considered. If
either of the control valves or any of the three sensors fail, the system
fails, as the software will be unable to control the system. A tank leak,
pipe leak or pump failure are also considered single points of failure.
Further, an underflow can occur if valve v1 or v2 fails, thus preventing

proper inflow, unless valve v3 can be closed. Therefore, if v3 and either
v1 or v2 fail, the system fails. Overflow can occur if valve v3 fails, thus
preventing outflow, unless v1 can be closed. Thus the failure of both v1
and v3 leads to system failure.

slide(21)
Software Failure Modes
Software failures have been characterized as an improper change of

state. Software failures are further classified as critical and non-critical.
Non-critical software failures lead to less than optimal performance but

do not lead to system failure.
A software failure when the tank water level is very low or very high
can lead to system failure.

slide(22)
Fault Tree Model for Example System
Overflow Control valves Mechanical systems Underflow
Processor
SW-VH-Failure Control Valve 1 tank leak SW-VL-Failure

Control Valve 2 pipe leak
pump failure
Stop Valve 1 Level sensor Stop Valve 2 Stop Valve 1

Stop Valve 3 Stop Valve 3 Stop Valve 3
inflow sensor
outflow sensor

slide(23)

slide(24)
Fault trees as a design aid for software systems
Fault tree analysis can help to insure that the software system does not
do what it is not supposed to do. (As contrasted with a formal design
review which helps insure that the software does what it is supposed to
do.)
For robust software systems, fault trees can help identify high-risk
areas (either quantitatively or qualitatively).
Can manage risk by preventive or protective measures applied to

identified high-risk areas.
• exhaustive testing
• formal methods
• exception handling
• acceptance tests
• interlocks
• redesign

slide(25)
Using fault trees to manage risk
AND gates can be protected by disallowing one of the inputs[1]
• exhaustive testing or formal proof to show module cannot fail

• test for failure condition and provide recovery routine
OR gate can be protected by disallowing all inputs or by providing

detection and recovery point. (The detection and recovery routines
must be simple enough to be certifiably correct.)
[1] Herbert Hecht and Myron Hecht, “Fault Tolerant Software.” In D.K. Pradhan, editor, Fault-
Tolerant Computing: Theory and Techniques, volume 2, pages 658-696. Prentice-Hall, 1986.

slide(26)
Example Risk Mitigation
Suppose basic events represent software modules.

G1
Can protect G3 by preventing failure of module A4
G2 G3 - exhaustive testing
- proof of correctness
A5 Then can protect G2 by

G4 G5
-preventing failure of A3
- preventing failure of both A1 and A2
A1 A2 A3 A4
- or by providing detection and recovery handler for G4

slide(27)

slide(28)
Modeling Fault-Tolerant Computer Systems
Fault Tolerant Computer (FTC) systems can actively handle many

faults and errors that may occur.
Because FTC are adaptive and flexible, faulty components can be

switched out automatically, and spares switched in.
However, adaptability and flexibility often result in increased

complexity. Increased complexity can mean decreased reliability.
If the fault tolerance mechanisms (error detection, recovery,

reconfiguration) fail, this failure could lead to overall system failure,
even if adequate functioning resources remain.
A coverage model is used to analyze the behavior of the computer

system in the presence of a fault. The results of the coverage model
are then incorporated into the overall system model.

slide(29)
Covered vs. Uncovered Faults
A covered fault is one from which the system can automatically

recover.
• Recovery from transient does not change system state.
• Recovery from a permanent fault discards faulty component.
An uncovered fault is one which leads to immediate system failure,

regardless of the state of the system.

slide(30)
How to estimate coverage?
If a working or prototype version of the system exists, or if enough

information is available about a system being designed, then coverage
probabilities can be estimated.
A model of the recovery process can be developed. The parameters for

the model can be measured from a fault injection on the working
prototype and or estimated from data collected in the field.
A detailed simulation model of the system recovery process can be

developed.
If the details of the recovery process are not known, reasonable

parameters can be deduced from other, similar systems.

slide(31)
General structure of a coverage model
fault occurs
C exit
Coverage permanent coverage
Model (covered failure}
R exit
transient restoration
(covered fault) S exit
single-point failure
(uncovered failure)
The entry point to the model is the occurrence of the fault, and
the three exits (R,C, and S) are the three possible outcomes.

slide(32)
R exit: Transient Restoration
Correct recognition of and recovery from a transient fault. A transient
is usually caused by external or environmental factors, such as
excessive heat or a glitch in the power line.
The vast majority of faults are transient.
Successful recovery from a transient fault restores the system to an

operational state without discarding any components - for example
by masking the error, retrying an instruction, or rolling back to a
previous checkpoint.
Reaching this exit successfully requires:
• timely detection of an error produced by the fault;

• performance of an effective recovery procedure; and
• swift disappearance of the fault (the cause of the error).

slide(33)
C exit: Permanent coverage
Determination of the permanent nature of the fault, and the successful

isolation and removal of the faulty component.
S exit: Single Point failure
A single fault causes the system to fail, generally when

an undetected error propagates through the system, or if the faulty
unit cannot be isolated and the system cannot be reconfigured.

slide(34)
Typical fault recovery for a processor
A processor contains built-in test circuitry so that error checking occurs
concurrently with instruction execution. If an error is detected, the
instruction is retried immediately. Partial results are stored in case the
retry is unsuccessful, so that the computation can be continued from
some intermediate point (called a checkpoint).
The process of continuing a computation from a previously saved

checkpoint is called a rollback. In some cases the fault is such that the
rollback is not successful, so the computation must start over after a
system-level recovery procedure is invoked.

slide(35)
Example coverage model for processors
Wait
Retry
Rollback
Permanent
Recovery
Transient
Restoration
Exit R
Permanent
Coverage Failure
Exit C Exit S

slide(36)
Example of transient restoration
Transient Restoration attempt: Assume that the fault is transient, and

begin a multi-step recovery procedure that continues as long as an
error is detected. If an error persists after all three steps have been
performed, then a permanent recovery procedure must be invoked.
• Step 1: Wait for 0.1 second and do nothing. If the fault is transient it
may disappear during this time, allowing rollback to succeed.
• Step 2: Retry the current instruction several times, for as long as a

half-second. The probability that the retry will be successful (i.e., no
error is detected) is 0.5.
• Step 3: If an error persists, perform a rollback to a previous check-

point, followed by recomputation, taking 2 sec. total. The rollback suc-
ceeds in removing the error 80% of the time.

slide(37)
Example of permanent coverage and
single-point failure
If an error still persists after the rollback, it is assumed to be caused by
a permanent fault, and a system level permanent fault recovery
process is begun, to remove the offending processor from the set of
active units and to reconfigure the system to continue without it.
The permanent fault recovery process succeeds with probability 0.875.
The permanent coverage procedure is invoked against a a persistent

transient fault as well as against a permanent fault.
If the permanent fault recovery process fails, then a single-point failure

is said to occur.

slide(38)
Coverage model for memories
Error Occurs
0.02
0.98
Single bit Multiple bit

Memory error Memory error
0.95 0.05
Error masked detected not detected

in zero time
Failure
Transient Exit S
Attempt
Restoration
recovery
Exit R
0.85 0.15
successful unsuccessful
Permanent Failure
Coverage Exit S
Exit C

slide(39)
Example of recovery process for memory faults
The memory uses an error correcting code, so a single-bit error is

always detectable and correctable, and no reconfiguration is required.
If 98% of all memory faults affect only a single bit, then
the probability of reaching the R exit is 0.98.
The 2% of faults that affect more than one memory bit are 95%
detectable. When a multiple memory error is detected, the affected
portion of memory is discarded, the memory mapping function is
updated, and the needed information is reloaded from a previous
checkpoint and updated to represent the current state of the system.
Experimentation on a prototype system revealed that this recovery

from the detected multiple memory errors works 85% of the time.
Thus, the probability of reaching the C exit is the probability that a
multiple fault occurs, is detected, and is recovered from is:
c = 0.02 × ( 0.95 × 0.85 ) = 0.01615

slide(40)
Single point failure for memory faults
There are two paths to the single point failure exit.
• The memory fault causes a single-point failure if a multiple-bit error is

not detected (with probability 0.02 x 0.05)
• A multiple-bit memory error is detected, but the attempted recovery is

not successful, with probability 0.02 x 0.95 x 0.15
Thus the probability of single point failure is the sum of these two
cases, or 0.00385.

slide(41)
Example - 3P2M system
Processors
Bus
Memories
3 Processors and 2 memories connected via a bus. 2 Processors, one

memory and the bus are needed for correct operation.

slide(42)
3P2M fault tree
System Failure
2/3
Bus
P1 P2 P3 M1 M2
Processors Memories

slide(43)
Adding coverage to fault tree
System Failure
2/3
Bus
S C S
S C C S C S C
R R R R R
P1 P2 P3 M2 M1
Processors Memories

slide(44)
Probabilities for covered and uncovered basic
events
Pr[component fault] = p
No fault covered fault
Pr[component operational] = q
or transient
restoration Pr[covered failure] = cp
uncovered Pr[Uncovered failure] = sp
fault Pr[No fault or transient restoration] = q + rp
Note that the events “fail covered” and “fail uncovered” are mutually
exclusive (i.e. not independent).

slide(45)

slide(46)
Sequence dependencies
Traditional fault trees cannot model sequence dependent failures, in

which the order that events occur is important.
We define special purpose gates for modeling sequence

dependencies, and solve the resulting fault tree as a Markov chain.
The development of a correct Markov model for a complex system can

be difficult. Our approach is to use the fault tree for model development
and automatically convert the fault tree to the equivalent Markov chain.
The dynamic fault tree model is considerably simpler than the
equivalent Markov chain.
Coverage models are automatically added to the resulting Markov

chain which is solved via a numerical differential equation solver.

slide(47)
Example of sequence dependency
Primary
Switch
Spare
If switch fails after primary fails (and after spare is activated) then the
system is still operational.
If the switch fails before the primary fails, then the spare cannot be
activated and the system fails, even though the spare is operational.
Failure criteria depends on order in which failures occur. This system

can be solved correctly via a Markov model.

slide(48)
Sequence dependency gates
Several special purpose gates have been added to the traditional fault
tree gates. These special dynamic gates capture sequence
dependencies which frequently arise when modeling fault tolerant
computer systems. If a dynamic gate is part of a fault tree then it is
solved via a Markov chain, rather than by using traditional methods.
The special dynamic gates include:
• Functional dependency gate for modeling situations where one com-

ponent’s correct operation is dependent upon the correct operation of
some other component
• Spare gate for modeling cold, warm and hot pooled spares
• Priority-AND gate for modeling ordered ANDing of events. Note that
many traditional fault trees include the Priority AND gate; most simply
approximate with an AND gate

slide(49)
Functional Dependency gate
FDEP produces no logical output.
Its only effect is on propogating failures.
FDEP
Trigger event
(may be subtree)
dependent basic events

are forced to occur when trigger
event occurs

slide(50)
Spare gate
SPARE
Primary
component Spare units which are assumed to
fail at (possibly) reduced rate
before being switched into active use

slide(51)
Priority-AND gate
Output occurs if A and B

occur, and A occurs before B
A B
Inputs (may be subtrees)
which may occur in any order

slide(52)
Cascading Priority-AND gates
A before B before C
A B

slide(53)
HECS: Hypothetical Example Computer System
Redundant
Bus
Memory Memory
A1 Cold Spare A
Interface Interface
A2 Unit 1 Unit 2 Operator console
Operator
& Software
M1 M2 M3 M4 M5

slide(54)
HECS system description
HECS consists of dual-redundant processors A1 and A2 and a cold
spare which can replace either upon failure. A cold spare is one which
is assumed not to fail before being used.
HECS has 5 memory units; three are required. These memory units
are connected to the bus via two memory interface units. If the memory
interface unit fails, the memory units connected to it are unusable.
Memory unit 3 (M3) is connected to both interfaces for redundancy;
thus M3 is accessible as long as either interface unit is operational.
There is also a human operator who interfaces with the system via a
console, and runs some software application.
HECS requires at least one of the three A processors, at least 3 of the

memory units, at least one of the redundant busses, and the operator,
console and software to be operating correctly.

slide(55)
Modeling the cold spares
A processors
and spare
Cold Spare Cold Spare
A1 A2
Cold Spare
A
Notice that the cold spare is shared between the two processors. First
to fail is replaced with the spare; the spare is then unavailable if the
other fails.

slide(56)
Modeling the memory units
memory
units
3/5
Functional
Functional Functional
Dependency
Dependency Dependency
M1 M2 M3
M4 M5
MIU 2
MIU 1

slide(57)
HECS system-level fault tree model
hypothetical system failure
operator, A processors memory
console & SW and spare units
2*Bus 3/5
operator console Functional

console Functional Functional
software
Dependency
Operator
A1 A2
M1 M2 M3
M4 M5
Cold Spare MIU 2

A MIU 1

slide(58)
Example: Fault Tolerant Parallel Processor
Processing
elements
Network
elements
NE3
NE2 NE4
NE1

slide(59)
FTPP configuration #1
One spare per triad
A3 B3 C3 D3
NE3
D2 AS
C2 BS
NE2 NE4
B2 CS
A2 DS
NE1
D1 C1 B1 A1

slide(60)
Fault tree for FTPP configuration #1
3/4 3/4 3/4 3/4
A1 A2 A3 AS B1 B2 B3 BS C1 C2 C3 CS D1 D2 D3 DS
NE1 NE2 NE3 NE4
FDEP FDEP FDEP FDEP
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS

slide(61)
FTPP configuration #2
One spare per NE
A3 C1 D2 S3
NE3
B1
S2
D1 C2
NE2 NE4
B3 D3
A2 S4
NE1
S1 C3 B2 A1

slide(62)
Failure conditions for FTPP configuration #2
Consider as an example, the first member of the A triad, specifically

component A1. Now A1 will fail if
• both A1 and its spare (S1) fails OR
• if either of the other processors on the same NE fail before A1does,
thus using the spare first. In this case there will be no spare available
when A1 fails.
A1 S1 A1
B2 C3

slide(63)
Fault tree for FTPP configuration #2
2/3 2/3 2/3 2/3
A1 S1 A1 A2 S2 A2 A3 S3 A3 B1 S4 B1 B2 S1 B2 B3 S2 B3 C1 S3 C1 C2 S4 C2 C3 S1 C3 D1 S2 D1 D2 S3 D2 D3 S4 D3
B2 C3 B3 D1 C1 D2 C2 D3 C3 A1 D1 A2 D2 A3 D3 B1 A1 B2 A2 B3 A3 C1 B1 C2
NE1 NE2 NE3 NE4
FDEP FDEP FDEP FDEP
A1 B2 C3 S1 A2 B3 D1 S2 A3 C1 D2 S3 B1 C2 D3 S4

slide(64)
Alternative Fault Tree for configuration #2
2/3 2/3 2/3 2/3
Spare Spare Spare Spare Spare Spare Spare Spare Spare Spare Spare Spare
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
AS
BS
CS
DS
NE1 NE2 NE3 NE4
FDEP FDEP FDEP FDEP
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS

slide(65)
Mission Avionics System Example
The success/failure of the system is driven by the need

to provide certain software functionality
crew station management

scene & obstacle processing
local path generation
system management functions
vehicle management
Fault tolerance is achieved via redundant processors

(hot spares), pools of cold spares and redundant buses.

slide(66)
MAS system architecture
Memory 1 Memory 2
Background Data Bus
crew scene& local path system

SPARE 1
station a obstacle a gen. a mgmt a
crew scene & local path system

SPARE 2
station b obstacle b gen. b mgmt b
Mission Management Bus
vehicle vehicle vehicle vehicle VM VM

mgmt 1a mgmt 1b mgmt 2a mgmt 2b SPARE 1 SPARE 2
Vehicle Management Bus

slide(67)
Fault Tree model for MAS
MAS Failure
Module DB Module MM Module VMB Module M Module VMS
VM VM
BD BD MM MM Bus Bus Mem Mem Veh Mgt 1 Veh Mgt 2
Bus Bus Bus Bus 1 2 1 2
1 3 1 3
BD MM
Bus Bus
2 2
Vehicle Vehicle Vehicle Vehicle
Mgmt 1a Mgmt 1b Mgmt 2a Mgmt 2b
CSP CSP CSP CSP
VM VM VM VM
1a 1b 2a 2b
Module CPU
VM
S1 VM
S2
Crew S&O Sys Mgt Path Gen
Crew 1a Crew 1b S&O 1a S&O 1b Sys Mgt Sys Mgt Path Gen Path Gen
1a 1b 1a 1b
CSP CSP CSP CSP CSP CSP CSP CSP
Crew Crew S&O S&O Sys Sys Path Path

1a 1b 1a 1b 1a 1b 1a 1b
S1 S2

slide(68)
Redundant Software Architecture
Next Consider the path generation (PathGen) and
scene&obstacle (S&O) functions:
Each function needs a single processor to provide full
functionality.
There is also a reduced version of each function that can

provide minimum functionality (PathGenMin and S&OMin).
In the event of a detected software fault in PathGen the

system can switch to PathGenMin (similarly for S&O)
Further, if there are no longer 2 full processors available, the

system will switch to PathGenMin and S&OMin running on a
single processor.

slide(69)
Redundant Software MAS model
Loss
Software
Minimize
FDEP
Both Proc One Proc S&O SW Path SW
CSP CSP
Crew S&O S&O Path Path Sys Mgt

Path Gen S&O SW SW SW SW
full Min full Min
Crew 1a Crew 1b Sys Mgt Sys Mgt

Path Gen Path Gen S&O 1a S&O 1b 1a 1b
1a 1b
CSP CSP CSP CSP
CSP CSP CSP CSP
Crew Crew Sys Sys

1a 1b Path Path S&O S&O 1a 1b
1a 1b 1a 1b
S2
S1

slide(70)
MAS Failure
VM VM
1 3 1 3
BD MM
Bus Bus
2 2
CSP CSP CSP CSP
VM VM VM VM
1a 1b 2a 2b
Loss VM
S1 VM
S2
Software
Minimize
FDEP
CSP CSP

full Min full Min

1a 1b
CSP CSP CSP CSP
CSP CSP CSP CSP
Crew Crew Sys Sys

1a 1b 1a 1b
S2
S1

slide(71)

slide(72)
Modular Approach to fault tree analysis
Given a fault
tree model as input
Find Independent subtrees
Solve each subtree separately
Combine the results

slide(73)
The best of both worlds
Divide-and-conquer helps produce models that may

not be too large to solve.
For static modules (containing AND, OR, K-of-M) gates,

use the fast and efficient BDD (Binary Decision Diagram)
approach.
For dynamic modules, convert to equivalent Markov

model for solution.
Different solution methods can aid in validation and

testing.
Modularization allows consideration of different solution

methods (i.e. simulation).

slide(74)
Modular Solution of HECS
hypothetical system failure
Independent subtree 4 (buses)

type: static
operator, A processors memory
console & SW and spare units

independent subtree 3 (memories)
2*Bus type:dynamic
3/5
operator console Functional

console Functional Functional
software Dependency
Operator
A1 A2
M1 M2 M3
M4 M5
Independent subtree 1 Cold Spare MIU 2

A MIU 1
type: static
independent subtree 2
type: dynamic

slide(75)

slide(76)
Sensitivity analysis
Reliability Analysis tells only part of the story.
“What are the weak points in the system?”
“How do my results change with changing input parameters?”
“What is the most cost-effective way to improve reliability?”
These questions require sensitivity analysis of reliability analysis

results.

slide(77)
Modular approach to sensitivity analysis
Sensitivity analysis (also called importance analysis) can use partial

derivative.
Sensitivity results from different submodels can be easily combined

using “chain-rule”.
Sensitivity analysis for BDD is almost free while calculating reliability.
Sensitivity analysis for Markov chain is more troublesome but we have

developed an interesting, efficient approximation.

slide(78)
Example: Cardiac Assist System
TEDTS Pace
Coil Leads
TEDTS Power Primary Crossbar

Controller Supply CPU Switch
WSP
Backup Motor
Battery
CPU Amplifier
Motor
Pump Motor
Cable

slide(79)
Cardiac Assist System Fault Tree
System
Failure
Pump,Motor M2 CPU
Leads
M5
Motor
Pace Section
Pump
Leads
M1 FDEP WSP
Motor Motor Trigger

Motor Power
Cable. Amp.
TEDTS
Primary Backup
M4 CPU CPU
Crossbar System
Power TEDTS Switch. Superv.
Supply
M3
TEDTS TEDTS
Battery
Contr. Coil

slide(80)
MAS Failure
VM VM
1 3 1 3
BD MM
Bus Bus
2 2
CSP CSP CSP CSP
VM VM VM VM
1a 1b 2a 2b
Loss VM
S1 VM
S2
Software
Minimize
FDEP
CSP CSP

full Min full Min

1a 1b
CSP CSP CSP CSP
CSP CSP CSP CSP
Crew Crew Sys Sys

1a 1b 1a 1b
S2
S1

slide(81)
Summary
The DFT (dynamic fault tree) methodology is ideally suited for the
analysis of computer-based systems.
DFT uses a modular approach to FTA, detecting modules using a fast

and efficient algorithm.
Modules are classified as static or dynamic, depending on the types of

gates included.
Static modules are solved using the BDD approach; dynamic modules
are solved using Markov chain methods.
Coverage models can assess the effect of complex recovery

mechanisms.
Dynamic gates can allow modeling of sequence dependencies that

arise from complex redundancy management.

slide(82)
Software for Dynamic Fault Tree Analysis
Galileo/ASSAP is a software package for fault tree

analysis which embodies the DFT approach.
(Being developed for NASA Langley Research

Center, expected completion Nov. 2001. Beta ver-
sion available for evaluation.)

slide(83)

Dugan Comp Sys Fta Tutor

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Dugan Comp Sys Fta Tutor

Transféré par

Droits d'auteur :

Formats disponibles

Fault Tree Analysis

Joanne Bechta Dugan

RELIABILITY and MAINTAINABILITY Symposium

I. Introduction to fault trees

II. Fault tree analysis of an example control system

III. Fault trees as design aid for software systems

IV. Adapting the fault tree to analysis of computer-based systems

V. Dynamic fault trees for modeling sequential behavior

VI. Modular approach to fault tree analysis

VII. Sensitivity analysis

VIII. Summary and Conclusions

RELIABILITY and MAINTAINABILITY Symposium

What is a fault tree?

RELIABILITY and MAINTAINABILITY Symposium

A fault tree model precisely documents which failure scenarios have

Fault tree analysis can be used to support engineering and

The fault tree model has a well-defined boolean algebraic and

RELIABILITY and MAINTAINABILITY Symposium

Basic Event: corresponds to a basic failure

Undeveloped Basic Event: A basic event that is not

Replicated Basic Event; represents k statistically

RELIABILITY and MAINTAINABILITY Symposium

AND gate - output event occurs

OR gate - output event occurs

m/n gate - output event occurs if

RELIABILITY and MAINTAINABILITY Symposium

RELIABILITY and MAINTAINABILITY Symposium

RELIABILITY and MAINTAINABILITY Symposium

A mincut (minimum cutset) is one that contains no redundant elements.

RELIABILITY and MAINTAINABILITY Symposium

RELIABILITY and MAINTAINABILITY Symposium

Prob(failure) = Prob(valve failure) + Prob(both timer and sensor fail)

RELIABILITY and MAINTAINABILITY Symposium

Pr(A) + Pr(B) + Pr(C)

RELIABILITY and MAINTAINABILITY Symposium

Pr(A) + Pr(not A and B) + Pr (not A and not B and C)

RELIABILITY and MAINTAINABILITY Symposium

RELIABILITY and MAINTAINABILITY Symposium

I. Introduction to fault trees

II. Fault tree analysis of an example control system

III. Fault trees as design aid for software systems

IV. Adapting the fault tree to analysis of computer-based systems

V. Dynamic fault trees for modeling sequential behavior

VI. Modular approach to fault tree analysis

VII. Sensitivity analysis

VIII. Summary and Conclusions

RELIABILITY and MAINTAINABILITY Symposium

Consider a simple tank level and flow control system.

The key features of this system are:

• a water tank, fed by a water pump on the inflow and regulated by

• a tank bypass to prevent overflow, controlled by the three stop

• valve actuation and control implemented in software

RELIABILITY and MAINTAINABILITY Symposium

RELIABILITY and MAINTAINABILITY Symposium

Control Flow Normally

RELIABILITY and MAINTAINABILITY Symposium

RELIABILITY and MAINTAINABILITY Symposium

non-critical failure behavior

critical failure behavior

RELIABILITY and MAINTAINABILITY Symposium

The control system is not designed to be fault tolerant, so that most

Further, an underflow can occur if valve v1 or v2 fails, thus preventing

RELIABILITY and MAINTAINABILITY Symposium

Software failures have been characterized as an improper change of