Académique Documents
Professionnel Documents
Culture Documents
of Computer-Based Systems
A fault tree model provides a logical framework for analyzing the failure
behavior of a system.
Structure Function:
Fail = valve_failed OR
fill mode too long (timer_failed AND sensor_failed)
valve
stuck open
F = A + BC
timeout full
control sensor
failed failed
F = A + BC Suppose
Pr[A] = 0.01
Pr[F] = Pr[ A + BC]
Pr[B] = 0.05
= Pr[A] + Pr[BC] - Pr[ABC]
Pr[C] = 0.075
= 0.01 + 0.00375 - 0.0000375
(all failures are independent)
= 0.0137125
Most fault tree analysis techniques start with the generation of cutsets
A cutset is a set of basic events; if all the basic events in a cutset occur,
then the top event (system failure) occurs.
Cutsets:
fill mode too long
valve
stuck open {valve} (single point of failure)
timeout
control
failed
full
sensor
failed
{timer, sensor}
G1
{A1,A3}
G3 {A4,A5}
G2 G3 {A1,G5} {A1,A4}
G1
G2 {G4,G5}
{A2,G5}
A5 {A2,A3}
G4 G5
{A2,A4}
A1 A2 A3 A4
But the cutsets are not disjoint so we cannot sum their individual
probabilities. We must account for the overlap of the events.
Washing Machine Overflows
Cutsets:
fill mode too long
valve
stuck open {valve} (single point of failure)
timeout
control
failed
full
sensor
failed
{timer, sensor}
A AB B
ABC
AC BC
_
A AB
__
ABC
A1
B1 B1
A2 1
A2
0
A1 B1 0
B2
0 AB
0 1
A2 B2
AB
• a tank and level control system with three sensors (level, inflow and
outflow) implemented in software
The function is to maintain the water level and downstream flow rate
at particular values by opening and closing control valves cv1 and cv2.
The controller receives inputs from the three sensors, implements the
control logic and then gives commands to the two control valves and
the two stop valves.
(This example is adapted from an example in: S. Guarro, M. Yau and M. Motamed,
“Development of tools for safety analysis of control software in advanced reactors”,
NUREG/CR-6465, April 1996.)
Digital
Controller
Normally
Closed
valve v2
close v1
level too high Drain water open v2
from tank open v3
cv1 = min
cv2 = max
flow set point
open v1
compare level calculate close v2
level
with level level positions for open v3
set points within control valves calculate cv1
bounds calculate cv2
upstream downstream
flow measurement flow measurement
open v1
replenish water
in tank close v2
level too low close v3
cv1 - max
cv2 = min
very very
low normal high
low high
Underflow Overflow
correct operation
A software failure when the tank water level is very low or very high
can lead to system failure.
Processor
For robust software systems, fault trees can help identify high-risk
areas (either quantitatively or qualitatively).
• exhaustive testing
• formal methods
• exception handling
• acceptance tests
• interlocks
• redesign
[1] Herbert Hecht and Myron Hecht, “Fault Tolerant Software.” In D.K. Pradhan, editor, Fault-
Tolerant Computing: Theory and Techniques, volume 2, pages 658-696. Prentice-Hall, 1986.
G2 G3 - exhaustive testing
- proof of correctness
fault occurs
C exit
Coverage permanent coverage
Model (covered failure}
R exit
transient restoration
(covered fault) S exit
single-point failure
(uncovered failure)
The entry point to the model is the occurrence of the fault, and
the three exits (R,C, and S) are the three possible outcomes.
Retry
Rollback
Permanent
Recovery
Transient
Restoration
Exit R
Permanent
Coverage Failure
Exit C Exit S
• Step 1: Wait for 0.1 second and do nothing. If the fault is transient it
may disappear during this time, allowing rollback to succeed.
0.02
0.98
0.95 0.05
0.85 0.15
successful unsuccessful
Permanent Failure
Coverage Exit S
Exit C
The 2% of faults that affect more than one memory bit are 95%
detectable. When a multiple memory error is detected, the affected
portion of memory is discarded, the memory mapping function is
updated, and the needed information is reloaded from a previous
checkpoint and updated to represent the current state of the system.
Thus the probability of single point failure is the sum of these two
cases, or 0.00385.
Bus
Memories
System Failure
2/3
Bus
P1 P2 P3 M1 M2
Processors Memories
2/3
Bus
S C S
S C C S C S C
R R R R R
P1 P2 P3 M2 M1
Processors Memories
Pr[component fault] = p
No fault covered fault
Pr[component operational] = q
or transient
Note that the events “fail covered” and “fail uncovered” are mutually
exclusive (i.e. not independent).
Primary
Switch
Spare
If switch fails after primary fails (and after spare is activated) then the
system is still operational.
If the switch fails before the primary fails, then the spare cannot be
activated and the system fails, even though the spare is operational.
Several special purpose gates have been added to the traditional fault
tree gates. These special dynamic gates capture sequence
dependencies which frequently arise when modeling fault tolerant
computer systems. If a dynamic gate is part of a fault tree then it is
solved via a Markov chain, rather than by using traditional methods.
FDEP
Trigger event
(may be subtree)
SPARE
Primary
component Spare units which are assumed to
fail at (possibly) reduced rate
before being switched into active use
A B
Inputs (may be subtrees)
which may occur in any order
A before B before C
A B
Memory Memory
A1 Cold Spare A
Interface Interface
A2 Unit 1 Unit 2 Operator console
Operator
& Software
M1 M2 M3 M4 M5
HECS has 5 memory units; three are required. These memory units
are connected to the bus via two memory interface units. If the memory
interface unit fails, the memory units connected to it are unusable.
Memory unit 3 (M3) is connected to both interfaces for redundancy;
thus M3 is accessible as long as either interface unit is operational.
There is also a human operator who interfaces with the system via a
console, and runs some software application.
A1 A2
Cold Spare
A
Notice that the cold spare is shared between the two processors. First
to fail is replaced with the spare; the spare is then unavailable if the
other fails.
memory
units
3/5
Functional
Functional Functional
Dependency
Dependency Dependency
M1 M2 M3
M4 M5
MIU 2
MIU 1
2*Bus 3/5
Operator
A1 A2
M1 M2 M3
M4 M5
NE2 NE4
NE1
A3 B3 C3 D3
NE3
D2 AS
C2 BS
NE2 NE4
B2 CS
A2 DS
NE1
D1 C1 B1 A1
A1 A2 A3 AS B1 B2 B3 BS C1 C2 C3 CS D1 D2 D3 DS
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS
NE3
B1
S2
D1 C2
NE2 NE4
B3 D3
A2 S4
NE1
S1 C3 B2 A1
A1 S1 A1
B2 C3
A1 S1 A1 A2 S2 A2 A3 S3 A3 B1 S4 B1 B2 S1 B2 B3 S2 B3 C1 S3 C1 C2 S4 C2 C3 S1 C3 D1 S2 D1 D2 S3 D2 D3 S4 D3
B2 C3 B3 D1 C1 D2 C2 D3 C3 A1 D1 A2 D2 A3 D3 B1 A1 B2 A2 B3 A3 C1 B1 C2
A1 B2 C3 S1 A2 B3 D1 S2 A3 C1 D2 S3 B1 C2 D3 S4
Spare Spare Spare Spare Spare Spare Spare Spare Spare Spare Spare Spare
A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
AS
BS
CS
DS
A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 AS BS CS DS
Memory 1 Memory 2
Background Data Bus
VM VM
BD BD MM MM Bus Bus Mem Mem Veh Mgt 1 Veh Mgt 2
Bus Bus Bus Bus 1 2 1 2
1 3 1 3
BD MM
Bus Bus
2 2
Vehicle Vehicle Vehicle Vehicle
Mgmt 1a Mgmt 1b Mgmt 2a Mgmt 2b
CSP CSP CSP CSP
VM VM VM VM
1a 1b 2a 2b
Module CPU
VM
S1 VM
S2
Crew 1a Crew 1b S&O 1a S&O 1b Sys Mgt Sys Mgt Path Gen Path Gen
1a 1b 1a 1b
CSP CSP CSP CSP CSP CSP CSP CSP
S1 S2
Software
Minimize
FDEP
CSP CSP
S2
S1
VM VM
BD BD MM MM Bus Bus Mem Mem Veh Mgt 1 Veh Mgt 2
Bus Bus Bus Bus 1 2 1 2
1 3 1 3
BD MM
Bus Bus
2 2
Vehicle Vehicle Vehicle Vehicle
Mgmt 1a Mgmt 1b Mgmt 2a Mgmt 2b
CSP CSP CSP CSP
VM VM VM VM
1a 1b 2a 2b
Loss VM
S1 VM
S2
Software
Minimize
FDEP
CSP CSP
S2
S1
Given a fault
tree model as input
Operator
A1 A2
M1 M2 M3
M4 M5
independent subtree 2
type: dynamic
TEDTS Pace
Coil Leads
WSP
Backup Motor
Battery
CPU Amplifier
Motor
Pump Motor
Cable
Pump,Motor M2 CPU
Leads
M5
Motor
Pace Section
Pump
Leads
M1 FDEP WSP
Crossbar System
Power TEDTS Switch. Superv.
Supply
M3
TEDTS TEDTS
Battery
Contr. Coil
VM VM
BD BD MM MM Bus Bus Mem Mem Veh Mgt 1 Veh Mgt 2
Bus Bus Bus Bus 1 2 1 2
1 3 1 3
BD MM
Bus Bus
2 2
Vehicle Vehicle Vehicle Vehicle
Mgmt 1a Mgmt 1b Mgmt 2a Mgmt 2b
CSP CSP CSP CSP
VM VM VM VM
1a 1b 2a 2b
Loss VM
S1 VM
S2
Software
Minimize
FDEP
CSP CSP
S2
S1
Static modules are solved using the BDD approach; dynamic modules
are solved using Markov chain methods.