Académique Documents
Professionnel Documents
Culture Documents
Fault Tolerance
Fault-Error-Failure.
Error models
Redundancy,
Error Detection,
Watchdog
Damage Confinement,
Error Recovery,
Fault Treatment,
Fault Prevention,
anticipated and
unanticipated Faults.
Roll No: 15
Fault
Definition
malfunction .
Fault:
Example
Fault is a defect within the system
Examples:
Software bug
Random hardware fault
Memory bit stuck
Error
Definition
Error is a deviation from the required operation of
system or subsystem
A fault may lead to an error, i.e., error is a
mechanism by which the fault becomes apparent
Error:
Example
Memory bit got stuck but CPU does not access this
data
Failure:
Definition
A system failure occurs when the system fails to
perform its required function
Redundancy
All fault-tolerant techniques rely on extra elements
introduced into the system to detect & recover from
faults
Hardware Redundancy
Two types: static (or masking) and dynamic
redundancy
Static: redundant components are used inside a
system to hide the effects of faults; e.g. Triple Modular
Redundancy
TMR 3 identical subcomponents and majority voting
circuits; the outputs are compared and if one differs
from the other two that output is masked out
Dynamic: redundancy supplied inside a component
which indicates that the output is in error; provides an
error detection facility; recovery must be provided by
another component
E.g. communications checksums and memory parity
Real Time and Fault Tolerance
bits
Software Redundancy
System is provided with different software version
of task
Written independently
programmers
by
different
team
of
Information Redundancy
Parity checking
Checksum error detection
Cyclic Redundancy check
Error Detection
Types
Other types heartbeats etc.
Environmental detection
hardware e.g. illegal instruction
O.S/RTS null pointer
Application detection
Replication checks
Timing checks
Reversal checks
Coding checks
Reasonableness checks
Error Recovery
Introduction
Two approaches:
1. Forward Recovery
2. Backward Recovery
Forward Recovery
assessing and removing errors completely
Forward error recovery continues from an erroneous
state by making selective corrections to the system
state
This includes making safe the controlled environment
which may be hazardous or damaged because of the
failure
It is system specific and depends on
accurate
Backward Recovery
This has the same functionality but uses a different algorithm (c.f. NVersion Programming) and therefore no fault
Fault Treatment
Introduction
ER returned the system to an error-free state; however,
the error may recur; the final phase of F.T. is to
eradicate the fault from the system
Continue..
If the alternative module also fails the acceptance
test, the program is restored to the recovery point
and yet another module is executed, and so on
Channel is noisy
Channel output prone to error
we need measure to ensure correctness of the bit
stream transmitted
Error control coding aims at developing methods for
coding to check the correctness
transmitted.
Continue..
Different error control mechanisms:
continue
We may therefore represent the codeword as
Repetition Codes
This is the simplest of linear block codes
Example:
Consider a linear block code which is also a repetition
code. Let
k = 1 and n = 5. From the analysis done in linear block
Hamming Distance
Improves traditional measures by
differ.
Watchdog processors
Error detection technique:
Watchdog Processor
Watchdog Timer
An inexpensive method of error detection Process being
watched must reset the timer
Watchdog
timers
only
detect
errors
which
manifest
Heartbeats
Includes
Heartbeats: Issues
The timeout period is pre-negotiated by the two
parties
or
sometimes
even
hard-coded
by
the
programmer
The predefined
timeout
value
cannot adapt to
proper authorization
Examples:
virtual
address
management
(MMU
usually
has
capability check)
password checking
Consistency Checks
range check - confirms that a computed value is in a
valid range, e.g: a computed probability must be in
the range 0 to 1
Data Audits
Introduction
For dynamic data, the range of allowable values for database fields
are often stored in the database system catalog. This information is
used to perform a range check on the dynamic fields in the
database.
all
header
fields
at
computed
offsets
with
expected values
Assertions
Goals
Hardware Approaches
Software Approaches
Hardware Approaches
Embedded Signature Monitoring
Pre-computed signature embedded in the application
program
Recompilation of existing programs
Performance degradation of application
Autonomous Signature Monitoring
Watchdog Processor stores pre-computed signature in
the
memory and mimics the control flow of application
Watchdog Processor rather complex
High memory overhead
Software Approaches
Software techniques partition the application into
blocks, either in the assembly language or in the high
level language
Appropriate instrumentation inserted at the beginning
and/or end of the blocks
The checking code is inserted in the instruction stream
eliminating the need for a hardware watchdog
processor
Two classes of approaches
non-preemptive signature checking
preemptive signature checking
Software Approaches
THANK YOU