Académique Documents
Professionnel Documents
Culture Documents
RELIABILITY
Part 1
Reliability
• Correct service = the delivered service fulfills the system
function
• Incorrect service = the delivered service does not fullfill the system
function
• Failure = transition from correct to incorrect service
• To quantify:
• Reliability = continuity of correct service
• Time to failure
• Availability = readiness for correct service
• Frequency of failure
Reliability model
Is this suitable for software
reliability ?
Hardware = real, physical component
Software = un-touchable, informational component
Reliability = continuity of correct service
Models
System=
hw+sw
System reliability tasks
“Reliability assurance of
combined hardware and
software systems requires
implementation of a
thorough, integrated set of
reliability modeling, allocation,
prediction, estimation and test
tasks. These tasks allow on-
going evaluation of the
reliability of system, subsystem
and lower-tier designs. The
results of these analyses are
used to assess the relative
merit of competing design
alternatives, to evaluate the
reliability progress of the
design program, and to
measure the final, achieved
product reliability through
demonstration testing.
At each step in the design-
evaluate-design process, the
metrics used to predict product
reliability provide a mechanism
for a total quality management
system to provide ongoing
control and refinement
of the design process.”
Hw and sw
Reliability part
Reliability/
management
part
Example: missile guidance system
• On February 25, 1991, during the Gulf War, an American Patriot Missile battery
in Dharan, Saudi Arabia, failed to intercept an incoming Iraqi Scud missile. The
Scud struck an American Army barracks and killed 28 soldiers.
• The cause: an inaccurate calculation of the time since boot due to computer
arithmetic errors.
• Specifically, the time in tenths of second as measured by the system's internal clock was multiplied by 1/10 to produce the time in
seconds. This calculation was performed using a 24 bit fixed point register. In particular, the value 1/10, which has a non-terminating
binary expansion, was chopped at 24 bits after the radix point. The small chopping error, when multiplied by the large number giving
the time in tenths of a second, lead to a significant error. Indeed, the Patriot battery had been up around 100 hours, and an easy
calculation shows that the resulting time error due to the magnified chopping error was about 0.34 seconds. (The number 1/10
equals 1/24+1/25+1/28+1/29+1/212+1/213+.... In other words, the binary expansion of 1/10 is
0.0001100110011001100110011001100.... Now the 24 bit register in the Patriot stored instead 0.00011001100110011001100
introducing an error of 0.0000000000000000000000011001100... binary, or about 0.000000095 decimal. Multiplying by the number
of tenths of a second in 100 hours gives 0.000000095×100×60×60×10=0.34.) A Scud travels at about 1,676 meters per second, and
so travels more than half a kilometer in this time. This was far enough that the incoming Scud was outside the "range gate" that the
Patriot tracked. Ironically, the fact that the bad time calculation had been improved in some parts of the code, but not all, contributed
to the problem, since it meant that the inaccuracies did not cancel.
Pay attention:
The system, once split into modules, must be evaluated
at each module and as a whole
System reliability and a bit of Math
• Reliability R(t) = conditional probability that a system
functions correctly during the time interval [t0,t], if at the
initial point t0, the system works properly.
• Example:
• R_SS=R+(1-R)*R=2R-R^2
• R_TMR=R*R*R+(1-R)*R*R+R*(1-R)*R +R*R*(1-R)=3*R^2-2*R^3
System vs Module Reliability
1
0.9
0.8
System reliability
0.7
0.6
0.5 TMR
0.4 SS
0.3 =R
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Module reliability
The System, the Module and the Software
• software ≠ hardware
• each mode in a software system’s (CSCI) operation, different
software modules (CSCs) will be executing
• each mode will have a unique time of operation associated with it.
• A model should be developed for the software portion of a
system to illustrate the modules which will be operating
during each system mode, and indicate the duration of
each system mode.
• Example:
Software Reliability Models
• Over 200 models
• since the early 1970s
• how to quantify software reliability still remains largely unsolved
• No single model that can be used in all situations
• No model is complete or even representative.
• One model may work well for a set of certain software, but may be
completely off track for other kinds of problems.
• Most software models contain:
• Assumptions
• Factors
• A mathematical function that relates the reliability with the factors.
The mathematical function is usually higher order exponential or
logarithmic
Software Reliability Models Comparison – 1/2
Model How it works What data? When
Prediction Uses historical Usually made
data prior to
observe and development
or test
accumulate
phases; can
failure data be used as
early as
analyze with concept phase
statistical Usually made
Estimation Uses data
inference later in life
from the
current cycle
software
development
effort
Software Reliability Models Comparison – 2/2
Model + - Examples
Prediction • software • “educated Musa's Execution
reliability can be guess” Time Model,
predicted early in • If no/little Putnam's Model
the development historical and Rome
phase data may Laboratory
• enhancements have models TR-92-51
can be initiated substantial and TR-92-15
to improve the errors
reliability.
Estimation • More accurate • Can be fault count/fault
values used after rate estimation
some data models:
have been exponential
collected distribution,
• Enhanceme Weibull
nts may be distribution
difficult to Bayesian fault
implement rate estimation
models:
Thompson and
Chelson's model
Software Reliability Prediction
• Historical data
• Current data
Fault
content
Many, many metrics
• Product metrics
• Software size
• Lines Of Code (LOC), or LOC in thousands(KLOC)
• source code is used(SLOC, KSLOC)
• comments and other non-executable statements are not counted.
• can not faithfully compare software not written in the same language
• Function point metric
• a count of inputs, outputs, master files, inquires, and interfaces
• functional complexity of the program.
• is independent of the programming language.
• used primarily for business systems; it is not proven in scientific or real-time applications.
• Complexity-oriented metrics
• simplify the code into a graphical representation
• McCabe's Complexity Metric.
• Test coverage metrics
• software reliability is a function of the portion of software that has been successfully verified or tested.
• Project management metrics
• Cost
• Process metrics
• estimate, monitor and improve the reliability and quality of software, e.g.: ISO-9000
• Fault and failure metrics
• Mean Time Between Failures (MTBF) MTBF=MTTF+MTTR
• Mean Time To Failure (MTTF)
• Mean time to repair (MTTR)
• number of faults found during testing
• failures (or other problems) reported by users after delivery are collected
Different models, different metrics
Using those metrics – MTTF
• MTTF = Mean time to failure
• Distribution of failures?
• Exponential
• Weibull
• Logaritmic
• Etc…
Example (Xing)
• A module with constant failure rate of will survive 200
hours without failure with a 0.97 probability
• MTTF?
• What’s the probability to survive 1000 hours?
• Solution:
• R(t)=exp(-t), t=200, R(t)=0.97
=>- *200=ln(0.97)
=> = -0.030459/-200=1.523*10^-4
• MTTF=1/ =6566.16 ore
• R(1000)=exp(- *1000)=0.858
This helps, why?