Software Reliability: by Allesh Panda Iiit BBSR

Software Reliability
By Allesh Panda IIIT BBSR
Functional and Non-functional Requirements

System functional requirements may specify error checking, recovery features, and system failure protection System reliability and availability are specified as part of the non-functional requirements for the system.
System Reliability Specification

Hardware reliability
probability a hardware component fails
Software reliability
probability a software component will produce an incorrect output software does not wear out software can continue to operate after a bad result
Operator reliability
probability system user makes an error
Failure Probabilities
If there are two independent components in a system and the operation of the system depends on them both then P(S) = P(A) + P(B) If the components are replicated then the probability of failure is P(S) = P(A)n meaning that all components fail at once
Functional Reliability Requirements

The system will check the all operator inputs to see that they fall within their required ranges. The system will check all disks for bad blocks each time it is booted. The system must be implemented in using a standard implementation of Ada.
Non-functional Reliability Specification

The required level of reliability must be expressed quantitatively. Reliability is a dynamic system attribute. Source code reliability specifications are meaningless (e.g. N faults/1000 LOC) An appropriate metric should be chosen to specify the overall system reliability.
Hardware Reliability Metrics

Hardware metrics are not suitable for software since its metrics are based on notion of component failure Software failures are often design failures Often the system is available after the failure has occurred Hardware components can wear out
Software Reliability Metrics

Reliability metrics are units of measure for system reliability System reliability is measured by counting the number of operational failures and relating these to demands made on the system at the time of failure A long-term measurement program is required to assess the reliability of critical systems
Reliability Metrics - part 1

Probability of Failure on Demand (POFOD)
POFOD = 0.001 For one in every 1000 requests the service fails per time unit
Rate of Fault Occurrence (ROCOF)

ROCOF = 0.02 Two failures for each 100 operational time units of operation
Reliability Metrics - part 2

Mean Time to Failure (MTTF)
average time between observed failures (aka MTBF)
Availability = MTBF / (MTBF+MTTR)

MTBF = Mean Time Between Failure MTTR = Mean Time to Repair
Reliability = MTBF / (1+MTBF)
Time Units
Raw Execution Time
non-stop system
Calendar Time
If the system has regular usage patterns
Number of Transactions
demand type transaction systems
Availability
Measures the fraction of time system is really available for use Takes repair and restart times into account Relevant for non-stop continuously running systems (e.g. traffic signal)
Probability of Failure on Demand

Probability system will fail when a service request is made Useful when requests are made on an intermittent or infrequent basis Appropriate for protection systems service requests may be rare and consequences can be serious if service is not delivered Relevant for many safety-critical systems with exception handlers
Rate of Fault Occurrence

Reflects rate of failure in the system Useful when system has to process a large number of similar requests that are relatively frequent Relevant for operating systems and transaction processing systems
Mean Time to Failure

Measures time between observable system failures For stable systems MTTF = 1/ROCOF Relevant for systems when individual transactions take lots of processing time (e.g. CAD or WP systems)
Failure Consequences - part 1

Reliability does not take consequences into account Transient faults have no real consequences but other faults might cause data loss or corruption May be worthwhile to identify different classes of failure, and use different metrics for each
Failure Consequences - part 2

When specifying reliability both the number of failures and the consequences of each matter Failures with serious consequences are more damaging than those where repair and recovery is straightforward In some cases, different reliability specifications may be defined for different failure types
Failure Classification
Transient - only occurs with certain inputs Permanent - occurs on all inputs Recoverable - system can recover without operator help Unrecoverable - operator has to help Non-corrupting - failure does not corrupt system state or data Corrupting - system state or data are altered
Building Reliability Specification

For each sub-system analyze consequences of possible system failures From system failure analysis partition failure into appropriate classes For each class send out the appropriate reliability metric
Examples
Failure Class Example Metric
ATM fails to Permanent Non-corrupting operate with any
ROCOF = .0001 card, must restart to Time unit = days correct POFOD = .0001 Time unit = transactions
Magnetic stripe Transient Non-corrupting can't be read on
undamaged card
Specification Validation
It is impossible to empirically validate high reliability specifications No database corruption really means POFOD class < 1 in 200 million If each transaction takes 1 second to verify, simulation of one days transactions takes 3.5 days
Statistical Reliability Testing

Test data used, needs to follow typical software usage patterns Measuring numbers of errors needs to be based on errors of omission (failing to do the right thing) and errors of commission (doing the wrong thing)
Difficulties with Statistical Reliability Testing

Uncertainty when creating the operational profile High cost of generating the operational profile Statistical uncertainty problems when high reliabilities are specified
Safety Specification
Each safety specification should be specified separately These requirements should be based on hazard and risk analysis Safety requirements usually apply to the system as a whole rather than individual components System safety is an an emergent system property
Safety Life Cycle - part 1

Concept and scope definition Hazard and risk analysis Safety requirements specification safety requirements derivation safety requirements allocation Planning and development safety related systems development external risk reduction facilities
Safety Life Cycle - part 2

Deployment safety validation
installation and commissioning
Operation and maintenance System decommissioning
Safety Processes
Hazard and risk analysis
assess the hazards and risks associated with the system
Safety requirements specification

specify system safety requirements
Designation of safety-critical systems

identify sub-systems whose incorrect operation can compromise entire system safety
Safety validation
check overall system safety
Hazard Analysis Stages

Hazard identification
identify potential hazards that may arise
Risk analysis and hazard classification

assess risk associated with each hazard
Hazard decomposition
seek to discover potential root causes for each hazard
Risk reduction assessment

describe how each hazard is to be taken into account when system is designed
Fault-tree Analysis
Hazard analysis method that starts with an identified fault and works backwards to the cause of the fault Can be used at all stages of hazard analysis It is a top-down technique, that may be combined with a bottom-up hazard analysis techniques that start with system failures that lead to hazards
Fault-tree Analysis Steps

Identify hazard Identify potential causes of hazards Link combinations of alternative causes using or or and symbols as appropriate Continue process until root causes are identified (result will be an and/or tree or a logic circuit) the causes are the leaves
How does it work?

What would a fault tree look like for a fault tree describing the causes for a hazard like data deleted?
Risk Assessment
Assess the hazard severity, hazard probability, and accident probability Outcome of risk assessment is a statement of acceptability
Intolerable (can never occur) ALARP (as low as possible given cost and schedule constraints) Acceptable (consequences are acceptable and no extra cost should be incurred to reduce it further)
Risk Acceptability
Determined by human, social, and political considerations In most societies, the boundaries between regions are pushed upwards with time (meaning risk becomes less acceptable) Risk assessment is always subjective (what is acceptable to one person is ALARP to another)
Risk Reduction
System should be specified so that hazards do not arise or result in an accident Hazard avoidance
system designed so hazard can never arise during normal operation
Hazard detection and removal

system designed so that hazards are detected and neutralized before an accident can occur
Damage limitation
system designed to minimized accident consequences
Security Specification
Similar to safety specification
not possible to specify quantitatively usually stated in system shall not terms rather than system shall terms
Differences
no well-defined security life cycle yet security deals with generic threats rather than system specific hazards
Security Specification Stages - part 1

Asset identification and evaluation
data and programs identified with their level of protection degree of protection depends on asset value
Threat analysis and risk assessment

security threats identified and risks associated with each is estimated
Threat assignment
identified threats are related to assets so that asset has a list of associated threats
Security Specification Stages - part 2

Technology analysis
available security technologies and their applicability against the threats
Security requirements specification

where appropriate these will identify the security technologies that may be used to protect against different threats to the system

Software Reliability: by Allesh Panda Iiit BBSR

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Software Reliability: by Allesh Panda Iiit BBSR

Transféré par

Droits d'auteur :

Formats disponibles

Software Reliability

By Allesh Panda IIIT BBSR

Functional and Non-functional Requirements

System Reliability Specification

Functional Reliability Requirements

Non-functional Reliability Specification

Hardware Reliability Metrics

Software Reliability Metrics

Reliability Metrics - part 1

Rate of Fault Occurrence (ROCOF)

Reliability Metrics - part 2

Availability = MTBF / (MTBF+MTTR)

Reliability = MTBF / (1+MTBF)

Probability of Failure on Demand

Rate of Fault Occurrence

Mean Time to Failure

Failure Consequences - part 1

Failure Consequences - part 2

Building Reliability Specification

Magnetic stripe Transient Non-corrupting can't be read on

Statistical Reliability Testing

Difficulties with Statistical Reliability Testing

Safety Life Cycle - part 1

Safety Life Cycle - part 2

Operation and maintenance System decommissioning

Safety requirements specification

Designation of safety-critical systems

Hazard Analysis Stages

Risk analysis and hazard classification

Risk reduction assessment

Fault-tree Analysis Steps

How does it work?

Hazard detection and removal

Security Specification Stages - part 1

Threat analysis and risk assessment

Security Specification Stages - part 2

Security requirements specification

Vous aimerez peut-être aussi