Vous êtes sur la page 1sur 12

Software Failure and Classification of Software Faults

What is a software failure? According to Laprie et al. [20], ``a system failure occurs when the delivered service no longer complies with the specifications, the latter being an agreed description of the system's expected function and/or service''. This definition applies to both hardware and software system failures. Faults or bugs in a hardware or a software component cause errors. An error is defined by Laprie et al. [20] as that part of the system which is liable to lead to subsequent failure, and an error affecting the service is an indication that a failure occurs or has occurred. If the system comprises of multiple components, errors can lead to a component failure. As various components in the system interact, failure of one component might introduce one or more faults in another. Figure 1 shows this cyclic behavior.

Figure 1: Fault behavior

Classification of software faults Some studies have suggested that since software is not a physical entity and hence not subject to transient physical phenomena (as opposed to hardware), software faults are permanent in nature [15]. Some other studies classify software faults as both permanent and transient. Gray [11] classifies software faults into Bohrbugs and Heisenbugs. Bohrbugs are essentially permanent design faults and hence almost deterministic in nature. They can be identified easily and weeded out during the testing and debugging phase (or early deployment phase) of the software life cycle. Heisenbugs, on the other hand, belong to the class of temporary internal faults and are intermittent. They are essentially permanent faults whose conditions of activation occur rarely or are not easily reproducible. Hence these faults result in transient failures, i.e., failures which may not recur if the software is

restarted. Some typical situations in which Heisenbugs might surface are boundaries between various software components, improper or insufficient exception handling and interdependent timing of various events. It is for this reason that Heisenbugs are extremely difficult to identify through testing. Hence a mature piece of software in the operational phase, released after its development and testing stage, is more likely to experience failures caused by Heisenbugs than due to Bohrbugs. Most recent studies on failure data have reported that a large proportion of software failures are transient in nature [11,12], caused by phenomena such as overloads or timing and exception errors [7,27]. The study of failure data from Tandem's fault tolerant computer system indicated that 70% of the failures were transient failures, caused by faults like race conditions and timing problems [21,22].

Techniques for Fault Tolerance in Software


Means to cope with the existence and manifestation of faults in software are divided into three main categories:

Fault avoidance/prevention: This include design methodologies which attempt to make software provably fault-free Fault removal: These methods aim to remove faults after the development stage is completed. This is done by exhaustive and rigorous testing of the final product. Fault tolerance: This methods makes the assumption that the system has unavoidable and undetectable faults and aims to make provisions for the system to operate correctly even in the presence of faults.

Most Bohrbugs, which are deterministic and repeatable, can be removed through rigorous and extensive testing and debugging. But, as argued in the previous sections, no amount of testing can certify software as fault-free, i.e. fault avoidance and fault removal cannot ensure the absence of faults. Therefore, any practical piece of software can be presumed to contain faults in the operational phase and designers must deal with these faults if the software failure has serious consequences. The remaining faults in software after testing and debugging are usually Heisenbugs which eluded detection during the testing. Hence, fault tolerance, is the only remaining hope to

achieve dependable software. Fault tolerance makes is possible for the software system to provide service even in the presence of faults. This means that an imminent failure needs to be prevented or recovered from. In this paper, we will only discuss methods to deal with software in the operational phase, i.e., methods to provide fault tolerance. There are two strategies for software fault tolerance - error processing and fault treatment. Error processing aims to remove errors from the software state and can be implemented by substituting an error-free state in place of the erroneous state, called error recovery, or by compensating for the error by providing redundancy, called error compensation. Error recovery can be achieved by either forward or backward error recovery. The second strategy, fault treatment, aims to prevent activation of faults and so action is taken before the error creeps in. The two steps in this strategy are fault diagnosis and fault passivation. Figure 2 shows this classification. The nature of faults which typically occur in software has to be thoroughly understood in order to apply these strategies effectively. Techniques for tolerating faults in software have been divided into three classes - design diversity, data diversity and environment diversity. Table 1 shows the fault tolerance strategies used by these classes. All these classes are discussed briefly in the following sections.

Figure 2: Means of software fault tolerance

Table 1: Strategies used by different fault tolerance methods Design Data Environment Checkpointing diversity & recovery

diversity diversity Error compensation Error recovery Fault treatment X X

X X

Design diversity Design diversity techniques are specifically developed to tolerate design faults in software arising out of wrong specifications and incorrect coding. Two or more variants of a software developed by different teams but to a common specification are used. These variants are then used in a time or space redundant manner to achieve fault tolerance. Popular techniques which are based on the design diversity concept for fault tolerance in software are:

N-version programming: First introduced by Avizienis et. al. [2] in 1977, this concept is similar to the NMR (N-modular programming) approach in hardware fault tolerance. In this technique, N (N>=2) independently generated functionally equivalent programs called versions, are executed in parallel. A majority voting logic is used to compare the results produced by all the versions and report one of the results which is presumed correct. The ability to tolerate faults here depends on how ``independent'' the different versions of the program are. This technique has been applied to a number of real-life systems like railroad traffic control and flight control, even though the overhead involved in generating different versions and implementing the voting logic may be high. Recovery block: Recovery blocks were first introduced by Horning et. al. [14]. This scheme is analogous to the cold standby scheme for hardware fault tolerance. Basically, in this approach, multiple variants of a software which are functionally equivalent are deployed in a time redundant fashion. Anacceptance test is used to test the validity of the result produced by the primary version. If the result from the primary version passes the acceptance test, this result is reported and execution stops. If, on the other hand, the result from the primary

version fails the acceptance test, another version from among the multiple versions is invoked and the result produced is checked by the acceptance test. The execution of the structure does not stop until the acceptance test is passed by one of the multiple versions or until all the versions have been exhausted. The significant differences in the recovery block approach from N-version programming are that only one version is executed at a time and the acceptability of results is decided by a test rather than by majority voting. The recovery block technique has been applied to real life systems and has been the basis for the distributed recovery block structure for integrating hardware and software fault tolerance and the extended distributed recovery block structure for command and control applications. Modeling and analysis of recovery blocks are desribed by Tomek et al. [28,29]. N-self checking programming: In N-self checking programming, multiple variants of a software are used in a hot-standby fashion as opposed to the recovery block technique in which the variants are used in the cold-standby mode. A self-checking software component is a variant with an acceptance test or a pair of variants with an associated comparison test [19]. Fault tolerance is achieved by executing more than one self-checking component in parallel. These components can also be used to tolerate one or more hardware faults.

The design diversity approach was developed mainly to deal with Bohrbugs. It relies on the assumption of independence of between multiple variants of software. However, as some studies have shown, this assumption may not always be valid. Design diversity can also be used to treat Heisenbugs. Since there are multiple versions of software operating, it not likely that all of them will experience the same transient failure. On the disadvantages of design diversity is the high cost involved in developing multiple variants of software. However, as we shall see in Section 3.3, there are another approaches which are more efficient and better suited to deal with Heisenbugs.
Data diversity Data diversity, a technique for fault tolerance in software, was introduced by Amman and Knight [3]. While the design diversity approaches to provide fault tolerance rely on multiple versions of the software written to the same specifications, the data diversity approach uses only one version of the software. This approach relies on the observation that a software sometime fails for certain values in the input space and this failure could be averted if there is a minor perturbation of input data which is acceptable to the

software. N-copy programming, based on data diversity, has N copies of a program executing in parallel, but each copy running on a different input set produced by a diverse-data system. The diverse-data system produces a related set of points in the data space. Selection of the system output is done using an enhanced voting scheme which may not be a majority voting mechanism. This technique might not be acceptable to all programs since equivalent input data transformations might not be acceptable by the specification. However, in some cases like a real time control program, a minor perturbation in sensor values may be able to prevent a failure since sensor values are usually noisy and inaccurate. Data diversity can work well with Bohrbugs and is cheaper to implement than design diversity techniques. To some extent, data diversity can also deal with Heisenbugs since different input data is presented and by definition, these bugs are non-deterministic and non-repeatable.

Environment diversity Environment diversity is the newest approach to fault tolerance in software. Although this technique has been used for long in an ad-hoc manner, only recently has it gained recognition and importance. Having its basis on the observation that most software failures are transient in nature, the environment diversity approach requires reexecuting the software in a different environment [17]. Environment diversity deals effectively with Heisenbugs by exploiting their definition and nature.

Adams [1] has proposed restarting the system as the best approach to masking software faults. Environment diversity is a generalization of restart. This has been proposed in [15,17] as a cheap but effective technique for fault tolerance in software. There are three components which determine the behavior of a process or executing software [32]:

The volatile state: This consists of the program stack and static and dynamic data segments. The persistent state: This state refers to all the user files related to a program's execution. The operating system (OS) environment: This refers to all the resources the program accesses through the operating system like swap space, file systems, communication channels, keyboard and monitors.

Transient faults typically occur in computer systems due to design faults in software which result in unacceptable and erroneous states in the OS environment. Therefore environment diversity attempts to provide a new or modified operating environment for the running software. Usually, this is done at the instance of a failure in the software. When the software fails, it is restarted in a different, error-free OS environment state which is achieved by some clean up operations. Examples of environment diversity techniques include retry operation, restart application and rebooting the node. The retry and restart operations can be done on the same node or on another spare (cold/warm/hot) node. Tandem's fault tolerant computer system [22] is based on the process pair approach. It was noted that these failures did not recur once the application was restarted on the second processor. This was due to the fact that the second processor provided a different environment which did not trigger the same error conditions which led to the failure of the application on the first processor. Hence, in this case, hardware redundancy was used to tolerate most of the software faults. The basic observation in all these transient failures is that the same error condition is unlikely to occur if the software is reexecuted in a different environment. A specific form of environment diversity, called software rejuvenation [5,10,16,30,31] is the crux of our research..

Checkpointing and Recovery Checkpointing and recovery [6,18,24,25] belongs to the category of error recovery for fault tolerance, as opposed to design diversity which belongs to error compensation, and data and environment diversities which belong to the fault treatment category. Error compensation, error recovery and fault treatment are complementary to one another and fault tolerance in software can be increased by deploying a combination of these techniques. The recovery block uses checkpoints in its implementation. Garg [8,9] proposes and analyzes the combination of software rejuvenation (preventive fault treatment) with checkpointing and recovery to reduce the chances of activating a fault and simultaneously minimizing the loss of computation when there is a failure.

Checkpointing involves occasionally saving the state of a process in stable storage during normal execution. Upon failure, the process is restarted in the saved state (last saved checkpoint). This thus reduces the amount of lost work. Checkpointing and recovery was mainly intended to tolerate transient hardware failures, where the application is restarted upon repair of a hardware unit after failure. This technique has been implemented in both software and hardware. Checkpointing and restart was used as early as 1948 when the ENIAC was used to solve a two point boundary value problem with known conditions, although the computation was marred by a very high tube error rate [4]. Checkpointing has also been used as a backward error recovery technique for handling intermittent software faults since then. For instance, the progressive retry technique [32] employs checkpointing along with message logs.

Bibliography
1 E. Adams. Optimizing Preventive Service of the Software Products. IBM Journal of R&D, 28(1):2-14, January 1984. 2 A. Avizienis and L. Chen. On the Implementation of N-version Programming for Software Fault Tolerance During Execution. In Proc. IEEE COMPSAC 77, pages 149-155, November 1977. 3 P. E. Amman and J. C. Knight. Data Diversity: An Approach to Software Fault Tolerance. In Proc. of 17th Intl. Symposium on Fault Tolerant Computing, pages 122-126, June 1987. 4 W. C. Carter. Experiences in Fault-Tolerant Computing, 1947-1971. The Evolution of Fault-Tolerant Computing, pages 1-36, Springer-Verlag, 1987. 5 V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan and W. Zeggert. Proactive Management of Software Aging. IBM Journal of Research & Development, Vol. 45, No.2, March 2001. 6

K. M. Chandy, J. C. Browne, C. W. Dissly and W. R. Uhrig. Analytic Models for Rollback and Recovery Strategies in Data Base Systems. In IEEE Trans. on Software Engineering, Vol. 1, pages 100-110, March 1975. 7 R. Chillarege, S. Biyani and J. Rosenthal. Measurement of Failure Rate in Widely Distributed Software. In Proc. of 25th IEEE Intl. Symposium on Fault Tolerant Computing, pages 424-433, Pasadena, CA, July 1995. 8 S. Garg, Y. Huang, C. Kintala and K. S. Trivedi. Minimizing Completion Time of a Program by Checkpointing and Rejuvenation. In Proc. 1996 ACM SIGMETRICS Conference, Philadelphia, PA, pp. 252-261, May 1996. 9 S. Garg. Proactive Fault Management in Operational Software Systems. Ph.D. thesis, Department of Electrical and Computer Engineering, Duke University, Durham, NC, 1997. 10 S. Garg, A. van Moorsel, K. Vaidyanathan and K. S. Trivedi. A Methodology for Detection and Estimation of Software Aging. In Proc. of the Ninth Intl. Symposium on Software Reliability Engineering, pages 282-292, Paderborn, Germany, November 1998. 11 J. Gray. Why do Computers Stop and What Can be Done About it? In Proc. of 5th Symposium on Reliability in Distributed Software and Database Systems, pages 312, January 1986. 12 J. Gray. A Census of Tandem System Availability Between 1985 and 1990. IEEE Trans. on Reliability, 39:409-418, October 1990. 13 J. Gray and D. P. Siewiorek. High-Availability Computer Systems. IEEE Computer, pages 39-48, September 1991. 14

J. J. Horning, H. C. Lauer, P. M. Melliar-Smith and B. Randell. A Program Structure for Error Detection and Recovery. Lecture Notes in Computer Science, 16:177-193, 1974. 15 Y. Huang, P. Jalote and C. Kintala. Two Techniques for Transient Software Error Recovery. Lecture Notes in Computer Science, Vol. 774, pages 159-170. Springer Verlag, Berlin, 1994. 16 Y. Huang, C. Kintala, N. Kolettis and N. D. Fulton. Software Rejuvenation: Analysis, Module and Applications. In Proc. of 25th Symposium on Fault Tolerant Computer Systems, pages 381-390, Pasadena, California, June 1995. 17 P. Jalote, Y. Huang and C. Kintala. A Framework for Understanding and Handling Transient Software Failures. In Proc. 2nd ISSAT Intl. Conf. on Reliability and Quality in Design, Orlando, Florida, 1995. 18 V. Kulkarni, V. Nicola and K. S. Trivedi. Effects of Checkpointing and Queueing on Program Performance. In Communications in Statistics- Stochastic Models, Vol. 6, No. 4, pp. 615-648, 1990. 19 J. C. Laprie, J. Arlat, C. Bounes, K. Kanoun and C. Hourtolle. Hardware and Software Fault Tolerance: Definition and Analysis of Architectural Solutions. In Digest of 17th FTCS, pages 116-121, Pittsburgh, PA, 1987. 20 J. C. Laprie (Ed.). Dependability: Basic Concepts and Terminology. Springer-Verlag, Wein, New York, 1992. 21 I. Lee. Software Dependability in the Operational Phase. PhD. thesis, Department of Electrical and Computer Engineering, University of Illinois, UrbanaChampaign, IL, 1995. 22 I. Lee and R. K. Iyer. Software Dependability in the Tandem GUARDIAN System. IEEE Trans. on Software Engineering, pages 455-467, Vol. 21, No. 5, May 1995.

23 J. L. Lions. ARIANE 5 Flight 501 Failure Report by the Inquiry Board, Paris, July 1996. 24 J. Long, W. K. Fuchs and J. A. Abraham. Implementing Forward Recovery Using Checkpoints in Distributed Systems. Dependable Computing and Fault-Tolerant Systems - Dependable Computing for Critical Applications, Vol. 6, Meyer, Schlichting (Editors), Springer-Verlag, New York, 1992. 25 D. K. Pradhan and N. H. Vaidya. Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture IEEE Trans. on Computers, pages 1163-1174, October 1994. 26 D. K. Pradhan. Fault-Tolerant Computer System Design. Prentice Hall, Englewood Cliffs, NJ, 1996. 27 M. Sullivan and R. Chillarege. Software Defects and Their Impact on System Availability - A Study of Field Failures in Operating Systems. In Proc. 21st IEEE Intl. Symposium on Fault-Tolerant Computing, pages 2-9, 1991. 28 L. Tomek, J. Muppala and K. S. Trivedi. Modeling Correlation in Software Recovery Blocks. In IEEE Transactions on Software Engineering (special issue on Software Reliability), Vol. 19, No.11, November 1993, pp. 1071-1086. 29 L. Tomek and K. S. Trivedi. Analyses Using Stochastic Reward Nets, In Software Fault Tolerance, ed. M. Lyu, John Wiley & Sons, 1994. 30 K. Vaidyanathan and K. S. Trivedi. A Measurement-Based Model for Estimation of Resource Exhaustion in Operational Software Systems. In Proc. of the Tenth IEEE Intl. Symposium on Software Reliability Engineering, pages 84-93, Boca Raton, Florida, November 1999.

31 K. Vaidyanathan, R. E. Harper, S. W. Hunter and K. S. Trivedi. Analysis and Implementation of Software Rejuvenation in Cluster Systems. In Proc. of the Joint Intl. Conference on Measurement and Modeling of Computer Systems, ACM SIGMETRICS 2001/Performance 2001, Cambridge, Massachusetts, June 2001. 32 Y-M. Wang, Y. Huang, P-Y. Chung, P. Vo and C. Kintala. Checkpointing and its Applications. In Proc. Symposium on Fault Tolerant Computer Systems, pages 22-31, Pasadena, CA, June 1995.

Vous aimerez peut-être aussi