Vous êtes sur la page 1sur 5

7th Annual Conference on Systems Engineering Research 2009 (CSER 2009)

An Efficient Fault Diagnosis Method for Complex System Reliability


Ozge Doguc1 and Jose Emmanuel Ramirez-Marquez2 Stevens Institute of Technology, USA, odoguc@stevens.edu Stevens Institute of Technology, USA, Jose.Ramirez-Marquez@stevens.edu Abstract Due to aging and environmental factors, system components may fail or not function as expected; which causes unprecedented changes in systems reliabilities. Determining the failed component is crucial for systems engineers in order to limit the effects of the failure and avoid future problems. However finding the source of the failure is not trivial in systems with large number of components and complex component relationships. In this study, we present an efficient method to detect the unprecedented changes in system reliability and find the failed component. Our method uses BN to model the complex systems and involves a heuristic to reduce the time to find the failed component. In this study we also show that, the running time of our method does not increase with the number of components in the system, therefore it can be efficiently used in complex systems. Key words Fault diagnosis, Bayesian networks, complex systems, system reliability 1 Introduction 2 During a complex systems life-cycle, components failures may occur due to several reasons such as component aging and environmental factors. Such failures may cause unprecedented changes in the system reliability value and affect the reliability of not only the failed component, but also the overall complex system. Diagnosis of the unprecedented changes in the system reliability and detection of the source of the change is essential for removing faulty components, replacing them with better ones, restructuring system architecture, and thus improving the overall system reliability. However modern complex systems create challenges for systems engineers to understand and trouble-shoot possible system problems. Therefore due to system size, efficient monitoring and fault diagnosis methods should be used for complex systems. Since Bayesian networks (BN) combine expert knowledge of the system with probabilistic theory for construction of effective diagnosis methodologies, they have been used in various fault diagnosis applications [1, 2, 3]. From our perspective, a BN is a directed graph where the variables are defined as the components in the system while the links represent the interactions between the components. In this paper, a new fault diagnosis method based on continuous monitoring of the overall system reliability value is defined. In our method, fault diagnosis mechanisms are triggered to find the failed component when significant changes to system reliability are detected. As part of our method, an efficient search algorithm is designed specifically for BN. This algorithm is empowered with effective heuristics. In this paper, we discuss how our method can be efficiently used in complex systems, since our search algorithm needs to check only a small portion of the systems components before detecting the failed one. We believe that our method provides system engineers with invaluable information to diagnose the failed components and improve reliability in complex systems. Literature Survey Statistical Process Control (SPC) [4], Principal Component Analysis (PCA), Partial Least Squares (PLS), neural networks, Fuzzy logic, Bayesian networks [5], Decision trees and Kalman filters [6] are the most commonly used fault diagnosis methods in the literature. During the last decade there has been a surge of applications using BN models for system analysis and diagnosis in industrial process control, business fraud detection and homeland security [7]. Currently the BN models are being used for building intelligent agents and adaptive user interfaces (Microsoft [8], NASA [2]), process control (NASA [2], General Electric, Lockheed), fault diagnosis (Hewlett Packard, Intel [1], American Airlines), pattern recognition and data mining, and medical diagnosis (BiopSys, Microsoft), security and fraud detection (credit cards, AT&T) [3]. In the literature, system success or failure conditions have been evaluated by using different metrics such as availability, functionality, maintainability, etc. In addition to these, one important metric is the reliability of the system; which can be defined as the probability that a system will perform its intended function during a specified period of time under stated conditions [9]. Traditionally, engineers estimate reliability by understanding how the different components in a system interact for system success. However, for complex systems, understanding component interactions, which usually requires intervention of a domain expert, may prove to be a challenging problem. BN have been proposed as an alternative to traditional reliability estimation approaches, partly because they are easy to use in interaction with domain experts in the reliability field [10]. The idea of using BN in systems reliability has mainly gained acceptance because of the simplicity it allows to represent systems and the efficiency for obtaining component associations [11]. The concept of Loughborough University 20th - 23rd April 2009

7th Annual Conference on Systems Engineering Research 2009 (CSER 2009) BN has been discussed in several earlier studies [12, 13, 14]. More recently, BN have found applications in, software reliability [15, 16] and general reliability modeling [17]. Currently, predefined BN are used for reliability estimation for specific systems. For example, Gran and Helminen [9] provide a BN for nuclear power plants and introduce a hybrid method for estimating the reliability of the plant. In another study, Helminen and Pulkkinen present a BN-based method for reliability estimation of computerbased motor protection relay [18]. In addition to these, Amasaki et al. [19] use BN for software quality assessment. As stated in the previous section, in this study we also used the BN to model complex systems. A brief overview of BN is provided in the next section. 3 Bayesian Networks (BN) expressed as probabilities. In Figure 1 the topmost nodes (X1, X2 and X4, representing components 1, 2, and 4 respectively) do not have any incoming edges, therefore they are conditionally independent of the rest of the components in the system. The prior probabilities that are assigned to these nodes should be known beforehand -with the help of a domain expert or using historical data about the system. Based on these prior probabilities, the success probability of a dependent node, such as X3, can be calculated using Bayes theorem. Similar to the prior probabilities, CPT can also be computed by using historical data of the system. The same methodology can be applied to node X5 by using X2 and X4 as inputs. Overall system reliability (i.e. success probability of the System Behavior node) can also be calculated in the same way. 4 System Monitoring for Detecting Unprecedented Changes In real-life complex systems, system components may fail or do not function as expected in time. As an example, due to aging or environmental factors, a system component may start failing more frequently and the mean time between failures (MTBF) for that component decreases. Such failures may cause unprecedented changes in the evaluated reliability values. This affects not only the reliability of the failed component, but also the overall system as well. For the systems engineers it is very important to detect unprecedented changes and diagnose the causes of these changes to improve system reliability. In this study, a methodology is developed for monitoring the overall system reliability value and diagnosing the failed components in complex systems. BN are useful tools for diagnosis; and efficient search algorithms can be employed within the BN [21]. We provide an efficient method specifically for BN; empowered with simple and efficient heuristics. As the first step of our methodology for fault diagnosis in complex systems, we detect if any significant changes occur in the system reliability. For this purpose, we continuously monitor [22] the evaluated system reliability value, and observe the significant changes in the CPT of the System Behavior node in the BN. Our method suggests monitoring the reliability value with intervals of t, where there can be at most one significant change in the system reliability. The value of t should be decided by the systems engineer or a domain expert, who has adequate knowledge about the system characteristics. In our methodology, CPTs of the nodes in the BN are saved into Current set at each observation x. The values in the Current set are then compared with the ones in the Previous set, which was created during the previous monitoring t time units ago (Figure 2).

One could summarize the BN as an approach that represents the interactions among the components in a system from a probabilistic perspective. This representation is performed via a directed acyclic graph, where the nodes represent the variables and the links between each pair of nodes represent the causal relationships between the variables. From a system reliability perspective, the variables of a BN are defined as the components in the system while the links represent the interaction of the components leading to system success or failure. In a BN this interaction is represented as a directed link between two components, forming a child and parent relationship, so that the dependent component is called as the child of the other. The success probability of a child node is conditional on the success probabilities associated with each of its parents [15]. The conditional probabilities of the child nodes are calculated by using Bayes theorem and stored in conditional probability tables (CPT). Also, absence of a link between any two nodes of a BN indicates that these components do not interact for system failure/success, thus they are considered independent of each other and their probabilities are calculated separately.

Figure 1 - A sample Bayesian network To illustrate these concepts, the BN shown in Figure 1 presents an example on how the five components of a system interact. For this BN, the child-parent relationships of the components can be observed, where on the quantitative side [20] the degrees of these relationships are

Loughborough University 20th - 23rd April 2009

7th Annual Conference on Systems Engineering Research 2009 (CSER 2009) in Section 4. The DFS algorithm starts from the System Behavior node, whose CPT was changed due to the failed component in the system. Next, the algorithm checks the parent nodes for changes in their CPTs, since they are the candidates for the failed component. In the next step The DFS algorithm picks the leftmost parent whose CPT was also changed. The algorithm keeps iterating this loop until it reaches to the end of the network, or it finds a node whose parents are not altered at all. For both cases the DFS algorithm returns that node as the source of the unprecedented change.

Figure 2 Our method for continous monitoring As it can be seen from Figure 2, our continuous monitoring methodology requires only two sets (CurrentX and PreviousX) at the same time; so that earlier sets can be discarded. This approach reduces the amount of resources (i.e. disk space and memory) required for monitoring and fault diagnosis. Next, if a change from PreviousX set to CurrentX set is observed, our methodology requires mechanisms to search for the system component that causes this change. However, not every single change in the system reliability value is caused by failed components. Therefore, a threshold value must be defined to decide if the difference between two sets will be defined as a significant change. This threshold value depends on the general system characteristics and the frequency of the component failures in the system. When a significant change is decided, the diagnosis mechanisms are triggered to find the failed component in the system. 5 Diagnosis of the Unprecedented Changes Figure 3 Flowchart for our DFC method. In contrast to DFS, our proposed method uses an intelligent mechanism (heuristic) to choose the nodes to be considered while searching for the failed one. Our Diagnose Failed Component (DFC) method uses the heuristic to select the node to be considered next rather than always picking the leftmost one. The heuristic reduces the number of nodes to be considered, thus our DFC method would be more efficient than the popular DFS algorithm. According to our heuristic, if there is a change in the CPT of a node, the DFC method chooses the parent that is closest to the source of that change. The difficulty here is to define such heuristic; i.e. how to determine the node which is closest to the source of alteration. Moreover, an important property of the heuristic is that it must be efficient, since it is re-evaluated at each iteration of the DFC method. Our heuristic defined in this study uses the percentage of change in the CPT of the considered node. In other words, our heuristic is that the node whose CPT changes most, is also the closest one to the source of the change in the system, therefore it should be picked first. This heuristic is Loughborough University 20th - 23rd April 2009

When a system component fails or stop functioning as expected, it will eventually affect the other components and also the overall system reliability. Using the BN representation of a system, the affected component(s) can be diagnosed by using search algorithms among the network. Starting from the System Behavior node, the search algorithm traverses the BN until it finds the source of the change. In this study, we introduce a new search method using an efficient heuristic. We also compare the performance of our method with a commonly used nave search algorithm, depth first search (DFS). As a blind search algorithm, DFS always examines the leftmost child in a graph first. It keeps iterating until it reaches at the end of the graph. Then it backtracks one level up and tries the next node. The algorithm either finds the desired node; or keeps running until it checks all the nodes in the graph [23]. The DFS algorithm can be used to search for the failed component in a system by using the BN model as shown in Figure 1, and Current and Previous sets that are explained

7th Annual Conference on Systems Engineering Research 2009 (CSER 2009) accurate, since according to the Bayes rule the values in the CPTs are calculated by multiplying the probabilities from the parent nodes [9]. As a result of this, the change in the CPT of a parent node is reflected to its children with a probability and to their children with a lower probability, and so on. Therefore if a component is closer to the failed one, its CPT would be changed more than others and it can be used to find the failed component. 5.1 Experimental Analysis In this section we compare the two methods (DFS and DFC) defined in Section 4 to show how our method reduces the time to find the failed component in the system. For comparison purposes, we use different systems that are modeled as BN, so that larger systems require more nodes in their BN representation. We evaluated two methods with different BNs and recorded the number of iterations required for the methods to find the failed component. In Figure 4 it can be observed that, our DFC method finds the failed component in significantly less number of iterations even for large systems (# of nodes > 30). Also we can say that unlike the DFS algorithm the number of iterations of our DFC method is not directly proportional to the number of components in the system. of components. We show that our DFC method perform much better than DFS for complex systems with large number of components. 7 References

[1] Kappen, B., Wiegerinck, W., Akay, E., Neijt, J. & Beek, A. v., "A Clinical Diagnostic Decision Support System," Bayesian Modeling Applications Workshop, 2003. [2] Horvitz, E. & Barry, M. (1995), "Display of Information for Time-Critical Decision Making", Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, 1995, pp. 296-305. [3] Painter, J., "Uses of Bayesian Statistics," Tesella Support Services PLC, vol. V1.R1.M0, 2003. [4] Weighel, M., Martin, E. B. & Morris, A. J. (1997), "Fault Diagnosis in Industrial Process Manufacturing Using Mspc", In Fault Diagnosis in Process Systems, 174(4), pp. 1-3. [5] Yongli, Z., Limin, H. & Jinling, L., "Bayesian Networks Based Approach for Power Systems Fault Diagnosis," 2005. [6] Cho, H. & Kim, K. (2002), "Fault Diagnosis of Batch Processes: A New Statistical Approach Using Discriminant Model", In International Journal of Production Research, 42(3), pp. 597-612. [7] Koivo, H., "Fault Diagnosis Methods," Helsinki University of Technology, 2005. [8] Dtas, http://www.research.microsoft.com/dtas. (2005)

Figure 4 - Comparison of DFS and DFC algorithms. 6 Conclusion

[9] Gran, B. A. & Helminen, A. (2001), "A Bayesian Belief Network for Reliability Assessment", In Safecomp 2001, 2187, pp. 3545. [10] Sigurdsson, J. H., Walls, L. A. & Quigley, J. L. (2001), "Bayesian Belief Nets for Managing Expert Judgment and Modeling Reliability", In Quality and Reliability Engineering International, 17, pp. 181190. [11] Doguc, O. & Ramirez-Marquez, J. E. (2007), "A Generic Method for Estimating System Reliability Using Bayesian Networks", In Journal of Reliability Engineering and System Safety [12] Cowell, R. G., Dawid, A. P., Lauritzen, S. L. & Spiegelhalter, D. J. (Editors) (1999),Probabilistic Networks and Expert Systems, New York, NY: Springer-Verlag. [13] Jensen, F. V. (eds) (2001),Bayesian Networks and Decision Graphs, New York, NY: Springer Verlag.

During a systems life-cycle, the system components may fail due to aging or environmental factors. Failure of a system component not only reduces its reliability, but also affects overall system reliability as well. In this study, using the BN model we suggest an efficient method to find the failed component when an unprecedented change occur in a complex systems reliability. We simulated the BN model of the complex system in time, and monitored the changes in the overall system reliability. Then we defined DFC method to detect the significant change in system reliability and efficiently find the failed component that causes the change in reliability. The DFC method employs a simple heuristic to find the failed component efficiently within the system. As a benchmark, we compared the performance of our DFC method with a nave search algorithm; DFS. Although the DFS algorithms performance was directly related with the number of components in the complex system, our method performed independent of the number

Loughborough University 20th - 23rd April 2009

7th Annual Conference on Systems Engineering Research 2009 (CSER 2009) [14] Pearl, J. (eds) (1988),Probabilistic Reasoning in Intelligent Systems, San Francisco, CA: Morgan Kaufmann. [15] Fenton, N., Krause, P. & Neil, M. (2002), "Software Measurement: Uncertainty and Causal Modeling", In IEEE Software, 10(4), pp. 116-122. [16] Gran, B. A., Dahll, G., Eisinger, S., Lund, E. J., Norstrm, J. G., Strocka, P. & Ystanes, B. J. (2000), "Estimating Dependability of Programmable Systems Using Bbns", Proceedings of the Safecomp 2000, Springer, 2000, pp. 309-320. [17] Bobbio, A., Portinale, L., Minichino, M. & Ciancamerla, E. (2001), "Improving the Analysis of Dependable Systems by Mapping Fault Trees into Bayesian Networks", In Reliability Engineering and System Safety, 71(3), pp. 249260. [18] Helminen, A. & Pulkkinen, U. (2003), "Quantitative Reliability Estimation of a Computer-Based Motor Protection Relay Using Bayesian Networks", In Safecomp 2003, 2788, pp. 92102. [19] Amasaki, S., Takagi, Y., Mizuno, O. & Kikuno, T. (2003), "A Bayesian Belief Network for Assessing the Likelihood of Fault Content", Proceedings of the 14th International Symposium on Software Reliability Engineering, 2003, pp. 215- 226. [20] Lagnseth, H. & Portinale, L., "Bayesian Networks in Reliability," 2005. [21] Doguc, O., "An Assessment, Monitoring and Diagnosis (Amd) Tool for System Operational Effectiveness," Systems Engineering, Stevens Institute of Technology, Hoboken, 2006. [22] Gaurav, B. (2000), "Auto-Diagnosis of Field Problems in an Appliance Operating System", Proceedings of the USENIX Annual Technical Conference, USENIX Association Berkeley, CA, USA, 2000, p. 24. [23] Kopec, D., Cox, J. & Marsland, A. (Editors) (2004),The Computer Science and Engineering Handbook CRC Press.

Loughborough University 20th - 23rd April 2009

Vous aimerez peut-être aussi