Vous êtes sur la page 1sur 584

Reliability Engineering

Proceedings of the ISPRA-Course held at the Escuela Tecnica Superior de Ingenieros Navales, Madrid, Spain, September 22-26,1986 in collaboration with Universidad Politecnica de Madrid

Edited by

Aniello Amendola
and

Amalio Saiz de Bustamante

ISPRA Courses on Reliability and Risk Analysis

Kluwer Academic Publishers

RELIABILITY ENGINEERING

COURSES
ON RELIABILITY AND RISK ANALYSIS

A series devoted to the publication of courses and educational seminars given at the Joint Research Centre, Ispra Establishment, as part of its education and training program. Published for the Commission of the European Communities, Directorate-General Telecommunications, Information Industries and Innovation.

The publisher will accept continuation orders for this seres which may be cancelled at any time and which provide for automatic billing and shipping of each title in the series upon publication. Please write for details.

RELIABILITY ENGINEERING
Proceedings of the ISPRA-Course held at the Escuela Tecnica Superior de Ingenieros Navales, Madrid, Spain, September 22-26,1986 in collaboration with Universidad Politecnica de Madrid

Edited by

ANIELLO AMENDOLA
Commission of the European Communities, Joint Research Centre, Ispra Establishment, Ispra, Italy

and

AMALIO SAIZ DE BUSTAMANTE


Universidad Politecnica de Madrid, Escuela Tecnica Superior de Ingenieros Navales, Madrid. Spain

PARI. rur?.

y.Mk

CL KLUWER ACADEMIC PUBLISHE R S


DORDRECHT / BOSTON / LONDON

Library of C ongress Cataloging in Publication Data Reliability engineering : proceedings of The Ispracourse held at the Escuela Tecnica Superior de Ingenieros Navales. Madrid, Spain. 2226 September 1986 1n collaboration with Universidad Politcnica de Madrid / edited by Aniello Amendola and Amallo Saiz deBustamante. p. c. (Ispra courses on reliability and risk analysis) Includes Index. ISBN 9027727627 1. Reliability ( E n g i n e e r i n g ) C o n g r e s s e s . I. Amendola, Aniello, 1938 . I I . Saiz de Bustamante. Amallo. III. Universidad Politcnica de Madrid. IV. Serles. TA169.R4394 1988 620'. 0 0 4 5 2 d c 1 9 815565 CIP

ISBN 9027727627

Commission of the European Communities, ^ M l H

Joint Research Centre Ispra (Varese), Italy

Publication arrangements by Commission of the European Communities DirectorateGeneral Telecommunications, Information Industries and Innovation, Luxembourg EUR 11587 1988 ECSC, EEC, EAEC, B russels and Luxembourg LEGAL NOTICE Neither the Commission of the European Communities nor any person acting on behalf of the Commission is responsible for the use which might be made of the following information. '

Published by Kluwer Academic Publishers P.O. B ox 17, 3300 AA Dordrecht, The Netherlands. Kluwer Academic Publishers incorporates the publishing programmes of D. Reidel, Martinus Nijhoff, Dr W. Junk and MTP Press. Sold and distributed in the U.S.A. and Canada by Kluwer Academic Publishers, 101 Philip Drive, Norwell, MA 02061, U.S.A. In all other countries, sold and distributed by Kluwer Academic Publishers Group, P.O. B ox 322, 3300 AH Dordrecht, The Netherlands.

All Rights Reserved No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner Printed in The Netherlands

Table of Contents

Introduction Part I: Reliability and data Fundamentals of reliability theory A. Saiz de Bustamante Estimation of parameters of distribution A.Z. Keller Inference, a Bayesian approach C.A. Clarotti Component event data collection A. Besi The organisation and use of abnormal occurrence data H.W. Kalfsbeek Part II: Modelling Techniques Fault tree and event tree techniques A. Poucet Elements of Markovian Reliability Analysis I.A. Papazoglou Monte Carlo methods A. Saiz de Bustamante Common cause failures analysis in reliability and risk assessment A. Amendola Human factors in reliability and risk
assessment I.A. Watson

vi i

3 27 k9 67

95

129 17 1 205

221

257

Part III: Study Cases Systems Reliability Analysis in the Process Industry A. Amendola and S. Contini

303

VI

The Rijnmond risk analysis pilot study and other related studies

H.G. Roodbol
Study cases of petroleum facilities as comparison bases for different methods J.P. Signoret, M. Gaboriaud and A. Leroy Study case on aeroespace S. Sanz Fernndez de Cordoba Reliability of electrical networks A.G. Martins Software reliability : a study case 3. Muera Study case on nuclear engineering

3 19

345 367 387 4 17

3. Gonzlez
Probabilistic evaluation of surveillance and out of service times for the reactor protection instrumentation system I.A. Papazoglou Structural reliability: an introduction with particular reference to pressure vessel problems A.C. Lucia Reliability of marine structures C. Guedes Soares Subject Index

447

463

487 5 13 561

Introduction

Reliability, Availability, and Maintainability (RAM) are concepts which are nowadays entering into the technological field. They characterize, indeed, the objective of any engineering science, that is the achievement of reliable, easy to operate and to maintain systems, in a cost effective way. Reliability analysis is also a fundamental part in any safety assessment of potentially dangerous plants, which are now being subjected to evermore stringent regulations and public attention. This book, originated from a first JRC collaborative effort on this theme in an enlarged European Community, offers a comprehensive - even if an incomplete - state of the art review, which shows the maturity of reliability engineering as practised in different technological sectors. The first part of the book is devoted to some basic definitions in Reliability Theory and to some problems of data collection and parameter estimation. The book does not enter in a review of existing data bases, since this was already the subject of a previous work which appeared in the same Reidel series under the title "Reliability Data Bases" (A. Amendola and A.Z. Keller eds. J. With respect to the original course programme, in which the theme was only briefly discussed, the Editors were happy to include in the book a rather provocative paper on Bayesian inference, which focusses the engineers attention to the basic meaning of probabilistic assessment. The second part of the book presents in a rather detailed manner the most currently used approaches in systems reliability modelling -like fault trees, event trees, Markov and Montecarlo methods- and includes review papers on controversial issues like common cause failure analysis and human factors. The third part of the book is of a more applicative character, however it also describes techniques like DYLAM, Petri Nets, Structural Reliability theory and others which whilst not being specific to the presented study cases nevertheless are of great theoretical interest.

The study cases are derived from process industry, aerospace, telecommunication, electrical networks, nuclear power plants and marine structures. In addition to applications for availability and reliability assessments, examples are also given of risk studies for both chemical and nuclear plants which help to locate the role of reliability engineering in more comprehensive safety assessments. Of course this book does not pretend to exhaustively cover the specific problems raised by the different technological systems; however it is the hope of the editors that it will prove of general interest to practitioners involved with RAM and safety assessments and that it will have contributed to a cross fertilization of the techniques among the different sectors.

The Editors

PART I RELIABILITY AND DATA

FUNDAMENTALS OF RELIABILITY THEORY

A. Saiz de Bustamante Universidad Politcnica de Madrid

ABSTRACT. The concepts on reliability, maintainability and availability are presented as applied to system behaviour. As an example of reliability modelling the constant failure and repair rates is developped. The last part of the paper is dedicated to the background of deductive and inductive methods to assess systems availability.

1. INTRODUCTION The reliability of a system (component) is the probability of performing without failure a specified function under given conditions for a specified period of time. Therefore it represents the probability of survival at time t. By a "system" it is meant a group of components that work together to accomplish a specific function. A component is constituent of a higher level system or component, and can include human components of system. A failure is the inability of a system (component) to perform its intended function, and may occur as a result of defects in the system (component), wear, tear, or because unexpected stresses. A failure is classified as catastrophic if the failure is complete and sudden. A degradation failure is a partial and gradual failure. A failure is complete when the deviation in characteristics of the item is such as to cause complete lack of the required function. The maintainability of a system is the probability of restoring it to specific condition after its failure within a given period of time when the maintenance is performed in accudance with prescribed procedures. Maintenance is defined as all the actions necessary to restore an

A. Amendola and A. San de Bustamante (eds.), Reliability Engineering, 3-25. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

item to specified condition. The availability of a maintained system is defined as the probability that the system is able to perform its intended function at a given time during its life. Both, failures and repais, are probabilistic events, being the time to that event a random variable. At any given time a maintained system is either functioning or being repaired, being possible to define system states according to the status of its components. In the simple case of a repairable component it can be assumed two states: the operating state - = 0 -, and the failed state - = 1 -j being a random indicator variable depending on the parameter time. The transition from = 0 to = 1 is called a failure, and from = 1 to = 0 a repair, being assumed also the changes of states happened istantaneously and that at most only one transition occurs at a sufficiently short time interval -Markov chains-. The system behaviour can be represented by its transition diagram as shown at Figure 1. The whole process can be understood as a train of random square waves, depending on two random variables: Time To F ailure and Time to Repair, or the number of failures and the Time to Repair. The repair-failure process is analyzed at point 2, the failurerepair process at point 3> and the whole process at point 3

2. BASIC CONCEPTS OF RELIABILITY The random variable "time to failure" - = TTF -, corresponding to the process repair failure, being a typical example the life cycle, is described by means of its associated probability functions, according to the following definitions. (i) R(t): Reliability Probability of system survival up to time t. R(t) = Pr ( > t) (ii) F(t): F ailure distribution function Probability of system failure prior to, or a time t F(t) = Pr ( t) Therefore R(t) + F(t) = 1

Cerapeaeat

fails

Ceaipeaeat

repaired

( I ) T R A N S I T I O N DIA GRA M

X(tl

RANDOM SQUARE WAVE

F: FAILURE R: REPAIR (II) RANDOM SQUARE WAVES

FIG. 1 Representation of the whole process: failurerepairfailure (iii) f(t): Failure density function Unconditional probability of system failure between t and t+dt f (t)dt = Pr(t < <. t + dt) f(t) = F(t) dt (iv) (t): Failure rate Conditional probability of system failure between t and t+dt, given that it has survived to time t X(t)d(t) = Pr(t <. <. t + dt | > t)

, ,

( t + dt) _ f(t)dt () " R(t)

X(t) = _
or

^ >

R(t)

dt

In R(t)

R(t) = exp and

J X(t)dt

t f(t) = X(t) exp X(t)dt 0 The expected lifetime of a non maintained component or its mean time to failure -MTTF- is the expected value of the random variable MTTF = () = J " t f(t)dt = " R(t)dt 0 0 given that lim R(t) \(t)~ t + 0 + 0

The failure rate distribution plays an important role in Reliability Engineering, explaining the life characteristic of components. Figure 2 shows a typical failure graph known as bathtub curve. The initial failures of wear in period correspond to negative aging -zone I- or decreasing failure rate. Wear-out period -zone Illcorrespond to positive aging or increasing failure rate. The random failures period -zone II- correspond to a constant failure rate electrical or electronic components- or slightly positive aging mechanical components. A constant failure rate is described by an exponential distribution and represent an ageless condition, because the conditional probability of failure is invariable with time. Under this circumstances: R(t) = e" X t f(t) = Xe" X t

USEFUL LIFE

Tia I

FIG. 2 The bathtub curve

x(t) = MTTF = 1/ = 1/ Any condition can be represented by a three parameters Weibull distribution defined by

X(t)dt =
where y: Threshold parameter : Scale parameter : Shape parameter

(^)

If > 1, the distribution is unimodal and the increasing positively aging: if < 1, there is failure rate is decreasing, negatively agingj if = 1, is exponential and the failure rate constant. The unconditional probabilities are as follow:

failure rate is no mode and the the distribution conditional and

R(t) = exp - ( )

f(t)- A (t-yf-1
" V / MTTF = + (1+1/)
1

exJ-(^))
\
V

I I

2 = 2 [(2/+) 2(/+) ] Example 1 Assuming a population of components which follow a Weibull distribution with = 25000 hours J = 0 and depending on the life cycle period as given wear in : = .5 random : = 1 wear out: = 3 the end of the wear in period is 10 hours, and the useful life 10 hours, calculate:

a) Reliability and failure rate at 10, 100, 1000, 10000 and 20000 hours assuming no debug and a debugging process. b) Mean Time To Failure at the wear in, random failures and wear out periods. Solution: a) R(t) = exp ( ) (t) = n" t 1

no debugging: R(t) = R . (10) R (t) R (t) t > IO 4 win R wout debugging : R(t) = R(t) R (t) t > 10 R wout wear in period t = 10 R(10) = 0.9802 (10) = 0.001 hour"1 Random failures period t = 100 RD(100) = 0.996

R(100)

= 0.9802 0.996 = 0.9763

t = 1000 t = 10000

R (1000) R

= 0.9608

R(1000)

= 0.9802 O.9608 = 0.9418

R (10000) = O.6703 R hour"

R(10000) = O.9802 O.6703 = 0.6570

(t) = = / = 4'10

wear out period t = 20000 R (20000) = 0.5993 wout debugging : R(20000) no debugging: R(20000) (20000) = 768 10

R(20000) = 0.4493 R

= 0.5993 0.4493 = 0.2693 = O.9802 0.2693 = O.264O

(wear out) hour"

b) = (1+/), because = 0 wear in period MTTF = 225000 = 5OOOO hours Random failure period MTTF = 125000 = 25OOO hours wear out period MTTF = 0.893425000 = 22335 hours

3. BASIC CONCEPTS OF MAINTAINABILITY The associated probability functions to the random variable "time to repair" = TTR as good as new, corresponding to the process failurerepair are defined, as follow (i) M(t): Maintainability Probability of a system being repaired before time t, it failed at time zero M(t) = Pr (t) (ii) m(t): Repair density function Unconditional probability of system repair between t and t+dt m(t)dt = Pr (t<. t + dt)

given that

m(t) = A. M(t)
dt

10 (iii) (): Repair rate Conditional probability of system repair between t and t+dt, given that it has been at a failed state from time zero to time t p(t)dt = Pr (t <t < t + dt | * t)
M

(t) _ Pr(t < t + dt) _ m(t) ( > t) l-M(t)

and M(t) = 1 - exp - J v(t)dt 0 ft m(t) = () exp - i p(t)dt 0 The expected value of the time to repair,
OD

MTTR = t m(t)dt 0 If a constant repair rate can be assumed, M(t) = 1 - e m(t) = e () = MTTR = i 4. BASIC CONCEPTS OF AVAILABILITY The life cycle of a maintained system consits on repetitions of repair to failure and failure to repair processesj but the time origen for the whole process differs from the time origen for each process. The associated probabilistic functions for the whole process are given below. (i) A(t): Instantaneous availability Probability that the system is at the operating state at time t, given it was on operating state at time zero. (ii) Q(t): Instantaneous unavailability Probability that the system is at the failed state at time t, given it was an operating state at time zero.

11 A(t) + Q(t) = 1 (iii) w(t); Failure intensity Unconditional probability of system failure between t and t+td, given the system at operating state at time zero. (iv) v(t): Repair intensity Unconditional probability of system repair between t and t+dt, given the system at operating state at time zero. (v) W(t t ): ENF at interval t t The expected number of failures - ENF - at interval [ t t given by the formula

] , is

w(t t ) = t l
because

w(t)dt

w(t,t+dt) = l-w(t)-dt (vi) V(t t ): ENR at interval t t Analogously to (V), the expected number of repairs - ENR - at interval Ct.t!
V(t

lt2)

H V(t)dt

The failure rate and repair intensities differ from the failure and repair densities, as previously defined, because the differences of time origins of the repair-failure and failure repair-processes, and that of the whole process, but they are related by means of the following simultaneous integral equations w(t) = f(t) + f(t-u) v(u)du 0
ft v(t) = J m(t-u) w(u)du 0 The first equation indicates that the probability of failure per unit time at time t, is the sum of the probability of a first failure on that internal of time, and the probabilities of failure on that interval of time after a repair on time u> u+du ] for any u C >t 1 According to the second equation the probability of repair per unit time at time t, is the sum of the probabilities of repairs on that interval of time after a fialure Q u , u+du ] for any u [, ] .

12 The mentioned simultaneous integral equations can be solved by analytical or numerical methods if the probability functions of the isolated processes: repair-failure and failure-repair, are known. The unavailability of the system due to the whole process repairfailure-repair, can be calculated from the failure and repair density functions, ft Q(t) = J [w(u) - v(u) ] du = W(0,t) - V(0,t) 0 5. THE CONSTANT FAILURE AND REPAIR RATES MODEL Constant failure and repair rates assumption simplify the analytical treatment of system reliability analysis, being adequate for solid state electronic components during their useful life of random failures period, or as an approximation for components of large number of parts with different rates for long periods. The Laplace transform of the simultaneous integral equations of the whole process, w(s) = f(s) + f(s) v(s) v(s) = m(s) w(s) where f(s) = [ f ( t ) ] = " e - C s + X ] t d t = - i -

s+
m(s) = [ m ( t ) ] = f ^ e ^ ^ d t = 0 s+ The simultaneous integral equations become

w(s) = J-

s+

+ 1*1*1
s+

v(s) = M s )
S+

Therefore V(S) = *( S + M) S+S++)


=

J _ / + +uVs

__\ S++ /

v(s)
+++) + \S S++ /

13 the inverse Laplace transform

f<,\

( +) .

w(t) = + e + +

v(t) = -L - - e ( X + M ) t + + the unavailability Q(t) =f [ w(t) v(t) ] dt = _ L [ le ( X + M ) t ] + and the availability A(t) = 1 Q(t) = P + J _ e ~ ( U , l ) t + + The steady state availability A() is defined as the proportion of time that the system is available for use when the time interval is very large, A() = lim A(t)
t *

at the present constant rates model A(~) =


+

MTTF MTTF+MTTR

+ I

6. ASSESSMENT OF PROBABILISTIC FUNCTIONS There are three general approaches to the probabilistic functions of the repairfailure processes: assessment of the and failurerepair

(i) Curve fitting when it is possible to construct a histogram from data obtained from extensive life testing or field observations. The fitting is done by means of a piecewise polynominal approximation. (ii) Statistical Inference (classical approach) assuming a parametric probability distribution by means of point and interval estimation and a goodness of fit test.

14 If the failure rate is increasing a normal, lognormal, or Weibull ( >1) distribution can be assumed. For the repair process a lognormal distribution is frequently best fitted. The data from life tests consists on lifetimes from censored trials on identical components called: . Type 1 censorship, being the test finalized at a fixed time . Type 2 censorship, being the test terminated at time t , when the rth failure occurres r <. . The maximum likelihood estimation of a component mean life, if constant failure rate can be assumed, and the data come from a censored life test finalizing at the time of the rth failure (Type II censorship).

= i[ t. + (nr)tl= ^ r|_ J r
r

where t. t r t c = = = = = = Constant failure rate Time to i failure Time to the last failure Number of components under test. Observed failures during test. Cumulative operating time of all components under test: r (nr)t + t. r =1 To construct an interval estimation for an exponential distribu tion test without replacements , it can be shown that the statistic 2t is a with 2n degrees of freedom, therefore: c _ X 2 2n(tt/2) = u 2t

c
X 1 (
2

2n(la/2) 2t

c
u > > , ) = 1 1

being o the level of significance. The above formulae are not applicable to type I censorship.

15 (iii) Statistical Inference (Bayesian approach), assuming for the status of a random variable - -, because the sample size is finite. According to Bayes' theorem, f(X|D) = L(D|X) ()//~ (|) () \ 0 where : D = (t ,t ,...,t ) : Test observations 1 2 r f(|D) : A posterior density function L(D|x) : Likelyhood function of results D , given () : prior density function (to be chosen) If an exponential function, distribution can be assumed the likelihood

r r L(D|X) = exp ( t.)

and choosing a uniform prior distribution (): 1/K, 0 < \<K

then, the posterior density can be approximated by a gamma distribution of parameters and , if t / K.

f(X|D)
where :

I [exp ()1 '(+1) a +1 r( + i)L '"'J


1

= : Number of observations (shape parameter) l t : Cumulative operating time (scale parameter). "= As a point estimate of , it can be choosen the mode of f(x|D), its mean, or its median, which in this case is o mode and a(+l) mean. The twosided coefficient limits for , can be defined by

J 1 f(x|D) dx =
0 2

f"
^ f(x|D) dx = |

16
being o the level of significance. Example 2 The following repair times - TTR below 3.39 1.23 41 of an electric generator is given

.20 .11 .13 .17 .35 .84 .01

.03
94

.26
53

.12 .29 .89


43 23

.66 .13 .15


25

.83
33

.86
33

.28 .48
I.38

.02 .18
47

.80 .24 .28 .00

.97
.02 .07 .21 .05 .85
I.48

.18
19

.68
54

1.29 1.87 1.28

.64

.50
70

.17 .07 .09

.15
1.19 50

.27
03 73

.44 .07
.01

.19 .17

.08
59

.43

.20

.29 .14 .04 .01

.50
.20 .06 .08

From this data assess the maintainability of the generator using the: a) classical approach} b) bayesian approach. Solution: The test results can be summarize by means of a frequency histogram, with 10 intervals of 0.2 time units each, except for the 10th class which includes from 1.8 to 4 time units (see Table 1 ) . The data suggest an exponential distribution with a constasi repair rate . 1 456 2.19 time units

MTTR

The Mean Time to Repair - MTTR = .456 - is the mean value of the given times. a) The point estimate of the standard deviation - S = .522 obtained from the data defered from MTTF , but it could not coincide besides the assessment of a exponential distribution is correct due to randomness. The cumulative repair time - t - is 36.48 hours, and for a c confidence level (l-o) of 95$,
X

160

(a/2)

= 6 (0.025) = 197.01

17
X

l60

(1

~^2) = *? 6 (0'975) =

126 88

'

The limits of the interval 19701 236.48 126.88 126.88 236.48 2.70 time units
= 1>74 time

u
=

units

so
Pr(2.70 1.74) 0.95 2 The goodness of fit for the selected exponential distribution requires that the statistic,

=
where

tn E

()
tf

fth u

f = observed frequency of repairs at time interval o f = theoretical frequency of repairs at time interval tn Mt 2) f = ( *1 e th o df = total number of repairs = 80 = level of significance = degrees of freedom = Number of class intervals minus number of parameters to be estimated minus one. According to Table 1, and for = 5% df = 6 1 1 = 4 2() = 949 949 3.322 < 949 Therefore the null hypothesis : = 2.19 can be accepted for a level of significance of 5%, according to test. The KolmognovSmirnov test of fit for an exponential distribution is based on the statistic, D(n,o) = Sup [d(i) ] where

% f c

18
d(i) = max [d 1 (i),d 2 (i) ] d j U ) =(- F[t(i) ] )

d2(i) = ( F [ t ( i ) 3 izi) F(t) = 1 being : Sample size a : Level of significance d(i): Max. deviation at time (i) i : Number of repairs at time t(i) F(t): Distribution (null hypothesis). The null hypothesis of exponentiality is rejected if D(n, ) is too large (see Table I of paper "Estimation of parameters"). In order to use the Kolmognov statistic the data should be rearranged in ascending orders then the frequencies and cumulative frequencies can be calculated from each Time to Repair t(i) ' , and compared with the theorethical cumulative distribution to get the maximum deviation d(i), as done in Table 2, D(80,0.05) = SupQd(i) ] = 0.0791 according to mentioned Table I, the critical value corresponding to sample size 80 and level of significance o = 0.5 l^ 6 . = 0.1521 since D(80j0.05) < 0.1521, = 2.19 is accepted. the null hypothesis of exponentiality : = e~vt

b) The posterior distribution assuming an uniform prior density is a gamma distribution with the following parameters: = t. = 36.48 time units = = 8 observations
e 36.48, 80

,(,) = 3 6 ^ ol ! as shown in Fig. 3

( 1 7

. 6 6 5 _ 3 8 ) . 8

19

2.0001.8821.7651.6471.5291.4121.2941.1761.059f(u) 0.9410.8240.7060.5880.4710.3530.2350.1180.000-

1
FIG.

16 .

22 .

28 .

34 .

3 - Posterior gamma function of .

As mentioned before the following values can be adopted as point estimates of Mode = 2.193 Mean =2.22 Median = 2.212 The mode of the posterior probability coincides with the classical point estimate. The confidence limits for a level of significance of 5%
IL

= 1.764

=2.728

or
Pr(I.764 < < 2.728|D) = O.95

20 7. APPLICATION TO SYSTEM From the Reliability Engineering approach a system is a group of interacting components, such as hardware, materials and personnel. The system must be identifiable, and it should be possible to establish clearly its external and internal boundaries. The external boundary allows us to determine the effects of the environment-system, and the internal boundaries to establish the interaction among components within the system. The system state is a consequence of components state due to their interrelations as represented by the system logic. The relationship of events or components states to the system state can be expresse by means of the logical operators: A(AND)j V(0R), or the algebraic operators + j which can be manipulated according to the rules of the Boolean algebra: (i) Absorption
A + (A-B) = A A (A-B) = A A V (aAB) () A

( i i ) Identity
A + A = A A . A = A A VA = A A = A

(iii) Commutative A + = + A A = A (iv) Associative law A + (B+C) = (A+B) + C A (B-C) = (A-B) C (v) Distributive law A + B C = (A+B)-(A+C) A (B+C) = (A-B)+(A-C) AV(BVC) = (AVB)/(AVC) () = ()/( A V (B C) = (AVB)VC ( C) = () A V V A B A

which allow the determination of the minimal path sets defining the success modes of the system, or its complement, the minimal cut sets. The formal representation of the system logic is called the system

21 structure function. The use of the structure function for each set construction of a truth table - only two states represented by a binary state variable indicator x. decision table - where more than two states are component representing the system behaviour which means of the following two methodologies:

of events allows the for each component, = $; x. = 1 -, or a considered for each can be analyzed by

(i) The forward or inductive methods based on the idenfication of different failures of components and their consequences or another component and system. Examples of this methodology are the "Failure Mode and E ffect Analysis" (FMEA) and the "Event Tree Analysis" (ETA), (ii) The backward or deductive methods based on the identification of different system states, and then the second, third, etc., tiers events that by logical links contributes to it. Example of this methodology is the "Fault Tree Analysis" (FTA). In order to implement both the deductive or the inductive methods of system reliability analysis it is necessary to understand how the system works, specially the relationship among components events or their states. For this purpose the knowledge of system flow charts and the development of logic diagram are of prime importance. In spite of the complexity of reliability system analyses, there are simple systems, where the application of probability theory allows a straight-forward quantification: (i) Series systems Group of independent components, causes system failure

in which the failure of one

R sist

(1-R.) 1=1

(ii) Parallel configuration The system will fail it and only if all components fail

R . = 1 - (1-R.) sist 1=1 (iii) Standby redundancy This type of parallel configuration represents a system with one unit operating on units as standby. If the components are statistically independent, identical, and with a constant failure rate, being the switching arrangement perfect,

22

R . = 1 (Xt) j e X t
sis j=0 ji

But if the component are not identical, and assuming only two, the failure density function of the system, f(t) = iV(tu)f(u)du =!2(e 0 2 2~ 1 and its reliability R(t) = e " ^ + _ L ( e
1 2
X

l t _ 2*)

^ e_Xlt)

where and are the constant failure rates of each component. Example 3 Fig. 3 shows a standby power system, where a diesel generator is backed up by a battery. Answer the following questions: (a) Reliability of the power system with battery for time 10 hours and 100 hours assume: 1 = 0.0001 2 = 0.0002 = 0.001 3

(b) Indicate the system logic if the system can be maintained. (c) Unavailability of the system for times: 1, 5, 10, 100 hours and the steady state, assuming a constant rates model = 0.0003 Solution: (a) Parallel series system R . (t) = 1 [ l e ( X l " X 2 ) t J
SIS
= e( x +

V = 0.2

(le^) .
=

l + x 2)t

+ e

3t _ (1+2+3 e 0.0013t

= e
R(10) R(IOO)

0.0003t

+ e

O.OOlt

= 0.9999 = 0.9971

23

GENERATOR

II) HOCK O IA G RA M

COMPONENTS

SYSTEM

>
I 1

J t
1

r
1

STATUS

STATE

I 2 3 4 S

1 I 1 V 1 1

r
t

>
1 1

I I 1

1 1

(
7 9

*
1

I I I ) TRUTH TA BLE

(IUI

FA UIT

TREE

FIG. 4 Three component power system

24 series system R. (t)=e


( X

^^

R5)
R(100) (b) The indicator system,

= 0.99997
= O.99719 random variables of the states of components and

component =H=1 componentfl2 component 4 3 system

: : : :

XI X2 X3 XS

The system logic can be represented by the following statement: "IF (XI = 0 and X2 = 0) OR X3 = 0 THEN XS = 0" Fig. 4 developes a Truth Table and Fault Tree for according the above mentioned logic statement. (c) Q(t) = J _ [ l e ( X + , l ) t ]
+

the system

Q(l) Q(5) Q(IO) Q(100) Q(~)

= = = =

.2719 .9476 0.0012957 O.OOI4977

= * = O.OOI4977
+

25 REFERENCES |l| S. Oski, Stochastic System Reliability Modelling, World Scientific (1985). |2| Ernst V. Frankel, System Reliability and Risk Analysis, Martinus Nijhuff Publishers (1984) |3| J Henley, H. Kumamoto, Reliability Engineering and Risk Assessment, Prentice-Hall, Inc. (198I). |4l Richard E. Barlow, F. Proschan, Statistical Theory of Reliability and Life Testing, Mc. Ardle Press Inc. (1975) |5| R Mann, E. Schfer, and D. Sigpurwaller, Methods for Statistical Analysis of Reliabilty and Life Data, John Wiley and Sons (1974) |6| A. Serra and R.E. Barlow, Theory of Reliability Proceeding of an International School of Physics "Enrico Fermi", North-Holland (1986).

ESTIMATION OF PARAMETERS OF DISTRIBUTION

A.Z. Keller, PhD, BSc, FBCS, FIQA, FSaRS School of Industrial Technology, University of Bradford, England.

1.

Introduction

Acquisition of dependable data for reliability and safety analysis is always a major problem for practitioners. Available data sets sure usually small and subject to significant uncertainties arising from incomplete knowledge of the duty cycle or the specific environment the particular component or subsystem is subjected to and for which data is being acquired. This general shortage of data requires procedures whereby the probability distributions governing the reliability behaviour of the component can be identified and the parameters of the governing distribution estimated. This procedure can often provide a valuable insight into the reliability of the component. If, for example, failure behaviour is fitted by a Weibull distribution then, if the shape parameter is greater than unity, one has a wear-out phenomenon again, if the parameter is less, one has a growth situation and if the shape parameter is of the order of unity one can infer that the failure mechanism is random in nature. The first stage of data analysis is selection of candidate distributions; this is then followed by estimation of parameters; the distribution best describing the data can then be selected on the basis of a goodness of fit test or other appropriate criteria. The following are the most commonly used methods of estimation. 1. 2. 3. A. Graphical Method of Method of Method of methods. least squares. matching moments. maximum likelihood.

27 A. Amendola and A. Sail de Bustamante (eds.), Reliability Engineering, 27-47. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

28 2. Graphical Methods Graphical methods are the simplest but inaccurate and have many practical advantages in the early stages of data analysis. Any distribution can be transformed into a standard distribution by means of linear transformation as follows: y = (-)/ where y location parameter scale parameter transformed variate original variate (1)

(2)

In this way if we put = F(x) = G(y) for all 0 < < 1, the standard distribution G is free of the scale and the location parameters and y = G-1 () = G -1 [F(x)] (3)

Instead of plotting against F(x), one may choose to plot against G~ [F(x)] which is equal to y. On the graph paper with y axis so calibrated, since and y are linearly related, every straight line with positive slope represents a cdf F(x) with some scale and location parameters. Example For the Weibull distribution, the cumulative distribution function F(t) is given by F(t) = 1 - exp[-(t/T)n] and the reliability function is given by (4)

R(t) = e x p [ - ( t / r ) n ]
i.e.

In R(t) = - ( t / f ) n In lnf-TT^ ] =nlnt - nlnT (5)

It is seen that if 1/R(t) is plotted against time t on a log log log paper, the points would be on a straight line. The parameters n and are then estimated from the slope and the intercept of the straight line respectively.

3.

Method of Least Squares

The method of least squares is one of the most widely used methods for

29 fitting a curve (or straight line) to a series of data points. The estimation of parameters of the curve is based on the criteria of minimising the sum of squared deviations of the fitted curve and the actual data points. Let and let
(,.)

(*2'2'' j = 1

'xN'yNl' b e M

a s e t N

"

observations

y = f(x:9.),

be the functional relationship of the dependent variable y where 8.(j=1....M) are the parameters to be estimated. The method of least squares finds the value of . such that J D = (y f(t ; ) J i=1 is minimum. This is accomplished by setting the first m partial derivatives of D with respect to its parameters Q.(j=1....M) and solving the set of M normal equations thus obtained for M parameters. i.e. __2 _6D _ 0
for

j=1

Considering an example of fitting a straight line to a set of data, let y = ax+b be the form of the functional relationship of the dependent variable. The sum of squared deviations can be written as follows: ? 5 D = . (y, ax. + br 1 i=1 where ( x ^ y ^ i=1 are the paired data p o i n t s . and b are estimated as follows: j-a= i.e. N
D2
N

The parameters a

0 = . (y - a x . + b ) ( x . ) N
(6)

a L v b xi = L V i
,n2 | ^ = 0 = 2(y. ax. + b) = 0 i=1 i.e. a i=1 + Nb = y

(7)

30 The least square estimates a* and b* are obtained by solving the normal equations (6) and (7). a* =

- 2

(8)

i=i

Li=i "-J

b, = =,

-=

uf" -

^iU

(9)

4.

Method of Matching Moments

In this method, theoretical moments of the distribution are equated with the sample moments. k t n moment of a probability density function f(t) about the origin is defined as: ^ = J t k f(t) dt all values of t and its estimator for a sample of observations are given by (10)

Vi.Vi*

(11)

1=1 t. are unbiased estimators of 1 . . Advantage of this method is that they are relatively simple to evaluate provided the theoretical moment exists. It may also be necessary some times to evaluate higher moments about the mean in order to estimate the parameters. One of the serious disadvantages of this method is that the sampling distribution of these estimators for the purpose of interval estimation are not always easy to evaluate. Examples Binomial distribution:The probability density of the binomial distribution is given by

P(x;N,p) = (J) p X (1-p) N _ X

(12)

31 where number of trials. number of successes, probability of success.

Suppose that in trials successes have been observed and we are interested to estimate p. Expected number of successes (theoretical first moment about the origin) is given by: Z>P(X>=
all

.
!

x,,
P( X ,
l

,(Nx)
P =NP (13)

x).
'

x=0

values of

However, we have observed successes in trials. When this is equated with (13) one has: = NP and the estimator is given by = -^ Exponential: The probability density of exponential distribution is given by: f(t) =0exp((*t) and its theoretical first moment about the origin (mean) is given by
OS

(14)

(15)

'

= toexp((Xt)dt

(16)

= 1/0 Let t , t , t N be the observed failure times of randomly selected components and it is required to estimate 0(as given in (15) to describe these failure times. Moment estimator of o, o^is obtained by equating the theoretical moment as given in (16) with the sample moments. First moment (sample) about the origin Tor the data points

VVS

fc

lsgiven by
N

t = (1/N) i=1

(17)

Hence the matching moment estimate of (y is given by

= /
i=1

fc

(18)

32
Normal Distribution:The probability density function p.d.f. of the normal distribution is given by f(t) = (2TTCT2)"172 [-(-)2/(22)] (19)

Since this distribution is characterised by two parameters and , two sample moments are required to obtain the estimates for and a-2. The first moment about the origin | and the second moment about the mean 2 for (19) is given by

' =

tf(t)dt =

(20)

= j (t^)2f(t)dt =2 co

(21)

respectively. Corresponding sample moments for observed failure times, t., t , t ,..., t N are given by
1

t* 1 = jg i=1

t i

(for simplicity t i s used instead of t ' ) (22)

t2= (vt) 2
= 1 ( t i 2 2 t i t + t2)

(23)

"[t^Ztit^Ht2]
but
i.e.

= Nt from (22) " Ztj


,T2 T2 2Nt + Nt'

L\2

- Nt2

.2"

N ^

(24)

33 Note that in equation (23) instead of the true parameter , the estimator t is used. E quating (20) with (22) and (21) with (24) the moment estimates for and Q-2 are given by
A 1
N

= i ti
A 2

(25)

[-

J
is given by
NL

(26)

respectively. Q 2 is biased.

However an unbiased estimator of c

rfr' fl

(i

y2

5. Method of Maximum Likelihood The most favoured method from the point of view of statistical theory is that of maximum likelihood. This is derived from the stipulation that the probability of obtaining the given sample values should be a maximum, if the estimator equals the population value. Thus if the estimator of a population parameter is 0 and one has reason to assume some given probability density f(t;0) then one requires that = TT 1=1 be a maximum or

f(t ; )

(28)

JL=
The equation,

In L = f(tlf9) i=1

(29)

H = o = |![()]

(30)

will give a maximised value of . Maximum likelihood estimators (MLE) may be sometimes biased, but it is usually possible to adjust for this bias. MLE are sufficient if a sufficient estimator exists for the problem. They are most efficient for large N(>^ 30) and are invariant. The distribution of the MLE's has the normal distribution as the

34
limiting distribution ( * ) and confidence limits for estimates can be derived from the normal distribution as follows: Let y

(09)/og

(31)

be the standardised normal variate y* exp(t2/2)dt

Then

p(y < y) = ^

<

with

* 2 s

rr
2*.

(32)

In general if more than one parameter is to be estimated from the sample the variance covariance matrix of the estimators 0.(j=1...M) is obtained from the inverse of the matrix of second partial derivatives of ., these are:

.. J are used to form the elements of the M matrix . i.e.

(33)

2 '

(34)

62
, . rf' tf

JL

is called Fisher's Matrix.


EXAMPLES Poisson Distribution: The probability density function of the Poisson distribution is given by (;) _ expt)
!

(35)

35
Where is the mean occurrence rate/unit time. Let , X2 ....X(j be the number of observed occurrences for equally spaced time intervals. Then the likelihood of such a sample is given as
) =

() x 1=1 r
i=1

(36)

i=1

L - x. 1

In .! 1

=0 =

iAxiN
i=1

(37)

The maximum likelihood estimates is given by = * i=1 Appropriate variance for this estimator can be obtained as follows

(38)

* 2 = / ' <
2

' ^

= i.e. ^ = 2/ = / (40)

Normal Distribution; The probability density (p.d.f.) of the normal distribution is given by

f(t)

^75 (2iTe)1/

[()2/2 ]

(41)

Let t , t , t ....t

be the observed failure times for components.

Likelihood of such a sample is given by

36 L = IT (22)"1/2 exp[(t )2/22] i=1 and

(42)

t = ^ in (2fT<r) ^ 2 < ) 2
2 i=1

(,3)

(44) I

i.e.

i=1

(t.) = O

and the maximum likelihood estimator of is given by

=| = t =t
i=1

(45)

(Note that this is the same as that of the matching moment estimator.)

2
N

2( ) =1 2

= <) **"' i=1


(47)

2 and the MLE of er is given by

= (t ) 2
i=1

Usually, the true parameter is unknown and an unbiased estimate of the population variance "2 is obtained from the sample using 1=1 _

S 2 = =1= . . (t t) N1 i

(48)

In order to obtain the variance of the estimator the second partial derivative of (43) is evaluated.

4 =4
<y

<9>

37 and hence from (44) the variance of the estimator is given by

< = 1 ^ 1

w/L ^"
=

N =^

(50)

This is also known as the sampling error. Similarly an approximate variance of the estimator Q 2 is obtained by taking the second partial derivative of (43).

JL
( )
A2
2 2

2(er )

2 2

( )

i3

(tuT
i=1

(51)
(52)

and

/ \ - 1
<
2 * 2

Even though equation (52) can be used to derive an approximate variance for the estimators, in practice the exact distribution for the unbiased estimator S 2 given in (48) can be derived from the relation 2 2y2 S = & \ 2 Where has .degrees of freedom, being equal to when the true parameter is known and N1 when is replaced by its estimator t. Weibull Distribution: The probability density function of Weibull distribution is given by n1 f(t)= exp[(t/T)n] (54) (53)

Let 1/T = oi so that (54) can be written as f(t) = noctn"1 exp[ottn] (55)

Suppose in a typical life test, specimens are placed under observation and as each failure occurs, the time is noted. Finally at some predetermined time the test is terminated. Let be the number at the beginning of the test and R be the number of failures (R < N ) . Let t i=1 R be the failure times and be the censored time, i.e. at the end of the test there are (NR) items which have not failed until time T. The likelihood function of the above observation can be written as

38
L = TT f(t.) TT R(T) i=1 i=R+1 i,= R In L = ln[(ncxt. i=1 (56)

i.e.

) exp(oCt.n)] + ln[exp(Tn)] r+1 (57)

To obtain the MLE's for /L and n, the first derivatives of with respect to oc and are set to zero giving (58)

i.e.

S= R / Z t , n i=1 x

(59)

311(1

6= = + t
i=1

l n fc

i "NT f / i
2

ln

fc

(60)

t.

i=1

Equation (60) can only be solved numerically using either NewtonRaphson or the Binary Chop Method.

Example on the estimation of the parameters and goodness of fit The life time, in hours of a certain kind of tool is given beloi

6 5 13 15 4 7 16 52 6 10 28 182 15 5 9 92 31 13 6 9
Solution

21 8 17 7 9 192 87 20 10 17 17 27 20 8 33 22 7 45 55 61

8 11 5 7 6 6 11 27 10 15 31 15 18 22 11 7 37 32 10 26 4 9 19 10 25 14 23 13 21 19 151 171 6 7 17 7 52 5 102 7

10 9 11 9 14 44 12 4 19 8

21 4 38 8 15 20 7 6 42 23

Range = Maximum observed value minimum observed value = 192 4 = 188 hours. In order to summarise the data, the data first of all is rearranged in an ascending order. The rearranged data is presented in table II. In

39 order to construct the frequency histogram, number of classes between 8 and 15 of convenient class interval are chosen. For the data given in the example, if equal class intervals were to be chosen, then a width of 20 hours would be suitable. However, as can be checked, in the case of distributions with positive skew, unequal class intervals gives a better summary. Table III gives the summary of frequencies for the chosen interval. The corresponding histogram is shown in Fig. 1. It can be seen that the shape of the histogram suggests that the underlying distribution of the failure times is exponential of the form A f(t) = e H / )T t i=1 * with the maximum likelihood estimate given by

which is the inverse of the sample mean.

From table II the mean is given by _ 42 t = i=1 i.e. I = 2528/100 = 25.28 hours. T(I).F(I)/100

The sample variance is given by s2 = () i=1 ( v<2 L u However, for computational purpose the following formula can be used.

S 2 = (zF(l)i) ( T(I)2/F(I) [IT(I).F(I)]2/IF(D frF>}> ,1 (


From table II. 2F(I) = (4+4+7+
2 2

1) = 100 (1922x1) = 189020 (192x1) = 2528

T(I) F(I) = (4 x4)+(52x4)+(62x7)+ 2T(I).F(I) = (4x4)+(5x4)+(6x7)+ The sample mean is now given by I = 2528/100 = 25.28 hours and the variance 2 _ 189020 25282/100 .... . 100 1 1263.79

40 and a standard deviation


o

s = v/S

= 35-55 which is an estimate offl-.

The maximum likelihood estimate of is then


A

= 1/t = 0.0396 failures/hour Even though the theoretical mean (1/ ) and the standard deviation take the same value for the exponential distribution, the corresponding sample mean and sample deviation are not necessarily the same as is illustrated above.

Goodness of Fit Tests 6. The Chi-square test for goodness of fit

The data consists of independent observations. These observations are grouped into C classes and the number of observations in each class are presented in the form of a 1 C contingency table.

Class Observed Frequency Expected Frequency

C total

Corresponding expected frequencies are obtained as follows E. = P.N j = 1 ...C

Where P' s are the probabilities of a random observation being in class j assuming that the probability distribution of hypothesised function is F(t). Then the null and the alternative hypothesis can be stated as below. H : The distribution function of the observed random observation is F(t) : The distribution function of the observed random observation is some function other than F(t)

The test statistic is given by

41
C
7

= (0. - E.) /E. 1 1 i=1


If some of the E'.s are small (<5) it is recommended that these classes be combined. The critical region for the c*.significance level can be obtained from the chi-square tables. i.e. Where by 2( ,df) is the significance level and df is the degrees of freedom given

df = C-1- number of parameters 2 Then accept if (o,df) otherwise reject. EXAMPLE In the example of tool life, table II the chi-square test statistic is calculated as = 8.6117 with degrees of freedom 5. From table II critical value corresponding to 5% significance level ( ( = .05) and 5 degrees of freedom ( y = 5) can be read as 11.070. * Hence the null hypothesis that the distribution of failure times follows a negative exponential distribution with = .0396 is accepted.

7. Kolmogorov-Smirnov test for goodness of fit This test is based on the distribution of values of cumulative probabilities. Let t. i=1 be the failure times arranged in an ascending order. The observed cumulative distribution function is given by S(t.) = i/N Where i is the accumulated number of failures up to and including time t . The test statistic Dmax is given by Dmax = Sup( | F(t.) - S(t.) | ) 1 i=1...N 1 Where F(t ) are the CDF of the hypothesised distribution and "sup" (SupremumT denotes the maximum absolute deviation. This is shown diagramatically below.

42
F(t)

.0 1
JO

) o >

0.8 ypothesised Distribution 0.4 0.2 0

( 0.6^ Empirical Distribution .

co
3

Note that for any observation t. the maximum deviation d. is given by d. = MAXClFttJSttJl.lFtt^Si^l)/] and Dmax is given by

Dmax = MAX i=1..N

( d. )
1

Critical values for given significance level are given in table I. EXAMPLE The previous example on tool life is considered here again to illustrate the test. Referring to table II. Column 3: the sample CDF, S(I) is obtained as follows S(1) = A/100 = .04 S(2) = (4+4)100 = .08 S(3) = (4+4+7)/100 = .15 the hypothesised probability P(I) is obtained as follows P(I) = 1exp( pt ) P(1) = 1exp(4/25.28) = .1463 P(2) = 1exp(5/25.28) = .1715

Column 4:

and so on. Column 5: D(I) = max[|s(I)P(I)|,|S(I1)P(I)| ] with S(0) = 0 D(1) = max[|0.1463|,|0.04.1463|] = .1463 D(2) = max[j.04 .1795|,|.08.1795|] = .1395

and so on.

43 The maximum absolute deviation Dmax = .1526 which occurs at time 23. The critical value corresponding to sample size 100 and significance level 5% = 1.36/V100 = .136. Since Dmax > .136, the null hypothesis that the distribution of tool life is exponential with = .0396 is rejected. Compared to the chisquare test described before, a contradictory result is observed. However, it is recommended that the Kolmogorov Smirnov test results be used since it is an exact test, whereas the chisquare test is an approximate test.

8. Confidence limits Having obtained estimates of the parameters of a given distribution it is often desirable, particularly in risk assessments, to supply confi dence limits for the derived parameters. Except in special cases this is usually difficult to do. If the failure rate is exponential, explic it limits can be given in terms of the ^ distribution. Again, if the sample size is large (>20) then the multivariate normal distribution using the variance covariance matrix given by equation (32) can be used to construct confidence limits. If the sample size is small, the only general recourse left is that of using Monte Carlo sim ulation. With this process one repeatedly, by simulation, generates samples of the same size as the actual data sample, using a derived distribution. Parameter values for these simulated samples are then es timated; confidence limits are then obtained from the resulting spread of values occurring in the estimates. A possible alternative procedure to deal with confidence limits is to analyse the original data using Bayesian methods and obtaining a proba bility distribution directly for the parameters; however, discussion of these techniques is beyond the scope of the current paper.

44
TABLE I. Critical values, d.(N), of the Maximum Absolute Difference between Sample and Population Cumulative Distributions. Values of d.(N) such that Pr|max| S(x)F.(x)>d.() =0i, where F.(x) is the theoretical cumulative distribution and S () is an observed cumulative distribution for a sample of N. Sample size (N)
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 25 30 35

Level o f S i g n i f i c a n c e (oC ) 0.20 0.900 0.684 0.585 0.494 0.446 0.410 0.381 0.358 0.339 0.322 0.307 0.295 0.284 0.274 0.268 0.258 0.250 0.244 0.237 0.231 0.21 0.19 0.18 1 .07
y

0.15 0.925 0.726 0.597 0.525 0.474 0.436 0.405 0.381 0.380 0.342 0.326 0.313 0.302 0.292 0.283 0.274 0.266 0.259 0.252 0.246 0.22 0.20 0.19 1 .14
N/TT

0.10 0.950 0.775 0.642 0.564 0.510 0.470 0.438 0.411 0.388 0.368 0.352 0.338 0.325 0.314 0.304 0.295 0.288 0.278 0.272 0.264 0.24 0.22 0.21 1.22
y/W

0.05 0.975 0.842 0.708 0.624 0.565 0.521 0.486 0.457 0.432 0.410 0.391 0.375 0.361 0.349 0.338 0.328 0.318 0.309 0.301 0.294 0.27 0.24 0.23 1.36 s/W

0.01 0.995 0.929 0.828 0.733 0.689 0.618 0.577 0.343 0.514 0.490 0.468 0.450 0.433 0.418 0.404 0.392 0.381 0.371 0.363 0.358 0.32 0.29 0.27 1 .63
y/W

over 35

45
TABLE II Table of Rearranged Failure Times

(1)

(2)
Frequency F(I)

(3)
Sample C.D.F. S(I)

(4)
Theoretical C.D.F. P(I)

(5)
Absolute Deviation D(I)

No.

Failure Time T(I) (hours)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 25 26 27 28 31 32 33 37 38 42 44 45 52 55 61 87 92 102 151 171 182 192

4 4 7 9 5 6 6 4 1 3 2 5 1 4 1 3 3 3 2 2

.0400 .0800 .1500 .2400 .2900 .3500 .4100 .4500 .4600 .4900 .5100 .5600 .5700 .6100 .6200 .6500 .6800 .7100 .7300 .7500 .7600 .7700 .7900 .8000 .8200 .8300 .8400 .8500 .8600 .8700 .8800 .8900 .9100 .9200 .9300 .9400 .9500 .9600 .9700 .9800 .9900 1.0000

.1463 .1795 .2113 .2419 .2713 .2995 .3267 .3528 .3779 .4020 .4252 .4475 .4689 .4895 .5094 .5283 .5467 .5643 .5812 .5974 .6280 .6425 .6563 .6697 .7066 .7180 .7289 .7686 .7776 .8101 .8246 .8313 .8721 .8864 .9104 .9679 .9737 .9823 .9975 .9988 .9993 .9995

.1463 .395 .1313 .0919 .0313 .0505 .0833 .0972 .0821 .0879 .0848 .1125 .1010 .1205 .1107 .1216 .1333 .1458 .1489 .1526 .1320 .1275 .1337 .1303 .1134 .1120 .1111 .0814 .0824 .0599 .0554 .0586 .0378 .0335 .0196 .0380 .0337 .0323 .0375 .0289 .0193 .0095

Maximum absolute deviation

.1526 at time 23 hours.

46
TABLE III Frequency Table (1) Nc^ (2) (3) (4)

Class Interval (Hours) 0-<10 10-<20 20-<30 30-<40 40-<50 50-<70 70-00 100-<140 140-<200

Observed Frequency 0.
1

Theoretical Probabili t y P(I) 0.3267 0.2200 0.1481 0.0997 0.0671 0.0757 0.0436 0.0152 0.0009

Expected Frequency
E

Chi-Square Contribution (0i-Ei)2/Ei 0.1662 2.9090 0.0075 1 .5808 2.0513 1.6836 0.2133
8.6117

1 2 3 4 5 6 7 8 9

35 30 15 6 3 A

2j

32.67 22.00 14.81 9.97 6.71 7.57, 4.36 1.52 5.88 0 J

100
Column 3 Estimated value of = 1/25.28 exp(0) - exp(-10/25.28) = 1 - 0 , 6733 = 0.3267 exp(-10/25.28) - exp(-20/25.28) = 0.6733 - 0.4533 = 0.2200 exp(-20/25.28) - exp(-30/25.28) = 0.4533 - 0.3052 = 0.1481 etc. Column 4

E. = probability 100. Since class interval 7, 8 and 9 have the expected frequency <5f these are combined. Chi-Square statistic = 8.6117. Degrees of freedom = number of class intervals _ 1 = 7-1-1 = 5. number of parameters

47

35 30 25 20 . 15 . 10 5 0 15 FIGURE 1 REFERENCES 25 35 45

expected frequencies theoretical frequencies

70

100

140

1. "Statistics for Technology", Chatfield, C , Chapman & Hall, 1975. 2. "Data Analysis for Scientists and Engineers", Meyer, S.L., John Wiley & Sons, 1975. 3. "Methods for Statistical Analysis of Reliability and Life Data", Mann, N.R. etal, Wiley, 1973.

INFERENCE A BAYESIAN APPROACH

C . A . Clarotti ENEA TIBISP CRE Casaccia S.P. Anguillarese 301 00100 Rome Italy

ABSTRACT. The paper dwells mainly on the following issue: "Reliability theory and practice are concerned with making decisions almost always relative to "nonstatistical'' uncertain events; that is why Bayes sta tistics is the ideal tool for reliability analysts." In order to sup port this statement, fundamentals of subjective probability and utility are surveyed; Bayes theorem, prior distributions and the other tools of reliability analysts are reframed into a decisionmaking context.

1. FOREWORD Authors of most tutorial papers on the use of Bayes statistics in re liability field, are in a hurry to arrive at the following final statement: "... in view of the previous discussion, from a Bayesian standpoint the problem of estimating the failure rate of components which fail independently according to an exponential distribution, can be solved as follows: assign the status of random variable (since it is unknown) and define your subjective prior pdf 0() on possible values of ; ii) put some components on test and obtain some data D (a set of ob served lives and some survivals); iii) write down the likelihood L(I)|X) of the observed data; iv) derive your posterior on according to Bayes' theorem
L(D|X)TT ()

i)

( | D) = /(0|)() ) use (|D) for estimating . In their haste, these authors unconsciously and unduly accept a 49
A. Amendola and A. Saiz de Bustamante (eds.), Reliability Engineering, 49-66. 7985 by ECSC, EEC, EAEC, Brussels and Luxembourg.

50 large inheritance of orthodox statistics. This clearly appears at the moment of deciding "how to use" TT(\|D) in order to guess 0 , the "true value" of . Someone suggests to assume
o

\o = J

(|)

(1.1)

someone else chooses 0 such that

TT(X|D)dX =

oo TT(X|D)dX ^o (1.2)

some others assume 0 to be the mode of (|), but there is not a ge neral consensus and there cannot be, simply because estimation of "abstract" parameters is orthodoxstatistics's business and not a business of Bayes statistics. As a matter of fact the very motivation for subjective probability and Bayes statistics is the need of making decisions in face of uncer tainty relative to events which cannot be repeated indefinitely (non statistical events) and not the aim of guessing the "true value" of probability distributionparameters. No one better than engineers should appreciate the flavour of Bayes statistics. Engineers are not concerned with point estimation of probability distribution parameters; they are concerned with practical decisions such as: different design solutions are available for a given system; whichever solution is adopted, the system results to be prone to failure; system correctly operating will produce benefit to people (e.g. a cheaper cost of electricity); system failure is hazardous to people; implementations of different solutions have different costs; which is "the best" solution? Of course, in order to correctly answer the question above, the attribute "the best" should be made precise (and it will be made in the sequel), but no matter what the precise meaning of "the best" is, no doubt that the concern of engineers is to maximize (minimize) in some sense the benefit (the hazard) to the "unique" population living in the surroundings of a finite (and generally rather small) number of systems to be loacted at preestablished sites. Arguments such as: if an infinite number of systems were to be built, then (according to the classical probability theory) the ob served failure frequency would be the same as the system failure proba bility, in the long run case the "best" solution would then be ..." are of little (if any) help in selecting the "best solution" in a "single case context" (a unique lot of systems which will result to be beneficial or hazardous to a preestablished population). The Bayes

51 statistics' capability of handling the "single case" is then of great importance to engineers. Yet, in a Bayesian view, only observable entities such as the failure time of a system are admissible subjects of statistical fore casts; parameters like are regarded as operational tools which sim plify calculations; guessing the "true value" of something () which will never be observed, is meaningless. This suits the attitude of mind of engineers who, of course, prefer to figure out a measure of the uncertainty relative events which ultimately will occur or not (such as system failure prior to end of mission) rather than to think of in tangible quantities like probabilitydistribution parameters. "Cooperation" between engineers and Bayes statistics is then very promising. In order it to be as profitable as it potentially is, fundamentals of subjective probability must be discussed in some detail ; a correct understanding of basic concepts will prevent misuses coming from our past experience with orthodox statistics. Polluting the Bayesian environment with practices borrowed from orthodox statistics must be avoided; Lord Russel's theorem reminds us that any false statement in a group of axioms makes it impossible to prove any theorem, or: peeling prickly pears as if they were apples is dangerous.

2.

INGREDIENTS OF DECISION MAKING & BAYES STATISTICS

Let be the number of different design solutions for the system of our example in section 1; we can choose among (n+1) different decisions, namely: d: implement the i.**1 design solution, i = l , d0: do not build the system. ,n;

The difficulty in making the decision lies in that we do not know with certainty which of the following exhaustive and mutually exclusive events will occur: E 0 : system fails prior to the end of the mission; Ej: system does not fail prior to the end of the mission. As a consequence, the first step to selecting "the best" d is quanti fying our uncertainty on Ej (E 0 ); this must be done without any refe rence to the long run since we are interested in the single case. If probability could handle the single case it would be a very nice uncertainty measure since probability laws permit us to derive an uncertainty measure on complicated events starting from the uncertainty measures on much simpler ones (e.g. to derive the probability of E 0 starting from the probabilities of sys tem components failing); provide procedures for changing the uncertainty measures on speci fied events as new information becomes available. All that we have to do to obtain a suitable uncertainty measure is then to cut the tie between probability and the long run. The tiebreak is

52 achieved by defining the probability of an uncertain event A as our degree of belief that A will occur. For making the definition precise we need only the concept of indifference between two gambles [l, Chap ter 2].. In one gamble a prize is won if A occurs. In the other one the same prize is won if a black ball is drawn "at random" from an urn containing b black balls and b white ones. "At random" has the following meaning: let the balls in the urn be numbered from 1 to n; let gamble i, i = ,.,.,, be a gamble such that a prize is won if ball i is drawn from the urn (the prize is the same for all gambles); the balls are drawn "at random" (in your view) if the gambles above are equally palatable to you. "At random in someone's view" is synonym with "equally likely (to occur, to be drawn) in someone's view". Having defined randomness in terms of indifference among different gambles, let us define by the same token the probability of the event A. Keep constant (n: total number of balls in the urn, white balls + black ones), let each ball be equally likely to be drawn and let b vary (b: number of black balls). If b=0 winning a prize in the gamble contingent on a black ball being drawn is impossible; if b=n winning the prize in the same gamble is certain. If b=0 a gamble contingent on A occurring is preferable to .a gamble contingent on a black ball being drawn; the converse is true when b=n. In between, 0 < b < n, small values of b will cause you to prefer a gamble contingent on A; large values of b will make it preferable to you to take a gamble contingent on a black ball being drawn. Let us denote by B A the set of values of b such that the gamble contingent on A is preferred let Bj, be the set of values of b such that the gamble contingent a black ball being drawn is preferred. For any b A e B A and b^GBb ifc results: b A < bb. (2.1)

If is large enough to avoid "discontinuity in preference" a value b 0 , there must exist such that you are indifferent between the two gambles. Since for any b each ball is equally likely to be drawn, the ratio b/n (2.2)

soundly measures your degree of belief (subjective probability) that the event B: "a black ball is drawn" will occur. Since in particular for b = b 0 , the urn gamble is as attractive to you as a gamble contingent on B, your subjective probability of the event A is expressed by b0/n (2.3)

Defining probability via comparison to a standard eliminates the tie

53 between probability and the long run, "... the standard uses no repetition and is non-statistical. The ball is only drawn once from the urn to settle the gambles, after which the whole apparatus could be destroyed" [l. Chapter 2 ] . We are then in a position to measure the uncertainty in a single case context. Notice furthermore that: - Subjective probability is always a conditional probability conditioning being with respect the available information. In the case of the choice among different design solutions for a plant, for example, the more detailed your analysis of a given design solution has been, the more you are confident in that there are no design-bugs or weak points which will cause the plant to fail. - In a subjective probability frame there is no room for unknown probabilities . You face uncertainty by assigning probabilities to uncertain events and then no probability can be unknown. Let us refer back to the example of selecting the best design solution and focus on the i t h design solution. We want to "incorporate" into the probability that the plant will not fail when built according to solution i, the following information: Ni plants have been built in the past according to design solution i and k of them failed. For r=l,...,Ni let A =1 if plant # r (built according to solution i) successfully operated and let A ^ = 0 if plant # r failed. Let A^i+l b e likewise defined for the plant whose construction is under scrutiny. The mission time is understood to be the same for all plants. If we were frequentist we would act as follows: - A,,.ifAA ,A1 would be regarded as a set of independent Bernoulli

trials with unknown probability Pi (probability of successful operation of the plant); - the outcome of the first Ni would be processed in order to derive an estimate Pi of the unknown pi; - the decision as to building or not plant Ni+1 would be drawn by making use of pi What to do in a Bayesian setting where a thing does not exist such as an unknown probability? First notice that from a Bayesian standpoint your being willing to "incorporate" the information above in the assessment of probability that A =1 means that in your view the events A =a_, r=l,,Ni+l, Ni+1 r are not independent. If they were indeed, for any choice i l , , i s , s=2,...,Ni+l among the indices l,...,Ni+l, it should be:

'l<

=ai.)n. ..n(A U ) = a i )) = P(A.(l)=ai,)...P(A.(l)=a.: ) 1 M i -- ' "s ii 1 i_ ^s (P() probability of the event in brackets)

and in particular

54

p((A;i)=a1)n...n(^)=aNi)n(^)i=1)j = P(AJ1U1)...P(JS[J)^1II)P(^|11)
o r , which i s the same,

p < l= i V+ =1

( i )= a _

(i)_

=*%>

=P(A

Ni+l=1

(i)

(2.4)

(All the above probabilities are conditional also on di s the plant has been (is going to be) built according to the ith design solution; ex plicit conditioning is omitted for the sake of simplicity.) But the latter equation does not apply since it contradicts our willingness to "incorporate" the information A (i) =a.,...,A i =a N . into
1

probability that (A^, + i = l ) occurs; as a matter of fact, after knowing that (A. =a.)...( =a N .) occurred, we want to change our mind and to

assign to the event A^ 1 ^..^ a probability (lefthand side of Eq. (2.4)) which is not the same as the probability we would assign had w e simply known that the plant is going to be built according to the design solu tion i (righthand side of Eq.(2.4)). If A. ,...,A are not i.i.d. (independent identically distri buted) random variables, what they are? How to obtain the lefthand side of Eq.(2.4) after assessing the righthand side and having ob served (Aj=a^)n n(AN=aN.) ? "Having abandoned independence, the simplest choice open to us is to continue to regard the order as irrelevant" [2, p.21l], i.e. to assume that all the products

(A i, ^ 1 )n...n ( j S 5^ e a Mi )
such that (1a )=k (number of observed failures) r i r=l have the same probability (^) i .

This is again a subjective judgement which, anyway, can be shared by all people who agree on that the plants are equivalent in all respects (design, construction, operating conditions) so that there is no reason for considering some of them more likely to fail within the mission time than the others. If the random variables A* 1 ' r=l,...,N

55 possess the property above, they are referred to as exchangeable random variables. Note furthermore that the "exchangeable plants of type i" we can build are potentially infinite in number. For exchangeable processes which can continue indefinitely, the Finetti's representation theorem holds: "If A. ,An ,.. are binary and exchangeable, then there exists a pro

bability distribution ^ on the interval [0,1 ] such that for any value of (.=1,2,... ) , any finite collection . ,.,., , and any out 1 r r l N comes a <, .. ., a^j, 1 ( 1, = &1 )...( 1 '=3 .)) = 6ii(l6i)Ni k i 1() ,(i) 4 ,<i>. (2.5)

L (1 ai) (number of observed failures) j=l

The interpretation of the theorem is as follows: If A. ,A ,... are exchangeable, then they can be regarded as being conditionally i.i.d. given the value of some fictional! parameter ^ that lies in the inter val ^ ^ and has the probability distribution ,L" [3]. If we consider plant Ni+1 (the one under scrutiny for being built according to the design solution i) to be "exchangeable" with the first N plants, then by de Finetti's theorem it must be both

p((Aj i) =a 1 )n...n(A i J i) =a Ni )J = 1 0 i i ( i e i ) N i " k i d i (8 i )


o and
( _ , n , w.^ P [ ( A, i )l _ 1 ) ~ . . / ~ ( (' > 1 I.l ) n ( ,A ( i1)= i ) ) i,^ i a

(2.6)

, (2.6")

= J e k i ( i e i ) N i + 1 k i d i (e i )

In order t h a t our assessment of p r o b a b i l i t i e s be coherent ( i . e . i n accordance with the laws of p r o b a b i l i t y ) , i t then follows

56 (i)
(i) . >

V+ 1=1 Ai 'i

=a

i'""\

=atI ) =

p((A; i) =a 1 )n...n(A i j ) =a Ni )n( Ai ^ 1 =i))


P((A ) =a 1 ) n n ^ ) =a N .))

o f1 k i Nk 0 a i (1_8i) i' e i'

(2.7)

Eq.(2.7) can be rewritten in a more familiar form as follows. Let ^(^) have a density ^^), and let

(A i) =a 1 )n...n( A i>= aNi ) D


set L(Di|ei) = e i 1 ( i e i ) N i _ k i L(D.|.). (.) , (. |D. ) = .
1 1 1

(2.8)

(2.9)

L(Di|9i)Tri(9i)dei

(2.7) then becomes P(AN+1=I|D) = o L(DjJ8) is the likelihood of data we observed, it is the probability of observed data given that the fictional! parameter value is ^ (that is why the A 's are said to be conditionally i.i.d.). (iei)TTi(6i|Di)dei (2.)

^(^) is the "prior" pdf (probability density function) on the fictional 1 parameter ^# TitoliD) is the "posterior" pdf on it. In no way ^() and IT(9JJD) are pdf's on the "unknown failure probabi lity" of the plant. Probability is not an observable quantity and a gamble contingent on a probability having a given value can never be decided in an unequivocable manner, so "probability of probability" cannot be defined via comparison to a standard or, simply it is a nonsense. The parameter 9 is just an operational tool for simplifying cal culations and information processing, as is shown below.

57

Let

, ...,

be exchangeable i n your view so t h a t , by

de F i n e t t i ' s theorem i t must be ( 1 . ) . ( . ) . (2.11)

'<&"

Suppose that after examining the desing solution i, your subjective probability of plant surviving the mission is ,(i) = Ni+1

If you wanted to rely as much as possible on data for your final state ment of P(A^
= +1

l | D i ) y u ought to choose a uniform prior on B

w 1
.1
P(A

o4. = i

(2.12)

(do not c a l l (2.12) a noninformative p r i o r , what p r i o r s are to be "classified" as noninformative i s a controversial issue [4 ]) Eq.(2.12), together with (2.11) would e n t a i l
,(i) N:;I=I) (.).=|

which does not fit your degree of belief in plant surviving the mission. You could then choose

w
,(0)
1

(0) >i

e.<e. <0)
1 = 1

>.(0)
1

(2.13) (0)
(1 9 )

A choice like (2.13) would cause P ( A(i) =1|D^_) never to be smaller than A (0) i 1- , whatever the observed data. This would be so, even in the case that it were Di : k i = N i , Ni "large"

58

while the l a t t e r evidence would r a t h e r be i n favour of


(

!=1 =

A more reasonable choice might then be

1
,(0)

<(0)

i=

w=
1 (0)

.> (0)
1 1

(2.14)
(0) 1

0):1

Tr . ( 0 )

(^. + ^( 0 ) (iejde. = i 1 1 J(0)


i .

with small but not equal to zero, so that data can better do their job, i.e. they can push P(\,

= 1 [Dj_) to zero as the frequency of ob

served failure increases (note that (2.14) reduces to (2.13) when = 0 ) . We are now able to measure the uncertainty relative to both sta tistical and nonstatistical events, we are able also to update our uncertainty measure in face of new information. In order to draw "rational" decisions we have to introduce a measure for the conse quences of our decisions. This will be done with reference to the choice of the best design solution; the approach is borrowed from [l, Chapter 4 ] . For the sake of simplicity let us assume some addi tional hypotheses and make notation more handy. Let the design solutions be ordered according to their costs, that is to say: design solution i+1 is more expensive than design solu tion i. Design solution i and the decision of implementing it will both be denoted by d , i=l,...,n, CIQ is the decision of not building the plant (keeping the status q u o ) . Let the larger cost of d + j with respect to d be due only to additional cautions for avoiding plant failure (e.g. more redundancy); cautions taken for mitigating the impact (on people, environment) of the plant failure are the same for all di's. In the sequel we shall denote respectively by E^ and E 0 the events: the plant survives the mission, the plant fails within the mission time. The possible consequences of our decisions can be represented by the couples ( E j , d i ) , j=0,l; i=l,...,n (2.15)

other than (Ej,do) (if the plant is not built it can neither fail nor

59 survive). The "status quo" will be simply denoted by d0. (2.15) is a short form for resuming the whole story. All the un certainty lies in Ej occurring or not; if d has been selected and Ej has occurred, then there is no more anything uncertain and we know con sequences. For instance, if di has been selected and E 0 occurred, we incur the monetary loss relative to having implemented di plus the loss inherent to the impact of the accident on the environment and people. From the hypotheses we assumed it follows that the sound ranking of consequences is:
(E

o' d n ) (E o' d nl ) (E o' d l ) ^ d o (E l' d n ) ' (E l' d l ) < means not preferred to.

(2

"16)

Justification of (2.16) is as follows. The reason why we are contem plating to build the plant is that, if it works correctly, we will have more benefits than we have in the status quo (no plant); we did not make straight the decision of building it because, if the plant fails, we will be in a situation worse than the status quo. If things go wrong (failure), the more money we will have spent, the worse it will be (the impact is the same for all di's). In the case of no failure, the cheaper a successful operation has been, the better it is. Let C and c be any two consequences such that c<(E ;d),(E.,d )<C
'J

1 \>

(that is: c and C are any two consequences which encompass all the possible ones). By making use of C and c, a numerical measure for consequences will be given which is based on a capability which people continuously exer cise in everyday's life; the capability of comparing and prefering for the purpose of exchange benefits enjoyable with certainty, with benefits which will be enjoyed or not according to chance (think of subscribing insurance contracts, investing assets, choosing between a long lasting riskless train trip to somewhere and more quickly being there by air, which involves taking a small chance). Focus on any (Ej,di) and suppose to have it with certainty. You are offered to exchange it with taking a gamble in which you receive C with probability u and c with probability 1u. If you prefer (Ej,di) to the gamble, u is increased up to the value u(Ej,di) which makes you indifferent between the gamble and (Ewd.^) ("with probability u" is, of course, a short form for "you receive C if a black ball is drawn at random...", the existence of u(Ej,di) follows by the same token as in the case of the definition of subjective probability). Note that for the purpose of exchanging (Ej,di) with C with chance u, the information (A. =a.)...(^. ajj.) is irrelevant. The

uncertain event Ej (whose probability depends on that evidence) has occurred and, since the impact of plant failure (on people, environ

60 ment) is the same and known for all dj/s, we are not influenced by having observed (A. 1 =a.)...( =a N .). 1 . "i

The main results of the above are: whichever decision is taken and whichever Ej occurs, the ultimate result is equivalen to "C or c"; ii) u(Ej,d), which is referred to as the utility of (Ej,d), obeys the laws of probability, it is nothing but the conditional proba bility of the most desirable consequence C given (Ej,d) i) u(E.,d.) = P(c|E.,d. ) The best d is then the one which achieves C with the highest probabi lity, i.e. the d which yields the maximum of P(c|di). For what pre cedes, in the course of calculating P(C|d) we can simply write P(Ei|djJ in place of

P(A

i + l=1 l A ) = a i

4^ = a Ni>
the solution d* such that

The best design solution will then be

P<c|diJfc> = PiclE^d.^PCEjd.j) + P(c|Eo,diJt)P(Eo|diJE) = = ulE^d.jJPtEjd.i) + u(Eo,di3k){lP(E1|di)} > u(E,,d.)P(Ejd.) + u(E ,d.){lP(E. Id.)}

1 1

(2.17)

1' 1

O l

l'I

for any other d. The decision criterion expressed by (2.17) is referred to as the Maximum Expected utility principle because P(C|di) is, in fact, the expectation of the utility if d is selected. Note, however, that (2.17) has been justified by just the use of probability laws which suit nonstatistical events, without any reference to what happens in the case of indefinite repetition (in which case, according to clas sical probability what you expect is exactly what comes out to occur). The Bayesian decision scheme is now clear. There is neither room nor need in it for estimation of probability distribution parameters. The role of pdf's on parameters is to simplify the application of Bayes theorem in face of "new information" and not to be the basis for abstract parameter estimation.

3.

IT PAYS TO BE BAYESIAN

Let us now turn to the problem which so much attention captured in re liability practice: statistical inference in the case of underlying exponential distribution. First of all we have to provide ourselves

61 with a "Bayesian definition" of exponeniality. As has been shown in the previous section, in a Bayesian frame one cannot state the problem of inference as: components which fail independently according to an exponential distribution are put on test in order to estimate the unknown value of the failure rate . This is a nonsense because: 1) If what we observed (failure times of the components in the sample) is independent of what is of interest.to us (failure times of similar components which we will use in the future), then we learn nothing from the observation (Eq.(2.4)). 2) "Failure rate" has a welldefined physical meaning only if the orthodox definition of probability is accepted, which we are not permitted to do (peeling prickly pears...). In view of the previous section, with little effort one can imagine that the way out is to extend the concept of exchangeability to nonbinary random variables, and as a matter of fact de Finetti's theorem "... has been extended to an exchangeable sequence of arbitrary random variables Xi,X 2 ... by Hewitt and Savage. They show that one can construct a parameter such that Xj,X2... are conditionally i.i.d. given , where itself has an appropriate probability distribution"^ ' This is not enough because we want not only component failure times to be conditionally i.i.d. but also them to have a unidimensional conditional density of the form f(t|X) = exp(Xt) (3.1)

For Eq.(3.1) to hold you have to judge the failure times to be ex changeable and such that for any.n, given the sum h of n failure times, your "probability assignment" is the same for all the ntuples of failure times which add up to S n . People interested in a deeper insight into this topic are referred to [5]. The brief discussion we have been entertaining is intended to be just a reminder of the fact that the classical definition of failure rate does not apply in a Bayesian setting, where is a fictional parameter whose knowledge makes the failure times i.i.d. random variables as a consequence of your subjective judgement on what you learn from the observations. Up to now it could seem that Bayes statistics is there for making things more involved; this not being so clearly appears in the case of the statistical analysis of field data. The need for reliability data banks arose for the purpose of in ference on "reliable" components. These are expensive and fail rarely so that one cannot afford putting a large number of them on test and waiting for a significant number of failures; operational data coming from the plants where the components of interest are installed must then be used (field data). Field data are gathered under nonhomoge neous stopping rules. In one plant, for example, the observation lasted for a time Ti(this is equivalent to a type I test of duration T., s e e Fundamentals of Reliability Theory, this volume); in another plant the observations terminated at T2^Ti (type I test with truncation time T 2 ^ l ) an< ^ a third plant stopped delivering data at time of kth compo nent failure (type II test, see Fundamentals of Reliability Theory,

62 this volume). The use of classical statistics for reliability purposes is im practical under the general sampling plan of field data [6]. In relia bility, applications of classical statistics, most of times confidence intervals for parameters are needed. This is because reliability ana lyses are in general carried out for high (economicalenvironmental) risk systems and the analyst wants to be "really sure" that the values of parameters do not exceed the thresholds which make the system fail ure probability comfortably low. For confidence intervals to be obtainable, the probability dis tribution of the parameter estimator must be known, and it depends on the sampling plan. In the case of inference on the parameter = 1/ of the exponential distribution, the situation for the very popular Maximum Likelihood Estimator is as follows [7]: a) The estimator probability distribution is the handy wellknown 2 only in the case of type II test. b) In the case of type I test, the distribution is an involved combina tion of shifted 2 distributions. c) If the sampling plan is a mixture of type I tests of different dura tions, only the Laplace transform of the pdf of the estimator is available. d) In the case of the general sampling plan of field data, nothing is known on the estimator probability distribution. Furthermore, if even the less general sampling plan c) applied to field data and you were able to invert the Laplace transform (it is there since 1963 [7 ]), the result would not be worth the effort. A classical confidence interval with confidence level a is the interval where the true value of the parameter will fall a% of times in the long run. Nothing is guaranteed in the single case, a behaviour such as: = 90 is selected and the true value of the unknown parameter is assumed to be the lower bound of the confidence interval at 90% level (the "lowest mean life" resulting from the interval estimate) is not conservative, is simply incoherent. Indeed after observing a given set of failure times and calcu lating the confidence interval corresponding to the observed sample, the true value will be in the interval with either probability 1 or 0. If you acted as above, you would simply express your "degree of belief" of being in one of the favourable cases, but there is no room for degree of belief in orthodox statistics and it is unfair to let people think that the laws of objective probability endorse your behaviour. Inference on field data in the exponential case is coherent and much easier in a Bayesian frame. The prior on is upon you and, if stopping rules are noninformative as to [8], the likelihood results to be
L(\\O) = Xk exp I tA (3.2)

where:

63 k is the number of observed failures is the dimension of the sample t is the total operation time of component i, i=l,...,n. The definition of noninformative stopping rules [8] is not re produced here, the important issue is that the stopping rules of both type I and type II test are noninformative. A mixture of the two, i.e. the general sampling plan of field data, has noninformative stopping rules as.well and by (3.2) a posterior on can easily be derived to be coherently used, in conjunction with a suitable utility function, for making decision (a not for estimating ). So, it pays to be Bayesian.

4.

AN EXAMPLE OF APPLICATION

Again consider the example of the selection of the best design solution and suppose that two different design solutions are available. Let your posterior pdf's on the fictionali parameters 9, i=l,2, be

V ^ I V na'nl) 9i"1'19i)i"1
where () is the gamma function oo
r

i=1 2

'

{4)

Ht) = o

t1 X j e dx

r(t) = (t1)!

for integer t's.

The class of distributions which have pdf's such as (4.1) is closed under sampling from Bernoulli distribution. That is, if the prior has pdf of the form (4.1) also the posterior will have. This makes calcu lations handy and is the only reason why, for the sake of exemplifica tion, we chose ^^^) as defined by (4.1). By substituting (4.1) into (2.10) and by making use of the proper ties of the beta function, one easily gets:

p ( A

= 1

) =

i=1

'2

Let a^=0L2=li which corresponds to having observed no failures in the past, as clearly appears after inspection of (2.8) and (2.9) and com parison with (4.1). Let 2>if that is the second.design solution is more reliable than the first one; let the latter be cheaper than the former. The impact in the case of plant failure is the same for both solutions. For taking a decision you have to define the utilities of

64 consequences

Since any utility function is defined up to a linear transform [l, Chapter 4] you can arbitrarily assign the values of utilities of two among the above consequences (provided that the assignment is coherent) So you can set u(E ,d) = 0

uE^dj) = 1 Suppose the severity of the impact of a plant failure is such that (E0,d1) is as undesirable as (E0,d2); i.e. having saved some money (d1 taken in place of d2) will not make you feel better in case of an accident. It then is u(E0,d1) = 0 For what concerns the utility of the status quo, note that it can be defined by comparison with some accepted societal risks such as air craft crashes and others which involve consequences similar to those following from a plant accident (same number of fatalities and so on). Suppose then you have decided that it is acceptable to expose the pu blic to a risk of an accident with probability q. This means that you have chosen u(d0) = 1q since if the plant reliability is smaller than 1q you will keep the status quo, while if it is higher you will prefer benefits with proba bility (Ai = I|D) and accident with probability 1P (Ai = 1|D1.) to no benefits with certainty; 1q is then by definition the utility of the status quo. As to u(Ej,d2>, for the sake of coherence you have to set ute^d ) =16 0 < <5"<q

the meaning of 6 will be discussed later on. We are now ready for making decisions, without any need of para meter estimation; the posteriors on the 8i's have been used just for deriving (Ai = l|Di) i=l,2. The expected utilities for the possible decisions are u(d J : o 1q

65

{):

0 1

a1+1J

1 1 r= +1

c^+j

aj+j

The decision to be taken is the one which yields the maximum expected utility. Suppose that it is u(d) > u(d0) for both i=l and i=2. You will take d1 (higher risk with a cheaper investment) if it results u(d!) > u(d2)

i e

>

(1 )

n r
l + - 2

(4 2)

a 1 + 1 a 2 + 2 From (4.2) it follows that, for d^ to be taken, it must be


2 +

>

(4.3)

V2
that is can be understood as the minimal relative change of plant re liability which makes it worth spending the difference in cost between d2 and d^.

REFERENCES [l] D.V. Lindley: Making Decisions, J. Wiley & Sons, New York, 1985. [2] B. de Finetti: Theory of Probability, a critical introductory treatment, Vol.2, J. Wiley & Sons, New York, 1975. [3] M.H. de Groot: Modern aspects in probability and utility, in: Proc. of Int. School of Physics E. Fermi, Course C II: Accele rated Life Testing & Experts' Opinions in Reliability, C.A. Clarotti and D.V. Lindley (eds.), NorthHolland, in print. [4] L. Piccinato: Predictive distributions & noninformative priors, in: Proc. of 7th Prague Conf. on Information Theory Statistical Decision Functions and Random Processes, Prague, 1974. [5] F. Spizzichino: Symmetry conditions on opinion assessment leading

66 to time transformed exponential models, In: Proc. of Int. School of Physics E. Fermi, Course C II: Accelerated Life Testing & Experts' Opinions in Reliability, C.A. Clarotti and D.V. Lindley (eds.) NorthHolland, Amsterdam, in print. [6] C.A. Clarotti, M. de Marinis, . Manella: Field data as a cor rection of MilHandbook estimates, in: Reliability Data Collection and Use in Risk and Availability Assessment, H.J. Wingender (ed.) Springer Verlag, 1986, pp.177184. [7] D.J. Bartholomew: The sampling distribution of an estimate arising in life testing, Technometrics, Vol.5, No.3, August 1963, pp.361374. [8] R.E. Barlow, F. Proschan: Inference for the exponential distri bution, in: Proc. of Int. School of Physics E. Fermi, Course XCIV: Theory of Reliability, R.E. Barlow and A. Serra (eds.) North Holland, Amsterdam, 1986, pp.143164.

COMPONENT EVENT DATA COLLECTION

A. Besi Commission of the European Communities Joint Research Centre - Ispra Establishment 21020 Ispra (Va) - Italy

ABSTRACT. This paper describes the basic features of a component failure-event data collection scheme, suitable for reliability evaluations, and the associated informatie structure, necessary for the organization of the information collected, its retrieval and processing. In addition, the general objectives of a computer-based component event data bank are shortly discussed. As an example of a data bank, reference is made to the classifications, the informatie structure and the on-line data processing programmes of the Component Event Data Bank, managed by the JRC.

1.

INTRODUCTION:

What is a component event data collection

Performing a component event data collection in a plant means carrying out an organized collection of information on the operational history of some specific plant components, the engineering and operating characteristics of which are well identified. The operational history of a component is expressed by: - a report of the events which occurred to the component during its operational life, such as failure events, repair actions, maintenance operations; - the record of the operating hours (or the number of operational cycles, see footnote), the number of demands to intervene, during each year of its operational life. Footnote: "operating time" for a component operating continuously at constant rate for long periods and "number of operating cycles" for a cyclic device are usually the most important figures to be recorded, where an estimate of the reliability of the item is the purpose of the data collection. For a piece of equipment subject to high stresses during the start phase, the number of operating cycles (equal to the number of starts) should be recorded. For the engine of a car, the distance travelled is the most meaningful figure to express its mechanical wear. For equipment subject to bad environmental conditions, the calendar time can be the most influential factor. 67
A. Amendola and A. Saiz de Bustamante (eds.), Reliability Engineering, 67-94. 1988 by ECSC, EEC. EAEC, Brussels and Luxembourg.

68 A computerbased component event data bank consists of: a data collection scheme, i.e. a classification/codification system and a set of forms to be filled in, designed to collect and organize all of the abovementioned items; a Data Base Management System (DBMS), which manages data within the computer environment, making information retrieval and data pro cessing possible. this lecture, only component event data collections of interest for reliability/availability studies of engineering systems are dealt with. As a consequence, the abovementioned data processing is aimed at computing, through statistical treatment, reliability parameters. As an example of a data collection scheme and a bank structure con ceived for reliability purposes, we will refer to the Component Event Data Bank (CEDB) of the European Reliability Data System (ERDS) /l/, the bank managed by the Joint Research Centre of the Commission of the European Communities.

2.

SOME EXAMPLES OF COMPONENT EVENT DATA COLLECTION SYSTEMS, DESIGNED FOR RELIABILITY PURPOSES

Most operating data collection systems are aimed at improving economy of operation and/or safety in commercial plants. Examples of such sys tems are the CEDB, collecting data in commercial Nuclear Power Plants (NPPs) and in fossilfuelled electric power stations, and the OREDA Project /2/, collecting data in offshore oil drilling and production platforms. Some other data collection systems have to.collect data in pilot or demonstration plants. In such cases, the data collection activity is to be considered relevant to the experimental programme associated with the plant being tested. As a consequence, the data collection sys tem had often to be designed on an ad hoc basis for a specific experi mental facility type. Examples of this second category of data col lection systems are the "CEDB Fast Breeder Reactors" /3/ and the "Centralized Reliability Data Organization CREDO" /4/. The CEDBFBR, developed and managed by ENEA, the Italian Committee for Research and Development of nuclear energy and alternative energies, is an extension and an adaptation of the CEDB for the data collection in fast breeder reactors, still in an experimental and demonstration phase. The CREDO is a bank collecting data from fast breeder reactors and test loops in the USA and Japan; its scheme is also being implemented for data collection in a test facility for tritium handling (fusion reactor research) at Los Alamos. Some differences between the two categories of data collection systems, specifically between CEDB and CEDBFBR, are: CEDB is conceived to collect information on a high number of compo nents, operating in plants fairly standardized as layout, system and component functions and related design.

69 CEDB-FBR is used in plants working in an intermittent mode, with brief operating periods and many shut-downs for maintenance and incidents. In this scheme, new components such as control rods, drive mechanisms, still in a development phase, have been classified. Also the operation report forms have been modified (in the part related to the identification of nominal operating characteristics). In the CEDB it is the functional position of the component, rather than the component itself, which is monitored; the component which after a failure is definitively removed, is considered as scrapped, even if, after a full overhaul, it is afterwards installed in another position in the plant (where it is considered as a new component). In the CEDB-FBR the physical component itself is monitored; all the events, operating times and demands, are related to the physical component; irrespective of the working positions it had. This has also an influence on the reliability parameter computation. The event "maintenance operation" is not recorded in the CEDB; in CEDB-FBR it is reported with a classification similar to that of the failure-event (maintenance time, parts replaced, etc.).

3.

OBJECTIVES OF DATA COLLECTION (see Chapter 1 of / 5 / for further details)

The objectives of most data collection systems in current use are: a) To improve the reliability of plant engineering systems by reducing the failure rates of their components (by design improvements, by reducing failures induced by the staff in operation and in maintenance, by optimizing maintenance and operational test strategies..). b) To improve the availability of the systems by reducing down-times. The component down-time results essentially from its unavailability during the repair action following a failure (i.e. during the active repair time and the waiting time) and from its periodic withdrawal from service for planned maintenance/inspection/tests. The repair time can be reduced by improving some design features and the capability of maintenance staff; the waiting time can be reduced by organizational improvements. c) To provide generic and plant-specific failure rates, to be used in reliability/availability studies.

4.

DATA COLLECTION SCHEME; basic definitions, classifications and coding, data collection forms

A classification scheme suitable for the monitoring of specific components in a plant (or in a group of plants) consists of classifications (and associated codes) allowing the identification of the engineering characteristics of the component, its application in the plant, the type and attributes of the failure/event occurring.

70 Reference is made, in general, to the CEDB definitions (see Appendix) and classifications /l/, in connection with the four report forms (Figs. 2,3,4,5). Sometimes we also refer to EuReDatA recommendations /5,6/. The CEDB classifications and those recommended by EuReDatA /6/ are usually in good agreement. We recall that the exchange of data between users having the same (or a compatible) coding is greatly facilitated. In fact, a harmonization of classifications and coding is a primary objective of EuReDatA activity, aimed at promoting data exchange and pooling of data between organizations. 4.1 a) Component identification Component family; it identifies a category of components having similar engineering characteristics. The CEDB has classified 44 component families (Table I ) . Component boundary definition; the following definition is recommended by EuReDatA /5/: "The component boundary shall encompass all parts and associated mechanisms generally specified when buying, supplying or manufacturing the component." According to the CEDB definition of the component "VALVE", all protection and trip devices dedicated to the item (usually installed on the component and supplied by the manufacturer) are included in the boundary; the control devices and the actuator are not included. Experience shows that differences in defining the boundary of a component (i.e. of the material which can contribute to its failure), are a major cause of variation in quoted reliability parameters from one source to another. Component piece-part list; a piece-part reference list for each component family has been set up to enable the failure reporter to single out the part(s) of the component involved in a failure and to better characterize the ensuing repair action. The piece-part list adopted for a valve by CEDB is presented in Table II. Component engineering and operating characteristics; it is the minimum set of attributes required to assess the reliability of a component. An example of a generic description of a component family, according to EuReDatA recommendations /6/, is shown in Fig. 1 which refers to the family VALVE. Both /6/ and /l/ identify some common descriptors (i.e. valid for all component fami lies) suitable to define some engineering features of the component, its operational requirements, its external and internal environmental characteristics on the one hand, and family-specific descriptors on the other hand. According to /6/, important common descriptors are the following: - type of industry/installation, an important piece of information on the area of activity (agriculture, industry, services, ..) in which the component is used (/6/, pages 24 and 54);

b)

c)

d)

71 TABLE I AMPL ANCT BATC BATT BLOW BOIL CKBR CLTW CLUT COMP CP CT DRYE ELHE ENGI EXCH FILT FUSE GENE INCO INST INSU MOP PIPE PUMP RECT RSTR SAVA STGE SUPP SWIT SWTC TANK TRAN TRSD TUGE TURB VALV WIRE Component family reference classifit Amplifier Annunciator Modul/Alarm Battery Charger Battery Blower/Fan/Ventilator Boiler Circuit Breaker Cooling Tower Clutch Compressor Capacitor Air, Gas Dryer/Dehumidifier Electrical Heater Internal Combustion Engine Heat Exchanger Filter Fuse Electrical Generator Instrumentation-Controllers Instrumentation-Field Insulator Motor-Pump Unit Piping/Fitting Pump Rectifier Resistor Safety/Relief Valve Steam Generator Pipe-Support Switchgear Switch Accumulator/tank Transformer Transducer Turbine-Generator Set Turbine Valve (except safety valve) Electrical Conductor, Wire, Cable

Source: CEDB Handbook

PER 1187

72 TABLE II Piecepart list for a valve classified by CEDB

Valve (VALV) Piece part list 1 28 29 30 2 90X B 3 13 13 4 15 82X


17

16 81X 32 '.*) 90A 90E

3ocy BodyBonnet connection Body seat Disk/Ball/Plug/Wedge Bonnet Sealing (*) Stem Stuffing box Body trunnion (ball valves 1 Trunnion bearings ( " " ) Shaft (butterfly valves) Shaft bearings ( " " ) Hinge pin (check valves) Spring ( " ) Protection devices Absorber (check valves) Instrumentation/Monitors/Recorders Cooling system components Pipe connection steir. body sealing Packing

Source: CEDB Handbook

PER 1187

73
- 01 - Type

Mechanical valves (VALV)

02 Functlon Application 03 - Actuation


r 04-Size(SZ)

(Nominal diameter) Capacity Performance 05 - Design Pressure (PR) 06 Design Temperature (TE) Design Related Materialsr-07-Body Material (MA) -08-Seat Material (MA) 09 - Disc Material (MA) Construction features 10 - Body Construction Type (MP) 11 Seat Type (CO)
1

SealingH

r12 L

- Valve externally (SA; SB; SC)

13 -Valve Internally (SA; SB; SC) 14 - Safety Class/Standards

VALV15 - Process Pressure (PR) Process Related -16 - Process Temperature (TE) 17-Medium Handled (MH) 18-Type of Industry (El) 19- Vibrations (EV) 20 - (Environmental) Temperature (ET) Use/Application Related 21 - Radiation (ER) Environment Related 22 - Type of Installation (EL) 23 - Position Relative to Sea-level (EA) 24 - Climate (EC) 25 - Humidity (EH) 26 - (Environmental) Influences (EE)
L

27 - (Environmental) Pressure (EP) 28 - Maintenance Related (MS) 29 - Duty Related (M<B)

Fig. 1. Hierarchical diagram of the attributes defining the component VALVE according to the EuReDatA Guide /6/. Source: EuReDatA Guide /6/ PER 1187

74 - maintenance system (/6/, page 2 4 ) , giving an indication of the regime of monitoring and periodic checking; - operation mode or duty-related attributes (/6/, page 2 7 ) , giving an indication of the type of stresses to which the component is subjected; - codes from Existing Standards (/6/, pages 28 and 5 5 ) , giving ' indication of the basic design requirements and the quality level of the component. Other descriptors are unique to the component family: for a valve, for instance, those related to the type, the application/function for which it has been designed, the type of actuator (Table III, from / 6 / ) . e) Component mode of operation or duty; it identifies the operational requirements of the components in a specific plant system. A component may be called upon to perform a required function - in a continuous way, as a circulation pump operating in a coolant loop (its operating time is recorded), and/or - in operating cycles, as an on-off regulating valve or as a circulation pump, which is periodically called into service for a certain time (the number of openings of the regulating valve and the number of starts of the pumps are equivalent to the number of operating cycles and are recorded). Other components are permanently maintained in a stand-by condition (e.g. "active" protection devices in a process plant) and intervene only upon demand: the number of demands are recorded. f) Engineering system to which a component pertains; an important part of a data collection scheme designed for data collection in plants of the same type (e.g. NPP) is the "reference system classification". The specification of the engineering system in which the component operates gives an important contribution to a full characterization of its application/use. The CEDB reference system classification covers commercial Light Water Reactors (PWR and B W R ) , CANDU reactors and gas-cooled reactors (Magnox and AGR) and conventional electric power stations (only the engineering systems of the steam cycle). Failure description Component failure; is defined as the termination or the degradation of the ability of a component to perform a required function. It is a component malfunction which requires some repair. A physical impairment or a "functional" unavailability (i.e. lack of performance because of the absence of a proper input or support function) or a shut-down of the component due to conditions external to the process are not considered as a failure (see Appendix for further details) . Failure mode; is the effect by which a component failure is observed. Failure mode types are correlated with the component operation mode. The failure modes are subdivided into two general

4.2 a)

b)

75
TABLE III Descriptors unique to mechanical valves, suggested by the EuReDatA Guide //
01: Type

Category

Code

Bl al
Butterfly Check N O C Check, swing Check, lift Cylinder (piston & ports) Diaphragm Gate (sluice, wedge, split wedge) Globe N.O.C Globe, single seat Globe, single seat, cage trim Globe, double seat Needle Plug Poppe: Sleeve Other Category 02: Function/Application

10
20 30 31 32 40 50 60 70 71 72 73 80 90 A0 B0 ZZ

Bieed Bypass Control/regulation Dump Exhaust Isolation/stop Metering Non-return/check Pilot Pressure reducing Relief/safety Selector (multiport valve) Vent Other Category 03: Actuation

10 20 30 40 50 60 70 80 90 A0 B0 CO DO ZZ

Differential pressure/spring Electric motor/servo Float Hydraulic Pneumatic Mechanical transmission Solenoio Thermal Manua Other

10 20 30 40 50 60 70 80 90 ZZ

Source: EuReDatA Guide / 6 /

PER

1 1 8 7

76 classes: - "(not demanded) change of operating conditions (or state)" for components asked to accomplish a function during a certain time; - "demanded change of stage not achieved, or not correctly achieved" for components which are called to operate on demand. A set of reliability parameters (failure rate and failure-ondemand probability, repair rate) corresponds to each component failure mode. The classification adopted by CEDB (and recommended also by EuReDatA /6/) is presented in Table IV. Among the codes related to failure on demand, codes A, B, C and F apply to components such as valves, breakers, actuators. Codes D, E and F apply to rotating components such as pumps, electric motors, diesel generators, etc. As far as the second class of failure modes is concerned, "change of operating conditions (or state) not required", two categories have been singled out, namely: - degree of suddenness; - degree of seriousness. The first category describes whether the unavailability of the component is contemporary to the detection of the failure or of the abnormality or whether the unavailability of the component could be. deferred. The second category refers mainly to the mode of change of condition/state, i.e. to its gravity (no output, outside specification) or to its peculiarity (operation without request, erratic output). c) Failure cause; failure descriptors; failure detection; parts failed; plant status at the time of failure; effect of failure on the system to which the component pertains, on other systerns/components, on plant operation; corrective and administrative actions taken; plant operation restrictions; these expressions are selfexplanatory; some of them are defined in the Appendix. The repair action is mainly characterized by the repair time, the parts involved in the failure, the corrective action taken (component replacement, overhaul, modification, etc.). Yearly operating time and number of demands

4.3

This information item on the operational life of the monitored component is - directly available if the component is equipped with a counter; - or is to be evaluated on the basis of the diagram unit state versus calendar time and of an assumed operating factor of the component referred to unit state; - or is recorded in the computerized central data acquisition system. 4.4 Data collection forms

Coded formats are used by the CEDB (Figs. 2,3,4,5); a literal description (i.e. a free text) is foreseen only for the failure-event report

TABLE IV On demand

Failure modes classified by CEDB

A 3 C D

Wont'open Wont' close Neither opens nor closes/does not switch Fails to start Fails to stop Fails to reach design specification

On ooeration suddenness A 3 C Sudden failure Incipient failure Not defined

. degree of seriousness A C D No output Outside specifications Operation without request Erratic output (false, oscillt, instability, drift, etc.)

Source: CEDB Handbook

PER 1187

78
EUROPEAN COMPONENT RELIABILITY EVENTS DATA DATA SYSTEM BANK-CEDB -EROS

COMPONENT REPORT FORM


s S v.

"
MAT 1 3 J PLAUT t S

'I

fiero* a 13

COMPON is if

IDEM

COOE

lu C C *

ra 1 > 20 21 72 1

A SSES CL I M SERVICE ' 0 fJoJS I A TE P m U/TH S d S *UJ**-\aASfA UH A CL SS WTH. CLiSW A UJT\cU%4fUTH . OA SI tUTH-lo SsiUTHCL SS A ji i?ijj|j< |^^7|> | p | < j ^ | 4 j ^ 4 u'|*i ju *$\XiSi S H 33 3I 37 MO< < U J U * i3 I * u o 7o rt JJ * 75 71I77 * COMPtMEKT EROS P^JEHnFTEPiSTSTEM

" CODES/STAMDAROS/SAPETr

uh*

upr

z
c

f f
UV MIL TM. MAMUFACWKTR MODEL 47 4 4* I

# *

HiMJFACT. COOE

MANUFACTURER SERTA t UMBER 3 0 I 3 I 32 SJ 54 33 3< 3 7 5 St to ti

oirr er I co*er*ucr/o* | |

JI J71JJ J4 J 3 M 77

41 12 13 44 43

* *

f j >4 * i * # 7 M f

70 77 7J 174 7S|7I|7T7#

in

"

il

JffjTJ J4 SS M 37 JJ J f 40 47 47 4J 44 4 5 < 4 7 4 # [ 4 5 0 J7 3 7 5 3 | 3 4 U ! < 9 7 M 9 1 I II U M j [<3 M 7 | * j l 7 0 | n | 7 2 7 J |?4 |5^7717> | | M

or

MO.

r* EH 61 HEE RIM CHARACTER.ai EMtPiEERiMA CHARACJLZM.02 43 t J 44 43 EM+tMEERtH CHARACTER. 03

JI Q i j 0 4

I 1 II

b* 3S\3t

J7

1 41

47 4ft4l|3lsi 37 J 54 S3 U 3 7 M 3 0 o f c j

CHA**C7I.04

III
|

1 1 I I I 1 1 Mil
| vmEwmm A M CTER.O A CH CHARACTER.O I EH+iHEERIHt

tilts

M r i t i l t l w 71 72 Tl 7 < | 7 t l 7 7 | 7 7 f M

rl

EHtlMEERIH* A R CTERS A CH CHARACTERS

CHARACTER. Ot A R CTER. Ot A CH

rm+mcERim A R CTER07 A CH CHARACTER 07

i T l j j b i |js|j|j7|j#|jko l4iNd*j|44|*j 44U7WI4Isc |si|r?4w|34L5|37|5 1st UefiU^Lj|<4l|ef|f7|M|<>|7c>jn|77b.

riiiiiniiiiirriiiriiiinrriiiittftit
I* *
\1\ 13 3* EMVMCERim CHARAaEP.O }} M J7 | j 40 41 4^4j 44J4J 44 47 44*

ERUMEER1H* CHARACTER. 4J SO

EHVMEERiHO

I'll
I*
lo
c

3,

1
EMUHEERtHt CHARACTER. 14

II

1 "r

CHARACTER. It 34 33 M 7 M 3 f fT

CHARACTER. O 4 ts 1 7 t It
7171 77

73 TJ 74 73 7|77 71

to

EHtlHEERIM* CHARACTER. LI i i L i i . '"

ENttHEERIH CHARACTER. IS

EHUhCERtHO CHARACTER. M ?j\u\Ts\n\ii}jt\n\t

f i i i ***|<|[ si jjJsfjIs^ jjls4f Ijrls^lj jol^jtTj 13 l u k i l t t | 7 | M | |TP| TI\J2

tf

EMlHETRina CHARACTER. 17 40 47

EHtlHEERiHO CHARACTER M

I J l )J JJ J4 IJ J a 17 3

i)

44 43 44 47 4 4 SO 3!

I
JM J

EMvmrRiM* CHARACTER. It 34 S3 It

ENUHEERtHC CHARAKTER. tt t2 J t* 5

sr

5 3ff to

7 44 r i 70

74 73 TC 77l7 7

PREPA RED ; PREPARED Rf:

Fig. 2

79
EUROPEAN RELIABILITY DATA SYSTEM -EROS COMPONENT EVENT DATA BANK CEO

OPERATION REPORT FORM


^ 8 5 8
AEA CODE

a AT.
t
j

PLANT 3 4 5 i

Tl i
? t

S
o cc
REACTOR IJ COMPON. IS It

toe NI . cooe
23 24 33 3*

OllU

C C

lg

ufoc

3 ] \ DA TE J O Jf 32 23 34 3S 3t

13 ti

17 II rt TO ai

4*p ^ f

or.

no

ra.
l

# | OPER. \rrpf\

STO. | A LTiT. | ctw^tff]/fgu.|Wfxsnar w & s . |<my,ir.| rgww| w w a j wtmr. | cogeos.) 3T|3J|S4 333<57 SS\3l\u <f |JlljjiUs * ITIW 01 70 ? 73 7* 7Z 717 7* 7>(*|

17 M\3ti{o\tl\*l{*jU*Ui\iM\*7\**\*9\lo\st

J7ljjtj40J<l|u 4 j | AtUj | 4T"* 4>30af|Ml5Jj*53al37|j< St\to\tt\t2\t3 <\to\tt \t2t3 tttS I M U T IM U (707173 I7J74 \n\7t\T7\7e 7 1*J leiJ IM M M ld TP71177 |7J 74 |7S7e77l7

i i h 11 11 rrm
OPERATI CHA PACT*. 07 ORt PATINO CHARACTER. Off 5 et 7|M tff70l7l l7l7J74J73|7ffP7l7e 79\

0 J7I Old

OPt RATINO CHARACTER. 03

CMARACTCR. Of

fKoUlUUj

44 Us 44744 (190(51 u U j l 5 * ! i J | S f f [ j 7 M

[IMI

OPTRA OPERA IN A R CTER Off A CH CHARACTER, 13 I CHARACTER. ti CHARACTER. tO J7> 40|4 l<J|4J |44J4i|4<l|<7U< 4 J303J |j3J3J IJ433 jjtt | j 7 | V Uo|ffJ | | ] 3 17'(<> 70 I7 frj ~73 | |3 \?\7\ 7

OPERATING

1 II ITTI 1 imn
OPERATIN* CHARACTER. M 471*1 4ff|jo5; l u 3JIJ4|35lse|97l3 3 to ti

ifUo!! i*leja4

1[ l i i u l i i i II

TF

#
J7 J4 Jt D 0

OPERATING CHARACTER. 13

OPERATIN CHARACTER. 5 17 t3 4 ta ft 7 f

ATINS 707f \3 73\x\73 17j77 7*4 CHARACTER. *

iOltl tf

4J 44 ts it

1 I 1 1 1 11
30 b f

111
70 Ti 71 73 74 7 3 l 7 t l 7 7 79 7 l * j |

II 1 1

OPERATI NO CHARACTER. 17 Iff 40 il tf |4J 44 43 441 47

OPERATIN* CHARACTER. 1$ 33tt

OPERATIN* CHARACTER. 3 tO tl tl 13 t* t3 tl tT\tt

OPERA TI N CHARACTER. 20 If 70 7f |7J>pJ J74 7S |7|77J7 \tO\

OATS

PREP RED: A

PREPAREO Or;

FORM CEOB a

Fig. 3

80
EUROPEAN COMPONENT RELIABILI Y T EVEN S T T A DA DATA BANKT EM-ERS SYS CEDB

FAILURE REPORT FORM


HAT.\ I 33 IAEA CODI

Tl

REACTOR COUROM. KXMT. COOK 9 10 ti 17 U u\ts m 7 M It JO tl ?| 1

S 8 I S s

PLANT 4 5

et 22 33 7* S 7*k>7M 71 F

II

SE m "H il
EFFECT 0 * _

M \unun HJLURE eoor jj 3* 35

UMAtfAILAUUTY

-, .seil *

o 5 c 3 E 1 *\pATE OF FAilURE UMAVAIlABtUTf H OU* 3 MiM H OURS MIM 5 2 5 u t 2 S MmJf I 42 4J U 45 U 47 I i M 51 S i U 54 M 1 57 3413f 0 et t7 t3 I* 3 69 t7 C4 S9 70 7t 73 TJ 7* 73 7 77 7 7 40]

or.

MO.

w. r.

MO.

m.

MffFS

FAI LCD

\nooes l

,u ,

40

4J

44 43 44 47 4J <J 50 SI 33 33 54 5 3 5 57 5 51 to

11 %2 a at 3 f f 47

f TO

73 74 73 7 TT 7t 7*

*l

ACT/OW* CORRECTIVE

TAKEN

f
#
ORERAT f ft* HOURS

Mi

MUM OFM or crcics/DEMANDS M J l 40 4/ 42 4J 44 45 4* 7 4f 41 SO 51 37 5J 54 S3 St 57 SJ 3 ( 0 I t3 it 0

S f fl7 es f 70

77 73 74 75 7f 77 l 79

"

J r E c c 0 f F

##

FAILURE

OESCRIRTIOM 1 $3 f 4 5

to 41 42 |4J 44 S 4 47 <I<1 30 57 32 37 54 Is3 3 * 37 54 5 0

7 1 ( f 70 7 77 73 75 7f 77 71 7 * 0

s M r L M M

Fig. 4

81
EUROPEAN COMPONENT RELIABILITY EVENTS DATA SYSTEM DATA BANK-CEDB -EROS

ANNUAL OPERATING

REPORT FORM

r^IAAA

COflt

HAT 1 2 J

PLANT t S

i
? t

REACTOR COMPON

WEHT.

coot

a.

IO I I

13

ri

li

IS I f

17 IJ I 30 21 23 33---t 2S 26 27 2t 29 H II H H

0.

SAR

OPf RAT INO HOURS j s b f J7 J

M/MBEK OF CTCieS t 0HAHDS tt <l\"

jr IJ 3

JJ 3i

Jilto

ri
I

*s 1.7

t !

I 1

*
H H H

I 1
I
I ! I

! i 1 1 !
1

I
1

I ! 1

||

1 1 I

1 i ~ i I I
1 1 !

1!

OAT C PREPA R0: _ PRPARCO *T

ron*

c"J.i *

Fig. 5

82 form. The first two forms (concerning the component design-related characteristics and use/application-related characteristics, respectively) are filled in only once. A failure report form is filled in for each failure which occurs. The form with the component hours of operation and cumulative number of demands, is filled in once a year.

5.

TYPES OF DATA COLLECTION; Requirements for data collectors (ref. to /5/, Chapter 2) Two basic data collection types exist: On-going data collection; the failure reporting is made as a failure occurs. A structure for data collection has been established within the plant. The possibility to collect information from current activities and to participate in discussions with those in charge of plant operation and maintenance should enable the reporter to produce high quality reports. Historic data collection; information is collected through a search of the technical records of the plant. The quality of information derived is strongly dependent on the quality of the records that are available. The completeness and credibility of these records can be improved by interviewing the personnel who entered the information and by making cross-checks of different files (i.e. maintenance files; operation log-books; spare parts inventory files). PersonnelL responsible for the completion of the event report form should have

5.1 a)

b)

5.2

- a good knowledge of the plant, its operation and maintenance regime ; - have a good knowledge of the specific component the event report is referring to; - have the opportunity to participate in discussions with those in charge of plant operation and maintenance; - have a thorough knowledge of the data collection scheme; - have attended courses, participated in seminars and conferences dealing with reliability data collection and reliability engineering.

6. 6.1

DATA RETRIEVAL AND PROCESSING (Fig. 6) The CEDB, as all large computerized component event data banks, allows, through its enquiry procedure:

- the selection of a set of specified components and of the relative set of failures of a specified type; - the statistical treatment of this set of failures (and of the associated repairs); - the retrieval and the display of all information items stored in the bank, such as the characteristics of a component and the report of each failure it suffered.

VATTENFALL-S

EDF-F F.NF.L-I

GRP/HWK-D ' (IITI.TKRS KEMA/GKN-NL

N.n t l o n o 1 (-,

JRC i n t e r n a l network Manual or automatic transcoding or d i r e c t Storage Data search and

^
USER

data

/PZEM-NL

hQ

T.P.

CEGB-UK

NCSR-UK

Mag. tapes or CF.DB r e p o r t forms

Fig. 6. Flow of information from the data suppliers, through the CEDB, to the users.

Source: S. Balestreri, "The European Component Data Bank and its Uses", ispra Course on "Reliability Data Bases", Ispra 21-25, 1985.

PER 1187

84 6.2 Tables V, VI and VII give lists of the component, operation, failure attributes, with the indication for each attribute if it is questionable or not, i.e. if a selection can be made or not on the basis of this attribute. Through stepwise refinements made by the "SELECTION" command, the user can select, for example, a set of valves, installed in PWRs, pertaining to the "condensate and feed water system", globevalve type, of a specified diameter range, operating at a certain temperature range; of all the failures which occurred to this set of valves, the user can now select those corresponding to the failure mode in operation-degree of suddenness "incipient". To this last set of failures selected, the user can apply all the statistical application programmes inserted in the on-line enquiry procedure. 6.3 CEDB data processing methods and computer programmes

Through the CEDB interactive enquiry procedure, the analyst can estimate reliability parameters for a specific component category in which he is interested. The on-line statistical processing includes at present: a. a.1 Point and interval estimation (for complete and censored samples), of constant reliability parameters (time-independent failure rate in operation, constant unavailability = constant failure rate on demand, time-independent repair rate), by /7,8,9,10/: - Bayesian parametric approach (with priors: beta, uniform, loguniform, log-normal, histogram), - classical approach (maximum likelihood, confidence interval); non-constant reliability parameters (time-dependent failure rate in operation, non constant failure rate on demand, time-dependent repair rate by the Bayesian non parametric approach (with prior identified by a sample of times to failure or by a failure-time analytical distribution) /11,12/. Test of hypothesis on the law of failure and repair time distribution: - exponential (for complete and censored samples), - Weibull, lognormal and gamma distribution, increasing failure rate, decreasing failure rate (only for complete samples).

a.2

b.

Effective graphical tools can give on-line the representation of an observed time-dependent failure rate; of the prior and the posterior distributions (Bayesian parametric approach); of the cumulative failure distribution function F of the observed, the prior and the posterior sample (Bayesian non-parametric approach), etc. In refining a selected sample of failures for a statistical analysis, the analyst can retrieve and review each event, to identify and delete from the sample those failures which appear not to be independent.

85

TABLE V

Er.DS C E D B

COMPONENT EVENT DATA B ANK HANDB OOK Volume V Pirl I .1.3 List of attributes

Dale Septeaicer 1981 Issue C Pjft

4.1.3 List of attributes a) Attributes of the entity COMPONENT

ATTRIBUTE

EXTENDED ATTR CODES/STANDARD/SAFETY C LASS REACTOR C OMPONENT CODE CATE OF C ONTRUC TION

UEST TAB CLASS TYPE INDEX

CLA
COHPCODE CONDA CYCLES EHCHCOD ENCHNU FIRSTCYCLES FIRSTHOURS HOURS IACOO

Y Y

A A D
ENG ENG

0 0

0
0

ANNUAL PROGR. NUMBER OF CYCLE! ENGIN. C HARAC TERISTIC ENGIN. NUMERIC AL PARAMETER FIRST PROGR. CYCLES FIRST PROGR. NUM. OPER. HOURS ANNUAL PROGR. OPERATING HOURS IAEA REAC TOR CODE IDENTIFIER MANUFACTURER CODE MANUFACTURER MODEL NATION REACTOR POWER RANGE REACTOR TYPE DATE OF SCRAPPING MANUFACTURER SERIAL NUMSER IN SERVICE DATE ERDS 6YSTEM YEAR OF CYCLE PROGR. YEAR OF OPERATING HOURS STARTING YEAR

Y Y

A F

0 2 0 0
0

0 0
0

ID HAN
MODEL

ENG

0
0

NA
REAPO REATY SCDA SERNUM SEOA

D D

0 0 0 0 0 0
0

BY
YEARCYCLES YEARHOURS YBTAR

0
0

86
TABLE VI

E RDS C E D

COMPONENT EVENT DATA B ANK HANDB OOK Volume V Pari 1


4.1.3 '1ST: of attributes

Date 3epte=ier 138; Issue 0 Page

b) Attributes of the entity OPERATION

ATTRIBUTE

EXTENDED ATTR ALTITUDE CLIMATE CORROSION HUMIDITY TYPE OF INDUSTRY INSTALLATION MAINTENANCE TYPE MODE OF OPERATION OPER. C HARAC TERISTIC OPER. NUMERICAL PARAMETER OPERATION DATE PRESSURE RADIATION TEMPERATURE VIBRATION

SUEST TAB CLASS TYPE INDEX

AL CLI
CORRO

Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y

Y Y Y Y Y Y Y Y
ENG ENG

A A A A A A A A A F D

0 0
Q

IND INS HAI


MODE OPCHCOD OP C NU H OP OA

0 0 0 0 0 0 2 0 0 0 0

PR RA TE

Y Y Y Y

A A A A

87 TABLE V I I
COMPONENT EVENT DATA B ANK HANDB OOK Volume V Part 1 4 . 1 . 3 List 3f attributes
~;l; : : : ; : : :-.2.;

E S DS C D

Par;

.1.3/3

c) Attributes of the entity FAILURE

ATTRIBUTE ACAD ACCORRE CA CYCLES OES DET EFOT EFREA EFSY FAOA HOURS MODEOEM MODEOPDEG MODEOPSU PA REASTAT REL RE'.AKS REPAIRTIKE K: UDA UNAVTIiE

EXTENDED A TTR ADMINISTRATIVE A CTION TA KEN

QUEST TAB CLASS TYPE INDEX Y Y Y OCCURRENCE Y Y SYSTEMS OPERA " ION Y Y Y Y OCCURRENCE Y Y Y Y Y Y Y A Y A Y A Y A Y A A A Y A D Y A Y A Y A Y A Y A D Y A Y A Y A 0 0 0 0 0 C 0 0 0 0 0 0 0 0 0 0

CORRECTIVE A CTION TA KEN CAUSE CYCLES A T FA ILURE DESCRIPTOR DETECTION EFFECT ON OTHER

EFFECT ON REA CTOR EFFECT ON SYSTEM FAILURE DA TE

HOURS A T FA ILURE MODE ON DEMA ND MODE OPER. MODE OPER. PARTFAILED REACTOR RELATEB REMARKS REPAIR TIME STA TUS FA ILURES DEGREE SUD.

0 0 0 0 0

Y Y II

STARTU?

RESTRICTIONS

DATE OF U , < A V A I L A 3 I L I T Y UNAVAILABILITY TIME

88
APPENDIX Basic definitions adopted by the CEDB (Reference to the report forms "component, operation, failure, annual operating" and to the Vol.11 of the Handbook / l / ) . Actions taken (after a failure event) The intervention and the technical corrective action on the failed component and all the official and documentation follow-up that had an administrative and operational impact (in terms of repair schedule, plant shut-down duration, failure reporting to technical bodies . . . ) . Availability The characteristic of a component expressed by the probability that it will be operational at a randomly selected future instant in time. Component A structure or equipment considered as a set of mechanical and/or electrical parts constituting a well-identified element inside the plant. It has well-defined performance characteristics inside the system to which it belongs and can be removed and replaced inside the plant. Erratic output (failure mode in operation-degree of seriousness) It refers to all those events in which the component function output is false or unstable, etc. It refers mainly to measurement devices and electronic components. (External) environment The local conditions in which the component is required to operate; these conditions are described in general terms, such as standard NPP environment, indoors/outdoors, open/sheltered, mechanically ventilated areas, naturally ventilated areas, etc. Failure The termination or the degradation of the ability of a component to perform a required function. It includes: 1) Failure of the component or its part(s), during normal operation or testing, which causes its contemporary unavailability and total loss of its main function either due to a spontaneous failure or because of the automatic intervention (shut-down) of the component protective system. 2) Failure of part(s) of the component which permits its deferred unavailability for corrective actions; failure discovered during normal operation (surveillance, monitoring) or testing by observation of operation outside specifications. 3) Manual shut-down of the component (in the presence of operation outside specification) to avoid more serious damage to the item. The component is immediately unavailable.

89
4) Operation without request of the component due to non required intervention of part(s) or protective device(s) associated with the component. The component is immediately unavailable. The reason for this change of state remains unknown (spurious operation). 5) Failure of part(s) of the component which requires repair, discovered during inspection or planned/preventive/corrective maintenance. Only unexpected anomalies considered as compromising the duty of the component on its return to operation (which may cause immediate or deferred unavailability; total or partial loss of function (s)) are to be recorded. End of life replacements are not considered as failures. The following outages are not considered as failures: - unavailability due to preventive or planned maintenance/testing; - a physical impairment or a shut-down of the item owing to external (process) conditions; - a "functional" unavailability, i.e. a lack of performance because of the absence of a proper input or support function. Two types of failure may be distinguished: - unannounced failure: the failure is not detected until the next text/inspection/maintenance or request; - announced failure: the failure is detected at the instant of occurrence. Failure cause The original causes of the component failure, i.e. the circumstances during design, manufacturing, assembly, installation or use that led to failure. Failure detection The way and means by which a complete or an incipient failure has been detected or was revealed. Failure descriptors Attributes of the failure event which can be inferred from the knowledge of the failure mechanism and the physical phenomena causing or accompanying the event. Failure mode The effect by which a failure is observed on the failed component; it describes the variation of the state of the component that results from the failure. Two generic classes of failure mode are considered: - failure mode in operation, i.e. change of operating condition (state) of the component not demanded; - failure mode on demand, i.e. demanded change of state of the component not achieved (or not correctly achieved).

90 Failure rate in operation (failure /10 hours) The conditional probability that a component, required to operate in stated conditions for a period of time, will suffer a failure (corresponding to a defined failure mode) in the time interval t, t+1, given it is working at time t. Failure-on-demand probability (failure /10 demands) The conditional probability of failure to operate upon demand, or demand probability, for those components that are required to start, change state, or function at the time of the demand, given they intervened successfully at the previous demand. The demand probability incorporates contributions from failure at demand, failure before demand, as well as failure to continue operation for a sufficient period of time for a successful response to the need. When pertinent, the demand data can be associated with cyclic data or can be interpreted as a general unavailability. Incipient (failure mode in operation-degree of suddenness) A category of failure modes in operation corresponding to a partial failure, such that the unavailability of the component, i.e. the necessary corrective action, can be deferred. It corresponds to an abnormality in the operating condition of the component; one or more of its fundamental functions are compromised, at possible different levels of gravity, but not ceased. The component function(s) may be compromised by any combination of reduced, increased or erratic output. Maintenance The set of actions, planned or unplanned, to ensure that the levels of reliability and effectiveness of all plant structures, systems and components remain in accordance with design assumptions and intents, and that the safety status of the plant is not adversely affected after commencement of operation. Maintenance program The set of prescriptions covering all preventive and remedial measures, both administrative and technical, necessary to perform maintenance activities satisfactorily. It includes service, overhaul, repair and replacement of parts, testing, calibration, inspection and modification to the structures, systems and components. MDBF (mean demands between failures) It is the mean number of successful demands between consecutive failures, recorded for the set of components considered. MTBF (mean time between failures) It is the mean of the operating times between failures, recorded for the set of components considered. MTTR (mean time to repair) It is the mean of the repair time, recorded for the set of components considered.

91 Mode of operation It is an attribute characterizing the operational regime of the component. The following operational modes are considered: - continuous; - intermittent/cyclic; - activated from stand-by conditions. No output (mode of failure in operation-degree of seriousness) It refers to all those events in which the main function of the component/ in the particular operation mode, is totally lost/ either due to a "spontaneous" failure or because of the automatic intervention (shutdown) of the equipment protection system. For instance, a shaft blockage of a pump or the automatic withdrawal from operation of a pump due to loss of its cooling device. Operation without request (mode of failure in operation-degree of seriousness) It refers to spurious operation of switches, valves, etc. Operating time The period of time during which a component is capable of performing a required function and is really in use. Output outside specification (mode of failure in operation-degree of seriousness) It refers to all those events in which the main function (i.e. pumping) or some of the auxiliary functions (i.e. leakage, structural stability, etc.) of the component in the particular operation mode are outside the prescribed specifications without being totally lost. In this way it is possible to distinguish between failures independently of the particular application (no output category) and failures - on the contrary - dependent on particular specifications. For instance, the same leakage in two similar pumps, one dealing with active medium and the other not, will affect the pumps' availability differently. According to/6/, this classification includes also failure of components found during preventive maintenance. Parts failed The parts of the failed component that originated the failure or have been involved in the failure itself. Passive maintenance time It is the time during which no work is as yet performed on the component withdrawn from service (isolation, access, shielding, decontamination, etc.). Related failures All the failures that are linked to the failure under consideration from the point of view of parallel detection, common-cause induced failures ...

92 Reliability The characteristic of a component expressed by the probability that it will perform a required mission under stated conditions during a stated mission time. Repair Any remedial measures necessary to ensure that a failed component is capable of performing its function(s) as intended. Repair rate It is the conditional probability that a repair action being performed on a component failed (in a specified failure mode) is completed in the time interval t, t+1 (minutes), given it was not completed at time t. Repair time The time required to repair a component, including: time to identify the failed part; time for disassembly of the parts or the components; time to transfer to workshop (when necessary); waiting time for in-plant spare-parts supply; time to repair, adjust, replace the failed parts; time to re-assemble; check-out time to perform functional tests after the repair completion.

Sudden (failure mode in operation-degree of suddenness) A category of failure modes on operation corresponding to a sudden and complete failure, i.e. the unavailability of the component is contemporary to the detection of the failure; one or more fundamental functions of the component are suddenly lost. Immediate corrective action is required in order to return the item to a satisfactory condition. System A set of mechanical, electric and electronic components, unequivocally identified as: - it accomplishes a well-defined function inside the plant; - it accomplishes more than one function, but at different plant operating modes, i.e. cold shut-down, emergency, normal operation, etc. System grouping A set of nuclear power plant systems, characterized by common: - function which can be framed in a more general logic function, i.e. protection and control system, engineering safety features, etc.; - functions directly related to the accomplishment of general plant operating services, i.e. reactor auxiliary systems, etc.

93 Test Any attempt to determine operability under controlled conditions. A test implies a perturbation that simulates a situation that could cause a specific response for the purpose of verifying that the operational capability of a system/component has not been lost. Test frequency The number of tests of the same type per unit-time interval. Test interval The elapsed time between the initiation of subsequent identical tests ; the inverse of the test frequency. Test procedure An exhaustive description of the conditions and of all the subsequent actions to correctly perform the test. Testing and maintenance Information on test and maintenance programmes and procedures for the set of components considered is given, when available. Unavailability time Is the calendar time during which the component is not available for operation. It includes: undetected failure time, preparation time, repair time, passive maintenance time, and logistic delays.

94 REFERENCES 1. "The Component Event Data Handbook", CEC, JRC-Ispra, Technical Note No. I.05.CI.84.66, PER 855/11/84, 1984. 2. "OREDA - Off-shore Reliability Data Handbook", published by the participants to the OREDA Project, distributed by VERITEC in cooperation with Perm Wall Books, 1984. 3. G. Cambi (University of Bologna), R. Righini (ENEA), P.G. Sola and G. Zappellini (NIER), "Component Event Data Bank - Fast Breeder Reactors", Proc. of EuReDatA Conference, Heidelberg 1986, Springer Verlag. 4. "Centralized Reliability Data Collection (CREDO) - Guide for Completing Data Input Forms", Oak Ridge National Laboratory, Oak Ridge, Tennessee. 5. "Guide to Reliability Data Collection and Management", EuReDatA Project Report No.3, edited by B. Stevens, distributed by CEC, JRC-Ispra, 1986. 6. "Reference Classification concerning Component's Reliability", EuReDatA Report No.l, edited by T. Luisi, distributed by CEC, JRC-Ispra, 1983. 7. A.G. Colombo, R.J. Jaarsma, "Bayesian estimation of constant failure rate and unavailability", IEEE Transactions on Reliability, R-31 (1982) 84-86. 8. A.G. Colombo, D. Costantini, "Ground-hypothesis for beta distribution as Bayesian prior", IEEE Transactions on Reliability, R-29 (1980) 17-21. 9. A.G. Colombo, R.J. Jaarsma, "BIDIPES - A conversational computer programme for point and interval estimation of constant failure rate, repair rate and unavailability", Report EUR 7645 EN, 1982. 10. A.G. Colombo, R.J. Jaarsma, "BAESNUM - A conversational computer programme for point and interval estimation of time dependent failure rate by a Bayesian non-parametric method", Report EUR 10183 EN, 1985. 11. A.G. Colombo, R.J. Jaarsma, "Bayes non-parametric estimation of time dependent failure rate", IEEE Transactions on Reliability, R-29 (1980) 109-112. 12. A.G. Colombo, R.J. Jaarsma, "BANPES - A conversational computer programme for point and interval estimation of time-dependent failure rate by a Bayesian non-parametric method", Report EUR 10183 EN, 1985.

THE ORGANISATION AND USE OF ABNORMAL OCCURRENCE DATA

H.W. Kalfsbeek Commission of the European Communities Joint Research Centre - Ispra Establishment 21020 Ispra (VA) Italy

ABSTRACT. This paper demonstrates how operational experience data can be exploited for safety/reliability assessments of (hazardous) industrial installations. To comply with this objective specific requirements are needed with respect to the organisation of the raw data, with consequences for the data collection scheme and computerised storage and retrieval capabilities. This is illustrated by means of a description of the set-up and the analysis possibilities of the Abnormal Occurrences Reporting System (AORS) of the Commission of the European Communities. This is an incident data bank on safety related events in nuclear power plants, and has been designed and realised specifically as a tool for safety assessment and feedback of safety related operating experience.

LIST OF ABBREVIATIONS ANS/ENS AMERICAN NUCLEAR SOCIETY/EUROPEAN NUCLEAR SOCIETY AORS ABNORMAL OCCURRENCES REPORTING SYSTEM ASP ACCIDENT SEQUENCE PRECURSOR BWR BOILING WATER REACTOR CEA COMMISSARIAT A L'ENERGIE ATOMIQUE CEC COMMISSION OF THE EUROPEAN COMMUNITIES EdF ELECTRICITE DE FRANCE ENEL ENTE NAZIONALE PER L'ENERGIA ELETTRICA ERDS EUROPEAN RELIABILITY DATA SYSTEM IAEA INTERNATIONAL ATOMIC ENERGY AGENCY INPO INSTITUTE OF NUCLEAR POWER OPERATIONS 1RS INCIDENT REPORTING SYSTEM JRC . JOINT RESEARCH CENTRE 95
A. Amendola and A. Saiz de B ustamente (eds.), Reliability Engineering, 95-126. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

96 LER MCA NPP ' NRC OECD PDF PRA PSA PWR SKI UNPEDE USERS LICENSEE EVENT REPORT MULTIPLE CORRESPONDENCE ANALYSIS NUCLEAR ENERGY AGENCY (of OECD) NUCLEAR POWER PLANT NUCLEAR REGULATORY COMMISSION ORGANISATION FOR ECONOMIC COOPERATION AND DEVELOPMENT PROBABILITY DENSITY FUNCTION PROBABILISTIC RISK ASSESSMENT PROBABILISTIC SAFETY ASSESSMENT PRESSURIZED WATER REACTOR SWEDISH NUCLEAR POWER INSPECTORATE UNION INTERNATIONALE DES PRODUCTEURS ET DISTRIBUTEURS D'ENERGIE ELECTRIQUE UNPEDE SIGNIFICANT EVENT REPORTING SYSTEM

1.

GENERAL INTRODUCTION

In the course of the development of industry and technology, the expe rience gained during the related activities has always been an important input for design and operating practices. In many instances, however, the use of what had been experienced in earlier days, has not been structured, and therefore far from optimal. This is particularly due to insufficient documentation and imperfect communication means and prac tices. With the present growing importance of large computerised data banks and related data transmission facilities, the possibilities are offered to improve this situation, especially in conjunction with the application of reliability engineering techniques through which the formalisms and structures are created for exploiting past experience optimally. The needs for various information types and analysis tools can be differentiated according the three different parties involved in any (hazardous) industrial activity. The operator/owner of an installation (the utility) is preferably interested in information related to optimisation of the operation and minimisation of the costs. Also the designer/manufacturer/vendor is in terested in such data, though for different reasons. Finally, the li censing authority puts by its nature emphasis on data related to the safe operation and protection of the public and the environment. The common denominator of the various types of operating experience data is reliability data, more specific, failure data on the component, system and plant level, and related observation data on the various

97 levels, that is operating times, number of demands/cycles, numbers of similar equipment, etc. From this raw data the reliability parameters can be derived that are needed for the various types of probabilistic analyses. Most familiar examples are component failure rates, failure on demand probabilities, repair rates, common cause failure parameters, etc. The production of this type of data involves statistical processing of the raw data, that, moreover, has to meet specific requirements for this type of use (see the papers of Besi and Keller presented during this course). The present paper concentrates on a different use of operating experience data, namely more qualitative directed inferences. The raw data for these types of use differ from the input data needed for the statistically oriented elaborations, though overlaps in both directions exist. The first type of operating experience data can be referred to as "component event data", whereas the latter type of data is termed "abnormal occurrence data", the subject of this document. To point out roughly the difference between these two types of data, we can say that component event data consist of each and every failure event of each component of interest, whereas the abnormal occurrence data in general consist of major events, where often more than one component failed. Also human failures are included. Moreover, these events must have relevance for the safe operation of the installation, and/or impact the reliability of the installation or a substantial part of it, leading to production losses. To date in the nuclear field, various international data collection systems exist, giving rise to ever growing data banks. Among the systems of interest for the operators we mention USERS (UNPEDE Significant Event Reporting System) of UNPEDE, an organisation of European electricity generating companies /!/; operator and manufacturer needs are served by the INPO system; specific licensing authority interests are satisfied by the NEA-OECD and IAEA incident reporting systems (1RS) /2/; a broad public, including also research institutes, is addressed by the ERDS (European Reliability Data System) of the Commission of the European Communities /3/. One specific data bank from the latter system, namely the Abnormal Occurrences Reporting System (AORS) forms the main theme of this paper /4,5,6,7,8/. It must be emphasized here, that all these data bank systems have different objectives, different characteristics and different uses, they complement each other. Outside the nuclear domain, various international data systems exist or are being developed (chemical process industry, off-shore activities, aircraft industry, army, etc.). However, most of these have a private character, and therefore restricted access possibilities.

98 In almost each country that runs a pieceful nuclear program, the licensing authority concerned prescribes mandatorily the types of abnormal occurrences that have to be reported. In this way the licensing body disposes of the necessary information needed for its control- and licensing duties. However, we see nowadays an increasing use of reliability techniques in these activities, and it appears that existing reporting systems are not always tuned to this trend, since they usually originated from a purely deterministic approach. For example, it is often not required to report single train failures of safety systems. This type of data, however, is extremely relevant within a reliability engineering approach. Nevertheless, the AORS of the CEC, a conglomerate of various national abnormal occurrence reporting systems, homogenised in language (English) and information content (AORS reference reporting format) offers unique possibilities for a variety of - qualitative - analyses. Statistical types of analyses on the system (train) level are possible, but in view of the above signalised difficulties, should be carried out carefully. The advantage of an operating experience data bank like AORS is found mainly in two domains : - supporting the achievement of the completeness of system models (fault modes, fault interactions, human intervention, dependent failures, etc.) in reliability and safety analyses. - giving insights in interrelationships/dependencies of (groups of) incident parameters. The AORS reference reporting format is particularly tuned to the needs of the reliability/safety analyst. In this respect it takes a unique place amongst the various international operating experience data systems in the nuclear field. The present paper describes in some detail the set up of the AORS, and gives examples of various analysis possibilities offered by such a data collection scheme. The present subjects can be transferred to the non nuclear field, changing, of course, the relevant details (specific systems, components, initiating events, event categories, etc.), thus giving a contribution to the solution of the problem of incorporating operating experience into decision making in design and operation through the use of reliability engineering techniques. In Section 2 an introduction to the AORS is given. Section 3 deals with some selected analysis methods that have been developed for the data collection of the AORS.

99 Some concluding remarks are given in Section 4.

2. THE ABNORMAL OCCURRENCES REPORTING SYSTEM 2.1. Introduction The AORS collects information on safety related abnormal events from nuclear power plants in Europe and the U.S.A. The data input for the AORS is coming from different national reporting organisations, in different languages and with different reporting schemes and criteria, being in general all the safety related operational events which occurred in the participating plants, that by Law have to be reported to the competent Safety Authorities in the respective countries. Therefore, the AORS data bank has been developed so that the information is merged together, homogenized both in language and content (see in Appendix I the reference format being utilized). For the establishment of this AORS data bank the following principles have been adopted /9/: - conservation of the original information; - transcoding, applying coding guidelines and adopting a quality assurance programme; - inquiring possibility of homogenized event data from the national data files. 2.2. Aims of the system The principal aim of the AORS is the collection, assessment and dissemination of information concerning occurrences (events or failures) that have or could have consequences for the safe operation of NPPs; the system intends : - to homogenize national data in one unique data bank in order to facilitate the data search and analysis; - to be a support for safety assessment both by utilities and licensing authorities; - to feed back relevant operational experience to the NNPs; - to facilitate the data exchange between the various national reporting systems and utlities. 2.3. Structure of the AORS The structure of the AORS (see Figure 1) is such that two types of input data for the AORS data bank are considered:

On-line interrogation Ml i i i I

> USERS

USERS
i,

NRC(LER) SKI events Data tapes originating from Nat. D.B's lENEL eventsli I EDF events Semi-automatic transcoding Checking AORS Data Bank

Analysis programmes

On-line interrogation

"
USERS

Data on rep. formats from N.P.Ps. Transcoding of Nat. ev. on AORS rep. format. Checking

. EE

Fig. 1.- Structure of AORS

11 0
- first of all the data which are coming from the national data files. At the moment 5 national reporting systems have reported their data by means of a data tape. These data suppliers are NRC (U.S.A.), SKI (Sweden), EdF and CEA (France) and ENEL (Italy). For each of these these national systems a national event data bank has been installed which can be interrogated one by one; - secondly, data on reporting formats directly coming from the NPPs. This is the case for the Doel units I, II, III, IV in Belgium and Dodewaard in the Netherlands. An agreement for the data exchange between the data suppliers assures the confidentiality of the reported information. Only the data suppliers have direct access to the information in the AORS. Information requested by non-data suppliers needs the agreement of the data supliers concerned. The access to the various data banks is only possible for authorized persons using a multi-level pass word system. All these input data are analysed in order to identify the sequence pattern and then transcoded on the AORS reporting format. After being checked the data are inserted into the AORS data bank. The output data are available to the data suppliers either for online interrogation or to be used in analysis programmes. A full detailed description of the AORS is available in the AORS Handbook /10/. 2.4. Content of the AORS data bank This section provides a summary of the data contained within the AORS, updated to February 1987. The countries which up to now have supplied data are listed below alongside with the period during which each country supplied the data. Belgium France Italy 1974 to 1984 1973 to 1984 1977 to 1985 The Netherlands Sweden U.S.A. 1981 to 1985 1975 to 1984 1969 to 1981

An overview of the stored event reports in the AORS data bank are given in Table I according to both country and type of reactor. Another classification of the AORS reports as a function of the observed and total number of reactor years (nuclear operating experience) is summarized in Table II. It can be seen that about 780 reactor-years of experience from 134 reactor units are recorded. This is equivalent to 20% of the world-wide experience gained to date; more specific, 24% of all existing experience of the reactor types present in

TABLE I - Contents of the AORS data bank.

CONTENT OF THE AORS DATA BANK COUNTRY DATA SUPPLIER DOEL TIHANGE CEA EDF NUMBER OF STORED EVENT REPORTS ACCORDING TO TYPE OF REACTOR PWR BWR GCR HWGCR FBR HTGR TOTAL 143 143

BELGIUM

FRANCE

931 449

858 71 NOT YET AVAILABLE

74 3

63

1.925 523 986 31 1.909 325 21.162 26.680

GERMANY ITALY NETHERLANDS SWEDEN USA CEC CAORSO GNK SKI NRC AORS 1 413 11 .525 13 .460

986 30 .496 1. 9, .312 .824 11, 929 77 63

325

Total number of stored event reports in the AORS Data Bank (status at 1.2.87):26.680 corresponding to a total of 45.481 occurrences.

103 TABLE II - Operating experience in the AORS. AORS CONTENTS - 1 FEBRUARY 1987 CLASSIFICATION FOR REACTOR TYPE AND COUNTRY TYPE NUMBER OF AORS REPORTS 13.462 11.826 929 325 77 63 26.682
NR. OF REACTOR-YEARS OBSERVED IN AORS

TOTAL OPER. EXPERIENCE 1.619 895 670 11 33 36 625 3.889

PWR BWR GCR HTGR HWGCR FBR OTHERS TOTALS

422 285 47 10 8 4

(26%) (32%) ( 7%) (90%) (24%) (11%)

777 (20%)

(24% OF TYPES FOLLOWED)

COUNTRY BELGIUM 143 FRANCE 2.448 ITALY 988 NETHERLANDS 31 SWEDEN 1.909 USA 21.163 OTHERS TOTALS 26.682 15 142 8 3 69 540 (31%) (40%) (11%) (10%) (72%) (49%) 48 357 74 31 96 1.094 2.189 3.889

777 (20%)

(46% OF COUNTRIES FOLLOWED)

104 the AORS is covered, which is equivalent to 46% of the total experience in the "AORS countries". 2.5. The use of the AORS In the AORS the events are screened for causal relationships between the facts reported; the causal sequences of failures and/or human interventions so identified are stored and are retrievable. This is in contrast- to the usual (national and international) practice of abnormal event/incident reporting, where the cause-consequence relationships of the elements making up the event can only be read, in the best case, from the narrative descriptions, and where component faults are coded and stored as independent entities, even if they concern a complex sequence. In the AORS the power for retrieving relevant information from the data bank is thus increased considerably. But at the same time also the large amount of detail that can be specified allows to reduce the size of the material selected in a search, thus enabling efficient data retrievals. The use of such a very extensive compilation of operational experience becomes, therefore, very effective. Other existing data systems have a restricted number 'of event reports stored, e.g. only so-called significant event reports, to avoid these problems of accessing a large amount of information: it is clear that in such a way potentially important information may be disregarded. For instance, small "non significant" events, due to their high recurrence, may constitute a real problem in the end, or may lead, through an unforeseen combination of additional circumstances, to a "significant" event. Sharing each others experience, also on a "lower" level of technical detail, to the extent offered by the AORS, may result in practical benefits, as will be illustrated in the following examples. 2.5.1. The problem of human failures and common cause failures is of continuous concern for the safe operation of nuclear power plants. In particular in these fields one can benefit from the experience gained elsewhere, also if it regards completely different reactor designs, operating regimes, etc., on the condition that this experience is reported at a detailed level. The ways in which human behaviour and common cause mechanisms can lead to serious operational problems are innumerable, but that their initiations have common origins, can only be detected if a sufficiently detailed and large number of events are examined.

105 2.5.2. Another example where useful operational feedback from "outside" could be obtained from AORS concerns specific problems related to ageing of the installation. As the AORS contains historical records of nuclear operational experience dating back to 1969, it is readily available for temporal trend analysis. Data from plants older than the plant population in the AORS users' country, but of a related design, can provide a useful insight into suspected incipient ageing problems of younger plants. 2.5.3. A third example where the AORS could be beneficial regards specific problems encountered by a utility, that are of minor importance from a safety point of view, but nevertheless disrupt the productivity. For this type of events no help can be found in current major international incident reporting systems (e.g. 1RS). The AORS, on the contrary, can be used since also "minor" events are stored. A final observation regards the cost-benefit aspects of the AORS. It should be recalled here that the Joint Research Centre of the Commission of the European Communities is a public service. All member states contribute financially; the products are available to them according to the rules established by an agreement between the member states and the Commission. This holds also for the AORS. Actually, data suppliers willing to exploit the data system, pay the connection through a data transmission network with the central computer where the system is installed, and some fee for computer time use. Also, specific requests for retrievals can be put forward to the Commission to be carried on by the AORS technical staff. The bulk of the cost arising from treating the input data flows and maintaining the system for the moment are carried by the European Commission in the frame of the JRC R&D programme. No specific additional effort is foreseen for the data supplier at the input side of the system. Access to the AORS information has been asked by many organizations involving a variety of subjects. Some examples are: - Germany, September 1983. Data about failures of pumps, valves, control rods and control drives for a common mode analysis ; - Netherlands, March 1984. Direct access to the NRC LER data files; - Germany, April 1984. Data for statistical analysis of tube ruptures in PWR; - Belgium, September 1984. Information about fracture of the clamping bolts securing the upper grip springs of some fuel elements in similar plants; - Italy, April 1985. Direct access to the AORS data bank; - Italy, June 1985. Information about stress-corrosion cracking

106 problems ; - Finland, January 1986. Request about events in French, German, Dutch and Belgian NPPs; - Spain, January 1987. Information about the times-to-restoration upon loss of off-site power events in AORS.

3.

SELECTION OF ANALYSIS METHODS DEVELOPED FOR AORS DATA

3.1. Introduction In this section the analysis potentiality offered by a data collection scheme as realised in the AORS is demonstrated. For that purpose some selected methods, not all, will be outlined. Because of the confidentiality of the data, it is not possible to present results of the analyses. However, all methods and programs presented below, have been tested thoroughly, and are ready for use and application by authorised parties in the AORS. In global the use and the analysis potential of the AORS covers the following domains: a) optimization of operating experience feedback for operation and safety assessment. The analysis tools pertaining to this field are based on the extensive retrieval capabilities, also on the failure sequence level, existing in the AORS. b) surveillance of safety related plant performance. Analysis tools applicable here include statistical trend and pattern recognition techniques, time trending techniques, etc. c) supply of evidence for advanced (probabilistic) safety/reliability analysis techniques and applications. The analysis programs relevant in this domain are specialised search algorithms, e.g. for dependent failures, human factors, etc., with related data output evaluation exercises. In the following sub-sections some analysis tools developed for the AORS data bank will be presented, where the items of Section 3.2. correspond to the first of the above mentioned analysis domains, 3.2. to the second and Section 3.3. to the last one.

107 3.2. Sequence analysis The methods of analysis presented in this section exploit the feature that renders the AORS unique amongst the international incident data banks on safety related operating experience, namely the construction and storage of a sequence of causally and/or sequentially related occurrences of each event processed. In the AORS an occurrence is not defined strictly. It may be a single component (piece part) failure, or a faulty human intervention/omission, or a degraded system state. The decomposition of each reported event into a chain of such occurrences is done manually before storage. The items that are stored for each occurrence are listed in the AORS reporting format, sheet 3 (Appendix I); they include the failed equipment, its location ("failed system"), the cause(s) and effect(s) of the occurrence, the safety systems or standby redundant systems called into operation due to the occurrence, the immediate action(s) taken, the way(s) of discovery and the way(s) in which the occurrence is related (linked) to the other occurrences of the event sequence. For each occurrence a description is stored stating in plain text what the occurrence represents. The automatically extracted keywords from these texts can be used together with the above coded information items for retrieval purposes. 3.2.1. Sequence histogramming. The purpose of this method is to produce an overview of the frequency of reporting of the various types of sequences recorded in the AORS. This serves to direct the analyst's attention towards e.g. frequently occurring combinations of occurrences, or to specific occurrences involving certain systems or components, intersystem propagation of failure paths, etc. he wants to investigate in more detail. The total number of sequences presently stored in the AORS amounts approximately 27,000, yielding about 45,500 occurrences. Given the 8 coded items per occurrence recorded, and the large number of possible code values to be entered (see AORS Handbook, Volume II), it is obvious that very few occurrence sequences will exist that have all codings identical. For practical reaons it is therefore necessary to reduce the number of coded items to be taken into account for characterizing the sequence types to be histogrammed. In the AORS aplication programs library, containing utilities for analysis application development as well as concrete applications, all programmed in NATURAL (a 4th generation programming language /ll/), two procedures exist for sequence histogramming, S0RTSEQA and S0RTSEQS. The S0RTSEQA procedure processes all recorded sequences, each sequence represented by a string of system-component code value pairs ordered by the progressive numbers given to each occurrence within a

108 sequence. All other coded items are neglected, as well as any causal or sequential relationships between the occurrences within a sequence. The program SORTSEQS histograms non-branched causal chains of occurrences, also, restricting the specification of each occurrence to system and component code only. Both procedures generate two types of histograms, first one that is sorted by system and component codes, followed by another sorted in decreasing order of frequency of observation. For both procedures it is possible to specify a minimum frequency of reporting to be taken into account. The output of SORTSEQA (highly simplified sequence representation) shows that over 11,000 different types of sequences are stored in the AORS. SORTSEQS.reveals that slightly more than 10,000 different types of non-branching causal occurrence chains have been reported. 3.2.2. Sequence search methods. When it is clear which types of occurrence sequences are of interest to study in more detail, all event reports featuring these sequences have to be retrieved from the AORS data bank. For developing application programs capable of performing this task, optionally followed by other types of processing of the retrieved reports, such as (partial) print-offs, trend and pattern analyses (see Section 3.3), etc., three utilities are available in the AORS NATURAL program library, named SEQAFIND, SEQSFIND and SEQ-FIND. The program SEQAFIND operates on the system and component level only; it is not possible to specify occurrence links in the search criteria (cf. SORTSEQA in Section 3.2.1). The user may specify ranges of system and component code values to be searched for, a user-defined boolean expression determines the sequence types containing these code values, that are to be retrieved. The output of the SORTSEQA procedure is indispensable for exploiting SEQAFIND, since it is the input of the search program for Tetrieving the specified sequence types. The utility SEQSFIND is used to develop search programs involving causally related, non-branched chains of occurrences. As with SEQAFIND, this program is closely related to the corresponding histogramming procedure, and the input and output requirements and facilities are identical: all sequence types, specified by means of system-component codes in a boolean expression, that are to be retrieved are therefore automatically of the non-branched causal chain type. Finally, the program SEQ-FIND may be used to develop applications where it is of interest that somewhere within the sequences to be retrieved, a spefific pair (or triplet, etc.) of occurrences exist, linked in a user-defined way, and with system-, component-, cause-, effect-code values, etc. that may be specified arbitrarily, e.g. also in the form of code value ranges for the various items. The basic utility

109 available is designed to retrieve events pair of causally related occurrences. featuring some user-defined

3.2.3. Sequence combining methods. This type of analysis assists in obtaining an as complete as feasible overview of all possible ways in which a specific failure event may emerge, exploiting to a maximum extent all information stored in the AORS. This is of interest when the AORS is consulted for tracing back the reasons and circumstances (possibly) responsible for a given type of operational problem, e.g. in the process of defining preventive measures for avoiding re-occurrence of this problem in the future (AORS for operational experience feedback). Another application of this type of analysis may be found in the case that the AORS is consulted by PSA/PRA specialists or reliability engineers performing probabilistic assessments of (safe) plant operation. In order to achieve as complete as feasible system- and failure propagation models, incorporating events that happened in reality and therefore should be included, it is important to study all failure paths that have been observed to (possibly) lead to a given top-event, e.g. loss of system function of an engineered safety feature (AORS for PSA/PRA support). Exploiting the AORS with this analysis type it is assured that the preventive measures and system models, usually based on engineering judgement and experience, are incorporating also what can be learned from the operational experience. Sequence combining methods identify on a generic level from the data bank all causal sequences leading to a given occurrence. The underlying motivation is that also things happening in other plants than the one under study, may yield valuable insights into the circumstances and mechanisms leading to the problem treated. This holds particularly for human behaviour and dependent failure effects. The NATURAL program library for the AORS contains two utilities for developing applications performing sequence combination analysis. They are named PRA-o and PSA-o. Both programs have the same structure, the difference is found in the criteria for retrieving relevant similar occurrence sequences. First the occurrence type of interest is specified (the basic occurrence), e.g. a top-event for a fault tree analysis, an initiating event type, etc. All AORS reports featuring this occurrence type are then retrieved and printed, as the first step of the multi-stage search procedure. From this set of sequences all the occurrences are retrieved that are causing the basic occurrences. This set of occurrences is then used for the second retrieval step, identifying all other sequences that

110 comprise an occurence(s) similar to these causing occurrences, but leading to different consequences. The event reports containing these occurrences are then printed out, the causing occurrences of these occurrences are then used for performing the third retrieval step, and so on. The programs mentioned above are designed to perform four of such retrieval steps, but this can be extended easily if necessary. The results of such a series of retrievals depend strongly on the similarity criteria defining the occurrences to be used as search criteria for the subsequent retrieval step. If the similarity is defined on a too generic level, e.g. only system and component matching, usually a "combinatorial explosion" is observed. The amount of retrieved sequences is much too large to be digested in a useful manner by the analyst in that case. For obtaining manageable outputs, the criteria for similarity have to be rather stringent. In the program PRA-o, matching of the code values for system, component, cause and action taken is the criterion for similarity, with as additional constraint that the occurrence must be a causing occurrence by itself; the effect of the occurrence is left free. In PSA-o, the criteria are more stringent, matching system, component, cause, effect and action taken altogether. In both programs, the organisation of the output is such that the analyst may easily trace back all the occurrence sequences of interest for this study. This is realised by grouping the retrieved reports from the several retrieval steps according the occurrences through which they were identified. Of each report the full AORS output is presented, because apart from the sequence coding also the other coded items and in particular the free text parts are indispensable for determining the relevance of a sequence. 3.3. Trend and pattern analysis The analysis methods presented in this section are useful by virtue of another characteristic of the AORS mentioned in the introduction, namely the presence of large, intrinsically homogeneous, complete samples of data. It is recalled here that from the participating countries all reportable safety related events in nuclear power stations (obligatory reporting) are input to the AORS, where this information is homogenised both in language and contents. Basically, one may discern between three types of trend and pattern analysis. First there are methods concerned with time evolution of event characteristics. Secondly, a variety of methods for analysing the interrelationships of incident characteristics exists. Finally, there

Ill exists a class of methods for investigating the time evolution of interrelationships, combining the previous ones. In conjunction with the type of data stored in the AORS, examples of these three basic analysis types are presented. As an example of time-trending, Section 3.3.1. describes a timebetween-events analysis method. For tracing possible relationships (dependencies) between incident characteristics the methods described in Sections 3.3.2. and 3.3.3. (cross-histogramming and code-combination histogramming) are relevant. This task may also be performed by means of the methods described in Section 3.3.4. (operating profiles, basically a contingency table analysis) and in Section 3.3.5. (Multivariate statistical analysis); but these may also be used for investigating the time evolution of incident characteristics and their relationships. In particular, the latter type of analysis is suited for this task, being the trend and pattern recognising method 'par excellence' with the best founded theoretical basis, but appearing to be the most difficult one in practical applications. 3.3.1. Time-between-event analysis. The objective of the method implemented presently for the AORS is to trace manually, i.e. without any underlying statistical model, relevant changes in the mean time between events for a particular unit. The NATURAL program CALC-TBF in the AORS program library may be used as basis for developing applications for a specific type of event for a specific unit. The procedure is only valid for a unit for which the event reports are available continuously in a certain time span. To identify the units satisfying this criterion, another program is available (HOMOGEEN). First the time window, the specific type of event and the unit to be analysed must be specified. From the time span all these events are retrieved, ordered by date of occurrence. Then the time between events is calculated, both in months and years, and from this the cumulative mean time between events, also in months and years. Finally, the inverse values of the latter quantities, i.e. the cumulative mean frequency of occurrence estimates, are calculated. From these an indication may be found, whether the rate of occurrence of the specific type of event is increasing, decreasing or constant during the specified time interval. If deemed necessary one may perform significant trend tests (least squares fitting with a linear function) on the output. This is not built-in presently. 3.3.2..Cross-histogramming. This type of analysis is a "quick and dirty" method for identifying possible interrelationships or dependencies of

112 the incident characteristics as represented by the values of the coded items in the AORS reports. It is based on a comparison of the relative frequencies of reporting ('scores') of the code values within two sets of reports: a 'selected set' and the 'complementary set'. The definitions of these two sets depend on the subject under study. For example, if one wants to study the characteristics of events of a specific type, the selected set will consist of these event reports, and the complementary set of all remaining AORS reports. Another example is the confrontation of event characteristics for two different reactor types, e.g. PWRs and BWRs (pressurized respectively boiling water reactors), or for two different operating regimes of the plants, e.g. events occurring within the first few years after first criticality versus events taking place at a more advanced plant age, e.g. after 8 years of operation. In the last example there is a link with the third type of analyses mentioned in the introduction. As said, for both sets of reports input to the cross-histogramming procedure, the set scores of the code values of all coded items are calculated and compared using a statistical estimation model. It is necessary to adopt such a model when one desires to assess the significance, or say the non-randomness, of differing scores. It is clear that the smaller the size of (one of) the sets, the larger the probability that an observed difference in score is coincidental. For assessing the significance of differing scores (for a particular code value of one of the coded items or keywords) a bayesian model is applied. The observed score is interpreted as 'evidence' for an assumed score probability of the code value in each set, and used for updating the "prior knowledge" of this probability, being any probability equally likely, i.e. an "ignorant" prior. Applying Bayes' rule, a posterior estimate of the assumed score probability results, of which the probability density function's mean value and standard deviation are computed. The above sketched procedure is often referred to as a "pessimistic fiducial" method of estimation and takes into account in a natural way both the observed'score and the set size on which this observation was made. Then for each code value the obtained mean values and standard deviations of the two posterior score probability density functions (one for the selected set and the other for the complementary set) are used for assessing the degree of significance of the score difference, employing Chebiev's inequality: if the distance between the two mean values is larger than alfa times the sum of the two standards deviations, the code value concerned is marked as having "significantly differing score" in the two sets. The significance factor alfa may be specified by the user, e.g. 2 or 3.

113 In the AORS program library the utility HISTAORS may be used for developing cross-histogramming applications. It follows the above sketched procedure. First the "selected set" and its complement are defined, whereafter the two sets are cross-histogrammed according the above procedure. For each coded item and the keyword types, histograms accross the two sets are produced, giving for each code value the absolute and relative score within each set (the latter expressed as percentage), the significance factor of the difference, i.e. the distance between the two estimated posterior mean values divided by the sum of the standard deviations; if this factor is larger than the userspecified value of alfa (see above), the code value involved is automatically marked with an asterisk. Finally, also the ratio of the two relative (percentual) scores and its inverse are given. For the histograms of the keywords, the user must specify a minimum absolute score value for a keyword value to be taken into account, e.g. 3 or 5. (Keyword histogramming is extremely time consuming given the large amount of observed keyword values.) In interpreting the output of the cross-histogramming procedure the following consideration applies. The observed (significant) differences in score only direct the analyst's attention toward a possible item of interest. As with any statistical outcome it cannot prove anything by itself, always an engineering explanation must be given for an observed trend. Often observed score differences are due to trivial reasons, such as a certain reporting practice or obvious correlations (e.g. the BWR recirculation system significantly less reported in PWR incidents). This process of assigning engineering or physical significance to observed statistical trends and patterns is the essential part of all these types of analysis methods. 3.3.3. Code-combination histogramming. The idea of this type of method is to count the number of times that code values of specific items are observed together. First a set of coded items is selected for which the code value combinations present in the AORS are deemed to be of interest, then a histogram is produced giving the frequencies of cooccurrence of all observed combinations of code values for these items. In this way it can be seen whether specific combinations are more preferred than others. This is then to be explained in engineering terms, and could yield insights into correlations of incident characteristics. There are no statistical models or significance considerations involved in this method. It serves only as a starting point for more detailed investigations, identifying frequently recurring combinations of incident parameters or (unique) combinations regarded of interest by their proper nature.

114 In the NATURAL library of AORS application programs the utility SORTREC serves as basis for developing applications of code-combination histogramming. It operates on the occurrence level, but it is designed to process coded items both from the general part of the AORS reports and from the occurrences together. After specification of the coded items to be considered in the histogramming, additional provisions may be made by the user, e.g. discarding of certain code values, counting values only once for sequences within the same system, etc. The user may also specify a minimum value for the frequencies of co-occurrence of code values to be processed, for cases where it is only of interest to identify the most frequently occurring combinations of code values. Next two types of histograms are produced. First one where the strings of code values are alfabetically ordered from left to right; then a histogram where the sorting is by decreasing frequency of occurrence of the code value strings. In its present form, the use and interpretation of the SORTREC program requires a thorough knowledge of the AORS coding scheme, the reporting format and, of course, of the coding values. This holds for all the analysis facilities offered by the programs in the AORS application program library. In addition, for running the programs with other parameter specifications, or for developing own applications from them, obviously the user must be capable of programming in NATURAL. 3.3.4. Operating profiles. The objective of this analysis method is to investigate for a specific unit how its reporting of abnormal events compares with that of a group of similar units. This serves to obtain for each member of such group of units any indication for "off-normal" behaviour. Performing the analysis as a function of time, the "operating profile" of a unit is monitored against a standard defined by the group to which the unit belongs. Trends of deterioration or amelioration of the plant's operation may be detected in this way. A similar approach is presented in ref. /12/ for the U.S.A. Licensee Event Reports (LERs). The operating profile of a unit is defined here as a row of scores of code values (for each coded item present on the AORS reporting format)-, These scores are normalised with respect to the calender time elapsed during the period of observation, or more precisely, treated as "evidence" for the bayesian estimation of the probability of reporting (per unit time) of that specific code value. What results is a row of mean values and standard deviations for the probability density functions of the reporting rates of the code values. In the same fashion such a row is also obtained for the ensemble of units investigated. The numbers of this row express the "normal" reporting rate probabilities for the unit population. For each unit then the "off-normal" reporting rate

115 probabilities are obtained by comparing with the standard, using a criterion similar as described in the section on cross-histogramming. The utility PROF in the AORS-NATURAL library serves as a basis for developing applications of operating profile analysis. It follows the procedure described above. First the user has to specify the ensemble of units to be investigated, e.g. units of the same type in the same country of the same power class, designed and built in the same era, etc. Then one might set a time window, if a time trending analysis of the profiles is to be performed. In that case the program has to be run for subsequent concatenated time windows. Since the technique is normalised with respect to time, the observation times of the member units need not to be identical. Next PROF produces for each unit a histogram over all code values of all coded items featuring on the AORS reporting format, with the following output for each code value: the unit's absolute and percentual score of the code value over the observation period (of the unit), then the mean value and standard deviation of the probability density function for the reporting rate of the code value. These are obtained with a statistical model, where the absolute score and the unit's observation time are the observed sample (evidence) in a bayesian estimation process, assuming an ignorant prior distribution and a Poisson likelyhood function for the realisation of the sample. So it is implicitly assumed that at least during the unit's observation period the reporting rate probability is constant' in time. Next the same quantities as above, but then for the ensemble of units are output, where the population averaged reporting rate estimates are calculated in two different ways. First an observation-time-weighted average of the member mean values and standard deviations. This takes into account the unit to unit variability. Secondly a pool based estimate, obtained by applying the above mentioned bayesian process on the accumulated scores and observation times over the unit population. This gives rise to unrealistic narrow p.d.f.s, since it neglects unit to unit variability. The former ensemble estimate is used for identifying off-normal reporting rate probabilities of the units. The user must specify the significance factor alfa (see Section 3.3.2) for marking automatically code values that are found to have a significantly low or high reporting rate probability. The code values thus identified trigger the process where the analyst must find engineering explanations for the reasons of deviating unit scores. Again, this is the most important (and complex) part of the method, for which no specific guidance can be given here. It is only stressed here that often non-relevant reasons exist for a unit's deviating score, such as the reporting practice, reporting thresholds, etc.

116 3.3.5. Multivariate statistical analysis. The type of coded data stored in the AORS is basically qualitative (non-numeric), categorical and completely disjunctive. That is, the type of data as found in a questionnaire, where one and only one answer (mode), to be selected from a finite set of possible answers, may be given for each question. In order to 'obtain a structured overview of such a large amount of data descriptive, exploratory statistical methods may be applied. In particular, using multivariate techniques where the various dependent variables are varied together, one seeks to determine hidden (complex) dependencies, interrelationships or new latent variables. All these are not readily recognised otherwise. Given the type of data in the AORS, techniques like multiple correspondence analysis (MCA) and related clustering methods utilising the same metric, are particularly suited for exploring the data space's structure and identifying relationships between incident parameters or groups of parameters. Furthermore, these methods are suitated for checking the consistency and integrity of the data (on a global level), and serve as a preliminary for more fine-tuned analyses. In the following the attention is focussed on MCA only. It is one of the few generally applicable techniques available for handling the type of data in the AORS. The basic method is described in many references (e.g. see refs. /13,14,15,16/), often in relation with other, closely related methods like principal components analysis (for quantitative data) and ordinary correspondence analysis (for contingency tables). It may be summarized as finding the best simultaneous representation of all the data in the multi-dimensional mode space, by least squares fitting in a weighted euclidian metric allowing for the notion of individual-profiles and zero-distance of equally profiled individuals (observations). A set of orthogonal axes is constructed in such a way that the variance of the projections of data points on these axes is maximised. These axes are termed 'principal axes' and are ordered in terms of amount of variance of the data points "explained" by them. Important elements for the interpretation of the principal axes and the planes spanned by pairs of them ('principal planes') are the modes that contribute the most to the orientation of the axes (these contribute the most in terms of inertia) and the squared correlations, being the percentages of the variance of a variable explained by the axes. In the process of assigning engineering meaning to the first (few) axes and planes, it is attempted to find the dependencies, interrelationships, new latent variables, "spectra" or gradients of incident characteristics, etc. For that aim also planar plots of the data points projected onto the principal planes are helpful.

117 An important intermediate result of an MCA is the multiple contingency table termed Burt-table, which gives for all existing pairs of modes the observed frequencies of co-occurrence. From this table indications can be obtained for the possible correlation of modes (= code values) and hence for the relationship of the corresponding incident parameters. It is recalled here that a statistical correlation cannot prove by itself a physical correlation. It gives only a hint in that direction, and a sufficient engineering explanation is always needed for drawing conclusions. The utility WRITAORS in the AORS program library serves for extracting data of a user-defined set of reports from the data bank for input to the external MCA- and clustering programs. The MCA program MULTM produces as output all elements needed for a proper interpretation of the analysis including a dictionary of the variables and modes; the complete Burt-table; a computer produced overview of correlated modes from this; of all modes (= code values) of all questions (= coded items of the AORS reporting format), the coordinates and squared correlations with respect to the first six principal axes; for the first 10 principal axes the modes contributing the most in terms of intertia to the orientation of these axes; planar plots of the first 2 principal planes (spanned by axes 1, 2 and 3, 4 respectively) showing the projections of the data points onto these planes; an overview of the variance explained (in percentages) by the various axes. A final remark concerns the potentiality of MCA for time trend analysis. After extraction of the data to be analysed by means of a NATURAL program based on WRITAORS from the data bank, an external data set results that is processed by a special interfacing program for preparing the input of the MULTM code. This program assigns to each output individual (= observation, a string of code values corresponding to one occurrence, with in front the code values of the items of the general part of the AORS report from which the occurrence originates) an age category. This is realised by computing the time elapsed since the date that the unit in which the occurrence took place went into commercial operation, and the date of the occurrence. This time then falls into one of the 38 age classes (= modes) defined for the question (= variable) 'plant age - fine' and in one of the 3 modes of the question 'plant age - coarse'. By incorporating these time related variables in the MCA, it is possible to trace the dependencies and interrelationships of the incident parameters with the operating age of the plants. An application of MCA to a subset of AORS data is described in ref.

m/.

118 3.4. Specialised search algorithms In this section some examples are presented of search methods for retrieving from the AORS reports containing (possibly) relevant information on one of the following subjects: dependent failures; a specific class of human errors; and events with a high potential for initiating hazardous plant states, the precursors to severe incidents. In all these examples, the presented extraction from the data bank is only the first step of an extensive analysis effort, which is beyond the scope of this paper. In all cases the follow-up analysis is essentially manual, involving classification schemes set-up according specific models, etc. Nevertheless, here it is demonstrated that the AORS is suited for supplying the necessary evidence from the operating experience, needed for the application and validation of the models involved. 3.4.1. Dependent failure classification. The term 'dependent failure' is preferred rather than 'common cause/mode' failure, though sometimes the same is referred to with the latter terminology /18/. Anyhow, what is meant here are all those instances where (similar) pieces of equipment failed (simultaneously) due to the same underlying cause. This cause could be either internal to the equipment ('pure common cause' cases) or be external, i.e. the failure of some other equipment or a human action involving multiple equipment. Speaking in terms of AORS coding, any of these cases would be marked with the special dependent failure indicator, the occurrence cause code CSX = 'common cause'. To classify the cases as internal or external cause, it would be necessary to read the complete report; automatically it cannot be done. The utility COMMODE in the AORS program library retrieves from the data bank three classes of dependent failure events. First all those sequences where there is at least one occurrence that has the coding CSX, and that has not been caused by some other occurrence. These events could very well comprise 'pure' common cause cases as defined above. The second group of reports retrieved concerns those sequences where there is at least one occurrence marked with CSX that has been caused by another occurrence. Here the pure cases are less probable. Finally a set of reports is retrieved describing sequences where there is at least one occurrence that causes more than one other occurrence, irrespective of any CSX indication. These three sets of reports are then 'cleaned', so that they are disjunct. The event reports found present in more than one of the above three sets are identified and stored in a fourth set. To give an idea of the size of the material thus retrieved from the AORS with its present contents of 27,000 reports, the first set contains 3043

119 reports, the second one 2062, the third one 264 and the fourth one 318 reports. All these sets have then to be screened by hand for obtaining the necessary information needed for the study in progress, e.g. a classification exercise /19/, scanning for general mechanisms leading to this type of failures, estimation of the relative frequency of occurrence of specific sub-classes of this type of failures, such as the simultaneous loss of more than one component (-train) from a redundantly designed system, etc. 3.4.2. Human factor analysis. In this section a procedure is described for retrieving all AORS reports that are possibly concerned with a specific type of human error, namely where the control room operators were misinterpreting the plant status, were presented with false or misleading information, etc. Since there are no dedicated code values defined in the AORS for this type of problems, the search algorithms developed basically rely on searches for keywords and/or strings of adjacent keywords. This implies that the retrieval is instrinsically incomplete. However, it is aimed to obtain a set of cases, large enough to be representative for the type of problems studied, not necessarily being all relevant cases present in the AORS data bank. The usual wayto conduct a search type as outlined above is the following. First the histograms of strings of adjacent keywords are scanned manually for relevant keyword combinations. The programs HISTSTR2, HISTSTR3 and HISTSTR4 in the AORS aplication program library produce such histograms for respectively strings of 2, 3 and 4 adjacent keywords. They have been designed to be used in successive analysis iterations, pinpointing gradually the set of keyword (strings) of interest. The user may specify a window for the string's frequency of occurrence to be considered in the histograms, as well as ranges of keyword values for each of the string members to be histogrammed between. These features allow for the necessary flexibility in handling the keyword string histograms in view of the extremely large number of combinations present in the AORS. As an example, it is easy to obtain a histogram of all strings of 3 keywords where the word in the middle is 'operator', occurring between 2 and 20 times in the data bank. Having identified in a couple of histogram loops the keywords (strings) of interest, the reports containing these keywords (strings) are retrieved by means of search programs derived from the utilities STRING-2, STRING-3, STRING-4. These are for retrieving reports containing user-specified strings of respectively 2, 3 and 4 adjacent keywords.. In parallel, a search is made for single keywords (specified usually as ranges of keyword values) occurring in the text parts of the reports, for covering all those cases not identifiable through a string

120 search. The disadvantage of using single keywords is that there is a higher probability that non-relevant reports are retrieved. In most cases it is possible, however, to exclude the most frequently appearing non-relevant cases by means of additional search criteria, either again on certain keywords and/or on other coded items. All material thus retrieved has to be screened by hand for relevance, after which the remaining set can be used for the investigation in progress, e.g. a classification of the types of failures of the kind described above, a study on the favourable circumstances for such failures to occur, all kinds of frequentistic considerations, etc. 3.4.3. System modelling support - precursor searches. As a final example of specialised search algorithms a method is presented for retrieving so-called precursors to severe incidents. These are events that have the potentiality for causing severely degraded states of the plant, from a safety point of view, given additional (at the time of occurrence not existing) circumstances. The study of this type of events is of great importance for the system modelling phase of a probabilistic safety (risk) assessment (PSA) of the plant. It provides additional input to the' safety analyst in helping him to design the system models also incorporating what can be learned from the operating experience. It supplies the evidence for certain accident sequences to occur or potentially occur, which hence have to be taken into account in the study. The retrieval of the above type of events from the AORS is straightforward, bearing in mind that it was designed also for this purpose. All methods use two specific features of the AORS coding scheme, namely the initiating event/transient class and the system involved, i.e. the emergency system or engineered safety feature coming into operation or being challenged. The initiating event class into which each AORS report is categorised, is in fact the precursor type, classified according the effect of the event on the parameters relevant for the safety of the plant. If there is no,such immediate effect, which holds for the majority of reports in the AORS, than the event is classified as being not a transient, which does not automatically imply as well that it is not a precursor as defined above. In the AORS program library presently 4 utilities are available for deriving precursor search programs, each according to a special precursor definition. The program PRECURS1 retrieves events classified as transient, where there is also a failure within an emergency system and/or its support system(s). The utility PRECURS2 identifies all those cases classified as transient, where somewhere in the sequence the coming-into-operation or

121 challenge of an emergency system is reported. Then PRECURS3 selects all the cases classified as transient, with in the sequence both a demand of an emergency system and a failure within this system. Finally PRECURS4 identifies the total function loss of an emergency system, or loss of redundancy in at least two different emergency systems, irrespective of the event being classified as transient or not. With the present content of 27,000 reports in the AORS, the following numbers of cases are retrieved in the respective search algorithms. With PRECORSI there appear to be 159 transients where in the sequence an emergency system suffers complete loss of function, 37 where such system looses more than one redundancy, and 252 cases where the maximum effect is loss of one redundancy. These sets are disjunctive, the effects are the maximum effects observed in the sets. Running PRECURS2 it is found that there are 2055 transients with a challenge of an emergency system in one (or more) of the occurrences, and from the output of PRECURS3 it is concluded that in 120 sequences classified as transients, an emergency system was both challenged and failing. Finally, according to PRECURS4 the AORS contains 843 reports on a complete function loss of an emergency system (or its support system) and 277 cases where redundancy loss is described of at least two different emergency systems. The analysis of all these sets of reports depends on the study in progress. This might be a precursor classification exercise or a frequentistic type of study, or an analysis of the ASP type (accident sequence precursor methodology /20,21/), or an accident sequence inventory for a PSA /22/, etc.

4.

SUMMARY AND CONCLUSIONS

An overview has been given of various uses and analysis possibilities of an incident data collection, given that the data collection scheme and the coding scheme (reporting format) meet specific requirements. This has been illustrated by means of the AORS data bank, which has been designed and realised specifically as a tool for safety assessment and feedback of safety related operating experience. The key features rendering an incident data collection most useful in this respect are: - completeness of the samples, i.e. collection of all (obligatory) reportable safety related events, also the minor ones,

122 extensive free text descriptions combined with occurrence sequence coding. This creates a considerable effort before the data can be inserted into the data bank, but enhances the use and analysis possibilities dramatically.

REFERENCES /l/ ' 'Le systme d'change et de traitement de l'information sur les vnements des centrales nuclaires de 1'UNPEDE (USERS)'. B. Saitcevsky and J.P. Chanssaude (EDF), Congrs d'Athnes, June 1985. /2/ Guidelines for the Incident Reporting System (/IRS), October 1981, SEN/SIN(81) 40. /3/ G. Mancini et al., 'ERDS: An Organized Information Exchange on the Operation of European Nuclear Reactors', Int. Conf. on Nuclear Power Experience, IAEA Vienna, 1317 September 1982. /4/ J. Amesz, 'The European Abnormal Occurrences Reporting System (AORS)', IAEA Workshop on Reporting and Assessment of Safety Related Events in N.P.P.'s, Madrid 2226 November 1982. /5/ J. Amesz et al., 'The European Abnormal Occurrences System (AORS)', IV EureData Conf., Venice, 2325 March 1983. Reporting

/6/ J. Amesz et al., 'The Abnormal Occurrences Reporting System of the CEC', IAEA Technical Workshop on National Systems for Abnormal Events Reporting, Budapest, 1418 May 1984. Ill J. Amesz et al., 'The European Reliability Data System: Main De velopments and Use', ANS/ENS Int. Topical Meeting on Prob. Safety Methods and Appi., Vol. I, San Francisco, February 1985. /8/ J. Amesz et al., 'Status of the Abnormal Occurrences Reporting System (AORS) of the CEC', IAEA Technical Committee on the National Incident Reporting System, Vienna, 1317 May 1985. /9/ H.W. Kalfsbeek et al., 'Merging of Heterogeneous Data Sources on N.P.P. Operating Experience', Vth EureData Conf., Heidelberg, 911 April 1986. /IO/ AORS Handbook, Commission of the European Communities, JRCIspra

123 Establishment, T.N. I.05.Cl.86.09, February 1986. /Il/ Natural Reference Manual, Version 1.2, SOFTWARE AG, Darmstadt. 1X2.1 O.V. Hester et al., UNSRCNUREG/CR4129, January 1985. /13/ L. Lebart, A. Morineau, K.M. Warwick, statistical analysis', New York, 1984. /14/ M.J. Greenacre, 'Multivariate descriptive

'Theory and applications of correspondence anal

ysis', Academic Press, London, 1984. /15/ M. Voile, 'Analyse des Donnes', Economica, Paris, 1981. /16/ L. Lebart, A. Morineau, . Tabard, Statistique', Dunod, Paris, 1977. 'Techniques de la Description

/17/ H.W. Kalfsbeek, 'Multiple Correspondence Analysis of Abnormal Event Data from Nuclear Power Plants', Vth EureData Conference, Heidelberg, April 1986. /18/ A.M. Games et al., 'Common Cause Failure Investigation Using the European Reliability Data System', 8th Advances in Reliability Technol ogy Symposium, Bradford, April 1984. /19/ K.N. Fleming et al., 'Classification and analysis of reactor operating experience involving dependent events', PLGEPRI, PLG400, NP3967, February 1985. /20/ W.B. Cottrell et al., 'Precursors to Potential Severe Core Damage Accidents: 19801981', NUREG/CR3591, July 1984. /21/ H. Hortner et al., 'German Precursor Study Methods and Results', ANS/ENS Meeting on Probabilistic Safety Methods and Applications, San Francisco, February 1985. /22/ S. Balestreri et al., 'Merging Predictive Modelling and Operating Experience Assessment by Means of the Systems Response Analyser and the European Reliability Data System', IVth European Symposium on System Safety, Deauville, June 1986.

124
A.O.R.S. Abnormal Event Reporting Format E.R.D.S.

LJ L^J

A.E. Reference Number

LJ L

L_l_

_l

L_l I I I I I ' Netjonal Reporting Ident Code

J I L Date of event

L A.O.R.S. Category

L_L_LJ
N.R.S. Category

LLJ
Total number of occurrences

I
Ref. code C.E.D.B. Ref. code O.U.S.R. TITLE:

[
Ref. code I.R.S.

INITIATING EVENT:

Init. Ev. Class

FACILITY STA TUS (at the initiation of event) D D D Zero power / Hot stand by Starting up Reduced power Full power Cold shut down _l_l L D D D D Hot shut down Shutting down Refuelling / Revision Raising power Reducing power D D D D D Routine test Special test Under construction Unidentified Others Unidentified shutdown

Power level in MWel

EFFECT ON OPERATION (after event) D D D D No significant effect Delayed coupling Plant outage Power reduction ' I D D D Reactor trip Turbine trip Hot shutdown Cold shut down D D D D Loss of heat sink Loss of F.W. to S.G. Unidentified Others

Power level in MWel

ACTIVITY RELEA SE Personnel D Within authorized limits D Exceeding authorized limits

D No release Environment D Within authorized limits D Exceeding authorized limits

SIGNIFICANCE DISCUSSION

ineer 1

125

A.E. Reference Number

_|

Event and Sequence description fwith detailed information about activity release, plant parameters variation, common cause, human error, initiating event, etc)

sheet 2

126
A . E . Reference Number

Occ. number

OCCURRENCE TITLE:

SYSTEMS INVOLVED DURING THE COURSE OF THE OCCURRENCE

J J

L L

FAILED SYSTEM OR EQUIPMENT

FAILED COMPONENT OR PA RT

. . .

EFFECT ON FA ILED SYSTEM/COMPONENT D D D D D No significant effect Loss of component function Degraded component operation Induced failure of another syst/comp. Unavailability of another syst./comp. D O D D D D Loss of system function Degraded system operation Loss of one redundancy Loss of more than one redundancy Unidentified Others

CAUSE OF FA ILURE D Personnel Mechanical

DETAILED INFORMA TION:

D Electrical/Instrument D D D Environmental Hydraulic Previous failure Common cause Unknown Others

WAY OF DISCOVERY D D D D A udio/Visual alarm. Monitoring Rout surveillance. Observation Testing Review of procedure D D D D Calling system into operation Inspection Maintenance External source D D D D D Repair Unidentified Others Inferred by other fault Review of records

ACTION TA KEN D D D D No action taken Component/part replacement Component/part repair A djustment/Recalibration D D New procedure Training Redesign/Modification Control of similar equipment D D D D Temporary repair/by pass Unidentified Others Equipment cleanup

sheet3

PART II MODELLING TECHNIQUES

FAULT TREE AND EVENT TREE TECHNIQUES

A. Poucet Joint Research Centre Ispra Establishment 21020 Ispra (Varese) Italy ABSTRACT. After an introduction on systems analysis approaches, the fault tree technique is described. The mathematical representation of fault trees and the related logical analysis is presented first. Next the probabilistic analysis - calculation of unavailability, expected number of failures, unreliability and importance - is discussed. Some special problems related to fault tree analysis are shortly discussed. Subsequently, the event tree technique is shortly described. Finally a section is devoted to the use of computer codes in systems reliability assessment. In the appendix, an information sheet for some computer codes available at JRC is given.

1.

INTRODUCTION

The technological achievements of recent decades have created a need for methods capable of analysing the reliability and safety of large, complex systems. Such methods can be of major value in the following areas: 1. assessing the reliability or availability of a system or plant; 2. detecting weak point in system design and operation; 3. optimizing system design with respect to safety or availability; 4. providing system engineers with insight into the (normal and abnormal) behaviour of the system. To this aim, several system analysis methods have been developed. Before discussing the system analysis methods it is useful to first define what is meant by a system. A system can be defined in a very general way as: 'a deterministic entity of discrete interacting elements' /l/. The term deterministic implies that the system as a whole and its constituting discrete elements can be clearly determined. 129
A. Amendola and A. Saiz de starnante (eds.), Reliability Engineering, 129-169. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

130 The determination of the system and its elements lead to the definition of what is called the system external and internal boundaries: 1. the system external boundary determines what is assumed to be part of the system, and what not: e.g. consider some emergency system for injecting cold water in a reactor; one has to decide whether or not the power supply system needed to supply some motorpumps is part of the system or not, whether the protection system that generates the call signal for the emergency system will be included or not, possibly whether some system used to cool the pumps is included or not etc. ; 2. the system internal boundary defines the level of detail by which the system is subdivided into constituting elements. In this way, the external boundary determines the scope of the systems analysis, while the internal boundary determines its level of resolution. Both boundaries can have influence on the results. The choice of both boundaries depends on various factors such as the aim of the analysis, the resources available, the statistical data available, etc. In any case, it is important that the analyst clearly documents his choices with respect to external and internal boundaries. The systems analysis itself can be limited to be only qualitative or may be extended to include quantitative analysis. It can be carried out in an inductive way or in a deductive way: 1. the inductive approach is based on the study of events occurring in the elements of the system and of the consequences of such events on other elements and on the system itself. This approach could be characterised by the basic question 'WHAT IF?'; 2. the deductive approach assumes some events on the system level and investigates the causes by tracing down to events occurring in the system elements. This approach could be characterised by the basic question 'WHY?'. The most important inductive methods for systems analysis are: Failure Mode and Effects Analysis (FMEA). Operability Analysis or Hazard and Operability Analysis (HAZOP). Markov Analysis. Event Tree Analysis.

1. 2. 3. 4.

The first two methods are qualitative methods giving qualitative insight into the system but not answering questions about the likelyhoods of system failures or successes. Markov analysis and the event tree technique can be used to provide quantitative answers also. The most important deductive technique is fault tree analysis. It

131 is versatile, flexible mathematical tool for quantitative systems reliability analysis and has found wide applications. In this lecture note, the emphasis will be put on the fault tree method because of its importance in system reliability and safety analysis. The FMEA will be discussed 'en passant' as it is often used as a preliminary step to prepare a more in dept fault tree analysis. The event tree analysis method will be discussed shortly because it is used frequently to complement the fault tree method in the safety analysis of complex plant.

2.

FAULT TREE ANALYSIS

2.1. Introduction to the fault tree technique In the fault tree analysis method, the starting point is an undesirable event at system level or an undesired state of some system process parameter, called 'Top event'. The aim is to analyse the causes of the Top event and to quantify its likelyhood of occurrence. The analysis of the causes is performed through the use of a logic diagram, the fault tree, that depicts how various primary events combine through a set of boolean operators to lead at the Top event. The primary events are events that can be associated to the elements of the system or events at the external boundary, e.g.: 1. basic component failures or states; 2. human errors during test, maintenance or operation; 3. external events such as fires, floods, ....; 4. events concerning the state of systems that interface with the system under concern at its external boundary. The construction of a fault tree begins with the definition of the Top event and a determination of all the events that can cause this Top event and how they combine. Then, the same procedure is repeated for these latter events: i.e. their causes are determined and so on until the basic causes (primary events) are reached. The primary events represent the limit of resolution of the fault tree. As already mentioned, the establishment of this limit of resolution (internal boundary) depends on many factors among which the ability to produce some failure rate or probability data for the events at this level of resolution. The concept of a fault tree representation may be clarified by a simple example.

132
Consider a simple system for injecting water as illustrated in Figure 1. Suppose that each pump train is able to supply the nominal flow required (100% redundancy).

TANK

MAVi

PMPi

MOVi

CVi

MAV2

PMP2

MOV2

CV2

Figure 1. Sample system. Let us suppose that the system is in standby and has to start upon some demand. A problem could be to analyse the probability that the system does not respond successfully to a demand. Therefore the Top event could be defined as : 'No flow at the out put of the system after startup demand'. The direct causes of this Top event are the absence of flow out of train 2 and the absence of flow out of train 2 (for the sake of sim plicity, events such as reversed flow in any of the trains are not considered). This can be represented graphically in the following way:

I
topgate no flow at output

_n +traijil

+train2

~ L

no flow from traini

no flow from tr<Hn2

tj

< >

133 The cause effect relationship is, in this case, obtained through the use of the logical 'AND' operator. The construction of the fault tree would then proceed with the research of the causes for the events concerning the loss of flow in train 1 and 2. For the loss of low in train 1 the direct causes are:

_L
no flow from traini

.traini

ZI

ecvl
check valve evi stuck

+incvl
no flou to input of cvl

TJ

7>

The causeeffect relationship is given in this case by the boolean 'OR' operator. The event 'check valve CV1 failed to open on demand' can be con sidered to be a primary event and, hence does not need to be developed further. The event 'no flow to the input of the check valve' must be de veloped further. Finally, a fault tree as in Figure 2 could be obtained. Events such as 'pump PMP1 failed to start' can be considered pri mary events or could be further developed as in Figure 3. This latter decision depends on the scope of the analysis, the availability of data, etc. The event 'power supply to PMP1 failed' would have to be developed further and would then lead to the analysis of the electrical supply system. If this is outside the scope of the analysis, and thus the electrical supply system is a system that is outside the external boundary of the system, then its failure to supply the pump PMP1 is included in the fault tree as a 'undeveloped event' and not further analysed. The analyst would then have to associate a probability value to the undeveloped event. This value could be the result of some pre vious calculation or it could simply be put to 0 (or 1) in order to analyse the system under the condition that power supply to PMP1 is always (or never) available. Such as assumption is called a boundary condition.

134

.topgate no f l o u from output

1 .gateOl no f l o u from traini

0
1 I

~1

,gate02 no f l o u from train2

0
I

evi checkvalve 1 f a i 1 .to open

.gateli

cv2 checkval ve 2 f a i l . t o open

1 .gatel2

\
1 ,gate21

O
pmp2 motor pump 2 no s t a r t

movi motorval ve 1 f a i l .to open

mov2 motorval ve 2 f a i l . t o open

.gate22

tupi motor pu mp no s t a r t
1

(
.gate31

O
1

1 .gate32

\
1 tank storage tank is empty

J
mav2 man. valve 2 is not open

mav man. va 1 ve 1 is not open

I tank storage tank is empty

Figure 2. Sample system fault tree (simplified).

btars level 1 ftuit tree draun.. Code: Type: Label: Initial unavailability: Failure rate: Repair rate: T 1 M between Inspections (hours): Fault Tree Nain: f 1 g ^ ( LoaJ ) ( S.v. 1 ( Quii )

r&zn
Nunber of Levels Draun:

"

_1_

S
V
it

""

5
.yuwtoa

~-

s
;

.supply a up n t * rail

pertor

grid
noraal over f U M ly f a i l s

diesel

xrgcncy upplu f a l l i

c>

^ ^

<^

Figure 3. Fault tree for pump failure to start.

136 The symbols that are used commonly in fault trees are summarised in Figure 4. Once the fault tree has been constructed, one could ask which are the combinations of primary and undeveloped events that cause the Top event. This question can be answered if the fault tree is translated in a logical expression on which the rules of Boolean algebra can be applied (see also 'Fundamentals of reliability in this volume). Let '+' denote the boolean 'OR' operator and '*' the boolean 'AND' operator. To illustrate the logical analysis, consider a (hypothetic) fault tree as represented in Figure 5. The logical expression for the Top event can be deduced by expanding the events (i.e. substituting events by their causes) starting from the top (top-down) or starting at the primary events (bottom-up). For the fault tree of Figure 5 the top-down approach gives: Top = = = = = = = G1+G2 A*G4+Q*G5 A*(I+L+G8)+Q*(S+T) A*I+A*L+A*G8+Q*S+Q*T A*I+A*L+A*G11*G12+Q*S+Q*T A*I*A*L+A*(M+N)*(0+P)+QS+Q*T A*I+A*L+A*M*0+A*M*P+A*N*0+A*N*P+Q*S+Q*T

The final expression is called a 'disjunctive form' of the Top event because it is a disjuncton of event combinations that lead to the Top event. The bottom-up approach would give:

QG5=Q*S+Q*T TOP = A*I+A*L+A*M0+A*M*P+A*N*0+A*N*P+Q*S+Q*T The fault tree in Figure 5 did not contain any repeated events. This simplified the analysis. Figure 6 gives another fault tree that contains repeated events (A,I and Q are repeated).

= = = = = G5 = G2 =

Gil G12 G8 G4 Gl

M+N 0+P
(M+N)*(0+P)=M*0+M*P+N*0+N*P I+L+M*O+M*P+N*0+N*P A*G4 A*l+A*L+A*M*0+A*M*P+A*N*0+A* N*P

S+T

137

OR gate

exclusive OR (EOR) gate

majority voting (K out of N) gate

AND gate

O O
/ \

primary event

non developed event

transfer triangle

NOT gate

Figure 4. Symbols used in fault tree technique.

138
Top

J
.Gl
.G2

H
.G4

.GS

T7 7


.Gli

^7
.G8

S
.G12

TJ

S
Top = G1+G2 = = = = = =

S
T7

17

77

Figure 5. Sample fault tree without repeated events. Applying the topdown method yields:

A*G4+Q*G5 A*(I+L+G8)+Q(S+A) A*I+A*L+A*G8+Q*S+Q*A A*I+A*L+A*G11*G12+Q*S+Q*A A*I+A*L+A*(M+I)*(Q+I)+Q*S+Q*A A*I+A*L+A*M*Q+A*M*I+A*I*G)+A*I*I+Q*S+Q*A

The last expression can be simplified by using the rules of the boolean algebra. Indeed, the combinations A*M*Q,, A*M*I, A*I*Q and A*I*I can be deleted because they are already included in A*I and Q*A. Therefore, the final result yields:

139
Top

2
.Gl
.G2

~r
.G4

TJ TJ

n
XJ
.Gil

.G5

77
.G8

ZI

S
7
"" U

U
.G12

17

75
TJ

Figure 6. Sample fault tree with repeated events. Top = A*I+A*L+Q*S+Q*A Each combination in this expression obtained after elimination is called a minimal cut set (MCS). If no repeated event are present, minimisaton is not necessary and the combinations in the disjunctive form are already MCS's. In the following section, the exact definition of MCS and a more rigorous treatment of the logical analysis will be given. 2.2. Mathematical representation of fault trees A logic diagram like a fault tree can be represented in a mathematical way by its structure function. Let = , , , ,..., \ be the set of primary events of a , ( 1 2 3 4 ni fault tree.

140 To each primarv event associated such that: 1 if event E E , a binary i indicator variable y can be

/ i=^

occurs

0 if event E. does not occur i > A vector Y = (y.iy_iy y ) can be used to indicate which pri l 2 3 mary events occur and which.don't. The binary indicator variable of the Top event can than be written as a function of the vector Y: 1 if the Top event occurs

(? < )
0 if the Top event does not occur This function is called the structure function of the fault tree. The vector Y is time dependent as at any time some event may have oc curred (e.g. components failed) and other events have not. To illustrate the concept of the structure function, consider a single boolean operator operating on the set of the primary events. The structure function is then: 1. If the operator is an 'AND' gate:

() = AND

y i=l

in which is the symbol for the logical 'AND' operator. This can also be written in algebraic notation as:

*
AND

(Y) = y.

i=l 2. If the operator is an 'OR' gate:

VY) = v
i=l

in which V i s the symbol for the logical 'OR' operator. This can also be written in algebraic notation at:

11 4

*0R(Y) = (iyJ
i=l The expressions obtained in the previous section were structure functions in logic notation. In general, a structure can be coherent or noncoherent. A coherent structure is a structure for which: 1. The structure function is monotone increasing with the vector Y i.e. : *(yn .y
i

y.=o,...,y ) < *(y_



l i

y.=i

y )

for all y with i=l,2,...,n. (This means in practical terms that the system must in no way become 'better' if some failure occurs); * 2. Each indicator variable of Y is relevant: i.e. there is no y. such 1 that: *(y, .yo..".y.=0 y )= *(y. ,...,y.=i y ) 1 2 n i l for all y. with j=l,2 il,i+l n. (This means in practice that every failure must have some relevance for the system). Fault trees in which the 'NOT' operator appears, do not fullfill the first conditions and, hence, are not coherent. The analysis of noncoherent fault trees is more complicated than the analysis of coherent trees. In most practical cases, the use of the 'NOT' operator, either direct or hidden in e.g. 'EXCLUSIVE OR' gates, can be avoided or the 'ANDORNOT' logic can be written in an approximate 'ANDOR' logic. Therefore, in this course, the analysis of noncoherent trees will not be treated and the discussion will be limited to coherent trees only. Coherent fault trees can be always represented by a special form of their structure function called the SumofProducts (sop) form: this is a disjunction of all the minimal cut sets of the tree. Let us first define more precisely what is meant by a (minimal) cut and path set. Consider a vector X containing indicator variable values such that ()=1. Such a vector is called a cut vector, the set of primary events for which .=1 in the cut vector is called a cut set. A minimal cut set . is a cut set that does not contain any other

142 cut set. In a more practical sense, a MCS is a combination of events whose occurrence is necessary and sufficient to cause the Top event. Analogeously, a path vector X is defined such that ()=0. The set of primary events for which x.=0 in the path vector is called a path set. A minimal path set is a path set that contains no other phat set. In more practical sense, a path set is a set of events whose nonoc currence garantee the nonoccurrence of the Top event, and a minimal path set is a combination of events whose nonoccurrence is necessary and sufficient to garantee the nonoccurrence of the Top event. Now the structure function can be expressed in terms of minimal cut sets and minimal path sets. Let = < K ,K ,K,...,K > be the set of the s minimal cut sets of \ 1 2 3 s/ a fault tree. The structure function of a MCS can be written as:

= =
j ieK.

with i : for all i such that event E. appears in MCS .. In algebraic notation:

k
J

y,

Let = | p , , ,.,., ? be the set of the u minimal (MPS's) of a fault tree. The structure function of a MPS can be written as:

path sets

= y
leP.

y,

In algebraic notation:

P4 =
J

(i y J
any coherent structure function

ieP. j

Now it can be demonstrated that can be written as follows:

*(Y) = V k. = P.
j=l
J

Jl

with s the total number of MCS's and u the total number of MPS's.

143
In algebraic notation:

*(Y) = 1 1 1 (1k.) = P. J j=l Jl J


The first identity (left) is called the sumofproduct form. The determination of the minimal cut sets (and, less common, min imal path sets) of a fault tree is called the logical analysis of the tree. The minimal cut sets themselves are already a valuable result as they indicate the combinations of events leading to the system Top event. The number of events appearing in a MCS is called the order of the MCS. MCS's of order 1 are thus single events causing the Top event. The logical analysis of fault trees can, in real cases, be quite complex. It is not uncommon that a fault tree has millions (or even billions) of cut sets. Therefore computerised tools are necessary to perform the logical analysis. They will be discussed in the last chap ter of this course note. 2.3. Quantitative analysis of fault trees 2.3.1. Unavailability calculation. It is possibleto obtain quantita tive information of the fault tree if the primary events are quantified as follows: let Q (t)=P(y =l)E(y ) be the probability that event E is true at some
i i i
i

instant of time, shortly called the unavailability of event E. (E is the expectation operator: E(y )=ey P(y )). i i i Let Q (t) be the unavailability of the (system) Top event, s Let QK (t) be the unavailability of the MCS K . J j Let QP (t) be the unavailability of the MPS . J j The occurrence probability (unavailability) of the Top event can be calculated by taking the expected value of the sop form of the structure function: s

Q (t) = E (Y(t)) = E V

y.(t)
*

j=l ieK. J

. Y(t) indicates the time dependence of the vector Y. This time depen dence will further be assumed implicitly. The former expression can be worked out if all primary events in the fault tree are statistically independent.

144 In that case: E(A+B)=E(A)+E(B) and: E(A B)=E(A) E(B) Independence for events regarding failure of components means in practice: 1. the failure of a component must not influence the failure probabil ity of some other component; 2. the probability that component be repaired in some time t must be independent of the state of other components; this means e.g. that the repair of a component must not force the operators to change the state of other components (e.g. stop some operating component) or that the repair must not lead to some delay in the repair of other components (no queuing). The sop form (in algebraic notation) can be expanded into a polynomial form on which the expectation operator can be applied. Such an exact calculation of the Top event unavailability is only possible in very simple cases. If we consider again the logical expression of the fault tree in Fig. 6: Top = A*r+A*L+Q*S+Q*A This would be in algebraic notation: Top = 1UA I)(lA L)(lQ S)(lQ A) = A I+A L+Q S+A QA Q S L QA I QA I L+A I L Q Taking the expectation:
E ( T o p ) = E(A) E ( I ) + E(A) E(L) + E(Q) E ( S ) + E(A) E(Q) E(A) E(Q) E(A) E(L) E(A) E ( I ) E(A) E ( I ) + E(A) E ( I )

E(S) E(Q) E(Q) E(L) E ( L ) E(Q)

in which E(X)=P(X=1): the unavailability of event X. I i all but the simplest cases such a calculation would be too onerous and it is preferred to calculate bounds for the Top event un availability.

145 Frequently used bounds are given by: u s (1E(k.))

H E ( P . < () < 1 - J j=l 0=1 in which:

E(P.) = 1 (lE(y.))=Q J u P. ieP. J J and: E(k.) = E(y.) = Q 1 K ieK. j These bounds are called the minimal cut set (or BarlowProschan) upper bound, respectively minimal path set lower bound. The inclusionexclusion principle gives the probability for a union of independent events as e.g.: P(AC) = P(A)+P(B)+P(C)P(A)P(AB)P(AC)P(BC)+P(ABC) By applying this principle on the minimal cut sets, bounds can be obtained for the Top event unavailability: s s i1 s the following

1=1

E(k.)

j=l

..< (()) <


1J

E(k.)

1=2

1=1

in which P.. is the occurrence probability of the intersection of MCS I. and K.. 1J The upper bound is very easy to calculate. It does neglect the probability of more MCS's occurring at the same time which is sensible in highly reliable systems and is often called the 'rare event ap proximation' . It has to be noted that the inclusionexclusion principle offers the possibility to calculate more precise bounds if probabilities of intersections of more than two MCS's are calculated. This may prove to be very time consuming if the fault tree has many MCS's. As an example to illustrate the bounds, consider a fault tree for which the list of MCS's is:

146 = = C 2 = C 3 Let the unavailability of each primary event (A,B and 5.0E-2, then the unavailability of each MCS is 2.5E-3. The application of the inclusion-exclusion bounds yields: C) be

1 with

Q = Q = 7.5E-3 u i i=l and Q = Q -P(K )-( )-( ) u 12 13 2 3 = 7.5-3 - P(AB,AC) - P(AB,BC) P(AC,BC) = 7.5E-3 - P(ABC) - P(ABC) - P(ABC) 3 = 7.5E-3 - 3(5.0E-2) = 7.125E-3 The Barlow-Proschan bound would yield: 3 = 1 - JJ (1-Q.) = 1-(1-(5.0-2)2)3 i=l = 7.48E-3 2.3.2. Calculation of expected number of failures. Let w.(t) be the failure intensity of event E., i.e. w.(t)dt is the (unconditional) probability that event E. occurred during the time interval t,t+dt (see also 'F undamentals of reliability' in this volume). Let w (t) and wK.(t) be the failure intensity of the top event (or system) respectively the minimal cut set K.. Consider a minimal cut set K.. In order that the MCS occurs during the interval t,t+dt, n-1 events must have occurred at time t and the last event must occur during the interval t.t+dt.

147 The probability that n1 events occurred at time t is:

Q.(t)

The probability that event i occurs during t,t+dt is w.(t). Hence:

w (t) = K
j ieK
j

w.(t) i
neK n*iJ

Qn(t)

The expected number of failures (ENF) is given by the integration of the failure intensity over the interval 0,t. Hence, for the ENF of a MCS: t W (t) = f w (t)dt K. J K.

Bounds for the failure intensity and ENF of the Top event can be obtained by applying the inclusionexclusion principle. An upper bound for failure intensity resp. ENF of the Top event is given by: w (t) = w (t) s j=l j

ws(t) =
j=l

K (t)
j

2.3.3. Calculation of the unreliability. The calculation of the unre liability f the Top event F (t) in case of repairable events is not an s easy task. It can be shown however, that the ENF of the Top event is an upper bound for the unreliability. Another upper bound for the Top event unreliability is given by the sum of the unreliabilities of the MCS's: s

F (t) <
S

F (t)
K .

148
The unreliability of a minimal cut set can be determined by e.g. performing Markov analysis on the events in the MCS. The following upper bound can be deduced for the unreliability of a minimal cut set if both failure and repair times are exponentially distributed (constant failure rate and constant repair rate ) :

j F J with: (t) < 1 exp 1Q j

ieK. ()

Q , () = ,
. . + . J ieK. 2.3.4. Importance calculations. The importance of a primary event or of a MCS is the relative contribution the event or the MCS delivers to the Top event unavailability. The importance of MCS K. is given by: QK(t)

J Q (t) J s

import<ance of a primary event j

I E i

Q (ti K j j Q tt)

in which o.: 1 if component i belongs to MCS K. 0 otherwise. Other measures of importance of primary event and MCS's have been proposed. For a more detailed discussion the reader is referred to the literature. 2.3.5. Special problems. In this section, some special problems related with fault tree analysis are shortly introduced. The detailed discussion of the topics in this section would go beyond the scope of this introductory course. For more information, the

149 reader is referred to spective topics. the literature and/or other courses on the re-

Phased missions. A system can be said to perform a mission. If the system configuration changes during the performance of the mission, to each configuration a phase is associated and the mission is said to be a 'Phased Mission'. Since components might be used in different phases of a phased mission, the phases are in general not mutually independent. In that case, special techniques must be used to calculate the probability of mission failure. Examples of phased missions are, for instance, missile systems in which the flight consists of different stages or emergency core cooling systems in nuclear power plants which normally have injection and recirculation phases. Common cause failures (CCF's). The expression given above for calculating Top event unavailability and other system parameters, are all based on the assumption of complete mutual independence between the primary events. A common cause failure is a multiple component failure in which different components are failed by a same underlying cause. Consider a MCS in which three events occur, each event related to the failure of some different component. If a CCF may hit two or more of the components, the events in the MCS are no longer completely independent. Hence, tree. The underlying cause of the CCF can in some cases be clearly identified and associated to e.g. failure of some other component or unavailability of some support function. In such cases the cause itself can be explicitly modelled: i.e. the cause is included in the fault tree as a primary event, thus removing the dependence introduced by it. In some other cases, explicit modelling is not possible either because the cause cannot be clearly identified and associated to a primary event or because the cause is already accounted for in the data which are used for the component failure or because the scope of the analysis is such that explicit modelling would lead to too much detail. In such cases, parametric modelling can be performed. The various multiple failure events are modelled in the fault tree as shown in the figure in the next page. Different parametric models such as the Beta factor model, the Multiple Greek Letter model, the Binomial Failure Rate model or the Basic Parameter model can be used to quantify multiple failure events in the tree. CCF must be in some way taken into account in the fault

150

FAILURE O f UNITS A . . AN D C

DOTTED LINES INDICATE HOW SPECIFIC COMMON CAUSES A RE SHA RED.

Figure 7. Modelling of multiple failure combinations. Uncertainty calculation. The reliability parameters (failure rates, failure on demand probabilities) have no deterministic fixed values, but can rather be described by distributions. In many cases, the lognormal distribution is choosen for describ ing the variability of the reliability parameters. The lognormal dis tribution can be characterised by two parameters: often the median and the error factor (defined as the ratio of the 95th percentile over the median) are used. In order to calculate the distribution of the Top event unavail ability (or of other results), the propagation of the uncertainty of the parameters of the primary events should be carried out.

151 This can be done by using analytical or simulative (Monte Carlo) methods which will be shortly presented in the section on computer code. Manytimes, the propagation of the uncertainties is omitted and only a point value of the e.g. Top event unavailability is calculated. This can lead to erroneous conclusions with respect to the system unavailability or other parameters. 2.3.6. Tasks in carrying out Fault Tree Analysis. In general, tree analysis of a system will consist of three distinct phases: 1. system documentation and preliminary (qualitative) analysis; 2. system modelling: construction of the fault tree(s); 3. quantification of the fault tree(s); 4. logic and probabilistic analysis. 5. uncertainty evaluation. fault

During the first phase, the system design and operation are thoroughly examined. The success and failure criteria of the system are defined as accurate as possible e.g. making use of the design basis calculations. The system external boundaries are defined and documented as well as the boundary conditions: e.g. the states assumed for the interfacing systems. The internal boundaries are defined provisionally according the aim and scope of the analysis. This resolution limits can later be modified e.g. during quantification. It is very worthwhile to perform a Failure Modes and Effects Analysis (FMEA) in the first phase of the fault tree analysis. In a FMEA every component of the system is considered in a systematic way. Of each component the failure modes are identified and the effect of these failure modes on other components and on the system is analysed. Use can be made of tables (Figure 8). The knowledge on the system, acquired by performing an FMEA, will prove to be very valuable during the system logic modelling: the systematic analysis of all components with their failure modes and the effects of these latter helps to avoid that important failure modes are not discovered or that unimportant failure mechanisms are modelled in too much detail. During the second phase, the fault tree(s) are constructed for the Top event(s) identified during the preliminary analysis. It is important that the construction proceeds in an orderly and systematic way: the Top event is analysed in its immediate causes, then for each of these the proper immediate causes are determined, etc. The procedure continues until the limit of resolution is reached. In the third phase, the primary events are quantified.

System Function: Phase: Slule

Component Technical specification lor the component Operational parameters Environment, topology Test and maintenance frequency and policy

Stale

Functional elements Demands

Failure mode

Failure cause

Failure detection possibility

Corrective actions

F f feci s ou lhe system

Effects on interfacing systems

Overall effects NOTES

Figure 8. Table for Failure Modes and Effects Analysis.

153 The following parameters might be supplied: 1. for component failures: the on demand failure probability (fixed unavailability), the failure rate, repair rate and possibly inspec tion interval ; 2. for human errors: the occurrence probability of the error, taking into account the recovery possibilities. An accurate estimation of such probabilities may only be obtained by a separate human relia bility analysis. How this is performed is outside the scope of this course. It is possible however to assume arbitrary (but realistic) values and to perform sensitivity analysis in order to get useful results; 3. for other events: occurrence probabilities derived from casuistics. The logic and probabilistic analysis is in most cases performed by using computer codes for determining MCS's and calculating the system unavailability bounds (or other parameters). In order to have a correct understanding of the qualitative re sults, it is necessary to perform the uncertainty calculation i.e. to determine the distribution of the results or at least estimate the confidence bounds.

3.

EVENT TREES

Event tree analysis is an inductive method which is used frequently in risk assessment. Event tree analysis offers a systematic way to identify the vari ous accident sequences that may follow a certain initiating event and to quantify the probability of each of the accident sequences. Figure 9 gives a simple example of an event tree describing the accident se quences that may follow the initiating event 'rupture of a large pri mary circuit piping' in a Pressurized Water Reactor. In order to construct an event tree for a given initiating event, first all safety functions and systems that can be used to mitigate the outcome of the initiating event are identified. These safety functions and systems appear as the headings, the failure or success of them may give rise to branching. Each path through from the initiating event through the branches to the various outcomes depicts a possible scenario or event sequence. In theory, if there are safety functions or systems which each have failure or success, 2 event sequences are possible. In practice, dependencies may exist between the success or failure of different functions or systems (e.g. dependence on the availability of electric power in Fig. 9) so that the number of meaningful event

154 s e q u e n c e s may be l e s s .

PIPE BREAK

ELECTRIC ECCS POWER

FISSION CONTAINMENT PRODUCT INTEGRITY REMOVAL VERY SMALL


AVAILABLE AVAILABLE FAILS SMALL RELEASE RELEASE

AVAILABLE AVAILABLE FAILS AVAILABLE FAILS

SMALL RELEASE

MEDIUM RELEASE

AVAILABLE FAILS FAILS

LARGE RELEASE VERY U R G E RELEASE VERY LARGE RELEASE

FAILS

Figure 9. Simplified event tree. The quantification of the event sequences, in order to calculate the probability of each event sequence, can be carried out by determining the failure and success probabilities of the functions and systems. These probabilities are conditional probabilities: they are conditional to the initiating event and to the success and failures of other systems in the event sequence. Consider e.g. the event tree in Figure 10: q is the probability that system S fails under the condition that system S did not fail:; q' is the probability that system S fails under the assumption that S failed. If the systems S^ and S_ are independent then q_ and q' are equal. In general this is not the case and, hence, care must be baken during the quantitative analysis of event trees.

155

SYSTEM Si

SYSTEM s2

PROBABILITY OF OCCURRENCE

i-q, 1- q 1

s
F
q

CONSEQUENCE 1

~ 1q,jq2

s
INITIATING
EVENT

CONSEQUENCE 2
2

1-q
F S F

CONSEQUENCE 3 CONSEQUENCE U

~ q

<

Q1Q2

Figure 10. Sample event tree. Dependences between systems are often caused by common support functions or systems (power supply, compressed air, cooling, control.). Two approaches to event tree analysis are currently used: 1. In the 'large event tree small fault tree' approach, the frontline systems performing safety functions and the support systems neces sary for the functioning of the support systems appear explicitly. This allows to model dependencies explicitly in the event tree. The fault trees are 'small' because they do not have to model states of the support functions and systems. This method is also called 'event tree with boundary conditions', since fault trees have to be developed for the frontline system under various boundary conditions imposed by (partial) failure or success of the various support systems. For example for a power supply system with two busbars, the following situations may occur: both busbars function, both busbars fail, busbar 1 function and 2 fails, busbar 2 functions and 1 fails. For the systems which need the power supply, four fault trees will have to be constructed: one for each of the boundary conditions related to the busbar failure or

156
success situations. The event trees may become very complicated in real cases. Once the (large) event tree and the fault trees are constructed, the quantification of an accident . sequence is rather straightforward, since the fault trees can be assumed to be independent. The 'small event tree - large fault tree' approach is based on complete fault trees, including support system failures. Hence, the fault tree of the event tree are in general not independent. An accident sequence can be quantified by putting the trees concerning its branches as inputs to an 'AND' gate (fault tree linking). Since in an accident sequence success of some systems is assumed, these trees are not only fault trees but also may also be the complement of a fault tree: a so-called dual fault tree. Inclusion of the dual fault tree for systems that are assumed to function in any particular event sequence, will eliminate minimal cut sets of the event sequence in which failure causes appear of a system that is defined to be in a succes state. Fault tree linking may lead to very large fault trees containing complemented events. Such trees cannot be analysed easily. It has to be noticed that due to the presence of complemented events the linked tree of an accident sequence is non-coherent. An alternative to the dual fault tree appoach is 'MCS list matching'. It is based on the fact that, a MCS of a system failure in an event sequence that is also a cut set of a system assumed to be in the success state in that event sequence, should be removed.

4.

USE OF COMPUTER CODES IN SYSTEMS ANALYSIS

4.1. Tasks in which computer codes can be used Systems analysis is a quite effort intensive task. Event tree and fault tree analysis involve the construction and analysis of complex logical structures and the handling of huge amounts of data. The tasks for which computer codes can usefully be applied are: 1. logic and probabilistic analysis of fault trees; 2. calculation of the uncertainties; 3. system modelling: i.e. construction of the logic model; 4. fault tree drawing; 5. analysis of event trees; 6. analysis of common cause failures; 7. Markov analysis; 8. reliability parameter estimation.

157 The three first points will be discussed in this note. The discussion on the other applications of computer codes falls outside of the scope of this short introductory course. 4.2. Logic and probabilistic analysis of fault trees The logical and probabilistic analysis of fault trees can only in very simple cases be performed by hand calculation. As soon as the trees become a little bit complex, the number of (minimal) cut sets becomes too large and use must be made of a computer. The logical analysis of fault trees, i.e. the determination of the minimal cut sets, is part of a general class of problems for which no real fast algorithm can be constructed. The calculation time grows (exponentially or faster) with the size of the problem. For large fault trees, the calculation times can be considerable if no special measures are included in the fault tree code. A possibility to reduce the computer time is the modularisation of the tree. The splitting of the tree in a number of subtrees, to be analysed separately, can be efficient if these subtrees do not have too many interaction points. In this way the best splitting is obtained when the subtrees are modules which are independent from each other: i.e. the subtrees do not contain events which are repeated in other subtrees (repeated events may exist within the module). Another possibility to limit the calculation time is to apply cut off rules for limiting the number of minimal cut sets determined. Two kinds of cut off can be applied: 1. logical cut off: which means that a limit is set to the order of the minimal cut sets to be determined. All minimal cut sets of order higher than the threshold will be neglected. 2. probabilistic cut off: means that a probability threshold value can be specified: all minimal cut sets whose probability is lower than the threshold value will be disregarded. The application of the logical cut off only can lead to erroneous conclusions: some higher order cut sets may have substantial occurrence probabilities, while low order cut sets containing e.g. passive failures may have low probabilities. The first fault tree codes determined the minimal cut set by using a combinatorial method. This results in very high computer times for analysing complex trees. In later codes a Top-down or Bottom-up approach is used. This means that, starting from the top or from the bottom of the tree, each gate is replaced by its cut sets. In the recent codes, modularisation and application of cut-off rules are implemented.

158 The possibilities of the codes differ with respect to the probabilistic models: 1. consideration of staggered testing or not; 2. possibility to calculate phased missions; 3. different component failure detection and repair policies considered: e.g. on line detection, periodically inspected components, ...; 4. allowing to model stand-by components; 5. calculation of importance measures. In the appendix short summary SALP-PC codes are given. 4.3. Uncertainty calculation Together with the fault tree codes, codes have been developed for performing the uncertainty analysis on the results: i.e. for calculating the uncertainty on the various resulting system parameters such as the system unavailability or the expected number of failures. In general, those codes can be based on: 1. Monte Carlo simulation: in which sampling is performed from the different distributions of the reliability parameters of the primary events ; 2. analytical calculation of e.g. variances; 3. systematic combination of random variables. The primary event distributions are approached by histograms. The distribution of any analytical function of random variables is obtained by systematic combination of their histograms. The analytical approach is limited to certain types of distributions, while the other approaches can in principle be used for any shape or type of distribution. 4.4. Computer codes for system modelling With the increasing popularity of the fault tree technique, and the application to more and more complex systems, there is a growing interest in computer codes which could be used to reduce the important effort needed to construct the fault trees. Indeed, while the analysis of the trees was already from the beginning performed by computer, the construction mainly remained a manual task involving a considerable amount of effort and skill. To reduce the mechanic subtasks, such as data transcription, and to reduce the likelihood of making errors, integrated software systems for fault tree drafting, manipulation and analysis were developed. information sheets of the SALP-MP and

159 Such codes allow to interactively draw fault trees on a screen and include facilities for modification of the trees, introduction of reliability parameters, passing the tree description to the analysis program, etc. They provide a natural engineering environment for performing fault tree analysis or sensitivity studies without however assisting the analyst in the system modelling itself. Many attemps have been made to develop automatic fault tree construction codes, but although they can be very successfully used in some cases, a general purpose automatic fault tree construction program is not yet available. The problem of fault tree construction is not an easy one to solve and requires a deep understanding of the system to be analysed. It is very difficult to formalise all the information and knowledge on the system in such a way that it can be used by a computer code to construct the fault tree. Moreover, the analyst uses quite a lot of judgment and experience in modelling the system: e.g. to make motivated simplifications or to go deeper into detail on some critical issues. It has to be noticed that fully automatic codes for fault tree construction reduce one of the major benefits of fault tree analysis, namely the deep understanding of the system behaviour that the analyst gains while modelling the system. On the other hand, by using automatic methods it is possible to garantee completeness and correctness according to the underlying assumptions. Automatic models also decrease the likelihood of modelling erros and increase the comparability of different analyses. The conclusion is that automatic models can very usefully be applied to parts of the system to be analysed and that there should be a mechanism for combining automatic modelling with the application of expert knowledge and judgment from the analyst. An interactive approach, as used in computer aided design (CAD), seems to be indicated. In the appendix, a short summary information sheet of the CAFTS code is given. This code can be used to interactively construct fault trees starting directly from a description of the system Piping & Instrumentation diagram. The code features also interactive fault tree drafting and editing capabilities. For the documentation (drawing of fault trees) a companion code (MAPLE) is used. An information sheet on MAPLE is given in the appendix (the fault tree examples in this paper were constructed or drafted with CAFTS and drawn with MAPLE).

160 REFERENCES 111 Haasl D. and Roberts ., 'Fault Tree Handbook U.S. gulatory Commission', NUREG0492 (1978). Nuclear Re

/2/ Henley E. and Kumamoto ., 'Reliability Engineering and sessment' (Prentice Hall, N.J. 1981).

Risk As

/3/ 'U.S. Nuclear Regulatory Commission', PRA Procedures Guide, A Guide to the Performance of Probabilistic Risk Assessments of Nuclear Power Reactors, NUREG CR2300 (1982). /4/ Astolfi M., Clarotti C , Contini S. and Picchia F., 'SALPMP: A Computer Program For Fault Tree Analysis of Complex Systems and Phased Missions', EUR0C0PI report No. 12, JRC Ispra (1980). /5/ Poucet ., 'CAFTS: Computer aided fault tree analysis ANS/ENS Int. Top. Meet, on Probabilistic Safety Methods and Applications', San Francisco, February 24 March 1 (1985). 16/ Poucet ., 'MAPLE: A computer code for plotting fault trees, description and how to use' EUR0C0PI report No. 15 (oct. 1981). Ill Jaarsma R. and Colombo ., 'SCORE: A Computer Program for the Sys tematic Combination of Random Variables' EURATOM report6819EN (1980).

181 Contini S. and Poucet ., 'SALPPC: A Computer program for fault tree analysis on personal computers', ANSENS Int. Top. Conference on Probabilistic Safety Assessment and Risk Management, (Zurich, Aug. 30 Sept. 4, 1987).

161

APPENDIX

Code name: CAFTS Year: 1986

Organization: JRC, Ispra Italy

Problem Solved. Computer aided construction of fault trees starting from P&I diagrams. Editing and manipulation of fault trees through graphics interface. Method Used or Short Description. CAFTS is an interactive software package that allows to construct modify and manipulate fault trees and that is interlinked with SALP-MP for the logic and probabilistic analysis of the fault trees. The computer aided construction of a fault tree for a system TOP event is performed in two phases: 1. In a first phase, a high level fault tree (Macro Fault Tree: MFT) is generated automatically. This is performed by an expert system approach using a knowledge base containing production rules on generic component behaviour and by backward chaining from the state(s) in the TOP event through the rules. In this process, the system P&I diagram and other relevant system description are prompted to the user in a stepwise fashion. 2. In a second phase, the MFT is expanded in, a fully detailed and quantified fault tree by using a library of Modular Component Models (MCM) that describe the detailed causes of failure for standard types of components and subcircuits. The MCM's contain small data bases with the reliability parameters of the primary events they produce. As a result, the fault tree is foreseen of quantitative data after the expansion. A powerful graphic fault tree editor is built in allowing to display and modify the fault trees built and/or to interactively construct fault trees for (parts of) system for which no automatic modelling is desirable or possible (e.g. due to the fact that no rules or MCM's for some particular components are stored).

162 The code is interfaced with the SALP-MP code so that the constructed fault trees can be analysed on-line allowing to perform sensitivity analysis in a very efficient way. Features. Interactive package, fully menu driven. Fast graphic fault tree editor. Limitations. The production rules and MCM's have been developed for fluid systems and mechanical components. Some models and rules exist for electrical components. Input Needed. For computer aided construction, the code prompts the description of the TOP event and the P&I diagram of the system and some relevant information on components. The code accepts also previously constructed or manually constructed (sub) trees that have to be further developed, modified or analysed. As a third possibility, the code can be used as a fault tree drafting program for interactivity inputting the fault tree in a graphic form. Output Produced. System fault tree, analysis results (cfr. description of the code). Environment Information. Language: PL/1. Operating System and interactive environment: MVS/TSO or VM/CMS. Other utilities needed: ISPF and GDDM program products of IBM. Computer: IBM, AMDAHL. Contact Person(s). A. Poucet JRC-Ispra Establishment 21020 Ispra (VA) Italy

163 Code name: SALP-MP Year: 1980 Organization: JRC-SIGEN-ENEA

Problem Solved. Logical and probabilistic analysis of fault trees for single phase and

phased mission systems. Method Used or Short Description. Bottom-up approach Cut set cancellation and merging for phased mission problems. Features. Type of gates: AND, OR, K/N. Error checking on input free formatted. Sub-tops can be defined. Fault tree modularization. Boundary conditions. Comments allowed inside the fault tree description. Multiphase systems. Stand-by subsystems. Probabilistic and logical cut-offs. Estimate of the truncation error. Limitations. Up to 6 phases for phased mission systems. Number of events-gates is machine dependent. Input Needed. Fault tree description. Reliability parameters of events. Output Produced. Unavailability and Expected number of failures for significant MCSs and Top event. Estimate of the truncation error. Importance event analysis.

164 Environment Information. Language : PL/1 Computer: IBM 370/168 Contact erson(s). S. Contini JRC-ispra Establishment 21020 Ispra (VA) Italy

165
Code name: SALP-PC Year: 1987 Organization: JRC

Problem Solved. Logical and probabilistic analysis of fault trees for single phase. Method Used or Short Description. Bottom-up approach according to different levels of fault tree definition. Logical cut-off and/or probabilistic cut-off with/without estimation of truncation error. Features. Type of gates: AND, OR, K/N, NOT, XOR, INHIBIT. Error checking on input. Modularity: 6 processors for input, modularisation, logic analysis (2 phases), probabilistic analysis, reporting. Restart possible at each step of the analysis procedure. Boundary conditions. Limitations. Limit on size at tree and number of cut sets depending on available memory: e.g. PC with 192 kb RAM: max. 300 gates and primary events and max. 3000 cut sets up to order 10. Input Needed. Fault tree description and reliability parameters of events entered interactively. Output Produced. Unavailability and Expected number or failures for significant MCSs and Top event. Criticality of primary events. Estimate of the truncation error.

166 Environment Information. Language: FORTRAN 77 Computer: PC's and Compatibles running MS-DOS Contact Person(s). S. Contini JRC-Ispra Establishment 21020 Ispra (VA) Italy

167
Code name: MAPLE-II Year: 1986 Organization: SRC Belgium; JRC-Ispra (I)

Problem Solved. Plotting of fault trees. Method Used of Short Description. The MAPLE code produces high quality drawings of fault trees of whatever complexity starting from the structure function of a fault tree and (optional) event and gate text descriptions. The size of boxes in which the description of the events and gates will be drawn, the number and length of lines of description and character size to be used can be defined by the user or selected from preestablished standard options (2 lines of 12 characters, 3 lines of 20 characters). It is possible to lower gates in the drawing in order to change the lay-out. An automatic gate lowering facility is provided which produces a special output in which all events in the fault tree are drawn next to each other on the bottom of the sheet. For the drawing of large fault trees, MAPLE features the possibility to cut a fault tree in subtrees each to be drawn on separate sheets. It then automatically provides transfer triangle sysmbols in the places where the fault tree was cut referring to the sheet numbers. The cutting can be performed automatically based on some defined sheet format, or it can be performed on indication of the user if he likes to have particular subtrees drawn on a separate page, or both. Features. Type of gates: AND, OR, NOT, K/N, EOR, INHIBIT. Type of events: primary events, undeveloped events, protective events. Checking of logical correctness of tree. Gate lowering (optional). Cutting in subtrees (optional). Limitations. The size of the fault trees is limited only by the memory size of the computer. The version distributed has (for memory reasons) a limit that

168 can be easily modified. The current maxima are 60 logical levels and 200 event per logical level. Input Needed. Fault tree structure function (a record for each gate name of the gate, its type and descendants). Descriptive labels for events and gates (optional). Output Produced. Fault tree drawing(s). Environment Information. Language: PL/1 Graphics: CALCOMP, TEKTRONIX or GINO-F Computer: IBM 30xx, IBM 43xx Note: a PC version in Pascal is available Contact Person(s). A. Poucet JRC-Ispra Establishment, 21020 Ispra (VA) Italy describing the

169 Code name: SCORE Year: 1979 Organization: JRC-Ispra (I)

Problem Solved. Calculation of functions of random variables. Method Used or Short Description. Histogram combination (method of systematic combination of equal probability intervals). Features. Stored output and plot of histogram of intermediate and final results. Limitations. Distribution of input random variables: form, logtriangular, point (histogram). Input Needed. Fault tree or MCSs represented in the form of an analytical function uniform, triangular, loguniderived from input distributions,

lognormal, normal, exponential, gamma, point-by-

and distributions of the input random variables. Output Produced. Histograms derived from input distributions, intermediate and final results. Environment Information. Language: FORTRAN IV, PL/1 Computer: IBM 370/165 Contact Person(s). A.G. Colombo / R.J. Jaarsma JRC Information Analysis and Processing Division, 21020 Ispra (VA) Italy

ELEMENTS OF MARKOVIAN RELIABILITY ANALYSIS

Ioannis A. Fapazoglou Greek Atomic Energy Commission Nuclear Technology Department N.R.C. "DEMOKRITOS" 153 10 Aghia Paraskevi Greece ABSTRACT. The elements of Markovian reliability analysis are presented. Markovian models are appropriate whenever the stochastic behavior of the components of a system depends on the state of the other components or on the state of the system. Such cases arise when the system contains repairable components, when components are on warm or cold standby, when they share common loads, when the operating environment changes with time, and in certain instances of dependent failures. Markovian models are also required when the failure probability of standby safety systems must be calculated. Simple examples comparing the results of Markovian models to those obtained by other techniques are presented. An annotated literature review on Markovian reliability analysis is also provided.

1.

INTRODUCTION

This paper presents the basic principles of Markovian reliability analysis, along with an annotated literature review. Reliability analyses begin with the establishment of logic diagrams such as event trees, fault trees and cause-consequence graphs. These diagrams are essential for understanding qualitatively the operating and failure modes of the system and for identifying the operating and failed states. When the stochastic characteristics of the components of the system depend on the state of the system, the logic diagrams must be complemented by special techniques for the quantitative evaluation of the various reliability measures. In particular, when these characteristics depend only on each pair of initial and final states of the system, the technique best suited for evaluation of reliability is Markovian analysis. Examples of particular circumstances that generate dependences on the state of the system are: System with repairable components. Repair of a component may be possible only if the system is in an operating state and not if it is in a failed state. 171
A. Amendola and A. Saiz de Bustamante (eds.). Reliability Engineering, 171-204. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

172 System with different repair policies. Repair may depend on the number of repairmen available, or repair of certain components may not be possible while others are operating. In addition, several repair policies may be possible, such as repair all failed components before resuming operation or repair only those components that are necessary to resume operation. Operability of standby systems. Response to a challenge may depend on the state of the standby system. For example, for some states the standby system may respond to some challenges but not to others. Furthermore, a successful response to a challenge may reveal partial failures which if repaired could make a positive contribution to reliability. Standby redundancy. Standby failure and repair rates are in most cases different than the corresponding online rates. Common extreme environment. The failure and repair rates of components change significantly under extreme environments. Components sharing common loads. If a load is shared by several components, the failure rates of the components depend on the number of the components sharing the load. Common cause or common mode failures. Two or more components of a system may fail together because of an existing commonality. Furthermore, the failure of one component might cause that of others. The capabilities of Markovian models in reliability analyses have been recognized and extensively used. A brief review of Markovian processes and simple numerical examples of Markovian reliability analysis are given in sections two through nine. A survey of the relevant literature is given in section 10.

2. 2.1

MARKOVIAN RELIABILITY ANALYSIS Markov Processes-Definitions

A discrete-state, continuous-time random process describes the stochastic behavior of a system that can be in any of a finite number (z) of discrete states and that changes its state randomly as a function of time. A change of state state transition can occur at any instant of time. A discrete-state, continuous-time Markov process is a random process such that the probability that the system will perform a state transition from state i to state j at any time depends only on the initial state i and the final state j of the transition. If ir.(t) denotes the probability that the system is in state i at time t, and ir(t) the 1xz row vector with elements ir.(t), for i=1,2,...z, namely

173 jr(t) = [^(t) IN (t) Tfgt)]. <21>

then it can be shown that ir_(t) satisfies the state evolution equation given by the relation d () (t) = ^ (0 (2.2)

where is a matrix with elements a., such that: _ 1 J a..dt = the probability that the system will transit to state j during the interval between t and t+dt given that it is in state i at time t. Vector jr(t) is called the stateprobability vector with elements the state probabilities ir(t)'s. Matrix A is called the transitionrate matrix with elements the transition rates a..'s. 1J . Each nontransition rate a.., for i=1,2,...,z, satisfies the relation a.. = T a., or T a.. = 0
11

(2.3)

fa

fa

Indeed, the sum of the state probabilities

(O = 1
i=1 for all values of the ir.(t)'s and, therefore,

(2.4)

\M =0
i=1
Using Eq. 2.2 in Eq. 2.5 we find

(2.5)

i=1

ir ( t ) a
j=1

\M
i=1

a
j=1

= 0

(2.6)

Equation 2.6 must hold for all values of the ir(t)'s and this, in turn, can be true if and only if Eq. 2.3 is satisfied. The solution of Eq. 2.2 is given by the relation Tr(t) = ir(0) exp (A t) (2.7)

174 where ) the value of the state probability vector at time t = 0. A discretestate, discretetime Markov process is a random process such that! (a) state transitions occur at discrete times t , where
C

n=tn1

+At(n)

'

or, with

t(n) = constant, t = t + At ; o '

and (b) the probability that the system will perform a state transition from state i to state j at time t depends only on the states i and j of the transition. It can be shown that the 1xz state probability vector ir(n) obeys the relation (+1) =()() (2.8)

where the z x z transition probability matrix P_(n) has as elements the transition probabilities p..(n), for i,j = 1,2 z. 2.2 StateTrans it ion

A discretestate Markov process is customarily represented by a state transition diagram such as sketched in Fig. 2.1. The arrows in the diagram correspond to the possible transitions of the system and are assigned the corresponding transition rates. The diagram in Fig. 2.1 depicts a system that can transit from state 1 to state 2, with a rate ai2> from state 1 to state 3 with a rate a ) 3, and from state 2 to state 1 with a rate a2i. Transitions from state 2 to state 3 and from state 3 to states 1 and 2 are not possible. The transitionrate matrix for the process shown in Fig. 2.1 has the form
a

11 a 12 a13 21 a 22 0 0 a 33J
a

(2.9)

where a f ] 2.3

~( a 1 2 + a i3^

22^ a 21 '

One Component System

We will consider a system that can be in either of two states: (a) state 1, the system is operating; and (b) state 2, the system has failed. If the system is nonrepairable, a transition from state 2 back to state 1 is not possible. Then the statetransition diagram is as shown in Fig. 2.2a. The transition rate a. is the failure rate of the system. If the system is repairable, a transition from state 2 back to state 1 is possible. Then the state transition diagram is as shown in

175

Fig. 2.1

Example of a state transition diagram

Fig. 2.2b. The transition rate a2i is the repair rate of the system. The transition rate matrix and the state evolution equation for the repairable system are (2.10) [ir1(t), r2(t)] = [ir1(t), u2(t)]

and
.y y

respectively. If at time zero the system is in state 1, i.e., (0) = [,] the solution of Eq. 2.11 is

l ( t ) " "+ + +

ex

P[(y

X)t

(2.12a) (2.12b)

ir2(t) ^

{1 exp [( + )t]}

If the system is nonrepairable (=0), the solution reduces to TTx(t) = exp (At) (2.13a)

176

()
Fig. 2.2

(b)

State transition diagrams for a twostate system

ir2(t)

1 exp (Xt)

(2.13b)

In the repairable case, ,() (Eq. 2.12a) gives the probability that the system is operating at time t though it may have failed and been repaired several times in the time interval (0,t). This probability is the point availability, A(t), at time t. The complementary probabi lity 2() (Eq. 2.12b) is the point unavailability, U(t), at time t. In the nonrepairable case, the system can leave state 1 but cannot return to it. Accordingly, the probability TTj^t) (Eq. 2.13a) that the system be in state 1 at time t gives the probability that the system be operating continuously from time 0 up to time t. This probability is the reliability, R(t), of the system. The complementary probability TT2(t) (Eq. 2.13) the probability that the system occupies state 2 at time t gives the probability that the system has failed between 0 and t. It is the failure probability, F(t), of the system. 2.4 Twocomponent Systems

We will consider a system consisting of two components A and B connected in parallel, (Fig. 2.3). Each component has two possible states, an operating state and a failed state. In general, a system state is defined as a combination of component states. The number of system states is equal to the number of all possible combinations of component states. Hence, for the system in

177
Component states Type Operating Failed
A

Comp. A A

Comp.

System states

Number

Component states

1 2 3 4
Fig. 2.3 Twocomponent system. Component and system states are listed in the tables.

AB

B
AB

Fig. 2.4

Transition state diagram for twocomponent system in Fig. 2.3

178 Fig. 2.3, the number of system states is four. The statetransition diagram for the system is shown in Fig. 2.4. A transition from state 1 to state 2 is equivalent to failure of compo nent A, a transition from state 4 back to state 2 is equivalent to repair of component B, etc. Since the components are connected in parallel (Fig. 2.3), the system is operating if at least one component is operat ing. Thus, the system is operating if it is in one of the states 1,2, or 3 (Fig. 2.4), and failed if it is in state 4. The availability of the system at time t is the probability that the system be operating at time t or that it be in one of the states 1,2 or 3. Hence, the availability of the system is given by
A(t) = ir^t) + TT2(t) + TT3(t) (2.14)

Similarly, the unavailability of the system at time t is the probability that both components have failed, or the probability that the system be in state 4. Hence, U(t) = ir4(t) (2.15)

In general, if the set Z' of all possible states of a system is partitioned into two subsets X and Y containing all the operating and failed states, respectively, then we have that

A(t)

.(t) 1

and

U(t) = ^ u . ( t ) = 1A(t) ieY L

(2.16)

The importance of the Markovian model lies in the fact that it allows the transition rates of the system to depend on the initial and final states of the transition. For example, in Fig. 2.4 a transition from state 1 to state 2 is equivalent to a failure of component A. The same is true for a transition from state 3 to state 4. Yet, the failure rate of component A may depend on whether component is operating or not. This possibility is indicated in the statetransition diagram by denoting by Xf and the failure rates of component A for the system transitions from state 3 to state 4 and from state 1 to state 2, respe ctively, and taking Xj" X. . Such dependances of the stochastic behavior of the components on the state of the system arise in many practical applications, as illu strated by the following examples.

3.

STANDBY REDUNDANCY

We will consider again the twocomponent system of Fig. 2.3 and assume that one of the components is operating while the other is either on a cold standby mode or on a warm standby mode.

179 3.1 Cold Standby Redundancy

In cold standby redundancy the nonoperating component is not subject to any stress and cannot fail. a. Markov model

The state transition diagram for this case is as shown in Fig. 3.1. It is noteworthy that state 3 cannot be reached because component cannot fail while component A is operating. Here, the important feature is that the failure rate of component depends on the state of component A. This rate is equal to zero if component A is operating and to 2 if component A is failed. Hence, component cannot be examined indepen dently of component A. The transition rate matrix for the process is

A =

1 1 0 2

0
2 (3.1)

Using Eq. 3.1 in Eq. 2.2 and solving for the failure probability ir^it) of the system, we find the relation

2
,

1 1

Vu;

II. vi./

(3.2)

2 1

Foi = 1 = , Eq. 3.2 reduces to F(t) = 1 (1 + Xt)exp(Xt) (3.2a)

b.

Fault tree model

The fault tree for the system in Fig. 2.3 is shown in Fig. 3.2. The top event (system failure) is given in terms of the basic events as follows = Hence, the probability for the top event is Pr(T) = Pr(B) (3.3)

To calculate Pr(AB), i.e., the probability that both components are down, taking into consideration the cold standby characteristics of the system, we must solve the Markov model of the preceding subsection. Assuming that the two components are statistically independent, we could calcu late erroneous values. Indeed, then we find that

180

Fig. 3.1

State-transition diagram for a two-component system with one component in cold standby

Pr(A) = l-exp(-X.t) = failure probability for component A, (Eq.2.13b), Pr(B) = l-exp(-X.t) = failure probability for component B,(Eq.2.13b), and the system failure probability F(t) = Pr(T) = Pr(A)Pr(B) = 1-exp(-X1t)-exp(-X2t)+

+exp[-(X
or, if X
Xo
=

+ X.)t]

(3.4)

F(t) = 1 - 2 exp(-Xt) + exp(-2Xt) These results differ from those given by Eqs. 3.2 and 3.2.a . 3.2 Warm or Hot Standby Redundancy

(3.4a)

In warm or hot standby redundancy, the nonoperating component can fail even when it is not online. If the failure rate of a component in a standby mode is different (usually less) than the failure rate of the online mode, the standby is called warm. If the failure rate or a

11 8

System Failure

Fig. 3.2

Faulttree for system in Fig. 2.3

component in a standby mode is equal to that of the online mode, the standby is called hot. The statetransition diagram for such standby conditions is shown in Fig. 3.3a. If the two components A and are identical, the state transition diagram in Fig. 3.3a can be reduced to that of Fig. 3.3b. In general, the reduction is called merging. It simplifies the nume rical difficulties associated with reliability analyses'^). Solving the state evolution equation corresponding to the state transition diagram in Fig. 3.3b, we find TT3.(t) = F(t) =1(1 +p)exp(Xt)+ exp[(X+X*)t] (3.5)

Again, this result differs from that given by Eq. 3.4a except when =*, namely, when the failure rate of component does not depend on the state of component A and vice versa.

4.

COMPONENTS SHARING COMMON LOADS

We will assume that the two components of the system shown in Fig. 2.3 share a common load so that if one of the two fails the other must ope rate at 100% of the load. In many applications, the failure rate of each component at partial load differs substantially from that at full load. Here, the statetransition diagram is as shown in Fig. 4.1. The

182

(a)

Fig. 3.3

State-transition diagrams for a two component system with one component in warm standby

failure rate of component A (or B) depends on the state of component (or A ) , and therefore, on the state of the system. The process can be merged (Fig. 4.1b), and the failure probability of the system is given by the relation

F(t) = TT4(t) = 1 - 2frp- exp(-X*t) + ^ * exp(-2Xt)

(4.1)

This correct result reduces to that given by Eq. 3.4a only if =*.

EXTREME ENVIRONMENT The failure rates of the components of a system depend on the environment to which they are exposed. For example, the failure rate of a high voltage transmission line depends on whether or not there is a storm. The arrival of a storm is in itself a random process. Hence, we

183

(a)

Figure 4.1

State transition diagram for a system with two components sharing a common load

can distinguish three states of the environmenttransmission line system. In state 1 (Fig. 5.1) there is no storm and the transmission line is up. In state 2, there is a storm and the transmission line is up. In state 3, the transmission line is down. The failure rate of the transmission line when there is no storm is ^ (transition rate between states 1 and 3). The failure rate of the transmission line when there is a storm is 2 (transition rate between states 2 and 3). The arrival rate for the storm is \a (transition rate between states 1 and 2). The failure probability of the transmission line is the probability that the system will be in state 3. Solving the relevant state evolution equation we find

F(t) = w3(t)1

2~1

2 "1

Trsexp E<Vx e >t]


(5.1)

VVxs

exp(Xt)

184

Line up, I j no storm

! Line up, storm

Line down I storm or J /no storm

Fig. 5.1

Statetransition diagram for the extreme environment example

If a Markovian model is not used, the failure probability must be appro ximated by either lexp(X.t) = failure probability under normal conditions (5.2a)

1exp(Xt) = failure probability under storm

(5.2b)

The probability given by Eq. 5.1 may differ by orders of magnitude from that given by Eq. 5.2. RELIABILITY OF SYSTEMS WITH REPAIRABLE C OMPONENTS The reliability of a system is by definition the probability that the system will operate continuously from time 0 to time t or the probabi lity that no system failure will be observed during the time interval (0,t). Whenever this quantity is of interest for a system with repair able components a Markovian model is necessary. As an example, we will consider two repairable components connected in parallel (Fig. 2.3) under the conditions: a. Repair of the components is possible even if the system is not operating. Then, the state transition diagram is as shown in Fig.6.1a Transitions from state 4 back to states 2 and 3 are possible. The

185

Fig. 6.1

State transition diagram of two-component system in Fig. 2.3: (a) when system is down, repair is possible; (b) when system is down, repair is impossible

probability that the system will occupy state 4 at time t is the unavailability of the system at time t. It is the probability that the system is unavailable at time t regardless or whether it has failed and been repaired during the time interval (0,t). b. Repair of a unit is possible only if the other unit is operating. Then, the state transition diagram is as shown in Fig. 6.1b. Transitions from state 4 back to states 2 and 3 are not possible. If the system enters state 4, it cannot leave again. Here, the probability that the system will occupy state 4 at time t is the probability that the system will fail during the time interval (0,t). It is the failure probability, F(t), the complement of the reliability. If we assume no dependences and use the fault tree model of Fig. 3.2, we can calculate the unavailability of the system. This is equivalent to considering the system under conditions (a). If, however, we are interested in the failure probability we must consider conditions (b). This distinction and the need for a Markov model are essential in the analysis of an engineered safety system of a nuclear reactor that starts operating at a certain time during the course of an accident and must continue to operate for a period of T hours. If the system fails before T hours have elapsed, unacceptable damage to the core will result. To calculate the probability of unacceptable damage to the core because

186

ur

Approximation with independent, non repairable components

li io 3
II

<

<


Approximation with independent, repairable components

50
Fig. 6.2

Time (hr)

100

Failure probability of twocomponent system with on line repair possible (Fig. 6.1b) = . 10"3 hr"1 =v-2 2 10" hr

187 of system failure we must consider conditions (b) and the corresponding Markov model. If we assume no dependence, we will miscalculate the probability. Assuming nonrepairable independent components, we will overestimate the value of the failure probability (conservative answer). Assuming repairable independent components, we will underestimate the value of the failure probability (nonconservative answer). These remarks are illustrated by the numerical results shown in Fig. 6.2. In particular, if the system must operate for a period of = 100 hours, the correct probability of unacceptable core damage is equal to 9.4x10 ^. A model that assumes independent nonrepairable components yields a failure probability of 9 * 1 0 " ^ an overestimation by a factor of 10. A model that assumes independent repairable compo nents yields a failure probability of 2.5 x10 5, an underestimation by a factor of 40.

7.

SYSTEMS WITH SPECIAL REPAIR POLICIES

In many applications, the repair policy of the components of a system depends on the state of the system. Then, a Markov model is necessary for the calculation of the unavailability or the failure probability. Two examples of special repair policies follow. 7.1 Limited Repair Capability

For some systems, the number of components that can be under repair at any instant of time depends on the number of repairmen (or repair crews) that are available. For example, for the twocomponent system examined in the previous sections, if only one repairman is available then only one component can be repaired at a time and, if we decide that component will be the first to be repaired, then the state transition diagram will be as shown in Fig. 7.1. Again, if two repairmen are available, then the state transition diagram will be as shown in Fig. 6.1a. Values of unavailability versus time for a specific system are shown in Fig. 7.2. 7.2 Repair All Components before Resuming Operation

In some situations, repair of components is more expedient if the system is not operating. In other situations, because of practical difficulties, such as radiation fields or invessel components, repair is possible only when the system is not operating. For these situations our two component example will have the statetransition diagram shown in Fig. 7.3. In this diagram, we have introduced two new states, state 5 and state 6, in which the components are under repair but the system is not operating. The unavailability, U(t), of the system is equal to the probability that the system will be in any of the states 4,5 or 6 and, therefore,
U(t) = irA(t) + TT5(t) + TT6(t) (7.1)

188

Fig. 7.1

State transition diagram for a repairable twocomponent system but one repairman available

Its values versus time for a specific system are shown in Fig. 7.2.

8.

CHALLENGEDEPENDENT FAILURE PROBABILITY

Usually a safety system remains in a standby mode until there is a need for it to operate. An undesirable event (an accident) occurs if the system is not available to operate when challenged to do so. Hence, the probability that an accident will occur during a time period is the probability that a challenge will occur at some instant in the period and at that instant the system is unavailable. The correct calculation of the accident probability requires proper handling of the dependences between the frequency of the challenge and the unavailability of the system. If we assume no dependence, then we will grossly overestimate the accident probability. We will confirm this assertion by using a simple numerical example. 8.1 Model with no Dependences

We will consider a safety system consisting of two components in parallel. We will assume that a challenge (a need for operation) for this system arrives according to a Poisson random process with an arrival rate X c . It can be shown that if the unavailability of the system is inde pendent of the occurrence of challenges, then the accident probability,

189
U(t)D

IO 3

^ '

(iii) No online repair

~~ _

(i) One repairman

/ /

1
IO" 4
/
/

li'
/ /'

V'"
I

(ii) Two independent _ repairable components

IO"5

50
Fig. 7.2

Time (hr)

100

Unavailability of a twocomponent system under different repair policies: (i) one repairman avail able (Fig. 7.1), = 2 = 10 3 hr 1 ; (ii) two repair men available (Fig. 6.1a), = 2 = 10_1hr I ; (iii) no online repair possible (Fig. 7.3), = 2 = 10 3 hr _ 1 ; = 2 = 1.5x10 1 hr 1 .

190

UP: system operating DOWN: system not operating

Fig. 7.3

State transition diagram of a two component system (Fig. 2.3) when online repair is not possible

F(T), is given by the expression F(T) =

(2)
(8.1)

1 -[- TU ()]

where U(T) is the average unavailability of the system during the time period given by the relation

U(T)

=1

U(t)dt

(8.2)

For small values of TU(), Eq. (8.1) can be approximated by the relation T(T) (8.3) c The statetransition diagram for the twocomponent system is shown in Fig. 8.1. We have assumed that the failures are undetectable and, therefore, that the components are unrepairable.. The unavailability of the system is the probability that the system will be in state 4. Hence, U(t) = ir.(t) = 12 exp(Xt) + exp(2Xt) (8.4) F(T) =

191

Fig. 8.1

State transition diagram for a twocomponent system with no dependence on the challenge rate

The same result could have been obtained with other methods such as fault tree, reliability block diagrams, or state enumeration. Using Eq. 8.4 in Eq. 2.27 we find (T) = 1 ir4(t)dt = 1 - ^ [l-exp(-XT)] +

+ [lexp(2XT)] Finally, substituting Eq. 8.5 in Eq. 8.1, we find F(T). 8.2 Model with Dependences

(8.5)

To account for dependences in the system of Sec. 8.1, we must include an additional state accident state 5 in the state transition diagram (Fig. 8.2). A transition to state 5 occurs if the system is unavailable (state 4) and the challenge occurs. The accident probability for a period of time is the probability that the system will be in state 5 at time t =T. Solving the state evolution equation for the process in Fig. 8.2, we find

2
F(T) = ir5(T) = 1 fr exp(XT) + exp(2XT) 2 c c
ex

2
(2)( )

P(*eT>

(8.6)

192

Fig. 8.2

State transition diagram for a twocomponent system with dependence on the challenge rate

The values of the accident probability given by Eq. 8.6 are lower than those obtained from Eqs. 8.1 and 8.5. This assertion is verified by the numerical results shown in Fig. 8.3. In particular, for = 8500 hours the accident probabilities are 6 10 and 5 IO* for the models with .6 and without dependences, respectively. 8.3 Model with Dependences and Renewal Effects

Another dependence of the unavailability of the system on the frequency of challenges is due to the renewal effect that successful challenges have on the system. For example, if a challenge occurs when the system is in either state 2 or 3 (Fig. 8.2) then the failure of the failed component is revealed and can be repaired. Thus, a challenge in either state 2 or state 3 will bring the system back to state 1. The corresponding state transition diagram is as shown in Fig. 8.4. Again the accident probability is the probability of being in state 5 at time t = T . Numerical results for a specific system are shown in Fig. 8.3. For =8500 hours, this model yields an accident probability of 5.2 10~4, two orders of magnitude less than the model with no depend ences .

193

F(t)

1
/ s s

1 J. I
-

IO"2

/
'

/ INDEPENDENT MODEL

^**

/ / / 1 ^r yr /

^/^EPENDENT MODEL y^ NO RENEWAL

^r

^^^

IO"3

~
m

1 /
/

".

' / ' /

. "* DEPENDENT MODEL " WITH RENEWAL

IO" 4

/ 1 / : I / / 1 / <

' / ' I

I1

' /

/'

s'
31 = 3x10 J hr -5 -1 = 10 hr '
:

'

" "

" '/ '

"'li * '/
'/'

. 1/ ' in5 ' 'l 1 1 1 1 1 ]


t (10 hr)

10

Fig. 8.3

F a i l u r e p r o b a b i l i t y with challenge-dependence

194

Fig. 8.4

State transition diagram with dependence on the challenge rate and renewal effect of successful challenges

COMMON CAUSE OR COMMON MODE ANALYSIS The terms common cause and common mode refer to events that cause the simultaneous failure of more than one component of a system. Common cause or common mode failures play an important role in reliability

195 analyses because they reduce the reliability of systems consisting of redundant components. As it is explained in References 2 and 3, the common cause failure problem is actually a problem of stochastic dependences. Markovian models are especially suited for handling this class of dependences as well as an additional very important class called sympathetic failures. We will consider the redundant twocomponent system shown in Fig. 9.1a. If the two components are completely independent, the state transition diagram is as shown in Fig. 9.1b. Common cause failures can be incorporated in Markovian or other analyses by assuming the existence of a dummy component, C, connected in series with the two parallel components A and (Fig. 9.2a). Whenever component C fails whenever the common cause event occurs the system fails. The common cause event may occur when either one or both actual components are operating. The system in Fig. 9.2a has 8 states, three operating and five failed. If the rate of occurrence of the common cause event is \ c , the states can be merged into four, and the statetransition diagram is as shown in Fig. 9.2b. Here, state 4 includes all five states in which the system is failed. This method of analyzing the problem is equivalent to the factor approach ' . It is evident from the equivalent system in Fig. 9.2a that the common cause events represented by this model are external to the system, such as earthquakes, floods, fires, and common support systems. There exists, however, another very important class of dependences that reduce the reliability of redundant systems. It includes all situations in which the probability of failure of one component depends on the state of the other components of the system. An example of such depend ences was presented in Sec. 4. There, the failure rate of one component increased whenever the other failed. Situations exist for which the increase of the failure rate is so high that, for all practical purposes, the component fails instantly. This kind of failure is called "sympa thetic". Sympathetic failures occur when the failure of a component creates dynamic phenomena which generate stresses that challenge the strength of the remaining operating components. For example, the failure of a generating station challenges the frequency stability of the whole net work and could result in a blackout. An analogous situation occurs as a result of a fault in a transmission line. On a smaller scale, the failure of one emergency diesel generator in a nuclear power plant can create a transient that entails the failure of the other diesel. Similar phenomena can happen with the redundant channels of the various logic circuits. Redundant legs of fluid systems are also subject to sympa thetic failures. When one of the several supports of a pipe fails the resulting dynamic forces can fail the remaining supports even if they are statically redundant. A valve can fail and spray with water the electrical controls of nearby valves. Sympathetic failures can also happen indirectly through human errors. For example, if the two compo nents in Fig. 9.1a represent two redundant electrical motors, failures of one (say A) will result in some repair action. The repair crew could mistakenly disconnect motor for repair.

196

(o)

Fig. 9.1

(a) Twocomponent redundant system; and (b) corresponding statetransition diagram with independent failures

^ > ~
I

,+ .

(a)

(b)

Fig. 9.2

(a) Twocomponent redundant system subject to external common cause failures; and (b) corres ponding state transition diagram

197

(la)Xi

Fig.'9.3

State transition diagram for a twocomponent system with common cause and sympathetic failures

Sympathetic failures can be incorporated in Markov models by assuming that the failure of one redundant component could affect the failure of the other components. We can illustrate this point by using our twocomponent system (Fig. 9.2a) as an example. We will assume that failure of one component may result in the failure and nonfailure of the other with probabilities and I, respectively. Thus, the statetransition diagram, including common cause and sympathetic fai lures, will be as shown in Fig. 9.3. The system can transit from state 1 to state 2 or 3 with transition rate (la)X and to state 4 with transition rate X c +2^. The latter consists of the external common cause failure contribution X c and the sympathetic failure contri

butions a\
It can be easily verified that for an moutofn system com ponents in parallel of which at least m are required for successful system operation the failure probability is

nm . .

^)[(1)(1 i)]he

.t

- . .t
i

)n~i\ e

t A Cc

(9.1)

To illustrate the importance of sympathetic failures, we have computed the ratio of the failure probability, FJ(t), of a single spe cific component to the failure probability, F (t), of a system consist ing of specific components in parallel versus time, for several values of n. This ratio is called the redundancy factor. The failure

198

OPTIMUM SYSTEM

1
30 (

:o.Oi

0 < Xt < 0.0125 0.0125 < X t < 0.1125 0.1125 < Xt < 0.2775 0.2775 < Xt

=0.01
\ N -2

=2 =3 =4 =5

. N= 3
o

\ 20

>
^
^ M

< fe
t"

H O

V..
I

:= 4
S '

< a

\ \ \

"^^^
.

"^^ V .

^ l ~

=5
^ . . .

V
\ \ \ \ S. >

. >*
^ ^ ". ^

10

* _ _

* " " * 1
1

._
. . . _ , _

!
0.10 0.20

(Xt)

0.30

0.40

Fig. 9.4

Redundancy factor versus time for various redundant systems

199
probability, F.(t), of the single component is given by the relation

F J ( 0 = 1 exp(Xt)

(9.2)

The probability F (t) i s given by Eq. 9 . 1 , and we have used the r e l a tions . + = ; i c c = ; . = (1) i (9.3)

The results of the calculations are shown in Fig. 9.4 . For the numerical values considered, we see from Fig. 9.4 that the 5component system is not always the more reliable as it is when the components are completely independent. In particular, if the mission time t is less than (0.0125/) the two component system is more reliable than the 3,4 and 5 component systems. For mission times t between (0.0125/) and (0.1125/), the 3component system is the optimum, for mission times between (0.1125/) and (0.2775/) the 4component system is optimum, and for t>(0.2775/) the 5component becomes the more reliable. References 1. Howard R., Dynamic Probabilistic System, Vol. I and II, Wiley (1971). 2. Papazoglou I.A. and Gyftopoulos E.P., "Markovian Reliability Analysis Under Uncertainty with an Application on the Shutdown System of the Clinch River Breeder Reactor". Brookhaven National Laboratory Report, NUREG/CR0405, (BNLNUREG50864), (1978). 3. Easterling R., "Probabilistic Analysis of Common Mode Failures", in Proceedings of Topical Meeting on Probabilistic Analysis of Nuclear Reactor Safety, L.A. California, ISBN: 0894481010, May 810, 1978. 4. Fleming K. and Raabe P., "A Comparison of Three Methods for the Quantitative Analysis of Common Cause Failures", in Proceedings of Topical Meeting on Probabilistic Analysis of Nuclear Reactor Safety, L.A. California, ISBN: 0894481010, May 810, 1978.

10.

LITERATURE REVIEW

The advantages of using Markov processes in reliability problems have been recognized since the inception of the reliability descipline. Almost every book published on reliability presents Markov modeling as the most powerful reliability technique because it can incorporate a great variety of system characteristics. Numerical difficulties, how ever, have limited the use of the technique to relatively small systems consisting of only a few components. A successful effort has been made

200 to apply this powerful technique to large systems, through the use of three techniques : state ordering, state merging and judicious choices of time steps. The three techniques are discussed by Fapazoglou and Gyftopoulos in BNLNUREG50864 (1978) and in a paper in Nuclear Science and Engineering, Vol. 73, No. 1, Jan. 1980. What follows is a short list of publications together with a brief comment on each publication. 10.1 1. Books on Markov Processes

HOWARD R., Dynamic Probabilistic Systems, Vol. I and II, Wiley (1971). Probably the most complete book on applications of Markov processes in studying dynamic probabilistic systems. Though it includes some examples, this treatise is not specifically oriented toward reliability analysis. KEMENY J.G. and SNELL J.L., Finite Markov Chains, D. Van Nostrand (1961). A classic reference for Markovian analysis but not specifically oriented toward reliability analysis. 10.2 3. Books on Reliability Analysis 2.

BARLOW R.E. and PROSHAN F., Mathematical Theory of Reliability, Wiley (1965). This book presents the Markov approach in Chapter 5, "Stochastic Models for Complex Systems". BILLINTON R., RINGLEE R., and WOOD ., Power System Reliability Calculations, MIT Press (1973). The authors use exclusively Markov models in calculations of reliability of electric power systems. DHILLON .S., and SINGH C., Engineering Reliability: New Techniques and Applications, Wiley (1981). In this book, the authors make the following comment on Markovian reliability analysis (Sec. 3.6.2, p. 37):... "The state space approach (Markov processes) is a very general approach and can generally handle more cases than any other method. It can be used when the components are independent as well as for systems involving dependent failure and repair modes. There is no conceptual difficulty in incorporating multistate components and modeling common cause failures". They treat common cause failures in terms of Markovian models (Sec. 4.14), and present applications of Markovian reliability analysis in software reliability, repairable threestate devices, generating capacity reliability (electric power systems), transmission and distri bution systems (electric power systems), transit system reliability, and computer system reliability. The book also includes an extensive bibliography. 5. 4.

201 ENDRENYI J., Reliability Modeling in Electric Power Systems, Wiley (1978). The author uses almost exclusively the state space approach (Markov models) to analyze many problems of reliability of electric power systems. GNEDENKO B.V., BELYAYEV Y. and S0L0VYEV ., Mathematical Methods of Reliability Theory, Academic Press (1969). The authors use Markov models to study a variety of problems on standby redundancy with renewal. Combinatorial analysis and the Markov approach are the only reliability techniques discussed. GREEN A.E., and BOURNE A.J., "Reliability Technology',', Wiley Interscience (1972). The authors introduce the concept of statechange and use the corresponding Markov processes to derive general reliability and avail ability expressions (Chapters 10 and 11). HENLEY E. and KUMAM0TO H., Reliability Engineering and Risk Assessment, PrenticeHall Inc. (1981). This book contains one of the most complete lists of reliability techniques. The Markov approach is presented as the only methodology capable of answering reliability questions for systems with dependences (Chapter 8: System quantification for dependent basic events), and for calculating the reliability of systems with repairable components (Chapter 9: System quantification, Reliability). SANDLER G.H., System Reliability Engineering, , PrenticeHall (1963). This book is devoted almost exclusively to Markovian reliability models. It is perhaps the most complete reference on Markovian models of small systems. 11. SINGH C. and BILLINTON R., System Reliability Modeling and Evaluation, Hutchinson, London (1977). This book is exclusively devoted to Markovian reliability models. 10. 9. 8. 7. 6.

SHOOMAN M.D., Probabilistic Reliability; An Engineering Approach, McGrawHill (1969). This book includes many reliability techniques. Markov models are used for the analysis of systems incorporating dependences, repair or standby operation. The author comments: "The Markov model approach is perhaps the best and most straightforward approach to computations in systems with dependence, repair, or standby operation", (Sec. 5.8.4, p. 243). 10.3 10.3.1 13. Papers and Reports Review documents GENERAL ELECTRIC CO., "Reliability manual for LMFBR", Vol. 1, Report SRD75064. Prepared by Corporate Research and Development,

12.

202 General Electric Co., for the Fast Breeder Reactor Department, General Electric Co., Sunnyvale, CA. (1975). This manual presents an extended list of reliability analysis techniques pertinent to nuclear reactor systems. Markovian analysis is described as the most suitable technique for reliability analysis of repairable systems (Sec. 3.5.7, Complex repairable systems, Markov Analysis). RASMUSON D.M., BURDIC G.R., and WILSON J., "C ommon Cause Failure Analysis Techniques: A Review and Comparative Evaluation", EG & G Report TREE1349, September (1979). This report contains reviews and evaluations of selected common cause failure analysis techniques. Markovian reliability analysis is listed among the available techniques for quantitative evaluation of common cause failures. In evaluating the Markovian technique the authors state (Sec. 11.6, p. 113): "In terms of the variety of system chara cteristics which it can calculate, Markov modeling probably represents the most powerful reliability technique. However, due to limitations on the number of states for which calculations are feasible, the technique has been essentially ignored in the nuclear field until recent years. Two approaches have been used to solve the problem of size limita tion: (a) small systems or resolution to subsystem level only; and (b) special calculation and reduction techniques. These approaches have still not resulted in widespread use of Markov modeling in nuclear industry. Perhaps as failure data become more detailed the versatility of Markov modeling in calculating diverse reliability characteristics will be more appreciated". 15. BLIN ., CARNINO ., and GEORGIN J.P., "Use of Markov Processes for Reliability Problems", in Synthesis and Analysis Methods for Safety and Reliability Studies edited by Apostolakis et al., Plenum Press (1980). This paper summarizes French reliability efforts in nuclear systems. The authors state: "It is not possible to use methods such as fault tree analysis, to assess the reliability or the availability of time evolutive systems. Stochastic processes have to be used and among them the Markov processes are the most interesting ones." 10.3.2 16. Applications of Markovian Analysis in Large Nuclear Systems 14.

PAPAZOGLOU I.A. and GYFTOPOULOS E.P., "Markovian Reliability Ana lysis Under Uncertainty with an Application on the Shutdown System of the Clinch River Breeder Reactor", Brookhaven National Laboratory Report, NUREG/CR0405, (BNLNUREG50864), (1978). The authors develop a methodology for the assessment of the uncer tainties about the reliability of nuclear reactor systems described by Markov models and present an assessment of the uncertainties about the probability of loss of coolable core geometry of the CRBR due to shut down system failures. The Markov model used in this study includes common cause failures,

203 interdependences between the unavailability of the system and the occur ence of transients, and inspection and maintenance procedures that depend on the state of the system, and the possibility of human errors. 17. WESTINGHOUSE ELECTRIC CORPORATION, "Reliability Assessment of CRBR Reactor Shutdown System", WARDD0118, Nov. (1975).

ILBERG D., "An Analysis of the Reliability of the Shutdown Heat Removal System for the CRBR", UCLAENG7682 (1976). A Markovian model for the calculation of the reliability of SHRS of the CRBR was used. The Markovian model was chosen because... "it is convenient for the analysis of time dependent reliability (or availabi lity) of safety systems, when subsystems rather than a large number of components are included. A Markov model treats easily repair rates, failure to start upon demand, changes with time of the system functional configuration, and common mode failure transitions between states of the systems" (Sec. 4.1, p. 5 ) . 19. BLIN ., CARNINO ., BOURSIER M. and GREPPO J.F., "Determination, par une Approche Probabilist, d'une Rgle d'Exploitation des Alimentations de 6.6 KV des Racteurs Eau Sous Pression (Tranches de MW(e))", in "Reliability of Nuclear Power Plants". Proceedings of a Symposium, Innsbruck, IAEA (1975). General Applications of Markovian Reliability Analysis

18.

10.3.3 20.

21.

22.

23.

24.

25.

26.

BUZACOTT J.., "Markov Approach to Finding Failure Times of Repair able Systems", IEEE Trans. Reliability. Vol. L, Nov. 1979,'p.128 134. ENDRENYI J. and BILLINTON R., "Reliability Evaluation of Power Transmission Networks: Models and Methods", CIGRE, Paper No. 32 06, (1974). ENDRENYI J., MAENHAUT P.C. and PAYN L.E., "Reliability Evaluation of Transmission Systems with Switching after Faults: Approxima tions and a Computer Program", IEEE Transactions on Power Apparatus and Systems. Vol. 21, pp. 18631875, Nov/Dec (1973). FLENHIGER B.J., "A Markovian Model for the Analysis of the Effects of Marginal Testing on System Reliability", An. Math. Stat.. Vol. 3, June )1962), pp. 754766. SINGH C. and BILLINTON R., "Frequency and Duration Concepts in System Reliability Evaluation", IEEE Trans. Reliability, Vol. R2^, April (1975), pp. 3136. SINGH C. and BILLINTON R., "Reliability Modelling in Systems with NonExponential Down Time Distributions", IEEE Transactions on Power Apparatus and SystemsT Vol. 32, March/April (1973), pp. 790 800. ZELENTSOV B.P., "Reliability Analysis of Large Nonrepairable Systems", IEEE Trans. Reliability, Vol. RJ3, Nov. (1970), pp.132 136.

204 10.3.4 Simple Applications of Markov Models in Fault Trees

Modeling of small portions of a system by a Markov process in relation to a fault tree is presented in the following papers. 27. N UMAN C P . and BONHOME H.M., "Evaluation Maintenance Policies E using Markov Chains and Fault Tree Analysis", IEEE Transactions of Reliability. Vol. R-24, April (1975). CALD0R0LA L., "Fault Tree Analysis of Multistate Systems with Multistate Components", ANS Topical Meeting on Probabilistic Analysis of Nuclear Reactor Safety, Los Angeles, California Paper VIII. 1, (1978). Also appearing in Synthesis and Analysis Methods for Safety and Reliability Studies, edited by Apostolakis et al., Plenum Press (1980).

28.

The following two reports present a fault-tree technique that can incorporate Markovian models for single components. 29. MODAR S M., RASMUSSEN . and WOLF L., "Reliability Analysis of E Complex Technical Systems using the Fault Tree Modularization Technique", MITNE-228 (1980). KARIMI R., RASMUSSEN . and WOLF L., "Qualitative and Quantitative Reliability Analysis of the Safety Systems". MITE L-80-015 (1980).

30.

MONTE CARLO METHODS

A. Saiz de Bustamante Universidad Politcnica de Madrid Spain

ABSTRACT. Monte Carlo methods apply the simulation of random va riables to the solution of problems by means of system probabilistic models. The paper examines first the generation of a uniform random variable, and then of current distributions to be used in the direct simulation method or "crude" Monte Carlo. Later the variance reduction techniques are introduced in order to improve the utility of Monte Carlo codes.

1.

INTRODUCTION

The Monte Carlo methods are numerical methods which allow the solution of mathematical and technical problems by means of the simulation of random variables. It was originated forty years ago by mathematicians J. Newmann and S. Ulam at the early developing states of Nuclear Technology. Today its applications have been extended to a much broader area of Science and Technology, mainly due to the expansion of the electronic computer. This methodology can be used to simulate a chance system beha viour, by means of the implementation of a system probabilistic model into a computer. Each trial run of the model constitutes an artificial sample or a single process output. Repeating this procedure several times, it is possible to develop a statistical picture of the system behaviour or its probability dis tribution function. The whole process repairfailurerepair of a component can be simulated if the distribution functions of the repairfailure and failurerepair are known. The above mentioned distributions can be combined by means of the system logic to simulate the system behaviour or system state changes in time. In order to simulate the different types of probability distribu tions it is required to dispose of the corresponding random number generator, all based on the uniform distribution generator, which can 205
. Amendola and A. Sa2 de Bust amante (eds.). Reliability Engineering, 205-220. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

206 be obtained or from a computer,because most computing languages have a builtin function which gives it, for example in BASIC RND, or from a random number table (see Annex 1). Monte Carlo methods can be used also to solve deterministic mathe matical or technical problems if a simulation can be developed by means of a probabilistic model. Calculation of the value by a probabilistic method (Buffon's problem). A needle of length 2S. is dropped on an infinite plane crossed by a set of parallel straight lines distance between lines: 2a, where a > SL (see Fig. 1), assuming a uniform distribution for: i) the distance from centre of the needle to the nearest straight line; ii) the angle between the needle line and the straight lines. The probability that the needle will cut any straight line, x<_ I sen a, Example 1. i I sen r 1 2l = J dx da = ia ' r ira o o Therefore, the probabilistic model to calculate ,

= iL
ap being determined by sampling, or by running n 0 times the experiment of dropping the mentioned needle to the plane and counting the number of successes n,

2.

UNIF ORM RANDOM NUMBER GENERATOR

The generation of random numbers from a uniform distribution was based early on gambling devices. Thus, why the statistical simulation methods took the name of Monte Carlo. Annex 1 shows a table of 3500 random digits, in which the digits 0, 1, 2, ..., 9 have an equal chance of appearing at every draw. The digits are grouped for convenience in five digit numbers. A table of random numbers represents the sampling of a uniform distribution. Uniformly distributed random numbers generated by an algorithm are called pseudo-random numbers, because their generation is not at random but the number sequence has passed successfully statistical tests for randomness. Given a sequence of random numbers,

VYk+i'

207

fi .104 .241 .375 .995 .895 .289 .094 0.071 .023 .521

x - /i 26.00 60.25 93.75 248.75 223.75 72.25 23.50 17.75 5.75 130.25

+ .223 .421 .779 ..963 .854 .635 .103 .510 0.010 0.070

C = 180 V i + 1 X 40.140 75.780 140.220 173.340 153.720 114.300 18.540 91.800 1.800 0.001

JL s e n a 112.815 169.638 111.972 20.296 77.483 159.496 55.644 174.914 5.497 0.214

Success
Y Y Y Y Y Y H

6 10

2 ap

350 2.33 250.0.6

Figure 1.

P r o b a b i l i s t i c ir e s t i m a t i o n

( B u f f o n ' s problem),

208 where
]+1

= (])

it is required to supply 0, the seed or the starting point. The algorithm mainly used is the congruential or residue method, defined by (mod m)
Y

INT(KYk/m)m

k+1

being ^ the current random number, a constant and m the module of the method. Many highlevel computer languages have a function which gene rates uniformly distributed random numbers in the range 0 to 1. In BASIC this function has the form RND, used in connection with the statement RANDOMIZE(x), to make RND start off at a definite place in its sequence of numbers, or to get a different sequence each time the programme is run, with the statement RANDOMIZE TIMER.
i

Example 2. Estimation of the value * by artificial sampling. According to example 1,

=apil
being 21 the length of a needle dropped on a plane divided by a system of parallel straight lines distance between lines: 2a, and the probability that the needle cuts any straight line .<a. Assuming I = 175 mm and a = 250 mm, the condition that the needle cuts any straight line (success) ^ 175 sen o t according to Fig. 1, or 250 . ^ 175 sen(180y.+1) being y and Y+i two uniform random numbers, for the trial, which can be obtained from: i) a random digits table (see Annex 1) as done in the Fig. 1 table. Each line of Fig. 1 table represents a trial, indicating the last column the success (Y) of failure (), getting an estimation of IT,
TT*

= 2.33

ii) a computer function, as RND(x). A program based on a uniform random number generator which incorporates the success condition = 10,000 trials shows, v* = 3.115

209 3. GENERATORS FROM SPECIFIED DISTRIBUTIONS

To convert a uniform random number to one with a specified density function, the fact is used that its associated distribution function is uniformly distributed over (0,1) interval. t F(t) = Pr(T<t) = f(t)dt =

being a random uniform variable over (0,1). If F(t) can be expressed in closed form it is simple to obtain t^, given , by inversion of the distribution function, t1 = F 1 (T1) to get a sequence of uniform random numbers. The above formula allows to have another sequence of artificial samples from a specified distribution function. The application of the inversion procedure, or alternative me thods, to the most frequent probability distribution functions which appear in Reliability Engineering modelling, are given below: i) Exponential distribution

f (t) = Xe" Xt F(t) = le"At = t or t = 1(1)/ = In /

because and l have the same distribution. ii) Weibull distribution (two parameters)

f(t)=^t^exp(S
F(t) = 1exp - ) where = shape parameter = scale parameter 1/3 (in ] =

210 iii) Normal distribution

The Central Limit Theorem can be used to generate normally distributed random numbers, 12

=
i=l

. e N(6,l)
1

x6 e N(O,I)

^e
then

N(O,D

t = (6) + . iv) Lognormal distribution

Accordingly In t is distributed normally 12 i=l

e N(6,l)

In t = (6) + t = ((6) + ] ) Poisson distribution

To generate a Poisson variate, which is discrete, it is ne cessary to find the value of that most closely satisfies
e

=0

zT = y>

>0

Poisson random number generator is based on a linear interpolation of two integral values for x, which bracket the value, producing a continuum of values. If a true Poisson distribution is desired, should be an integral approximation of the generated value.

4.

DIRECT SIMULATION METHOD

The application of Monte Carlo techniques to reliability/availability problems requires: i) knowledge of failure/repair probability distribution of com ponents in order to be able to generate the pseudorandom times which define component events; ii) knowledge of the system logic, by means of its truth or deci sion tables, or the equivalent fault tree which allows to combine the

211 component events to get the system state. The model simulated time moves from one event to the next. Fig. 2 shows a flow chart for the application of Monte Carlo me thods to system reliability or availability estimation; representing h the length of the trial and the number of trials. After running the program, an estimator of system reliability or availability is obtained and/or a histogram of the results, where both reliability and availability are considered as random variables. As a help for Monte Carlo programming, reference / 2 / mentioned a group of FORTRAN subroutines making up the simulation language GAST. Reference /l/ explained a special language designes for digital simu lation: GPSS or General Purpose Simulation System, a flow charted model with special symbols equivalent to program statement. At the system design phase the Monte Carlo techniques represent an advantageous tool because of the easiness to change in the model the number and type of components to reach reliability/availability goals. At the system operating phase the model provides "operating ex perience" to the planning engineer. In order to have accurate results n, the number of trials, must be large; therefore, requiring a great amount of computer time. Assuming a system unavailability Q = 10~4, then it is required 10,000 trials on average to have one failure; but with 1000 trials most likely there will be no failure and, therefore, the estimate of availability will be one. The unavailability of the system is estimated after trials.

2 = Q i
The accuracy of the estimation, if is sufficiently large to con sider Q to be normally distributed,

Q
= ()2
For a significance level = 0.05, K ^ 2 42 2~2 Q

Assuming ^Q, then it will be necessary to have 40,000 trials to get an estimator of system availability with an accuracy of = 0.01 for a level of significance a = 5%.

212

ISTMI

,h

iI

GENERATION PSEUDORANDOM TIMES

SYSTEM LOGIC

SAMPLE DATA CAL.

i i + J

1>

)*-

MONTE CARLO ESTIMATOR

ROO/Q(h)

( ^

STOP

Figure 2.

Flow c h a r t for Monte Carlo methods applied to system re l i a b i l i t y ( a v a i l a b i l i t y ) estimation.

213 Example 3. Draw a flowchart for unavailability estimation of a three component maintained system and write the related program, assuming an exponential distribution of time to failure and time to repair. Two components are linked by a series configuration being the third one redundant to them. Solution: The flowchart is represented at Fig. 3 and the corresponding BASIC program is given at Annex 2. During time h length of trial the subroutine GEN(,r) (sub routine 660 of Annex 2) generates Time To Failure (TTF) and Time To Repair (TTR) for each component from its constant failure rate (.(0)) and repair rate (r(0)), being 0 = 1 to 3. The events, failures and repairs, of components are represented by the subscripted variable S(j) which takes the following values : component No.1 No.2 No.3 failure 1 3 5 repair 2 4 6

The occurrence of the j event means to advance the trial clock, t(j), to the generated Time To Failure (TTF) or Time To Repair (TTR), what ever is the case, up to the trial time h for every component. The total number of events at trial i is called x(i). The sequence of events for the three components, t(j), s(j), at random are being sorted by the subroutine SORTING, subroutine 830 of Annex 2. From the scheduled sequence of events a behaviour matrix M(x(i),6) is generated representing column No.l, the times to events, t(j), and columns Nos.2, 3 and 4, the states of different components. The status of the system is given at column No.5 and its state at column No.6. The states of components (X1,X2,X3) and the status of the system (X4) are represented by indicator variables which can take the values: 0: operating state (components) or status (system) 1: repairing state (components) or status (system). The eight states of the system are given by the indicator variable XS which can take the following values: 1: IF XI = 0 AND X2 = 0 AND X3 = 0 2: IF XI = 0 AND X2 = 0 AND X3 = 1 3: IF XI = 0 AND X2 = 1 AND X3 = 0 4: IF XI = 1 AND X2 = 0 AND X3 = 0 5: IF XI = 0 AND X2 = 1 AND X2 = 1 6: IF XI = 1 AND X2 = 0 AND X3 = 1 7: IF XI = 1 AND X2 = 1 AND X3 = 0 8: IF XI = 1 AND X2 = 1 AND X3 = 1. Matrix M(x(i),6) can be condensed to matrix C(v,6) where represents the number of system status changes, being its first column the time to these changes and the second the status of the system at that time.

214

(3
f~

. ( ; t i f i l i *.it ill! | III' = l i l . i l

111=' III Ir l l

PUT

Mt

mm emilii nil III

ATII1 (III.)

Till CMOIMATIM CIMI

CJiEJ

Figure 3.

Flow chart for availability estimation of a threecomponent system.

215 The unavailability of the system at trial i is calculated from the condensed matrix where the up and down system times are given. The average value and the standard deviation of the availability of the system is estimated after trials using the described program where availability is considered as a random variable. Example 4. A power system is made up by a diesel generator, components No.l and No.2, back up by a battery, component No.3. Assuming the fol lowing failure and repair rates, Y! = 0.1 2 = 0.2 3 = 0.15 \i = 1 2 = 2 3 = 1 estimate by direct Monte Carlo method, the unavailability of the system for 35 time units and 50 trials. Solution: The execution of the BASIC program included in Annex 2 gives the following results: Q S2 S = 0.0235 = 0.00069 = 0.0263

5. VARIANCE REDUCTION TECHNIQUES The results of the application of a Monte Carlo method is an estimate, presently the mean of the simulation process, as shown_in example 4, and therefore subject to variability asmeasured by Var(6). In order to reduce the number of trials or the computing time, for a given accuracy e, it is required to decrease the variance of the estimate to obtain a new estimator * such that Var(6*) <Var(6). To get * several methods have been developed (refs./l,3/) which can be grouped into correlation/regression techniques and importance methods. Only the control variate (S) and the importance sampling will be considered as they are the most significant methods of each group. The control variate method requires a control random variable with a known probability density function and mean which corres pond to what is understood as a simplified model of the system. Moreover, 0 C should be positively correlated with , then an unbiased estimator * can be defined, * = (0) = + 0 The variance of * indicates that to get a variance reduction it is required Cov(I,ec) > j Var()

216 The system reliability or availability after trials can be estimated by

- + i ( ) ~ * ()

=1 being the random input to Monte Carlo models, full and simplified. The difficulty of this method consists in determining 9 C , the control variate, which can be overcome by a pilot testing of the full model, to obtain an histogram, which can be used as p.d.f. The importance sampling improves precision by taking more samples in domains which make great contribution to the integral
no

= J h (x)f(x)dx

= E(h(x))

using a different sample distribution such as


CO 00

V = J

h(x)

(^

(x)

g(x)dx=[ h(x)g(x)dx = [()]

being , . h(x)f(x) h(x) = g(x) to obtain

S = e*=.* V x )
i=l It is possible to get an important reduction in the variance of the estimate if g(x) is choosen of a similar shape as f(x), but a reduction cannot be assured. Anyhow, the application of variance reduction techniques is based on some qualitative or quantitative knowledge of the system to be modelled. If that knowledge is not existing, these techniques are not applicable.

217 REFERENCES 1 2 3 4 5 6 7 R.y. Rubinstein, Simulation and the Monte Carlo Method, John Wiley & Sons (1981). J. Henley and H. Kumamoto, Reliability Engineering and Risk Assessment, Prentice-Hall (1981). J.R. Wilson, Variance Reduction: The Current State, Mathematics and Computers in Simulation XXV, North Holland Publishing Co. ( 1983) . P. Rothery, The Use of Control Variates in Monte Carlo Estimation of Power, Royal Statistical Society (1982). G. Gordon, System Simulation, Prentice-Hall (1978). A.A. Pritsker, The GASP IV Simulation Language, John Wiley & Sons (1974). J.M. Hammersley and D.C. Hadscomb, Monte Carlo Methods, Methuen (1964).

218 ANNEX 1

* TABLE OF IMO RANDOM DKJIT if.Col.


1 2 3 4 5 6 7 8 9 10 11 12 13 14 13 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 33 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

(I)
10480 22566 24130 42167 37570 77921 99562 96301 89379 85475 2891 63553 09429 10363 07119 51085 02368 01011 52162 07056 48665 54164 32639 29334 02488 81325 29676 00742 05S66 91921 0058! 00725 69011 23976 09763 91367 17955 46503 92157 14577 98427 34914 70060 53976 7607! 90725 64364 08962 95012 13664

g)
IHM 46575 4BS60 93093 3997S 06907 72905 91977 14342 36837 9578 40961 93969 61129 9733 12765 11382 54092 53916 97628 91245 58492 32363 27001 35062 72295 20591 57392 04213 26418 04711 69884 65797 37948 85473 42593 56349 18584 89634 62763 07523 6397 28277 54914 29515 52210 67411 00358 68379 10493

(8)
01336 25595 22527 06243 SI837 lino 56420 05463 63661 43342 88231 48235 52636 87529 71048 51121 52404 33362 46369 33787 83828 22421 05597 87637 26834 04839 68086 39064 23669 64117 87917 62797 93876 29888 73377 27958 90999 18845 94824 53603 3336! 8720 59475 06990 40960 83974 5335 51662 93526 10492

(4)
02011 85393 97265 1680 16656 42751 9994 07972 10281 '53988 3327 03427 92737 85689 08178 31239 60268 94904 58586 09998 14346 74103 24200 87308 07351 6423 26432 66432 26422 94505 77341 56170 55293 88604 12908 50134 49127 49618 78171 81263 4270 82763 46475 7245 07391 9992 3192 25388 70763 58391

(5)
81647 30995 76393 0785 06121 27756 96872 18876 17453 53060 70997 49626 88974 48237 77235 77432 89368 31273 23216 42696 09172 47070 13563 58751 19731 24878 46901 84675 44407 2676 42206 6324 18968 67917 50885 04024 20044 02304 84610 39667 0I6J8 5447 23219 68350 58745 3831 14883 61642 10593 91151

(6)
9164 19198 4109 16376 91762 33498 3101 20922 18103 59535 79936 69445 33488 32267 1391 16308 19885 04146 14513 06691 30168 25306 38005 00256 92420 62651 20849 40027 44048 13940 33126 88072 17354 48708 18317 16383 59931 51038 82834 47356 92477 17032 55416 82948 15774 58857 24413 3407! 04542 11099

(7)
69179 279112 13170 39440 60468 18602 71194 94393 57740 38867 56865 18663 36320 67689 47364 60736 55322 18594 83149 76968 90229 76468 94342 45854 60951 66566 89768 32832 37937 39972 74087 76222 26575 18912 26290 29880 06115 20655 09922 36873 66969 873B9 94970 11398 22987 50490 59744 81249 76463 59516

(8)
M194 53402 24H30 33537 1305 70659 18738 36869 84378 2300 05859 72693 17617 93394 105 92144 44819 29832 98736 13602 04754 26384 28728 15398 1280 14778 61536 61362 63904 12209 99547 56086 06625 82271 35797 99730 20342' 58727 25417 36307 8420 40K36 25832 42B7B 80059 83765 92351 3SA4B 54328 81651

(9)
62590 93965 49340 71341 496*4 90655 44013 9014 25331 0815

(IO)
56207 J10U5 32061 57004 60672 13033 48840 60045 1256 17983

(11)
20969 52666 30660 OOB49 141 IO 2191 03213 I842S 58678 16439 01547 12234 84115 85104 29571 70960 64835 31132 94738 88916 30421 21524 17012 10367 3258 13300 92259 64760 75470 91401 43808 76038 29841 S36II 54952 29060 73708 56942 25555 B9656

(12)
99370 19174 19655 74917 06927 1125 21069 4905 44947 11438 5590 90311 27156 20265 744! 3990 44919 01915 17752 18309 6166 15227 4 I t i 07684 86679 87074 37101 64564 66520 42416 76635 65853 80150 54262 37888 09230 83317 53389 21246 MIOS 04102 86B63 72828 46634 14221 57375 04110 45578 14777 22923

(13)
91291 39615 6334 97758 01263 44394 10634 42508 05585 IS393 91610 35703 30613 29975 28551 73601 05944 92747 35136 15625 99904 96909 1829 56188 50720 79666 80428 96096 34693 07844 62028 7791 12777 85983 58917 79656 36103 20562 55309 77490 46880 77775 00101 06541 60697 56228 25726 7B547 62750 32261

(14)
90700 99505 311022 16378 34613 428X0 12958 32307 56941 4952 78188 90322 74952 89868 90707 4071 55157 4951 53749 58104 32812 44591 22851 18310 94933 3725 25280 98253 90449 69618 76630 88006 46301 03547 88030 73211 42791 87338 20468 1806! 45709 69348 66794 97809 59585 41546 51900 81788 92277 85653

90106 31595 52180 20847 30015 ' 0 8 2 7 2 01311 26358 97733 85977 49442 01188 71585 23495 31831 59193 58151 35806 46557 50001 76797 86645 98947 43766 71500 81817 84637 40601 65424 05998 55556 18059 28168 44137 61607 04880 32427 69973 80587 39911 53657 97473 56891 02349 27195 33900 65255 83030 64350 46104 2217 0664 06912 41133 7638 14780 12659 96067 66154 4368 42607 93161 59920 69774 41688 84833 02008 13475 48413 49518

45585 46363 70002 70663 94884 ' 19661 88267 47363 96189 41151 14561 89286 69332 17247 48225 31720 33931 48373 28865 -46751

219 ANNEX 2
20 'Failure repair process. THREE COMPONENTS SYSTEM. Events/samples! 300 30 RANDOMIZE TIMER 40 DIM L(3),R(3>,A(100) 50 FOR 01 TO 3 60 PRINT "C omponent"|0|"i"( 70 INPUT "Failure rate, Repair rate"L<0),R(0) BO NEXT O 90 INPUT "Samples number, Time"|N,HiDIM X(N),N*(N),K(N>iU=OiC U=0 100 'Sample il Nu. events=x(i)| unavailabilityk(i) 110 FOR 1=1 TO ISO DIM T(300>,8(300),B(3) 130 'Random time t(j) leading to event 5<j) at sample i. Number of events comp onent olB(o) 140 FOR 01 TO 3 150 LL<0>iRR(O) 160 GOSUB 660 170 B(0)MllIF M"" THEN N*<I)=M* 180 FOR Jl TO B<0> 190 T< J+XU))=A(J> 200 IF 0=1 AND J=2INT(J/2> THEN S(J+X<I>>=2:G0T0 S60 210 IF 01 THEN S<J+X<I>)=1iGOTO 260 220 IF 0=2 AND J=2*INT<J/2) THEN S< J+X < I > >=4 :G0T0 260 230 IF 0=2 THEN S(J+X<I))31GOTO 260 240 IF 03 AND J=2*INT(J/2) THEN S(J+X(I)>=6:G0T0 260 250 S(J+X(I))5 260 NEXT J 270 X(I)=X(I)+B<0) 2B0 NEXT 0 290 GOSUB 830 300 DIM M(X<I),6),C C X<I>,2>V0lD=0 310 'Behavior matrixiM. C olumnsxl time 2,3,4; components 1,2,3, states (0 or 1) 5; system status (O or 1) ~ 6; system states (1 to 8) 320 '"C ondensed matrixiC . Columnsil, time 2; system status (0 or 1 ) " 330 FOR Jl TO X(I ) 340 M(J,1)T(J) 350 IF S(J)=1 THEN M(J,2)=1lK=2lG0SUB 910 360 IF S(J)=2 THEN M<J,2)=0:K=2:G0SUB 950 370 IF S(J)=3 THEN M(J,31=1lK=3:G0SUB 910 380 IF B(J)4 THEN M<J,3>0:K=3:G0SUB 950 390 IF S(J)=5 THEN M< J,4 )l |K=4:G0SUB 910 400 IF S(J>=6 THEN M<J,4)=0:K4lGOSUB 950 410 IF (M(J,2)=0 AND M<J,3)=0) OR M(J,4)0 THEN M(J,5)0lG0TO 430 420 M(J,5)1 430 IF M(J,2)=0 AND M(J,3)=0 AND M(J ,4)=0 THEN M( J,6) = l iGOTO 510 440 IF M<J.2)0 AND M<J,3)0 AND M(J,4)=1 THEN M(J,6)=2tG0T0 510 450 IF M<J,2)=0 AND M(J,3)=1 AND M(J,4)=0 THEN M(J,6)=3lG0T0 510 460 IF M(J,2)=1 AND M(J,3)=0 AND M(J,4)=0 THEN M(J,6)=4:GOTO 510 470 IF M(J,2)=0 AND M(J,3)=1 AND M(J,4)=1 THEN M(J,6)=5lG0T0 510 480 IF M(J,2)=1 AND M(J,3)=0 AND M<J,4)=1 THEN M<J,6)=6:GOT0 510 490 IF M(J,2)=1 AND M(J,3)=1 AND M(J,4>=0 THEN M(J,6>=7:G0T0 510 500 M(J,6)B 510 NEXT J 520 FOR J2 TO X(I) 530 IF M(J,5)OM(Jl,5)THEN VV+l >C V, 1 )=M( J , 1 ) |C( V,2)=M< J,5) < 540 NEXT J 550 FOR J=l TO Vl 560 IF C(J,2)0 THEN DD+C IJ,1>C <Jl.1) 570 NEXT J 580 IF V O O AND V=V* INT<V/2)THEN D=D+ < V, 1 )C Vl , 1 ) I GOTO 600 C ( 590 IF V O O THEN D=D+HC ( V. 1 > lGOTO 600 600 IF N(I)<>"" THEN K(I)=D/HlU=U+K<I)|C U=C U+K<)2 610 ERASE M,C ,T,S,B

220 ANNEX 2 (cont.)

EO NEXT I 630 PRINT "System unavailability estimation**" }U/N 640 PRINT "Var<Q("Hs")";(CU/N)(U/N)'S 650 STOP 660 'Random square wave generator 670 DEF FNE<X)=LOG<RND)/X 680 T1=0:M=0 690 TTF=FNE<L) 700 'TTF: Time To Failure 710 Tl=Tl+TTF:M=M+llIF M M O O THEN BOO 7S0 IF T1>H THEN 810 730 A<M)=T1 740 TTR=FNE<R> 750 'TTR: Time To Repair 760 T1=T1+TTRM=M+1:IF M>100 THEN BOO 770 IF T1>H THEN 810 7B0 A(M)=T1 . 790 GOTO 690 BOO M*="*":GOTO 820 BIO M*="0" 830 RETURN 830 'Time events sorting. Sample i 840 FOR P=l TO X(I) 850 FOR 0=1 TO X(I) 860 IF T(Q)>T(P) THEN fil=T(P):B1=T(G)C1=S<P)|D1=S(B):T(P)=B1iS<P)=DlsT(Q)=Al:SI Q)=C1 B70 NEXT Q 880 NEXT 890 RETURN 900 'Behaviour matriz 910 FOR 0=J+1 TO X(I) 980 M(0,K)=1 930 NEXT 0 940 RETURN 950 FOR 0=J+1 TO X(I> 960 M(0,K)=0 970 NEXT O 980 RETURN

COMMON CAUSE FAILURES ANALYSIS IN RELIABILITY AND RISK ASSESSMENT

Aniello Amendola Commission of the European Communities Systems Engineering and Reliability Division Joint Research Centre - Ispra Establishment 21020 Ispra (VA) - Italy ABSTRACT. The analysis of the so-called Common Cause Failures and their effects on systems reliability has been long time a controversial issue. Recently a European project aimed at the assessment of PSA methods and procedures via benchmark exercises (i.e. independent analyses of a same reference problem by different teams) has given a significant contribution towards the establishment of an internationally agreed procedural framework for incorporating dependent failures in PSA. After a review of available definitions and models, and the description of the major results of the benchmark exercises, the principal steps of the recommended procedure are described. Even if the paper is referred to Nuclear Power Plants, the general framework can easily be transferred to other kinds of complex systems.

Abbreviations AFWS BFR BP CCF CMF ET FMEA FT LOCA MGL NPP PSA RBE = = = = = = = = = = = = = Auxiliary Feedwater System Binomial Failure Rate model Basic Parameter model Common Cause Failure Common Mode Failure Event Tree Failure Mode and Effect Analysis Fault Tree Loss of Coolant Accident Multiple Greek Letter model Nuclear Power Plant Probabilistic Safety Assessment Reliability Benchmark Exercise

221
A. Amendola and A. Saiz de Bust amante (eds.). Reliability Engineering, 221-256. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

222 1. INTRODUCTION To protect hazardous processes against uncontrolled accident developments, safety systems are required to achieve high availability and reliability standards. Therefore special redundancies and multiplicities of critical items are usually adopted to avoid that the malfunction of some few components or the unavailability of some single systems be sufficient to degrade the overall plant safety feature. However possible unidentified dependency structures, random occurrences of abnormal environmental conditions, inadequacies in design or manufacturing, inappropriate system operation, etc. may provoke that multiple items are lost in a critical time period, nullifying the efficiency of the built-in defences. All the events provoking the loss of multiple items or defences, which cannot be explained by a simple conincidence of random independent failure events, have been commonly designated as common mode or common cause failure events. The quite different nature of the possible dependency structures implied under an unique term has resulted for a long time in a certain uncertainty with respects to the approaches adopted both for modelling CCFs, and for collecting the relevant data in a consistent way. On the one hand CCFs were recognized to have a dominating impact on the reliability of high redundancy systems, on the other one the paucity of the available data and the general modelling difficulties discouraged the analysts to approach the problem in usual risk and reliability assessments. PSAs of several NPPs did not include CCFs analysis. This unsatisfactory state-of-the-art was evident during the benchmark exercise organized by JRC, Ispra to assess models, and uncertainties linked with a system reliability analysis 1,2.1. Therefore a special project was launched to clarify the difficulties encountered by the RBE participants in dealing with CCFs and to eventually achieve a consensus on the most suitables approaches to be adopted. The CCF-RBE was very successful in matching the objectives assumed. Before describing its major outcomes, it is necessary to briefly review some relevant definition and modelling issues.

2. SOME TERMINOLOGY PROBLEMS The notation CMF has been used for a long time to indicate the loss of multiple redundant components at the demand or during the mission of a system, when this loss cannot be explained by the anticipated failure rates of the components assumed to be independent.

223
This notation is adequate in simple cases as for the scheme of Fig. 1, which shows four redundant identical circuit breakers: the closure of only one of these breakers is able to allow a current to flow between nodes A and B. A breaker has a finite number of failure modes (i.e. no closure on demand, spurious opening, stuck closed, abnormal contact resistance, etc.);, each failure mode might have a rather large number of different causes (mechanical failures provoked by several different factors, electrical failures, loss of command provoked by unavailability of other systems, etc.). When the breakers are demanded to close and they fail to close, a multiple failure event occurs which refer to the same failure mode (no closure on demand). Such simple case is well identified by the CMF label. However even to model and to quantify this event it is necessary to distinguish among its possible causes as it will be shown at section 3. In reality many times redundancy is obtained by different components or systems, based on diverse functional principles and fed by diverse energy sources: this in order to better protect the design against CMFs. In such cases, simultaneous loss of redundant systems can occur with different modes as a result of some "common" or, in any case, related causes. For instance, a human operator - because of wrong training or faulty instructions or his own negligence - may leave a valve "closed" instead of "open" after maintenance, and another one "open" instead of "closed": in this way, a same cause has provoked at the same time two different mode failures. Therefore, the term CCF has been more widely adopted and has been extended to different classes of events /3,4,5/. Table I presents a well-known classification scheme of possible "CCFs". It is a useful guide, a kind of check-list of potentially dan-

~^L

Figure 1. A 4-redundant circuit breaker system.

CMF CAUSES ENGINEERING {El

DESIGN I | (EOFI FUNCTIONAL DEFICIENCIES Hozord Undetectable Inadequole Instrumentation Inadequale Control

OPERATIONS I PROCEDURAL |0P| IOPMI | MAINTENANCE C TEST Imperlecl Repair Imperlecl Testing Imperlecl Calibration

ENVIRONMENTAL IOE I I |OEN]| |I0EEI NORMAL ENERGETIC EVENTS EXTREMES Tempera lure Pressure Humidity Vibration Acceleration

IEOI ||EDHI REALISATION FAULTS Channel Dependency Common Operation 4 Protection Components Operational Deficiencies

CONSTRUCTION IECI | IECMI MANUFACTURE

I IECI I
INSTALLATION I COMMISSIONING Inadequate Oualily Control Inadequate Standards Inadequale Inspection

'

| IOPOI OPERATION

Inadequate Oualily Control Inadequate Standards Inadequate Inspection' Inadequate Testing

,L ,

Operator Errors Inadequale Procedures Inadequate Supervision Communication Error

Fire
Flood Weather Earthquake Explosion

I
Inadequate Testing t Commissioning

I
Imperlecl Procedures Inadequate Supervision

I
Inadequale Componenti 0 e sign Errors Design Limitations

I
Stress Corrosion Contamination Interference Radialion

I
Missiles Eleclrlcol Power

I
Radiation Chemical Sources

I
Sialic Charqe

TABLE I. A CCF classification scheme /4/.

225 gerous factors against which defences must be incorporated in the design of a system to achieve the reliability target implied by the adoption of multiple redundancies. Too disparate factors are however included in Table I, which cannot be modelled in the same way in a CCFs analysis or cannot be used in data collection exercises without further distinctions. Therefore, more recently a large effort has been spent towards taxonomies allowing data to be more consistently collected and usefully screened for estimation of the relevant parameters of the models proposed /6,7/. Also, it has been recognized that "the label CCF is an inadequate descriptor of events that can be more precisely designated by an adequate classification system. It is, therefore, recommended that the technical community discontinue the use of this simplistic term" /6/. The reasons for supporting such a statement can be better understood when observing the differences in the modelling approaches that the different dependency structures may deserve.

3. CLASSES OF DEPENDENCY STRUCTURES Dependency among different events can be provoked either by functional reasons (and is then principally of deterministic nature) or by the occurrence of random events (and is then principally of stochastic nature). Furthermore each of this principal set can be subdivided into different subsets according to the possible differences in the existing functional links or in the nature of the random events themselves. Without attempting to be exhaustive the principal classes of possible dependency structures are briefly reviewed with respects to the most suitable modelling approach they deserve. 3.1. Functional unavailabilities within a same system This is the case when the unavaibility or the failure of a component A impedes one or more other components to perform their intended functions: these components are not failed but because they do not receive the required feed or input they are unavailable. For instance the spurious opening of a protection (failure event) which is in common to multiple electrical components make them unavailable because of loss of power supply. Such effect can be modelled by a series-parallel block diagram as in the exemplification scheme of Fig. 2, in which the loss of the common element A implies the loss of the functions of the items B, C and D. In this case, redundancy is built only in certain parts of the system, whereas some single failures are able to make the system un-

226

C
A

Figure 2. A seriesparallel system.

available. This might be an intended design choice. Some time however this occurs because of poor analyses or design errors. Identification of such kinds of dependencies is one of the princi pal tasks of a structured qualitative system analysis before any relia bility modelling. Once the functional unavailabilities have been identified they can be easily included in the logical model chosen for the system structure (f.i. into a fault tree). Therefore, they do not deserve a special analytical treatment. 3.2. Functional unavailabilities among different systems High degree of protection is generally ensured by the installation in a plant of multiple safety systems, fter based on different functional principles. Whichever the degree of separation, ,the systems interacts each other either because of some functional boundaries, or because they share some power supply source (for instance electrical grids), or sim ply because of their location, their interaction with a same process; or, finally, because they are controlled or supervised by human operators. Boundaries between systems need always a certain conventional def inition. For instance, the electrical systems can be considered sepa rately from the feedwater systems they supply with the required power. The boundary between electrical systems and the supported one can be put at the bus bar, or at the breaker, or at certain protections. The first step in a reliability analysis of multiple interfacing

227
systems is a clear definition of the boundaries assumed. Afterwards the support functions needs to be identified. The un availability of a system A may cause the loss of a support function for the system and therefore the unavailability of system B. For in stance, let be a pumping system driven by electrical motors, and A the electrical supply system. Obviously, if A is unavailable, the elec trical motors of system cannot be available. The interactions between complex systems are, in general, not al ways so easily detected: many times they are hidden and sometimes they are detected only because of the accidents they provoke. Before any modelling, they should be examined by an indepth qualitative analysis. In some cases different systems share some common components or some common support functions. Therefore unavailability of some part may provoke multiple unavailabilities in other parts of the affected systems. Such kinds of related unavailabilities can be explicitly modelled in fault trees or event trees. The analysis of event trees for systems not independent of each other can be performed by taking into account the appropriate conditional probabilities, as schematically illustrated in Fig. 3, or by the approaches described at Ref. 8. Therefore, once identified, intersystems functional unavailabili ties do not deserve any other particular treatment.

P(A B) = P(A) P(B/A) P(A B) = P(A) P(B/A) P( ) = P() P(B/) P( ) = P() P(B/)

Figure 3. Event tree.

228 3.3. Cascade failures These events are also mentioned as failure propagation or component caused failures. The unavailability of a component (system) may provoke in some cases not only the unavailability of other components (system) but also their failures. For instance, the failure of a protection may provoke the failure of electrical items interested by too high current, the loss of a lubrication system may provoke the failure of the related pumps etc. The failure of a component A provoking failures of com ponents B, C and D (Fig. 2) can be stillmodelled in a similar way as in the case of functional unavailibility: however, in this case the restoration of the function supplied by the component A does not automatically imply the restoration of the functionality of B, C and D, since they are failed. Therefore longer repair times are to be expected with possible consequences on the overall reliability and availability features of the system. As in the cases of intercomponents or intersystems functional unavailability links, component (system) induced failures should be identified by a proper qualitative analysis and afterwards explicitly included into the logical fault tree/event tree models. The procedure is straight forward when the failure of a component A causes with certainty the failure of a component B. In some cases however the failure does not follow with probability 1 at least during the mission time considered. If the analyst does not wish to introduce the conservative assumption (P(B/A) = 1 ) , either he might estimate by his engineering judgement the appropriate P(B/A) or he should have adequate data. As an example, in the case that the failure of a com ponent A provokes higher temperature in the system then the failure rate Xgj for might be higher than the nominal . Once this data is available the Markovian approach /9/ can handle such cases and there fore, a combination of a fault tree model with a Markov one might result in more accurate probability estimations. 3.4. Constraints due to procedures For sake of completeness, it is worthwhile to remind that particular kinds of dependencies can be introduced by administrative procedures (f.i. technical specifications) among the operability of plants and/or systems. Operations might be conditional on the state of certain systems or on the occurrence of given events. According to the particu lar case fault trees with boundary conditions /8/, Markovian methods /9,10/ or Monte Carlo methods /ll/ can analyse such problems in an appropriate manner.

229 Also, other constraints or dependencies can be introduced by the repair/maintenance policy assumed. For instance, in repairable systems the performance processes of the different components can be considered as independent only if an adequate number of repair operators are available. These kind of problems should be considered by adopting the most suitable reliability analysis tool for the case under examination (e.s. F.T., Markov, Monte Carlo). 3.5. External events Abnormal environmental conditions may provoke simultaneous loss of functional and protective systems. Major fires, floodings, hurricanes, earthquakes, external explosions, aircraft crashes, etc. might indeed be common causes for multiple failures or unavailabilities. They should be subjected to a special type of analysis which within the limits of this lecture can be summarized in the following steps /12/: - establishment of a check-list as complete as possible of all hypothesizable external events; - motivated exclusion from further considerations of events which are not significant for the specific site (e.g. floods in a desert, aircraft impacts where no flight corridors exist); - evaluation of the occurrence and intensity probabilities of the relevant events ; - evaluation of the exposure of the components to these events (how the design defences can mitigate or make negligible their effect) and of the induced loads; - evaluation of the vulnerability (f.i. fragility curves) of the components subjected to the effects of the relevant events (f.i. vibrations, high temperature, etc.); - comparison of fragility curves with the anticipated loads to evaluate which components are expected to fail; - repetition of the probabilistic analysis by excluding the components not surviving to these events (or by considering their appropriate conditional probabilities of failure). Of course, the procedure might involve different degrees of sophistication from simple engineering judgement to very complex stochastic models, according to the seriousness of the investigated problem. 3.6. Dependency structures linked with human factors As Table I shows many causes of dependencies can be attributed to human errors. These may occur at any stage of system design and operation and may present very different patterns.

230 The TMI accident is very instructive about some typical effects of human malfunctions, and, therefore, is worth to be briefly analyzed under this perspective. The incident was initiated by an anticipated harmless transient (event ) , which provoked the shutdown of the plant and demanded the startup of the auxiliary feedwater system. The AFWS had been designed with a redundancy degree large enough to ensure him a high availability figure. However because of a human failure (event A) having occurred at a previous time, all the isolation valves of the redundant trains had been left closed after completion of the required maintenance interventions: a typical CCFs event. Consequently loss of heat removal provoked a pressure increase in the primary cooling sys tem; a relief valve opened into the pressurizer to mitigate the pres sure transient but stuck open (component failure event C) when the pressure decreased below its critical level. All what happened between events and C was a physical functional transient conditioned by the boundary condition A, with a pattern familiar to the operator. However because of event C a loss of coolant incident (small LOCA) through the pressurizer relief valve was initiated: the operator failed (event F) to recognize this event for a time sufficient to provoke a partial core melt down despite of later correct interventions. All actions before the correct diagnosis had in common the wrong representation of the event being faced with. Now the event F was not simply a wrong response of the operator to the actions demanded to him, which could be explained by an "indepen dent" human error. F was in someway "functionally linked" with two other events having previously occurred in the design of the hardware and software of the manmachine interface. Namely, the operator did not recognize that the pressurizer relief valve was stuck open because the corresponding signal in the control room indicated that the valve closed. However, by a design error (event D ) , the sensor for the con trol room indicator monitored only the presence of the command to close and not the actual valve state. Furthermore, the dynamic of the process had been too poorly analyzed (event E) with respects to small LOCA oc currencies; and, therefore, the training and the operational procedures were not adequate under these circumstances. The event F was clearly dependent on D and E. Even if some other previously occurred events might let suppose that the plant overall management might have intro duced some other more general dependency structures among the events, by limiting the analysis to the description presented above, the events and C have to be considered as independent random failures, as well as A has to be considered independent of F, D and E. However the event A in itself and the events F, D and E are representative of some typi cal CCFs cathegories included in Table I, and, are worth of being discussed with respects to the most suitable approaches to be adopted

231 for them in a reliability assessment. Possible multiple failures, caused by human interventions in test and maintenance operations (events like A) can be identified through an in depth task analysis of the corresponding procedures and of the man-machine interface. It might also possible to estimate some probability figure for multiple failures by using the methods proposed for human reliability /13/; these however still give rather uncertain results. Use of field data should therefore be recommended. It might also be useful in the praxis the inclusion of these events in the generic dependency classes to be dealt with via parametric models (item 3.8). The existence of dependency structures between failure of diagnosis (F), and hardware and software of the man-machine interface and, therefore, correction of possible design (Dj/procedure (E) faults can be investigated by the tools described at Ref. 12 characteristical for the assessment of human relability. The major difficulties in such an assessment are the dynamic aspects both of the process and of the operator interventions ; these can be better experimentally investigated via replica simulators or can be modelled via proper dynamic tools /14,15/. However it is practically impossible to study the man-machine interaction in all possible conditions, so that one can never exclude that some design/procedure unadequateness might still be hidden and can reveal its effects only under particular demands. Such possibly existing residuals faults might be included into the generic dependency class for which parametric model can be used. 3.7. Statistical correlations among failure rates A particular class of correlations is that which should be assumed to exist between failure rates of nominally identical components /16/. This correlations must be correctly taken into account when evaluating the uncertainty distributions and the mean values of the reliability figures of interest; otherwise underestimation errors will occur /17.18/. 3.8. Residual class of potential dependencies After having analyzed the system in order to identify possible dependencies of the kind described at items 3.1 to 3.7, further attention must be devoted to other factors which might additionally provoke multiple related failure events. Indeed the operational experience of redundancy systems indicates that, despite of the defences normally built-in against their occurrence /19/, a series of events have the potential to affect the performance processes of "independent" items in a related way. Whereas from a mathematical point of view it would be

232 possible to achieve an unlimited high reliability by increasing the number of redundancies, in the praxis these common factors will always put limits to what is really achievable. Fig. 4 /20/ visualizes how the UKAE A-NCSR has synthesized the experience processed from different technical fields by defining some cut-off levels for the minimum unavailability of different systems. This class of events is constituted of (see also Tablei): - errors or inadequatenesses in design, construction and installations which have not been detected during the commissioning tests or normal operations and may become evident under particular system demands ;

TYPES OF SYSTEM 10
>

sO

10'
A

10'

10

5 2
5 1Q3

'li

Ho* I3
<r

10"3
10"*

S 154
= 10s >

1'5

16

1 6

Type of system: A simple, unique system; simple, redundant system; C partially diverse system; D completely diverse system; E two separate systems, each one diverse.

Figure 4. Cutoff levels for different system types (UKAEANCSR).

233 - faults in operating, control, test and maintenance procedures which reveal their effects under particular system demands; - incompleteness and errors in the system analysis and modelling (for instance, lack of identification of the existence of functional dependencies among components and systems such those described previously); - human errors during tests and maintenance operations enhancing stress on components or leaving the systems unavailable; - fabrication or material defects which might appear on components having in common pieceparts of a same fabricant; - environmental conditons which might enhance corrosion and wear of a certain group of components, etc. Some of these faults present a "debugging" pattern similar to that countered when dealing with software reliability. It would be theoretically possible to analyse and model each specific factor. However because of the paucity of the data for each single phenomenon, and of the very significant increase in size and complexity of the resulting system model, it might be much more cost effective to include all these residual factors within the so-called implicit or parametric models for CCFs. Several models have been proposed to take into account all the dependencies which are not explicitly considered into the model structure chosen for the system reliability assessment. The most usual ones are briefly reviewed in the following.

4. PARAMETRIC STOCHASTIC MODEL Parametric stochastic models have been proposed to describe occurrences of random events which may affect in a same way multiple items. In the following the models, amongst the most usual ones, which have been adopted by participants in the CCF-RBE, will be briefly described. Further models will be mentioned after the discussion of the CCF-RBE results. 4.1. Basic Parameter model (BP) This model is based on a simmetrical specialization of the MarshallOlkin multivariate exponential distribution /21/. It requires a direct estimation of the rate of failures of groups of items in the redundancies considered /22/. Let us consider three identical items A, B, C and, for sake of simplicity, let us refer to the case of the unavailability on demand. Then, the following rates are defined:

234 = rate of failure on demand for one out of the three items (inde pendent) ; = rate of failure on demand for two out of the three items for some 2 common cause; = rate of failure on demand for all three items for some common 3 cause. Because of the simmetry assumption ^ is the same for A, and C; and, _ applies to any couple AB, AC and BC. The generalization to items is straightforward. If the items are in a 1/3 success redundancy scheme, the unavail ability will then be given by (1/3) = ^ + 3 2 + 3 \2 + 3 (1)

where expresses the contribution of independent failures; , expresses the contribution of 1 independent failure + a double failure (3 modes); expresses the contribution due to double failures events (3 modes): if a cause occurs which provokes a failure of A and B, the occurrence of failure causes for both A and C should be considered even if in reality A is already failed; expresses the contribution of the common failure of A, and C. If the items are in a 2/3 success redundancy scheme, the unavail ability will be analogously given by (2/3) = 3 + 3 + 3 . (2)

Since the total failure probability ( t ) of an item A is provoked by events which would fail A only, A and or A and C together, or, finally A, B and C at one time, it can be written
X

t = 1

+ 2

3 *

(3)

Would enough data being available for a direct estimation of the relevant . , this model would allow straightforward probability as sessment. However data normally retrievable from component or incident re ports does not contain all information needed for \. estimation. There fore, other models have been proposed which put less stringent require ments to the data base, by performing some further hypotheses on the relations among the rates of multiple failures.

235 4.2. The 8 factor model This was the first model to be applied /22/ since had the advantage of requiring the estimation of only one parameter in addition to the total failure rate of a component. Furthermore this estimation does not de pend on success information about component or system demand. Such data are hard to be found in incident reports, the most informative source for multiple failure events. It has however the disadvantage to be rather conservative for high redundancy configurations, since it assumes that common cause occurrences will fail all the components in the group. To come back to the previous example (Eq. 3) s 0 and therefore at a fixed ., _ is higher. By indicating with 8 the ratio between the common cause rate and the total failure rate, Eq. (3) becomes

= \

+ &\ = Ue)

+ \

(4)

where ^ = , = (16) = 8 d t

is the "independent" failure rate is the "common cause" failure rate, which applies to whichever number of redundancies.

4.3. The Multiple Greek Letter model In this method /24/, which extends the 8factor model, further parame ters were introduced to distinguish among common cause events affecting different numbers of components in high redundancy systems. For a system of k redundant items, for each failure mode, k param eters are defined. For example, the first four MGL parameters are: X t = total failure rate due to all causes, as before; = conditional probability that the common cause of an item failure will be shared by one or more additional items; y = conditional probability that the common cause of a component failure that is shared by one or more components will be shared by two or more components additional to the first; = conditional probability that the common cause of a component failure that is shared by two or more components will be shared by three or more components in addition to the first. If the examples previously given for three components are consid ered, with reference to the rate of failure on demand, then ( does not apply)

236

(1

t (5)

2 = 1/2 () x t

3 =

0YX

And, in the case of 1/3 success criterion, Eq. (1) becomes (1/3) = (1) + 3/2 (1) (lY)Xt + 3/4 (1) whereas for 2/3 success criterion, Eq. 2 becomes (2/3) = 3(1) + 3/2 (1) + . (7) + ^ ; (6)

If Y=l, then the simple factor model is obtained as it can be seen by Eq. (5). In this case Eq.s (6) and (7) would reduce to (1/3) = (1) (2/3) = 3(1) + + .

4.4. The Binomial Failure Rate (BFR) model This model, firstly proposed by Vesely /25/ and then modified by Atwood /26,27/, describes the failure events as the effect of shocks which affect the single components and shocks which affect all redundant components. The latters can be distinguished according whether they fail all components simultaneously (lethal shocks) or they can be sur vived by the affected components. In this last case a conditional failure probability (p) must be defined. Therefore the following parameters are defined: = independent failure rate; = rate of occurrency of non lethal shocks; = conditional probability of failure of each component given a non lethal shock (therefore multiple failures are given by the bino mial distribution); = frequency of occurrence of lethal shocks. As Ref. 22 shows, it is possible to relate this model to the previous ones. Let us continue the example previously presented, the .s introduced in Eqs. (1) to (3) are given by:

237 = + (1) 2 2 = (1) (8)

3 = + . 3 By substituting Eqs. (8) into Eqs. (1) and (2), the unavailability rates (1/3) and (2/3) can be easily calculated. 4.5. Some preliminary remarks It should be noted that even if the examples have been discussed with references to the failuretostart problem, which results in straight forward calculations of unavailability on demand, the same relations apply to the rates which describe failures during item operation. Therefore the notation has been used. Furthermore, the models have been presented without discussing problems connected with sound procedures for parameter estimations. Such problems will be discussed later on in the paper after the presen tation of the CCFRBE results. In this way their practical relevance will appear in a more clear perspective.

5. THE COMMON CAUSE FAILURE RELIABILITY BENCHMARK EXERCISE Benchmark exercises have proved to be valuable tools for correctly understanding the problems linked with reliability assessments of complex systems /28/. In particular the CCFRBE was very successful for suggestions towards sound procedural frameworks to be adopted in PSA. The reccommended approaches can be better understood, if supported by the insights gained through the project which will be described in the following. 5.1. Background for the CCFRBE The project was launched because of the unsatisfactory results obtained for the CCF problem in a previous exercise on systems reliability assessment (RBE) /29.30/. This was aimed at the assessment of the com plete procedure for reliability analysis of complex systems, via comparison and discussion of results obtained by different teams independently evaluating the reliability of a same reference system. Therefore it started with the examination of the basic documentation, and with the visit and the familiarization with the chosen system (i.e. the AFWS of the French PWR Paluel Unit); and, included a first step for

238
a structured qualitative analysis of the system and further steps for the reliability modelling and assessment. Participation in the exercise involved representatives of the principal parties concerned with NPP safety assessments in EEC member countries and Sweden: i.e. authorities, vendors, utilities, engineering companies and research institutes. Fig. 5 shows a simplified scheme of the AFWS studied. This consists of two separated trains, and (for simplicity the figure does not repeat ). Each train is able of feeding two out of the four steam generators via a motordriven pump and a turbodriven pump. Each driving turbine is fed by steam spilled from the two corresponding steam generators. Availability of only one steamline is sufficient to drive the turbine at the rated speed. Redundancy and diversity characterize the system and make it an interesting study case for CCF analysis. However, during the several RBE workingphases the discrep ancy among the different approaches to the CCF problem could not be overcome: indeed, in this respect, it was not a question of differences in completeness or in assumptions utilized in constructing the relevant

a ^
<&~
Insidi containment

^fj

Turbina

- - (As TA for SG3 and SG4, M0P2 ind TDP2)

T A , Tg = trains A and MDP, TDP = electrical motordriven and steam turbinedriven pumps SG = steam generator

Figure 5. The Auxiliary Feedwater Reference System.

239 fault trees; rather, the way in which the problem was dealt with differed substantially from team to team; and, this appeared to be a natural consequence of diverse phylosophies. Namely, only few teams attempted to quantify in FTs CCFs occurrences. Other teams analysed the possibility of CCF events in a purely qualitative manner, by indicating the reasons why such events could be excluded or neglected because of the defences built into the design against their occurrences. Other teams, finally, performed a sensitivity analysis of the overall results with respects to selected common events. However, as a result of the discussion of RBE results and approaches, some important conclusions could be reached: - first of all, it was recognized the need for a separate treatment of the different classes of dependency structures (as already discussed at item 3 of this paper), whereas in most of the intermediate RBE contributions these appeared to be mixed up in an unique treatment; - furthermore, it was generally agreed with the importance which should be given to a well structured system performance analysis preliminary to any reliability modelling and assessment. Some recommendations could be written down in the RBE final report /30/ especially towards procedures for structuring FMEA and interfacing systems analysis ; these already allow the analyst to identify important dependency structures like functional unavailabilities, cascade failures, loss of support functions and similar. All these results have been confirmed during the subsequent CCF-RBE. On the contrary, as far as quantification was concerned, it was only possible to reach a consensus on the inclusion in the final common FT model of the probability of the two Diesel generator sets failing to start-on-demand because of common events. For all other components, it was generally lamented the lack of adequate data. A strong reluctance was shown towards the use of generic S-factors deriving from the operating experience of diverse plants (i.e. from USA literature) without a thorough analysis of their applicability to the case under study. To progress beyond this unsatisfactory state-of-the- art, an ad-hoc exercise was felt to be necessary. 5.2. Objectives and structure of the CCF-RBE The objectives of the project /31.32/ were defined as follows: a) to achieve a better understanding of available methods and procedures (state of the art); b) to contribute to the establishment of procedures for the identification of potential sources of CCFs; c) to agree on terminology and definitions; d) to clarify objectives and boundaries of qualitative and quantitative

240 methods of analysis ; e) to assess methods for quantifying CCF occurrences on fixed data, experts judgement, etc.; f) to assess methods for quantifying CCF effects in PSA when events and data are assumed to be known. The achievement of the objectives assumed was made possible by: - the availability of an adequate study case; - the qualified and active participation in the project; - t h e way in which the project was structured and executed. The study case was offered by the KWU vendor and the Grohnde NPP utility in Germany and was referred to a loss of electrical supply scenario (Emergency Power Mode) for the KWG NPP /33/. The event to be analyzed was the failure to supply the needed feedwater to the steam generators. In this design the feeding is assured by two diverse and redundant systems (see Fig. 6 ) : namely the start-up and shut-down pumps which have even process related functions since they are normally demanded during reactor start-up and shut-down; and a four train emergency pump system which has only safety functions and is protected against external events and sabotage. Availability of only one train assures the mission success. To avoid to be confronted, in addition to the CCF problem, even with the spreads in modelling and data with respects to "independent" events (like that found during the previous RBE), a common fault tree with common data has been assumed for the "independent" primary events. Subject of the study was the change of the unavailability figure of the common fault tree as provoked by the introduction of the "dependent" failures identified by the participants. To limit the project within a reasonable size, external events as well as malfunctions of other systems were left outside of the CCF-RBE boundaries. Participation. The participation in the exercise was extremely well qualified to support the final recommendations. It included indeed: - representatives of authorities, vendors, utilities, universities and research organizations, covering thus the different interests and points-of-view of the principal parties dealing with NPP safety; - teams professionally involved in PSA, who already participated in the RBE, where they were confronted with the not resolved CCF issue; - experts since many years active in R & D efforts in the field, who had proposed and/or critically reviewed models and approaches in authoritative publications both in USA and in Europe. In total 10 independent teams, including 21 organizations from 8 different countries were involved in the project which was coordinated

241

1 2 3 4 5 6 7 9 10

Slnm gtntfilori V(vf compiitmetili Slap ind contiol vilvit Bypns station Rthiittr HP turbini LPlurbmt Condens Mim condenti pumpt LP lied heat sytiem

11 12 13 14 l 16 17 IB 19

F eedwi te* rank F t ed wu lei pumpi Startup and shutdown pumps HP lied heiter tyiiim Dunmenli/rt witei t tonat link D wn mir ili / ed M I t ei pumpt Emeiaency leed pumpt Circulating vaiei pumpt Closed loop cooling lytitrn

Figure 6. Overview of KWG systems for secondary side heat removal.

by JRCIspra and namely: a Belgian team composed of Vinotte, Trac tionel, and the Leuven and Brussel Universities; the Danish RISO na tional laboratory; a French team composed of EDF, CEA and Framatome; the GRS in Germany; KFAJlich; KWU (designer of the reference system); an Italian team composed of Ansaldo, ENEA and ENEL; the RELCON consul tant group sponsored by the Swedish SKI authority; a UK team composed by UKAEA/SRD and CEGB; and, finally, the USA PLG consultant company sponsored by EPRI and USNRC.

242 Project structure. After a documentation phase (started with a first meeting in September 1984 and concluded in March 1985, two months after the visit of the Grohnde plant) the project has been subdivided into two working-phases. Working phase-I was essentially devoted to assess the state of the art in CCF analysis. Each team was left free within the boundaries assumed for the exercise to perform both qualitative and quantitative analyses according to its usual procedures. At the end of this phase the results were confronted and discussed in a meeting held at Ispra on September 1985. This comparison resulted in the identification of the items which needed to be further investigated to meet the objectives assumed for the exercise. In this way contents and rules were defined for working-phase II, which was concluded by a final meeting held at Ispra on April 1986. The project was finalized with the issue of the report at Ref. /31/, that was submitted for advice and agreement to the CCF-RBE participants before its publication. Whereas the results of working-phase I were very enlightening as far as qualitative analysis procedures were concerned, phase II succeeded to clarify the factors dominating the spread of the probabilistic results. This is composed of three main contributors: i) the identification of the common events which should be considered in the unavailability calculations; ii) the choice of the stochastic model; iii) the estimation of the model parameters from the relevant data base. Since after working-phase I the spread of the results obtained by the different participants was very significant, in order to understand the relative importance of the possible contributors it was decided to reduce the scope of working-phase II but to perform a deeper investigation on selected aspects. Namely, since within the boundaries assumed for the exercise the start-up/shut-down pumps could be considered as independent of the emergency feed system, for working-phase II the scope of the exercise was limited to the emergency feed system only. Furthermore, as a matter of a common discussion supported by theoretical argument and experimented events, it was possible to agree with a common set of components to be considered for application of CCF stochastic models for the unavailability calculations. The tasks to be performed by the different participants during working- phase II were: - recalculation of the emergency feed system unavailability according to the common set of "CCF component groups" by using the same tools and the same data base assumed in working-phase I (assessment of the spread due to contributor i ) ; - recalculation of this unavailability according to the event common

243 base by using a common set of parameters estimated by only one team, in a consistent way for the different models (assessment of the spread due to contributor ii); evaluation by all the participants of the parameters of only one model starting from a common set of relevant "raw" event data; and, recalculation of the system unavailability by using these parameters (assessment of the spread due to contributor iii). 5.3. Main results of the CCFRBE The project was most successful in matching the objectives assumed. Furthermore the participation of the USA team which was elaborating formal guidelines on the same subject allowed the exercise to result in a valuable test of preliminary elaborated procedures and to enrich them through the insights gained from the CCFRBE. The stateoftheart as derived from the project (objective a) covered both structured qualitative methods, and models for probabili stic evaluations /31/. Suggestions for procedures for the identification of potential sources of CCFs (objective b) could be elaborated as will be summarized at the next section. As far as objective c was concerned, the CCFRBE did not attempt to achieve a fully agreed common terminology. However, the project resulted in a common understanding of what is meant by different terms and how different classes of dependent events should, in general, be tackled in a systems reliability analysis /31/, as already extensively discussed at section 3 of this paper. As far as the objectives assumed for the probabilistic aspects of the analysis, the project stressed again the need and gave suggestions for qualitative analyses and for their consistent link with the sub sequent quantification (objective d ) ; identified the overwhelming im portance of system modelling and of parameter estimation as the most significant contributors to the uncertainties of the results, whereas the type of stochastic model adopted appeared to have a lesser impact on the results (objectives e and f ) . These statements were supported by the analysis of the spread of the unavailability evaluations resulting from the different CCFRBE steps. The fault tree provided by KWU which included only "independent events" resulted in an overall unavailability for the feeding via either the startup/shutdown pumps or the emergency pumps of about 5 T 1 0 ~ 8 . By including "dependent" events as judged and modelled by the participants, the overall figure decreased within the range 1 0 4 1 0 7 with a spread of about 3 order of magnitude. This large spread was due both to modelling effects i.e. events included into the fault tree and component groups considered as pos

244 sibly affected by generic CCFs (of the type at item 3.8) - and to data assumed. Especially for the start-up/shut-down pumps the modelling differences were important: indeed the relative importance of CCF events with respects to this system unavailability ranged between 0% and 100% according to the different participants. This result stressed again the need for structured qualitative analysis procedures to be applied before any probabilistic assessment. A more uniform modelling approach was found for the emergency pump system on which the working-phase II was focussed. The steps i to iii (item 5.2) were very helpful for assessing the different contributors to the spread. Table II summarizes the different spreads measured as the ratio of the maximum to the minimum unavailability values obtained by the participants, concerning the emergency feed system only.

TABLE II. Spread factors among unavailabilities assessed for the emergency feed system. Working-phase I
1

Working-phase II
11

Uncertainty factors

111

90

13*

1.7

90 390**

* This figure should be compared with a factor 25 in W.Ph.I. ** By taking into account results of sensitivity analyses with respects to different assumptions.

These spreads are due to the CCFs evaluations only, since CCFs account practically for the whole unavailability figure. From this table the parameter estimation step (iii) is clearly identified as the most crucial problem in the overall assessment; an important role is plaid by the system model (i ) ; whereas the choice of the stochastic parametric model (ii) appears to give a negligible contribution to the overall uncertainty. Table II shows also that the usual uncertainty factors (i.e. the uncertainty bounds estimated by the analysts as applicable to their own results by propagating data uncertainties) are not an appropriate measure for the uncertainty to be attached to results ; they indeed cannot take into account the team to team variability. The identification of the most significant sources of uncertainties was of course a very important result. However, the major outcomes from the exercise were the insights gained towards analysis procedures,

245 which were substantially agreed with by the participants, and described in the following. are

6. RECOMMENDED ANALYSIS PROCEDURES. Identification and modelling of dependent events must be included in any system reliability assessment since the very beginning of the analysis. Any study non including dependency analysis leads to misleading results especially when high reliability is achieved via redundancy. The suggested approaches involve all the steps needed, and namely: structured system qualitative analysis, system modelling incorporating dependency structures explicitly and implicitly via stochastic parametric models, estimation of the requested parameters from the relevant data base. The description essentially follows the indications presented in the report by A. Poucet et al. /31/, enriched however by more recent literature which can be considered in a certain way as a follow-up of the CCF-RBE. In addition to the identification of the most important uncertainty sources, the CCF-RBE resulted in useful recommendations towards a sound procedural approach. 6.1. Qualitative analysis The need for a systematic and well-structured qualitative analysis was already pointed out by the RBE /!/ as a key to completeness and correctness of the modelling. This holds even more for the identification of dependency structures, which mostly are consequences of hidden causes. Objectives of qualitative analysis should be: - to understand the mechanisms and factors determining dependences among system items; - to identify potential dependent failure or unavailability events, caused by other component or systems and which have to be modelled explicitly in system logic diagram (f.i. FT); - to identify the groups of components which might be affected by generic dependency causes to be taken into account via stochastic models; - to assess the effectiveness of the defenses built-in to prevent or to limit dependent failure 'occurrences; - to rank and to screen the identified potential dependencies in order to link the qualitative analysis with the subsequent reliability model in a consistent and effective way.

246 These objectives can only be achieved when a close interaction between the analyst team, the designer and the operators is estab lished. The methods available can be roughly distinguished in two types: the "component driven" methods in which each component failure mode (FMEA) is analyzed with respects to its effects and its possible failure cause shared with other components ; the "root cause driven" methods which start from a list of causes (like that in Table I) and assess the components possibly affected by these causes. The FMEA based method appears to be very effective for the identi fication of functional unavailabilities, cascade failures, intersystem dependencies (i.e. of the dependencies which must be modelled explicit ly). In order to combine its advantages with those deriving from a cause checklist, a modified FMEA table has been proposed /31/ which is shown in Fig. 7. This differs from the normal FMEA tables in the fol lowing respects: the main attributes from generic CCF classification tables (like manufacturing, location etc.) are included; a "cause column" for each component which can be filled with other CCF generic causes; the table, contains not only a column for the identification of the effects of a failure on further components (functional unavailabil ities or cascade failures), but also a column for recording other components sharing a generic CCF cause. An interesting way of performing a qualitative analysis with a direct attempt to quantification via the structuring of an engineering judgement on design and defences is the socalled partial factor method has been elaborated by UKAEANCSR /34/. 6.2. Linking the qualitative analysis with the system modelling The previous step allows the analyst to identify the dependencies which must be modelled explicitly and those which can be modelled via para metric models. (It is worthwhile to remind that in some cases it is not possible to perform a defined cut between the two above categories: it is left to the judgment of the analyst until which limit the explicit model is costeffective). Of course all the identified logical dependencies must be con sidered for inclusion in the explicit system model. As far as the generic cause groups are concerned, the number of potential depen dencies tends to be very large and such to uselessly complicate the overall model. Therefore some screening procedure is necessary to rank the single groups so that in the final model only the most significant

Component liluiillllcullun

Function Aliale

Component type

Component manufacturer

Component location

Test & maintenance

Failure mod

Detection possibility

Effects on oilier components

Failure cause categories

Older componen Is sensitive for same causes

Figure 7. Modified FMEA format for including dependent events.

to

248 effects are included. To rank and to screen the relevant events a useful guide might be the use of simple available quantitative terms (f.i. generic factors for different groups of components). Some screening rules agreed with by the CCFRBE participants are given in the following (they obviously concern component groups to be considered for CCF parametric model ling) : identical redundant active components should be considered in CCFs groups; completely diverse components in separate, redundant trains might be considered to be much less significant with respects to CCFs events; diverse components having some identical piece parts should be con sidered as possibly affected by CCF events. These effects might be neglected when there is a strong evidence that they are much less important than similar effects on identical components in the system. Of course these rules do not decrease the importance of a good engineering judgment in deciding what is "diverse" or "identical". It is therefore of the outmost importance that this step of the analysis is well documented, so that the assumptions of the analyst can be evident in any review by decision makers. 6.3. System modelling The logic structure for events leading to complex system unavailability is well represented by FT/ET. The dependencies which are provoked by functional links or propa gations of failures can be easily included in these diagrams, as any other kind of events and logical links. In order to take into account the generic CCF events by the stochastic parametric models the analyst should choice an appropriate model and include the multiple failure events in the FT. As far as the choice of the stochastic model is concerned, the results of the CCFRBE show that this does not appear to be a critical issue. BFR, MGL and factor models gave comparable results. However, it is felt that, for high redundancy systems, models which give a better account of partial failures should be preferred to the simple factor. Also, it can be expected that from further exploi tation of the data available from operating experience of such systems the difference in the results as obtained by the factor model and multiple parameter models might increase. The choice of the model should also take into account, the argu ments in support of correct statistical estimation procedures, which will be discussed in the following. Fig. 8 shows how Ref. /22/ suggests to include multiple failures

COMMON CAUSE FAILURES ANALYSIS IN RELIABILITY

249

NUMBER OF UNIQUE BASIC EVENTS

General common cause fault subtree for Component A in a^tommon cause group of ' components.

Figure 8. Inclusion of generic CCFs in a Fault Tree.

events the rates of which are determined by the stochastic model chosen. The procedure illustrated in Fig. 8 significantly increase FT size and complexity, especially for systems with high redundancy, but is the logically correct one. By incorporating the CCFs in the FT it is always possible to apply afterwards simplifications on the basis of some probability cut-off /8/. The alternative approach of applying the implicit models by

250 starting from the min cut-sets of a FT constituted of only independent events simplifies the overall system model, but can lead to overestimation of the unavailability and also to loose relevant min cut sets if FT cut offs rules are applied in an incorrect way /35/. 6.4. Parameter estimation It is recommended to estimate the parameters for the model adopted directly on the relevant data base and to not use generic literature parameters. This because events happened in other system designs do not necessarily apply for the design under examination. Generic data should only be used for screening purpose as indicated at item 6.2. The estimation of the parameters has been recognized in the CCF-RBE as the most crucial issue in the entire procedure. The estimation of parameters involves : - the availability of a CCF event data base; - the screening of the events in the data base, to assess the relevance of the single events for the system under consideration; - the calculation of the parameters (or their distribution). Data base. Normally the experience on CCFs for the system to be studied or identical ones is too sparse to be used for evaluating the impact of such rather rare events. Therefore, it is recommended the examination of raw CCF events deriving from the operating experience available even from diverse plants. Normally, the U.S. LER incident data reports constitute the most comprehensive source of information /36/. It should be hoped that a similar efforts will be done in Europe, by using other comprehensive normal occurrence reporting systems /37/, and, by further exploiting the possibilities offered by component event data banks /7,38/. Event screening. The screening of the event data can be done by using the "impact vectors" as proposed at /22/: this step is one of the most critical ones, since it really relies on the analyst judgment. It is of the outmost importance to give adequate documentation of the events rejected and of the impact vectors assumed. In this way it would be possible in any further review of the analysis to have trace of the hypotheses dominating the results. The impact vector represents the analyst's judgment about how the causes of a particular event in the data base would impact the system under study. See for instance the example given in Table III. This table shows how an event cause happened at the Dresden 3 plant might be transferred to the case of Grohnde; the reasons why it is assumed that the wrong procedures might affect either one diesel generator (90%

COMMON CAUSE FAILURES ANALYSIS IN RELIABILITY TABLE III. Example of impact vector. Event description in the
K

251

...... Application Dresden 3 (redundancy=2) Grohnde (redundancy=4)

data base Plant Dresden 3: One diesel generator failed to start and one diesel generator showed an incipient failure due to wrong procedures.

, P 2

~ P 3

~ P^ 4

.9

probability) or all 4 generators, but not only 2 or 3 is clearly depending on the way the expert analyst judges the description of past events and their relevance for the system which has been subjected already to a deep analysis. Even if it not possible to formulate a prescriptive set of rules for the event screening step, some useful insight can be gained by the examples given at Ref. /31/. The uncertainty will be even larger when considering two main points raised during the CCFRBE by P. Drre and afterwards clarified by him at Ref. /39/ and, namely: the extrapolation of events concerning systems with lower redundancy to systems with higher redundancy; the use of "success" information in particular when reducing from high redundancy systems to lower redundancy systems. Attempts to respond these questions during the CCFRBE resulted in lower unavailability estimates and therefore increased spread of obtained results (see the two different values of Table II column iii). The problems raised were agreed to be very significant. Because of the limitations of this paper it is not possible to give further details, but Ref. /39/ should be taken attentively into account when performing the screening and the parameter calculations. Parameter calculation. Each stochastic model described at section 4 has been proposed together with the corresponding estimation rules for the relevant parameters. Therefore the rules can be easily taken from the relevant references. As an example for a four redundancy system the most common estimators for the MGL parameters, in a Bayesian procedure, are
mm

/""

i1

jn+

JnJ+ a

JnJ+ a

B ^
r j=n

""1

=-
/="*

O)

Vk+e+i jTi

^jnj+a + b y3

J^jnj+a + b /

252 in which j = varies between 0 and m=4, . = is the sum of the impact vectors over all the applicable events, a and b are the two parameters of the prior distribution (of the beta family) selected. (In the case of the CCFRBE a non informatie prior (a=b=l) was used). The Eqs. 9 are an obvious extension of the method currently used for estimating the simple B factor. Now this method was already sub jected to some criticism by Parry /40/. Similar criticisms have been recently moved to this way of estimating the parameter for the MGL mod el /41,42,43/. Indeed, the procedure followed is based on component stochastics, rather than event statisties: an event provoking simulta neous failures of 2 components is considered twice since Eq. 9 take in to account the number of component failed (2 n 2 ) . This method results in an artificial strenghtening of the evidence and hence to narrower distribution for the parameters and smaller uncertainty factors ( last column of Table II). This criticism has been recognized to be correct during the CCFRBE. Afterwards other models have been proposed such as the multinomial failure rate /42/ or the alpha model /44/ to overcome the problem. However it was also evident that the major uncertainties are those connected with the screening procedure responsible for most of the spread in Table II (iii). Even if the impact on the results is not very large, it should be recommended to adopt the correct way of estimating parameters, or models based on event statistics.

7. CONCLUDING REMARKS The paper constitutes a stateofart review of available models and recommended procedures for CCFs analysis. It should however be reminded that any design assessment should be followed by an optimal exploitation of the operating experience for the system when in operation. The better way to link assessments with experience is the bayesian approach: the design assessment is the prior probability which should be updated by the evidence drawn from the experimented system be haviour. The experience might also result in a need for changing the system model once priorly not identified dependencies appear to exist among system items : also when correctly analysed the experience might also lead to design changes before that dramatical events occur. Retrieval and use of the operating experience are the best defence against CCFs once the plant is in operation.

253 ACKNOWLEDGEMENTS Thanks are due to A. Poucet, who is the principal author of the CCF-RBE final report from which section 6 has been derived; to all participants in the CCF-RBE for their significant contribution towards the achievements of the methodological recommendation presented; and, finally to G. Apostolakis (UCLA), who during his visiting period at JRC, Ispra contributed to the CCF-RBE concluding discussions.

REFERENCES 1. A. AMENDOLA, 'Systems Reliability Benchmark Exercise', Final Report, CEC-JRC, EUR 10696 (1985). 2. A. AMENDOLA, 'Uncertainties in Systems Reliability Modelling: insight gained through European Benchmark Exercises' Nucl. Eng. Des. 93 (1986) 215-225. 3. G.T. EDWARDS and I.A. WATSON, 'A Study of Common Cause Failures', UKAEA, SRD R 146, July 1979. 4. I.A. WATSON, 'Review of Common Cause Failures' UKAEA, NCSR R 27, July 1981. 5. D.P. WAGNER, C L . CATE and J.B. FUSSELL, 'Common Cause Failure Analysis Methodology for Complex Systems', Nuclear Systems Reliability Engineering and Risk Assessment, Eds. J.B. Fussell and G.R. Burdik, SIAM (1977). 6. G.L. CRELLIN et al., 'A Study of Common Cause Failures, Phase II: a Comprehensive Classification System for Component Fault Analysis', LASA-EPRI-NP-3837, Interim Report, January 1985. 7. A.M. GAMES, P. MARTIN and A. AMENDOLA, 'Multiple Related Component Failure Events', Reliability '85, Birmingham, 10-12 July 1985. 8. A. POUCET, 'Fault Tree and Event Tree Techniques', (in this same book). 9. I. PAPAZOGLOU, 'Elements of Markovian Reliability Analysis' (in this same book). 10. I. PAPAZOGLOU, 'Probabilistic Evaluation of Surveillance and out of Service Times for the Reactor Protection Instrumentation System' (in this same book). 11. A. SAIZ DE BUSTAMANTE, 'Monte Carlo Methods', (in this same book). 12. USNRC: 'PRA Procedure Guide: a guide to the performance of probabilistic risk assessments for nuclear power plants', NUREG/CR23000, April 1982. 13. I.A. WATSON, 'Human Factors in Relability and Risk Assessment', (in this same book).

254 14. A. AMENDOLA, G. REINA and F. CICERI, 'Dynamic Simulation of Man-machine Interaction in Incident Control', Proceedings of 2nd IFAC Conf. on Analysis, Design and Evaluation of Man-Machine Systems Varese (1986), Pergamon Press Publisher. 15. A. AMENDOLA, U. BERSINI, P.C. CACCIABUE and G. MANCINI, 'Modelling Operators in Accident Conditions: Advancements and Perspectives of a cognitive model', To appear in the International Journal of Man-Machine Studies (Academic Press). 16. G. APOSTOLAKIS and S. KAPLAN, 'Pitfalls in Risk Calculations', Rel. Engineering, 2, 135-145 (1981). 17. P. KAFKA and H. POLKE, 'Treatment of Uncertainties in Reliability Models ', Post SMIRT-8 Seminar on the Role of Data and Judgement in Probabilistic Risk and Safety Analysis, Brussels, August 26-27 (1985). 18. G. APOSTOLAKIS and P. MOIENI, 'On the Correlation of Failure Rates', in Proceedings of the 5th EuReDatA Conference on Reliability Data Collection and Use in Risk and Availability Assessment, Heidelberg, April 9-11, 1986, Published by Springer Verlag. 19. A.J. BOURNE et al., 'Defences against Common-Mode Failures in Redundancy Systems', A Guide for Management Designers and Operators, UKAEA, SRD R 196, February 1981. 20. G. BALLARD (UKAEA-SRD), 'Common Cause Failures Analysis', Ispra Course 'Reliability and Data', October 1984. 21. A.W. MARSHALL and I. OLKIN, 'A Multivariate Exponential Distribution', J. Am. Stat. Assoc, 62 (1967), pp. 30-44. 22. K.N. FLEMING, A. MOSLEH and R.K. DEREMER, 'A Systematic Procedure for the Incorporation of Common Cause Events into Risk and Reliability Models', Nucl. Eng. Des. 93 (1986), pp. 245-273. 23. K.N. FLEMING, 'A Reliability Model for Common Mode Failure in Redundant Safety Systems', Proceedings of the Sixth Annual Pittsburgh Conference on Modelling and Simulation, General Atomic Report GA-A13284, April 23-25, 1975. 24. K.N. FLEMING and A.M. KALINOWSKI, 'An Extension of the Beta Factor Method to Systems with High Levels of Redundancy', Pickard, Lowe and Garrick, Inc., PLG-0289, June 1983. 25. W.E. VESELY, 'Estimating Common Cause Failure Probability in Reliability and Risk Analysis: Marshall-Olkin Specialization', in

Nuclear

Systems

Reliability

Engineering

and Risk

Assessment,

(J.B. Fussel and G.R. Burdick eds.), SIAM, Philadelphia, 1977. 26. C L . ATWOOD, 'Common Cause and Individual Fault Rates for L.E.Rs of Instrumentation and Control Assemblies at US Commercial NPPs' EG&G-EA-5623, 1981. 27. C.L. ATWOOD, 'Common Cause Fault Rates for Pumps', NUREG/CR-2098, February 1983.

COMMON CAUSE FAILURES ANALYSIS IN RELIABILITY

255

28. A. AMENDOLA, 'Result of the Reliability Benchmark Exercise and the Future CEC-JRC Programme', Proc. of ANS/ENS Int. Top. Meeting on Probabilistic Safety Methods and Applications, San Francisco (USA) 24 February - 1 March 1985. 29. A. AMENDOLA, 'Uncertainties in Systems Reliability Modelling: Insight Gained through European Benchmark Exercises', Nucl. Eng. Des. 93 (1986) 215-225. 30. A. AMENDOLA (ed.), 'Systems Reliability Benchmark Exercise: Final Report: Parts I and II', EUR 10696(1-11) - CEC-JRC Ispra (1986). 31. A. POUCET, A. AMENDOLA and P.C. CACCIABUE, 'CCF-RBE - Common Cause Failure Reliability Benchmark Exercise - Final Report', EUR 11054 EN, April 1987. 32. A. POUCET, A. AMENDOLA and P.C. CACCIABUE, 'European Benchmark Exercise on Common Cause Failure Analysis', ANS/ENS PSA Conference, Zrich 31 August - 5 September, 1987. 33. H. S0B0TTKA and H. FABIAN, 'Advanced Design of PWR through Probabilistic Safety Analysis', ANS/ENS PSA Conference, San Francisco, 24 February - 1 March 1985. 34. B.D. JOHNSTON, 'A Structured Procedure for Dependent Failure Analysis', Reliability Engineering 19 (1987) 125-126. 35. S. CONTINI, 'Algorithms for Common Cause Analysis', A Preliminary Report, T.N. 1.06.01.81.16 - JRC Ispra (1981). 36. K.N. FLEMING and A. MOSLEH, 'Classification and Analysis of Reactor Operating Experience Involving Dependent Events', EPRI NP-3967 (1985). 37. H.W. KALFSBEEK, 'The Organization and Use of Abnormal Occurrence Data', In this same book. 38. A.M. GAMES, 'The Use of the European Reliability Data System's Component Event Data Bank for Common Cause Failure Event Analysis', in Reliability Data Bases (A. Amendola and A.Z. Keller eds.), D. Reidel Publishing Company, Dordrecht (NL) 1987. 39. P. DORRE, 'Possible Pitfalls in the Process of CCF Event Data Evaluation', Proceedings of ENS/ANS PSA 1987: Intern. Top. Conf. on Probabilistic Safety Assessment and Risk Management, Zrich, August 30 Sept. 4, 1987. 40. G.W. PARRY, 'Incompleteness in Data Bases: Impact on Parameter Estimation Uncertainty', Annual Meeting of the Society for Risk Analysis, Knoxville, 30 Sept. - 3 Oct., 1984. 41. G. APOSTOLAKIS and P. MOIENI, 'On the Correlation of Failure Rates', EuReData Conf. Reliability Data Collection and Use in Risk and Availability Assessment (H.J. Wingender ed.) Springer-Verlag Heidelberg (1986).

256 42. G. APOSTOLAKIS and P. MORENI, 'The Foundations of Models of Dependence in Probabilistic Safety Assessment', Reliability Engineering, 18 (1987) 177-195. 43. H.M. PAULA, 'Comments on the Analysis of Dependent Failures in Risk Assessment and Reliability Evaluation', Nuclear Safety, 27 April-June 1986. 44. A. MOSLEH and N.O. SIU, 'A Multi-Parameter, Event-Based CommonCause Failure Model', Paper M7/3, SMIRT 9, Lausanne, CH, August 1987.

HUMAN FACTORS IN RELIABILITY AND RISK ASSESSMENT

I. A. WATSON Head of Systems Reliability Service SRS, UKAEA SRD Wigshaw Lane Culcheth Warrington, WA3 4NE Cheshire England

ABSTRACT. Human factors can have a significant impact on the reliable operation of technological plant. An explanation of human action theory is given followed by a review of the techniques of human reliability modelling now being utilised in reliability and safety assessment. The importance of task analysis is illustrated and the problem of incorporating management influences is discussed.

1.

INTRODUCTION AND BACKGROUND

It is now commonly accepted by concerned professionals that human factors (HF) can have a significant impact on the safe and reliable operation of technological plant. This understanding is manifest across a variety of industries and technologies eg, chemicals, processing, nuclear power, aviation, mining, computers and so on. What is still puzzling and controversial is how the matter can be wholly effectively dealt with. This common concern was expressed in a paper produced on behalf of the Commission of European Community (CEC)(1) surveying research on human factors and man machine interaction and proposing a European Community collaborative research programme. Many probabilistic risk assessment (PRA) reports in the nuclear power industry(l)(2) have shown the tremendous significance of HF and man machine interaction/ interfaces (MMI) in nuclear power plant accident sequences. Also HF plays a considerable role in software reliability and common cause failures(3). Its significance now in human computer interaction/ interfaces (HCl) is being appreciated in the field of advanced Information technology, so it has become a recognised research category in the huge UK, European, USA and Japanese 5th generation computer research projects now underway until 1990. It is well known in the case of aviation that 70%(4) of accidents are due to crew error and similar figures apply to the shipping and chemical industries. The 257
A. Amendola and A. Sail de Bustamante (eds.). Reliability Engineering, 257-300.

258 experience of the systems reliability service (SRS) in carrying out reliability assessments for the process industries is that human error (HE) can figure significantly somewhere in many plant or system safety/reliability assessments. An indication of the economic significance of HF can be obtained by a simple calculation based on the loss of availability or mean fractional downtime (D) due to HE and the annual product revenue (APR) ie, HE cost pa = D APR eg For D HE cost pa = = 1%, APR - 106 10"2 IO6 10,000

As these are representative figures then clearly it is worth spending a few thousand pounds pa per million pounds worth of product in order to reduce HE. H owever such trade-off calculations can relate to many aspects of plant design and operation. The moral is that good HF should be part of the whole process of marketing, specification, design, operation and maintenance of plant/systems. The basis of good ergonomics have been laid down by workers over the past forty years. There is much useful material available relating to design and operation. Unfortunately this is not always utilised in industry, so NCSR in conjunction with the I Chem E is publishing a guidance document(5) which is aimed to help in the application of ergonomics as well as in understanding HF in order to reduce human error. This guide was produced with the aid of the Human Factors in Reliability Group (HFRG) which is supported by the National Centre of Systems Reliability (NCSR). Researchers in human factors related to safety and reliability and those relating to advanced information technology (AIT) and hence HCl clearly agree that fundamental work on human performance modelling (cognitive ergonomics/modelling user interactions) is necessary to make any substantial progress. One of the problems in communicating and co-ordinating views on relevant research which affect these issues is the lack of an agreed terminology and descriptions in the concerned community. To some extent this is part of the overall problems. It is clear from texts on cognitive psychology(ll) that this science is still in an early state of development. Before the 1950' behavioural psychology ("behaviourism") was dominant but a revolution has occurred leading experimental psychologists to turn increasingly to investigation of the mind and so there has been a rebirth of interest In the "cognitive approach" to psychology. This emphasises three major characteristics which distinguish It from behaviourism:1 It emphasis "knowing" rather than merely responding as in the behaviourists stimulus-response (S-R) bonds 2 It emphasises mental structure/organisation which is inherent in all living creatures and provides an important impetuous to cognitive development. 3 The individual is viewed as being active, constructive and planful rather than being a passive recipient of environmental stimulation.

259
Lingulsts have had a considerable impact by showing that behaviouristic theories could not in principle work for the acquisition and use of language and cognition. Cognitive psychology draws considerably on computer science, particularly analogies with computer structures eg, memory and processing and from programming analogies with thinking. Also great reliance has been placed on simulation for testing new theories. Out of this has grown a new theory called information processing theory which has contributed mainly to the development of the theory of memory structures. From this has emerged developments in the understanding of language processing, problem solving reasoning etc. Other voices also argue that more emphasis should be given to understanding the human actions through interpretations of their meanings for the agents (humans) involved and through the elucidation of the purposes, rules, beliefs, reasonings which make these actions appropriate at the time of their performance^ ). This is the so-called hermeneutical approach which is very much at odds with a purely mechanistic approach to human behaviour. This approach I believe is important to MMI understanding particularly for engineers and managers who wish to organise their understanding and those of plant operators, i.e., their experience, so that the design and operation of the MMI is optimum. However, we need to be careful. It has been observed(ll) that when people make decisions they "satisfice" rather than maximise or optimise, that is people accept a choice that is good enough rather than continue to search for the best possible. One reason why Simon(ll) is the Nobel prize for economics is that he noted that economic theories often fail to take this into account. They assume that humans have a greater capacity than they actually do to obtain information from the environment, ie, economic theories assume erroneously that people'(and hence corporations) are optimisers or maximisers rather than satisficers. Managers and engineers can use their experience and understanding of plant operators to ensure that a proper understanding is taken into account in MMI design and operation. For this they must communicate effectively with the people involved. Even particle physicists have come to realise that they are in some ways participators in their experiments that they can never be completely isolated observers. Engineers know this from practical experience, but their classical scientific education and general character traits tends to be at variance with an interactive approach to plant personnel. This needs guarding against.

2.

HUMAN ACTION

A general theory of the structure of action has been produced by John Searle in the 1984 Reith Lectures. This theory makes sense of the many issues involved especially the anomalies, it can underpin some of the useful aspects of current human error models and can be specialised so as to be useful in understanding MMI and the occurrence of human error.

260 2.1 Principles of the Theory of the Structure of Action ACTIONS characteristically consist of two

2.1.1 Principle 1; components viz:

a mental component and a physical component

Suppose an operator is closing a valve or paginating a computer visual display, he will be conscious of certain experiences. If he is successful then the valve will close or the correct screen page is displayed. If he is not successful then he will still at least have had a mental component, ie, the experience of attempting to close the valve or paginate the VDU (or a misplaced intention, ie, a mistake leading to an error) together with with some physical components such as turning switch handles or pressing keys which may itself be in error due to a slip. This leads to:2.1.2 Principle 2: The mental component is an INTENTION ie, it has intentionality. To say that a mental state has intentionality means that it is about something. For example, a belief is always that such and such is the case, a desire requires that such and such happen as in the examples above. Intentional states have three key features:1 They have a content In a certain mental type. The content is what makes it about something, eg, closing a valve. The mental type is whether it is a desire or belief, eg, the operator wants to close the valve or the operator believes he will close the valve or the operator intends (in common parlance) to close the valve. 2 They determine their own conditions of satisfaction, ie, they will be satisfied or not depending on whether the world (out there) matches the content of the state. Sometimes when errors referred to as mistakes occur this is the result of a misplaced intentions leading to inappropriate physical components. 3 They cause things to happen (by way of intentional causation) to bring a match between their content and the state of the world. There is an internal connection between the cause and the effect, because the cause is a representation of the very state that it causes. The cause both represents and brings about the effect. This kind of cause and effect relationship is called intentional causation and is crucial to both the structure and explanation of human action. The mind brings about the very state that it is thinking about (which sometimes may be mistaken). The two characteristics of action propounded in Principle 1 and elaborated in Principle 2 and others below, are indirectly supported by Norman(24) in spelling out the different origins of mistakes and slips very succinctly as follows:- "The division occurs at the level of

261 interaction. A person establishes an intention to act. If the intention is not appropriate this is a mistake. If the action is not what was intended this is a slip. Mistakes are deficiencies or failures in the judgement and/or inferential processes involved in the selection of an objective and/or in the specification of the means to achieve it. Whereas slips (or lapses) are errors which result from some failures in the execution stage of an action, regardless of whether or not the guiding plan was adequate to achieve its purpose. 2.1.3 Principle 3; The kind of causation which is essential to both the structure of action and the explanation of action is INTENTIONAL CAUSATION. The physical components of actions are caused by intentions. Intentions are caused because they make things happen. They also have contents and so can figure in the process of logical reasoning. They can be both causal and have logical features because the kind of causation considered is mental or intentional causation and is realised in the human brain. This form of causation is quite different from the standard form of causation described in text books on philosophy or physics. What is special about intentional causation is that it is a case of a mental state making something else happen and that something else is the very state of affairs represented by the mental state that causes it. This leads to the fourth Principle. 2.1.A Principle 4: The explanation of an action must have the same content as was in the originator's head when the action was performed or when the reasoning was carried out that lead to the performance of the action. If the explanation is really explanatory the content that causes behaviour by way of intentional causation must be identical with the content of the explanation of the behaviour. In this respect actions differ from other natural events in the world (as already alluded to above). In the explanation of an earthquake or electricity the contents of the explanation only has to represent what happened, ie, a model, and why it happened. It doesn't actually have to cause the event itself. But in explaining human behaviour the cause and the explanation both have contents and the explanation only works because it has the same contents as the cause. 2.1.5 Principle 5: There is a fundamental distinction between those actions that are premeditated which are the result of advance planning and those actions which are spontaneous and which we do without prior reflection. An example of the latter is normal conversation where one doesn't reflect on what is going to be said next, one just says it. In such cases there is indeed an intention, but it is not formed prior to the action. It is called an intention in action. In many cases relevant to industrial plant operation or technological activities however there are prior intentions, advance planning, practical reasoning etc. This process of reflection characteristically results either in the formation of an intention or sometimes it results in the action itself.

262 2.1.6 Principle 6: The formation of prior intentions is, at least generally, the result of practical reasoning. Such reasoning is always about how best to decide between alternative (sometimes conflicting) possibilities and desires. The motive force behind most human action is desire based on needs or requirements. Beliefs function to enable us to figure out how beet to satisfy our desires. Tasks are generally complex and involve practical reasoning at a high level on the way forward and intentions in action and many physical components (often repetitious) at a lower level. Take for the example the response of control room operatives to a simulated LOCA in a nuclear reactor. Without going into technical detail it can be seen from Figure 1 that the methods chosen by various operatives were different, although all were considered acceptable. If one particular strategy were preferred then all others could be considered to be in error ie, mistaken unless there is some overriding criterion, such as keeping below a maximum operating temperature. Notice also that errors ie, slips, could occur at the detailed level in not correctly operating pumps or valves. 2.1.7 Principle 7: An intentional state only 'functions' as part of a network of other intentional states. 'Functions' here means that it only determines its conditions of satisfaction relative to many other intentional states. The example given above and illustrated in Figure 1 indicates some of life's typical complexities. One doesn't have intentions by themselves. The operatives are in the control room for many reasons, personal, organisational, technical etc. The desire to successfully control the LOCA functions against a whole series of other intentional states eg, to maintain the reactor working, the quality of its output, please the boss, maintain the integrity of the plant, keep their jobs, job satisfaction etc. They characteristically engage in practical reasoning that lead to intentions and actual behaviour. The other intentional states that give the intentional state particular meaning is called the network of intentionality. 2.1.8 Principle 8: The whole network of intentionality only functions against a background of human capacities that are not themselves mental states. Our mental states only function in the way they do because they function against a background of capabilities, skills, habits, ways of doing things etc, and general stances towards the world that do not themselves consist of intentional states. In order for example to form the intention to drive a car somewhere one must be able to drive, but this ability doesn't just consist of a whole lot of other intentional states. A skill is required of know 'how' rather than 'that'. Such skills, abilities, etc., against which intentional states function is the "background". 2.1.9 Principle 9; The formation and development of intentions is affected by the results of our actions which are continually evaluated.

263
KEY: 100 II gallons / minute *'* A'. BULK 90 2<J TEMP 80 <C> s 70 60
N

HT= Manual trip IATI - Automatic trip : ' = C : C' = Am= A'm= AAS '= R : R' : R* Start pump Stop pump Start pump C Stop pump C start pump A I manual ) Stop pump A I manual I Start pump A (automatic Stop pump A (automatic Open riser valve Close riser valve

'0

10

IS 20 25 30 TIME (minutes)

35

40

S O

1100 180 TEMPlt) LEVEL f (cm)

20

25

30

35

180 RAT LEVEL (cm)

rji

48 gallons /minute /-'t

" 0 VJ 160

100 TEMP ("O 90

180 LEVELI cm I 170

10

15

10

15

FIGURE 1 VARIOUS STRATEGIES FOR CONTROLLING A SIMULATED LOCA

264 This is an additional principle to Searle's theory hut is essential if we are to take into account practical experience of doing things and of correcting (or not correcting) errors. Even the simplest actions such as pressing a computer key has some form of feedback (touch, sight or sound) if they are to be fully satisfactory. More complicated tasks involve much more elaborate evaluation for their progress. This is illustrated in Figure 1 where continual adjustments are being made during the process of controlling the LOCA. At an even higher level strategies may be evaluated and modified to obtain desired results. Unless actions are continually corrected to satisfy Principle 6 (prior intentions) then intentions in action (Principle 5) may be mistaken. Practical reasoning arising from Principle 6 is continually modified by evaluations required by Principle 9. 2.2 Man Machine Interaction (MMI)

This covers control room operation (CRO), human-computer interaction (HCl), local board and machine operation and plant maintenance. The theory already cited clearly indicates separate mental and physical aspects to MMI. This needs to be considered further. An operator may open/close a control valve for several reasons:1 2 3 4 for routine control part of a start up procedure part of a test procedure because of an emergency.

The physical component of the action is the same (or very nearly the same in all cases) but the intentions, purpose, reasons, motives etc, are different. There are many different types of valve and means of controlling them eg, bellows, globe, diaphragm, gate, butterfly, manual, electro-magnetic, hydraulic, direct digital control, remote switch, remote servo etc. Hence opening/closing a valve has a generic technological meaning and many physical realisations. This is all very much in accordance with the basic theory of the structure of human action. However, crossing the control room or walking between parts of a plant has similar physical actions but may be for a variety of reasons or intentions. Now in the case of using visual display units (VDUs) there are similar physical components of such actions eg, reading a screen and operating a keyboard, tracker balls, ... etc, but a huge variety of purposes for doing this narrow range of physical actions because of the tremendous flexibility of computerised control which is one of the reasons for its widespread use. Nevertheless the actual physical relationship to the process being controlled eg, information, chemical plant, factory machinery, ship, aircraft etc, is very indirect and in general could be almost anything these days. Maintenance activities can be much more physical and directed, nevertheless there is a strong desire to use standard techniques eg, in calibration, setting up, dis-assembly and assembly so similar activities will be utilised for a variety of jobs. Tools such as wrenches, saws, winches will be used with many different intentions.

265 POLYA'S STAGES INFORMATION PROCESSING TRANSLATION

Understanding the Problem

Encode the Problem in Working Memory

Devising a Plan

Search Long-Term Memory for Plan or Production System

Carrying out the Plan

Execute the Production System

Looking back

Evaluate the Results

Respond

FIGURE 2

POLYA'S (1957) FOUR STAGES OF PROBLEM SOLVING

266 Indeed to be slightly humorous and controversial, maintainers are renowned for what they see as a wrench or hammer, lever etc, ie, they improvise. Thus in general there is a discontinuity between the mental components of the various activities related to MMI and the physical phenomena being controlled.

3 3.1

HUMAN PERFORMANCE MODELLING Generic Models

The four stage model originally described by Polya(ll) shown in Figure 2 has been updated by cognitive psychologists in a way related to memory working and language comprehension along the lines of modern information processing theory previously mentioned. More recently Norman(12) has produced a specialised version of this for HCl modelling. At the Interact '84 Conference he emphasised that as a professional psychologist he did not think that a scientifically satisfactory HCl model could be offered but engineering required adequate models and he thought that this requirement might be met. He identified four different stages when a person perforins an action In an interactive cycle of activities: the the the the formation of an intention selection of a method execution of that selection evaluation of the resulting action.

These stages overlap and interact and the feedback or evaluation aspects are crucial, particularly the levels at which these occur and the recognition shown in the system design of this. An illustration of the design implications of the four stages of user activities relevant to VDU screen layout is shown in Table 1 was presented. The value of the above models compared with a more conventional MMI "operator" model as an information processor as shown in Figure 3 is that they are set at a higher cognitive level and emphasise "feedback", which is often more appropriate to modern plant conditions and can be used more widely. With all these'models, consideration has to be given to major performance shaping factors (PSF). These will be seen to recur as the discussion of appropriate models develops. Typical PSF are: Level of mental loading Level of mental skill Activation and arousal Environmental factors Perception of risk Error correction and feedback.

INTERFACE

S Y SENSING
"

SHORT > TERM J MEMORY


J

DECISION \ MAKING ( A

ACTION

CONTROL s ACTIONS \

& PERCEPTION

SELECTION

E ^

....

LONG TERM MEMORY

FIGURE 3

FLOW OF INFORMATION THROUGH THE HUMAN OPERATOR

268
TABLE 1

DESIGN IMPLICATIONS FOR THE STAGES OF USER ACTIVITIES STAGE TOOLS TO CONSIDER Structured Activities Workbenches Memory Aids Menus Explicit Statement of Intention Memory Aids Menus

Forming the Intention

Selecting the Action

Executing the Action

Ease of Specification Memory Aids Menus Naming (Command Languages) Pointing Sufficient Workspace Information Required Depends on Intentions Actions Are Iterations toward Goal Errors as Partial Descriptions Ease of Correction Messages Should Depend upon Intention

Evaluating the Outcome

The last mentioned PSF is probably the most fundamental and has implications at many levels and is consequently being incorporated much more explicitly in actual models. It emphasises the interconnectedness or network situation typical of human activity. The Riso National Research Laboratory in Denmark has become well known for the work of Rasmussen and his co-workers in this field and for their useful insights into human performance/error modelling. One of which is shown in Figure A is more related to the practical situation of decision making at the MMI. This model can be more easily understood in terms of a more generalised model related to human information processing shown in Figure 5. This emphasises three basic levels of capability based behaviour that can be directly related to the data processing activities of the previous model. The human performance models briefly discussed above have concentrated essentially on MMI with emphasis on the lone operator. They are approximations in as yet a poorly developed area of scientific understanding, but they are of practical assistance if used with caution. However they do not really take full into account the interconnectedness and interactions of human activities. A model which takes this more fully into account came out of a Safety and Reliability

269

KEY
DATA PROCESSING ACTIVITIES

f RESULTING*\
STATES OF KNOWLEDGE

INTERPRETATION OF SITUATION

IMPICATIONS OF PROBLEMS

EVALUATION OF ALTERNATIVES

SYSTEM STATE

I
J<-

GENERAL STRATEGY

IDENTIFICATION OF SYSTEM STATE

TASK DEFINITION

&
SELECTION OF GOAL STATE

OBSERVATIONS \ TIONS I

4> ^
^Vt> *# <v

\^l
>X TASK

&

OBSERVATIONS DATA COLLECTION

7
ALERT STATE

^>4C.

ACTIVATION

oC^
/ < *

,t>* v S
RELEASE OF PRESET RESPONSE

FIGURE 4

DECISIONMAKING MODEL (ADAPTED FROM RASMUSSEN)

270

Fixations GOALS Assume, expect Associate from individual observation \ IDENTIFICATION KH0WLED6E-BASED BEHAVIOUR Mental Traps Familiar association (cue not defining)

Effects of Linear thought In Causal Net Causal conditions not considered Side effects not considered

'
DECISION TASI PLANNING PROCEDURE

Recall ineffective

RECOGNITION ASSOCIATION STATE/ TASK STORED RULES FOR TASKS

Omission of isolated acts Mistakes among alternatives

RULE BASED BEHAVIOUR

Absentmindedness (cue for discrimin.) Alertness low (cue not activating)

FEATURE FORMATION

AUTOMATED SENSOR-MOTOF PATTERNS

Spatial-temporal coordination Inadequate Manual variability Topographic orientation inadequate

SKILL-BASED BEHAVIOUR

Sensory inputs

TIT

TimeActions space Information

trnii

FIGURE 5

MODEL OF HUMAN DATA PROCESSES AND TYPICAL MALFUNCTIONS REPRODUCED FROM RASMUSSEN, 1980

271 Directorate (SRD)(7) study of accidents in UK H ealth and Safety Directorate. This showed many influences at work affecting human performance leading to fatal accidents The model is shown diagrammatically in Figure 6. From this an accident classification scheme was derived which reflects the principle influences shown in the model. This is illustrated in Figure 7. The centre of the influence model in Figure 6 is MAN, the modelling of which has been briefly reviewed above. Another is the actual plant concerned. From the reliability and risk point of view this Is dealt with by well known reliability techniques related to but not the concern of this lecture. The other two ie, TASK and MANAGEMENT will now be considered. 3.2 Task Analysis

The most effective form of reliability analyses involving human operations usually Involves some form of task analysis. This is because it is the reliability that can be achieved in the task(s), in which humans are involved, that is the essential concern of risk analysis or reliability assessment. The most useful form of this type of analysis used by ergonomists is hierarchical task analysis(21). 3.3 An Illustration

To illustrate this process of redescription, consider an operation that might be carried out as one of the duties of a chemical plant operator - 'ensure caustic concentration is within limits specified by manufacturing instructions'. By questioning an informant competent at this operation, we may be able to say that the five sub-ordinate operations in Figure 8 need to be carried out. But simply listing these five sub-ordinates does not provide a complete redescription of the operation being examined. Their plan must be stated. In this case the plan is most clearly stated in the form of the algorithm in Figure 9. The same process of redescription can now be applied to each of the five sub-ordinate operations identified in Figure 8. Figure 10 shows how some of these redescriptions may be carried out. Some of the operations so derived may also be treated in a similar fashion. 3.4 Control and Monitoring of Tasks

In the NCSR study of common cause failures(3) the importance of control, monitoring and feedback came to be realised in reducing human error, particularly in connection with maintenance. Also the importance of high level controls such as QA, design review and reliability assurance in minimising design error. The essential points are set out in the idealised flow diagram form of task checking model shown in Figure 11. The solid line arrows represent stages of work and the and p' dotted arrows represent the checking process at various stages, the

PERSONALITY PHYSICAL DEFECTS PSYCHOLOGY EXPERIENCE INTERNAL STRESSORS MOTIVATION

COMMUNICATIONS RELATIONS

FIGURE 6

INFLUENCES ON MAN IN INDUSTRY

273

ACCIDENT CLASSIFICATION

ACCIDENT SIGNATURE

CAUSES AND PREDISPOSING FACTORS

PLANT

ORGANISATION

environment ergonomics faults

software design communications industrial reins misc. safety aspects

causes PSP's aberrant behaviour

Proximate causal a c t i v i t y a c t i v i t y of deceased, error types

FIGURE 7

MACRO-STRUCTURE OF TAXAC

1.

Ensure caustic concentration is within limits specified by Manufacturing Instructions

Put on gloves and goggles

Collect sample

4.

Add caustic to correct concentration

Take off gloves and goggles

FIGURE 8

TASK ANALYSIS: HIGH LEVEL DESCRIPTION

To ensure caustic concentration as per manufacturing instruction

Put on gloves and goggles (2)

Collect sample (3)

Test sample (4)

Is caustic ^ concentration as N -^.per manufacturing instruc-^, tions

YES

Take off gloves and goggles (6)

Caustic concentration correct

_Si_
30 minutes after caustic addition 4~ Take off gloves and goggles (6) Add caustic to correct concentration (5)

FIGURE 9

TASK ANALYSIS:

INTERMEDIATE

REDESCRIPTION

to -J

Ensure caustic concentration is within limits specified by Manufacturing Instruction

Plan 1: 2 3 4

After 30 mlns. < 6

Put on gloves and goggles

Collect sample

Test sample

Add caustic to correct concentration

Take off gloves and goggles

Plan 3: * 7 * 8 * 9 * EXIT

Plan 5: * 10 * 11 *

12 13 * EXIT

Open mar.-lid

Dip for sample

10.
close man-lid

Estimate caustic required

11.

Collect required caustic from store

12.

Bring caustic to vessel on sack

13.

Tip caustic into /essel

FIGURE 10

TASK ANALYSIS: FINAL REDESCRIPTION

277

Pic

Pc

, .
'Pc

1 ._

..

P.

FIGURE 11

TASK CHECKING MODEL

SUBSYSTEM FAILURE MODEL

&
CMF MODELS INDEPENDENT FAILURES MODEL

MAINTENANCE ERROR ENGINEERING ERROR RANDOM IDR INHER(NT) ERROR

MAINTAINED ORIGINATED ERRORS

DESIGN INDUCED ERRORS

CAUSAL MODELS

ENVIRONMENTAL MODELS

OPERATOR ERRORS

FIGURE 12

SUBSYSTEM CCF MODELLING STRUCTURE

278 latter are shown as a feedback function. Making the important assumption that to a large degree these individual actions are independent and the and symbols are taken as probabilities of error, then assuming that the probabilities are small, the overall probability of failure is given by: Pa
=

{[(PP C

PPc) Pc + PPc] P c

"

PPcl * Pc

pp c p + higher order terms.

Experience has shown that high integrity can be achieved by various means, eg, high skills, experience, QA. Generally this can be entitled "product assurance". According to the task checking model shown in Figure 11 this involved determining that p', the overall task control element is adequate. Turning now to Figure 12 this represents the upper hierarchy of an overall subsystem CCF modelling structure where CCF models incorporates, maintenance, engineering and random errors (causal mechanisms) as previously discussed. The latter can be divided as shown in Figure 12 and various models have been discussed in the literature for dealing with them. The various generic factors which enter into the estimation of engineering error are shown in Figure 13. These are assumed to be nominally independent, although this may be not entirely true. Studies of plant are have shown that engineering defects decrease by up to an order over a few years(3). The regulatory authorities insist that their experience shows that mandatory codes of practice have a beneficial effect. The three principal types of product assurance shown, ie, design review, reliability assessment, QA will also contribute perhaps up to an order each improvement in error rate. The thorough implementation of all these factors can obviously have a very significant effect and indicate how a much lower error probability than 10~ 3 may be achievable. Very little data is available to support these predictions except that from aircraft systems. 3.5 Management Assessment

This is the most problematic and least developed area from a risk and reliability viewpoint. It is a common influence affecting all aspects of plant operation. Some authorative sources believe that the range from very good to very poor management can produce an order of magnitude increase in risk of accidents. Some analysts believe it can best be dealt with by considering the effects of supervision, training, working environment, etc, and other management controlled factors at the detailed task level. Indeed the existence and performance of overall controls and monitoring as previously described is clearly a major management responsibility in reducing risk and improving reliability. In the aviation world(13) the flight crew training programmes are expanding beyond the traditional role of maintaining piloting skills and providing instruction orientated towards flight deck management crew coordination, teamwork and communications.

279

ENGINEERING ERROR

L
BASIC ENGINEERING ERROR PRODUCT ASSURANCE

m I

s
co w Q O U

b3

> ^

s
10

*-t

S C O

co

* *

**

W C O

c o

EH

.J

FIGURE 13

PART OF SUBSYSTEM CMF MODELLING STRUCTURE

280 Flight simulator training(13) now include management programmes focusing on communications and management practices eg, managerial philosophy individual work styles communications integration of the "four" foundations of management - planning, organising, leading and controlling management skills and involvement practices specific strategies for the effective exertion of influence.

Flight experts tend to relate aircraft accidents to interpersonnel and management factors far more than lack of systems knowledge or to aircraft related factors. Studies(13) identify a "safety window" in which nearly 83% of accidents involving professional pilots occur beginning at or about the final approach fix and extending through approach and landing. 90% of the accidents that occur in this window appear not to be aircraft related, they are pilot caused and seem to reflect failure to management properly. As a result in training pilots a role change is occurring converting the pilot from a control manipulator to an information processor. Practically the only technique which has been developed to model and assess management explicitly from the risk viewpoint is the Management and Oversight and Risk Tree (MORT)(14). This system safety programme has been developed and refined by the US Department of Energy (DOE). MORT is a systematic approach to the management of risks within an organisation. It incorporates ways to increase reliability, assess risks, control losses and allocate resources effectively. The acronym, MORT, carries two primary meanings: 1 the MORT "tree", or logic diagram, which organises risk, loss, and safety program elements and is used as a master worksheet for accident investigations and program evaluations; 2 the total safety program, seen as a sub-system to the major management system of an organisation.

and

The MORT process includes four main analytical tools. The first main tool, Change Analysis, is based upon the Repner-Tregoe method of rational decision making. Change Analysis compares a problem-free situation with a problem (accident) situation in order to isolate causes and effects of change. The second tool, Energy Trace and Barrier Analysis, is based on the idea that energy is necessary to do work, that energy must be controlled, and that uncontrolled energy flows in the absence of adequate barriers can cause accidents. The third, and most complex, tool is the MORT Tree Analysis. Combining principles from the fields of management and safety and using fault tree methodology, the MORT tree aims at helping the investigator discover what happened and why. The fourth tool, Positive (Success) Tree Design, reverses the logic of fault tree analysis. In positive tree design, a system for

281
successful operation is comprehensively and logically laid out. The positive tree, because it shows all that must be performed and the proper sequencing of events needed to accomplish an objective, is a useful planning and assessment tool. An illustration of a MORT "tree" or logic diagram is shown in Figure 14.

QUANTIFICATION OF HUMAN ERROR

In a review(7) of the general approaches to human reliability quantification carried out by the Safety and Reliability Directorate (SRD) of the UK Health and Safety Executive (HSE) three broad categories of approach were described. The first of these relies primarily on combining together historical data on the probabilities of failure for relatively basic elements of human behaviour such as operating switches, closing valves or reading dials, to give the likelihood of errors for more complex tasks which are aggregations of these basic elements. Such techniques are variously referred to as 'and synthesis', 'reductionist' or 'decomposition' approaches. The next approach are those which attempt to apply classical reliability techniques of time dependent modelling to predict parameters such as probability on a function of time. The third category of techniques makes a much greater use of quantified subjective judgement, supplement the currently inadequate data base of objective data on the probability of human error for various types of task. Also, these methods tend to take a more wholistic approach to the evaluation of a task than the decomposition techniques. Further developments have taken place in some of the specific techniques described in the SRD/HSE report(7), new techniques have appeared and there has been a proliferation of work and PRA reports for the American nuclear power industry utilising many variations of the available methods. It must be emphasised that most of these techniques rest in some way although often tentatively on the human performance models previously described. They are loosely based on such models and are techniques to quantify certain kinds of events in probabilistic risk analysis (PRA). They represent an engineering solution to a problem that has resisted solution in the fields of psychology and human factors. A framework for the systematic application of these techniques has recently been provided through the Electric Power Research Institute (EPRI) of the USA by the NUS Corporation. This is the so-called SHARP (Systematic Human Action Reliability Procedure) framework(15). A description of the method of quantification will be given therefore with reference to this framework. The SHARP framework is shown in Figure 15 which shows the links between the seven steps involved. The objective of the first step is to ensure that potentially important human influences are included in plant logic diagrams such as event trees (ET) and fault trees (FT). An example of an enhanced fault produced after undergoing the detailed procedures of the definition step is shown in Figure 16. The failure

LETTER 0/N D/NP EROA F/ F/M F/MiR F/T HAP JSA LTA OSHA USO

nformotion System* LTA

Design plan LTA

Operational Readiness LTA

FIGURE 14

MANAGEMENT OVERSIGHT AND RISK TREE

w;

ABBREVIATIONS DIO HOT OIO NOT PROVIOE ENERGY RESEARCH I DEVELOPMENT ADMINISTRATION FAILED FAILURE FAILED TO MONITOR FAILEO TO MONITOR ( REVIEW FAILEO TO HAZARD ANAL. PROCESS JOB SAFETY ANAL. LESS THAN ADEQUATE OCCUPATIONAL SAFETY t HEALTH ADMINISTRATION REPORTED SIGNIFICANT OBSERVATION WITH

STEP 1 DEFINITION

STEP 2 SCREEENING

STEP 3 BREAKDOWN

NO

YES

STEP 7 DOCUMENTATION

STEP 6 QUANTIFICATION

STEP 5 IMPACT ASSESSMENT

STEP 4 REPRESENTATION

FIGURE 15

LINKS BETWEEN SHARP STEPS

284 "types" referred to in this figure are defined in the SHARP report, but are self-explanatory in the fault tree. In step 2 the objective is to reduce the number of human interactions identified in step 1 to those that might be significant. The application of coarse screening is shown in Figure 17 which is the same fault tree as the previous figure where the analyst has applied generic equipment data and a fixed human error probability, eg, 1.0. Coarse screening takes into account only those system features that diminish the impact of human interactions on accident sequences. Fine screening goes beyond this by also applying probabilities to human actions. Various examples of suggested screening data have been given in the literature(7)(15). Figure 18 shows a graph based on the Rasmussen model of human data processes and typical malfunctions previously described in Figure 5. The application of such error rates to the fault tree shown in the previous figures is shown in Figure 19. The impact of failure to maintain the breakers is thus seen to be very significant relative to the combination of the failure to scram automatically and manually. The objective of step 3 is to amplify the qualitative description of each key human interaction identified in step 2. This is essentially done by means of some form of hierarchical task analysis such as previously discussed. Influence parameters, performance shaping factors, ergonomie features (or lack of them) etc., need to be considered to establish a basis for selecting a model basis for representation of the human interactions. This would include organisational factors, quality of information, procedural matters as well as personnel factors. The steps described so far are usually followed to some limited degree by risk and reliability analysts. Some form of screen or sensitivity analysis is advisable because of the difficulties in carrying out the next steps 4, 5 and 6 concerned which is what is often regarded as human reliability modelling. In fact step 3 and step 4 require human factors specialists as well as risk/reliability assessors whereas the previous steps principally requires systems and reliability expertise. In recent work carried out by NCSR(16) on reactor pressure vessel ultrasonic inspection the ET/FT format was followed. The event tree following the sequence of welding and testing and fault trees was developed for the nodes of the ET each of which involved ultrasonic testing. The fault trees were generated to the level at which reasonable robust human reliability data could be generated as in Figure 20. A similar procedure was devised for human error treatment in major hazard assessment(17) by SRD. An example of an event tree from a typical example is shown in Figure 21. 4.2 Human Reliability Modelling

Not all the modelling techniques and data generation methods can be considered here, so only those most relevant to the power and process industries will be considered since their requirements do have considerable similarities. The models and data will be considered together rather than separately, since they are intimately linked. It is worth mentioning here that step 5, impact assessment, of the SHARP

285

FAILURE TO

INSEAT CONTROL RODS

I
FAILURE OF RODS TO HOVE FAILURE TO REMOVE ALL

FAILURE TO OPEN ALL BREAKERS

I
FAILURE OF OPERATOR TO OPEN AUX. BREAKERS

BREAKERS FAIL CLOSED

I
FAILURE TO OPEN SCRAM BREAKERS A

A
FAILURE TO OPEN SCRAM BREAKERS BREAKERS FAIL CLOSED

<> !
MECH. FAILURE OF BREAKER A BLADES TO OPEN

FAILURE TO DE ENERGIZE UNDER VOLTAGE COIL A

MECH. FAILURE OF BREAKER BLADES TO OPEN

FAILURE TO DE ENERGIZE UNDER VOLTAGE COIL

A.
INCORRECT DESIGN OF 3RE7JR MAINTENANCE

3.
FAILURE OF OPERATOR TO SCRAM PLANT FAILURE OF APS TO SCRAM

I
INCORRECT DESIGN OF BREAKER MAINTENANCE FAILURE

JZ

9.
FAILURE OF APS TO SCRAM

FAILURE

FIGURE 16

4 ^

FAILURE OF OPERATOR TO SCRAM PLANT

ENHANCED FAULT TREE

286

FAILURE TO INSERT C ONTROL

I
FAILURE OF RODS TO HOVE FAILURE TO REMOVE ALL POWER

FAILURE TO OPEN ALL BREAKERS

I
FAILURE OF OPERATOR TO OPEN AUX. BREAKERS

BREAKERS A FAIL C LOSED FAILURE TO OPEN SC RAM BREAKERS A

I
A.
FAILURE TO DE ENERGIZE UNDER VOLTAGE C OIL A 10'

"A"

FAILURE T O 1 OPEN SC RAM I BREAKERS I BREAKERS FAIL CLOSEI

MECH. FAILURE OF BREAKER A BLADES TO OPEN

.
MECH. FAILURE OF BREAKER BLADES TO OPEN

FAILURE TO DE ENERGIZE UNDER VOLTAGE COIL 10"

A
INCORRECT DESIGN OF BREAKER MAINTENANCE FAILURE FAILURE OF OPERATOR TO SCRAM PLANT

I
FAILURE OF APS TO SC RAM

,4
I
FAILURE OF OPERATOR TO SCRAM PLANT FAILURE OF APS TO SC RAN

INCORRECT DESIGN OF BREAKER

MAINTENANCE FAILURE'

T^^^
"

FIGURE 17

APPLICATION OF A COARSE SCREENING TECHNIQUE

287

SKILL

RULE

KNOWLEDGE

IO -5

IO"4

IO"3

IO -2

IO -1

1.0 ERROR RATE

FIGURE 18

ERROR RATE RANGL'S ASSOCIATED WITH HUMAN BEHAVIOUR

288

FAILURE TO INSEKT C ONTROL RODS

I
FAILURE OF RODS TO HOVE 105 FAILURE TO REMOVE ALL POWER 104

T
FAILURE TO OPEN ALL BREAKERS 102

1
FAILURE OF OPERATOR TO OPEN AUX. BREAKERS 102

104 BREAKERS A FAIL C LOSED FAILURE TO 5C RAM BREAKERS A 102

I
102

"S"
FAILURE TO OPEN SC RAM BREAKERS B

T
I
IMECH. FAILURE OF BREAKER A BLADES TO OPEN

102 FAILURE TO DE ENERGIZE UNDER VOLTAGE C OIL A 1107

A
HECH. FAILURE OF B REAKER BLADES TO

a.

BREAKERS B FAIL C LOSED

IO"

FAILURE TO DE ENERGIZE UNDER VOLTAGE C OIL

SL
INCORRECT DESIGN OF BREAKER MAINTENANCE FAILURE FAILURE OF OPERATOR TO SCRAM PLANT

103

I
INCORRECT DESIGN OF BREAKER MAINTENANCE FAILURE

JZ

.
FAILURE OF APS TO SC RAM

T^
104

FAILURE OF APS TO SC RAM

r^

FAILURE OF OPERATOR TO SCRAM PLANT

FIGURE 19

APPLICATION OF SCREENING USING GENERIC DATA, HUMAN AND EQUIPMENT

289

FAILURE OP CHANNEL TO IDENTIFY FLAW

FAILURE OF PASS TO IDENTIFY FLAW

J
FAILURE TO DETECT FLAW SIGNAL FAILURE TO CORRECTLY REC ORD RELEVANT SIGNAL FAILURE TO PRODUCE FLAW SIGNAL FAILURE TO DETECT DISPLAYED FLAW SIGNAL

FAILURE OF REDUN DANT CHANNFLS TO IDENTIFY FLAWS

tests

I
FAILURE TO PREVENT CONTAMINATION OF RES'JLTS

FAILURE TO ACCURATELY REC ORD RESULTS

/ \ AIoo

/ \A:

f
FAILURE TO PREVENT CORRUP TION OF RECORDS FAILURE TO PREVENT LOSS OF RECORDS FAILURE TO PREVENT DAMAGE TO REC ORDS

1
F M LIU." TO USE COPRECT THRESHOLD FAILURE TO CORRECTLY SYNCHRONISE RECORDING UITE DATA

l
FAILURE TO PREVENT DATA DRIFT

FAILURE TO MATCH RESOURCES TONUEDS

FAI LURE TO

USE
CORRECT SCALING

FAILURE TO APPLY CORRECT .BAND WIDTH

FAILURE TO PREVENT OMISSION IN RECORDS FAILURE TO PREVENT TRANSPOSITION IN RECORDS FAILURE TO PREVENT SUBSTITUTION IN RECORDS FAILURE TO PREVENT INSERTION IN RECORDS FAILURE TO PREVENT MULTIPLE ERRORS IN RECORDS FAILURE TO PREVENT ERRORS OF C OMMISSION IN RECORDS

FIGURE 20

FAILURE OF PASS TO IDENTIFY FLAW

290

INITIATING EVENT

at

B*

H2

Level naaxs normal fill lev!

Level Indicator work

Operator Acta <Clos Valv VI)

Valv VI Ciosas

Valva V2 Operable

Alara Mrka

Operator Acta (Cloe Valv V2)

Autotrip Hork

Consequence

Failure Probability

Y Y Y

E3

KEY

Y S

Ye No Safe Tank Overflows

pai
'overflow
<End S t e t c rehfclllt

y>

B ranches

la

H i n i a a l cut a e t on Pig 4

Contribution froa "nomai operation" state of plant

FIGURE 21

EVENT TREE FOR SEMIAUTOMATIC SYSTEM ON FIG. 3

* (Heading r e a r r a n g e d t o a c c o u n t f o r Common Mode F a i l u r e )

291 procedure allows a re-evaluation of the overall reliability/risk assessment so far and the Incorporation of any Insights gained having decided which human reliability models should be used. The rest of this paper will only be concerned with human reliability modelling and not with the details of the SHARP procedure which essentially only formalises what risk analysts and reliability assessors have been doing to varying degrees anyway. 4.3 Operator Action Tree (OAT)

This representation 15) of human decision making is shown in Figure 22. It allows for mis-interpretation by the operator at various key stages in the execution of a task. There are two significant aspects. The first is the limited time which the operator has to carry out the task. The OAT method has a time failure (or non-response) probability relationship. The second is that the operator can take various decision paths and the assessor can determine whether they are key or not. If as shown in the Figure 22 all paths but one lead to failure then they can be grouped together. However if for example failure to diagnose the event correctly could lead to inappropriate action (as evidence indicates has happened since operators often do not follow standard procedures) then the OAT representation should reflect this. Although the OAT representation shown does not show recovery action, it may be appropriate also to allow for this key extension of the tree. The time related non-response probability data used to quantify OAT is shown typically in Figure 23. The grouping of these curves might tentatively be considered to show the essential character of skill, rule and knowledge based behaviour (moving from L to R across the graph). However further work on the use of simulator data and human behaviour modelling is required to clearly establish the relationship between human behaviour types and simulator results. The OAT representation potentially is capable of modelling human performance reliability with high levels of problem solving requirement. 4.3 Human Reliability Analysis (HRA)

(This technique was formerly called THERP "Technique for Human Error Rate Prediction"). The HRA tree provides a flexible structure for representing the steps of a well defined procedure. Some steps may involve omissions while others may show up as errors of commission. This method has been extensively developed over the past decade. An overview of the procedure involved is shown in Figure 24. Details of the method have been extensively described in a handbook(18) which includes data sets and a procedures guide(19). An illustrative HRA tree together with explanatory glossary and data is shown in Figure 25. The evaluation of performance shaping factors and the procedures for choosing data are explained in detail in the handbook. It may be seen that the HRA tree is similar to the fault tree approach used by NCSR in the reliability analysis of RPV inspections. However there are a variety of methods for generating the data for the basic events in the

Event Occurs

Operator Observes Indications

Operator Diagnosis Problem

Operator Carries Out Required Response

Success/ Failure

Success

Failure

Failure

Failure

FIGURE 22

BASIC OPERATOR ACTION TREE

m m o

<

RESPONSE KEY SITUATION I.S SIGNALS / HR 0.35 SIGNALS / HR BRITISH OPERATOR SUMMARY NUCLEAR INSTRUMENT FAILURE TEMPERATURE INSTRUMENT FAILURE STEAM GENERATOR LEAK LOSS OF COOLANT CONTROL ROO ACCIDENT INAOVERTANT SAFETY INJECTION SOURCE SIMULATOR SIMULATOR UKAEA SIMULATOR SIMULATOR SIMULATOR SIMULATOR SIMULATOR FIELD DATA

TIME I m i n )

10. 11. 12. 13. 14. 15. 17.

FLUX TILT DATA CORRELATION INAOVERTANT SAFETY INJECTION SMALL BREAK LOCA LOSS OF COOLANT IRSS) ESTIMATED FROM THI NATIONAL RELIABILITY EVAL. PROGRAM WREATHALL 1982 )

SIMULATOR EXPERTS SIMULATOR SIMULATOR EXPERTS FIELD DATA EXPERTS

I FROM HALL , FRAGOLA t

FIGURE

23

NON-RESPONSE PROBABILITY VERSUS RESPONSE TIME

294

PLANT VISIT PHASE 1: FAMILIARIZATION REVIEW INFORMATION FROM SYSTEM ANALYSTS

TALK-OR WALK-THROUGH PHASE 2: QUALITATIVE ASSESSMENT

-2L
TASK ANALYSIS

DEVELOP HRA EVENT TREES

ASSIGN NOMINAL HEPS

ESTIMATE THE RELATIVE EFFECTS OF PERFORMANCE SHAPING FACTORS

ASSESS DEPENDENCE PHASE 3: QUANTITATIVE ASSESSMENT

DETERMINE SUCCESS AND FAILURE PROBABILITIES

DETERMINE THE EFFECTS OF RECOVERY'FACTORS

Jj/l
PERFORM A SENSITIVITY ANALYSIS IF WARRANTED

SUPPY INFORMATION TO SYSTEM ANALYSTS

PHASE 4: INCORPORATION

FIGURE 24

AN OVERVIEW OF A HUMAN RELIABILITY ANALYSIS

A = .02

d = .96

Event A = Control room operator omits ordering the following tasks = Operator omits verifying the position of MU-13 C = Operator omits '"verifying/opening the DH valves D = Operator omits isolating the DH rooms

HEP

Source

.02(EF = 5 ) * .04(EF = 5)** .04(EF = 5)** .04(EF = 5)**

T20-6 +1 T20-8 +3 T20-8 +3 T20-8 +3

* **

Modified to reflect the effects of moderately high stress Modified to reflect the effects of moderately high stress and protective clothing

FIGURE 25

HRA EVENT TREE FOR ACTIONS PERFORMED OUTSIDE THE CONTROL ROOM

296 trees or indeed for whole tasks. It will be seen from the HRA tree illustrated that it is based on the task analysis approach previously described. Data has been estimated and presented in the handbook for a variety of task elements. The method of estimation was expert judgement by small groups of experts. This data has been verified to a limited extent by a recent simulator study(20). The observed error rates (OER) from the simulator study were compared with the adjusted (allowing for PSFs) human error probability (HEP) derived from the handbook and found to be largely in agreement. A summary of the results is shown in Table 2. In the case of errors of commission which appeared to be mainly due to operator slips, almost instantaneous error recovery was a significant factor as indicated by the recovery rate in the Table. TABLE 2 SUMMARY OF OERs OERa

Error Type Whole-Step Omissions With Procedures Without Procedures Within-Step Omissions General Commission: Total: Unrecovered: Recovery Rate =

AHEP

.0314 .0473 .0270 .00316 .00042 .867

.0148 .0163 .0155

(.0049-.044) (.0033-.0815) (.0031-.0775)

.00453 (.0015-.0136)

a 4.4 Expert Opinion

Taken from Table 4-4 (reference 20)

This has already been referred to in connection with the HRA method which mainly utilised direct numerical estimation by expert groups. It appears to be quite successful and is supported by experience and trials in NCSR. Two other methods are also worthy of serious consideration. The paired comparison technique was originally developed for psychological scaling and was adopted for human reliability purposes by Hunns and Daniels of NCSR(7). Pairs of tasks from a set of interest are successively judged by each judge in a panel. This procedure is repeated for all possible pairings from the set and a scale of likelihood of failure constructed, based on certain assumed mathematical relationships. The justification for these assumptions are theoretical with very limited experimental evidence. The procedure tends to be long' and laborious and has not been used extensively.

297 4.5 Slim-Maud

This is the "Success Likelihood Index Methodology"(22) implemented through the use of an interactive computer programme called "Multi-attribute Utility Decomposition". The basic rationale underlying SLIM is that the likelihood of an error occurring in a particular situation depends on the combined effects of a relatively small set of performance shaping factors (PSFs). In brief, PSFs include both human traits and conditions of the work setting that are likely to influence an individual's performance. Examples of human traits that "shape" performance might include the competence of an operator (as determined by training and experience), his/her morale and motivation, etc. Conditions of the work setting affecting performance might include the time available to complete a task, task performance aids, etc. It is assumed that an expert judge (or judges) is able to assess the relative importance (or weight) of each PSF with regard to its effect on reliability to the task being evaluated. It is also assumed that, independent of the assessment of relative importance, the judge(s) can make a numerical rating of how good or how bad the PSFs are in the task under consideration. Having obtained the relative importance weights and ratings, these are multiplied together for each PSF and the resulting products are then summed to give the Success Likelihood Index (SLI). The SLI is a quantity which represents the overall belief of the judge(s) regarding the likelihood of success for the task under consideration. The logarithmic relationship between expert judgements and success probabilities can be expressed with the following calibration equation: log of the success probability = a SLI + b

where: a and b are empirically derived constants. In general, the field evaluation of the basic SLI methodology has been successful in achieving several objectives. Although it was not possible to verify the accuracy of the human error estimates produced by SLIM because of the absence of sufficient field data on the rare event scenarios being evaluated, the judges involved in the exercise had considerable confidence in the results. It also seemed apparent that SLIM provided a useful structure which assisted the judges in modelling the potential failure modes. 4.6 Comments on Human Reliability Modelling

Clearly there are limitations to the extent which psychology can be used to produce well based and useful techniques. It has been shown that there are also other related important considerations. The effect of feedback and error recovery and conditional probabilities can introduce considerable structural complexities into the model. In particular such complexities make the choice of a taxonomy on which modelling and data collection can be based very difficult. There is as yet no universally accepted generic approach to decomposition methods

298 or the corresponding data bases. These have been and should continue to be the subjects of basic research. One way ahead being investigated is to use basic ergonomie research data on the effects of very specific influences such as unfamiliarity, time, poor recovery capability, overload, learning etc., in combination to produce failure data of practical use in reliability assessment. This may lead to a Bayesian methodology. An intermediate step intended by NCSR in conjunction with the HFR6 is to produce a longer version of the guide(5) done recently in co-operation with the I Chem E incorporating some of these features. For industries with the capability the use of simulators to verify modelling techniques and derived data will also be extremely useful if not indispensable.

RISK PERCEPTION

This is an area not so far mentioned, but which can sometimes be of acute interest and importance to the safety/reliability assessor. It is an area mainly affected by social and political considerations, however the 'human factor' attributes which have been found to influence perceptions of technological risk are listed below(23). They are negatively valued by most people, therefore the stronger the belief that the technology is characterised by these attributes, the less likely people will be to accept it involuntary exposure to risk lack of personal control uncertainty about the probabilities or consequences lack of personal experience difficulty in conceptualising or imagining delayed effects infrequent but catastrophic accidents the benefits are not highly visible the benefits go to others accidents caused by human failure.

It would be as well when risk criteria or targets are being set to bear these considerations in mind.

ACKNOWLEDGEMENTS The author gratefully acknowledges the Institution of Chemical Engineers for the permission of utilizing as a part of the present lecture the paper "Review of Human Factors in Reliability and Risk Assessment" which first appeared in the Institution's Symposium series No 43, "The Assessment and Control of Major Hazards". For the same reason, the author is grateful to Springer-Verlag GmbH, that allowed him to use for the same purpose his paper "Fundamental Constraints on some event data", previously published in the Proceedings of the 5th EuReDatA Conference 1986 (pag. 466-491).

299 REFERENCES 1 WATSON I A et al "Critical Survey of Research on Human Factors and the Man-machine Interaction" IAEA-SM-26B/29. International Atomic Energy Agency (1984). "The German Risk Study" Rep EPRI-NP-1804 SR Electric Power Research Institute Palo Alto (1981). EDWARDS G and WATSON I A "A Study of Common Mode Failures", The Safety and Reliability Directorate UKAEA SRD R146 (1980). F light International 22, January (1975). "Guide to Reducing Human Error in Process Operation", Short Version by the Human Factors in Reliability Group. Published by NCSR, UKAEA, Wigshaw Lane, Culcheth, Warrington. GOULD A, SHOTTER J "Human Action and its Psychological Investigation". Routledge and Kegan Paul (1977). "Human Factors in industrial Systems - Review of Reliability Analysis Techniques". (Draft available from the HSE, Chapel Street, London). "LOSS PREVENTION BULLETIN" Number 058, August (1984). "HUMAN FACTORS RELATED TO RELIABILITY", Proposed Summary Guidelines for Research. Obtainable from I A Watson, Head of SRS, SRD, UKAEA, Wigshaw Lane, Culcheth, Warrington. INTERACT '84 First IFIP Conference on Human-Computer Interaction Conference Papers, September 1984. SHACKEL Keynote Address "Designing for People in an Age of Information" page 6. "Cognitive Psychology" by Darlene V Howard, Macmillan Publishing Co Inc, New York, Collier Macmillan Publishers London (1983). NORMAN DAVID A "F our Stages of User Activities" INTERACT '84 (see Ref. 3) Page 81. Aviation Week & Space Technology. October 1, (1984) Page 99. "Cockpit Crew Curriculums Emphasise Human Factors". "MORT Management and Oversight Risk Tree". International Risk Management Institute, Vol VI, No 2, October (1983). "Systematic Human Action Reliability Procedure (SHARP)" EPRI NP-3583 Interim Report, June (1984).

4 5

8 9

10

11

12

13

14

15

300 16 WILLIAMS J C et al "A Method for Quantifying the Effectiveness of Ultrasonic Inspection to Identify and Detect Flaws in Welds", NCSR UKAEA 8th Advances in Reliability Technology Proceedings, University of Bradford, April (1984). "A Suggested Method for the Treatment of Human Error in the Assessment of Major Hazards" Safety and Reliability Directorate SRD R254, UKAEA, Wigshaw Lane, Culcheth, Warrington WA3 4NE. SWAIN A D et al "H andbook of Human Reliability Analysis with Emphasis on Nuclear Power Plant Applications" Final Report NUREG/CR-1278 Prepared for the US Nuclear Regulatory Agency, August (1983). BELL Barbara Jean et al "A Procedure for Conducting a Human Reliability Analysis for Nuclear Power Plants" Final Report NUREG/CR-2254 May (1983). Prepared for the US Nuclear Regulatory Agency. "A Simulator Study-based Study of Human Errors in Nuclear Power Plant Control Room Tasks", NUREG/CR-3309, January (1984). Prepared for the US Nuclear Regulatory Agency. DUNCAN D et al "Task Analysis". 6, (1971), HMSO London. Training Information Paper No

17

18

19

20

21

22

SLIM-MAUD: "An Approach to Assessing Human Error Probailities using Structural Expert Judgement" NUREG/CR3518 EMBREY, D E et al, March, (1984). Prepared for the US Nuclear Regulatory Commission. "Beyond Acceptable Risk: On the Social Acceptability of Technologies" OTWAY J. Policy Sciences 14, (1982), 247-256. Elsevier Scientific Publishing Co, Amsterdam. NORMAN D A (1983). Position Paper on Error. Human Error, Italy. NATO Conference on

23

24

PART I I I STUDY CASES

SYSTEMS RELIABILITY ANALYSIS IN THE PROCESS INDUSTRY: AN INTRODUCTION

A. Amendola and S. Contini Commission of the European Communities Joint Research Centre Systems Engineering and Reliability Division 1-21020 Ispra (VA) ABSTRACT. The paper is focussed on the problems connected with the need of establishing a correct link between process and reliability modelling. To this aim a particular attention is given to the procedures for linking the operability analysis with the fault tree construction, and to the DYLAM technique which has been proposed to analyze process physics and systems reliability by a self contained dynamic approach.

1.

INTRODUCTION

Reliability engineering is extensively used in the process industry both for design of control and safety systems, and for integrated RAM (Reliability - Availability - Maintainability) programmes. It is also playing an ever increasing role in safety and risk assessments of hazardous processes. Of course basic reliability theory and modelling techniques developed for other technological systems apply to process plants as well. There are however some process features which have led to the development of more specific approaches. Principally, these are aimed at: identifying the hazards; structuring the qualitative system analysis procedure in a way that gives an adequate account of the physicalchemical processes involved; and linking the qualitative insights with the fault tree modelling. In addition to these well established procedures, a more detailed account of the process characteristics can be obtained by the DYLAM approach , which has been rather recently proposed to analyze process physics and systems reliability by a self contained dynamic approach.

303
A. Amendola and A. Saiz de Bastamanie (eds.). Reliability Engineering, 303-318. 1988 by ECSC, EEC, EAEC. Brussels and Luxembourg.

304 2. RANKING RISK BY HAZARD INDICES Numerical indices have been developed as non probabilistic measures of the risks connected with process plants or storage facilities. Their calculation does not require the performance of complex risk studies. Their significance for a proper probabilistic safety or availability assessment has to be seen in the need of focussing the study of complex and large installations on the most critical units with respect to the analysis objectives (safety, environmental impact, or investment-specific loss prevention). Under these aspects, hazard indices can be evaluated in the preliminary phases of more comprehensive studies, in order to rank plant sections or units according to their relative risk significance. The indices most frequently used are: - the DOW Index /l/ proposed by the DOW Chemicals (USA) for fire and explosion hazards; and - the Mond Index /2,3/, an extended version of the DOW one developed by the ICI (UK) that covers toxicity hazards as well. To calculate such indices, a plant is subdivided in process or functional units; afterwards the risk level of each unit is estimated in a qualitative manner (Table I) as depending on factors like the nature and the quantities of involved substances, the kind of process, available protection and mitigation systems, etc. TABLE I. The DOW Fire and Explosion Index. Degree of Hazard Light Moderate Intermediate Heavy Severe

Index < 60 61 - 91 92 - 127 128 - 158 >159

Both methods assume that the plant has been well designed and constructed, according to applicable standards. With respects to the DOW Index, the Mond Index gives a more detailed account of material and process features, furthermore it considers human operational factors, like quality of training, procedures, supervision etc.

3. OPERABILITY ANALYSIS Hazop (Hazard and Operability Analysis) /4,5/ is a systematic procedure

Guide Word too high

Deviation

Causes

Consequences

Actions

too high pressure in the separator 10.S.3

1) Blockage of valve LV.37 2) Valve PS.72 does not open

- Overpressure in tank 2.T.1 - High level in 2.T.1

- Automatic closure of valve LV.07 on high pressure or high level in 2.T.1

low

Figure 1. Operability analysis working sheet.

306 for identifying out-of-nominal process conditions, their causes and consequences and possible corrective measures. Whereas Failure Mode and Effect Analysis (Fig. 7 at Ref. 6) is driven by the component failure modes, operability analysis, being process oriented, is driven by hypothesizable deviations of the process variables from their nominal values. This technique can be used both as a selfstanding design review tool and as a preparatory step for a reliability or probabilistic safety assessment. In the former case the objectives are: - identification of hazards; - verification of the adequacy of existing prevention and protective devices ; - identification of measures for risk reduction. Of course, without probability estimations the whole assessment is of qualitative nature. Nevertheless, even as a stand alone technique, an operability study gives useful insights in the process hazard features and permits significant decisions for risk reduction to be taken. To get a correct system model and to identify the hazards as completely as possible, it is always recommended to perform a preliminary qualitative analysis before any reliability assessment /7/: under these respects operability analysis is the most adequate tool for process plants. It offers indeed a well structured approach for the identification of the hazards, which have to be assumed as the Top Events for the relevant fault trees. It also is a useful guide for a correct fault tree construction. An operability study can be performed at any design stage, as well as on plants already in operations. Depending on the objectives, the necessary documentation may vary from general flow schemes to detailed piping and instrumentation diagrams, including information on layout, test, maintenance, operating procedures etc. The study is normally performed by a multidisciplinary team (including design, process and instrumentation engineers, control and maintenance operators etc.), through a series of successive brainstormings which start from the subdivision of the plant into functional units, and, after a ranking of the units according to their significance, concentrate the major resources on the most important parts. Figure 1 shows a typical example of a Hazop working sheet. The meanings of the different columns are self explaining: only the "Guide word" column needs to be briefly commented upon. Guide words, associated with each deviation possibly occurring in the value of process variables, are taken from a table constituting a kind of check list which should by followed during the analysis to improve completeness. An example of guide words is shown in Table II.

Guide Hord too high

Deviation

Causes

Consequences

Protective Actions Automatic Manual

Notes

Top

too high pressure in 10.S.3

1. Blockage of valve Overpressure in tank Closure of Closure of 2.T.1 and high level valve LV.07 LV.07 on LV.37 high preson high in 2.T.1 2. Valve PS.72 does pressure or sure alar not open high level in 2.T.1 in 2.T.1

One of the protections action successful Both protection action failed

No

Yes

Consequence T P O

2
Deviation occurs Failure of protective actions

5
Cause 1 Cause 2 Cause k Failure of autonatic protections Figure 2. A fault tree oriented operability working sheet and implied fault tree.

Failure of manual protective actions

308
TABLE II. Guide words applicable variable "flow". low high too low too high to the process

no reverse

4.

LINKING OPERABILITY ANALYSIS WITH FAULT TREE CONSTRUCTION

The operability working sheets represent a basic source of information for constructing the fault trees, needed to estimate the probability of occurrence of the most significant undesired abnormal events. The participation of the reliability engineer in the team per forming the operability study should be recommended. Also, the working scheme can be modified in such a way that a direct link can be estab lished with fault tree construction. Such procedures have already be come industrial praxis. An interesting example is the procedure imple mented by SNAMPROGETTI /8/, as summarized in the following. A key point of the procedure is a preliminary subdivision of the plant into separate sections: the boundaries of each section is defined by a set of nodes, which also represent the points at which the differ ent sections interact each other. A section can be identified by grouping together components performing a same process function compo nents on a same flow line or belonging to a same protective system etc. The experience generally results in optimal decomposition schemes. The operability analysis is then performed by studying the causes and consequences of process variable deviations at the plant section nodes. In a same section further nodes might be defined to better de scribe how the deviation can propagate among units belonging to a same plant section. The analyst attention is principally focussed on the units within the considered section, even if the overall view of the plant must be kept in mind to avoid misleading interpretations. The main information contained in the working modules (see Fig. 2) are sets of statements of the following type: if "causes i or j occur" then "deviations AF and at nodes and L occur"; if "deviation AF occurs at node K" and "the automatic protection does not intervene" and "the operator fails to inter vene" then "the deviation AFj at the node Hj occurs".

309

Configuration

Type of gate

Operations to be performed

N., V., D

AND
GATE
1

Remove the event E

N., V., D J h

OR

Remove the event E and all OR gates up to the first AND gate found

Sure to occur event

N., V., D J h
I 1

AND

GATE
1

Remove the event E and all AND gates up to the first OR gate found

N.,
1

V., D k J

OR

Remove the event E

Impossible to occur event

Figure 3.

Checks to be performed before developing a generic event E.

310 The fault tree construction for the relevant top events can be carried out by the direct links created by the procedure described in the operability working sheets among the deviations at the different nodes and the components failure modes: each consequence can easily be described in fault tree terms as shown in Fig. 2. By starting with the working sheets for the section at which the top event has been defined, the first subtree identifies the relevant causes and deviations at the section nodes, which in turn represent the roots of the subsequent subtrees. The procedure ends when all leaves of the tree cannot be further developed. During the fault tree construction congruence and simplification rules are applied. To this aim an event to be developed is associated with a triplet (N, V, D) identifying the node, the process variable and the kind of deviation, respectively. An event E does not satisfy congruence criteria and therefore must be removed when the associated process variables appear in a same path* with different forms of deviations (e.g. high flow and low flow). In addition to that, the development of events already present in a same path must be avoided, otherwise loops are generated. This results in the rules shown in Fig. 3.

5. FAULT TREE ANALYSIS Control and protection functions are generally not completely separated in process plants: indeed, valves of control loops can also have an emergency shutoff function and may be actuated by the protection sys tem as well; a same sensor can monitor the value of a process variable in input to both the control system and the protections, etc. These aspects may call for the use of particular logical gates in fault trees of process systems in addition to the simple AND, OR and NOT operators. Indeed, the simple AND of the two input variables A and does not take into account whether A occurs to be true before or af ter being occurred. A Top Event in a process plant node is normally provoked by some failures in the process and by protection failures, but, of course, only if the protections fail before or at the same time of the process failure events. In order to express the sequentiality conditions described above, an extended definition of the inhibit gates can be successfully applied (see Fig. 4 ) .

* Note. A path is a list of triplets from the event being considered to the top.

311
is true only if is true before A becomes true (i.e. if the condition expressed by the variable is not such to inhibit a failure event A to propagate).

Figure 4.

Inhibit gate used as a sequential AND gate.

Events able to initiate a failure chain (like failures of compo nents in the production or control systems) are input to the gate (A in Fig. 4 ) ; conditions inhibiting the verification of the top event (B in Fig. 4) are modelled by the lateral input P. Initiating events are characterized by a failure frequency o>(t), whereas inhibiting events are characterized by an unavailability an de mand q( t). Both 6i(t) and q(t) are functions of the failure rates, repair rates and test intervals for the relevant components. A generic order minimal cut set h of the fault tree contains ei ther "initiating" events only or both "initiating" and "inhibiting" failure events. It cannot contain inhibiting events only. Let k be the number of the initiating events, then the MCS contribution to the Top Event probability F (t) can be expressed by the expected number of failures as an upper bound, as follows:

FJt) = , ) h(t) =
o

It q. id) .
k+1
J

(1)

Chains of inhibit gates may be needed to describe failure logic of plant and protection systems. The SALPPC code /9/ has implemented efficient algorithms to ana lyze fault trees including inhibit gates as well.

7. THE DYLAN! TECHNIQUE Safe and reliable plant operation requires strictly kept within prescribed ranges, the that process variables be exceeding of which would

312 lead to dangerous conditions. Fault trees are based on binary logic: even in the cases where multistate or discrete variables have been used to construct the fault trees /10,11,12/, the final analysis has always been reduced to the treatment of binary indicators. These are however not adequate descriptors of continuous quantities like the process variables. Furthermore in fault trees physical time aspects can only be taken into account as discretized deviation events, just as those considered in operability studies. The DYLAM technique /13,14,15,16/ has been developed as specially addressed to process safety and reliability assessment by a probabilistic dynamical analysis of the possible process transient. Therefore, the principal feature which substantially differentiates DYLAM from other reliability analysis techniques is the ability to describe in self-contained calculations both the random performance process of the system components and the resulting (steady-state and transient) behaviour of the associated physical process. Indeed, incidents move in specific directions according to the values assumed by physical parameters (such as temperature, flowrate, pressure, etc.) which are, on the one hand, safety significant and on the other hand, activate the intervention of control, protection and/or mitigating systems. These may be actuated manually or automatically at times which depend on the transient course and therefore can be determined only by the knowledge of the process physics. On the other hand, availability of the protection systems, as well as the frequency of the initiating events, are random variables that need to be described by the probabilities associated with the different "states" (nominal, failed, degraded, etc.) of the component. In some cases these probabilities may be strongly dependent on the process variables (e.g. failure of a pressure vessel, success in the closure of a valve in certain critical flow conditions, etc.), and therefore need to be evaluated as conditional to the process situation. The main steps of the methodology can be summarized as follows, whereas for a detailed treatment the reader is referred to the more extensive referred papers /13 -r 16/: - Identification of the states (nominal, failed or degraded) into which system components may move; - Modelling each state by a mathematical equation describing the physics of the component under that condition according to the implemented parametrical algorithm which allows the DYLAM code to associate each state with the corresponding equation; - Assignment of probabilities associated with the initial state, rates for independent time transitions and/or conditional probability matrices in cases of dependence among states and process variables ; - Implementation of the most efficient solution scheme for the re-

313 sulting equation system; - By the previous steps, the system has been already implicity described in all its possible states so that no further model has to be established by the analyst: it is necessary only to define the TOP conditions that have to be investigated. Indeed, more than just one TOP can be analyzed. A TOP can be assigned with respect to values attained by process variables at certain times: e.g. a condition such as temperature above ignition point and concentration of flammable substances above the corresponding threshold after a specified time can be programmed. At this point, program input variables control the extent and the detail at which the system needs to be studied according to automated procedures consisting of the following: 1) Combinatorial generation of all possible component states up to the prefixed resolution (cut-off rules are introduced to limit exponential explosion) or of a randomly selected number of possible transients; 2) Resolution of the system of equations corresponding to the resulting states of the components at each selected time step during the mission time; 3) Comparison of the values calculated for the process variables with the TOP conditions to identify the sequences (combination of component states) leading the system in the TOP situation; 4) Evaluation of the occurrence probability of the TOP condition as a function of time. Limitations for the analysis of complex systems might principally arise from too high computation times; this can result because of a too large number of components which rapidly increases the number of sequences to be investigated. Applicability to rather complex cases has been demonstrated by a study case referred to a batch chemical plant /17/, which will be briefly summarized in the following. The system under study was a sulfolane reactor (Fig. 5 ) , which presented hazard of runaway reactions. The system has been investigated for the following conditions : maximum pressures to be expected as a function of time and probability provided that the safety valve fails to open. This information might be useful for the design of the pressure vessel: in normal condition the maximum pressure is about 20 bars, however, it can increase to 55 bars in case of loss of coolant in the heat exchanger IE 102, and even more when there is no recirculation due to pump failure (BC 101 A/B). The simulation has been repeated a first time without allowing for operator correction actions and a second time considering the operator intervention.

314

BC101A/B

Figure 5. Sulfoleine batch reactor. Simplified flow exchanger, BC: centrifugal pump, RM: reactor).

diagram (IE: heat

A dynamic simulator has been used for this sulfolane batch plant which includes the following models: - Reactor: lumped parameter model (all properties are uniform in every phase inside the reactor). Vapour and liquid phases are considered, including mass transfer rates between the two phases and latent heats of vaporization in thermal balances. The reactor walls are included in the thermal balances, and the reactor is considered adiabatic. - Double-tube heat exchangers IE 101 and IE 102: distributed parameter model (the temperature changes with time and distance). The exchangers are considered adiabatic and their walls included in the respective thermal balances. - Pipe walls in the circuit are adiabatic and they were included in the thermal balances. Several failures were included in the modelling of the different plant components and equipment, e.g. partial or complete lack of cooling water to IE 102 (caused by various reasons), partial or complete lack of warm water in IE 101 after the start of the reaction, changes in water temperature, recirculating pump failures, control devices failures and/or delays, operator delays, etc.; possible interventions in opening the by-pass manual valves in IE 102/IE 101 when the control system fails to actuate them automatically. The TOP conditions have been assigned in terms of different levels of maximum pressures, namely: TOPI = 27 bar, T0P2 = 36 bar, T0P3 = 46 bar and T0P4 = 56 bar. All these TOPs have been investigated in a

315 same code run; while, by the fault tree approach four different fault trees, for each of the TOP conditions, should have been analyzed. Table III summarizes the probabilistic results as obtained by exploiting the system possible states by applying a probabilistic cut-off of 10 and by exploring all sequences up to order 4 (maximum 4 failure or degradation events). TABLE III. Probability of overpressures. Because of event sequence of order: Probability (y ) of pressure greater than: 27 bar 36 bar 46 bar 56 bar

Without considering operator intervention 1 2 3 4 Total 0 5.4 5.3 1.7 1.3 -4 10 -4 10 -4 10 -3 10 0 5.3 0.3 0.7 5.7 -4 10 -4 10 -5 10 -4 10 0 5.3 0.2 0.3 5.6 -4 10 -4 10 -5 10 -4 10 0 0 -5 1.5 10 -7 0.6 10 -5 1.5 10

Considering operating intervention 1 2 3 4 Total 0 1.4 8.3 3.4 1.2 -6 10 -4 10 -4 10 -3 10 0 1.4 2.5 1.1 3.6 -6 10 -4 10 -4 10 -4 10 0 1.4 6.7 1.5 8.4 -6 10 -5 10 -5 10 -5 10 0 0 -5 1.5 10 -7 0.4 10 -5 1.5 10

As Table III shows, no single failure'leads to the TOPs and the mitigating action of the operator has some (even if not drastic) effects on reducing the probability of both T0P2 and T0P3. This last result reflects the high failure probability assumed because of the very short time available for a successful intervention of the operator. The time course of the transient as far as temperature is concerned is shown in Fig. 6 in nominal and selected failure conditions; such secondary curves are other typical results which can be directly obtained by DYLAM together with the probabilistic ones. This brief description should be sufficient to demonstrate DYLAM capabilities with respect to other current techniques. the

316

NORMAL OPERATION FAILURE OF HEAT EXCHANGER IE 102 : PUMPING SYSTEM FAILURE BC 101 A/B

TIME (HOURS!

Figure 6.

Some typical temperature transients.

The advantages of DYLAM (consideration of process physics, time, multiple TOP events studied in a same run, completeness of the analysis not depending on the analyst's judgement in the construction of a system model such as fault tree, possibilities of modelling human interventions, etc.) are to be weighted against the need of providing system-specific simulation modules and long computational times in the case of complex systems. However, these disadvantages can be mitigated if the DYLAM basic package can be coupled with some general-purpose process simulation packages which can enable the analyst to use the method without spending too much effort in programming plant-specific equations.

6. CONCLUDING REMARKS As the paper shows the development of reliability analysis techniques is moving from purely qualitative approaches towards methods able not only to predict frequencies of undesired events but also to give an adequate description of the process characteristics. The limit of this paper did not allow an extensive review of all available techniques: further approaches presenting other interesting features for system reliability analysis are described in another paper in this book /18/. Each technique has of course its advanteges and limitations in terms of capabilities and costs. The choice of the most appropriate approach for a particular application is therefore strongly

317 depending on the objectives of the analysis and on the seriousness of the possible accident consequences.

7. REFERENCES 111 Dow Chemical 'Fire and Explosion Index. Hazard Classification Guide'. 5th Edition, 1981, Midland Mich. /2/ I.C.I. 'The Mond Fire, Explosion and Toxicity Index', ICI pbl. /3/ J.D. Lewis, 'The Mond Index Applied to Plant Layout and Spacing'. 3rd Loss Prevention Symposium, 1977. /4/ C T . Cowie, 'Hazard and Operability Studies. A New Safety Technique for Chemical Plants'. Prevention of Occupational Risks, Vol. 3, 1976. Ibi H.G. Lawley, 'Operability Studies and Hazard Analysis'. 2nd Loss Prevention Symposium, Vol. 8, 1974. /6/ A. Amendola, 'Common Cause Failures Analysis in Reliability and Risk Assessment' (In this same book). IlI A. Amendola, 'Uncertainties in Systems Reliability Modelling: insight gained through European Exercises' Nucl. Eng. Des. 93 (1986) 215-225. IBI S. Messina, I. Ciarambino, 'Analisi di operabilit: tecnica qualitativa di individuazione dei punti critici di un impianto' Ispra Course: Analisi di affidabilit e sicurezza AAS/84/4, 12/14 Novembre 1984. /9/ S. Contini 'SALP-PC: a Fault Tree Analysis Package on Personal Computer', Ispra, PER 1427/87 1987. To be published as EUR Report (1988). /10/ S.A. Lapp and G.J. Powers, 'Computer-aided synthesis of faulttrees', IEEE Trans. Reliab. R26 (April 1977) pp. 1-13. /Ill S. Salem and G. Apostolakis, 'The CAT methodology for fault-tree construction', in Synthesis and Analysis Methods for Safety and Reliability Studies, Plenum Press, New York, 1980. /12/ L. Caldarola, 'Generalized fault-tree analysis combined with state analysis', KfK 2530-EUR 5754e, February 1980. /13/ A. Amendola and G. Reina, 'DYLAM-1: a software package for event sequence and consequence spectrum methodology', EUR 9224N, JRC, Ispra, 1984. /14/ G. Reina and A. Amendola, 'DYLAM-2: description and how to use', T.N. No. 1.87.128, JRC Ispra, October 1987. /15/ A. Amendola and G. Reina, 'Event sequence and consequence spectrum: a methodology for probabilistic transient analysis', Nucl. Sei. Eng. 77 (March 1981), pp. 297-315.

318 /16/ A. Amendola, 'The DYLAM Approach to Systems Safety and Reliability Assessment' EUR 11361 EN December 1987. /17/ A. Amendola, .A. Labath, Z. Nivolianitou, G. Reina 'Application of DYLAM to the Safety Analysis of chemical Processes', Interna tional Journal of Quality & Reliability Management, Vol. 5, N. 2, pp. 4859 (1988). /18/ J.P. Signoret, M. Gaboriaud and A. Leroy, 'Study cases of petro leum facilities as comparison basis for different methods', (In this same book).

THE RIJNMOND RISK ANALYSIS PILOT STUDY AND OTHER RELATED STUDIES

H.G. Roodbol Central Environmental Control Agency Rijnmond 's-Gravelandseweg 565 3119 XT Schiedam The Netherlands

ABSTRACT. In Rijnmond a number of studies in the field of risk analysis are carried out. The most important ones are described here. The classical way of risk assessment appears to be very time and money consuming and gives only a limited accuracy. The cost-effectiveness of a risk assesment can be improved by simplifying the methodologies and computerising the techniques. It then becomes possible to assess the risks for the population in a whole area with numerous hazardous objects. Such a risk overview is necessary for safety policy decisions. 1. INTRODUCTION

Rijnmond is the area of the Rhine delta stretching from Rotterdam to the North Sea. It is about 40 km long and 15 km wide and more than one million people live in this area. The largest harbour in the world is situated here, with a vast agglomeration of chemical and petrochemical industries. So industrialised and residential areas are sometimes close together. In such an area accidents, with a relatively small area of influence, could cause calamities. The Central Environmental Control Agency Rijnmond (DCMR) registers approximately 400 incidents per year. Fortunately, most of these are minor incidents, such as spillages that result in minor pollution. Some of these incidents, however, could have produced a hazardous situation under less favourable conditions. Two examples of severe incidents that happened in the past are as follows: in 1963 toxic clouds, developed by decomposition of mixed fertilizer in storage, threatened 32,000 inhabitants. Favourable winds made an already prepared evacuation unnecessary, in 1968, heavy window pane damage occurred over a distance of 4 km due to an explosion of hydrocarbon vapours from an overheated slops tank.

319
A. Amendola and A. Saiz de Bustamanle (eds.), Reliability Engineering, 319-344. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

320 So far there have been no fatalities among the population due to industrial accidents, but the presence of hazardous materials every where in the Rijnmond area is percepted by the population as a con tinuous source of danger. Therefore, in 1976, the executive council of the former Rijnmond Public Authority issued a note on industrial safety, in which an ac tive policy was proposed along the following lines: 1. In judging the acceptability of risk, both the probability and the consequences will be considered, but the consequences will play a more important role. 2. After prescription of riskreducing measures, there will still be a residual risk. In judging the acceptability of this risk, other aspects, such as social and economic aspects, should also be considered. 3. For new installations, the principle of "best practicable means" applies; in the case of toxic substances the principle of "best technicalmeans" applies. Moreover, sanitation of existing situ ations may be required in some cases. 4. The elaboration of this policy into concrete measures will be done after an assessment of the present situation. Consultation and cooperation with the industry is considered an es sential part of the assessment process. In order to conduct such a policy, it is necessary to know the exposure of the population to potential risks. To this purpose, a number of studies on behalf of the Rijnmond Public Authority were started; these are described below. The results of these studies, together with results from other risk analysis studies carried out in the Netherlands, will be evaluated in a policy note on safety for the population at large, to be issued in 1988.

2. 2.1.

THE RISK ANALYSIS PILOT STUDY Objectives of the study

A 'C ommission for the Safety of the Population at large' (Dutch ab breviation: COVO) was set up, which decided to carry out a pilot study of risk assessment for six industrial "objects" in Rijnmond, in order to learn how well the consequences and the probabilities of possible accidents can be assessed. The principal objective of the study was to try out the techniques of quantitative hazard analysis in a realistic context, in order to get an answer to some questions that were formulated and were conside red essential: 1. What is the reliability of the assessment of the consequences and probabilities of possible accidents with industrial instal lations when the procedures and methodology of risk analysis are carried out to their full extent? 2. What problems and gaps in knowledge exist in the field of risk analysis?

321 How can the results of a risk analysis be presented conveniently, without losing important details, so that it may be used for safety policy decisions? 4. How well can the influence of risk reducing measures on the consequences and on the probabilities be calculated? 5. What resources are required, in time and money, to assess the risks with sufficient accuracy to be useful for safety policy decisions? The study was not to be concerned with the acceptability of risks or the acceptability of risk reducing measures. The objects selected to be studied were: 1. The storage of toxic material: - 3700 m 3 atmospheric storage of acrylonitrile; - 1000 m 3 sphere for pressurised storage of ammonia; - 90 m 3 horizontal tank for pressurised storage of chlorine. 2. The storage of flammable material: - 57000 m 3 cryogenic storage of LNG - 6000 tonnes sphere for pressurised storage of propylene 3. A part of a chemical separation process: - a diethanolamine (DEA) stripper of a hydrodesulphurizer plant. These installations were chosen because all possible combinations of atmospheric and pressurised storage of both flammable and toxic materials were present, so that all available calculation methods for release and dispersion had to be used (see table 1). To give an answer to the above mentioned questions it was decided that the study should be done in a "classical way", i.e. it should contain the following steps: Collection of basic data (description of the installation with operation procedures and inspection routines, population densities, meteorological data etc.) and definition of the boundary limits of the study object (which components coupled to the main installation should also be studied, what are the smallest accidents to be considered etc.). Identification of potential failure scenarios (such as the failure of a pipeline etc.). Selection and application of the best available calculation models for physical phenomena (for the calculation of the effects of the failure scenarios). Selection and application of the best available models to calculate the consequences of physical effects (toxic doses, pressure waves) on people. Collection and application of the best available basic data and models to calculate the probabilities of such events. Choice and development of different forms of presentation of the final results. Investigation of the sensitivity of the results for variations in the assumptions used and an estimation of the accuracy and reliability of these results. Investigation of the influence of risk reducing measures. A block diagram of the analysis is given in figure 1. 3.

tibie

1.

Study objects and phenomena examined

Study object

Relate and spreading

Cloud formation and dispersion Dense

Type of hazard

Other effects

1. Acrylonitrile storage

Liquid jet and pool vaporisation

Toxic/pool fires

Confined explosion

2. Ammonia storage

2 phase flow and flash

Dense/neutral

Toxic

3. Chlorine storage 4. LNG storage

2 phase flow and flash

Dense/neutral

Toxic

Liquid flow and vaporisation from ground 2 phase flow and flash

Dense/neutral

Flash fires/ pool fires Explosions/flash fires Toxic Desorption of gas

5. Propylene storage 6. DEA stripper

Dense/neutral

Gas jet (HjS) Liquid vaporisation (fat DEA)

Neutral

323
COLLECT INFOCMATIOM O N STUDY OBJECT

t
IOEMTIFY PhssiBLC Lasses OF CONTAINMENT

t
CALCULATE SIZE A M O DUCATION OF EELEASE.

1
CeouP Losses COMSEQUENCE O F CONTAINMENT te. SIMILAR INTO CASES OF LIKE. Size, LOCATION E T C .

FlAMMABLt

CALCULATE

AXESASE

NUMBES

O F CASUALTIES P E E

BELEASE.

FAULT T e t e

^^CALCOLATD ^Teee .SfLBcreo

DATA

DKAW

FAULT

TICE

SELECT

OASE V E M T

PB^BABULITIES

ELECT fcm

FAILUCE FREQUENCY Lose, O F A M O GeauP CAIE

1
CALCULATE AMO MOST RANCE Tor EVENT PLUS FCEOOEMCY FKEOOCMCXOF CUT SETS SIGNIFICANT

EACH

CSMTAINMBNT FOB EACH

FAicuee

T
Geoup Foe E A C H FAILUCE CASE

MODIFY

FceoueNciH

B Y IGNITION

F*eoCkAB<LiTies

I F RELEVANT

T
CALCULATE S. FREQUENCY > CONSOUEMCC IMMTIFY T& FtasAiAue REMEDIAL. Bisic MCASUK.ES

CEOUCC

OveeAu.

FlGueE 1.
B L O C K D I A F A N I O P A N A L Y S I S

324 2.2. Collection of basic data

The different companies made available all the information required. This was partly collected through site visits with discussions with the plant engineers, managers and operators. The general approach used was to issue a general list of questions to the industries in advance of any meeting. This enabled industry to prepare answers in advance and was generally felt to be very useful. Two main meetings were held with each industry, the first being an initial familiarisation, to enable the investigators to draw up a specific and detailed set of technical questions in time for the later meeting and site visit. From the obtained information a data-package for each installation was produced that was approved by the company concerned and checked for commercial secrets, patent information etc. It was sometimes necessary to make special secrecy agreements. For the definition of the study object it was decided that all connected pipework up to the second independent shut off valve should belong to the object of study. Transport containers coupled to the installation for loading were assumed not to cause failures (otherwise all different types of such transport containers would have to be considered). Only such accidents that were typical for the hazardous material concerned were considered. Simple accidents caused by mechanical failures, falls etc. were excluded. Accidents which could cause less than two casualties among the employees or the population were excluded. Meteorological data obtained from Zestienhoven and Hoek van Holland were analysed and a manageable number of sets of weather conditions (combinations of wind speed and stability) were selected as being representative of the full range of meteorological conditions. For the calculation of the overall effect on the population, a data base was needed on the distribution of population around the plants being studied. A grid was constructed showing the population over an area of approximately 75 square kilometres, covering the whole of the Rijnmond area. Daytime and nighttime populations were estimated for each 500 m. square, from various sources of information. For the specific plants under study, information was obtained on the numbers of employees for day and night operation and also their locations. 2.3. Identification of failure scenarios

The failure cases selected for consideration in the study were identified by two distinctly different methods. The first of these is the so-called "checklist" method in which failure cases are derived from a procedure which is based on a knowledge of actual previous accidents. This list includes elements like: sudden and complete rupture of pressure vessels; "guillotine-breakage" of pipework connected to vessels;

325
small holes or cracks in piping and vessels; flange leaks; leaks from pump glands and similar seals. For each piece of equipment included in the study, the possibility of each type of event on the checklist occurring was considered. The second method used was the Hazard and Operability Study, using a technique that is assumed to be well known and therefore will not be discussed here. The "checklist" method was the one which was most cost effective in generating realistic failure cases. Some 95% of all of the failure cases used in this study were found in this way. The more systematic method of Hazard and Operability Study only provided a "polishing" effect as far as failure case identification was concerned. The low extra results of the operability studies over the checklist method can be explained as follows. In this assessment only releases with relatively extensive effects are of interest. These can also easily be found with the checklist method. Further the installations were all of a not too complex, well known design, for which often operability studies have been carried out earlier, so that any necessary counter-measures were already included in the design. However, the exercise of working through the H & 0 was very valuable in developing a thorough understanding of the behaviour of the equipment, and in this way it contributed very greatly to the assessment of possible causes of failure cases, especially in the development of fault trees. 2.4. Calculation models for physical phenomena

A report on physical models, prepared for the Labour Directorate by TNO, the so called "Yellow Book", was to be used as a reference. However, the best available and most realistic models were to be used, but an extensive justification had to be given when the proposed model differed from the reference. After long discussions between experts on physical modelling, models for all relevant phenomena were chosen, although uncertainty and a difference of opinion for some models remained. This was the case for example for the vapour cloud explosion model. The piston-type model of TNO was used with the following deviations. It was assumed that unconfined vapour clouds of ammonia and LNG do not explode. It was agreed that unconfined vapour clouds after normal ignition (not by a detonator) will not detonate but deflagrate, with the possible exception of a few very reactive gases such as acetylene. In confined spaces all combustible gases may detonate. There was no agreement on the explosion model itself to be used. Some experts preferred the so called correlation model, which is also given in the TNO "Yellow Book".There also remained a difference in view whether liquid flow or fully developed two phase flow would occur from a hole in the liquid phase of a containment for liquefied gases under pressure.

326 Further there were some uncertainties about the dispersion of heavy gases, and about the initial mixing rate, of ammonia with ambient air, forming a cloud. 2.5. Calculation of consequences

The consequences calculated in this study are expressed as number of fatalities. Regarding the nature of the hazardous materials involved and the possible failure scenarios it was considered that people might die as a result of inhalation of toxic gases, by blast waves or in demolished buildings or due to fire radiation. For toxic gas exposure use was made of probit relations, giving the percentage of people killed at a given concentration level and exposure time. For ammonia and chlorine the data of US Coast Guard (Vulnerability Model CG-D-137-75) were used. First at the end of the project we realised that these data are rather pessimistic whereas it was the intention to use realistic data and models at each stage of the calculations. For all toxic gases the toxic load giving 50% mortality was calculated (LTL50). It was assumed that all of the people exposed to more than the LTL50 value would be killed, and none of those exposed to lower values than the LTL50. Because the neglect of people above the LTL50 who survive is balanced by the people who are killed below the LTL50, this approximation is not unreasonable, but it does introduce an error which will be larger if the population distribution is non-uniform in the region of the LTL50. It has been noted that a degree of protection from toxic effects has been afforded by remaining indoors during the passage of a gas cloud, particularly in an airtight building where the rate of inleakage of the gas can be very slow. The concentration of the gas indoors is found by time-averaging the infiltration rate of the gas into the building. The indoor toxic load can then be calculated by the same method as the outdoor load. In the case of an explosion, there will be some degree of expansion of the flammable cloud, and if the cloud is wide but shallow, as is usually the case, this expansion will be preferentially in the vertical direction. An allowance is however made for some degree of horizontal expansion and for the estimation of numbers of fatalities it is assumed that all people inside the expanded cloud would be killed. Additionally, people outside the expanded cloud but inside buildings that would be demolished by the blast wave are assumed to be killed. For all types of flash fire, it was assumed that any person who was within the region covered by the flammable cloud at the moment of ignition would be killed. For calculating the consequences of fire radiation again use was made of the threshold data in the Vulnerability model.

327
The consequence calculations have been repeated for both day-time and nighttime population distribution and for six weather types (characterised by atmospheric stability and wind speed) and twelve wind directions (sectors of 30 degrees intervals) each. So each postulated failure scenario has 144 possible consequences, each with its own probability. 2.6. Calculation of frequencies

One of the techniques available for probability analyses, the Fault Tree Method, involves very laborious procedures which would only be justified on critical sections of a plant in which the integrity of the containment depends on the reliability of a complex control system. Therefore this study made use of a combination of this technique with simpler ones which are practical for application over all the failure cases associated with the whole plant. These simpler techniques involve the extraction of failure rate statistics from the available data banks and other published results, and (where relevant) the modification of these statistics by appropriate multiplicative or other correction factors to allow for any differences between the circumstances of the actual piece of equipment under study and that which pertained in generating the statistics. This hybrid method (a mixture of fault tree and data bank approaches) was found to be both practical and efficient. The advantage of the fault tree method was that it helped in identifying possible remedial measures and the likely benefit accruing from them. The data bank method was highly efficient in use, but care had to be taken in defining the failure cases consistently, in order to avoid "overlap" of cases or "gaps" between cases. The frequency of each failure case is in itself not the whole of the probability question, because each failure case may be associated with a variety of ultimate damage levels, due to such factors as operator intervention, wind direction, presence of ignition sources, time of day, etc. These factors each have probabilities associated with them, and so the analysis must be carried through each of the possible outcomes corresponding to every permutation of the factors. This yields an extensive set of damage/frequency pairs for each failure case, and these constitute the principal output of the analyses, at it most detailed and comprehensive level. To evaluate the consequences of each release case, the following general procedure was used: For each windspeed/stability combination, calculate the area affected by the release. For each direction, calculate the probability of occurrence of this speed/stability/direction combination both by day and by night. Calculate the number of people affected by the release by day and by night. Repeat the procedure above for all weather cases and generate a list of scenarios, each characterised by a probability and some specified consequences.

328 The consequences can then be averaged, weighting them according to their corresponding probability. Finally, the contribution from each failure case to the overall rates of fatalities and injuries can be summed. Presentation of results

2.7.

It was considered that three basic forms of presentation of the risks were necessary, because they illustrate different aspects of the risks. The tables with failure scenarios form a way of presentation which is much more compact and conveniently arranged than a separate presentation of the results of the large number of possible consequence assessments associated with each single failure scenario. The deviation of the most extreme results with respect to the average result of each failure scenario presented in the tables is less than the total inaccuracy of the end results. The tables give insight into the relative contributions of parts of the installation to the total risk of the installation and allow comparison with accident statistics. (See table 2, showing a part of the failure scenarios for one study object). The cumulative frequency curves give an impression of the group risks and allow comparison with similar plots for other activities (See figure 2 ) . The individual risk contours give a good impression of the contribution of the installation to the individual risk. This.can be compared with other contributions to the individual risk such as transport of hazardous materials on a road nearby or the individual risk of accidents at home. A further advantage is the independency of these risk contours from the population distributions around the plant. This allows comparisons with other installations. (See figure 3.) 2.8. Reliability of the results

The reliability of the results depends on the correctness and accuracy of the models and on the uncertainty about the input parameters, assumptions and failure rates used. Because of the very large number of parameters involved in this study, and the large uncertainty about many of them, it was impractical to carry out a rigorous accuracy analysis on the whole calculation. Ideally, a Monte Carlo simulation should have been done over the whole study, but this was obviously not feasible. In just one case, the fault trees, a Monte Carlo method was used, and this indicated that although the base events in a tree might include several with large uncertainty ranges, the uncertainty range of the top event value was not excessive, and the "Z" value (ratio of 95 percentile points) lay generally within the range of "Z" values of the individual basic events. Ideally, in order to examine the accuracy of a physical model, it would be taken in its entirety and its performance tested on actual observations made in experimental or live situations.

Table 2. Accident scenarios and consequences. Av. Individual3 Chance of Being Killed per Calendar Year Employees (219) 9.51 "' 7.13 10" 7.77 " 7.77 "' 6.23 10" 6.23 10" 3.11 10" Typical Hazard D istance (in m.) from Source (weighted average) Explosion Fire Fat. eff.
LTL 50

Release of Material Frequency (events per year) 0.93 X IO"' 0.74 10" 0.83 X IO"
6

Code A1.1

Description of Failure Mode Catastrophic burst full

Mass Flow D uration (kg/s) or (sees) or mass (kg) Inst. 100,000 kg 50,000 kg
63

Average Number of Fatalities per Year Employees 2.08 X 10" 1.56 IO'' 1.7 .7 1.4 .4 6.8 10" "' "' "' 10" Population 2.59 10"' 7.47 10"' 1.03 1.03 10 0.82 0.82 10 3.7 "'

Toxic Load
LTL 05 TL 50

I I 790 660 3500 1900 200

6500 4900 3400


3400 2200

8600 6400 4200


4200 2800

11E3 8500 5600


5600 3600

Al.2 D itto, half full A2.I Split below liquid Ap = 6.5 bar

A2.2 D itto, Ap = 9 bar A3.1 Split above liquid level Ap = 6.5 b.

0.83 1 0 " 0.83 1 0 "

76 14 26 253

A3.2 D itto, Ap = 9 bar


A4.] (h) A4.1 (d)

0.83 10" 3 IO"'

2200 4700

2800 6000

3600 8000

Fracture of connection, full bore, Ap - 6.5 bar angle horizontal Ditto, angle downwards Ditto, Ap = 9 bar angle horizontal Ditto, angle downwards

3 3 3

IO"' 10" IO"'

94 300 110

530 166 450

6.6 6.8 6.6

10" 10" 10"

3.3 3.7 3.3

10"' 10" 10"

3.01 10" 3.11 10" 3.01 10"

4700

6000

8000

A4.2 (h)
A4.2 (d)

4700

6000

8000

4700

6000

8000

Total fatalities per year: Total individual chance of being killed per year: "Not relevant for population. b Major structural damage assumed fatal to people inside structures (Ap = 0.3 bar). c Repairable damage. Pressure vessels remain intact, light structures collapse (Ap = 0.1 bar). ""Window breakage, possibly causing some injuries (Ap = 0.03 bar).

330 ,
-'t -

1 4

1 *

1 4 %

:t

-.-:-:|-* o" 3 *
Hf:

i.;.;

1
1

"HT 'i

1 1 -

*(+

1 1

11

...._E

t 1

|z
III '

f|i
_1 ...._]_
1

1 I
.-. 1

m-. :? "-

II ih
-J ~ : ffit III -|. Il :i"

il1 1
1!

14
1
1

(1 1
"""
11

t. * !.. * '=:
= ! : - -

I
a'

1 li .

~ "' ~
--

-\

=zl\

-:. :
' " ~

rv s
t

"~

l "

""'1i
Ii_

h
|
. . : _

1_

* > -f1

1
po

| . _...

._P_ 111 ''~


-

:;

1 I i

II

ri.
I

1 -t= .::|: ! t "'!* . *


1
" ' t
-

V'";TV\
,

~~ ;

~ fcfcv j

Hi
|!
1

. . IL

J
|-

...

E/uno /EES i fbwuTion; :

.|

: pf -

._. _

S' w

7,

^r
1

iiip 1
' ira 1 =

-fi] V\

!
\ \ i\

- [[ '7
H

: :: 1

! ~' "V ! V i !
I

1 .

iff".
4

' "

~"

iff III
"|"T

i
!

"T

-'

! !
1

:: ^ i 'T ' ill. \ "'T


1 3=H

r ~ ~":

O"*
-k

I ~!~
||4|

1
1

4 1

~ I:
I a

*
i n " .

:B| : r: :: : ;'IT ' ~!~ I [ I ....

1 I ! i". : . .

a + Jo 90 ofca # * t I H mwiee^ o* nume PEK. EVB IT ()

[It

1 |"'
: 1

"

i I

Figure 2. C umulative frequency curves

331

Figure 3.

Individual risk contours

332 These observations should not be those from which the model parameters were estimated, and should preferably be drawn from a number of different situations. In practice, this approach is usually not practicable. This may be because: the necessary experiments or observations do not exist, or although they do exist, they are in respect of a much smallerscale incident or experiment than that for which the model is to be used, and are therefore not a reliable guide to accuracy, or the observations are only partial, in the sense that not all the data required to input to the model are available, or the observations are only partial, in the sense that the physical situation resulting from the experiment or incident is only partly recorded, so that the consequences in terms of the criteria which are important here, i.e. expected numbers of fatalities, cannot be properly calculated. The approach which has been adopted for assessing uncertainty in the physical models has therefore been to take each physical model as it stands, to enumerate the main sources of error in that model, and then to try to assess what influence such errors might have on the final results. In many cases, this involves taking empirical parameters in the models and placing a range of values on that parameter within which the "true" value is thought almost certainly to lie. For convenience a specific accident scenario has been chosen as the reference case for most of the sensitivity analyses; The effects of changing parameter assumptions are thus presented in terms of the change in the estimated numbers of fatalities. The sensitivity analysis showed that changing model parameters caused a maximum variation in final outcomes of 40% for several models and lower percentages for other models. The overall uncertainty in the chain of physical models appeared to be approximately one order of magnitude. The base failure rates used were obtained from literature or special data banks, and where possible were based on a statistically significant population of events. For each equipment item the mode and severity of failure was identified together with a range of failure rates derived from the open literature. These failure rates however showed a variation of one order, and in some cases even two orders of magnitude. As mentioned before, this has not led to an even more excessive uncertainty in the probabilities of the top events of the fault trees. In general, an uncertainty of somewhat more than one order of magnitude was reported for the top event probabilities. Possible inaccuracies in weather class distributions and the population distributions were believed to be of minor importance. For the final results expressed as casualties per year in general an uncertainty of approximately one order of magnitude was found. This can be illustrated by comparing the Mortality Index (number of fatalities per tonne released) for real incidents and predictions from this study (see figure 4 ) . It appeared that the predictions give somewhat higher results (probably due to the use of pessimistic toxicity data) than the real incidents, but not in an excessive way.

333
FIGURE 4. MORTALITY INDEX AS A FUNCTION OF SPILL SIZE FOR REPORTED AND PREDIC TED CHLORINE INC IDENTS.

6.0

Real C

Incidents

& W Predictions

4.

2.0

, 10

I I

JL. 20 30 0 50 60

l~ 70

1^ 90

100

Spill Size (Tonnes of Chlorine)

is"" 1
I
M tak next plane 1

| output plane i d e n t i f i e r and | | daacripcive header |

1
1

fcl

_.

j take n*xt unic 1

1 output unic i d e n t i f i e r and 1 1 descriptive header "

J . . . . . ..___. _ _ . . 1

1 take next Clustered EOF 1

analyae a l l eoniaquences and frequencies I

1
1 output CEDF i d e n t i f i e r , | and comaquancei freq enciea 1

1
1

y 1 any eore CEDFa f 1

1.
1 ny aere uniti

Data f i l e to Suaoariaacion Routine

I
L
y 1 anytaere plante

rA
Stop Fig. 5. Flow chart for consequence calculations Clustered EOFs

334 2.9. Main conclusions from the pilot study

The most important conclusions drawn from this study are summarised below. 1. The reliability of the assessments of the consequences and the probabilities of potential' accidents with the installations is limited. The consequences were calculated with an uncertainty of approximately one order of magnitude. The assessed probabilities have an uncertainty of one and in some cases even two orders of magnitude. The uncertainty in the final results presented, expressed in the number of casualties per year, is estimated to be approximately one order of magnitude. A general conclusion is that the consequences can be assessed with more certainty and reliability than the probabilities. Only this limited accuracy could be obtained in spite of the great efforts to use the best data and best methods available. 2. During this study several gaps in knowledge were identified in the field of the modelling of physical phenomena, toxicity models and base failure rate data. In some cases the experts could not agree what was the best model to use. This was mainly the case with models for the following phenomena : explosion of flat combustible vapour clouds; liquid or two phase flow for liquefied gases-under pressure; the influence of the topography and the microclimate on dispersion and the dispersion of heavy gases. 3. The base result of a risk assessment is a number of accident scenarios, each of them characterised by a probability (expected frequency) of occurrence and a damage pattern. Three forms of presentation are necessary together to show the different aspects of the computed risks : consequence hierarchy tables; cumulative frequency curves ; iso risk contours. 4. The total costs of this study amounts to approximately 2,5 million Dutch guilders. The project took more than two and a half years of time. A more general conclusion is that the execution of a risk assessment costs much time and money. 5. In this study it was shown to be possible to study within the context of one project the risks for both employees and the external population. Large parts of the analysis are identical for both aspects. 6. In this study the operability studies that were made did not contribute much to the identification of the failure scenarios. All important failure scenarios had already been found by the checklist method.. The reason for this is that the study was mainly concerned with modern and not too complex storage installations for which there exists worldwide many years of experience, so that potential causes of maloperation have already been foreseen and taken care of.

335 Another reason is that the relatively large minimum consequence (two or more deaths) to be considered restricted the failure scenarios of interest to relatively large ones. Therefore it is doubtful if operability studies in risk assessment studies of this kind have much use. Because of the high costs on the one side and the limited accuracy of the results on the other side, it is advisable to use the instrument of risk assessment only selectively, for example in such cases where existing technical and scientific experience give too little insight into the risks, such as new technologies or in such cases where different alternative solutions have to be compared.

7.

2.10. Areas for future development There are many areas in the analysis where further progress would be valuable. They are identified here without extensive discussion. Identification of failure cases A detailed checklist could be agreed. Discharge rates More experiments required to determine rates for flashing liquids. Initial mixing More experimental work required to establish mixing rates for jets and instantaneous releases. Aerosol formation and rain out More experiments required (could be done in conjunction with the former point) Probabilities of immediate ignition and of explosion of dispersed clouds The evidence from the historical record is confused, and more could be done to clarify these aspects. Failure rate data A useable set of generic data could be agreed and more work could be done to fill gaps. Simplification of the techniques In order to be able to apply the risk analysis to more installations in a short space of time, the calculation methods could be simplified, but care must be taken that no bias is introduced as a result of the inevitable lack of detail in such an analysis. Toxicity criteria There are significant uncertainties in the critical toxic load values for human populations, and these should be thoroughly reviewed, taking account of all the available observations and using the raw data wherever possible.

3.

SIMPLIFIED RISK ANALYSIS

One of the conclusions of the previous study was that a detailed risk assessment is too expensive.

336

_! read next plane name, coordinates of plant grid

read next unit name, Local coordinates cartesian population and ignition densities

calculate absolute location (x, y)

calculate popC r, ) and igntr, ) [

read next EDF, frequency, consequence parameters

ign ( r , )

pop (r, )

take next weather c l a s s

weather s t a t i s t i c s

take nexc wind direction calculate frequency of this EDF/weather/direction combination


calculate frequency of ignition at next ignition source, increment risk grid calculate number of casualties, add to I cumulative frequency^ tables y I any more directions?

* r

any more weathers?

any more EDFs in this plant unit? j

any more units in this plant?

'

optional printout for plant unit just completed

_X_D_ optional printout for plant just completed J any more plants? printout for a l l plants together !
t Stop

output

Fig. 6.

Overall flow chart for summarisation routine

337
In 1981, a study on low-cost risk analysis methods was started. The overall objective of this study was to review and identify the characteristics of possible analytical methods which could be used to generate economically a reasonable accurate picture of the total impact of major hazards to which the population is exposed as a result of fixed industrial activities. This could be done by either examining the possibilities of simplification of the methods and results of previous studies or the use of fundamentally new methods. Three methods were reported: The first method is the Simplified Classical Method (SCM), which has the same procedure as the full classical risk analysis, with the following simplifications: the presence of safety systems is allowed for, but not the detailed data how those systems work; the number of failure scenarios is limited; the remaining scenarios should of course be representative for all risks to be assessed. E.g. for containment failure this should include: complete failure of the vessel (storage tank, process vessel); partial failure of the vessel (due to overflow, tank fire, etc.); guillotine breakage in main pipework; small leak in main pipework (flange leak, puncture). all calculations are standardised and automatized. In this way savings are achieved on collecting input data, though a full description of the installation is necessary, and on the amount of calculational work to be done. To identify the representative failure scenarios (EDF= equivalent discrete failure cases) two methods are indicated, the Clustered EDF and the Standard EDF. Using the CEDF, possible failure scenarios with about the same location and effects, are grouped together and replaced by one scenario with a probability equal to the sum of the probabilities of all the scenarios in the cluster. Actually the same is done carrying out a full risk analysis because in reality one has to deal with a continuum of possible scenarios, which are more or less grouped together to keep the analysis manageable, but the CEDF proceeds this way as far as possible. Using the SEDF, in advance a number of standardised failure scenarios are defined, e.g. the release of 1, 10, 100 tonnes of material, together with the consequences, and the real scenarios of the installation are as good as possible classed under one of these standard scenarios. The second method is the Parametric Correlation Method (PCM). The properties of the installation (type of hazardous material, contents, pressure, temperature, lengths of pipes, etc) are characterized by a number of numerical parameters. Assumed is that these parameters correlate with a number of parameters characterizing the risks of the installation.

338 The risks can then be calculated by converting the input parameters into the output parameters with help of a number of more or less complicated correlation functions. The correlation functions must be derived out of full risk analyses or can be established by experts, e.g. by using the Delphi-method. The third method is the Simplified Parametric Method (SPM). The SPM has only two input parameters and two output parameters. One output parameter is R(o), a measure of the risk in the centre of the plant. R(o) depends on the input parameter , a measure of the unreliability of the plant. The other output parameter is D-max, a measure of the maximum consequences of the plant. D-max depends on the input parameter H, a measure of the hazard potential of the plant. It is not yet clear how U and H and the relation between input and output parameters should be established. To do this, further development is necessary. Conclusions that are drawn from this study are: Preference should be given to the SCM because of the analogy with the full classical method and because the different steps of the analysis are more easily to follow and control. The PCM and SPM will not be further developed because there are doubts if these methods can be brought into operation in a satisfactory way, because it is difficult to correlate the input and output parameters. The SCM appeared to be not much less accurate than the full classical method. The amount of effort needed for the SCM is one tenth compared with the full classical method. If the SCM is to be developed into a computerprogram, this should be done on a national level because of the high costs.

4.

COMPUTERISED RISK ASSESSMENT

Following the last conclusion of the previous chapter, the Ministry of Public Health and Environmental Protection has, in cooperation with Rijnmond, ordered to develop the SCM into a fully computerised risk assessment technique, which now is known as the SFETI package. The SAFETI package is an integrated suite of programs to assess the risks associated with a chemical or petrochemical processing plant and present the results either as F-N curves or as 'risk contours'. These two measures of risk, taken together, can be used to give a balanced assessment of the risk from an installation both to the population at large (F-N curves) and to the individual (risk contours). The package contains programs to create the plant database file, to generate a fully representative set of possible hazardous events which might occur, to assess the extent of the effects of such events and finally to apply these effects to the surrounding population.

339 In order to provide as much information as possible, the package is designed to allow the analyst to examine the various files of data which are produced during the analysis in as much detail as is required. Furthermore, the various calculation methods and physical 'models' (of such phenomena as explosions, vapour cloud dispersion etc.) can, in principle, be checked directly by the analyst. In practice, the number of calculations in any study of even a moderate size is enormous and it would only be possible to check a very small proportion of them. However, all the models and calculation sequences in the package have been extensively tested both individually and in combination. The process starts with the generation of the 'Plant File' which holds detail of all the vessels and connecting pipework in the plant. Next, failure cases are generated in three stages: a. all pipework failures are generated automatically; b. vessel failures are generated interactively allowing the analyst to vary the precise conditions of the releases; c. there is a facility for the analyst to specify other hazardous events which may be liable to occur in relation to the plant but do not arise directly from pipework or vessels. In the range of failure cases thus produced, it is likely that there will be a number of groups of cases with very similar physical characteristics which will, therefore, produce very similar consequences. To speed up processing, there is a facility to 'cluster' these groups of failure cases into 'average' cases. The file of these average cases is then processed in place of the file containing each case in detail. This process clearly loses some precision in the results and is optional. The set of failure cases thus generated is processed by a consequence analysis program to produce a consequence file which contains such parameters as radiation radii for early ignition of flammable gases, dense cloud dispersion profiles and associated flammable masses for late ignition and toxic effect probabilities as appropriate. Finally, these consequence parameters are combined to produce risk contours or, by applying them to a file holding population density, F-N curves. In addition to the main data files mentioned above, details are held, both for day and for night, of population distribution, ignition source distribution and meteorological conditions and there is also a file holding failure frequencies for pipework, vessels and valves. All these files may be amended and updated by the analyst as required. The figures 5 and 6 show the main logical flow of the package.

5.

TRANSPORTATION OF LIQUID AMMONIA AND CHLORINE

The SAFETI package has been used by the Dutch government for several risk assessment studies. Recently in Rijnmond a study has been carried out to assess the risks of transportation of ammonia and chlorine, both gases being transported in great quantities in the Rijnmond area, by sea-going vessels, inland waterways, rail, road and pipeline.

340

read next plane name, coordinates of plant grid

read next unit nane, local coordinates cartesian population and ignition densities calculate absolute location (x, y)

calculate popC r, ) and ignC r, )

read next EDF, frequency consequence aramecers

ign ( r , )

pop (r, )

take next weather c l a s s

weather s t a t i s t i c s

take next wind d i r e c t i o n

1
calculate frequency of this EDF/weather/direction combination
calculate frequency of ignition at next ignition source, increment risk grid calculate number of casualties, add to j cumulative frequency tables

X
y

any more directions?

n
any more weathers?

any more EDFs in this pi ant unit?

^T
optional printout for plant unit just completed any more units in this plant? optional printout for plant j u s t completed any more plants? printout for a l l plants together ! Stop

output

Fig. 6

Overall flow chart for summarisation routine

341 FIGURE 7. FN CURVES FOR CI 2 AND N113 TRANSPORTATION CHLORINE / AMMONIA / COMBINED

as a

a,

o c o a c
0)

>1

iCOO Number of

ICOCO fatalities ()

FIGURE 8 ls^Risk Contours for Armonia -nd "" a n i Chlorine Transport a l l nod e -U 10 "lyt; t 0 " 5 / v r ; I O " 6 / y "J "-" 7 "" " r ; 10"'/yr

343
Ammonia and chlorine were selected because they were considered to be representative for the transport of toxic materials in Rijnmond. In this study also the SAFETI package was applied. Although the SAFETI package was designed for tackling problems relating to industrial plants in a relatively restricted location (the plant site) it was in principle easy to adapt the use of this package to problems where transportation is involved. The nature of transportation is such that the hazard generator ('plant') is spread out over the transportation routes. Some slight modifications to the package were required to handle this type of situation but the overall approach to inputting data on the hazard generation source is the same as for a conventional plant. This could be achieved by splitting up any transportation route into a series of discrete failure points very roughly every 100 m. apart with corresponding failure frequencies. Assessing these failure frequencies for any failure point has been a major part of the study. For example for marine transport, size and speed of the ships, width and length of the channel, traffic density, the Rotterdam traffic control system and accident statistics were considered in estimating the failure frequencies. Some new subroutines have been added to the program for phenomena like spreading and evaporation of ammonia on water. Another major part of the study was an extensive review of the toxicity of ammonia and chlorine. In the report two new probit relations for lethality are proposed: Ammonia : Pr - 0.71 In (C.tL- 9.35 Chlorine: Pr - 0.5 In (C t) - 6.5 where C is the concentration of exposure in mg/m3 and t is the duration of exposure in minutes. The overall results of this study are given in figure 7 and 8.

6.

ORGANISATION OF THE STUDIES

All the studies mentioned in this paper were carried out under supervision of the Committee for the Safety of the Population at Large (Dutch abbreviation: COVO). In this committee were representatives of the Rijnmond Public Authority; Rijnmond industry; the Labour Directorate and its local Inspectorate; the Ministry of Housing, Physical Planning and Environment and its local Inspectorate; the Province of South Holland; and the Boiler Inspectorate. For all studies, steering committees were appointed by the COVO with representatives of the abovementioned organisations. The steering committee supervises the entire project, watches the progress of the study, and, if necessary, further specifies the aims of the study. The steering committee reports to the COVO who advises the executive board of the Rijnmond Public Authority. The committee is able to call in experts from all disciplines for special points of discussion. Such a pool of expertise proves to be valuable, although the assimilation of all critical remarks and scientific arguments is time consuming .

344 7. FINAL REMARKS

Making use of developed methodologies and results of the studies carried out in the Rijnmond area and other studies performed on behalf of the Dutch government (for example a study into the risks of importation, storage, transport and use of LPG and gasoline) and after completion of an inventory of all potentially hazardous objects in Rijnmond, which is still in progress, it will be possible to make an overall picture of all the risks the population in the Rijnmond area is exposed to. Because of the reasons of costs and time available, it is clear that these risks cannot be assessed in depth, but the main problems can be highlighted. This will be the input for a policy note on industrial safety that will be issued in 1988, which will enable the local authorities to conduct an active safety policy.

8.

REFERENCES

Public Authority Rijnmond (1982) Risk Analysis of Six Potentially Hazardous Industrial Objects in the Rijnmond Area. A Pilot Study. D. Reidel, Dordrecht, Holland. Technica (1981) Study of risk analysis methods. Final report prepared for openbaar lichaam Rijnmond. Technica Ltd., London. Technica (1984) Report on a computer based system for risk assessment of chemical plant using a simplified classical method. Technica Ltd., London. Technica (1985) Study into the risks from transportation of liquid chlorine and ammonia in the Rijnmond area. Technica Ltd., London.

VEILIG/Ol 7/BrL 29-1-87

STUDY CASES OF PETROLEUM FACILITIES AS COMPARISON BASES FOR DIFFERENT METHODS

J.P. SIGNORET SNEA(P) 64018 PAU CEDEX FRANCE

M. GABORIAUD SNEA(P) 64018 PAU CEDEX FRANCE

A. LEROY TOTAL CFP Cedex 47 92069 PARIS LA DEFENSE FRANCE

ABSTRACT. On the beginning of 1981, ELF AQUITAINE (Production) and TOTAL CFP decided to work in association in order to adapt (and develop) risk / reliability analysis methods and means to the specific problems encountered in the petroleum industry. Since 1981, a lot of risk / reliability studies have been completed within the above framework. This paper aims to describe two of these studies. The first one is concerned with the probabilistic calculation of the production of subsea oil and gas production cluster and the second one with the modelling of a drilling procedure. These typical activities in hydrocarbon exploration - production have been used as a base to compare different methods : . Analytical, Markov and Petri net modelling for the first one. . Causes - consequences diagrams, Markov and Petri net modelling for the second one. . Advantages and disadvantages of the related methods to handle these specific problems have been pointed out. As a consequence, general purpose computer codes for Markov processes (MARK SMP) and stochastic Petri nets (MOCA-RP) have been developed within the study. As the probabilistic calculations of the production lead to new concepts in the risk / reliability field it will be discussed in greater detail than the other one.

1. PROBABILISTIC CALCULATION OF THE PRODUCTION OF A SUBSEA PRODUCTION CLUSTER 1.1 Introduction Small sized oil/gas fields often occur in the neighbourhood of the big fields. Exploitation of such marginal fields cannot be economic unless the main field equipemnt is used to the maximum. In order to do that, oil companies have started to design subsea production clusters comprising one or several production wells. These clusters are linked to the main production platforms by pipe lines and remotely operated from these 345
A. Amendola and A. Saz de Bustamanle (eds.), Reliability Engineering, 345-365. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

346 main platforms. This is the case in North Sea where several projects are treated as experiments (SKULD), scheduled (EAST FRIGG), or already producing (NORTH EAST FRIGG). In order to determine if such an oil/gas development is economic or not, it is necessary to evaluate, a priori, its expected production. To do that, several parameters have to be taken into account : failures rates, maintenance policy, availability of maintenance rigs and meteooceano logical conditions. This paper aims to describe and compare three methods which have been developed, in order to solve the above problem, within a joint project called "Safety and Reliability of Petroleum Systems" by TOTAL-CFP, ELF AQUITAINE (Production), and FRENCH PETROLEUM INSTITUTE (IFP). 1.2 Scope of the problem As shown on figure nl, a subsea production cluster comprises a certain number (k) of.wells linked to the main platform through a common unit. At all times, the cluster state can be characterized by the hydrocarbon flow sent to the main platform. This flow depends, of course, on the number of available wells. Thus, between the perfect state where all the k wells are producing and the completely failed state where no well is producing, there are several intermediate states when some wells are producing and other wells are failed. These states cannot be considered as perfect nor failed because they produce a certain amount of hydrocarbon which cannot be neglected. Therefore the classical reliability modelling which sorts the states into two classes "good/bad" or "work/don't work" cannot be used here and "multi state" modelling has to be considered. As a subsea production cluster is located at the sea bottom, repair of a failed component needs heavy maintenance means facilities such as a maintenance vessel (rig), which has to be mobilized before being brought onto location. In addition, when the cluster is located within a hostile meteo-oceanological environment (like the North Sea), it is impossible to undertake any intervention activities in the winter season ; failures occuring during this period have to wait for good weather to return before being repaired. From this point of view, the system has several functioning phases and "multi phase" modelling is then needed. The above considerations have led us to introduce a parameter that we have called "Probability Production Index (PPI)" in order to characterize the functioning of a cluster over a given period of time. This parameter is equal to the expected value of the cluster production divided by the maximum possible production over the period taken under consideration. In order to evaluate the PPI, we have developed three methods based on entirely different modelling principles : . Specific analytical model, . Markovian model, . Monte Carlo simulation based on Petri net modelling. We will now describe quickly each of the three above methods.

347

Wel Is Wl Prodi *

CU

W2

Prod2

Central Unit Wk Prodk *

Who le C luster oductlon Pr

V~ > Prodi -^

J
FIGURE No 1:SC HEME OF R KWELLS PRODUC TION C LUSTER

Imposs1 b 1 e to repatr (Bad Weather)

Poss i b 1 e to (Good repai r Weather)

L
1 YERR INTERVRL No 1

L
T i me INTERVAL No 2 Etc.

FIGURE

No2:BRD

RND GOOD HERTHER

PERIODS)

348 1.3 Parameters Main parameters needed in order to describe the cluster are the next ones : : single well failure rate ^ : common unit failure rate : mean time needed for rig mobilization : mean time to repair : good weather duration : bad weather duration : production per unit of time of the whole cluster The whole cluster failure rate is then :

On the other hand, as the repair duration depends mainly on certain operations like module retrieval rather than the nature of the failure itself, we have only introduced a single parameter , which will be used as MTTR for every failure. 1.4 Specific analytical method The first method that we have developed is based on the elaboration of specific analytical formulae. These formulae are quite complex, only the main concepts are given in the paper. As shown on figure n2, the time is composed of a serie of inter vals having the same duration (1 year). For the purpose of modelling, each interval has been divided in three phases (Cf. figure n3). Phase nl : No repair possible ; in case of cluster failure the rig mobilization begins at in order for it to be onto location at the beginning of good weather period ; repair then ends at the time + . Phase n2 : Repair is possible and if a failure occurs during this phase, it will be repaired before the bad weather period returns. Phase n3 : It is too late to mobilize a maintenance rig because repair would not be achieved before the beginning of the bad weather period. A failure occuring during this phase has to wait the next good weather period in order to be repaired. Let us consider the diagram shown on figure n4 where E represents the perfect cluster state and E. the cluster states where the well nj is producing. This diagram shows that it is possible to calculate the probabilities (M.,., P.,,) of the above states at the end of one inter val if the probability M ) of state E is known at the beginning of this interval. Of course the parameters a, b, e, d, e have to be calculated but they are the same for all the intervals. 1.4.1 Bad weather period. For well j to be producing during this period, it has to be in good state at the beginning of the period after which no failure must occur on the well itself nor on the common unit :

P1(t) = (M + P.) e(Xc

349

1 YEHR

<
. U

I
U J

i
.

. i me

(D

(D I

FIGURE No 3:PHRSES FOR THE RNRLYTICRL MODEL

1+1

1+1

FIGURE No 4: TRANSITION DIRGRRM FOR THE RNRLYTICRL MODEL

350 1.4.2 Good weather period. In order to find the proability that a given well will be producing at time t of the good weather period, the formula is far more difficult to establish than that above. Let us introduce Q(t) to be the probability that a given well produces at time t given the cluster was in perfect state at time zero. Figure n4 shows that there are two times of interest ( and +) where it is easy to determine the probability that the whole cluster is in perfect state. Then : P(t) = M . a . Q(tiM)) + (1M^a) . Q(t<|>6) 1.4.3 PPI evaluation. From the above formulae it is possible by an analytical integration to find a formula giving for a given well the mean cumulative time spent in a producing state. After that it is very easy to ascertain the production Prod, of this well during interval number i. As all the k wells are similar, the probabilistic production index of the whole cluster can be calculated, for a period of intervals (that is years), by the next formula :

PPI

=nT(W X

Prod

It has not been possible to describe all the formulae involved in this short paper but what has to be kept in mind is that it is possible to handle the whole problem in an analytical way : this has led to a small'computation code which happens to be very efficient. 1.5 Markovian model For the analytical model, we considered that the repair and rig mobili zation parameters were constant values. This hypothesis does not work when handling Markov process modelling : accordingly we took into account failure and mobilization rates such as u = 1/, = 1/. As in the case of the analytical model, we were led to introduce several functioning phases : four were selected because of the exponen tial character of repair and mobilization events. Figure ne5 shows these four phases : Phase nl : No repair possible and there is no interest in begin ning mobilization of the rig, which would be ready too early (i.e. before good weather returns). Phase n2 : No repair possible but if a failure has occured, the rig mobilization can be initiated in order to have the rig ready at the beginning of the good weather period. Phase n"3 : Good weather period ; repair is possible. Phase n4 : End of good weather period ; it is possible to complete a repair in progress but it is not advisable to begin a new repair because the bad weather period is too close. 1.5.1 Markov diagrams. For each phase a Markov diagram was developed :

31 5
RIG MOBILIZRTION DELRY. MERN TIME TO REPRIR

Ti me
PHRSE PH. PHRSE PHRSE

1 YERR FIGURE No 5:MRRK0V MODEL

PHRSES

FIGURE NoG:MULTI-PHRSES MRRKOV MODEL

352 in order to clarify the demonstration, figure n6 shows the diagrams related to a cluster of four wells. Phase nl : In this phase there are 5 states to be considered (E4 to EO), each corresponding to a given number of producing wells. As no repair is possible, this diagram takes into account only the wells and common unit failure rates. Phase n2 : In addition to the five above states, four other states (W3 to WO) have to be introduced in order to model the fact that the maintenance rig can be ready before the good weather period. In that case it has to wait for good weather period before beginning the repair. Phase n3 : In this phase, the five states defined at phase nl level are still possible, but as repair is possible, the four additional states of phase n2 have disappeared and are replaced by the repair state (R). Phase n4 : This phase is very similar to the phase n3, but the transitions corresponding to the initiation of a new repair have been suppressed. After determining the various graphs for each phase, it is necessary to introduce how these phases are linked together, i.e. how the states at the end of phase ni give the states at the beginning of phase ni+l : . State Ej at the end of one phase gives state Ej at the beginning of the next phase. . State Wj at the end of phase 2 gives state R at- the beginning of phase 3. . State R at the end of phase 3 gives state R at the beginning of phase 4. . State R at the end of phase 4 gives state Eo at the beginning of phase 1. Then it is possible to make a calculation, phase by phase, and for the number of intervals needed as soon as initial conditions are given for time zero : the probabilities associated with the states at the end of one phase are used as initial conditions for the next phase. 1.5.2 PPI evaluation. 1.5.2.1 Mean cumulated sojourn time. As in the case of the analytical model, it is possible to compute the mean cumulated sojourn time spent in a given state from the probability of being in this state :

Tk(t) =

Pk(x) . dx

1.5.2.2 States efficiency. Let us consider that the production of the cluster is proportional to the number of producing wells (that is each well, when not failed, produces the same amount of hydrocarbons). If we give an efficiency of 1 for the perfect state, and an efficiency of 0 for the completely failed state, it seems to be correct to give an efficiency lower than 1 but greater than zero for the others states. These considerations lead to definition of the state efficiency as the

353 number of producting wells divided by the number of wells of the cluster. Then the states E3 or W3 for which 3 wells are producing have an efficiency of 3/4 = 0.75, the states E2 or W2 have an efficiency of 0.5, etc... 1.5.2.3 Induction related to states. Let us consider a given state k for which the mean cumulative sojourn time is T, (t) and the efficiency is V, , then T, (t).V, .a is the expected value of the amount of hydrocar bons producedTjy this state for period (0,t). 1.5.2.4 Cluster production. The cluster production is obviously equal to the sum of the production given by the various states within the period of interest (n intervals of 4 phases) : i=l 4 I j1

Production =

I k

T* V, j,k ' 'j,k

In the above formula,

i is the interval j is the phase within interval i k is the state

1.5.2.5 Cluster efficiency.. The maximum possible production over intervals of four phases is . ( + ) . . Then the production as defined above divided by the maximum pos sible production gives the probabilistic production index :

= ^)

ml J 1

T J V V

Let us consider now that for classical availability study purpose, we have sorted the state into two classes (good/bad) and have given an efficiency of 1 for good states and an efficiency of 0 for bad states. It is then obvious that the above formula gives what is classicaly called the "mean availability" of the system over a given period of time. Then PPI and mean availability are too particular cases of a more general notion. We have called this one "efficiency" of the system under study. 1.5.3 Markov computer code : MARK SMP. For computing the formulae developed above, it is necessary to have a computer code able : . To process Markov diagrams . To link several Markov diagrams . To compute the mean cumulated sojourn times in the various states. We have accordingly developed such a code. Its name is MARK SMP (SMP stands for Multi Phased Systems). This code is based on an original algorithm which is able to compute the states probabilities and the mean

354 cumulated sojourn time practically by the same calculation, that is avoiding computation of the numerical integrals of the probabilities in order to calculate the mean cumulated sojourn times. 1.6 Monte Carlo simulation At the beginning of the study, we had no general purpose Monte Carlo simulation code. We consequently developed a small code in which the cluster behaviour model was built in the code. As the results were very promising, we decided to formulate a general purpose Monte Carlo simulation code. In order to give the code a good flexibility, we decided to describe the system behaviour by using "interpreted stochastic Petri Nets". Figure n7 shows the Petri net developed for evaluation of the PPI of a four well cluster. This net has been developed in order to fit closely with hypotheses used in the specific analytical model. It is far from the aim of this paper to describe Petri net theory : we therefore will present to you some of the elements in order to show how that works. . The circles, termed "places", are linked through "transitions" by the means of "arrows" (upstream and down stream). At each place we have attached given events. . Cluster states are determined by the "marking" of the Petri net and by the states of various "messages". - the marking consists of the presence or not of "marks" (tokens) in the places, - the various messages can have two states (true or false). . At a given moment, the Petri net is in a given state determined as defined above : - a transition is valid if all the places attached to its upstream arrows are marked by at least one token and if all the messages that have to be received (indicated by a 'question mark"?") are "true". - a transition is not valid if it does not meet the above conditions. . Changing the Petri net state consists of "firing" one of the valid transitions which result in : - substracting one token from the places linked to the fired transition by upstream arrows. - adding one token into the places linked to the fired transition by downstream arrows. - emitting the messages that have to be emitted (indicated by an exclamation mark"!"). . A new state is then reached with other valid transitions and the process can be started again. . When messages are used in order to synchronize transitions firing the Petri net is said to be "interpreted". . When probability laws are attached to transitions, the Petri net is said to be "stochastic". On the above bases, we have developed the MOCA RP code (MOCA stands for Monte Carlo and RP for Petri nets) which is able, in addition, to handle "inhibitor" arrows and "weighted" arrows. For transition probabi-

ID.RIG

7D.RIG IRIG.M ?D.RIG 7M.RIG I R I G . M VfioBl 3

T1

T3 i + RIG MOBILIZRTION

CLUSTER BEHRVIOUR

FIGURE No 7:PETRI NET FOR MONTE CRRLO SIMULRTION

u>

356 lity laws, a large choice Is given (exponential, log normal, Weibull...) comprising the deterministic law which is very useful in order to describe transitions which have to be fired after a fixed delay from the time they become valid. Coming back to figure n7, we can see that the Petri net is divided into three sub-nets : . Calendar on the right of the sheet . Rig mobilization on the middle of the sheet . Cluster behaviour on the left of the sheet. The three sub-nets are synchronized via three messages : . !M.Rig which means that we are within a period of time where we can mobilize the rig. . ID.Rig which means that a first failure has occured and has emitted a "demand" for a maintenance rig. . iRig.M which means the rig is already mobilized. At time t = 0, the state of the cluster is the next one : . Four wells producing (perfect state) => one token in place n"6 . Rig non mobilized => one token in place n c 4 . Bad weather/no mobilization possible => one token in place nl . No demand for a rig => Message "?D.Rig" is false => Message "?Rig.M" is false . Bad weather period => Message "?M.Rig" is false From this state transitions Tl, T6 and T16 are valid. Tl is a deterministic transition which is used only to determine in which phase we are. On the other hand, T6 and T16 which correspond to failures of one well or of the common unit are stochasitc transitions. Then by using random number it is possible to calculate when they are going to occur. Of course the transition which is actually fired first is the one for which the time of occurence is the lowest one. Let us consider, as an example, that T6 is fired first. Then the next events occur : . T16 is inhibited . T9 and T7 become valid . The message "!D.Rig" becomes "true" . Tl is still valid. Let us consider that the second transition to be fired is Tl then we have : . T2 becomes valid . The message "Ml.Rig" becomes"true" . As the messages "!D.Rig" and "!M.Rig" are true, T4 becomes valid. After a delay corresponding to the mobilization of the rig, and if no other transition has been fired in the meanwhile, T4 is fired and the message "IRig.M" becomes true. Then T3 becomes valid and is instantaneously fired in order to begin the repair (a deterministic delay of 0 is attached to T13). And so on... The process goes on until the limit time under interest is reached. This gives a "story" of the system. During the simulation of one story, it is easy to evaluate parameters under interest as, for example, the time spent in given states in order to find the production of the cluster for this story. By generating a certain number of such stories, a sample is obtai-

357 ned which can statistically processed in a very classical way. 1.7 Numerical results - Methods comparison The three methods quickly described above have led to development of computer codes written in BASIC and suitable on small sized computers. Whilst the analytical modelling has led to a very specific code, the MARKOV and MONTE CARLO modellings have led on the contrary to general purpose codes. In order to compare the results given by the three methods, we have studied a four well cluster, the data used coming from an actual case. The quantitative results are shown on table nl. As Monte Carlo simulation is very flexible from probability laws point of view, we have, in addition, compared several cases of probability repair laws : exponential law, Weibull law, Erlang law, and lognormal law. In these computations, we arranged in each case for the expected (mean) time to repair of the laws equal to the deterministic time to repair used for the analytical model. The results obtained are very similar. This shows that the PPI is not very sensitive to the model employed since the same phenomena are taken into consideration in each case. Expected mean values from the laws are far more important than the laws themselves. Then, from a practical point of view, it is better to use the method which is the most easy to handle. The three methods can be compared as follows : . Analytical method : model development is difficult and long but computing time is very short. . Markov model : model development is rather short and computing time is rather short. . Petri net model : model development is rather short and computing time is rather high. Then the Markov model seems to be the best compromise : . It is a general purpose model . It is very flexible . Computing time is short . Multi-state and multi phased systems can be handled. When the system under study is too complex in order to be described by Markov diagrams, then Petri net modelling and Monte Carlo simulation can be successfully used. Several studies that we have completed recently on complex clusters configurations have proven this method to be very efficient. The analytical method, which is not very flexible, can be kept in mind for simple systems or when a great number of computations is scheduled. 1.8 Conclusion This paper has shown that the method to be used in order to compute a parameter concerned with economic issues like the Probabilistic Production Index of a subsea production cluster does not differ basically from those classicaly used for reliability/availability calculations. Thus

358

DTR
NUMBER OF HELLS 4 M 2.2BE5 \ te 2.2BES \

S 458 h u IB88 h

y 2016 h B744 h

MODEL RNRLYTICRL MONTE CRRLO C o n s t a n t Repair Repair Repair Repair Repair

PPI 90.00* 90.59* 90.26* 90.41* 90.34* 90.51*

Exponential Me i b u l 1 Erlang Lognormal

MARKOV

89.00*

TRBLE No 1
"^^METHOD RISK^"""""^^ Blow o u t on f l o o r or a t t h e uiel l h e a d Internal u n c o n t r o 1 1 ab l e blow o u t K i l l i n g and Squeeze impossible Weel s q u e e z e d with plug CRUSES.CONS. DIRGRRM MRRKOV DIRGRRM

4-6/ 10E-6/h 3.5E-6/h 15-6/ 32.5E-6/h

2.3-6/ 2.5E-6/h 0.4E-B/h 44E-6/h 49.2-6/

TOTRL

TRBLE No 2

359 the same set of tools and means is able to handle both economical risk and safety. This work has led us to develop two general purpose computation codes : . MARK SMP which is able to handle "multi-state" "multi phased" systems by Markov process modelling. . MOCA-RP which is a Monte Carlo simulation code based on Interpreted Stochastic Petri Net modelling. These two codes which have proved to be very powerful tools, thanks to the major improvements on which they are based, open the way to easily solve many risk analysis problems otherwise difficult to handle.

2. MODELLING A DRILLING PROCEDURE 2.1 Introduction The main risk encountered when drilling a well is a "blow out". Then in 1983 ELF AQUITAINE (Production), TOTAL CFP and INSTITUT FRANCAIS DU PETROLE, within the joint research project described above, decided to study this kind of risk by risk / reliability analysis methodes. SERI-RENAULT automation was chosen as subcontractor to perform the study. This part aims to describe and compare three approaches (MARKOV, PETRI net and cause-consequence diagrams) which have been attempted in order to model such an indsirable event. 2.2 Scope of the problem The drilling of a well is basically a succession of phases (ex 26" drilling, 12"l/4 drilling, 9"5/8 casing, e t c . ) , each comprising a succession of operations which are related to the "drilling procedures" : when operation nn is successfully completed, then operation nn+l is started. Unfortunately in real life some problems are likely to occur at operation nn and in order to cope with the problem an operation n' must be started in place of the operation n+1 normally scheduled. Above problems can be related to geology, meteo oceanologicai conditions, component failures, human error,... In addition, the operation which has to be undertaken is depending on the position of the drilling bit into the well (near bottom or not, above blow out preventers, drill collar into blow out preventers,...). In order to model such a complex system it is necessary to use models able to handle a lot of events and to take into account the sequential aspect of the various operations. Cause-consequence diagram, PETRI nets and MARKOV models were then selected in order to be compared. 2.3 Cause-consequence diagram analysis As cause-consequence diagram is a well known method, it will be very shortly described here. Starting with an "initiating event" (able to lead to an accident if

360 a recovery procedure is not completed properly), the diagram is then built step by step (inductive way) by using YES/NO (success / failure) gates. Each answer defines a branch of the diagram and the various branches define different paths which describe the behaviour of the system under study. Then all the possible evolutions of the system (accident, incident, controlled, uncontrolled, etc... as well as the correct one) are described onto the same diagram. This is, in fact, the "consequence diagram". The causes diagrams (fault trees) can be attached to each YES/NO gate in order to describe how an operation can fail or how an event can occur. Figure n8 shows a part of the cause-consequence diagram derived from the initiating event "kick during 12"1/4 drilling, bit at the bottom of the well". For more clarity fault trees are not drawn and are identified by triangles. This diagram shows clearly that the situation can more or less quickly be under control or result in various risks more or less important as blow out onto the floor, leaks, craterization, etc... Fault trees can be derived from the diagram for each identified risk and processed in standard fashion. 2.4 Petri nets analysis As shown in 1.6 PETRI nets modelling is a very powerful tool in order to represent system in which several processes evolve simultaneously and asynchronously. For modelling the blow out risk when drilling, very big PETRI nets have been built. We have chosen to present here (Cf figure n"9) as an example, the PETRI net related to the elementary task "addition of a single into the drilling string" (a "single" is a set of three drill pipes screwed together). Since the decision to add a single is taken, the next sequence of operations occurs : a) take the single from the pipe rack b) introduce the single into the "mouse-hole" c) seize the drilling string d) loose the kelly bar e) move kelly bar into the single f) screw the kelly bar into the tool joint of the single g) screw the single into the drilling string h) loose the string Operations (a,b) and (c,d) can be realized simultaneously, three teams (El derrick man, E2 floor man, E3 driller) and the screwing and lifting facilities are needed. In fact other PETRI nets have been built concerning teams availability and facilities availability. For more clarity they are not represented on figure n e 10. The PETRI nets concerned with the teams availability, act on the PETRI net under study by removing or adding a token in places El, E2 and E3 and the PETRI nets concerned with facilities act by emitting messages (F.LEV for lifting facility failed, OC.LEV for lifting facility busy, F.CLE for screwing facility failed and OC.CLE for screwing facility busy). These messages

Initiating event KICK result yes-no gate \ consequence

7 SUCCESS IN CLOSING BOP'S VES

X
DIVERTER WITHSTAND PRESSURE M O VES

.WTLOW UNDER CONTROL /

cause chart Identificator


BOP EXTERNAL LE.\KAGE N O VES SUCCESS IN CLOSING BSR \l$ [ N ^ 6 EMERGENCY DECONNECTION VES N O

BLOW O U T \ ON BOARD /

SUB-MARINE BLOW OUT /

t
etc.. etc.,

etc,

FIGURE 8:E3A MPLE OF CAUSESCONSEQUENCES DA GRA M

362
DECISION TO RDD SINGLE DERRICK RVRILRBLE FLOOR MRN RVRILRBLE

I LIFT-BUSY I SCRE W-BUSY

FIGURE 9: EXRMPLE OF PETRI NET

-Veil not squeezed * ^ c k

Well not squeezed -Ground Blow-out

Kiok -internal BOP leaks -external BSR leaks -Blowout

-Vsl! squeezed

fe

FIGURE 10: BLOW-OUT MARKOV DIAGRAM

364 are used in order to validate certain transitions. PETRI net on figure n9 shows as example that operation b can be started only if operation a is completed, driller is available, lifting facility is not failed (?F.LEV) and is available (70C.LEV). If all these conditions are satisfied then the transition is fired, operation b is started and a message is emitted to say that the lifting facility is now busy (OC.LEV). Two methods can be considered in order to process such a PETRI net. The first one consists of identifying the various states and the transitions between these states by deriving the so-called "marking graph" from the PETRI net. The second one consists of using directly the PETRI net for a MONTE CARLO simulation. Because of the great number of possible states, the second one was (partly) used for our study. When the number of states is not too big and all transitions have exponential laws, the first method can be used in order to produce a MARKOV diagram. 2.5 MARKOV diagram analysis The MARKOV diagram method is quite a different approach compared to the two previous ones : it consists basically in identifying the states of the system and the transitions between these states. As said before, the above PETRI net could be used in order to derive a MARKOV diagram but this would lead to a too big number of states. Then state grouping and simplification are needed and that leads to a more global analysis than for PETRI nets and causes consequences diagrams. Figure n10 shows a 25 states MARKOV diagram where only states of importance according to our study have been kept. This MARKOV diagram has been processed by a classical MARKOV code. 2.6 Numerical results Numerical results are shown on table n c 2. They are concerned with the main indesireable events encountered during the study. No numerical results are produced from PETRI net model because simulation would have been too much costly. (Of course sub PETRI nets have been quantified by simulation in order to show that the method was working well). Causes consequences diagrams lead to a very detailed analysis but the sequential aspect is partly lost at the quantification level because of the use of fault-trees. MARKOV diagrams take very well the sequential aspect into account but the level of analysis is lower. Then it is normal that the numerical results are not the same, but as shown onto table n2, they are not too far from each others. 2.7 Conclusion of Part 2 Within the context of this type of safety study, one is tempted to summarize the respective advantages and drawbacks of the methods that we have compared. A classification versus different criteria is shown hereafter :

365
CAUSE CONSEQUENCES DIAGRAMS ! ! !MARKOV DIAGRAMS ! PETRI NETS ! !
t

CRITERIA

! 1 !

! Modelling I facility Excellent, be cause this ap proach brings together the points of view of the analysts and specialists (1) Medium:the bi ! nary character of the answers sometimes lacks subtlety (2)

! ! Mediocre, due ! Excellent, to the absolute! after acquisi! necessity of ! sition of the ! ! limiting the 1 formalism number of sta ! which requires! ! tes by reduc ! a little tions and/or 1 practice (2) ! groupings (3) ! ! ! Excellent. ! Finesse inac ! cessible to ! ! the other methods (1) ! ! ! Medium:simula! Medium:heavier tion software ! software (2) (2) ! ! Mediocre, due to the need to limit the num ber of states (3)

Modelling ! finesse ! ! ! ! Capacity ! to ! quantify ! ! ! Capacity ! to introduce operatory ! times ! Represen tation of common modes

Good:fault trees proces sing software (1) Bad (3)

Mediocre. Requires "artifices" (2)

! ! Good. The times Good. The ti ! ! are transformed mes are intro! into rates (2) duced constanti or by distri ! bution laws ! (1) ! ! Mediocre. Very good (1) ! Requires "artifices" ! (2) ! ! Costly. (3)

! Economy 1 Good. (1) of means ! !

Medium. (2)

Although somewhat subjective this table clearly indicates there is, in the absolute, no a better method or a less good than one. The choice should depend on the objective pursued, on the fineness required and on the financial resources available.

STUDY CASE ON AEROESPACE

S. Sanz Fernndez de Crdoba Construcciones Aeronuticas, S.A. Avda. John Lennon, S/N Getafe - Madrid Spain

ABSTRACT. A general view of reliability techniques in use for aerospace vehicles is given. The aplication of these techniques to Civil transport aircraft is reviewed in detail. Both the Reliability Analysis and the Safety Assessment Programs are discussed and their implication on the design of modern transport aircraft is presented. Civil Air Regulations are also discussed. Practical cases taken from a recently designed civil transport aircraft are presented at the end.

1.

EVOLUTION OF RELIABILITY ANALYSIS ON AEROESPACE

1.1. Historical background. The V-l Practical Reliability analysis on aeroespace systems seems to have started in the military systems back on WW II. According to most sources worrying about system reliability and starting a scientific approach to the problem was triggered by the excessive rate of failures observed in the German V-l flying bomb. Robert Lusser, one of the people involved in the development, describes that confronted with the high level of un-reliability of the system, they took the approach of the chain strength theory (no chain is stronger than its weakest link), but the figures of reliability they were getting were still unrealistic. Realising that some of the failures did not come from the weakest link, they got the idea that total reliability should be related to some kind of average of the failure probabilities of all the kinds involved. And yet they could not make much progress, since they were getting figures they did not correspond to experience. Apparently the solution was found by a matematician, Erich Pierusehka whom advanced the theory of the reliability of a system (understood as probability of survival) being equal to the product of the probability of survival of the components. In other words, the reliability of a complex chain type system is always much lower than the reliability of its components. The result seems to have been impresive, with the VI reaching a 75% rate of succes, and being the first airborne system where reliability 367
A. Amendola and A. Saiz de Bustamante (eds.), Reliability Engineering, 367-385. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

368 was conscionly searched and scientifically approached to achieve succes. The basic WW developments seem to have been kept in use by the air forces after the war, but almost exc'lirsive'ly for non manned flight vehicles. The extension of these techniques to the civil world seems to have been relatively slow at the beginning, but the process was accelerated after the start of serious fatigue problems in aircraft. The concepts of "fail safe" and "safe-life" structures spreaded somehow to systems. The civil aircraft regulations for transport airplanes had also a considerable influence since, in order to make transport safer and to handle better air trafic, they were forcing the use of more and more automatic systems which required guarantee of correct working. The designers soon discovered that the military techniques for unmaned vehicles were quite appropriate for the study of automatic systems failure analysis and from there they were expanded in general to all aircraft parts. A considerable part of this extension is to be credited t the extension of safety and reliability regulations in the Civil aircraft design codes of Regulations, notably BCAR, FAR and JAR, where these type of rules, either on their own or mixed with other rules, have become increasingly important. 1.2. Present Trends The decade of the 70'S was probably the first moment in which the reliability analysis was considered an end on itself as well as a tool to help in other areas of civil aircraft design. Up to that moment, it was seen mainly as a mean to assure the correct working of the automatic systems or analyzing the back up modes of functioning when the main functions failed. But it was soon realized that the basic technique in use (the failure mode and effect analysis) could be a perfect tool to analyze two other basic features of civil aircraft which worried seriously the designers in those years. These points were safety of the design and maintenance problems. Safety of the design had been mainly a qualitative art of the designers, as oppossed to a quantitative technique. Several approaches like redundancy and parallel paths were in common use but no sound technical approach was widely used. Quantification of components reliability, integrated in subsystem and system reliability lead to the quantification of the broader concept of manned aeroespace vehicles safety. The technique used was a down-to-top approach, starting by component reliability, building up system reliability through the introduction of concepts such as multiple failures, common cause failures, cascade failures, etc, to arrive to what is commonly called "global failures" which are the ones that may affect the safety of the aircraft. This technique, still in use, had certain shortcomings (we will analyze it in detail later), and gave way soon to top-to-down techniques, in which basically the process is reversed. In an initial phase a "failure tree" (qualitative) is worked up starting by the known global failure, branching down in several stages through logical doors

369 (AND, OR) to component failures. In the second phase, reliability of the components which appear in the tree is quantified and, retracing." the tree back to the top, a measure of safety for the considered global failure is obtained. This technique is nowadays the most commonly used for safety assessment of manned aeroespace vehicles, and we will deal with it later. The maintenance problem was also handled in qualitative form until reliability techniques allowed quantification. Aircraft maintenance cost (as measured by the direct cost of maintenance plus the cost of having the vehicle out of service) depends basically of three factors: inspectionabillity, accessibility and repair/replacement price of the part subject to maintenance. Parts which, in the designer feeling, would need more maintenance were traditionally made more accessible, easier to dismount with ordinary tools, etc, but the use of reliability techniques allowed a real quantification of this feeling of the designers. In this way, in the last years, reliability techniques have worked up their way into the design of aeroespace vehicles, and are nowadays a major factor in the design. The techniques used have separated into three major trends, namely, a) Strict reliability analysis, mainly used in non reparable or one mission vehicles (non manned satellites, space probes, rockets, missiles etc). b) Safety assessments, extensively used in civil and military transport aircraft and in all manned aeroespace vehicles. c) Maintenability studies done in all reparable or multimission vehicles. The backbone of them all are reliability analysis and reliability techniques, altough the final goal to be reached differs from each other. In a simplistic way, strict reliability analysis pretends to insure a certain level of mission success. Safety assessment looks for a level of security for people and vehicle assuring a continued safe flight and landing. Maintenability studies aim at an operational cost reduction. For the present paper, we will narrow the scope of the exposition to actual role of reliability techniques in modetn civil transport aircraft. 1.3. Reliability techniques in civil transport airplanes As previously mentioned, two major concerns for designers of transport airplanes are safety of the passengers and crew and operational cost of the aircraft. The reason for the concern is obvious. Civil transport aircraft market is highly competitive and require high level of investement by the manufacturer, so it has to be sure of producing a marketable product. The operator who is buying an aircraft, and is moving in a highly competitive market as well, when confronted with different aircraft equally suited to his needs, will carefully weight the operational cost (often more important than initial price) and the record of safety of the aircraft knowing that incident prone aircraft may divest passengers to competitors. Thus the worries of the designer to provide safe, cheap to operate products. A further concern of manufacturer is meeting the civil air regulations, which may be viewed (although this view is highly contested by

370 many people) as a set of safety standards. Meeting the regulations is the condition to achieve certification of the aircraft, the prime condition to be allowed to sell the product. The reliability techniques opened the way to the designer for quantifying, at probabilistic level at least, both safety and maintenance cost of the aircraft. Since maintenance cost is a purely economic concern, the regulations do not interfer much with it, other than generally requesting from the manufacturer to provide an acceptable maintenance plan to ensure continued airworthiness of the airplane for the time of accepted service. But this plan does not include cost. Safety however is the main concern of certification process, and safety features are throughly reviewed by civil air authorities. This difference has lead to manufacturers to establish early in the design process two different programs, which are commonly named "Safety assessment program" and "Reliability analysis program" both of which have direct relation with our subject. The safety assessment becomes a fundamental part of certification process, and the documents in which it is contained belong to the official documentation of the aircraft, which has to be submitted and approved in general by the authorities. Reliability analysis is only a tool and does not belong to the official documentation of the aircraft, except for the parts which may be used as support of the safety assessment program. We will review in the next paragraphs the content of those programs.

2.

RELIABILITY ANALYSIS PROGRAM IN CIVIL TRANSPORT AIRCRAFT *

2.1. Economic implication of reliability

The main reason to perform a reliability analysis in a modern transport aircraft is economic. Reliability and cost are interrelated in three aspects: a) Component price tend to increase exponentially with reliability so requiring excessive reliability is uneconomical (increase the price of the product). b)Maintenance cost increases enormously with increased need of unscheduled removal of parts, mainly if the removal time interfers with operation schedule. Unschedule removal of parts increases with lowered reliability of components. c) Partsstorage needs, thus cost, increase with decreased reliability of components (more components need to be in storage) but also increases with excessive reliability of fundamental stocks (parts which have to be in storage in all cases, and are more expensive if they are over reliable). The reliability analysis main objective is to predict an optimal point of the above conflicting requirements by requiring a certain level of reliability on each of the aircraft components. The analysis is performed usually in two "runs" following the classical rules of an FMEA. In the first run, the different systems and subsystems of the aircraft are allocated "allowable failure rates" in order to achieve the desired levels of reliability on the complete air-

371 craft. This allocation is not made on purely economic reasons in all systems since safety aspects usually play an important role in many systems. This failure rate allocation is one of the many imputs that the system designer uses when deciding the basic phylosophy of the system design. Other importants imputs, just to give an example, are weight allocation, sources and extent of available energy, etc. Starting from this required failure rate, since the designer knows the state of the art reliability on the available components, and considering all imputs the designer may decide over single chain system, multiple path system, redundant parallel system, or any other method which seems adequate to meet the requirements. Often the designer receives incompatible requirements (i.e. failure rates can only be met by multiplying channels, and this surpases the weight allocation) which force to modification of requirements in usually stormy meetings of all the parts involved. Once the system is laid down, comes the second run of the reliability analysis which is in practice an strict FMEA or similar study. In this second run the following main points are analyzed. a) The system meets the reliability requirements set previously. b) The system components reliability is rational, i.e. there are not unusually over reliable parts when compared with total system reliability requirements unless there are peculiar factors which may require that. c) Systme hidden failures have been properly considered. d) Predicted tests, inspections, probable removals, etc, are feasible and adequate. e) Multiple failures, and possible failures induced to or by another systems are in accordance with established goals. This analysis, which has the usual format of a FMEA is usually performed by an special section of the engineering office of the manufacturer company, working in close relation with the design departments, since as a result the design may have to be modified, although normally changes are minor. The results of the analysis is then transfered to the logistic support unit of the company, which will use these results for: a) Setting up the maintenance program (MRB document). We will come back to this later. b) Setting up sapre parts needs for operators. c) Setting up productions/storage of parts needed for the manufacturer and suppliers. In some occasions, a feedback from logistic support may change again certain features in the design. This is a very infrequent circumstance, but it may arise when suppliers can not meet the needs of spare parts delivery as predicted by reliability analysis. In those rare occasions a request for changing to more dependable parts to avoid possible supply shortage may be produced. The proper setting of a maintenance program and spare parts need is of fundamental importance to the operator. Optimistic figures provided by the manufacturer, which day to day operation reveals false, may lead to a quick discredit of the company, ending its possibilities in the market. Pesimistic figures will destroy all possibilities of selling

372 the aircraft. Thus the economic influence of what, in principle, is a purely technical aspect, and thus the responsability which is trusted into the reliability analysis. 2.2. The maintenance program and MRB document Aircraft do become uneconomical, but never unsafe. This is a golden rule, set up from the beginning of comercial aviation. It means that, as time goes by, it may become unfashionable or uneconomic to operate a certain type of aircraft (i.e. piston engine aircraft) but it is as safe to fly them later as it was the first day. This priciple has been spreading slowly to other industries like car manufacturing, but only aviation industry has shown a definite comitmect in this sense, and from the very beginning of commercial aviation aircraft have been sold together with a careful set of instructions, the maintenance manual, and the assurance that, following those instructions, the aircraft will remain all its predicted life as airworthy as it was on the day of delivery from the factory. Needless to say, there have been some mistakes on the way, mistakes which are all the more notorious in commercial aviation because of the above contention. But all considered, there is an impressive record of maintained airworthiness in commercial aircraft which is probably unmatched by any other branch of industry. In not too old times, maintenance instructions were very much the result of the designer's feeling and experience, and there was a continuos change on those instruction coming as a result of aircraft operational record which has always been closely monitored by manufacturers. Nowadays, the results of reliability analysis allow manufacturers to build up maintenance plans that are not very much changed by service record of aircraft. In other words, operational records of aircraft tend more and more to conform to predictions made at design time based on reliability analysis. The success of those procedures has made a standard practice on manufacturers to set up, in an early stage of the design, a group which is known as Maintenance Review Board. In this group, representatives of the designing company, the manufacturers, operators (i.e. airlines) and civil air authorities are present. They work on producing what is known as the MRB document, a comprehensive guide to maintenance procedures to be followed in the aircraft. They review each part for damage possibilities (i.e. damage do to corrosion, fatigue, accidental, operational, etc) and using the results of the reliability analysis they categorize the part within a previously set up classification which takes into, account critically of the part, economic implications, possibility of inspection, etc, to end up with a suitable maintenance task which is adequate for the component and ensures continued airworthiness of the aircraft. The final MRB document is normally accepted by authorities as a comprehensive set of basic instructions to keep the aircraft airworthy, and is used by the opeators to set up their particular maintenance program according to its own possibilities.

373 3. THE SAFETY ASSESSMENT PROGRAM IN CIVIL AIRCRAFT

3.1. Introduction Probably the most remarkable contribution of reliability analysis to modern civil transport aircraft design has been opening the way for the quantification on probabilistic levels of the safety of the aircraft. This quantification has allowd in its turn to pin-point problematic areas, solving those problems and producing safer aircraft. Furthermore, it has allowed manufacturers and authorities alike to reduce to cold figures the generic term of safety of an aircraft type. Safety of a civil transport aircraft has always been defined around the concept of the ability of the aircraft to continued safe flight and landing under all foreseable conditions. Foreseable conditions have always been taken as both adverse environmental conditions (frost, stormy weather, etc) or failures in aircraft systems. Several approaches were taken in the past to try to quantify, at least on discrete bands, the safety concepts. Some successiful, and somehow still alive, concepts in the past were - The single failure approach (no single failure may result in the loss of aircraft) - The fail safe approach (no single failure may prevent a system from performing its intended main task). When dealing with those approaches, a series of rules, taken from experience, were systematically used and accepted. The most famous one was, - No worst case consideration (in modern terms, unrelated serious failures with no possible common cause are not to be considered). Reliability techniques were obviously more adequate to analyze safety, since it can give an answer to the question of how unsafe an aircraft is, or, in other words, which is the probability that an aircraft will crash land loosing the aircraft and/or a high number of passengers lifes. If the probability of such a catastrophe is p, 1-p is a measure of the safety of the aircraft. Conceptually, failure of an aircraft to achieve safe flight and landing is no different from failure of an electricity producing plant to deliver the required voltag'e. Thus, the same reliability techniques may be applied. From such an starting point, modern safety assessment methods were derived. Two of those methods are commonly used nowadays, and we will review them in the next paragraphs. 3.2. The classic approach through reliability analysis The classic and still considered by many people in industry and civil aviation authorities as the accepted way of performing a safety assesment, is based in a down to top procedure, starting by a complete reliability analysis in the form of a FMEA of all the aircraft components and working the way up through combination of cases to arrive to what is usually known as global failure of the aircraft. The global failure

374 is defined as a series of malfunctions observed by the pilot or crew which may have different origin and which require corrective action. The object of the assessment is determining all the possible causes which provocate those apparent malfunctions, determine the seriousness of the failure and probability of its accurrence. This probability must be commensurated with the risk involved. The conceptual block structure which is followed in this technique is sketched in FIG. 3.2.1 A central FMEA is made based in the position occupied by the single element under consideration within the complex system and the failure rate of the element. The safety assessment and maintenance program are derived from this FMEA. As it is done in practice, the aircraft is divided into systems, so that each component of the aircraft belongs to a system. Those system do not neccessarily coincide with the designers concept of system. Caracteristically, monitoring subsystems, such as indicators, warning lights, etc, which are normally not considered as a part of the monitored system for design purposes are commonly included for reliability purposes with the monitored system. Once the airplane is broken down in systems, the following steps are performed: 1.- System description and system boundaries, that is, the points where it interacts with other systems. 2.- System main and secondary functions performed in normal operation. 3.- System malfunctions, which include not only the lack of performing its intended functions, but also incorrect performing of them or performing functions which were not intended at all on design (i.e.bursting of a pressure accumulator). 4.- Categorization of malfunctions according to their impact on safety 5.- Failure of system elements and effect of the failure upon the system itself and over the aircraft according to the malfunctions previously listed. 6.- Failure of external elements which affect the system through its boundaries. 7.- Double and multiple failure (including combination with external failure) defined as those failures which may have a common source (clogging of valves due to fluid contamination) or failures which, when combined, result in a danger to the aircraft of bigger order than the simple acumulative effect of both (failure of the second engine in a two engine aircraft). 8.- Global failure description, categorized according to the list of malfunctions, and calculation of probability of ocurrance according to previous single and multiple failures. This technique has many advantages, the main one beings its uniformity with indpendance of the system under study which allows people not too expert on systems to be capable of performing the analysis or following its results. Even more important nowadays, it is not too difficult to introduce, at least in part, in a computer. In spite of these advantages, it has certain shortcomings. The main one is reducing the global failure probability to a figure which is not always very reliable on itself. Single element failure rates, mainly for non electronic new design components, are difficult or too expensive or

375 both to substantiate. Since using this procedure the logical sequence of the global failure is somewhat lost and difficult to reconstruct it becomes at times very difficult to demostrate compliance with safety regulations. Thus nowadays safety assessments tend to be done more and more by a direct approach.

SINGLE ELEMENT COMBINED INTO COMPLEX SYSTEM

FAILURE MODE AND EFFECT ANALYSIS [FMEA]

RELIABILITY OF SYSTEMS AND ELEMENTS

V
SAFETY ASSESSMENT

V
MAINTENANCE PROGRAM

FIG. 3.2.1. Classical Approach

376 3.3. The direct approach to safety assessment Unlike the previous technique, which is used with small variations by all aeronautical companies, there are as many direct approaches as design companies or more. The principle of them all is the same. You take a failure condition which is essential (large reduction on safety margins) or catastrophic (loss of aircraft, death of large number of occupants) and you build up by means of a logical tree the combinations of failures which may lead to that failure condition. Working the tree down to single element failures, of known rate, and back up, the final probability of the failure condition may be calculated if necessary. One advantage of the method is that the logic build up at aircraft level of the failure condition is directly displayed in the tree, independently on how many different systems may be involved, a presentation that is nearly impossible in the classical method and which makes, in most of the cases, unneccessary to relie or even calculate the probability of the failure condition. When this technique is used, the reliability analysis becomes separated from the safety assessment and almost independent as can be seen in FIG. 3.3.1. The main advantage of using this method is that the reliability analysis is not used as basis to show compliance with regulations except in few isolated occassions, thus ending the unending discussion on single element failure rates during the certification process of the aircraft. One problem of this technique, and the origin of the differences between companies, is the systematization of the failure conditions, in other words, the method used to make sure you are not forgeting in your assessment essential or critical failure conditions. Unlike the classical approcah, expert people with a good knowledge of the aircraft itself and the way of functioning all the systems on it, is neccessary to make a good job, since there is nothing intrinsec in the method to ensure that all failure conditions essential or critical have been properly included. To solve this problem, a systematic ordered way of searching for failure conditions has to be developed, and here there are not two people agreeing on the best one. In FIG. 3.3.II- is given the procedure followed at Construcciones Aeronuticas, S.A. (CASA) for civil transport aircraft. A first division considering the source of failure (wether the aircraft itself or the environment) is made. Failures resulting from external conditions are listed and their possible effects on the aircraft are studied. For failures originated by malfunctions in the aircraft, the division is made on account of the energy involved in the failure. Failure of high energy containing elements may provocate extended uncontrolled damage, and minimizing that damage is the object of the study. The failure of energy producing system may rend inoperative many other system, and basic safety has to be maintained in those cases. Finally, pasive system failures will provocate the caracteristic malfunctions which have to be analyzed. The assessment of the so called

377 "in extremis procedures", aims at giving a set of instructions to minimize damage when those procedures have to be performed.
NEW PROCEDURE

RELIABILITY ANALYSIS

SAFETY ASSESSMENT

SINGLE ELEMENT COMBINATION ON SYSTEM

FAILURE MOOE AND EFFECTS ANALYSIS

SINGLE ELEMENT FAILURE RATE

SYSTEM/FUNCTION CLASSIFICATION

RELIABILITY OF SYSTEMS ANO ELEMENTS SELECTION OF ESSENTIAL AND CRITICAL S/F

SOMETIMES REQUIRED

ALWAYS REQUIRED

MAINTENANCE PROGRAM

>[
SAFETY ANALYSIS

DEMDSTRATIGN OF SAFETY (REDUNDANCE...!

/N

SAFETY ASSESSMENT

->v

FIG. 3.3.1. The direct approach

Damage From naturel egressive envi ronment . Frost . Lightning . Moist . Bird . etc

Damage From eystem malFunction

Extended

uncontrol led damage

Extended

controlled

effects

System

Failures

In extremi a procedures

Fire High energy rotation elements elements Electrical Supply Hydraulic Supply Engine Stop Lending retracted L/Q Water Landing

Bursting pressurized Other high energy Fai lures

related

Autonomous

Power

Feeding

Functionally

dependent

FIG. 3.3.II. Systematic Search of Failure Conditions

379 3.4. The influence of regulations The main codes of civil regulations in use nowadays for transport airplanes are 14 CFR part 25 (American code, usually referred as FAR 25) and JAR 25 (European Code). The main rules dealing with aircraft safety and reliability are FAR 25.1309 and JAR 25.1309 which are nearly identical.Both, american and european authorities have issued explanatory papers to those regulations, the main ones being AC 25.1309-1 (U.S.) and ACJ N 1 (until Na 8) to JAR 1309 (Europe). The regulations themselves are, in principle, clear when taken together with the AC, but they are nowadays the main battle horse in t e e certification process of aircrafts. The tendency is to favor the f' direct approach to the safety assessment over the classical approach. In the same direction points AC 25. 1309-1, paragraph 6 "Acceptable techniques", where it is specified that only a limited number of cases may require a full numerical probability analysis. The europeans, however, seem to place greater importance on numerical probability analysis according to the ACJ'S, but this impression may be somehow misleading by the fact that ACJ NO 1. to JAR 25.1309 was published earlier than AC 25.1309-1 and reflects an older way to look at the problem.

4. STUDY CASES After reviewing the place that reliability techniques occupy into modern aircraft design, and the role they play, we will show by means of a series of short examples how this is done in practice. Since the cases are quite numerous, we have just choosen an example from the CASA CN-235 aircraft, recently Certified in Spain. From all the different studies of safety which were done following the systematic explained in paragraph 3.3. and FIG. 3.3.1., we have choosen as examples two systems, the propeller brake system and the wheel brake system. The propeller brake system is introduced in the aircraft in order to stop the propeller in the ground while the engine is running. In this way, energy is available in the aircraft while the people can be working on it without the danger of the propeller rotating. Wheel brake is intended to help arresting the aircraft in the runway or when it is at parking place, as well as helping turning the aircraft while taxiing by means of differential braking. Wheel brakes have an antiskid system incorporated. 4.1. System versus failure criticality definition In the first step, the criticallity of the system function is defined. Following AC25.1309-1, the system may be non essential, essential or critical (see definition of the term in the AC). In our choosen system, cames out that brake propeller is non essential (if it is removed or inoperative, flight safety is unafected)while wheel brakes system is essential (safety on landing is greately decreased, but catastrophe does not follow, since the aircraft may still

380 be arrested using reverse propeller thrust). The CASA CN-235, like most moder transport, have no critical system (except for very special operations, like IFR night flight). Wheel brake system may not have a more serious failure than just not working (bursting of pressure accumulators is not considered here, but under failures with extended uncontrolled effects; so we do not need analyzing the bursting to see if it is a more serious failure) but propeller brake may have two failures, namely unwanted release when stoped on ground and unwanted braking the propeller in flight which are respectively essential (endangers people working around) and possibly critical (if the propeller is suddenly stop, the blades may break; the propeller brake in the CN-235 is not powerfull enough to provocate that situation, but we will assume it may for this example). This situation of having a non essential system which nevertheless present essential or critical ways of malfunctioning is not uncommon in modern transport aircraft, and the distintion between system criticality and failure mode criticaly has to be carefully considered and taken into account. 4.2. Non essential system This type of systems are not involved in safety, and need not to be studied as such. But, when the system display hazardous mode of failure, like the case of our assumed propeller brake, those failure modes need to be analized. The analysis is made constructing the failure logical tree. For the study case, the trees for unwanted release and unwanted braking in flight are given in FIG. 4.2.1. and FIG. 4.2.II. The trees are worked from the failure at the top, down to single element failures (marked as SEF in the trees). In the next step we would write on each SEF marked box the probability of that element failing, and work the way up to arrive to the probability of the failures, but in our example this was considered unnecessary. Brake accidental release while in GPU mode, as essential failure, can be produced only through the simultaneous failure of two independent chains or by accidental switching of two separated switches of different type. This was considered safe enough for the failure under consideration, and adding some numbers will not add any real new information. For the brake turned on during flight, the tree shows that simultaneous failure of five independent elements is necessary. This is considered safe enough, even if the failure is critical as we assumed, and it is explicetely stated that way in AC 25.1309-1 par 6.d. (1). Thus, for this particular case, no numerical analysis was consider necessary, and the safety of the system was considered satisfactory at this stage, so no further analysis was consider necessary. The above example illustrates the following general rules of the study of non essential systems. a) They need not to be studied for the failures in performing the function they are meant to perform. For complicated systems (such as ground power units) this means significant savings in time and money.

381 b) Hazardous modes of failure, essential or critical, need being studied only from the point of view of the failure itself. Redundancies, duplications, etc, need only be considered as far as the failure is concerned, not for the complete system (multiple independent failures needed for the hazardous condition count as multiplications, even if for correctly performing the function there is no parallel channels). This represents a savings in weight and allows simplication of design.
BEF : Single element failure f~\ fi-^ : AND : OR Brake release while In G.PU mode

Fai lure of Feeding Presaure pad set 1

<>
Fai lure of feeding Pressure pad set 2

A
Fa i lure of accumulator I Non return valve I open Nan return valve 2 open

A
Failure of iccumulator 2

Thermal reliof ralve t failure

Tharmol relief rslve 2 failure

Non return valve I fells open

Non return valvo 2 fails open

Brake release undesired comand

Discharge solenoide acted undes i red

5
Pull switch Move switch

FIG. 4 . 2 . 1 . Brake R e l e a s e w h i l e i n G.P.U. Mode.

382

Brake

turned

on flight

/>
Pressure Ted Pressure F ed to brake pad set 1 to brake pad aet 2

Pressure

solenoid

acted undesi red

Erroneous comand to brake prop

Gust

look

Landing

gear fail

Feather lever

cond

swi tch fai lure SEF |

oFfswitch

switch

failure SEF |

ni
Relais bypass

SEF|

SEF|

Power lever

idle switch fai lure

fai lure

FIG. 4.2.II. Brake Turned On while Flying.

383 4.3. Essential Systems Systems which perform essential functions have normally duplication or even higher redundancy. In those systems, the direct approach to safety assessment represents the greatest advantages for study when compared with the classical approach. The reasoning behing it is as follows: if we have a system reliable enough to perform a function, and you add up a second fully independent system as back up to perform the same function, it is obvious that the probability of being able to perform the function has been improved, and without further study you may say that it is a safer design. Couriouely enough, when you do this duplication and are using the classical approach to safety assessment, you are penalized, i.e. you need to do a higher amount of work (a new chain to be analized) and more difficulties in proving compliance (the lower, but existing, probability of failing combinations). The direct approach makes use of the rational expressed above and simply states that if one of the independent chains is proven to be (nearly)safe enough to (almost) comply by itself with the required level of reliability, the addition of a second independent, sound chane according to good engineering judgement, improves the system enough to make further analysis unneccessary. This is accepted in the regulations (see AC-25.1309-1 par. 6 . c ) , and it simply reflects a rational position when studying the system. To make full use of this rational it has to be realized that a main system may be classified as essential because it performs an essential function, but it may also perform a series of non essential ones which can make the system very complicated to analize. When a second parallel system is added, it only duplicates the essential functions, no the secondary non essential ones. Thus the analysis of this system is usually much simpler. This is the case with the wheel brake system of the CASA CN-235 aircraft we have choosen as example. The main system is partially duplicated, has double actuation from pilot and copilot seats, has capability of differential braking and antiskid. The emergency parallel system is far simpler, since it is only meant to brake the aircraft on landing or aborted take off. Thus, the analysis of this system is for more direct, clear and simple. Then, the analysis of systems performing essential functions proceeds as follows: 1.- Study and demonstration of the level of redundancy of the system. Redundancy has to be shown to be full, i.e. the systems have no common part and there are not external elements which may cause simultaneous failures in the two systems. 2.- Choosing one of the separated chains and work out the failure tree for the function involved: Usually, the simpler system is used for that purpose. In the case of the wheel brakes the emergency braking system was choosen and the failure tree is given in FIG. 4.3.1. 3.- If the simpler chain has been choosen, normally, as in our example, a simple inspection of the tree is usually not enough to determine the

384 safety of the chain. In our example it comes out that a single failure of that chain will render it inoperative, which is not good enough for an essential function, (i.e. emergency braking system alone will not meeti the requirements). Thus a numerical analysis is performed adding to the single failures on the tree the failure probability, and working the way up to the top. 4. The final figure for the failure at top is usually near the requi rement, t t not quite meeting it. In our case, required minimum level fu is JJ) , and, by the emergency braking system we only reach 0.66 10 . This means that the parallel independent mainbraking system is required to have a minimum reliability of 1.5 10 . A simple in spection of the diagrams of the main braking system (not included in this paper) made clear that the reliability of such system was much higher than that, thus no further study was considered necessary.

Emergency brake

wheel

Teilure I0'a

P 0.68

1
No at p preaaure ayatam 1.50 10"

4>
0.5

BrakeParking valva m a l f u n c t i o n "5

* <\=

Preaaure lo time Undetected preaaure 1.50 laai * 0"' at critical p 1.50 I0"fl

1
Critical time D.I flying

4
P

1.0

" 5

Indicator malfunction 15.0 IO" 6

1
Valve MB|Tunc. P 0.5 1

Discharge Volve Malfunc.

1
Accumulator 10~ O.B 0.5

Non Return Valva

1
IO"5

1
Pip end Fitting 1.0 " 7

r ^

P 0.3

10"5

critical 5.0 Fuie 10"

Upatreem rom Fuaea 1.0 " 7

I
Critical Fuie 5 a I 0 " M

Part

I
Expeled " 5

1.0

FIG. 4.3.1.

385 5. SUMMARY

We have reviewed in this paper the influence of reliability techniques in modern airplane design, with enphasis in civil transport aircrafts. The principal techniques used for the safety studies have been presented and a few examples have been reviewed to illustrate the practical use of the reliability techniques and of the regulations controlling their use at present time.

REFERENCES 1.- "Training Program for Reliability, Probability and Safety Analysis", McDonell Douglas Corp, FAA contract Na. DOT-FA75AC-5123. 2.- "Systematic Safety" C. Lloyd and W. Tye, CAA London July 1982. 3.- "Reliability Design Handbook" N RDH376, Reliability Analysis Center, ITT Research Institute. 4.- FAR, JAR, CAA regulations for Civil Transport Aircraft.

* Copies of these regulations are available from the Civil Aeronautical Authorities of the issuing countries, namely, U.S., Federal Aviation Administration for FAR regulations and AC papers, British Civil Aviation Authority for CAA regulations, and most european Civil Aeronautical Authorities for JAR regulations and ACJ material.

RELIABILITY OF ELECTRICAL NETWORKS

A. G. Martins Departamento de Engenharia Electrotcnica Largo Marques de Panbal 3000 Coimbra Portugal ABSTRACT. Some of the nost used techniques for power systems reliability evaluation are presented, covering the three main areas of generation, transmission (bulk) and distribution systems. Generation systems are considered both from the static and the spinning reserve points of view. Case studies are used to illustrate some of the concepts.

1. INTRODUCTION Reliability of electrical networks is a somewhat restrictive title to the notes that follow. In fact it is difficult nowadays to find a textbook on the area of power systems reliability that does not include a discussion on the three main items to be considered: generation, transmission and distribution of electricity. These notes are to be a complement of the lecture with the title above. On the other hand, the lecture itself is considered as a case study within the general scope of a course on reliability. In this context it appeared to the author that a discussion on the theme "power systems reliability" would be nore adequate. In figure 1 it is depicted a scheme of power systems organization, identifying the relative positions of the three main subsystems referred above. It becomes hence clear that neglecting the generation reliability problem implies to assume a faultless, or one hundred per cent reliable, generation system. Actually, the first contributions that arose in this particular field of reliability concerned the generation system. The general formulation then given to this problem may be stated as: How is to be determined the necessary reserve capacity of the generation system in order to assure a reasonable confidence level on the ability to serve the load without discontinuities, at an affordable cost ? Though considerable attention has been given to the generating-capacity problem since early in the 30's, it has not been before 1947 that probability methods were first successfully applied. Since then improvements and new contributions are constantly presented by
387

A. Amendola and A. Saiz de Bustamante (eds.). Reliability Engineering, 387-415. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

388
researchers and engineers en the power systems field. Generation system

Transmission system

S A
I ?

Substations Distribution networks ... Q . . . Transformers Load points

Fig. 1 Power system organization scheme

2. POWER SYSTEMS SPECIFICS There are two possible and alternative reasons to conduct a study of power systems reliability: planning reasons, where reliability evaluations play a major role in the decisionmaking process on investments operating reasons, to help conducting the system within strict safety margins The planning activities within power systems are performed by means of procedures that simultaneously accomplish the minimization of cost and assure that the reliability indices used to characterize the system performance lie within prespecified tolerances.

ALERT NCrtfAL RESTOR ATIVE Figure 2. Power system states

EMERGENCY

EXTREMIS IS

Figure 2 illustrates the second point. It represents the various identifiable states of operation of a power system. They may be

389 characterized as follows: Normal state - there is an adequate balance between generation and load (equality constraints respected) and all quantities lie inside the allowable intervals (inequality constraints respected, eg. voltages within tolerances). Alert state - the constraints are still respected but the system reserves may not be sufficient to prevent some constraints violations. Preventive actions must be undertaken. Emergency state - violations of some inequality constraints exist. The system is still intact but severe measures are needed to take it back to at least the alert state. Extremis state - both equality and inequality constraints are violated and it is no longer possible to recover without important losses of generation and load. Restorative state - this is the lenghthy process of bringing to service the components of the power system affected by the disturbance. Namely, large power plants with big turbine-generator groups have a long start-up time. They are designated as base plants and cannot be used to respond to rapid load growths. There is, especially at heavy load periods, or near them, a certain amount of power available but still not used, ready to take its share of service to the system load. This spinning reserve must be distributed by a certain number of groups and represent a certain amount of power at some instant in time in order to assure that the system does not leave the normal state. The spinning generating capacity reliability evaluation is a typical example of a short-term procedure, which will be referred in section 6. Power system components are repairable. So, power systems themselves are repairable and the reliability techniques applicable are restricted to the universe of repairable systems. 3. STATIC NERATING CAPACITY RESERVE EVAUUATICN Usually, the studies on this subject assume that the power system is composed of the power stations and the loads, directly connected to a common busbar or to few busbars. Transmission and distribution systems are considered reliable. The main goal to be attained by a generation system is to generate enough electricity to serve the load at all times. So, if there is not enough power to serve the load the system fails, otherwise it succeeds. From this arises the need for two different models to be considered and merged together: the generation system model and the load model. 3.1 The Loss-of-Load Approach 3.1.1 The generation model. The most simple representation of a generator consists of a component with two states: the up and the down states. Actually, there may be considered several intermediate states, due to partial outages. We will use here for illustration purposes the

390
sinplest model. Considering a mean cycle time of for a generator (as a repairable component) equal to = m + r where m is the mean time to failure and r is the mean time to repair, as depicted in figure 3, it is possible to define availability as the ratio D=m/T = m / (m + r)

(1)

Up

'

.wn

I I
= r / (m + r) (2) = 1 / (m+ r) (3) (4) (5)

'^ r

Figure 3. Mean life cycle of a two state repairable component (generator). and unavailability as 0 = 1 D, given by 0 = r /

The outage frequency is defined as f = 1/

The reciprocals of m and r are respectively "X, the failure rate, and iL, the repair rate. It is evident that = 1 / D ^U=1/0T

f = D > = 0u(6) Representing assume the and the down A different power system graphically the state model as in figure 4, * andu, meaning of departure rates respectively from the up state state. formulation for the unavailability is commonly used in jargon, the ' forced outage rate ', given by the ratio

FOR=(forced outage hours)/(inservice hours+forced outage hours) (7)

391

i
t

J.
m

DOWN

Figure 4. Statespace diagram of a single machine.

To model a system composed of several generators, two possibilities arise: a power plant with similar units, or a system with a natural diversity of generator groups. In the former case, the binomial distribution may be used to build up the socalled capacity outage probability table (OOPT), that gives information about probable capacity deficiencies. Table 1 is an example of a COPT for a system with 2 similar units of 10 MW, each with a FOR of 0.03 (the corresponding availability being 0.97).

Table 1 Capacity out of service 0 NW 10 M 20 NW Probability of ocurrence 1 .97 .97 = .9409 2 .97 .03 = .0582 1 .03 .03 = .0009

The probabilities were calculated using the known expression for , the probability of the system being in a state where k units are out of service in a total of units

Pk=<** Dr

(8)

On the other hand, when different groups exist within the same system, a recursive technique is used to build the COPT. Each state where the system may reside has a probability given by M D.

= TT
i=l

TT.
j=l

(9)

where represents the number of up units and M the number of down units. It is likely that different combinations of up and down units

392 lead t o t h e same value of capacity out of s e r v i c e . Therefore, t o conbine them i n t o a s i n g l e row of t h e COPT i t i s necessary t o compute

PS do) S s=l where r is the number of combinations just referred and p . is the probability of finding the same capacity on outage as represented by the forementioned M groups, defining system state j. A very important quantity to consider in these studies is the probability of finding out of service a capacity value equal to or greater then a certain amount - this is a cumulative probability. Table 1 may be completed to illustrate the concept.
1

p-i =

Table 1 (revisited) Capacity out of service OMW 10m 20 MW Probability of ocurrence 1 .97 .97 = .9409 2 .97 .03 = .0582 1 .03 x .03 = .0009 Cumulative probability 1.0000 .0591 .0009

As it is easily understood, the third column was built by computing


(11) cum j = p cum j+1 + p j where p ^ t t representsa t e ^ the cumulative probability of state j and p cum T+l ^ ^ t n e s t with the contiguous greater value of capacity on outage. To use the recurrence relation referred above for computational purposes, it is necessary to build up an elementary COPT relative to a single generator. Then, the recursive technique may be used to build up the complete COPT and may be stated as p

PX(X) = p 2 (X) D + p2(X-C) 0

(12)

where C is the capacity of the new unit to include in the table, p.(X) is the probability of being X MW on outage after including the new group, and , is the probability of X MW being on outage before the inclusion of the new group. To extend the use of (12") to the computation of cumulative probabilities, where it reads "X MW" it should be read "X or more MW". This procedure reveals to be very efficient and straightforward, requiring only the addicional use of (10) to combine redundant states. Case study no.l - To illustrate the use of expressions (9) through (12), with groups that are not identical, let us consider the case of a

393 generation system consisting initially of two units,as follows: Unit ~G G2 Capacity 20 M 50 NW POR T2 0.03

The first COPT will be built with Gl data: COPT no.l Capacity out of service MW 20 m Probability of ocurrence 798 0.02

Now, appying (12), let us consider first the case when G2 is on outage: 1st half of COPT no.2 Capacity out of service 0+50 M 20+50 KW Probability of ocurrence 0.98*0.03=0.0294 0.02*0.03=0.0006

followed by the case when G2 is in service

2nd half of COPT no. 2 Capacity out of service 0+ 0 MW 20+ 0 MW Probability of ocurrence 0.98*0.97=0.9506 0.02*0.97=0.0194

The final COPT for the system presented will be

394
COPT no.2 Capacity out of service 0 20 50 70 MW MW MW MW Probability of ocurrence 0.9506 0.0194 0.0294 0.0006

Case study no. 2 - Consider now the addition of a 30 MW group with POR=0.01, which will be designated by G3 for short, to the previous system. Following an analogous procedure, two intermediate COPTs will be built:
1st half of COPT (G3 on outage) Capacity out of service 0+30 20+30 50+30 70+30 MW MW MW MW Probability of ocurrence 0.9506*0.01=0.095600 0.0194*0.01=0.000194 0.0294*0.01=0.000294 0.0006*0.01=0.000006

2nd half of COPT (G3 in service) Capacity out of service 0+0 20+0 50+0 70+0 MW MW MW MW Probability of ocurrence 0.9506*0.99=0.941094 0.0194*0.99=0.019206 0.0294*0.99=0.029106 0.0006*0.99=0.000594

Combining the two states with 50 MW on outage leads to a single state with probability of occurrence given by 0.000194+0.029106= 0.029300. Hence, the final COPT becomes

395
COPT Capacity out of service 0 20 30 50 70 80 100 Mi Mi Mi Mi Mi Mi MW Probability of ocurrence 0.941094 0.019206 0.009506 0.029300 0.000594 0.000294 0.000006

3.1.2 The load model There are many possible ways to model the power system load. One of the most used consists of the so-called load duration curve (IDC). It is a cumulative load curve, assembled from the daily load power peaks with the abcissa indicating the percentage of days when the peak exceeds the amount of load on the ordinate. Figure 5 depicts an example of such a curve. Other models include more or less detailed histograms of the chronological load variation during previously typical chosen periods of time, or an analytical aproximation to the load curve.

Peak load

100%

Figure 5. Sample load duration curve.

3.1.3 The loss-of-load probability method. This method combines the COPT model of the generation system and the IDC system load model. There is loss-of-load whenever the remaining generation after an outage is less than the load demand. From here it is clear that an outage not always leads to a loss-of-load. Quoting a classic in power systems reliability, "a particular capacity outage will contribute to the system expected load loss by an amount equal to the product of the

396 probability of existence of the particular outage and the number of time units in the study interval that loss of load would occur if such a capacity outage were to exist" (Billinton, 1970). Figure 6 illustrates the lossofload probability computation procedure. Installed capacity Peak load

"TO Figure 6. Effect of an outage on the loss of load. The contribution to the lossofload made by outage of magnitude C. on figure 6 is then given by p. t. , where is the probability of ocurrence of the outage and t is the loss or load duration. The total lossofload will be given by LOLP = t k=l K K (13)

Installed capacity Daily peak load

10C%

Figure 7. Time periods of load loss ocurrences. If the IXC refers to a year and is obtained from the daily peaks, as

397
stated above, the LOLP will be expressed in days/year. In (13) the probabilities used are the exact values for the capacities on outage considered (the values on the second column of Table 1 revisited, eg.) and the tine values are cumulative. If the cumulative probabilities are to be used, the same result will be obtained if each time value represents the increment in curtailed time (T. in figure 7) associated with each discrete outage magnitude. Expression (13) will assume the form

IOIP

= : c u m k
p

(14)

k=l 3.2 The Frequency and Duration Approach 3.2.1 The load model. The ITC representation hides the daily load variations for it implies that demand equals the peak during the whole day. Thus, the LOLP approach provides a not very accurate measure of

Load Li
/ / / / /

if ** .

\ \ \

(a)

0 , edn
00
t

Load -3-6
L

Ln

J
6t

(b)

Figure 8. Two-level representation of the daily load.

398 system failure probability, besides preventing the determination of the failure frequency. It is nevertheless extensively used in planning studies, where the details of daily load variations may be considered of negligible importance by the decision makers. In the frequency and duration approach the load model must account for such variations. Though there is much work done on multilevel representation of load, the most widely used is still a model with two levels, as depicted in figure 8. The representation on figure 8 (a) is an approximation to the real daily load curve. LO, the low load level, is always the same and each Li (figure 8 (b)) is (or may be) different from the others. Li's occur in a random sequence. The exposure factor e = t./cL describes the mean peak duration, t., being d_ the duration of the load cycle (one day in figure 8 (b)). e is considered constant, lying between 0 and 1. Each value Li may have a certain number of n. occurrences. For d_ = 1 day, the total period considered has an extension given by

E=

ni'
p

being the number of load levels other than LO. In these conditions other parameters may be derived: probability of Li transition rate to a greater load transition rate to a lesser load frequency of occurrence of Li probability of LO transition rates

i
+

n. e / E
1

\ i

= 0

>Li = 1 / e f. = . / E
1
p

0 =

1 e

*L0+ = V (1 e) *L0 = 0

frequency of occurrence of LO

0 = 1

3.2.2 The generation model. This is an extension of what has been said in 3.1.1. Associating a state j to each value of capacity deficiency, . and > . are the transition rates from state j to the states with _ higher and lower remaining capacities, respectively. These indices may be used to complement the information already contained in the OOPTs. The frequency of state j, characterized by C. MW en outage with state probability p. is given by ^

f. =P. (A.
and

+V )

(15)
the system

3 : 3+ Il the cumulative frequency f . (frequency of finding

399 condition where the capacity on outage is equal to or greater than C.) D by f .= f (16) cum j cum j+1 + Pj (*j++V._) Expression (11) still holds for the cumulative probabilities. Case study no. 3 A complete COPT may be built with all the indices now referred for the system considered in case study no. 2 if some more data are included to define the three groups: Group Failure Repair Unavailability rate rate (POR) (yr 1 ) (yr1)
0.2 0.4 0.1 9.8

Gl G2 G3

12.9
9.9

0.02 0.03 0.01

The complete statespace representation of system states is depicted in figure 9. As there are two states with identical capacities on outage,

0 Hw />S ^ ^
2a t*

proozgOS

,
<>3

50 ! Si

3*1

< r*YOttW ^

5*

8*<

>
^
1 ^

=.<

Figure 9. Complete statespace diagram it becomes necessary to combine them. The resultant state has probability of occurrence given by the sum of the probabilities of the original states, 0.029106+0.000194=0.029300, as already found above. It

400 is necessary, however, to compute also the transition rates between the resultant state and the adjacent states. If the original states are identified by subscript and y and the resulting state by j, the transition rates to and from state i are computed through the following expressions : .. = V(p > . + p > .)/(p + p ) i ji x xi *y yi //v *x *y

"X.j . = .x + .y i i
As an example, ^ 7 Q X50
5Q 70

= fa + 0 = 9.8 yr

.l

and

=(0.029106*^+0.000194*0)/(0.029106+0.000194) = 0.1987 yr
M-*1 ti^
9.8
.

1
f.9

<6

ZOft^

U9

30 uw
006ff

<Z9

vpf

I0.2

2^ 50 h*

f.9

J.198?

9.9

fottw *

e.<

0QS3

BOH* *8
0.2

o.<

0.1

100 w

Q.OOZ6S

Figure 10. Statespace after combining identical s t a t e s

401 If state i is one with higher remaining capacity than j, and k is a state with lower remaining capacity than j, then transition rates from state j up and down are computed through

>^
As an example,

3-

g ^ = 12.9+9.9= 22.8 and

_. = 0.2.

The following table uses expressions (15) and (16). COM?ITE OOPT State Capacity Individual Cumulative T~ Frequency Cumulative no. on outg. probabil. probabil. + frequency 1 2 3 4 5 6 7
0 20 30 50 70 80 100 Mi Mi Mi M M M M 0.941094 0.019206 0.009506 0.029300 0.000594 0.000294 0.000006 1.000000 0.058906 0.039700 0.030194 0.000894 0.000300 0.000006 0 0.7 0.65877 9.8 0.5 0.19782 9.9 0.6 0.10038 12.9 0.3 0.38796 22.7 0.1 0.01354 22.8 0.2 0.00882 32.6 0 0.00019

0.65766 0.47905 0.39064 0.02026 0.00684 0.00019

3.2.3 The frequency and duration method. The generation and the

models are to be considered independent in the sense that an event probability within one of them does not change with changes in the other. On this basis it is possible to consider a procedure to merge the two models in order to assess the system reliability.
SUCCESS ~~^ DOMAIN States with mk>0 States with

^^*AmjRE__

\<

DOMAIN Figure 11. Distinction between success and failure domains. A margin m. may be defined as the surplus of generating power relative to the load, given by

\ = C. L.

(17)

402 where C now represents the remaining capacity in a generation outage state and L. is a load state. A diagram may be built in which each state k is defined through m. . The boundary between the states with positive m and negative m. distinguishes the domains of success and failure states of the system as a whole (figure 11). The rates of departure of state k to higher and lower margin states are respectively
\ \i+

\i Cj+

(18)

and

>k_= *Li++

*cj_

(19)

is the summation of the upward capacity transition rate, denoting an improvement in available capacity, and the downward load transition rate, denoting a reduction of load. The opposite applies to _. In figure 12 a combined generationload model is presented, where six different load levels and three levels of capacity deficiency are considered.

m
C-ol o-o

Coi, : |

Col,

0, h
" ~i

>4

QU SUCCESS FAILURE

^J7] [[ [ZZ] [ZA] ! |7] [] [I] 0


^

[^| [Z| [77]


<w ** \c^

Figure 12. Combined generationload model example. The a v a i l a b i l i t y of a margin s t a t e k i s given by


(20)

Pk P C j P L i and i t s frequency by

403

***<*+
= P i (22)

(21)

As i d e n t i c a l margin s t a t e s may r e s u l t from d i f f e r e n t combinations of capacity and load s t a t e s , they must be combined. Considering g i d e n t i c a l s t a t e s , the following s h a l l be performed

i=l

* = li f i *

(23)
(24)
of to is of

V^.f/i V / P k

A margin state table may be built with these values probabilities and frequencies (state and cumulative values) and used compute the whole system probability and frequency of failure. If d the first negative margin state index, then the probability failure will be

P = , and the frequency fL, = f J F 'cum d ^ cum d The method just summarized has the advantage of giving detailed information about the probabilities and frequencies of failure states, relating directly to the magnitude of capacity on outage in each state. 3.3 Generation Expansion Planning The scheduling of generating unit additions to an existing system is usually performed seeking to maintain a risk level below a chosen value. This risk level is expressed in one of the possible reliability indices. A load forecast must be available for the period under consideration, modeled according to the generation model used. A simulation process is then activated by means of which the instants of generation additions are determined by the need of avoiding excessive risk levels (from the strict point of view of reliability), when the forecast load places too heavy a demand on the generating units. If several different risk levels are used, the resulting schedules may be compared from the economic point of view. Other approaches are possible, where multiobjective programming allows the simultaneous consideration of economic and reliability criteria.

404 4. BUIK POWER SYSTEMS

4.1 Introduction Though the complexity of power systems and the diversity of their components usually leads to sectorial analysis, it is desirable to integrate as much as possible the evaluation of reliability aiming to obtain, if not global, at least very general indices that characterize system performance, taken as a whole. This perspective has led to approaches that seek to model the generation and transmission systems together. This latter is usually simplified in order to account for transformer and station failures within the model of transmission lines. In bulk power systems reliability studies the generation units are modeled by one of the methods briefly described in the previous section. On their hand, transmission lines are characterized by failure rates that depend both on line lenghth and on the number of terminals. In general, one may consider the use of two broad categories of methods for Bulk Power Systems (BPS) reliability evaluation: the simulation methods (eg. using the Monte Carlo approach) and the analytical methods. However, before any details are considered, a primary question is to be raised: what criteria are to be used to distinguish the success and failure states of a BPS ? As a temptative answer some important ones may be referred: generation capacity not enough to meet the load, interruption of the continuity of supply to a load point, overloads on transmission lines, bus voltages outside tolerances. The first one is used in generation systems evaluation. The second is used on some simplified approaches to the transmission systems reliability evaluation, as it will be seen in 4.2. However, they are all considered together in BPS studies.The system states that proceed from any of the above conditions do not necessarily lead to a global failure state. System protections usually perform the functional isolation of the faulted element and prevent collapses. The stated criteria allow to measure the undesirability of some occurrences as a point of departure to obtain system indices. Both simulation and analytical methods are, in nost cases, based on the enumeration of system states, testing each state against failure criteria. 4.2 The average interruption method This method has been one of the first approaches to evaluate transmission systems reliability. It provides a measure of the continuity of supply, not observing any other criteria. The algorithm used is rather simple and deals with series and parallel structures to model the transmission network. The nain principles behind it may be stated as: 1) every component is bistable, i.e. is either working or in failure. The probability of a component being in its failure state is given by its forced outage rate, or unavailability, p, as given by (7). Th availability is q = 1 - p. 2) there is no dependence between failures of different components. The probability of finding components i and j on outage is therefore p.p.. 3) The probability of success of a

405 series of two components i and j is = q.q.. The p r o b a b i l i t y of

failure is = 1 - q.q. = p. + s ! 3 s = i p. + p . . 4) j parallel is = p.

* p . - . If is very small, becomes simply 3 ! 3 j s The probability of failure of two components in p..

An Average Customer Annual Interruption Rate is defined for each bus in the physical system. It denotes the expected number of days in a year that an outage condition will occur for the load bus. Thus, it is a method that provides a local index for each bus but fails to obtain global indices to characterize system performance. 4.3 BPS approaches The developments registered since 1964 in transmission systems and BPS reliability evaluation have led to the application of different techniques, even inside the universe of analytical methods. It will be presented here as much as possible the common basis to all of them, along with some considerations about intermediate procedures as, for example, the load flow analysis. 4.3.1 Model of system branches. Each branch is usually modeled according to the state-space diagram in figure 4. As a consequence, expressions (1) through (6) still apply in this case. Some additional indices may be defined for transmission systems. One of them is a particular cese of (9) and is defined as the probability of the transmission network success, Pc = TT D., being the number of branches. The departure rate o . 1 1=1 from the state with probability is ^ = 2 S i=l x Considering the state charaterized by branch j on outage, its probability of occurrence is given by (29). P : = 0* TT D. . (29) J J i=i 1 In most cases the value of 0. is so small that the product in (29) results in a . practically identical to 0. (because TTD.=1). It

is common, therefore, to simplify to P. = 0 . . Hence, the 3 3 situation of having two branches, j and k corresponds to a state with

406

approximate probability Pjfk=0jDk (30)

The frequencies of occurence of states with probabilities given by (29) and (30) are respectively f. - P. ^
f

(31) (32)

j,*=pjpK(/V/V

4.3.2 Contingency analysis. A contingency is an outage, leading to a particular system state. As it is obvious, it is not practical to perform an exhaustive contingency analysis in a large system, due both to the huge memory requirements and the computation time. Therefore, some elimination procedure is needed to reduce the number of outage states considered. There are two main criteria to perform this reduction. According to cne of them, the state space is truncated in a way as to omit those states with a probability of occurrence below a predetermined value. According to the other criterion a state is eliminated when its impact on system performance is negligible, even if its frequency of occurrence is considerable. A particular implementation using this latter approach quantifies the impact on system performance based on the severity of line overload conditions resultant from each contingency considered. Generally speaking, the contingencies to be considered on BPS evaluations are those involving either generation outages, line outages or generation and line outages. In practice, contingencies are usually limited to two components failed simultaneously. Those of third and superior order are not considered. The load model clearly influences this stage of the study. The general approach to BPS reliability is to assume a constant load equal to the yearly peak. Under this condition, the distinction between a success and a failure state is quite clear but also leads to a pessimistic evaluation of system performance. As load changes with time, a certain outage may have minor consequences at an instant and very important consequences at some other instant. Seme models account for this particular point by suitably combining the system and load (a time-varying model) representations. These notes confine to the former approach for the sake of simplicity. Either because of generation deficiency or of loss-of-load, the load-flow analysis can reveal the existence of overloaded lines. Load flow is a power system analysis tool. Its fundamental inputs are the electrical characteristics of the network, the load values and locations, the generator ratings and various other parameters that delimit the allowable excursions of some variables. As outputs, the roost important are the bus voltages, the amount and direction of power

407 flews in the lines and the system losses. It assumes that the load in each bus is constant. The equations that describe the system operation are nonlinear and cannot be solved unless iterative techniques are used. Several of these have been studied and applied, as the GaussSeidel, Newton Raphson, decoupled or fastdecoupled load flows, just to mention some. They have been cited by increasing order of computation speed. However, even the fastest of all cannot cope with the requirements of most BPS reliability studies, where thousands of loadflow analysis may have to be performed. The solution usually adopted consists of computing an approximate loadflow solution by a method known as "DC LoadFlow". The system equations are linearized for the purpose, using sane sijnplifications, and can then be solved by Gaussian elimination or matrix inversion. The solution obtained is approximate but leads to the determination of sufficiently accurate values for the power flows in the transmission lines. One drawback that may be pointed out is a consequence of one of the simplifications assumed for the linearization process: the voltage magnitudes are constant and equal to the nominal value at all the busses in the system. Eventual excursions outside pre defined tolerances are not detected and cannot be considered and quantified in terms of reliability indices. In those cases where generation deficiency or lossofload occurs in the course of a contingency analysis, adjustments have to be made in order to maintain the balance between generation and demand. Load flows depend both on the locations and the amount of power generated. It follows that loadflow analysis must include some form of generation redispatch. Although this may be accomplished in a variety of ways, each dependent on a particular and arbitrary criterion, the one that has been used in many implementations consists of minimizing line overloads. This minimization procedure depends on the loadflow algorithm. When the DC loadflow is used it is possible to formulate the dispatch problem in a form suitable to linear programming. Optimized generation redispatch procedures use can avoid or mitigate some overload conditions and provide results not so pessimistic as if other strategy is applied. 4.3.3 Some particular aspects. Weather conditions have a significant influence on the failure rates of transmission lines. The twoweather Markov model as shown in figure 13 assumes two possible states for the environmental conditions: normal and severe weather, and being their respective mean durations. The values and A are the transition rates, given by = 1 / and x j = 1 / . Assuming independence between the cycles of failures ana repairs of system components and the weather cycles, figure 13 b) represents a fourstate model for one component. Here, ( A , . ) and ( A ' , A ' ) are the failure and repair rates under normal and severe weather conditions, respectively. The probabilities of each one of the four states are obtained through relatively complex expressions that would not help to clarify the model understanding. The procedure may be extended to the cases of two or three components giving place to diagrams with eight and sixteen states.

408

> Normal weather

*
DOWH

}|
J Severe I weather

U P

ormai weather

IIP '

nni'n '

*J

Severe weather

Figure 13. Twoweather model.


Sometimes it happens that a single event gives place to more than one component failure at a time. These simultaneous outages are called cannonmode failures and are usually due to natural events such as fires, earthquakes, storms, etc., or to interferences such as collisions with vehicles (either planes or surface transports). They usually occur where two or more lines are near each other. To be considered commonmode, failures must be independent. The approach normally used to deal with these situations restricts the universe of commonmode failures to those that occur on two lines sharing the same rightofway and assumes a single transition rate from the operational to this particular failure state. Figure 14 depicts the model for two components, where is the commonnode failure rate.

7,

rj^
IUP

J4 *. .2
2 ..1 UP

M-*),2
V.4 Dotiti .2. BowV

/**

a,+ \ 2

Figure 14. Cannonmode failure for a twocomponent model. 4.3.4 Indices calculations. In course of the computational procedure represented figure 12 the reliability indices for each bus in the power system are computed for each ccntingengy. ^/o, U _ and r are respectively the failure rate, the annual interruption time ana the repair time of a bus for contingency Ck. If p(Ck) denotes the probability of occurrence of contingency Ck, then the indices for

409 busbar j are computed as h . = p t C k ) ^ h U. = p(Ck) U_ 3 k=l = ** r

(33)

(34)

j=

j/ *j

(35)

where h is the number of contingencies considered. The global indices are then given by

\ , = > j=l

(36)

U_ =
G

U.

(37)

-i=T 3

=U

/A

(38)

v*iere i s the number of contingencies corresponding t o f a i l u r e s t a t e s . 5. DISTR IBUTION SYSTEMS

5.1 Introduction In distribution systems the consideration of serial and parallel arrangements of physical components is essencial to evaluate reliability. In fact, distribution networks have an inherently radial structure, as exemplified in figure 15. As in BPS studies, local and global indices may be calculated. The farmar concern the designated load-point reliability , of primary interest from the customer point of view, as they measure the quality of service provided by the utility. The latter characterize the performance of the distribution system as a whole. 5.2 Load-point reliability As distribution systems are made of series arrangements of feeder

410

branches where sometimes a few branches are connected in p a r a l l e l , review of these t y p i c a l s t r u c t u r e s evaluation i s interspersed in

1 LP

~Hh

HS

ri

L0

ur
LP

Lr

Figure 15. Distribution system structure example. the text that follows. The symbols used in (1) through (6) shall be maintained. The probability of failure of a series of branches is given by

55

= 1 TT D. i=l x

(39)

By definition, the frequency of failure of the same system is

^s1^

1
(40)

= TD.
1=1

.
1=1

Finally, the mean duration of system failure is

FS

FS '

If all the D. (i=l,2,...,n) are high enough, TTD. = 1 and

(41)

1=1
Under the same conditions, it will be verified that U.

X . and T.

FS

411

becomes, after some manipulations mentioned above. =1{\ .) /\ 1 r i i=i i=i U.

derived from the

approximations

58

(42)

where T r i = 1 /

Similar reasonings may be applied to a set of components in parallel. The probability of failure is
P

FP * i

(43)

and the frequency


P

FP ^y*i

= TT 0. .
i=l

(44)

As Tpp = Ppp / fpp, it simply becomes


T

FP * I f=1h

(45)

Assuming simplifications similar to the referred frequency becomes

above, the failure

^ i=lA i=l i=l / T ri) <46> yri (1


The most general case of parallel connection in distribution networks involves only two branches. A single equivalent branch to be considered in series must be determined. Its failure rate, approximately equal to the frequency as obtained from (46) for n=2, is

K'

*L V T rl +T r2>

412 The equivalent average f a i l u r e duration, from (45), becomes

Te = l /

(yU1+/X2)

After reducing the parallel connections to series equivalents, the application of (39), (41) and (42) quantifies the loadpoint reliability through the probability of failure, the failure frequency (in failures/year) and the average duration of failures (in hours). 5.3 System indices The loadpoint indices are suitably combined to quantify system performance. It will be assumed that loadpoint indices refer to the equivalent series subsystem indices and the respective subscripts shall be only "F" instead of "FS". System failure frequency may then be obtained through the ratio of the total number of customer interruptions to the total number of customers: m m
f

sys * < 1fFi> / * !

(47)

where m is the number of branches in series on path to load points and N. is the number of customers on branch i. Mean system failure duration is the ratio of total yearly duration of customer interruptions to the total yearly number of customer interruptions : m m sys =. , (NT. ,. .) ' .* (. ,.) . t i Fi / ^ TFi 1=1 1=1 (48)

Finally, the average total interruption time per customer is an index used by some authors: m m N T Hsys = . < i *Fi Fi> /
N

5.4 Further remarks The reasonings just presented assume an ever faultless state of the protection system, namely the circuit breakers. These devices may actually fail to operate when needed or even operate erroneously, tripping the supply as if a fault on a branch had occurred. This adds

413 to the overall system vulnerability. There are techniques, which will not be covered here, to generalize the model described in order to account for circuit breaker own failure indices. There are several other factors that have influence on distribution networks performance, requiring the use of other models. Some of the more important ones are now briefly pointed out, all referring to situations of redundancy of two components. It may happen that while one of two redundant components is on maintenance, the other experiences a forced outage. Also, when one of two components is on outage, be it forced or due to maintenance, the other may become overloaded beyond the technical allowable limits. Finally, common-mode failures, as referred for BPS, are also possible with two redundant components in a distribution network. 6. SPINNING RESERVE GENERATION RELIABILITY On page 2 of these notes it has been referred the need to conduct operational or short-term reliability studies in power systems. In section 3 generating capacity reliability evaluation has been introduced in the planner's perspective. Here a brief introduction to the short-term generation reliability will be made. 6.1 Introduction The operation of power plants within a system must be managed according to two fundamental criteria: economy and security. Shortterm forecasts of load evolution are normally available to system operators. According to the present and near-term load levels, decisions must be taken about which groups must be in service and, among them, how load must be shared to minimize the cost of supplying demand. This is accomplished by means of some computational procedure, in many cases based on dynamic programming techniques. It will be assumed here that a minimum running cost table will be available as a result of this procedure indicating, for a finite set of load levels, the optimum generation schedule to serve each level of demand. From the reliability point of view, the fundamental question is: are the scheduled generators able to respond satisfactorily to some unit eventual forced outage? The approach presented here has been chosen by its generality, which doesn't mean alternatives don't exist. 6.2 The s e c u r i t y function This formulation has been f i r s t introduced by Patton in 1970, and has been complemented by l a t e r c o n t r i b u t i o n s . The s e c u r i t y function assesses the system s e c u r i t y in an hour-to-hour schedule by means of p r o b a b i l i s t i c c a l c u l a t i o n s . I t may be s e t equal t o S ( t ) = 2 1 P . ( t ) Q.(t) (49)

414 where .(t) is the probability that system is in state i at time t and Q.(t) is the probability that state i constitutes a breach of system security at time t. If the only breach of security is insufficient generating capacity, (49) may be directly applied to spinning reserve studies. Also, if there are no uncertainties in load forecasts, Q.(t) will be a binary variable, assuming the value '1' when demand exceeds generation and '0' for the opposite situation. The value assumed by S(t) shall never be greater then a pre-defined "maximum tolerable insecurity level" (MTIL). By repeatedly calculating S(t) it is possible to identify the need for a different allocation of the available generators. There exist always units on standby, running on a no-load condition. When S(t) > MTIL it is necessary to verify whether adequate standby capacity can be put into service within a time interval measured from the present instant to the time when S(t) will become higher than MTIL if no action is taken. If the answer is negative, generation allocation is modified to a schedule of the units making up the combination that has the next higher total capacity in the minimum running cost table. S(t) is again computed for the new schedule and the process repeats as many times as necessary to assure that S(t) < MTIL for the next time interval considered. Assuming again a twostate Markov process to model a generator, the probability of finding unit i working at time t, given that it was working at time zero is

D(t) =[!. / (yu. +>)] + [ V / ( u i + . ) ] e" ( /*i and the probability of finding i t on outage 0(t) =[ . / ( ^ . + . ) ] [ i - e " ( A i
+ t]

V.

(50)

i*

(51)

It is easy to verify that for t = 00 , D(t) and 0(t) become equivalent to (1) and (2), given that m = / and r = 1/fx . Finally, the probability of system being in state i at time t as in (49) may be evaluated through (9). It should be noted that a table similar to the OOPTs in section 3 may be constructed for each different generation schedule. The value of the security function then becomes equivalent to the cumulative probability of loss-of-load calculated for values of power equal to or greater then the difference * total capacity in service at time t - forecast load at time t*. 7. REFERENCES /l/ R. Billinton. Power Systems Reliability Evaluation (Gordon and Breach Science Publishers Inc., 1970I /2/ R. Billinton et al. Power System Reliability Calculations (MIT Press, 1973).

415 /3/ J. Endrenyi. Reliability Modelling in Electric Power Systems (John Wiley and Sons, Ltd., 1978). /4/ I. Nagrath and D. Kothari. Modern Power System Analysis (McGrawHill, 1980).. /5/ A. D. Patton, "Short-term Reliability Calculation". IEEEE Trans, on Power Apparatus and Systems, vol PAS-89, April 1970..

SOFTWARE RELIABILITY:A STUDY CASE

Jos Munera R&D Dept. Telefnica Beatriz de Bobadilla 3 28040 MADRID (Spain)

ABSTRACT This paper presents a Real-Time development project as a case study for the problem of producing good quality software, which presents the Software Engineering methods and Tools currently being used within the development, and how they can improve the behaviour of the software in terms of some quality criteria widely used in software design

OBJECTIVE.

This contribution presents a particular case of the solution to the probem of producing software with the quality characteristics required to control a real-time system quite complex. As this case refers to a development project, currently under design at the R&D Department of Telefnica, the interest is focused on the description of a set of Software Engineering methods and tools and the way they are applied to the project, rather than the discussion of the advantages and disadvantages of such methods and tools when comparing to other available alternatives, that could also be used in this kind of projects. The study of the manner in which these methods are used in the project and of the benefits that this usage is expected to provide will be useful to try improvements in both the methods themselves and their applications to real-life cases, and in the other hand the comparation of the methods here presented with other alternatives will perharps allow some refinement to be introduced in the development methods that all the people involved in real-time software design use more and more 417
A. Amendola and A. Sail de Bustamante (eds.), Reliability Engineering, 417-445. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

418 every day, so contributing to minimize the always increasing cost of the software (fig.l).

100
%

SOFTWARE DESIGN

HARDWARE

+
1.960 1.970

+
1.980 1.990

SOFTWARE COST
FIG. 1

INTRODUCTION.

Software is being heavily used in telecommunication systems since about 15 years, although the exploitation of a great deal of its real capabilities has become a reality due mainly to three major facts: 1. 2. Availability of microprocessors at reasonable prices. Increase of the integration peripherals, etc.) of hardware devices (memory,

419 3. Availability development. of sophisticated tools for helping software

These facts have made possible for the designer to introduce devices controlled by software in almost every kind of products, maintaining the prices within a reasonable level but increasing more and more the abilities and features offered by these products to their final users. Although quite late in using software as part of its products, the telecommunication industry has been forced to learn fast what it might expect from the software as well as what was the price to pay in order to fulfill the expectations, and this has been learned quite luckily, as the new generation of telecommunication systems and networks become operative, specially the ISDN (Integrated Service Digital Network), whose only concept would be impossible without the extensive usage of software in its implementations. During this learning period, it was rapidly pointed out that the reliability requirements of the systems forced severe requirements to the reliability of their software components, and at the same time required large amounts of the total system's software to be devoted to maintenance tasks, amount that is for the majority of the medium to large systems in the surroundings of 50%. Reasons for this are a consequence of the very strict requirements for tolerable failures (tipically, one of this systems shall not be out of service for more than one hour every 20 years). Satisfying these requirements means to design very high-quality software, which must have defense mechanisms that allow proper functioning of the system even when some of its components are in error, and that are able to identify and isolate the failing components. Besides, the long lifetime of these systems, together with the fast technological evolution and the new requirements that are added to them after installation, requires for their design to be made in such a way that modifications may be introduced smoothly, without affecting the normal operation and with no significant errors introduced along with the modification. This is the key for the systems to keep up with the technology and to be able to satisfy the new user's requirements for some decades. Finally,the size and complexity of these product's software will require in general large design groups, which increases the trend to make software errors because of communication problems among the designers in the group.

420 To minimize the error proneness of these large groups, a common set of rules, methods and standards, known and accepted by all the group should be defined and used trhoughout the project, and their usage supervised by some kind of control mechanisms in order to ensure the correct and consistent use of them. Summarizing, three major characteristics migth be pointed out telecommunication systems today: 1. 2. 3. for

Strong reliability requirements, which forces designing for reliability and large amounts of software devoted to maintenance. High rate of modification after installation, which means designing for independence between parts and for safe on-line modifications. Large design groups, that are prone to communication errors, which requires using common methods and standards during the development.

These characteristics are going to be present in the development of the TESYS-B data packet switching system, that will fulfill the data network's requirements foreseen for the 90's. What is being done in this project, and its impact on the system's reliability will be discussed in the following sections of this paper.

BASIC CONCEPTS.

Before going to the main subject, it is advisable to provide a set of basic concepts about reliability in general and their application to the software. Definitions for software error and software reliability by Myers will be introduced, and their meaning commented. as given

Differences between hardware and software failures and errors will also be discussed, and some general criteria for improving software reliability will be proposed.

3.1

Definitions. SOFTWARE ERROR A software error is present when the software does not do what the user reasonably expects it to do. A software failure is an occurrence of a software error.

421 The most important point in this definition is the use of the word "reasonably", that migth be interpreted as a cause of ambiguity in the definition (what is reasonably?), but that takes into account the fact that the user specifications of a software product are neither complete nor correct, and so the user may reasonably expect that the system behave in a certain manner, not described by the specification. Another important point of the definition is the explicit distintion between static aspects (error) and dynamic ones (failure) to be considered for studying the product's reliability. In this respect, one can think of two products having the same number of errors (same static conditions), but distributed in different modules, which will probably make their reliability characteristics different, depending on how often the error-containing modules are run in each product (dynamic conditions). SOFTWARE RELIABILITY Software reliability is the probability that the software will execute for a particular period of time without a failure, weighted by the cost to the user of each failure encountered. The importance of this definition is the introduction of a certain economic quantification for the software failure, which measures its severity in terms of the impact caused to the system's users, rather than the impact to the design, which tends to be the major concern of the design team, but which is not always the best, as the unfortunate case of the first Venus mission of NASA, failed because the apparently trivial error of changing a comma by a dot in a FORTRAN DO statement.

3.2

Hardware and Software failures and errors.

In order to clarify the issues of software reliability, it is important to consider the existing differences between the origin of software and hardware errors, as well as the way in which the failures happen in each type of component. Starting with the errors, in the hardware they may arise either during the design process or during the manufacturing process of individual components, whereas in the software, errors can be considered to occur only during the design process, because the few errors that the manufacturing process can introduce are normally detected very easily and fixed inmediately. In the other hand, irrespective of how correct is the design and how accurate the manufacturing, the hardware components show wear-out characteristics that garantee a failure somewhere during the lifetime, whereas this is not true for software components, which will perform

422
exactly the same as long as its supporting hardware does not fail. The hardware failures, as considered by the current studies on reliability, are considered as statistically independent from one to the other, and also independent of the environment of the failing component, which means that the reliability of the component is an inherent property of it (fig.2), but for software components whith design errors, the failures will only take place when certain environment conditions are present (an specific combination of input data, level of CPU's load, etc.), which allows to consider software reliability associated to the manner in which the software is used as well as to the product itself.

FAILURE RATE DESIGN & TESTING OPERATION END OF LIFECYCLE

TIME

RELIABILITY BEHAVIOUR
FIG. 2

423 As software failures are always caused by design errors, the behaviour of the software product may be considered deterministic, meaning that once a software error has been fixed, the product will never fail again due to this error (fig.3), in other words, assuming that the process of fixing errors does not introduce new ones in the product, its reliability is directely related with the number of remaining errors, and it will improve as more errors are detected and fixed.

FAILURE RATE

DESIGN & TESTING

END OF LIFECYCLE

THEORETICAL

TIME

RELIABILITY BEHAVIOUR
FIG. 3

There are theories that do not agree with this, but the concept seems to be quite descriptive for most of the cases, although the assumption made (no new errors introduced by fixing detected ones) is

424 not true in penerai, and the expected decreasing rate of failures is not likely to happen (fig.4).

FAILURE RATE

DESIGN &| TESTING

END OF LIFECYCLE

( REAL )

TIME

RELIABILITY BEHAVIOUR
FIG. 4

3.3

Improving software reliability.

Assuming once again that software errors are directely caused by mistakes during the design process, and that software failures are somehow dependent of the number of remaining errors in the product, an obvious way to improve the reliability of such products is to minimize the number of errors present in the product when it is first installed,

425 and for doing so it is also obvious that any improvement on the design methods and support tools will translate in improvements on the quality of the product, which includes its reliability. The efforts being put on the development of Software Engineering all over the world, aimed to obtain more sophisticated methods and tools for helping and supporting software design, are contributing primarily to lower the development costs, and to improve the product's quality, by providing methods and tools for designing and testing that allow compact designs and early detection of errors, but this is not the only possible line of action that could improve software reliability. Considering for instance the cost to the user of a system's failure, the use of design strategies (method independent) oriented to design fault tolerant systems will lead to products where total failures are not likely to happen, the system being degraded to some extent after a failure occurrs, but without going to catastrofic crashes. As should be accepted that putting a system to work in its real working environment does not garantee the absence of errors, it is necessary to include in the system certain capabilities for detecting and isolating failing components, and if possible, for correcting detected errors or its consequences to the system, which offers two other methods of improving reliability: detecting the failures and correcting the errors during the normal operation of the system. So far, four possible methods for improving have been identified: 1. 2. 3. 4. Minimizing the number avoidance design. of design errors, software also known a reliability as fault system

Minimizing the probability of a failure to cause crash, also known as fault tolerance design.

total

Including software components that are in charge of failures and errors, denoted by fault detection design. Including software components that correct denoted by fault correction design. the detected

detecting errors,

The first method, whose final target is to produce error-free products, is somehow in contradiction with the other three, which assume the presence of errors in the working system and try to minimize their impact to it. On the other hand, fault avoidance design acts on the design procedures, and tends to minimize the number of functions included in the system, so lowering the probability of mistakes, whereas the other three require the inclusion of new components, not useful at all for satisfying operating requirements during normal operation, but contributing with their errors to the unreliability of the system.

426 3.3.1 Fault avoidance design.

The main idea behind this concept is to optimize in terms of reliability, the development process, so that the number of errors produced during the design stage is minimum, and the number of errors detected and fixed during the test phases is maximum. The key to optimize the design process, is the use of an standard set of design methods and rules, in order to improve the understanding of the information among all the members of the design team, and of course to improve the quality of the design. These methods will define the way to go through the design steps (requirements, top-level design, etc.) by specifying what sould be produced at each step, what may and may not be done, etc. A very important part of the mechanisms that it provides, which methodology and also contribute misunderstandings, etc. methodology ensure the to detect are the control correct use of the inconsistencies,

It is also important that as much as possibe of the design process, including control mechanisms, be supported by automatic tools, so that the designer's work is easier and control and validation operations can be automated and more reliable. Testing the system and its components is a very important part of the development process (normally about half of the resources used in a development are devoted to testing activities), and it is also important from the system's reliability point of view, since a good testing of a system will detect errors before the system is put into normal operation. Using well defined testing methods, which should be consistent with the rest of the methodology is a must for a succesful testing, but the most important part here are the testing tools, whose major contribution is to ease the way tests are performed, so allowing the tester to plan and run complex tests, which would be difficult or impossible to make without the help of automatic tools. As an idea of what should be expected in terms of we can mention the following: 1. 2. 3. testing tools,

Good error detection and information capabilities for compilers, linkers, etc. (Syntax and semantic testing of source code). Symbolic simulators and emulators, with powerful commands (as close as possible to the source language). Consistency-check capability in the configuration management tools.

427 4. Real time and multi-processor distributed systems). emulation (only for real-time

In any case, the major problem in testing is the production test cases, which is still a manual task, and so error-prone.

of

3.3.2

Fault tolerance design.

The designers of a system can not assume that using fault avoidance design techniques the product will be error free, the assumption that the remaining errors in a system just put into operation are going to be very difficult to find and fix is probably closer to the reality. On the idea that there are errors in the system, the designer may try to improve its working behaviour by using architectural constructs that allow the system to perform adequately even under error conditions in some of its components. The most widely used technique for fault tolerance design is probably the redundancy of critical components, which works nicely when applied to hardware parts, but does not make much sense in software, because a copy of a faulty software component will fail exactly at the same point the original part did. What is normally done in software is to allow degradation of the service, that could be qualitative (faulty facilities go out of service for all the users), cuantitative (faulty users do not have access to the system), or a mixed approach. Degrading the service is a technique widely used in telecommunication systems, and is based on the existence of fault detection and isolation mechanisms present in the system.

3.3.3

Fault detection design.

Even when a system is designed for tolerating certain faults, as software failures are considered deterministic it is obvious the need for correcting the errors that causes the failures as soon as possible after a failure happens. This requires first of all that the failure be much as possible the causing error identified mechanism. detected, by means and as of some

Fault detection design aims to design the system with some kind of built-in mechanisms for detecting failures and/or identifying errors, which will detect some failures right when they occur.

428 Modern programming languages, such as Ada or CHILL provide specific constructs to deal with detecting and processing exception states, which may be defined by the designer. These languages also provide the option of including a run-time checks, that contribute to the detection of failures. set of

3.3.4

Fault correction design.

Fault correction by software is a technique well used in telecommunication systems... for their hardware components, specially by using redundant elements working in active-standby mode, and mechanisms for switching them when failures are detected in the active element. As has been mentioned early, similar techniques applied to software are not likely to improve the reliability, althoug some alternatives have been proposed, such as using functionally duplicated modules, but with different implementation. More useful approaches are those that try to correct the consequences of a failure rather than the error that causes it; among them, garbage collection mechanisms are a good and obvious example. These recovery techniques are useful to avoid propagation of failures through the system, but must be complemented with efficient detection mechanisms, that will inform of the probable presence of a failure that can be eventually corrected.

THE TESYS-B SYSTEM.

Te name TESYS-B represents a family of data packet switching systems that are planned to fulfill the requirements of public and private data networks during the next decade. The description of some of its design concepts and its development methodology is the subject of the rest of this paper. The TESYS-B family is seen as a natural evolution of the TESYS-A, currently serving public and private networks in several countrys, including of course the Spanish public data network. This evolution tries to take advantage of the new hardware products as well as the new software products and support tools; it also includes some changes to the architecture, oriented to match the new technological oportunities.

429 The system's hardware in its high-capacity configuration is composed of a set of Packet Switching Units (PSU), connected to a switching network that provides data links between any pair of PSU's (fig-5).

X.25 / ASYNC

X.25 / ASYNC

T.
-ZA.

GLOBAL PSU INTERCONNECTION SYSTEM

X.25 (MAINT" + ADMIN)

/ -

-zi

SSU

HARDWARE ARCHITECTURE
FIG. S
This network also links the System Supervision Units (SSU), that are PSU's in charge of controlling administration and maintenance tasks within the system, and of providing operator and network management communication. Each PSU supports several (up to 10) Line Interface Units (LIU), that control the packet lines (X.25/X.75) or asinchronous lines, and a Local Supervision Unit (LSU) them all connencted to a local link (fig.6). Both LIU and LSU are controlled by a microprocessor, which the software to be distributed along the system. allows

430

LOCAL INTERCONNECTION SYSTEM INTERFACE

TO GIS

LIU

LIU

LS

D-O

OPTIONAL DISK

X.25/ASYNC

X.25/ASYNC

PSUI SSU ARCHITECTURE


FIG. 6

Distributing the control elements so that the system's software may also be distributed is a good point for getting good fault tolerance characteristics in the system, as each PSU can work independently of the rest of the system, as well as can each LIU or LSU within a PSU. As this is not enough for solving the reliability of the software, other strategies are used in the design: 1. 2. 3. 4. 5. 6. Common Standards and Methods High Level Languages. Automated tools for software development. Well defined documentation scheme. Development control based on reviews. Automated configuration management. system's

431
7. 8. Modular and structured design, programming and testing. F unctional and operational independence among modules.

As a matter of fact, most of these strategies have been defined in the Development Methodology that, along with a support environment is the main activity of the R&D's Software Engineering Group since the end of 1.984.

SYSTEM'S SOFTWARE ARCHITECTURE. The basic goals for this software architecture are: - Matching the distribution of the hardware. - Ability for on-line modification.

It should be mentioned here that about 65% of maintenance tasks during operation of a system are caused by software changes required by the users, that should be introduced in the system without affecting its operation. The main guideline for this design has been to achieve a high degree of functional independence among modules, by extensive use of standard logical interfaces, so that a working module needs a minimum of information about the rest of the system (just its interface with it) in order to perform its function. This guideline extends from the top-level design to the implementation phases by using standard operational interfaces, so that each module can only communicate with other parts of the system through a set of standard primitives. From a functional point of view, the system's software looks like a set of modules externally characterized by the signals (messages) that send and receive, but hidding their function to the rest of the system (fig.7). The logical test of the system may then be done at two levels, the first of them being testing the functionality of each module (local data transformations, sending the expected output messages when an expected input is present, etc.), and the second testing the consistency of the interface between two modules (is message sent by module A what module is expecting to receive?). From the operational point of view (fig.8), the standard interface (primitives) used for communicating/synchronyzing the modules, or for them to access to common resources or services, is centralized in the Operating System, so that most of the handling of critical resources is done only in one particular place within the software (the O.S.)

432

STANDARD INTERFACE BETWEEN A AND Y

STANDARD INTERFACE OF FUNCTION

SOFTWARE FUNCTIONAL VIEW


FIG. 7

so detecting, isolating and correcting many of the errors involving critical resources is eased due to the low number of places that handle them. Modules that have some functional commonality are grouped in blocks, subsystems, etc., and so are the signals interchanged between modules in different blocks, so that higher levels blocks can also be characterized by their external interfaces, and testing procedures extended to them in the same way.

433

[ry

/MODULEX

SCHEDULING, INTERRUPTS,

TT

OS PRIMITIVES

OPERATING SYSTEM

SOFTWARE OPERATIONAL VIEW


FIG. 8
6 DEVELOPMENT METHODOLOGY.

The Development Methodology used In this project provides a set of standards and methods that organize the development tasks, the development groups, project documentation, etc. The most important aspects of this Methodology that below are: - Development Cycle - Documentation Scheme - Reviews and Change Control are decribed

6.1

Development Cycle.

This methodology is based upon a conventional development cycle organized as a set of sequential phases, in turn decomposed in well identified tasks.

434 The development cycle (fig.9) is composed of five phases, every one of them ending with a review and acceptance task applied to what the phase produced, which may force iteration on some of the phase's task.

THIS IS WHAT WE WANT

WE CAN DO IT THIS IS WHAT IT DOES

FUNCTIONAL TEST PLAN THIS IS HOW IT DOES IT GOD BLESS US, IT WORKSJ

DEVELOPMENT CYCLE
FIG. 9 The five phases a r e :
- Requirements - Requirement's Analysis - Top Level Design - Detail Design

435 - System Building Requirements phase is basically handled by the final user (in this case by an ad-hoc group external to the design group), that define what requirements must the system fulfill. The design group in this phase must formalize these requirements in a document defined by the methodology, which becomes the official reference for the rest of the development. Besides this formal objective, the intention of this phase is to put to work, together the users and the designers, so that the defined requirements are as complete as possible, and the designers understand exactely what the user wants, and what are the user's critical points in the product. Understanding what the real problem is is a good start for solving it, and also serves to minimize mistakes due to false ideas that the designer might have about the product, which will cause later that the system does more than required in some aspects and does not do some important functions. The next two phases are the start of the design cover the tasks associated to the functional design. activities, and

The first of them, Requirement's Analysis, tries to prove the viability of the project, by defining an initial planning for it, that shows viability in terms of time and resources, and by establishing a set of high level strategies and initial decissions, that constrain the rest of the design, and that will prove the technical viability of the project. Is in this phase when decissions such as what programming languages are going to be used, or what are the basic architectural constructs to be used are taken. During the Top Level Design Phase, which is probably the key for the success of the project, the primary objective is to define the system's architecture, to identify its functions and the interfaces among them. At the same time, User's Documentation is written, final planning is established and the posible configurations of the system are defined. Top Level Design phase is also where the Integration Test Plan is designed, defining the functional and system tests that the system must pass before being put to operation. Analysis methods recommended for this phase are top-down methods, that are very convenient for decomposing the system into parts showing greater level of detail at each step, until the last level of decomposition allows to define the concept of functional module,

436 related with the rest of the system through a well defined interface, and simple enough as to be implemented independently of the rest. This phase must verify the functional completeness of the design, as well as the consistency of the interfaces defined along the process of decomposing the system. The last two phases of the development cycle cover the implementation and test of the system, and after completing them, the system can be installed. The Detail Design phase applies to each functional module identified at Top Level Design, which has been characterized by a functional specification (what the module does) and its interface specification, and in this phase the modules are described in operational terms (how the module works), coded in the selected language and tested against their specification and their interface. The objective here is to make sure that the functional specification of each module maps properly in its implementation, and also that the external behaviour of the module complies with its interface specifications. The kind of errors normally detected at this phase are programming mistakes that drive the module to function wrongly. When starting the last phase of this Development Cycle, the individual modules have been tested, so that one can assume that when a module receives a particular message, it will perform a set of functions as planned and will issue a set of messages to the rest of the system (this is not always true, of course, but should be the starting point of this phase). What is left for testing then is the functional cooperation among the system modules for doing higher level functions, which is called integration test, and the behaviour of the different configurations of the system under working conditions closer to real, including overload, unexpected system inputs and so on, which is called system test. Most of the errors detected here come from the Top Level Design phase, that is the point where cooperation rules and scenarios were defined, but there also will be implementation errors left that can be detected at this stage. For instance, if a module sends a message that contains a two-character string but what the receiving module is expecting is a three-character string, is quite possible that this error will remain undetected until this phase (fig.10).

6.2 that

Documentation Scheme. The objective of every development is to characterize a product so it is possible, using the characterization to obtain as many

437

FIG. 10
systems as needed. This means that the output of the development is the documentation that describes what vas made, how it works and how it can be built, rather than one or severar prototypes of the system. The importance given to the documentation as the only valid product of a development requires that special care be taken regarding its format, contents and management. This methodology uses the following criteria for establishing documentation scheme: 1. 2. 3. Information shall be formalized in a document available. as soon as it the is

Information shall not be duplicated in different documents. Each document to be produced shall have clearly defined its and the information that should be in it. format

438 4. All the documents of a development shall contain all information of the system. the relevant

Using this criteria, the methodology defines what are the documents that must exist at the end of the development, what are their formats (at the level of main headers), what is the contents of each section of every document and at what point in the development cycle each document should be produced. Naturally, the real contents of each document is conditioned by the project and the development team, but still the objective is to be sure of where one can find a particular class of information, and also that through the definitions given by the methodology the documents can be understood and interpreted correctly, so that they can be used consistently. By looking again to the definition criteria, it can be noted that: 1. The first one attempts to avoid human errors by imposing the existance of a written reference as soon as the information becomes available and clear. The non redundant information (second criterium) tends to avoid the inconsistencies and contradictions that are possible when the same thing is described twice, and will also ease the updating of information. The third one simplifies the necessary information retrieval operations, and at the same time contributes to ease detecting inconsistencies and defects in the documents. Finally, the last one means that there shall be one (and only one) written reference to any relevant aspect of the system, which combined with the definition of the contents of each document will allow to identify either the desired aspect or its absence in the documentation.

2.

3.

4.

On the other hand, using an standard documentation scheme makes easier for the designer the never liked task of documenting, and also makes possible the use of automatic tools for generating the documents (editing, filing, etc.), and control (version, configuration, etc.), which in turns ease the verification of consistency and completeness of the system documentation.

6.3

Reviews and Change Control.

In order to ensure the correction and completeness of the tasks done at every phase of the development cycle, review activities shall take place at the end of each phase, whith the objective of checking

439 that the required documentation has been generated, that it contains what is expected and that the contents of the documents satisfy the quality standards established for the project. Formal verification can be performed by automated tools without great problems, but quality control will require in general the presence of members of the design team (or quality control group when it exists), that will introduce subjective factors in the verification process. In any case, quality control may detect some design errors through reviews of the project documentation, and at the same time it may discover and correct the attempts of "bypassing" the existing standards, unfortunately quite frequent in software design activities. This methodology establishes the creation of temporary groups for checking the documentation produced at any of the development phases against the applicable standards and the information that was used to produce them. The objective of these group is then to identify errors within the document (expected information not present, redundant non expected information included, etc.) and to identify inconsistency between the input information and the generated documentation (for instance, an specification may reaquire finding the maximum of a table and the implementation finds the minimum). Depending on the kind of information to review, sometimes it is possible to use automatic tools for reviewing, at least partially, but the general rule is that human intervention is required for it. The major problem that shall be solved for properly use this kind of review procedures is the opposition of the design team to them, which in my opinion comes from two reasons: 1. Schedule too tight as for allowing time to study in deep the information to be reviewed before the review takes place. This is normally true, and comes from the fact that time is not properly' allocated to these activities when the project is planned. Feelings of this technique being a source for punitive actions against the designers, which has been true in many cases too.

2.

For avoiding these two problems, care should be taken that planning the project does take into account the time required for a review group to get ready for the review itself, and that the only objective of a review is to improve the product's quality. A very important point during the development and also after the product is in operation, is the way in which modifications are introduced into the system, either for correcting some errors or because the users ask for new or different functionality.

440 Considering for example the case of a detected error, the solution proposed for fixing it could have impact in other parts of the system, so that fixing the error could introduce new ones in the system, perharps more severe than the one fixed. In the other hand, there is a very strong trend in software design to consider that what only matters is the code (after all, it is what runs in the system), and when a modification should be made, only the source code knows about it, so starting inconsistencies between the product and its documentation, which is a very risky situation. For avoiding modifications out of cotrol, this methodology proposes a rigid change control mechanism, that is triggered when an error is discovered or a change requested, and whose function is to analyze in deep the error or change request, which must be submitted to a control group along with documentation describing the problem and a proposed solution to it. As a result of this analysis, where the objective is to evaluate the impact that the change may have on the rest of the system, this control group can either validate the solution submitted to it or propose alternative ones, with similar effect in the part of the system under modification, but less impact in other functions. It is obvious that this control group need to have a very detailed knowledge of the system, the design methodology and the development tools used, and that its efectivenessis directely related with the professional quality of its components.

SUPPORT TOOLS.

Developing software in accordance with the constrains imposed by a methodology like the one described in the previous section could be a near to impossible job in terms of human resources and time requirements if it was not supported by a set of tools that configure a development environment adequate to both methodology and project. In our case (Telefonica's R&D Dept.), the kind of projects we deal with is clearly identified, and the development methodology has been designed bearing in mind the type of project that will use it, so that the problem of selecting the tools was quite well bounded, the major constrain being what is available in the market. Our major goal for designing a development environment has been to achieve a high level of integration among its components, so that as much as possible of the information generated by every tool can be automatically used by other, and so the environment itself can make consistecy checks and detect errors to some extent (fig.11).

441

CODE GENERATION

LIBRARIES

I
DOCUM

CONFIGURATION CONTROL

DEBUG

fir.
PLANIFICATION

DOCUMENTS SOURCE OBJECT

ACTIVITY CONTROL

AN INTEGRATED SUPPORT ENVIRONMENT


FIG. 11
Starting with a minimum configuration, software development environments should include tools for generating object code from the source programs written by the designers. There is a wide set of compilers, linkers, etc., available in the market, that can generate object code for several 8-bit, 16-bit or 32-bit microprocessors from source programs written is such popular languages as Pascal, C or PL/M. What we have is a variety of such compilers, etc., coming from several manufacturers, and the environment provides the adequate selection facilities, so that a source program written in Pascal for instance, may be compiled using different compilers, depending on its target machine, the tools that will be used for testing it, etc. What we do not have yet is any compiler for languages such as Ada or CHILL, able to generate code for the microprocessors we normally use in our developments.

442 These code generation tools configure the minimum environment that is required, but it only helps during the implementation tasks, and not in an extensive way. In any case, a good code generation system can detect a lot of errors and inconsistencies, saving time to the designers and improving the system's reliability. The next logical step to increase the capability of a development environment is to include on it debugging tools, that vill support the testing and integration activities. Our Department, that has been intensively working with microprocessors since 1.975, has a number of debugging tools, ranging from monitors with a minimum capability of setting breakpoints, dumping memory and modifying registers, to sophisticated debuggers, with very elaborated control languages, high level debugging, real time emulation, symbolic traces, etc. As these tools are normally associated to some and require specific format for the files they facilities have been bought or developed so that compiling do not need to be aware of the testing used later with the generated code. specific hardware use, communication source coding or tools that will be

These two sets of tools (code generation and debugging aids) have been the only development tools offered to the software designers for many years, but the increasing complexity of the software designs makes it necessary or at least very convenient to provide tools for helping the designers in other phases or tasks of the development. In this sense, automatic documentation tools investment for several reasons, such as: 1. are always a good

They can enforce very smoothly the correct application of the documentation standards for a given methodology, by offering many facilities to the user that follows the standards, and not being so friendly with those who don't. The information that an automatic documentation tool can generate, such as when a document was updated and by whom, or how many versions of a document are currently available, are of great value to other tools, and of course to the design team. in our

2.

The documentation problem is covered by two sets of tools environment: - User Interface. - Project Libraries.

443 The user interface helps individual designers to write or update draft versions of project's documents or source programs. It enforces the use of standards by providing extra support for standard documents, and in general takes care of all the formatting and presentation issues of the documents, so that the user can concentrate on entering the proper information. The project's library helps to store all the project's documents, and also controls the available versions of each. It provides also a defense mechanism against spureous updates of documents, that if allowed could generate inconsistencies, by keeping track of all the acceses made to the library, and by constraining such acceses, that require a previous authorization. Vhen combined with configuration control methods and tools, the libraries can be a good help for system building, but in any case having a well organized library is a great help for keeping the project under control, specially when its size is from medium to large. Configuration control tools are of great help during the system's integration tasks. They use the information given by the design team to identify the components of the system and their relationships, and with the information given by the library they can perform operations such as: "Build a last version of the system", or "Build a system compatible with version X of a particular module". So far we only have the basic tool required to build up a more complete and effective configuration control tool, but as the functions of such a tool are already specified, manual operations supply the lack of functionality given by the basic tool, which is a commercial package and so is too general for fitting exactely our requirements. Ve also include in our environment automatic tools for helping the Top Level Design, quite important because they allow to work out modifications in the top level design products (most of them graphics) without drafting problems, so that consistency is improved, and they can also perform some automatic tests, mainly for inconsistencies, that help to obtain a good product in what is probably the most critical phase of a software design. Finally, mention shall be made of two tools that play a very important role in project's development and in the integration of the support environment, they are the planification tools and the activity control tools. Planification tools are very useful for managing the project, defining the schedules and budgets, allocating resources to tasks and so on, and they also help to control how things go during the design, marking up delays or shortage of resources, and allowing simulation of certain situations.

444 Activity control tools should make extensive usage of the information given by the project's library, as repository of all the information concerning a project, and of the data contained in the projet's plan. Combining these data with configuration control ones, the activity control tools can foreseen, and in some cases could correct, potential problems that could arise during the development, related to shortage of resources, delays, milestone's control, etc. Only palnification tools are so far present in our environment but, based on an idea given to us by professor Barbacci from the Software Engineering Institute, we plan to implement an Expert System that will play the role of the activity control tool, as soon as the resources become available, and we expect it to be one of the most important components of our environment.

CONCLUSIONS.

A set of general concepts about software reliability and how to improve it through software engineering procedures and software design concepts has been presented. As an application to these concepts, the basic architectural concepts of a system currently under development at the R&D Dept. of Telefonica and its development environment have been discussed. To finish the presentation, we can now say that: 1. Software reliability is a subject strongly related with the design process, and specially with the human mistakes made when translating information. Improving software reliability can be achieved by avoidance design techniques, which are based engineering procedures. using fault on software

2.

3.

Fault tolerance, fault detection and fault correction techniques, which are required by any project, are based on architectural and design procedures, and are somehow opposite to fault avoidance techniques. Using automatic tools for helping the software development activities tends to ease the process of developing software, and so to improve the quality of the products. The best approach for building development environments is to integrate their components as much as possible, so that a maximum of their operations can be done automatically.

4.

5.

445 9 1. 2. 3. 4. 5. 6. 7. 8. 9. REFERENCES. I.Sommerville. Software Engineering, Addisson-Vesley 1.985 F.P.Brooks. The Mythical man-month, Addisson-Wesley 1.982 G.J.Myers. Software Reliability, John Wiley & Sons 1.976 H.J.Kohontek. 'A practical Management', Proc. EOQC-85 approach to Software Reliability

R.J.Lauber. 'Impact of a Computer Aided Development Support Ssystem on Software Quality and Reliability', Proc. COMPSAC-82 M.B.Khan. 'A Practical Software Configuration Management Journal of Information Systems Management. Winter 1.982 H.Hecht. 'Allocation of resources for Software Reliability', C0MPC0N-81 B.Littlewood. 'What makes a reliable program - few bugs or a failure rate?', Proc. NCC-80 J.B.Bowen. 'Standard Error classification Reliability Assessment', Proc. NCC-80 to support Plan', Proc. small

Software

STUDY CASE ON NUCLEAR ENGINEERING

Julio Gonzlez NUCLENOR, S.A. Hernn Cortes, 26 39003 SANTANDER SPAIN

ABSTRACT. A nuclear plant is a complex industry in which a Probabilistic Safety Assesraent is not only a method for a better understanding of each system but a way of analyzing the whole plant and discovering weak points too. This is the case of the PSA made for "Santa Mara de Garoa Nuclear Power Station" in Burgos (Spain). A brief explanation of the plant followed by the main tasks of the study are presented in this lecture. The origin, objectives and organization of this work and the modifications that resulted from the study are helpful subjects to understand the whole analysis.

1. INTRODUCTION This lecture presents a general discussion of the main aspects of a Probabilistic Safety Assesment (PSA) recently performed for the Garoa Nuclear Power Station. It is assumed that an audience of individuals with a variety of backgrounds will attend this Reliability Engineering Course. Thus, a brief description of the Garoa Station and its systems is included in Section 2, for the benefit of all those attending the course with no specific training on nuclear technology. Section 3 deals with the origin, objectives and organization of the study. Section 4 dwells on the fundamental tasks of the study: accident sequence definition, system reliability and so on. The most relevant results of the study, as they relate to plant modifications, are discussed in Section 5. Finally, in Section 6, we refer to current activities at NUCLENOR in the area of PSA.

2. PLANT DESIGN AND SAFETY FEATURES Santa Mara de Garoa Station has a nuclear reactor of the BWR3 model supplied by General Electric and a containment building design of the Mark I type. This plant is owned by Nuclenor, S.A., a subsidiary of 447
A. Amendola and A. Saiz de Bustamante (eds.), Reliability Engineering, 447-462. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

448 Iberduero S.A. and Electra de Viesgo S.A. The first commercial power was produced in June 1971. The plant is rated at 460 MW and is located in the province of Burgos, Spain. 2.1 Nuclear Fuel The uranium dioxide that constitutes the nuclear fuel is contained in a number of rods. Fuel rods are about 13 mm in diameter with 3,6 meters of active fuel length sealed within zircaloy tubular cladding. Forty-nine fuel rods are installed in a metal channel of square cross section to form a fuel assembly. The channel is open at the top and bottom to permit coolant to flow upward through the assembly. The core contains 400 fuel assemblies or 20.000 fuel rods with a total weight of uranium dioxide of 80 tonnes. 2.2 Nuclear Steam Supply System The nuclear steam supply system consists primarily of the reactor vessel, reactor coolant recirculation pumps and piping, and equipment inside the vessel. The nuclear fuel assemblies are arranged inside a core shroud in the reactor vessel. Water boils in the core and a mixture of steam and water flows out the top of the core and through steam separators at the top of the core shroud. Steam from the separators passes through dryers to remove all but traces of entrained water and then leaves the reactor vessel through pipes to the turbine generator. Water from the steam separators and water returned from the turbine condenser mix and flow downward through the annulus between the core shroud and the reactor vessel. From there it is pumped towards the bottom of the reactor vessel and back into the core (See Figure X). The reactor pressure is maintained at about 70 Kg/cm2. At this pressure water boils and forms steam at about 285C.

2.3 Nuclear Accidents and Safety Features After months of plant operation the reactor core contains a large amount and variety of radioactive atoms generated by the spliting of uraniun in nuclear fission reactions. The radiation emitted by these atoms accounts for approximately 8$ of the heat produced by the reactor core at steady state conditions. The fission process and, therefore, plant operation is Immediately interrupted, whenever some of the plant variables exceed predefined operating limits. This function is automatically performed by the so called reactor protection system that triggers off a shutdown signal for all the control rods to be fully inserted into the core. Although, the neutron chain reaction is quickly terminated, heat continues to be produced by the decay of the radioactive atoms present in the core. After several hours decay heat continues to be produced at a rate close to 1$ of the reactor rated power (1430 MW for Garoa). This substantial amount of heat requires ade cuate cooling to be provided well after reactor shutdown. Otherwise, decay heat causes

449

Steam Lint (lo turbine)

Feed water (from condenser) ff

Fcedwater (from condenser)

Recirculation Pump

Recirculation Pump

Schemtic Arrangement of BUR HSSS. FIGURE 1

450 overheating and eventual melting of the nuclear fuel. Melting and overheating favour the release of the more volatile radioactive atoms from the molten core. This is the root cause of nuclear plant risk, and experience tells us that it is something not well understood outside the community of nuclear safety experts. 2.*\ Emergency Core Cooling System Light water reactors have multiple provisions for cooling the core fuel in the event of an unplanned transient or loss of coolant from the reactor. These provisions differ from plant to plant, but all plants have several independent systems to achieve flooding and/or spraying of the reactor core with coolant upon receiving an emergency signal. Garoa emergency core cooling systems -See Figure 2- include a high-pressure coolant-injection system which assures adequate cooling of the core in the event of a transient or a small leak that result in slow depressurization of the reactor. If, for any reason, the feedwater pumps and the high-pressure emergency cooling systems should be incapable of maintaining the desired reactor water level, an automatic depressurization system would operate to discharge steam through pressure relief valves and thereby lower the pressure in the reactor so that operation of the low-pressure emergency cooling systems could be initiated. A low-pressure core spray system uses two independent loops to provide emergency cooling after the reactor has been depressurized. These systems spray water onto the fuel assemblies at flow rates sufficient to prevent fuel damage. Another independent system, the low-pressure coolant-injection system, is provided to supplement the low-pressure core spray system and reflood the core. 2.5 Containment Systems The containment systems together with the residual heat removal systems perform the following safety functions: pressure suppresion, residual heat removal and radioactivity containment. These systems provide both "primary" and "secondary" containment for coolant and radioactivity releases from the reactor. The primary containment consists of a steel pressure vessel surrounded by reinforced concrete and designed to withstand peak transient pressures which might occur in the most severe of the postulated, though unlikely, loss-of-coolant accidents. The primary containment houses the entire reactor vessel and its recirculation pumps and piping, and is connected through large ducts to a large pressure-suppression chamber that is half full of water, as shown in Figure 3. Under accident conditions, valves in the main steam lines from the reactor to the turbine-generator would automatically close, and any steam escaping from the reactor would be released entirely within the primary containment. The resulting increase in pressure would force the air-steam mixture down into and through the water in the pressure-suppression chamber, where the steam would be condensed. Steam

451

FROM REACTOR

EMERGENCY CORE COOLING SYSTEM AND ISOLATION CONDENSER

452

UNCI* VltMl
O.J Will

PRIMARY CONTAINMENT FIGURE 3

453
released through the pressure relief valves of the automatic depressurization system would also be condensed in the pressure-suppresion pool. This pool serves as one source of water for the emergency core cooling system. Secondary containment is provided by the reactor building, which houses the reactor and its primary containment system.

3. OBJECTIVES, SCOPE AND ORGANIZATION OF THE STUDY The existence in Spain of various Nuclear Power Plants with more than 10 years of operation, led the Spanish Nuclear Safety Council following the "Sistematic Evaluation Program" (SEP) of the U.S. Nuclear Regulatory Commision (NRC) to reevaluate their safety and, in this particular case, to request a Probabilistic Safety Assesment (PSA) in order to get additional information to the one given by the classic licensing criteria of the U.S. NRC. The performance of a Probabilistic Safety Assesment was required from NUCLENOR in August 1983, as a condition for the twelveth operating cycle. The objectives of the APS required to Santa Mara de Garoa were established by the Spanish Nuclear Safety Council (CSN) as follows: - First, the study should add new points of view to the ones given by the classic licensing criteria so as to improve decisions about the need to introduce new plant modifications. It was thought that this type of a study would serve to detect plant characteristics that contribute to the global risk in a significative way. Experience shows that some relevant aspects are not easily identified by means of deterministic analysis. In addition, it allows to take into account priority and cost-benefit considerations in the decision process related to the modifications that might be required. - The study should allow to get a general view of the plant thus assuring that all relevant safety aspects are adecuately considered. - Plant operations personnel should be involved in the study. This participation could be another way to alert them about the negative and positive aspects of the operation and design of the plant. The scope of the required study was similar to the ones made for the Interim Reliability Evaluation Program (IREP) in the U.S. This kind of an analysis tries to identify and estimate de frecuency of the accident sequences that dominate the risk for the reactor core to suffer serious damage. However, it does not include the evaluation of all those posible escape ways for the radioactive products or the contamination of the environment. - Another objective of the study was to get a wide introduction of these techniques in Spain, avoiding as much as posible, foreign expenses. At the start of the study, the experience available in Spain about Probabilistic Safety Analysis for nuclear power plants was reduced to several reliability analysis for particular nuclear plant systems and an incipient research and development effort on PSA

454
financed by the electric utility sector. Four different organizations were included in the project team under the overall direction of NUCLENOR: - A U.S. Consultant (Science Applications, Inc.) - A Spanish Engineering Company (INITEC) - The Department of Mathematics from the Civil Engineering School at Santander (SPAIN) - NUCLENOR Thirteen technical persons participated in the study. Ten of them were fully dedicated to the project. Most of these people have no previous experience on the subject. This lack of previous experience suggested to proceed in two phases. The first phase consisted of a training period of about four months at the offices of the Consultant (SAI) in the U.S.A. During this period a limited study (mini-PRA) was performed in order to get familiar with the methods, techniques, computer codes, etc., required to perform the study. The second phase took place in Santander (Spain) over a period of fifteen months approximately. Figure 4 lists the main activities of the project and indicates in an approximate way the duration of these activities and the effort in man-months that was put into them. At the iniciative of NUCLENOR, an Independent Study Review Group was constituted. The group was formed by two experienced nuclear safety professionals from Spain and a third one from the United States. They were asked to review the work on the light of the main objectives of the study.

. MAIN TASKS OF THE STUDY 4.1 Initiating Events The description of the most likely accidents has, as a starting point, the identification and adecuate grouping of the so called initiating events. We call initiating events to any type of ocurrences or failures that require the reactor to trip and some protection or mitigation function to be carried out succesfully in order to prevent core damage. The identification of the initiating events and their frecuencies is based on various sources of information as well as a detailed plant analysis. In the case of a plant with years of operating experience, as it is our case, the most frecuent initiating events have ocurred several times. The plant operating records were used to identify these events, and as a source of numerical information to estimate their frecuencies. However, the expected life of a nuclear plant is too short to provide us with sufficient data to idenfity all but the more likely initiating events. Thus, the review of the operating experience of other nuclear plants, generic reliability data for different types of equipment, as well as a detailed plant analysis are the sources of information that were used to complete the spectrum of initiating events to be considered. In our case, some sixty different ocurrences were studied and it was concluded that about forty of them have to be

1984

1985

J F M A M J J A S O N D J F M A M J J A S O N D

Training Familiarization with the plant Initiating Events Sequence Delineation Systeras Analysis Data Bases Human Reliability Codes Preparation Sequences Quantification Results Analysis

(40 months.man) (36 months.man) (8 months.man) (4 months.man) (50 months.man) (15 months.man) I I I

(5 months.man) (6,5 months.man) (16 months.man) (9 months.man)

Figure 4

456 considered as initiating events. In order to reduce the complexity of the analysis of the accidents that may develop from each of the initiating events, the list of the initiating events was broken down into classes of events that were found to have the same general effect on the plant and its mitigation systems. If the event being looked at did not have the same effect as one that have already been considered, a new transient class was established. After all the events had been considered, fourteen clases of initiating events were identified, and the frecuency of each class was calculated by taking the sum of the frecuencies of the individual events included in that particular class. These frecuencies span over a range of four orders of magnitude. Experience shows that turbine trips occur more often than once per year. However, certain type of catastrophic pipe failures are estimated to occur only once every ten thousand years. The analysis that supports the definition of initiating events and the grouping of initiating events into different classes is a fundamental ingredient of the project and is a task that requires a thorough understanding of the plant design and operation. 4.2 Accident Sequence Definition Quantification of the risk associated with a commercial nuclear power plant requires the delineation of a large number of possible accident sequences. Because nuclear systems are complex, it is not feasible to write down, by simple inspection, a listing of important sequences. A sistematic and orderly approach is required to properly understand and accommodate the many factors that could influence the course of potential accidents. In our study, as is generally the case, the event tree method was used to identify the combination of system failures and human response failures that together with an initiating event constitute an accident sequence. In addition, system failure modes and system dependencies within a given accident sequence, was carried out by the fault tree method. In this section the even tree analysis performed for the Garoa study and the concept of the event tree are described. Next section discusses the fault tree methodology. In the Garoa study a separate event tree was constructed for each initiating event class. Each tree has a different structure from all the others, as a consecuence of some unique plant condition or system interrelationship created by the initiating event. In an event tree we begin with a postulated initiating event and the various event possibilities representing the systems or functions necessary to mitigate the consequences of the accident are listed across the top of the event tree. The line representing the initiating event branches into two, which represent success and failure of the first function or system (system operation may be automatic or manually initiated). Each of the resulting branches divides also in another two for the next system and so on. Some of these branches are not really considered because they do not serve to represent different accident

457
sequences or because they do not represent posible accident sequences. The end result of each sequence is assumed to be either the safe termination of the postulated sequence of events or some plant-damage state. Figure 5 shows an example of an event tree for a "Loss of Offsite Power" initiating event. The plant systems that are capable of performing the required functions are called front-line systems. The dependence of these systems on the succesful operation of other plant systems (the so called support systems) is not explicitly depicted on the event trees. The fault tree models take care of this type of system interrelationship. This approach (the IREP approach) gives rise to relatively simple event trees and rather complex fault trees. The relative complexity of event trees versus fault trees characterizes the different methodologies used in probabilistic safety assesments for nuclear power plants. Care was exercised to ensure that the event tree headings were consistent with actual plant-response modes or emergency procedures, that the heading could be precisely related to system success criteria and that they could be translated to top events for system-fault modeling. The order of the events across the tree is based on either the time sequence in which they occur, proceeding from left to right, or some other logical order reflecting operational interdependence. 4.3 System Reliability Analysis Fault tree analysis can be simply described as an analytical technique, whereby an undesired state of a system is specified (usually a state that is critical from a safety standpoint) and the system is then analyzed in the context of its environment and operation, to find all credible ways in which the undesired event can occur. The fault tree itself is a graphic representation of the various parallel and sequential combinations of faults that will result in the occurrence of the predefined undesired event. The fault can be due to events that are associated with component hardware failures, human errors, or any other pertinent event which can lead to the undesired event. A fault tree thus depicts the logical interrelationships of basic events that lead to the undesired event which is the top event of the fault tree. Figure 6 shows a page of the fault tree constructed for one of the systems referred in the previous example of the event tree; this is the "Core Spray System". Fault trees for the most of the front line and support systems were developed during this analysis. In a few cases it was not necessary to develop a fault tree because of similarity with already well studied systems or because the simplicity of the system allowed a direct reliability calculation. A generic data base was used for quantification of hardware faults. In some instances, plant specific data was used instead. Test and maintenance intervals and durations were obtained, where possible, from discussions with plant personnel and by reviewing plant logs.

T4

TV ss ss ss
SCM SS
4

SEQUENCE DENOMINATION

SEQ.
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

Ui 00

?*_

T4~JKN05M T47jKN05MG
T4 F J K O S M O T4ATJKNPFO5F T4"TJKNDT0'5MG" T4 T J K N D F S M O

ss
SCM SS

ss
SCM FCM . FCM SS SS FCM SS

T4JKNDF5M T4"TJKNF*5MG" T4X1JKNFE(ftMQ T4 F J D F E


T4 TJKND

T4AJF05M T4TJF5MG T4 T J F 0 9
T4*TJFF5R T4 A T J F 5 M G T4 T J F E os T4 T J F E

ss
SCM FCM FCM ATVS

T4 " 4

FIG. 5 EVENT TREE FOR LOSS OF OFFSITE POWER (4)


NOTE: SS = SAFE SHUTDOWN SCM SLOW CORE MELT FCM FAST CORE MELT A REACTOR PROTECTION SYSTEM
J> K' N' 0> RELIEF VALUES OPEN RELIEF VALUES CLOSE ISELA TION CONDENSER HIGH PRESSURE INYECTION SYSTEM AUTOMATIC DEPRESURIZATION SYSTEM CORE SPRAY SYSTEM LOW PRESSURE COOLONNT INYECTION SYSTEM REACTOR SHUTDOWN COOLING SYSTEM SUPPRESION POOL COOLING SYSTEM

SISTEMA ROC IADO NUCLEO NO OPERA

TREN '' SIST.ROCIADO NUC LEO NO OPERA I SF2B I

E
NO HAY FLUJO A TRAVES DE LA VLVULA MANUAL 62 I GF3A I ROCIADORES DEL LAZO A OBSTRUIDOS

NO HAY FLUJO A TRAVES DE LA VLVULA MANUAL 1482

ROCIADORES DEL LAZO B OBSTRUIDOS

I6F3B I

IFNZNOZBLOF I

I FNZNOZALOF I

fi;
o

460 Human errors were classified in two groups : latent errors and diagnosis errors. For the probability of latent errors, which includes test, maintenance and calibration, an upper value was obtained following the THERP (Technic for Human Error Rate Prediction) methodology. The second group of errors, the diagnosis errors, deal with those decisions and actions that operator must face in an accident situation. For this group, screening values were chosen. These values depend on the available time for the operator to make up his mind to take an specific course of action and on the complexity of the decisions to be taken. This last methodology agrees with the NUREG-3010. .4 Accident Sequence Quantification The event tree and fault tree models and the data base are integrated in the accident sequences analysis task to calculate accident sequence frequencies, and to identify the most probable faults contributing to each accident sequence. This is a time consuming task generally performed with the assistance of a computer. There were many activities performed in this part of the analysis: preparing computer inputs representing the structure of the fault trees, merging support system fault trees with the appropriate front-line system fault trees, quantifying the frequencies of all important accident sequences, including consideration of operator recovery actions and many others among all these mentioned. The results of this task were computerized models representing the plant systems and both qualitative expressions of fault combinations and quantitative expressions of cut sets and accident sequences frequencies for all potentially important accident sequences.

5. INSIGHTS OF THE STUDY We will start this section discussing the results of the analysis that were used or will be used in a near future, to introduce modifications in the design and operation of the plant. After that we will make reference to other insights of more general character. A first group of modifications came out, mainly, from the system reliability analysis. From the consideration of the most probable causes of the systems unavailability some aspects of design or operation susceptible to a significant improvement were identified. Most of these modifications affect test or calibration procedures for systems or components. As a consecuence, testing of electrical circuitry becomes more thorough and includes all the components that are required to operate for the circuit to perform its function. Some specific components (valves, sensor, etc) that were found to be important in some respect, are now tested or calibrated more frecuently, or included in a list of daily checks. The most relevant results arose, as one could expect, during the accident sequences analysis task, in the last part of the study. From the point of view of its contribution to the reactor core damage

461
estimated frequency, the plant transients appear to be more important initiating events than the pipe breaks. Electrical failures are important contributors to serious transients. Among these failures, the loss of either one of the direct current buses are two very significant events. Failure modes of emergency systems due to common causes were identified. System failure modes, that contribute meaningly to several accident sequences, and that could be avoided, or, at least, be made less probable through changes in their testing procedures or test frequency, were also identified. These results allowed, in many cases, to project modifications that avoid the occurrence of some transients, or that improve the availability or reliability of some systems. Plant modifications to eliminate the cause of a reactor shutdown due to the loss of a direct current bus are considered particularly beneficial. In other cases, results show up that a certain aspect of the plant design or operation can be improved. However, an additional study to identify the more convinient modification appear to be necessary. This is the case of some of the identified common cause failures. In addition to the above mentioned aspects, in relation with the design, it was found convinient during the sequence analysis task to bring in some other changes of similar nature to those developed in the system analysis. These changes affect testing, calibration and operation procedures and they have an effect mainly on the test frequency and the number of checked components. As a conclusion of this brief summary of the results of Garoa PSA, it can be said that a significant number of modifications in different fields were or will be made. That means a considerable reduction in the estimated core damage frequency: 5,4 10~ 4 year"^ (mean value). The elimination of the transients due to the loss of one direct current bus cause a 34$ decrease in that frequency. From the experience of the study we conclude that the initial objectives were reasonable objectives and that they have been reached to a good extent. The PSA has developed suficiently to be a useful evaluation methodology. In some areas it is clearly superior to the evaluation methods proposed by the traditional regulation. The interaction between safety and control systems is one of these areas. In relation with the transference of technology, we understand that the objective has been fully reached. The team that has done the work has obtained the knowledge and the means necessary to make a new study of similar characteristics. This allows, as is being done, to include modifications in the study and correct errors in order to maintain an updated PSA. The experience of the colaboration between Nuclenor and the Department of Applied Mathematics of Santander University has been satisfactory for both sides and, consequently, a new agreement that includes broader objectives and extended time frame, was signed. The Independent Review Group has contributed to improving the quality of the study with a detailed analysis of the documents. This

462 analysis is included in the reports that the members of the Group wrote after each of the meetings held during the project.

6. CURRENT ACTIVITIES After the completion of the project, under the schen that we have already described, work continues to be performed at NUCLENOR in several areas relataded to PSA. An important area of work is the design of modifications that, given their complexity, require additional analysis as a complement to the PSA results. We are currently giving additional consideration to the human reliability task. The most likely errors of diagnosis identified during the study are going to be analyzed further. The new probability estimates for their human errors will be included in the study. There is a new area of work, that can be considered a consecuence of the PSA, and that we could call transient simulation. We are currently planning to select and adapt a computer code capable of describing the behaviour of the reactor and the primary containment under a variety of transients. This computer model of the plant should allow, among other things, to improve the accidente sequence analysis and reduce the uncertainty associated with some of the hypothesis made in this part of the study. We expect that the study will be relevant in the future in several areas that receive frecuent attention, such as Technical Specifications, vital area definition and protection, plant modifications, etc.

PROBABILISTIC EVALUATION OF SURVEILLANCE AND. OUT OF SERVICE TIMES FOR THE REACTOR PROTECTION INSTRUMENTATION SYSTEM

Ioannis A. Papazoglou Greek Atomic Energy Commission Nuclear Technology Department 15310, Aghia Paraskevi Greece

ABSTRACT. A methodology for the probabilistic evaluation of alternative plant technical specifications regarding system surveillances and out of service times is presented. A Markov model is employed that allows for the modeling of state dependences and other dynamic effects like the renewal of the system after each successful challenge. Multiple states for the components and the system are considered. The methodo logy is applied to the Reactor Protection System of a tloop RESAR 3S type nuclear power plant. Various sets of Limiting Conditions of Operation are studied using the probability of core damage and the expected reactor shutdown time per year of reactor operation as cri teria .

1.

INTRODUCTION

The objective of these notes is to present a methodology for the pro babilistic evaluation of alternative plant technical specifications regarding system surveillance frequencies and outofservice times. The methodology is applied to the Reactor Protection System (RPS) of a loop RESAR3S (1) type nuclear power plant. The effect of the statistical characteristics of the system on the relative comparison of various sets of technical specifications is examined through sen sitivity studies. The Westinghouse Owner's Group (WOG) (2,2) requested from USNRC revisions of the Limiting Conditions of Operation (LCO) in the RPS technical specifications. Justification for revisions of LCO in plant technical specification can be provided on the basis of probabilistic analyses and arguments. Given the randomness that characterizes the behavior of the various systems, probabilistic analyses can provide a quantitative assessment of the "impact" or "cost", and the "value" of any proposed changes and, hence, provide a logical framework for decision making. This paper presents a methodology that can accurately quantify the effects of different LCO policies on risk as well as on economic attributes. The technique consists mainly of a detailed model of the 463
A. Amendola and . Satz de Bustamanle (eds.), Reliability Engineering, 463-485. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

464 stochastic behavior of a standly system using the method of Markovian Reliability Analysis. A Markov model allows for the incorporation of many time and state dependences in the stochastic behavior of the system which are of particular importance for the problem in hand. Several models for the unvailability of standby systems have been analyzed in the literature. Apostolakis and Chu (4) provide approximate analytical solutions for l-out-of-2, l-out-of-3, and 2-out-of 3 systems under periodic test and maintenance and different testing policies that correctly account for the time dependent of the phenoments. The same reference sites a number of other references that address various aspects of the unavailability of standby systems under periodic test and maintenance. The computer code FRANTIC developed by Vesely and Goldberg (5) provides the most complete coverage of the phenomena of periodic maintenance. It can describe failures that occur in constant or varying rate, failure per demand, failures caused by commom causes, and human errors. In addition, the effects of test downtimes, repair times, test efficiencies, test bypass capabilities, test caused failures and test staggering can be modeled. In general, FRANTIC can analyze the unavailabilities of systems that can be represented by minimal cutsets of a fault tree. Both the components and system are, however, binary. The technique presented in this paper presents a model that can incorporate, in addition to the above mentioned characteristics, multistate components and systems, as well as, temporal changes in the stochastic behaviour of the components caused by changes of the system state. The paper is organized as follows. Section 2 briefly describes the Reactor Protection System (RPS) under study. Section 3 discusses the Markovian model for the RPS. Section 4 presents the data base and the basic results obtained. Section 5 summarized the methodology, the results and the main conclusions of this study.

2.

REACTOR PROTECTION SYSTEM

2.1. System Description The Reactor Protection System (RPS) keeps the reactor operating within a safe region. If one or more physical parameters enter an unacceptable range of values, a trip signal will be produced to deenergize the electromagnetic holding power of the control rod drive mechanisms. The control rod drop because of the gravity mechanisms. The control rod drop because of the gravity ensures an orderly shutdown of the nuclear power plant. The electrical portion of a typical RPS of Westinghouse designed pressurized water reactors consists of analog channels, logic trains and trip breakers. The specific design details may differ depending on the vintage of the reactors. The particular hardware configuration which is the subject of this study is that of a 4 loop RESAR-3S (1) type reactor with solid state combinational logic units. References 1 and 3 describe the RPS in greater detail.

465 2.1.1. Analog Channels. The analog channels sense the plant parameters and provide binary (on-off) signals to the logic trains. A typical analog channel is composed of a sensor/transmitter, a loop power supply, signal conditioning circuits and a signal comparator (bistable). 2.1.2. Logic Trains and Trip Breakers. There are two logic trains and each logic train receives signals from the analog channels through imput relays. The imput signals are then applied to universal boards which are the basic circuits of the protection system. They contain l-out-of-2, 2-out-of-H coincidence logic circuits depending on the plant parameters and the corresponding analog channels. 2.2. Testing The RPS is designed to allow periodic testing during power operation without initiating a protective section unless a trip condition actually exists. An overlapping testing scheme, where only parts of the system are tested at any one time, is used. Typical RPS testing involves verification of proper channel response to known imputs, proper bistable settings and proper operation of the coincidence logic and the associated trip breakers. Detailed testing procedures including testing frequency and allowable bypass time are described in References (2) an.(3). 2.2.1. Analog Channel Testing. The analog channel testing is to veryfy that the analog channel is functionning properly and that bistable settings are at the desired setpoint. During test, the test switch disconnects the sensor/transmitter from the channel and the circuit is cadable of receiving a test sugnal through test jacks. The imput signal to the test jacks is then adjusted to check operability and setpoints of the bistable. The analog channel under test is allowed to be bypassed for a duration specified by the technical specifications and put in a trip mode if the allowable bypass time is exceeded. 2.2.2. Logic Train and Trip Breaker Testing. This portion of the RPS testing encompasses three stages : (1) Testing of imput relays places each channel bistable in a trip mod causing one imput relay in logic train A and another in logic train to de-energize. Each imput relay operation will light the status lamp and annunciator. This stage of the testing provides overlap between the analog channel and logic train positions of the test procedure ; (2) Testing of logic trains involves one train at a time. The semi-automatic testor checks through the solid state logic to the UV coil of the reactor trip breaker. The logic train under test is also allowed to be bypassed for a specified duration and the plant must be shut down if the allowable bypass time is exceeded ; (3) Testing of the trip breaker requires manual trip and operability verification of the bypass breaker and then manual trip test of the trip breaker through the logic train.

466 3. MARKOVIAN MODEL OF THE REACTOR PROTECTION SYSTEM

The basic principles of Markovian reliability analysis are discussed in References (6-10). This section describes the Markov model developed for the electrical portion of the Reactor Protection System (RPS). The model developed for this study does not include the mechanical portion (control rod drive mechanisms and control rods) and the opera tor manual actions to scram the plant by pushing the buttons in the control room or by locally opening trip breakers or output breakers on the rod drive motorgenerator sets. A typical fourchannel parameter was considered to evaluated the effects of changes in the test procedures on unavailability and risk measures, e.g., increments in unavailability or core damange frequency. The RPS is represented in a functional block configuration in Figure 1. There are four blocks for analog channels (one for each channel) and two blocks for logic trains (one for each logic train and the associated trip breaker). Each functional block is considered as a supercomponent composed of several basic components in series. Hence, the failure rate of a block is simply the sum of the failure rates of the composing com ponents . The block for an analog channel consists of the following com ponents : a sensor/transmitter loop power supply (120 VAC) signal conditioning circuits a bistable an imput relay It is noted that each bistable feeds two imput relays, one for each logic train. To avoid complexity of the model, however, it is assumed that each bistable feeds only one imput relay. This is a slightly conservative assumption. The block for a logic train consists of the following components : solid state combinational logic circuits DC power for the logic circuits (15 VDC) undervoltage coils DC power for the undervoltage coils (48 VDC) a trip breaker The state transition diagram for an analog channel is given in Figure 2. An analog channel is represented by a fivestate component. State 1 : is the operating state. State 2 : is the failed state. In this state the component is failed, the failure can be detected in the next test and the com ponent will be put under repair. State 3 : is the tripped state. While in this state the channel gene rates a trip signal and it may undergo repair. State 1 : is the bypass state related to state 1. To perform a test the channel can be bypassed for a prespecified period of time : Allowable Bypass Time (ABT) . At the end of this period the component transists instana

467

CHANNEL 1 TRAIN A CHANNEL 2

CHANNEL 3 TRAIN CHANNEL 4

Fig.l. Reactor Protection System Functional Block Diagram. neously to state 3. State 5 : is the bypass state related to state 2. If the channel is failed the testing and repairing can be performed while in a bypass mode, provided that the ABT is not exceeded. If the analog channel is in State 1, it may transit (see Figure 2): a) to state 2 with a failure rate ; b) to state 3, when anyone of the internal components gives a spurius trip signal or if it fails in a detactable way and the operator immediately trips the channel, with transition rate X s ; and c) to state 4 following a test, which takes place every T hours. Thus, the transition rate is represented by a delta function 6(t-kT), k = 1,2,... If the analog channel is in state 2 it transits to state 5 followm g a test. If the analog channel is in state 3 it transits back to state 1 once the repair is completed, with transition rate u*. If the analog channel is in state 4 it may transit to : a) State 3 if the testing is not completed within the ABT (instantaneous transition symbolized by the delta function (u- ) where u is the time spent in state 4 ) . b) State 1 if the test is completed within the ABT and there

468

\ff(tkT)

2 \

Frem / \State2/

Figure 2. State Transision Diagram ; Analog Channel, "Nonmarkovian Model" .

(1e+2T) Figure 3. State Transition Diagram : Analog Channel, "Equivalent" Markovian Model.

469 is no human error in restoring the channel in its operating state. (Transition rate .idp;) ; c) State 2 if the test is completed within the ABT and there is a human error that leaves the channel failed (transition rate P 1 p 1 ) ; If the analog channel is in state 5 it may transit to : a) State 3 if the test/repair is not completed within the ABT (instantaneous transition symbolized by the delta function
6(UT)) ;

State 1 if the test/repair is completed within the ABT and no human error is committed in restoring the channel in to its operating state (transition rate u(lp)) ; c) State 2 if the test is completed within the ABT and there is a human error that leaves the channel failed (Transition rate 2 2 ) . Whenever the allowable bypass time is small compared to the mean time of channel failure, the two test states (4 and 5) can be ommitted by assuming that the transition in and out states 4 and 5 occur instanta neously at the time of testing and with the following probabilities (see Figure 3) : (i) from state 1 to state 3 with probability ) i.e., probability that the test will last for more than units of time ; (ii) from state 1 to state 2 with probability p(lexp(u;r)) i.e. probability of completing the repair in less than units of time and that a human error will occur ; (iii) from state 2 to state 3 and state 1 with probabilities exp(u ) and (12>(1( )), respectively. In this study exponentially distributed times to test completion were used. This assumption is not, however, a requirement of the model. Any distribution of testing time can be used. Only the cumulative prob abilities are needed in the model. The state transition diagram for the logic train and trip breaker is similar to the one of the analog channel. The six component (4 ana log channels and 2 logic trains) form a system that can be in 729 (=36) states. However, all 729 states are not necessary for the solution of the model. The system states have been regrouped in to 198 states. The major grouping involves states that imply a spurious scram. If two analog channels are in the trip state or if one logic train is in the trip state a spurious scram signal is generated because of the 2out of4 and the loutof2 logic, respectively. The scram signal will cause a reactor shutdown that will result in a successful shutdown or in a core damaged state depending on the availability of the decay heat removal function. All the system states with two tripped analog channels or one tripped logic train were merged in to two system states. The 729 states can be grouped in to the following 9 groups. 1) RPS Available With No Tripped Analog Channel : This group contains all system states with at least two analog channels and one logic train operable and no tripped analog channel. 2) RPS Available With One Tripped Analog Channel : This group contains all system states with one analog channel

b)

470 tripped and at least one more analog channel and one logic train operable. 3) RPS Unavailable : This group contains all the states that imply system unavailability (two logic trains or three analog channels failed). "Real" ScramNoCore Damage : This group contains all the states of the system that imply an available RPS and the successful reactor shutdown following a "Real" Scram Signal. Real Signal means a signal generated by the RPS by proper ly responding to abnormal conditions of the plant. 5) "Real" ScramCore Damage : This group contains all the system states that imply an available RPS and the reactor in coredamaged state. The RPS successfully respond ed to the "Real" challenge but the decay heat removal function failed. 6) "Spurious" ScramNoCore Damage : This corresponds to group No.4 with the scram signal spuriously generated internally to the RPS. 7) "Spurious" ScramCore Damage : This corresponds to group No.5 with a spurious scram initiator. 8) ATWSNoCore Damage : This group contains all the system states that imply an unavail able RPS coupled with a real challenge (Anticipated transient Without ScramATWS) but with successfull mitigation of the event. 9) ATWSCore Damage : This group contains all the system states that imply an unavail able RPS coupled with a real challenge (ATWS) that results in Core Damage. The system transitions are graphically depicted, in summary form, in the state transition diagram in Figure H. If the system is in a state of group 1 it can transit to another state in the same group, or a state in group 3 if a component fails. The system transits from a state of group 1 to a state of group 2 if an analog channel trips. Transitions from groups 2 and 3 back to group 1 occur whenever a com ponent is repaired. Similar transitions (involving failures and re pairs of components) can occur within groups 2 and 3 as well as between groups 2 and 3. If the system is in a state of group 1 or 2 (available), a real challenge assumed to occur according to a Poisson random process with intensity will generate a scram which in turn will result in Core Damage with probability or in a safe shutdown with probability 1p (see Figure . The "Real Scram Core Damage" state is an absorbing state, that is, the system can not leave this state. Following a successful scram, however, the reactor is brought back online after spending some time (random variable) in the shutdown mode. This tran sition back to the operating state is depicted in Figure 3 by the tran sition rate r_. It is further assumed that following a successful scram all existing failures in the RPS are detected and repaired. Spurious scrams are modeled by transitions from either group 1 or group 2 to the "spurious scram no core damage" state (group 6) and "Spurious scram core damage" states (group 7). From a state in group 1, a spurious scram can occur if a spurious signal is generated (randomnly with time) in a component of the logic train and trip breaker or if the

test / / t

ATWS CORE DAMAGE

' /
L

/ \ |i_j

\ j

( RPS (uN A A BL A V IL

/ _^X.

Figure t. Generalized State Transition Diagram : Reactor Protection System.

472 ABT is exceeded while testing and/or repairing such a component. The same transitions are possible from a state in Group 2, however, if a spurious scram signal is generated by an analog channel (one channel is already tripped) or if the allowable bypass time for testing/re pairing an analog channel is exceeded. The conditional probability of core damage given a spurious scram is now denoted by * (see Fi gure 4). From a safe shutdown following a spurious scram the system is brought back to the operating state (renewed) with rate r (see Fi gure 4 ) . ATWS events can occur from some states in Group 1 and 2 and all states in Group 3. If the system is in a state of Group 3, it is un available to detect the need for shutdown and a challenge will bring the system to an "ATWS No Core Damage" (Group 8) or "ATWS Core Damage" (Group 9) state with probability 1p and , respectively. ATWS tran sitions can occur from states in Group 1 and 2 during tests. If the system has two analog channels and/or one logic train failed undetected then a test of a "good" component (channel or logic train) will put this component in a bypass mode and it will render the system unavail able for the duration of the test. If a challenge occures during this time an ATWS will occur. The system then transits to "ATWS Core Damage" and "ATWS No Core Damage" states with probabilities and 1p , re spectively. From the ATWSnocore damage state the system returns to the operating state (renewed) with rate r. (see Figure 4 ) . Additional features of the model are staggered testing and in clusion of commoncause faulure modes. Uniform staggered testing (4) has been assumed for the analog channels and logic trains. Externally (to the system) generated common cause failures are included in the model using the (Jfactor approach (11).. The Markov model for the RPS described above, includes several characteristics of the stochastic behavior of the system that cannot be adequately modeled by the current stateoftheart PRA techniques. In present PRA techniques, the system is modeled by a fault tree or an equivalent logic model which in turn is quantified by imputing the average unavailabilities of the components. The average (over time) component unavailabilities are estimated by considering each component independently of the other components or the system. Thus, the current PRA techniques do not consider the effects of the time dependence of the system characteristics and the effects of dependences of the sto chastic behavior of the component on the state of other components and/or the system. It is, almost always possible to apply the current PRA techniques with assumptions that will provide "conservative" an swers in the sence that they will overstimate undesirable reliability parameters of the system. It is not, however, obvious that such over estimations are desirable or that they can provide useful insights in cases where operating policy is to be decided at least partly on the basis of the results of a probabilistic analysis. The specific areas that the model presented in this paper improves over current PRA techniques are the following. (i) Modeling of Multiple States : A component can be in any num ber of discrete states. In particular, the Markov model allows for the modeling of Bypass and Trip states for the

473 analog channels and the logic trains. A current PRA technique would assume only one failed state (component unavailable) and it would assume that the component is unavailable every time it is tested and for a period of time equal to the mean time of the maintenance activity. This approach creates three problems : (a) It introduces a conservatism in the calculation by overestimating the unavailability of the system. This is because when a channel is in a trip mode it takes three additional failures for the system to be unavailable. Assuming that the channel is unavailable, however, requires only two additional failures to fail the system. (b) It introduces a nonconservatism by underestimating the probability of spurious scrams. When a channel is in a trip mode an additional spurious trip in any of the remaining channels will cause a spurious reactor scram. (c) It introduces a difficulty in estimating the real effect of an LCA policy change. It is conceivable that two alternative LCO policies are characterized by the same mean time to test and repair a channel (which is a component characteristic after all) and different allowable times in bypass. (ii) State Dependences : The stochastic behavior of the system might depend on its state. For example, the allowable bypass time for an analog channel depends on whether another channel is already tripped or not. The repair rate of an analog channel might depend on whether another channel is under repair or on whether the reactor is shutdown or online. Exceeding the allowable bypass time in an analog channel will generate a scram signal depending on whether another channel is tripped or not and on whether the reactor is online or not. (iii)Renewal Effect of Challenges : A successful challenge to the system will reveal any existing failures which will be subsequently repaired. Thus, the challenges to the system have the effect of randomly occuring tests. However, whether a challenge will have the equivalent effect of a test on a component will depend on whether the system is available at the time of the challenge. (iv) Inclusion of the "NO CORE DAMAGE" -and "CORE DAMAGE" States : The inclusion of no core damage states is important because they allow for the estimation of the expected reactor downtime that is directly related to the RPS. This quantity is an important attribute of any LCO policy. In addition, the inclusion of the no core damage and core damage states permit a more accurate estimation of the system unavailability and failure probability. This is due to the fact that the system spends a finite amount of time in the "no core damage states". The time the system spents in states of Group 1 to 3 is then reduced accordingly and thus some double counting in the estimation of the systems unavailability and failure probability is avoided.

474 The Markov model calculates the effect of these characteristics by considering their impact dynamically, that is, as a function of time.

4.

DATA BASE AND RESULTS

This section presents the data base and the main results obtained in the point calculations. 4.1. Data Base The failure rates of the components comprising the analog channels and the logic trains are given in Table I. The numerical values of the other parameters required in the model are given in Table II. 4.2. Results The Markov model described in Section 3 was quantified using the data base given in Section 4.1. The quantification of the model provides numerical values for two attributes of interest in the evaluation of the LCO policies : (1) The probability of core damage per year of reactor operation and (2) the average reactor downtime per year of reactor operation. The quantification of the Markov model provides the probabilities that the system will occupy each of the possible states as a funtion of time. The probability of core damage per year of reactor operation is given by the probability that the system will occupy any of the states in Groups 5, 7 and 9 (see Section 3 and Figure 4) at the end of one year period. Since core damage is a catastrophic failure, that is no recovery is possible, each of the states in these groups is an absorbing state. The probability of finding the system in one of these states at time t is then equal to the cumulative probability that the time of core damage will be less or equal to t. The probability that the reactor will be shutdown at time t is equal to the probability that the system occupies a state in Group 4, 6 or 8 (see Section 3 and Figure 4 ) . Since the reactor is brought back to power from such a state, the probability of being in a state of Groups 4, 6 or 8 is equal to the pointwise unavailability of the nuclear power plant (4). The average unavailability of the reactor (D) is obtained if the pointwise unavailability is integrated over the period of interest and divided by that perior.

M-

D(t)dt

_The average reactor downtime for the period T is then simply equal to DT. To demonstrate the methodology we report the results of the model for various LCO policies. An LCO policy consists of the period of test-

475

TABLE I Failure Data Component Analog Channel Block Input RelayFails to open Operates spuriously 5.09(-7)/d 3.6(-8)/hr Ref.f35j Ref.(15) Failure Mode Failure Probability Source

Loop Power Supply (120 VAC) Signal Conditioning Module Comparator (Bistable) Senson/Transmitter Neutron Flux

Inoperable* Reduced Capability*

5.4(-7)/hr 9.1(-8)/hr

Ref.(16) Ref.(16)

Inoperable** Reduced Capability** Inoperable Reduced Capability

2.6(-6)/hr 1.55(-6)/hr 6.5(-7)/hr 8.4(-7)/hr

Ref.(16) Ref.(16) Ref.(16) Ref.(16) Ref. (16) Ref.(16) Ref.(16) Ref.(16)

Inoperable Reduced Capability Inoperable Reduced Capability

3.i+(-6)/hr 8.5(-7)/hr 2.6(-7)/hr 3.1(-6)/hr

Pressure

Total Flux Channel

Fails to operate Operates spuriously

6.65(-6)/hr 3.91(-6)/hr

Pressure Channel

Fails to operate Operates spuriously

3.51(-6)/hr 6.16(-6)/hr

Logic Train and Trip Breaker Block Trip Breaker Fails to open Operates spuriously Fails to open Operates spuriously Inoperable Reduced Capability Fails to operate Operates spuriously 2.27(-4)/d 4.3(-8)/hr 5.09(-7)/d 3.6(-8)/hr 5.4(-7)/hr 9.1(-8)/hr 1.73(-6)/hr 2.48(-6)/hr Ref.(15) Ref .(15) Ref .(15) Ref.(15) Ref.(16) Ref.(16) Ref.(15) Ref.(15)

UV Coils

DC Power (48V) for UV Coils Solid State Logic Circuits

476
Component Failure Mode Inoperable Reduced Capability Fails to operate Total Operates spuriously 2.52(6)/hr 3.28(6)/hr Failure Probability 5.4(7)/hr 9.1(8)/hr Source

DC Power (15V) for Solid State Logic Circuits

Ref.(16) Ref. (16)

Both failure modes of power supply are considered to produce spu rious signals. In Ref. (17) "Inoperable" is defined as failure events involving actual failure and "Reduced Capability" as instrument drift', out ofcalibration, intermittent (spurious) events. The condition of reduced capability is considered to produce spurious signals.

**

TABLE II Data Parameters of the Model

Parameter TR 1 TR 2 CH 1 CH 2 CH 31

Data 1 hr" 1 1/7 hr" 1 1 hr" 1 1/7 h r 1 1/16 hr" 1 9.71 yr" 1 25.6 hrs l.i+3E5/d

Source

Comments

Ref.f3;

Ref. ra;
Ref.(3)
Ref.(3) Ref.(3) Ref.(12) Ref.(3) Ref.(13) Indian Point3 PRA revised by Sandia (internal tran sient initiators) Challenge rate on RPS (Fre quency of transients)

r = r R s

P
c Pc*

5.21E7/d

Ref.(13)

477 CH TR ing of analog channels ( ) , the period of testing logic trains (T ) , the allowable time in "bypass" for an analog channel if no other channel is tripped (xn) the allowable time in "bypass" for an analog channel if another channel is tripped (.), and the allowable time in "bypass" for a logic train (). A uniformly staggered testing scheme (4) of the analog channels and the logic trains has been assumed for both policies. In general, one would like to determine the five values of these parameters that optimize the chosen risk and/or economic attri butes. Such a general analysis is, however, outside the scope of this work. Instead, the main characteristics of the dependence of the attri butes on the parameters of the LCO policy is demonstrated by means of a sensitivity study. Two attributes have been chosen for evaluating the various LCO policies the probability of reactor core damage per year of operation and the expected downtime of the reactor. The sensitivity studies were performed for two limiting cases of dependences, namely, no dependences (0 = T R =0) and high dependences ( C H = T R =0.10). In the case of no dependences the system behavior is dominated by the two logic trains since the analog channels exhibit a high degree of redundancy (2outofH). In the case of high dependences the role of the analog channels becomes important. The results of the sensitivity studies are shown in Figures 5 through 8. Figure 5 presents the probability of core damage per year of rea ctor operation (PCD) and its three constituants (i.e., core damage from spurious scram, from ATWS, and from real scram initiators) as a function of the period of testing the logic trains, when there are no dependences between trains or channels ( $ = T R =0). Two cases are f shown : case 1 for short ABTs (i.e., x=lhr, x.=lhr and T = 2 h r ) ; and case II for longer values of ABTs (i.e., x=4hrs, T=6hrs, T=4hrs). The curves labeled 1 in Figure 5 show the variation of the probability of core damage as a result of spurious scrams. This contribution de creases as the testing period for the logic trains increases. Spurious scrams are almost totally due to the exceeding of ABT for the logic train testing (). As T increases, fewer tests are performed on the logic trains and the probability of spurious scrams decreases with a corresponding decrease of the probability of core damage from such spu rious scrams. As expected the spurious scram contribution is smaller for case II (large ABTs). The ATWS probability and hence the corresponding contribution to the probability of core damage increases with T R , since higher logic train testing period, means higher EPS unavailability. The combined effect of the spurious scram and ATWS contributions on the PCD is given by the curves labeled 2 in Figure 5. Thus, the contribution to the PCD from spurious scram and ATWS initially decreases with T T R but then it increases again. The ATWS contribution is larger for case II. When the contribution of the "real scram" core damage probability is added to the other two contributions, the total probability of core damage re mains practically constant for all values of T R as it is shown by the curves labeled 3 in Figure 5. The probability of core damage from a real scram on the time the reactor is up and operating and hence suscep tible to a real challenge. This time increases as increases since

478 the probability of spurious scrams and the associated reactor shutdown time decrease. The initial increase of the reactor-up time results in an increase of the probability of real scram and of the corresponding contribution to the PCD (see Figure 5). As T T continues to increase the probability of an ATWS increases. This increase in ATWS probability compensates for the decrease of the spurious scram contribution both to the PCD and the reactorshutdown time. As a result, the reactor - up time decreases along with the probability of a real scram and the associated contribution to the PCD. The variation of the reactor unavailability (per year of reactor operation) and its three constituants (i.e., downtime following a successful response to real scram, ATWS, and spurious srcam) as a function of T T R is given in Figure 6. The unavailability decreases with TiR because of the dominating effect of the corresponding decrease of the down-time from spurious scrams. The same qualitative behavior is observed for both small ABTs (case I) and larger ABTs (case II). The total unavailability for case II is, however, lower, because of the substantial decrease of the spurious scram contribution. The results of the sensitivity analysis for the case of no dependences (=TR =CH =0) depicted in Figures 5 and 6 indicate that if total PCD and reactor unavailability were the only criteria for assessing the ABTs and the period of testing for logic trains, then large ABTs and testing periods are favored since they do not affect the PCD while they do decrease the reactor unavailability. It should be noted, however, that this conclusion might not hold if other risk criteria are considered (e.g., offsite health effects). In this case an increase in the period of testing or in the ABTs while it does not change the total PCD it does affect the relative contribution of various accident sequences. An ATWS coremelt accident sequence, for example, could be more severe than an equaly probable spurious-scram coremelt sequence, in terms of beyond coremelt consequences. Figures 7 and 8 correspond to figures 5 and 6, respectively, when there are dependences among the logic trains and among the analog channels ( T R = C H =0.10). The total PCD does increase with T T R and the ATWS contribution dominates the PCD changes (see Figure 7). This was expected since the incorporation of dependences among the logic trains increases the RPS unavailability and hence renders the ATWS probability much more sensitive to the frequency of the logic train testing. The reactor unavailability (per year of reactor operation) on the other hand takes practically the same values as for the case of no dependences. This is due to the fact that the incorporation of dependences affects mainly the reactor-shutdown following a successful response of an ATWS. The latter, however, represents only a small contribution to the total unavailab ility.

5.

SUMMARY AND CONCLUSIONS

The purpose of the paper was to present a methodology and its application for the probabilistic evaluation of alternative plant technical specifications regarding system surveillancies and out-of-

479

1
, 4

r;

10

.
I
1 1 1

o
Q.

I 1 REAL SCRA K CONTRIBUTION

~
CASE I I

io

"
io 6
ri

V*^
2

^^^"""^

i _^ ^ " " ^ ^ * " ^ ^ i


^ * * *
,

1 1

_ ;

- \ - \\ : \ 1
IO" 7

ATWS CONTRIBUTION
*
N >

.
1

1 1
I

IO" 8

SPURIOUS SCRA M CONTRIBUT ION


1 1

"^ ^ <^_

50

100

150

200

' 250

300

350

T TR (DAYS)

Figure 5. Total Core Damage Probability as a function of the logic train testing period. TR CH 0, T C H = 30d, = 6.42x10 CASE I : = Ihr, T Q = Ihr, CASE II 4hrs, 2hrs 4hrs.

. = 6hrs, 1

480

10"

5
< > <i0"2 = >
o: o t
l_l

CASE I

< IU
cc

v \ 10"

CASE I I

IO" 4 r

SPURIOUS SCRA M CONTRIBUTION

bs
10" ATWS CONTRIBUTION

10"

_I_I_

_L 150 200
250
TR

50

100

300

350 f

T (DAYS) Figure 6. Reactor unavailability as a function of the logic train testing periods __ _2 QTR 0, = 30d, p0= 6.1+2x10" CASE I : = Ihr, T Q = Ihr, 2hrs = 4hrs.

CASE II : = 4hrs, T Q = 6hrs,

481

' 4 r

o .
10"

10"

'7

SPURI0US SC RAM C ONTRIBUTION

10" 50 100

I *

150

200

250

_L 300

350

TODAYS)
Figure 7. Total Core Damage Probability as a function of the logic train testing period.

CASE I : CASE II

TR

CH Ihr,

=0.10, T C H = 30d, Ihr,

6.42x10

2hrs 4hrs.

Mlirs, . = 6hrs, .

482

10

< > <

10"

CASE I CASE II

< iu

ac.

10"

SPURIOUS SCRA M

IO" 4

ATWS

CONTRIBUTION

10"

10" 50 100 150 200 250 300 T (DAYS)


TR

350

Figure 8. Reactor Unavailability as a function of the logic train testing period.


CH _ TR _ CASE I Ihr,
Q10^ TCH_30d

= 6.421"2

Ihr,

. = 2hrs Uhrs.

CASE II : = 4hrs,

x Q = 6hrs,

483 service times. The methodology was based on a Markov model and applied to the Reactor Protection System of a 1 loop RESAR-3S type nuclear power plant. The two attributes used in the evaluation of alternate sets of technical specifications for the RPS are the probability of core damage per year of reactor operation (PCD) and the expected reactor unavailability. A Markov model was employed that allows for the modeling of state dependences and other dynamic effects like the renewal of the system after each successful challenge. These modeling capabilities result in greater realism in the calculation of the two attributes mentioned above. Furthermore, the model includes multiple components and system states that permit the calculation of the three contributors to the PCD : i.e., probability of core damage following a real scram, probability of core damage following a spurious scram, and probability of core damage following a failure of the RPS to scram the reactor. Each technical specification affects a different contributor to the PCD and thus, the proposed model offers in addition to the greater realism in the calculation of the PCD a better insight into the effects of specific changes in the technical specifications. The general trends identified in the calculations performed in this study are as follows : (i) The probability of core damage is mainly affected by the unavailability of the RPS and consequently by the probability of an ATWS. The reactor unavailability, however, is mainly affected by the probability of spurious scrams. This behavior is due to the fact that the conditional probability of core damage given an ATWS is much higher than the conditional probability of core damage given a spurious scram. (ii) The Allowable Bypass Times (ABT) for the analog channels and the logic trains affect mainly the probability of spurious scrams. In general, an increase in the ABTs results in a decrease in the probability of a spurious scram and in a much smaller increase in the probability of an ATWS. The conditional probability of core damage given a spurious scram is, however, much smaller than the conditional probability of core damage given an ATWS. Consequently, an increase of the ABTs results in either no increase or small net increase of the probability of core damage, depending on the level of dependences among analog channels and among logic trains. On the other hand, the significant decrease in the probability of spurious scrams corresponds to a significant decrease in the reactor downtime. Given the very small increase in the PCD and the significant decrease in the expected reactor downtime obtained in this study, an increase in the ABTs might be justified if these two are the only attributes that evaluate the effect of changing LCOs. (iii)The frequency of testing of the analog channel and the logic trains affects the probability of core damage more than it affects the expected reactor downtime and in such a way that depends on the level of dependences among the analog channels and among the logic trains (dependent failures). At low levels of dependences low frequencies of testing are justified while at high levels of

484 dependences high frequencies result in lower probabilities of core damage. 6. ACKNOWLEDGEMENT

These notes are based on a report "Review and Assessment of Evaluation of Surveillance Frequencies and Out of Service Times for the Reactor Protection Instrumentation System" BNL-NUREG-51780 by I.A.PAPAZOGLOY and N.Z.CHO and an a technical paper soon to be submitted for publication. 7. (1) (2) REFERENCES Reference Safety Analysis Report, Westinghouse Electric Corporation, RESAR-3S, July 1975. Standard Technical Specifications for Westinghouse Pressurized Water Reactors, U.S. Nuclear Regulatory Commission, NUREG-O 1 ^, Revision 3, September 1980. Jansen, R.L., Lijewski, L.M., and Masarik, R.J., "Evaluation of Surveillance Frequencies and Out of Service Times for the Reactor Protection Instrumentation System", Westinghouse Electric Corporation, WCAP-10271, January 1983. Apostolakis, G. and Chu, T.L., "The Unavailability of Systems Under Periodic Test and Maintenance" Nuclear Technology, 50, August 1980. Vesely, V.E. et al., "FRANTIC II - A Computer Code for Time Dependent Unavailability Analysis", Brookhaven National Laboratory, NUREG/CR-1924, April 1981. Ross, S.M., Introduction to Probability Models, Academic Press, New York 1973. Howard, R., Dynamic Probabilistic Systems, Volumes I and II, John Wiley and Sons, Inc., New York, 1971. Shooman, M.D., Probabilistic Reliability : An Engineering Approach, McGraw-Hill Book Company, New York 1968. Papazoglou I.A., and Gyftopoulos, E.P., "Markovian Reliability Analysis Under Uncertainly with an Application on the Shutdown System of the Clinch River Breeder Reactor", Nuclear Science and Engineering, 73, 1 (1980).

(3)

(4)

(5)

(6) (7) (8) (9)

(10) Papazoglou, I.A. and Gyftopoulos, E.P., "Markov Processes for Reliability Analyses of Large Systems", IEEE Trans.Reliability, R-26, 232 (1977).

485 (11) Fleming, K.N. and Raabe, P.H., "A Comparison of Three Methods for the Quantitative Analysis of Common Cause Failures", Proceedings of the ANS Topical Meeting on Probabilistic Analysis of Nuclear Reactor Safety, May 1978, Los Angeles, California. (12) McClymont, A.S. and Poehlman, B.W., "ATWS : A Reappraisal Part 3 : Frequency of Anticipated Transients", Electric Power Research Institute, EPRI NP-2230, January 1982. (IS) Kolb, G.J. et al., "Review and Evaluation of the Indian Point Probabilistic Safety Study", Sandia National Laboratories, NUREG/GR2934, December 1982. Zion Probabilistic Safety Study, Commonwealth Edison Company, September 1981. IEEE Guide to the Collection and Presentation of Electrical, Electronic, and Sensing Component Reliability Data for Nuclear Power Generating Station, IEEE Std. 500-1977. Miller, C.F. et al., "Data Summaries of Licensee Event Reports of Selected Instrumentation and Control Components at U.S. Commercial Nuclear Power Plants", EGG Idaho, Inc., NUGER/CR-1740, May 1981. Indian Point Probabilistic Safety Study, Power Authority of the State of New York and Consolidated Edison Company of New York, Inc., March 1982.

(14) (15)

(16)

(17)

STRUCTURAL RELIABILITY: AN INTRODUCTION WITH PARTICULAR REFERENCE TO PRESSURE VESSEL PROBLEMS

A.C. Lucia Commission of the European Communities Joint Research Centre Systems Engineering and Reliability Division 121020 Ispra (VA)

ABSTRACT. The problem of structural reliability assessment is tackled with particular reference to the modelling of cumulative damage pro cesses. The estimation of the reliability of a pressure vessel is dealt with by means of a phenomenological approach.

INTRODUCTION The general assessment problem dealt with by structural reliability analysis is the following: given a structure (i.e. pieces of material shaped and connected in order to perform a certain function) character ized by a number of "attributes" (e.g. geometry, material properties, fabrication or service induced defects) and by some "operating condi tions" (e.g. external and internal loads, environment, etc.), which is the probability that, during a given period of time, it will perform its intended function or, in other words, it will not "fail"? From an analytical point of view, the reliability of a structural component can be represented as the probability that a vector repre senting its properties and behaviour does not cross a "safe surface" defining an admissible region : R(T) = [s Q]. (1)

A possible approach to its assessment is constituted by the theory of multivariate stochastic processes. However, it is in practice difficult to estimate the reliability R using the general expression (1). It is difficult to even define the structure of s and simplifying assumptions have to be made. Component performance capacity and demand are not statistically independent. Loads indeed tend to activate ageing and damaging mecha 487
A. Amendola and A. Sail de Bustamante (eds.), Reliability Engineering, 487-512. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

488 nisms that have a definite effect on the capacity of the structure. Modification of strength may in turn influence loads. A simplified scheme might be to decompose loads into two parts ; loads affecting the resistance, and loads arising from accidents but not adversely affecting the capacity. As far as the variation in space of structural reliability is concerned, the component can be segmented into a number of regions whose failure characteristics can be studied independently. The reliability is, therefore, reduced to a statistical interference, in time, of capacity and demand processes which are, in general, complex processes and which have to be decomposed into simpler ones. Additionally, these simple processes have to be quantitatively specified both by experimental techniques and by statistical inferences on data obtained by these techniques.

1. THE FATIGUE PHENOMENON AS A CUMULATIVE DAMAGE PROCESS Most of the phenomena that structural reliability analysts have to deal with are cumulative: wear, corrosion, fatigue crack formation and propagation, creep, neutron or hydrogen embrittlement, diffusion, etc. The macroscopic effects of the accumulation of damage are determined by the progressive degradation of the structure until its operating characteristics get beyond a safe or functional limit: which constitutes a failure. As far as fatigue is concerned, the interpretation of damage as the birth and propagation of defects in the elementary structure of the material caused by the alternating stress field acting on imperfections of the crystalline network, on distorsions due to impurities, etc., is generally accepted. Different mechanisms are presumably present in fatigue damage accumulation, from the initial nucleation of a microcrack to the growth of the large defects. The environmental conditions, the characteristics (i.e. frequency, shape, min/max ratio) of the stress field and of the material significantly affect the relative importance (in terms of number of cycles) of the various phases. Experimentally evident fatigue failures in operating components are very often due to fabrication defects, generally introduced during welding procedures. This lead reliability analysts to concentrate much larger efforts on the propagation stage starting from the fabrication defects. Nevertheless, the present trend to extend the operating life of plants and to requalify them for different and sometimes harder operat-

489 ing conditions is now shifting more attention towards crack nucleation and microcrack coalescence. The estimation of the reliability (number of cycles to failure) of a cyclically loaded structural component is an operational safety need and a practical problem of current design too, considered by the current standards (see, for example, the criteria of fatigue design of the ASME Sect. Ill standards and the acceptance criteria of the defect propagation rates of the ASME Sect. XI standards). The need to go beyond the criteria of the standards is indicated by the fact that, for fatigue design, the ASME uses the PALMGRENE-MINER rule, based on the hypothesis of linear accumulation of damage which is often considered completely unjustified, Freudenthal /l/. Widely employed prediction methods for the fatigue lifetime are based on the randomization of phenomenological formulations of the fatigue process. This type of methods is also largely used for the interpretation of laboratory results, coping with the need to represent the scattering of results. More reliability oriented are the methods based on the extreme value theory, in which the distributions of the extreme values of load and resistence are taken into consideration. Models of greater complexity tackle the problem of representing the processes of strength degradation and of load combination accounting also for their stochastic nature. Good results can be obtained in the interpretation of experimental results and in the prediction of the lifetime of operating components as well. It does not seem possible at present to achieve a unification of the approaches to the problem of the evaluation of the lifetime of components . Although apparently arbitrary the classification hereafter followed is expression of different fields of application and purposes (design safety analysis, data interpretation, stochastic process simulation for prediction, etc.).

2. DETERMINISTIC METHODS 2.1. Criteria for fatigue design The ASME standards for components of conventional (Sect. VIII, Div. 2 /2/) or nuclear (Sect. Ill) installations are based on limitations imposed to a "cumulative factor of use" U:

490

i = Z i = <

(2)

where . is the number of cycles envisaged for the cyclic load of type i and N. the allowed number of cycles as deduced from the fatigue curves (SN curves). This criterion coincides with the Miner rule /3/, and is still the most commonly used, not only for components subject to ASME standards but also, and especially, for components in the aeronautical industry for which the dimensioning for fatigue is often the main consideration, Schutz /4/. The SN curves used are generally obtained from monoaxial cyclic load tests at constant amplitude, on notched samples; a suitable safety factor, of the order of 3 and 4, is applied to the mean experi mental curve to take account of the considerable scatter of results. With a safety factor of 4 the design reliability is of the order of 99.9% for a standard deviation of 20%. The question of the accuracy of the Miner rule has often been raised, although few valid attempts have been made to verify it experi mentally because of the high cost of these experiments. During the sev enties, many experimental campaigns were undertaken, . in particular by NASA and MesserschmittBolkowBlohm with a number of results reported by Schutz and leading to the conclusion that the Miner rule has some relevant drawbacks, because there are no criteria to estimate a priori whether the prediction will be conservative or not and because the pa rameters which influence the estimation are not known. The simplicity of the Miner rule is the result of the main hypothesis on which it is based: the linear accumulation of damage. This hypothesis corresponds to the assumption of an exponential distribution for n., number of cy cles to rupture in case of constant amplitude load:

P(n.) = exp I

(?)

).

(3)

In fact, if one considers a total of cycles, of which n. (i = 1, 2, ...) with constant amplitude ., the probability of overall survival will be:

R(N) = " . P ( n . ) = exp

| J

fey

(4)

from which

491
.

. 1/R(N) = Y ) = const i i

(5)

which expresses the hypothesis of lineair accumulation of damage. The distribution (3) is, on the other hand, characterized by a failure rate : P'(n.) l h(n.) = = = const. P(n ) i i

(6)

which is in contrast with the experimental evidence of fatigue damage, characterized by an increasing rupture rate. The Miner rule and, as a consequence, the above mentioned ASME standards take into consideration the formation of fatigue cracks in integer material. Different coinsiderations have obviously to be made when dealing with the propagation of existing cracks (see next paragraph). To this respect, Appendix A of Section XI of the ASME code /5/ de fines the criteria for the acceptance of defects in nuclear compo nents. The propagation analysis is carried out on the basis of the Paris relationship /6/. This procedure shows how the design criteria provided by the Miner rule can be integrated by a more accurate propa gation analysis and thus by acceptance standards. Starting from the initial dimension, a , of the defect, an incre o ment Aa is estimated for each type of transient, up to the final dimen sion a . This value is compared with the smallest critical value a , defined by equating the stress intensity factor with the static fracture toughness . This analysis, indicating the need for control ling organizations to go beyond the results of the Miner relationship towards a more rational approach to fatigue design, constitutes also a way of linking the nondestructive testing standards (definition of the initial defects) and the operating conditions envisaged, expressed by the environmental conditions and the load cycles. 2.2. Models of fatigue crack growth The fatigue crack propagation rate may be influenced by a lot of param eters, such as stress parameters (maximum stress, mean stress, frequen cy and shape of the load signal, ratio of the minimum to ma.ximum stress; possible presence of residual stresses); material parameters (chemical composition; heat treatment; microvoid distribution); macro defect parameters (defect type: crack, inclusion, porosity, etc.; di

492 mension; orientation; shape; tip diameter); environmental parameters (dry/wet situation; corrosion effects; embrittlement; temperatures, etc.). A model taking into account correctly all of the above parameters does not exist. Furthemore, the effect of some of them (e.g. residual stresses; corrosive environment) have not been exactly quantified and modeled yet. Anyway, about a hundred different models are suggested in the literature . They range from purely phenomenological models (e.g. Paris; Forman; Priddle; etc.), to models based on the dislocation theory (e.g. Bilby - Cottrel - Swinden); from models based on the material behaviour at the crack-tip, to others based on cyclic properties of the material. Generally speaking, the more complex the model, the more difficult its practical application, except in the same situation for which they have been established and their parameters estimated. The most widely employed models are semi-empirical and have been developed mainly as interpretative models of experimental results; they allow the prediction of the behaviour of the crack size "a" as a function of the number of cycles "N". This prediction is based on the integration of the growth rate, for fixed initial conditions of the defect. The result arrived at is not, however, in general, representative of the real growth situations. This is due to the fact that the initial conditions (stress, material, crack, environment parameters) are not exactly known and some of their effects are poorly modeled or neglected and to the fact that, for the same initial conditions, there is an intrinsic variability in the process of damage by fatigue which leads to a distribution of values "a", at cycle N.

3. METHODS BASED ON THE RANDOMIZATION OF FATIGUE CRACK GROWTH MODELS The considerations made in the preceding paragraph clearly indicate that a better representation and understanding of the process of fatigue crack growth (FCG) could be obtained by probabilistic methods. It appears natural to consider the relationships for defining the growth rate as stochastic. In this context prediction methods can be seen as being based on the integration of the FCG relationships with the parameters or the initial conditions or the loads represented by random variables. These procedures lead to the determination of a distribution of dimensions for the propagated defect, at cycle N, or of a distribution of the number of cycles N fer a given propagation from a o to a. These distributions are the basis for the prediction of the re-

493 sidual life of structures stressed by fatigue. Quite a wide range of applications of this approach can be found in the literature. One, for example, has been developed by Yang, Donath and Salivar /7/, representing the growth rate of the defect size by the Paris relationship:

da b b = () = XQ(AK) . dn

(7)

The randomization of (7) is performed by considering C to be a random variable which can be expressed as a product between the parame er Q and the variable X with lognormal distribution and unit mean val ue. The dispersion of experimental results is used to determine the parameter b, Q and the variance of = log X. In this way the propaga tion model and its intrinsic variability are defined; the integration of (7) also requires definition of the link between the stress intensi ty variation and the defect size, which is done with reference to fracture mechanics. It is then possible to proceed to the integration of (7) and thus to the definition of a particular function a(N) for an assigned probability level, associated with the X value with which the integration is carried out. The next step is the determination of the distribution F () (c.d.f. of the number of cycles necessary to obtain a fixed propaga tion level) and F (a) (c.d.f. of the propagated dimensions a for a fixed number of load cycles). A very interesting model is the one proposed by H.D. Madsen /8/ for constant amplitude loading and extended to variable apmlitude loading. The crack increment in one cycle is expressed by:

da , *m , ,/, m = Y(a,Y) C(o -hTa) dN

(8)

where to explicity account for uncertainties in the calculation of K, the geometry function Y is written as a function of "a" and of a vector Y of random parameters. The model, based on linear elastic fracture mechanics, is devel oped with consideration to experimentally observed results and accounts for uncertainties in the loading, initial crack size, critical crack size, material properties including spatial inhomogeneity and uncer tainty in the computation of the stress intensity factor.

494 In the DufresneLucia approach /9/, the fatigue crack growth rate is still expressed by the ParisErdogan law, in the formulation:

^ = C ( ) n dN A (9)

c "V n
where the propagation analysis is bidimensional. The stress intensification factors , are calculated at the A defect tips, using the fracture mechanics relationships. The randomization is applied to parameter C which is represented by means of histograms, derived from the analysis of a large number of experimental results. More details are presented in chapter 6.

4. METHODS BASED ON THE THEORY OF EXTREME VALUES The reliability of a structure can be though of as the probability that the largest of the possible loads is smaller than the smallest of the resistances hypothesized. This means that what one needs to know is the distribution of the extreme values of the loads and of the resis tances, rather than their effective distributions. This observation, together with the fact that the possible distributions of the extreme values of a random variable are asymptotically independent of the dis tribution of the variable itself, leads to the consideration of the ex treme values theory as a fundamental ingredient of structural reliabil ity. A number of methods generally based on the hypothesis that the lowest resistences have a Weibull distribution, have been proposed by several authors. The classic approach of A.M. Freudenthal /!/ assumes that the rup tures are triggered by the accumulation of damage in the weakest area of the material, characterized either by the largest value of the de fect propagation rate (fatigue) or by the largest value of the seconda ry deformation rate (creep). Implicit in this approach is the consider ation of the propagation phase as the essential aspect of rupture by fatigue. The presence of fabrication defects is so emphasized at the expense of natural material defects, the growth of which is considered negligible. The most interesting aspect of the method is that of being

495 able to represent fairly simply the phenomena of creepfatigue interac tion and of obtaining, as a final development, design relationships as simple as the Miner rule, but characterized by rupture rates which grow with component age. The semiempirical is: relationship used to describe fatigue failure

a.

\dN/i F

C.

(10)

The distribution of the number of Weibull one:

cycles to

rupture,

, F

is a

F z = 1 exp ( ^ : ) (c/u)

(11)

which expresses the probability that is smaller than an assigned value. In the method presented in /10/, A.S. Ang uses the experimental information contained in the usual SN fatigue curves integrated into a probabilistic approach. This method assumes that the lifetimes of fa tigue stressed components are reproduced by the Weibull distribution. This assumption, initially adopted for situations of monoaxial load with cycles of constant amplitude, is extended to the case of loads of variable amplitude, supposing that the accumulation of damage due to individual load cycles follows the Miner rule. This is expressed with:

s) ds (s)

f(s) being the p.d.f. of the load amplitude, N(s) = c/s the mean value of the lifetime for fatigue at constant amplitude s (deduced from the appropriate SN curve) and the number of cycles to rupture in condiF tions of variable load. Unlike the authors already mentioned, R. Talreja /ll/ account not only for the stage of defect propagation but also for the initial stage of defect formation. In this stage, the assumption is made of a resistance R uniformly decreasing, from the initial value R to the transition value R . The o c degradation of the resistance is expressed introducing the damage pa-

496 rameter D: R R D = . R R o c

(13)

The evolution of damage is supposed to occur according to the fol lowing relationship between damage rate, cyclic stress S and number of cycles N:

dN

\ 1D /

(14)

R is supposed to be a random Weibull variable which is defined from o the (Griffith) relationship: 1/2 , , R = k aa (15) c c c being a shape factor and a the transition dimension of the defect. c In the propagation stage the resistance, still expressed by a re lationship of type (15), is supposed to decrease with the increase of dimension a. The propagation stage is governed by PARIStype law. A re lationship is than found between the resistance R and the number of cy cles N. The representation of the degradation of R as a function of by RN diagrams with assigned probability is characteristic of the Taireja approach; in these diagrams, which are drawn for various stress values S, it is possible to represent histories of cyclic load with variable amplitude and thus make lifetime predictions. 5. THE ACCUMULATION OF DAMAGE AS A STOCHASTIC PROCESS In general both the loading actions and the resistance degradation mechanisms have the characteristics of stochastic processes. They can be defined as random variables which are functions of time. The partic ular load history which affects a component is one of the possible re alizations of the stochastic load process and the same applies for the environmental condition or for the evolution of the dimensions of a de fect inside the component. The damage accumulation mechanisms can, in general, be represented by a positive "damage rate" function such that the measure of damage is a monotonie increasing function of time.

497 Under rather general conditions such accumulation processes can be represented by a wellknown class of stochastic processes, the Marko vian class. The simplicity and flexibility of models based on Markov schemes is the reason for their frequent appearance in the literature. In this chapter we analyse in particular the BogdanoffKozin method /12,13/j because of its particular relevance to fatigue damage phenome na for the class of components of interest. Other applications of Markov schemes may be found in Volta /14/. In the case in which the emphasis is rather upon the stochastic process of the loads and environmental conditions than upon the mecha nism of damage accumulation, the traditional techniques for the treat ment of processes of this type become more important: see e.g., Bolotin /15/. In the KozinBogdanoff model, the process of damage accumulation is simulated with a discretestate, discretetime Markov process based on the assumptions that: 1) the increase in dimensions of the defect at the end of each cycle depends only on its dimensions at the beginning of the cycle; 2) the damage accumulation mechanism (defect growth) only allows posi tive changes and the damage level may increase at most by one unit in each cycle. The defect dimensions are associated with a discrete level or state which allows (without excessive restrictions) use of wellknown algorithms for discrete Markov processes. No assumption is made about the elementary mechanisms which cause the damage and its accumulation, apart from those of general character already defined; one can thus say that such an approach is purely prob abilistic, unlike those examined already in which the probabilistic as pect is linked to phenomenological aspects peculiar to the damage mech anism considered. The damage levels are represented by the states y = 1, 2 b, b being the conventional rupture state. The damage process is repre sented, at cycle x, by the probability transition matrix [p ]: 0 q P 2 0 0 q 2 0 0 0 0 0 (16) 0 ... q

tpJ

As the transition between the states is governed by (16), the dam age state at level is linked to that at cycle x1 by: <PX>
<PXI>[P X ]

(17)

498 and thus: (p ) = (p ) w. [Pj (18) o k k which describes a "stationary" process. These relationships represent the mathematical basis of the Markov process; from them one can easily find the probability distribution of the cycles to rupture and of the damage level. The cumulative distribution function F (x;b) of the number of cy w cles at the failure level b is, by definition: F (x;b) = (b) w (19)

while the cdf of the level of damage at a given number of cycles is: j F (j;x) = V
o

(k)

(20)

"

k=l The transition matrix represents the damage process in the sense that, once the values of ., q. are defined, the mean 'curve of the pro cess and the single realizations can be reproduced with (16), for a fixed initial condition. It is also possible to take into account the consequence of in spections by updating the (j) with the definition of a new initial xl distribution (j), after inspection n.i. o Bogdanoff and Kozin have evidenced /12/ the importance of incorpo rating in the model the history dependence. Furthemore, they have analyzed fatigue life data as well as fa tigue crack growth data showing positive skewness: which means, rather surprisingly, that distributions generated by weakest link model (e.g. the Weibull one) does not seem to fit with the statistical properties of the data. It is suggested, consequently, to investigate whether a strongest link statistical model would provide a better description of fatigue crack growth. Another application of stochastic process theory is constituted by the Bolotin method which addresses the problem of the estimation of the lifetime of a structural component either in the design phase (a priori estimate) or during operation (a posteriori or conditional estimate). It is assumed that the properties of the component relevant to its safety and the loads acting on it vary in time and that the variations can be described by a vector stochastic process U(t); the admissible region is defined in the space V based on economic and safety consid

499 erations. The reliability of the component is defined as the probabili ty that the representative vector for the process remains within the region in the time interval (0,t): R(t) = P(U(t)e, (0,t). (21)

Defining the lifetime of the component as the time of first con tact between U and the boundary of the admissible region, one has: P(tf) = 1 R(tf). (22)

In the design phase the lack of information on U and is compen 1 sated with the use of wider safety margins. During operation of the component more information can become available by which the initial estimate can be modified. In this case a conditional reliability is defined based on the fact that the component has actually remained operational up to the time t of the last inspec tion: R(t/t ) = P(U(x/t, )en(t ); e(t ,t) k k k k and the consequent distribution of the residual life is: /t ) = 1 R(t, +tf/k,). (24) v k k k These completely general relationships define the principal cri teria for the analysis; further progress requires the definition of the failure mechanism and hence the study of the limit conditions for in terference between the stochastic loading processes on the one hand and that of the degradation of resistance on the other. The description of this model gives us the opportunity to remember that: i. any lifetime prediction made at the beginning of service of a structure in quite unlike to be correct; ii. it is of paramount importance to collect and properly use informa tion on structural behaviour during service to improve the predic tions; iii. to this end, information merging techniques and updating proce dures may play quite a vital role in structural reliability as sessment. (23)

500 6. THE RELIABILITY OF A PRESSURE VESSEL: A PHENOMENOLOGICAL APPROACH Estimating the failure probability of tanks and pressure vessels has, for various reasons, been of interest to Insurance Companies, boiler manufacturers, their associations and Union associations, Users, Safety Organizations. The results of these studies have not always been pub lished; they have been picked up from Safety Authority summaries. The failure probability of pressure vessels has been often asses sed by means of a statistical method. The method consists of defining a reference population which is as representative as possible of the structure to be examined, and pro ceeding with a systematic recording of the accidents which happen to this reference population. For such an examination to make sense, the following is essential: the reference population must be representative, that is to say, the materials used, manufacturing procedures, control means and accept ability criteria are as close as possible to those used in the struc ture being considered; working conditions, checks and periodic inspections, as well as the authorisation criteria, are as similar as possible. Difficulties arise when very low probabilities have to be estimat ed, for which it is necessary to have available a very large population of components. If one has not got such a population and tries to widen the avail able number, enlarging the sample characteristics, the result will be biased. On the other hand, recording the incidents for such a large popu lation would involve a lot of work in checking, in sorting out and in terpreting the results. In effect, when the sample has been enlarged, certain accidents which have happened to the reference popuplation could not happen on the equipment examined, and it is necessary to identify, for the equip ment to be examined, the representative and unrepresentative accidents. Obviously problems can arise in defining these accidents. The statistical approach is an overall approach. The consequent reliability: R = 1 f ( being the estimated probability of failure) has to be interpreted in a frequentisi sense. It can be thought of as the fraction of the theoretically infinite set of identical structures which still perform their intended function after the design lifetime T. When we are dealing with our particular structure and want to know

501 its reliability, we better refer to the so called subjective reliabili ty, or Bayesian reliability. It depends on the existing, although not well known, resistance characteristics of the given structure and the load process. Its value is not fixed, but may change as our information on the structure changes. The reliability assessment methods presented in the preceding sec tions can be sucessfully used for subjective reliability estimation: a practical application is briefly presented hereafter. 6.1. The use of probabilistic fracture mechanics In order to make estimates of the reliability of a structure and of its components, it is necessary to segement the structure, and the condi tions it is subjected to, into elements, and to analyze the behaviour of each one of them. Fracture mechanics is well suited to this type of approach since it allows one to follow the evolution of a defect during the life span of a structure and to subsequently calculate whether the fault might reach dimensions sufficient to cause failure. Such a method has been advanced by several authors: Becher /16/, Harris /17/, Marshall /18/, Vesely /19/. The analyses consist of expressing the habitual criteria of frac ture mechanics in probabilistic form, that is: i. the growth rate of the crack by fatigue, calculated for repetitive load conditions; ii. the initiation of unstable crack propagation, calculated for acci dental situation; iii. the possibility of arresting cracks which have become unstable, before the tank bursts. Besides the structural problems related to existing defects and pertinent to fracture mechanics, the problem also exists of crack nu cleation under fatigue loading. In principle, there are two possible ways to analyze the process of crack nucleation: i. the SN or N curves ; ii. the damage mechanics at the microstructural level. Some considerations and modelling proposals on that problem can be found in A. Jovanovic, A.C. Lucia /20/. Once the existence of a macrodefect has been detected, the crack growth rate da/dN can be expressed as a function of the stress intensi ty factor variation during the cycle. The initiation of unstable propagation can be expressed, on the one hand for the limited plasticity situation (small scale yielding)

502 with the condition > K 1 P a n d o n t n e other hand, for the generalized plasticity condition (general yielding) with a local plasticity insta bility criterion. The analysis therefore has to account for the distri butions of the factors and parameters which are included in crack growth and failure laws and those which participate in calculating , the size of the defects and the applied stresses depending on the loca tion of the defect and the load situation. In general, in order to reduce the number of random variables of the problem and to keep some of them deterministic, all the above fac tors and parameters should be subject of a preliminary examination of their dispersion range, followed by a sensitivity analysis of the reli ability assessment model to be used. This sensitivity analysis could be effectively performed by the Response Surface Methodology (see, e.g., Lucia /21/). It is worth to remember that, once the distributions of the main variables have been established at a giventime of the life of the structure, it is absolutely necessary to follow, as finely as possible, the elements which might further modify the thus defined original dis tribution. Taking the existence of defects in a structure as an exam ple, if the origin has been chosen as being the beginning of operation, it will be necessary to make provisions for all the phenomena which might cause cracks to appear after the equipment has been put into ser vice, e.g. wear and corrosion under stress. The new cracks thus created will be considered in the calculation programme from the moment they appear. By contrary, if the origin has been taken as during welding, it will be necessary to consider the defects which might appear in the course of the heat treatment. The probabilistic method thus defined will make available not only the absolute value of the failure probability relative to a structure, but also a lot more information such as: estimating parameters which have an important role in the final re sult, both for their mean value and because of their dispersion. In the first case, it might show the usefulness of modifying the concep tion so as to reduce the influence of this parameter; in the second case it would be opportune to solicit examinations which allow reduc ing this dispersion, improving a manufacturing procedure or control method. 6.2.1. Stress Analysis. Modern stress calculation methods based on fi nite element method generally give accurate results when deformation is in the elastic field. In the elastoplastic field one might find it nec essary to take calculation imprcisions into account, especially when threedimensional calculations are involved. Several procedures have been developed to propagate the uncertain

503 ties from input variables to output variables (see, for example, Cox and Baybutt, /22/). One of the most valuable and flexible method is constituted by the response surface methodology, which has been widely employed to approximate longrunning computer code (Lucia /21/). Nevertheless, the response surface method has some inconveniences, essentially due to the fact that the uncertainty associated with large random vectors requires an almost unbearable computational effort to be propagated. In order to overcome this limitation, an improved response surface approch was formulated (Veneziano et al., /23/). It expresses the output variable by the relationship:
= |x, | + ^
e. + e

where can be regarded as a second order polynominal, contains the coefficients and represents the variables. This expression coincides with the one obtainable by the moment methods, except for the error terms e.. The use of the response surface method allows the derivation of all the model parameters. An application of this approach to structural analysis of a vessel nozzle corner can be found in FaravelliLucia /24/. 6.2.2. Stress Intensity Coefficients. Literature gives a great number of more or less empirical formulas which allow calculating the stress intensity coefficient for generally elliptic defects subjected to uni form or variable stress fields. However, there are no formulas suitable for very large defects or situations of extended plasticity. One possible method consists of calculating, as accurately as pos sible, the stress intensity coefficient, for a certain number of dis crete values of the defect eccentricity, and to then search for analyt ical formulations which would allow interpolating the values obtained. The calculation method may use influence functions /25/. One can always assume that the pressure applied to the crack has the form:

()

= + 1 () + 2 () *'3() + a 4 (h)

(25)

The stress intensity factor linearly depends on , , , , ^ ^ .. .. 0 1' 2' 3 4 so that one can write:

504

K(0) = ^
+

L iQ(0) + ^ i^) + M 2 i^) +


(26)

(S) 3 V3 W + (rfv4 W ]

One can demonstrate that the functions i.(0) have no dimension and are applied to all cracked structures geometrically similar to those which were used for the calculation. The functions i.(0) only depend on the ratios 2a/b and 2a/h, where 2a and 2b are respectively the crack width and the crack length, and h is the wall thickness. As soon as one knows the i.(0) functions it is very easy to deter mine the stress intensity factor variation along the edge of the crack under the action of almost every load type one would meet in practice. To calculate the stress intensity factor or the i.(0) functions we can start by solving the corresponding elasticity problem using a cal culation programme with surface integral equations, calculating the elastic displacement in the nodal points. One determines the stress in tensity factor using the displacement extrapolation method which con sists of finding the C. u limit when > - 0 , being:
VP

u = the crack semiopening; = the distance frOm the point to the crack edge; C = _/2. 4 The calculation method was defined and checked on the problem of a circular crack in an infinite medium /25/. It was checked again on an elliptical crack in an infinite medium for 2a/b = 1/3. To be certain of being able to calculate a very long crack with a ratio 2a/b = 1/10 it was necessary to take a case in which one knew an analytical solution. We took once again the case of an elliptical crack in an infinite medium with a ratio 2a/b = 1/10. The results of this calculation turned out to be correct within about 2%. A. The case of surface defects The stress intensity coefficient has to be calculated partly at the crack tip (small axis end) and partly at the emergent point (long axis end). Since certain load conditions cause relevant stress gradi ents at the surface, one must take into consideration a linear stress variation near the wall.

505 The stress form of:


K

intensity coefficient can be written under the general

i= " ^
1 = |/2 a

[oio+

" . } *]
il 1 ij

(27)

i + L0 0 h

where : = stress on the surface where the defect emerges ; = stress at the small axis end; a a = stress at the opposite face. Expressing i and i in analytical form, we get:

, ,

YTTi I

1.14 0.48 b

2a

, , ,1/2 0.2 + 4.9 (2a/b) J

2a/h

( )

0.70.23 L |

+0.1 I 1I \0.05 + 2a/b /

IB

V2*a (2a/h) 2

V2a/b

] 1.30.57 j 0 I b

ll.8l.46) \ b /

( )

(0.56(2a/b)3 1.38(2a/b)2 + 2a
+

+ 0.93 (2a/b) + 0.048)

0.007511

2a/h J j

In the field 0.1 < 2a/h < 0.8 these expressions is 10%.

and 0.1 < 2a/b < 1 the accuracy of

B. The case of internal defects Under the assumption of a stress constant along the small axis of the defect, the following formula can be used: . = 1 E(a.b)

exp. [l/4 (a/d) 4 VbTaJ

(29)

However, this formula is no longer valid

for an

infinite length

506 defect. Calculating, by the formula suggested by Isida /26/, the value of for a tunnel type defect, we noticed that it corresponded to the formula above for b/a = 1 1 . It is therefore possible to adopt the given formula for values b/a < 11. 6.2.3. The J integral approach. As previously mentioned, formulas ap proximating the stress intensity coefficient value are no more avail able in case of large defects or extended plasticization. A possible approach is to evaluate via J integral calculation. The computation of the J integral can be performed (see Jovanovic, Lucia IZO I) by a finite element code, using a 3D mesh, containing the representation of the crack. A number of calculations, performed for different crack sizes, al low to determine the relationship between the crack size and J inte gral. For the purpose of its further use in the fatigue crack growth analysis, the stress intensity factor can be calculated on the basis of the J. Assuming planestrain conditions:

(1 ) , 2 2 , 1+v 2 J = (K + ) + E E

. . (30)

which, for the crack opening mode type I reduces to:

J = i = ^ K2

(31)

A series of values, computed by the above expression for the nozzle corner of a pressure vessel (see /20/) were compared with some requests from the literature. The comparison showed a good agreement throughout the whole thick ness , a minor difference appearing only in the range of small crack sizes, where the influence of the plastic zone on the crack tip is more significant and could not be well modeled in some of the literature work because based on a fully elastic analysis. 6.2.4. Damage accumulation law. A model describing the accumulation of damage (e.g. fatigue crack propagation) should try and take into ac count all the parameters which might influence the damage accumulation. Actually, a complete model does not exist and a choice has to be made, among the available ones, according to the structural situations

507 we are looking at. In our code C0VAST0L (see /27,28/), developed for the reliability analysis of a LWR pressure vessel, the well known Paris law has been adopeted:

= C () dn A

da

(32)

? = C ( ) n dn

(34)

when the coefficient C and the stress intensity factors are given in form of distributions. This relationship, of course, does not describe the behaviour of the crack in the nucleation stage and near the fracture. In order to better model the material crack curve, a variation of C and has been introduced as a function of range, environment and R / value, min max 6.2.5. Failure criteria. Various solutions have been proposed to define the fracture onset in conditions in which the linear elastic fracture mechanic cannot be applies: J, COD, equivalent energy. In the case where the loading on the structure and the size of the defect can lead to a generalised plasticity and where the J or COD for mulations are not valid, it is difficult to use these methods. On the other hand, the two criteria method developed by Dowling and Townley /29.30/ leads to interesting results; this method is based on considering three fields: a field governed by the linear elastic mechanics; a field where fracture is directly releted to the limit load; an intermediate field connected to the two preceding ones. Whichever field one considers, it is possible to estimate a load factor F given by:

= | f c s "

*[#(]

ex

P I I :rl I

(33)

which entails the determination of a relative stress intensity (ra r tio of the applied stress intensity to the fracture toughness ), 1 le and of a relative stress S (ratio of the applied stress to the plastic r collapse stress).

508 If only primary stresses a = = 12 r rp le are present, r (34) is defined as:

where is the stress intensity due to primary stresses, computed by LEFM without any plasticity corrections. If both primary and secondary stresses are present, is . s r given by: = + r rp rs where is the same as above, and a suggested procedure is compute ; according to this procedure: rs = ^ + rs le (35) given to

(36)

where is the stress intensity due to secondary stresses, computed Is by LEFM only, and is a plasticity correction factor. A simplified procedure has been implemented in COVASTOL, which ne glects the usually small plasticity correction factor ; in this ap proximation is simply computed as : r IP Is 1 = + = + = r rp rs le le le ,. 37

where is the stress intensity due to all the stresses, computed by LEFM without any plasticity correction. In the first option when all the stresses are considered to be primary, this result coincides with the exact value of . To compute the histogram of , the histogram of is first com r 1 puted; the histogram of is then computed. The histogram of is le r finally obtained according to the rules of composition of random vari ables. According to the R6 criterion, S is the ratio of the load gener r ating primary stresses to the plastic collapse load of the flawed structure; if the region is statically determinated, S is the ratio of r primary stress to the plastic collapse stress. The general analytical method proposed is implemented in the COVAS TOL code; S is thus given by: r

509 (c/w) + /4 + (( (c/w) + o_ /)2 , mc bc mc bc


5(1

t 2 mc

(lc/w) 2 ) 1 / 2

C/W)2

(38)

In the previous expression: is the flow stress, given by = 0.5 ( + ), where is the y u y yield stress and the ultimate tensile stress; u and , are the elastically calculated equivalent membrane and mc be bending primary stresses: 2 = t = t /6 mc bc where t is the wall thickness and , M are the tensile force and the bending moment, computed from the actual stress distribution acting on the whole wall thickness ; (c/w) is the effective flaw depth, determined as the ratio between the area of the flaw and the area of a conventional rectangle includ ing the flaw: Internal defects: c/w ab t(2b+t)

Surface defects:
c

/ w = ^,11 ^ tl2b+t)

if

a/b ^ 0.1

, c/w = t

if

a/b < 0.1

Axially propagated surface defects: / c/w = t width, crack lenght and


2a

where 2a, 2b and t are respectively crack wall thickness.

Once the histogram of K and S have been computed, the histogram r r of F is computed according to the rules applicable to the composition of random variables. F plays the role ofa "safety (or reserve) factor": F > 1 indi cates that the structure is safe, F < 1 indicates a failure condition. In the (S , ) plane, each loading condition is represented by one point, ana the equation Fp = 1 defines a limit curve; the safety of the

510 structure can be assessed by the position of the point with respect to the limit curve. The histogram of F can be respresented by a series of points, or more simply by a segment connecting F . and F . The probability of the representative point to fall beyond the limit curve is the propaga tion probability. Plots of the (S , ) plane can be generated by the present ver sion of C0VAST0L, via the postprocessor C0VAPL0T.

CONCLUSIONS The assessment of the reliability of a structure relies on a number of steps or singles procedures and bodies of knowledge to be correlated and concatenated. Nondestructive testing, material characterization, load sequence inference, stress analysis, damage accumulation analysis, failure modes identification, constitute the main pieces of information needed to come out with a meaningful reliability estimate. The necessity to include such a large number of elements makes the assessment complex and leads, from the one hand, to th need to use the best available methods and techniques for the single problems speci fied, and from the other, to the need of knowledge representation tools for the combination and use of information coming from different sources.

REFERENCES /l/ A.M. Freudenthal, "Reliability of reactor components and systems subject to fatigue and creep", Nucl. Eng. Des. 28 (1974) 196217. /2/ ASME Section VIII, Division 2, "Boiler and pressure vessel code", Appendix 5, Design based fatigue analysis, 1983. /3/ M.A. Miner, "Cumulative damage in fatigue", J. App. Mechanics Trans, of ASME 12 (1945) 159164. /4/ W. Shutz, "The prediction of fatigue life in the crack initiation and propagation stages A state of the art survey", Eng. Fracture Mechanics, V.ll (1979) 405421. /5/ ASME Section XI, Division 1, "Boiler and pressure vessel code, Rules for inservice inspection of nuclear power plant compo nents", Appendix A (1972). /6/ P.C. Paris et al., "Extensive study of low fatigue crack growth rates in A533 and A508 steels, ASTM STP 513 (1971) 141176.

511 111 J.N. Yang, R.C. Donath and G.C. Salivar, "Statistical fatigue crack propagation of InlOO at elevated temperatures, ASME Int. Conf. on Advances in Life Prediction Methods, N.Y. (1983). 18/ H.O. Madsen, "Random Fatigue Crack Growth and Inspection", in Proceedings of ICOSSAR '85, Kobe, Japan. /9/ J. Dufresne, A.C. Lucia, J. Grandemange and A. Pellissier Tanon, "Etude probabiliste de la rupture de cuve de racteurs eau sous pression", EUR Report N. 8682, 1983. /10/ A.H. S. Ang, "A comprehensive basis for reliability analysis and design", Japan - US Joint Seminar on Reliability Approach in Structural Engineering, Maruzen Co. Ltd., Tokyo (1975) 29-47. /Il/ R. Talreja, "Fatigue reliability under multiple amplitude loads", Eng. Fract. Mech., Vol. 11 (1979) 839-849. /12/ F. Kozin, J.L. Bogdanoff, "On Probabilistic Modeling of Fatigue Crack Growth", in Proceedings of ICOSSAR '85, Kobe, Japan. /13/ F. Kozin, J.L. Bogdanoff, "Probabilistic Models of Fatigue Crack Growth: Results and Speculations", to appear in Jour, of Nuclear Eng, and Design. /14/ A.G. Colombo, G. Reina and G. Volta, "Extreme value characteristics of distributions of cumulative processes", IEEE Trans. Rel., Vol. R-23, N. 3, (1974) 179-186. /15/ V.V. Bolotin, "Life prediction of randomly loaded structures", Nuclear Eng, and Design 69 (1982) 399-402. /16/ P.E. Becker and A. Pedersen, "Application of statistical linear elastic fracture mechanics to pressure vessel reliability analysis", Nucl. Eng. Des. 27 (1974) 413. /17/ D.O. Harris, "A means of assessing the effects of NDE on the reliability of cyclically loaded structures", Materials Evaluation (July 1977) pp. 57-65. /18/ W. Marshall et al., "An assessment of the integrity of PWR pressure vessels" (UKAEA, 1976). /19/ W.F. Vesely, E.K. Lynn and F.F. Goldberg "Octavia - a computer code to calculate the probability of pressure vessel failure, IAEA Symp. on Application of Reliability Technology to Nuclear Power Plants, Vienna, 1977. /20/ A. Jovanovic, A.C. Lucia, "Behaviour of the nozzle corner region during the first phase of the fatigue test on scaled models of pressure vessels", EUR 11023 EN, Ispra JRC, 1987. /21/ A.C. Lucia, "Response surface methodology approach for structural reliability analysis: an outline of typical applications performed at CEC-JRC, Ispra", Nucl. Eng, and Design, Vol. 71, N. 3, Aug. 1982. /22/ D.C. Box, P. Baybutt, "Methods for uncertainty analysis" Battelle, Columbus, Ohio, 1981.

512 /23/ D. Veneziano, F. Casciati, L. Faravelli, "Methods of seismic fragility for complicated systems", 2nd CSNI Spec. Meet, on Probab. Methods in Seismic Risk Assessement, Livermore, Ca, 1983. /24/ L. Faravelli, A.C. Lucia, "Stochastic finite element analysis of nozzle comer response", 9th SMIRT Conf., Lausanne, August 1987. /25/ J. Helliot, R.C. Labbens, A. Pellissier Tanon, "Semi elliptical cracks in a cylinder subjected to stress gradients", 11th National Symposium on Fracture Mechanics, Blackburg, Virginia, June 78 ASTM STP 677. /26/ Hiroshi Tada, "The stress analysis of cracks handbook", Research Corporation Hellatown, Pennsylvania 1973, pp. 101. /27/ J. Dufresne, A:C. Lucia, J. Grandemange, A. Pellissier Tanon, "The C0VAST0L program", Nucl. Eng, and Design, 86 (1985). /28/ A.C. Lucia, G. Arman, A. Jovanovic, "Fatigue crack propagation: probabilistic models and experimental evidence", 9th SMIRT Conf., Lausanne, August 1987. /29/ A.R. Dowling, Ch.A. Townley, Int. J. Press Ves, and Piping 3 pp. 77 (1975). /3C7 R.P. Harrison, K. Loosemore, I. Milne, A.R. Dowling, R.A. Ainsworth, CEGB, "Assessment of the integrity of structures containing defects", R/H/R6 Rev. 2, Berkely Nuclear Lab., 1980, and R/H/R6 Suppl. 2, 1983.

RELIABILITY OF MARINE STRUCTURES

C. Guedes Soares Shipbuilding Engineering Programme Department of Mechanical Engineering Technical University of Lisbon 1096 Lisboa, Portugal

ABSTRACT. Various applications of reliability theory to ship and offshore structures are described. Consideration is given to the differences between general reliability theory and structural reliability, discussing also the role of the latter in structural design. The examples described concern both the total and the theoretical probability of structural failure, that is the failures due to any causes and the ones due only to structural causes. A brief account is given of the load effect and strength models adopted in the structural reliability studies of ships and offshore platforms.

1. INTRODUCTION In dealing with the reliability of marine structures attention will be focused mainly on ships and offshores structures leaving aside other types of marine structures such as submersible vehicles, subsea installations and pipelines. Reliability will be dealt with here from the viewpoint that has dominated its introduction in the marine industry, that is, as an analysis and design tool for the structural engineer |l-8|. This is intended to distinguish from an alternative approach of dealing with a mathematical theory of reliability applicable to any general system |9-,10|. In fact one can identify structural reliability as a special branch of reliability theory involving various specific aspects which have influenced the type of approaches developed. To identify some of its characteristic aspects it is worthwhile to refer to other formulations applicable to electronic or even mechanical systems. The theory of reliability was initially developed for systems that were composed of many elements of relatively low reliability. This allowed the failure rates of the elements to be accurately determined from service data and the basic problem to solve was the assessment of the reliability of the whole system, as dictated by the characteristics of each element and by the way the components were connected to each other. On the other hand, structures are very
513 A. Amendola and A. Saiz de Bust amante (eds.), Reliability Engineering, 513-559. 1988 by ECSC, EEC, EAEC, Brussels and Luxembourg.

514 reliable systems, being the structural components designed for high levels of reliability. This makes it very difficult, if not impossible to determine the failure rates from service data. The various actions or loadings on the structure are mostly due to environmental effects, which makes them unpredictable at the design stage. This is a major difference from most mechanical systems, in which the loads on the various machine components can be accurately determined from their operating conditions. An important problem is therefore the description in probabilistic terms of the environmental loads that will act on the structures. Structures are designed so that they are able to resist most of the loads which they are subjected to during their operational lifetime. Thus, given a load, it is necessary to verify whether the structure has enough resistance to withstand it. The structural capacity is provided by the set of structural components which transmit forces and resist them in different ways. Based on the principles of structural mechanics, as well as on experiments, it is possible to quantify the different types of component behaviour. It has become clear that the same component is resisting different sets of loads simultaneously and that it exhibits different behaviour for different types of loads. The checking of strength has evolved to limit state design methods, which identify the various possible modes of failure of the element or of the structure and makes an independent check for each. Thus reliability has to be considered in connection with the different possible failure modes. The strength of each structural component will depend on its geometry, its material properties, as well as on the way that it is attached to the rest of the structure, i.e. the boundary conditions In the description of geometry and material properties not only the nominal values are important but also the random deviations due to uncontrolled actions occuring during the fabrication process. Thus the strength of a given type of component to a given load will be dependent on those random quantities being itself best described in a probabilistic way. Being structural design concerned with guaranteeing the existence of a strength larger than the load effects, and having both described in probabilistic terms, reliability theory becomes the natural tool for assessments to be made about the likelihood of having safe structures.

2. THE ROLE OF RELIABILITY IN STRUCTURAL DESIGN Reliability theory allows one to quantify the probability of failure of systems. Thus, it can provide information about the availability of various operational systems of marine structures, which is important both for their design and for defining maintenance and monitoring policies. However, if the system under consideration is the structure itself, reliability analysis will provide indications about the overall safety, with its implications, not only in material losses,

515 but also in loss of lives. In fact,while failure of various mechanical systems can lead to a stop in the production of platforms or operation of ships, structural failures usually result in loss of lives and of material, with a much larger period of time without operation. There are some specific cases in which failure of mechanical systems may result in loss of life, as will be referred to in section 3.2, but this is not the general case. One of the main applications of reliability theory in the marine field has been the prediction of the structural safety of the various types of vessels and installations. The studies of structural reliability have also proved to be important design tools in that they identify the critical failure modes of elements as well as allowing consistent safety factors to be used in the design of the various structural components. One can identify two distinct approaches to structural reliability. On one hand there are attempts to calculate accurately the real failure rate of the structures, as a result of all types of failures and of all causes. On the other hand there is the so-called technical approach which is not concerned with the previous aspects, but only uses reliability theory as a tool for design decisions. This implies that only the variables that influence the specific design decision need to be modelled and that the predicted reliability levels have only a relative value to be used for comparative purposes. This is the difference between the real safety levels, which are derived from accident statistics, and the notional safety levels which result from structural reliability calculations or which serve as the basis for choosing safety factors to use in design. The compatibility between the two approaches and the choice of adequate notional safety levels for design is still an actual problem 111,121. Analysis of accident statistics for all types of structures has shown that only seldom are structural failures originated by structural weaknesses or by overloading. The most common cause is the occurrence of any major event that could not be accounted for in design. In the marine field common causes are collisions, fires, explosions and groundings. In general these major events occur because of failure to comply with established procedures often during the operation of the structure but also during their design and construction. These failures are commonly designated as human errors or gross errors. This finding motivated a re-evaluation of the purpose and usefulness of reliability theory which is based on the assumption that proper procedures are observed during design, construction and utilization of the structures. It also motivated the recent interest in the modelling of gross errors, on one hand to bridge the gap between notional and real safety levels, and on the other hand, because it can be of direct use to structural design, especially as concerns the choice of structural topology. Pugsley | X31 pointed out that because errors are human actions they are very dependent on the working environment. In particular he identified various aspects such as the scientific, professional, industrial, financial and political climate as the main influences

516 on the proneness to accident. According to his approach it would be possible to estimate the probability of occurrence of an accident without the knowledge of the detailed configuration of the structure, i.e., independently of its theoretical reliability. For clarity in the discussion, the designations proposed by Ferry Borges |l4| will be adopted here. Theoretical probability of failure will be defined as the one that results from a reliability analysis which includes load and resistance variables. It will be dealt with in section 4. Total probability of failure is the one derived from accident statistics including all causes, which will be considered in section 3. The difference between the two is the adjunct probability of failure, attributable to gross errors. Blockley |l5| treated the adjunct probability of failure in the light of the concepts of Pugsley. He developed a method to predict the likelihood of structural accidents due to human error and applied it to 23 major structural accidents. Brown |l6| on the other hand compared accident statistics with the theoretical probability of failure, concluding that the adjunct probability of failure was generally one order of magnitude larger than the theoretical one. He also suggested one safety measure that would represent both aspects of the problem, i.e., that could be associated with the total probability of failure. One interesting aspect of the work of Blockley and Brown is that they do not use a probabilistic framework that is common to structural safety studies, but they operate with the new concept of fuzzy sets | IT,181. This theory has been developed as a tool for dealing with vague or imprecise (fuzzy) concepts.This is a fundamental difference from the probability theory which deals with uncertain but precise concepts. Because fuzzy sets are based on different postulates than probability theory, it is difficult to relate them with each other, despite the research efforts already done on the subject. Although fuzzy sets may still be useful in relation to the assessment of the behaviour and safety of existing structures |l9|, it appears that in the near future it may be difficult to combine a gross error analysis based on fuzzy sets with the theoretical probability of failure, which is conditional on the avoidance of gross errors. Error models based on a probabilistic approach have a better potential for estimating the adjunct probability of failure. The work in this area is still in its infancy, but mention can already be made to the contributions of Rackwitz |20|, Lind |2l| and Melchers |22|. Rackwitz proposed a model for the accumulated effect of independent checking, while Lind presented various probabilistic models that are applicable to different types of errors. Lind addressed also the question of the influence of gross errors in the total probability of failure. This is in fact an important aspect that requires clarification, so that the results of the theoretical studies on the probability of failure can be set in the correct perspective. Other examples of modelling the effect of human actions concern the prediction of load effects in ship structures |23|.0ne application is the use of controlling instruments to avoid exceedences of the

517 maximum allowed still-water bending moments during loading and unloading, which only occur as a consequence of human errors. The other applications is the voluntary manoeuvering of ships in heavy weather to avoid capsizing, which may result in increased longitudinal bending moments |23]. Ditlevsen treated recently the question of gross errors |24|. He considered that structural safety should be described by a pair of safety measures, the theoretical and adjunct probabilities of failure which should not be combined in only one measure, contrary to the opinion of Brown |l6| and Blockley |l7|. His fundamental postulate is that the adjunct probability of failure is insensitive to the theoretical probability of failure. In fact, small changes of dimensions and of material strength have negligible effect on the proneness to failure due to gross errors. The latter is dependent on the structural system in its wide definition which includes design, construction and operational environment, as already formulated by Pugsley. Ditlevsen showed that, given an adjunct probability of failure, the structural dimensions that minimize the total expected costs of a structure depend on the theoretical probability of failure. The adjunct probability of failure allows a minimization of the cost of different structural systems. This means that, while the analysis of gross errors is related with the choice of the structural layout of topology,the structural reliability analysis allows an optimization of the dimensions of the components of a given structure. Therefore, the occurrence of gross errors does not diminishes the importance of structural reliability in the design of structures. Formulating the safety problem in these terms has direct implications in the philosophy of structural design, in that it provides the theoretical justification for the treatment of accidental loads, which arise as a consequence of gross errors. In discussing the safety of ship structures ]25|, it was noted that it could only be increased by an explicit account of accidental loads in design because these are the major causes of ship losses. In fact, accidental loads should be considered when establishing the main configuration of the primary structure of the hull, while theoretical reliability is useful to determine the dimensions of the components that will minimize the total expected cost of the structure. The need to design for accidental loads has also been recognised for offshore structures |26|.

3. APPLICATIONS OF RISK ANALYSIS TO MARINE STRUCTURES Risk Analysis is the designation that has become common within the marine industry to indicate the reliability studies that account for all possible failure modes. This is intended to distinguish them from the structural reliability studies which consider only failures of the structure resulting from the excessive service loads or from too low structural strength. The basic principles governing the carrying out of risk analysis

518 are the same that are used to construct fault-trees, to assess event failure rates and to quantify the probability of occurrence of the top event |9,10|. However, it may be worthwhile to provide a brief description of the process with the terminology that is commonly used in the marine industry |25,27|. The objective of risk analysis is to quantify the risks that are associated with the different modes of failure of marine structures. The risk is a combined measure of the probability of occurrence of an undesired event and of its consequences. Therefore, risks can be reduced either by decreasing the probability of occurrence of a hazardous event or its consequences. Risk analysis can be used as a basis for accepting a system, for improving a system's design or for implementing a risk control program. The first application is probably the most controversial one in that it requires a definition of acceptable safety levels, which is a difficult and still debatable task. By identifying the relative contribution of each element to the global failure rate risk analysis allows the system design to be improved through changes only at the element level. It becomes therefore an important design tool. When performed on already designed structures risk analysis can be the basis for the risk control programs of the safety management, as it is sometimes called. The identification of the critical elements allows the definition of operational procedures that will minimize the probability of occurrence of the undesired events. It also allows monitoring and inspection programmes to be designed so as to provide early warning during operation. Finally, risks can also be decreased by reducing the consequences of unwanted events through contingency plans (see Fig. 1). Risk analysis can be performed either on the basis of failure statistics or by and analytical approach such as through a fault-tree analysis. The use of historical data ensures that all relevant failure modes are included if the data is sufficiently extensive. However, it is applicable only to the type of structures that already have enough statistical information compiled. This approach is not applicable to structures of a novel type in which all the major failure modes have to be identified so that a fault-tree can be constructed for each one. The accident will occur when a sequence of small failures occur. Estimates must be made of the failure rates of the elemental components and they must be combined so as to produce the probability of the top event occurring. Next sections present a review of both types of approaches applied to ship and to offshore structures. Analysis of historical data has indicated how the different structures fail, which could be used to improve safety levels. The explicit risk analyses that have been performed were generally concerned with particular problems, some examples of which are also given in the next section. 3.1. Applications to Ship Structures One of the major aspects of risk analysis is the identification and

519

Analysis of A c t i v i t y Hazard Identification

Analysis of Similar Activities Estimation of Risk levels

Analysis of Couses Estimation of Probabilities

Anolysis of Consequence Estimation of costs

Risk Assessment

Determination of Risk Acceptance level for

the Activity

Assessment of Magnitude of Risk Control Program

Identification of Means to reduce occurence of hazards

Identification of means to reduce the magnitude of consequence

Estimation of Costs in Hazard reduction

Estimation of Costs of consequence reduction

Cost-Benefit Analysis

Determination of Actions in Risk Control Program

Figure 1 - General procedure for risk analysis and for the formulation of a Risk Control Program 1251.

520 assessment of the risk. Some work has been reported on the analysis of the causes of accidents and failures. Most of the data banks are kept by Classification Societies, by Underwriters and by Governmental bodies. Unfortunately most of the data on major accidents, including total ship losses, only state the type of accident and eventually its direct cause. However, for an accident to happen, it is necessary that several unlikely events occur simultaneously and for a meaningful analysis to be conducted it would be need to include all of them. Only in few cases were major accidents studied in some depth 128,291, which is most valuable for the development of design criteria and for the identification of modes of failure. Analysis of the statistical data on total losses of the world merchant fleet provides an overall quantification of the failure rates and of the types of failures most common on ships. In Ref. 1251 a study was made using data from Lloyd's Annual and Quarterly reports of accidents. The first one gives only global statistics, while the latter contains a brief description of the circumstances and causes of the accidents. Tables 1 and 2 provide a brief summary of the Tables in (23) and (25). Statistics are presented in terms of number and tonnage of ships lost. From a reliability point of view the statistics of interest are the number of ships lost since this will describe the frequency of occurrence of accidents. Comparing these statistics with the ones of tonnage lost provides an indication of the average size of the ships lost. The annual rate of ship losses world-wide shows a slightly decreasing trend in the period 1950-1982, although the global average is 0.0050. The statistics of tonnage loss did not show the same decreasing trend indicating an increased importance of accidents of large ships. The distribution of accidents by major cause has shown a steady pattern with 38% of ships lost by foundering, 31% by grounding, 17% by fire or explosion and 21% by collision (see Table 1). Thus foundering and grounding are the two major modes of ship loss.Analysis of the data classified by ship sizes indicates that collisions and groundings are equally likely for all ship sizes. However, losses by foundering occur more often in small ships, while fires are more common in large ships. Analysis of the geographical distribution of accidents indicates that they are concentrated in areas of heavier traffic, which is not surprising for losses by collision. Even for grounding one could expect that result because heavier traffic tends to be closer to the coasts. Losses by fire are greater in the route from the Arabic countries to the North of Europe, again as expected. The casualty rate increases with the age of the ships for all types of accidents, except for collisions, in which case they are independent of age (see Table 2 ) . This effect would be the result of general deterioration such as corrosion and fatigue, together with possible lowering of the standards of operation in old ships and improved safety levels in new ships. A detailed classification of the causes of foundering, reported in Ref. |23|, was the result of an analysis of the description of

521
Percentage Foundering Num. Tonn. Min. Max. Grounding Num. Tonn. Fire & explosions Num. Tonn. Collisions Num. Tonn. Other Num. Tom.

33. 16. 43. 38.

26. 19. 38. 42.

14. 14. 18. 44.

09. 05. 13. 23.

01. 01. 06. 13.

Table 1 - Annual loses of ships worldwide due to different causes in th'period between 1970 and 1982 |23,25|

Initiating cause Foundering Wrecked Fire/explosion Collision Other

< 5

5-9 1.8 0.9 0.5 0.6 0.1 3.9

Age (years) 20-24 15-19 10-14

25-30

> 30

1.0 0.5 0.3 0.5 0.1 2.4

2.3 1.4 0.8 0.7 0.2 5.4

2.3 2.6 1.3 0.7 0.2 7.1

2.8 3.8 2.4 0.8 0.3


10.1

3.8 3.9 2.5 0.5 0.4


11.1

3.5
3 0

1.4 0.6 0.3 8.8

All

Table 2 - Average annual loss rate (per 1000 ships at risk) by age at the time of loss in the period between 1970 and 1981 123,251 .

Cause

Heavy Weather Num Avg.Ton

Unspec . Weather Avg.Ton. Num

All Weather Num Avg.Ton.


17 6 236 94 82 85 98 31 596
1245 16811 5309 2070 3749 2780

Break in two Hull Fracture Leakage Take Water Develop List Capsize Cargo Shift Machine Failure Unknown All Causes

10 2 65 37 47 37 63 15 217 493

10027 6640 2824 5634 3200 1114 1771 1669 1650 2418

7 4 171 59 35 48 35 16 379
752

26502 4643 1784 2526 2215

895
1201 3887

990
1567 2814 1169 1934

893
1617

Table 3 - Breakdown of causes of foundering between 1973 and 1982 23

522 all accidents reported in Lloyd's Quarterly Returns between 1973 and 1982. The results indicated in Table 3, show that unknown causes are the major group of accidents, which is understandable because foundered ships are not generally available for inspection to assess the cause of failure. From the accidents with identified causes, leakage is a dominant one. It is probably the result both of cracks in the structure and of lack of water tightness at hull openings. Another major group is related with developing list, cargo shift and capsizing, which could be related to stability problems.Structural failures are present when the ship breaks in two or when there is fracture, which may be potentially included in leakage, as being the initiating event. Making some estimates of this conditional probability, and combining with the conditional probability of foundering, and the probability of ship loss, leads to the following bounds on the annual probability of structural failure of ships: 0.3 10 " and 0.6 10 . These numbers * should be used with caution and should be interpreted as indicators of the order of magnitude of the rates of structural failure. Some data is also available on statistics of damage which does not lead to total ship loss. This is information of interest to maintenance studies. The majority of failures is caused by corrosion, not only as a direct effect, but also as cracks and dents due to reduction of thickness. Detailed statistics are also available for the location of damages (e.g. 30), but these tend to be more dependent on ship type, being thus more difficult to summarise in this overview. In addition to the analysis of accident statistics, which allows the quantification of global safety levels and of the main modes of failure, it may also be of interest to refer to some detailed risk analyses that have considered ships, although not always specifically concerned with their structural strength. Caldwell and Yang |3l| have developed an approach to study the risk of ship capsizing which uses methods of structural reliability to quantify the risk of motion stability with Liapunov theory to describe the mode of failure. Barlow and Lambert 1321 conducted a study on the effect of U.S. Coast Guard Rules of harbour traffic in reducing the probability of collision between a LNG tanker and any other ship in the Boston harbour. Collisions are often the result of a sequence of human errors that occur under stress conditions. Therefore they tend to be statistically dependent.A fault-tree was constructed and probabilities were computed considering the human errors both statistically dependent and independent. The basic events were intentional human error, deliberate disobedience, bad weather and equipment failure. The minimal cut sets represented accident scenarios involving collisions that resulted in release of LNG. Large fault-trees were constructed allowing the consideration of about 700 accident scenarios for operation without Coast Guard rules and 20,000 for operation with the rules. The accident scenarios involving a chain of human errors dominated the analysis, being equipment failure, such as steering gear, as insignificant contributor. Ostergaard and Rabien 1331 have chosen a steering gear as an

523 example of the application of fault-trees and of various importance measures to decisions related to systems reliability.The reliability of the emergency water cooling system of the nuclear ship Savanah has also been studied using fault tree-analysis |34|. Another application of fault-trees is in the study of the reliability of a dynamic positioning system for a diving-support ship |35|. Application of Markov processes to risk analysis is another interesting field. Basically the successive sea states are discretized and transition matrices are constructed to model the stochastic variation of the discretized processes. Interesting studies have been conducted especially related with operability periods of crane ships |36.37|. Many other risk analysis have certainly been reported in the technical literature but the objective here is to mention some examples of different approaches or applications and not to conduct a systematic literature survey. 3.2 Applications to Offshore Structures Risk analyses have been more commonly applied to offshore structures than to ship structures. This is probably a result of the fact that offshore structures made their appearance at a time when the profession was more aware of probabilistic concepts. As in the case of ships, accident statisticshave also been analysed for fixed and mobile platforms. A summary based on data from Lloyd's Lists is given in Table 4 1381. It is apparent that there has been more major structural accidents with mobile than with fixed platforms. Because the number of fixed platforms is roughly five times greater than the number of mobile ones, the accident rate for mobile platforms is over five times that of the fixed ones. The reasons for this difference are, among others, the proneness of mobile platforms to errors during operations such as moving, ballasting and anchor handling. Furthermore, they are exposed to risks of loss of buoyancy and stability which, in addition to their direct consequences, can also amplify the consequences of small structural damages, as happened recently with the Alexander L Kielland accident |39|.There are different types of mobile platforms each having its own accident rate. For example, the rates for jack-ups are about 2 to 4 times greater than for semi-submersibles. However, fatality rates do not differ much between them, possibly because there is a large incidence of jack-up failures during towing, a situation in which human lives are not involved. About 10-15% of the reported accidents involve fatalities. About 25% of all reported accidents resulting in severe or total structural loss are fatal. The number of lives lost and the fatality rate have also been greater for mobile rigs than for fixed platfoms. The loss of lives depends heavily upon the possible warning preceding the accident and the means of evacuation available. In the North Sea storms can develop with only 2 to 3 hours warning instead of the 2 or 3 days warning usually given by the Gulf of Mexico hurricanes. As regards the type of accidents, they are mostly blow-out, fire,

524 Table 4 - Number of accidents for platform in world - wide operation during 70.01.01 - 80.12.31 according to initiating event and exent of structural damage |38| Source: Lloyds'list

ALL PLATFORMS (MOBILE PLATFORMS) Structural 1 3SS ( Initiating . event Weather Collision Blow-out Leakage Machine etc
1)

SUM Total Severe 12 (10) 5 ( 2) 13 ( 7) 2 ( 2) 2 ( 1) 6 ( 2) 3 ( 2) 6 ( 6) 4 ( 4) 6 ( 4) 3 ( 0) 62 (40) Damage 30 (22) 17 (11) 15 ( 9) 3 ( 3) 5 ( 4) 20 (12) 10 ( 4) 3 ( 2) 3 ( 2) 3 ( 1) 20 (14) 1 ( 0) 130 (84) Minor 21 (17) 21 (18) 14 ( 7) 5 ( 6) 19 (12) 9 ( 6) 5 ( 2) 1 ( 0) 6 ( 4) No 9 ( 8) 23 (12) 13 ( 6) 3 ( 2) 79 ( 60) 70( 45) 70( 34) 8 71 13( 11) 48( 27) 25( 12) 9( 4( 6) 1)

7 ( 3) 4 ( 2) 15 ( 5) 1 3 ( 1) 2 ( 0) 4 ( 1) 2 ( 1)

Fire Explosion Out-of-pos Foundering Grounding

1 ( 1) 17( 12)
2 ( 2) 15 (10) 73 (45) 19( 17) 54( 41) 33( 18) 449(291)

Capsizing 11 (11) Structural > Strength 1 ( 1) Other SUM 2 ( 0) 52 (25)

1 ( 1)
25 (20) 12 ( 8) 132 (97)

1) Fires and explosions occuring in connection with blow-outs do not belong to this category as the initiating event in this case is the blow-out 2) This category includes structural failures that are not apparently induced by rough weather or accidental loads.Hence, accidents caused by a deficient structure belong to this category

525 collision and heavy weather, in the case of fixed platforms. Heavy weather, collisions and blow-out are the main types f or mobile platforms. In the case of fixed concrete structures the major causes are blow-out and collision |40|. Statistical data on damages is more scarce for platforms than for ships. Most of the reported damages to fixed platforms are ductile failures of up to few braces due to collisions and falling objects, fatigue cracks due to inadequate design calculations and fabrication faults. Dented and deflected members are common, as a result of boat impacts. There has also been a high rate of failure of mooring lines of semi-submersibles during the handling of the cables. The importance for design of the other possible failure modes in addition to structural failure has been recognised in the offshore industry, contrary to the situation in the shipping industry. For example, the guidelines of the Norwegian Petroleum Directorate (NPD) on safety assessment of fixed platforms require risk analysis to be performed, and specify that certain basic accidents should be defined in quantitative terms in design. Accidental loads with an annual probability less than 10_1 can be disregarded. This implies,for example, that accidental collision from a supply vessel must be considered in design. The quantitative risk analysis should be conducted during the conceptual stage so as to compare the risks associated with different actual concepts, and to select the type of platform, its orientation in the field, and its lay-out. An account of this type of study is given, for example, by Vinem |4l|. In addition to the studies based on accident statistics, reference should also be made to some explicit risk analysis. Blow-out, being one of the main failure modes, has been the object of different studies.For example Signoret and Leroy used a fault-tree method to quantity the risks of blow-out and riser loss |42|, in a deep water drilling project. Fault-trees with thirteen and fifteen modes of failure have been used in the riser and blow-out study. Another important cause of platform failures is collision, either between ships and platforms |43|, or between floating andfixedplatforms |44|. These studies can be conducted by isolating the various causes leading to collision and combining them with structural mechanics models which indicate the consequences of failure. The installation phase is sometimes critical because the structures cannot yet develop their full resistance. For example, in |45| a risk analysis of a fixed offshore platform during the unpiled installation phase is presented. A topic that is increasingly becoming more important is the ice loading for fixed structures. An interesting study,aimingatestablishing a design criteria for these structures, has been reported in 1461 . It accounts for the uncertainties related to the mechanics of the ice-structure interaction and those related to the environment which dictates the scenarios to be considered. It uses a decision tree aproach to represent the chain of events in the process, that is,the probability of ice-structure collision, the uncertain iceberg characteristics in terms of size, texture and speed, the nature of the impact and the

526
mechanical strength model of the ice and of the structure. The applications of risks analysis are many and are of a very different nature. The examples indicated above are intended to illustrate the scope of problems studied.

4. APPLICATIONS OF RELIABILITY TO MARINE STRUCTURES Most of the developments in the methods of assessing the structural reliability have occurred in the civil engineering community. However, the modelling and the type of analysis must be adjusted to the type of structure under consideration. It is exactly in this class that the reliability studies in the marine field can be classified. This section will consider first the main aspects of structural reliability dealing afterwards with the applications in ships and off shore structures. The load models are almost the same but the load effect and the strength modelling is different in the two cases. 4.1. Developments of Structural Reliability Theory The theory of structural reliability has already reached an advanced stage of development. This makes any detailed review of the subject an extensive task. Thus, only some of the recent contributions will be considered here, reference being made to the early monographs |l3| and recent textbooks |48| to a more detailed account of the methods. This review follows closely the one presented in |23|. A major contribution to structural reliability theory is due to Freudenthal and associates |l|. They advocated and developed what is presently known as level III methods. The basic aspects of this formula tion is that the strength of the structure is made dependent of only one load (L) and one resistance variable (R) that are described by their probability density functions. The measure of safety is provided by the probability of failure:

oo o o Pf = fR(r).fL(i.).dr.di, = F R U).f L a).di, (1)


0 0 o where f and F are the density and the cumulative distribution functions of the variables. The use of only two variables to describe the structural behaviour is a very idealized model which only provides adequate description of simple structures. The generalization to several load and resistance variables does not raise conceptual difficulties. Inclusion of more variables only implies one additional integration for each variable Bowever, the computational problems in the numerical evaluation of these multiple integrals remained unsolvable for more than 20 years. These difficulties have only recently been solved by using approximate integration methods generally called advanced level II methods. Level II methods are approximate methods that owe much of their initial development to Cornell |47|. The essence of these methods is to describe the structural variables by their first two statistical moments instead of the probability density function required by level III approa-

527 ches. The measure of safety is provided by the reliability index: = _M (2)

M
where the over bar indicates mean, is the standard deviation and M is the safety margin defined as the difference between resistance and load: M = R L (3)

This distributionfree representation of the variables by only their first two statistical moments made it possible to accommodate multi variable descriptions of the resistance and of the load.Thus, instead of operating with two variables it is possible to operate with functions, and instead of considering safety margins, limit state or failure func tions can be handled. In addition to allowing more elaborate models to be treated with the inclusion of more variables, the formulation provided also a simple method for determining safety factors to use in design. Thus, second moment methods became important both for the development of safety assessment methods and for structural design either directly or through codes. Next sections will treat the main developments that second moment methods have experienced in the recent past. They deal basically with reliability studies conducted at the component level and involving fai lures under extreme loading. The topics of system reliability and of fatigue reliability, which are presently under active research, will not be covered due to limitations in space and scope. 4.1.1. Advanced Second Moment Methods These methods are based on secondmoment information about the design variables, being the measure of safety provided by the reliability index. The Cornell reliability index (eqn. 2 ) , initially formulated for the two variable problem, was shown to suffer from lack of invariance. This implies that, for a given problem, different transformations of the safety margin may result in different values of the reliability index. Moreover, it is possible to find two different safety margins that will yield the same value of the safety index. Hasofer and Lind |48| extended the concept of reliability index to the multivariable case and solved the invariance problem. If the safety margin is defined by a linear combination of variables X : = a 0 + aj . X l + ...+ anXj! = a0+ a.X (4)

where a^'s are constants and a_ and X are matrices, the reliability index is given by: o
BHL

a n + aTX
"

~V7 5 (a T C v a) 0 5

(5)

528 where C x is the covariance matrix, X is the mean vector and the super script indicates the transpose of the matrix. When the safety margin is a nonlinear function of the design va riables, linearization reduces it to the case of equation 4, where the constants a are the partial derivatives of the safety margin with res pect to variables X. If the linearization is done at the mean value of the design va riables, the reliability index is equivalent to the Cornell index and suffers from lack of invariance. This does not happen if the lineariza tion is done at a point of the limit surface which is defined by: = g(X, Xn)= 0 (6)

as was shown by Hasofer and Lind |A8|. In the transformed space of un corrected normal variables the HasoferLind index is the minimum dis tance from the origin to the limit state surface, which is indicated in Fig. 2 for one load or demand variable (SQ) and one resistance variable Whenever the limit surfaces are not hyperplanes, the HasoferLind index will not distinguish between limit surfaces that have the same minimum distance to the origin^ which was called lack of comparativeness |49|. The generalized reliability index proposed by Ditlevsen |49|, sol ves this problem by considering the whole limit state function. This in dex results from integrating a weighting function along the limit sur face. For convenience Ditlevsen chose the standardized multinormal pro bability density function fx(x). The reliability index is then given by:

3 G = (|>l{/MFx<xidx}

(7)

where the is the standardized multinormal distribution and the in tegral is to be understood as a multiple one over all variables in vector X. Various methods have been suggested recently for an efficient in tegration of the multinormal distribution in equation 7. However, this can still be an extensive computational task when many variables are involved. One alternative consists in representing the limit state sur face by convex polyhedral in which each face defines one hyperplane.This is equivalent to linearizing the limit surfaces at several points ins tead of just one, as happens in the calculation of the HasoferLind in dex. Use of the polyhedral approximations allows upper and lower bounds of the generalized reliability index to be determined, as shown by Ditlevsen |50|, (see Fig. 3 ) . The formulation of the generalized reliability index is based only on a second moment description of the design variables. They are trans formed to an uncorrelated set of standardized normal variables |4|, before the methods discussed above are applied. Nonnormal dependent va riables should also be transformed to an uncorrelated set before pro ceeding to calculate the reliability index |5l|. If the design variables are not normally distributed, significant differences in the reliability index result because the distributions

529

z = g(sD,sc) = o

Figure 2 - Illustration of the Hasofer-Lind reliabilityindex .

530

Figure 3 - Illustration of linearised failure surfaces for a multi-failure case.

X s (t)

fx (XV

Figure 4 - Illustration of an alternating pulse processI^JI.

531 can differ much in the tail, in spite of having the same mean value and standard deviation. Whenever additional information is available about the distribution type of the design variables, it can be incorporated in the analysis by an approximate procedure that adjusts the distribution in the tails. The procedure that has received widespread acceptance is due to Rackwitz and Fiessler p 2 | . It consists of representing the design va riables by a normal distribution that has the same value of density and of distribution functions as the original variable at the approximation point. This is equivalent to substituting the tail of the original dis tribution by a normal tail. An improvement of this procedure was recently proposed by Chen and Lind |53|. It consists of using one additional parameter in the normal tail approximation, i.e. the derivative of the probability density fun ction at the approximation point. A different approach, proposed by Grigoriu and Lind |54|.consists in using various probability functions to fit the tail of the distribution. The estimated distribution function is determined by weighting the distributions with parameters whose sum equals one. The optimal values of the parameters are determined by a minimization procedure. Parkinson |55[ suggests still another approach to transform the variables, which is based on the knowledge of their 3rd and 4th moments instead of on assumptions about the shape of distribu tion function. Having determined the reliability index,using whichever method is chosen, it can be related to the probability of failure by: Pf = () (8)

where is the standardized normal distribution. When the basic 'varia bles are jointly normally distributed and the failure surf ace is a hyper plane, the above expression is exact. In other cases it will be an ap proximation even when using the generalized reliability index. The formulation of the safety problem discussed so far considers time invariant load and resistance variables. In the next section it will be discussed how time effects can be taken into account in deter mining the probability of failure. Consideration will also be given to the effects of the modelling assumptions on the probability of failure. In summarizing the development of second moment methods,it can be said that presently it is possible to calculate the reliability index of a multidimensional problem with a nonlinear surface and with nonnor mal design variables. The second moment based methods, by including in formation about the distribution function of the variables and by per forming approximate multidimensional integrations have, in a way. ex tended and solved the level III formulation of Freudenthal et al.] 1 31. 4.1.2. Description of Time Dependence Whenever a structure is subjected to a random sequence of loads, the theoretical probability of failure discussed above is not a direct mea sure of its safety, independently of the method used being of level III | | or of level II 1561. To account for the load dependence on time,se l

532 veral methods have been developed based on an upper bound formulation for the probability of a stochastic process exceeding a specified level in a given period of time. The bound is established by noting that the probability Q(a,T) that the process X(t) exceeds level in the time from 0 to is given by:

Q(a,T) = P[X(0)>a] + P[X(0)<a].P[N(a,T)>1]

(9)

where N(a,T) is the number of upcrossings of level a during time T. The probability of more than one upcrossing is always smaller than the mean number of upcrossings: oo j.P[N(ct,T)=j] = E[N(a,T)] > P[N(a,T) > 1] = P[N(a,T)=j] j=i j=i (10) Thus the following bound results |57|s Q(a,T) < Q(ct,0) + [1Q(a,0)] . E[N(a,T)] (11a)

For a stationary process satisfying the condition that: o o j.p[N(a,T) = j] [(,)=1] , (12) j=2 a good approximation for the mean number of level upcrossings is given by: E[N(a,T)] = v . (13a)

where V a is the mean upcrossing rate of level 0i. It is defined as the limit as At tends to zero of the probability that the process is below o at time t and over o at t+At: c t

'

[1

Q (ot > t) ] Q(a.t)

(13b)

where is the arrival rate in the case of point processes or the zero upcrossing rate for continuous processes. It expresses the probability that there is a change in the process in the period between t and t+ At.Substituting equation (13) in (11a) yields the basic form of the bound :

Q(a,T) < Q(a,0) + [1Q(a,0)] . v.T

(11b)

The usefulness of dealing with crossing rates is that the formulation is also applicable to correlated sequences of loads, whether Gaussian or not |56|. One way of dealing with the effect of time on the structural safety is to represent the theoretical probability of failure as a time varying process, as proposed by Bolotin 15S| and further elaborated by Veneziano, Grigoriu and Cornell |59|. They made use of equation (11),where Q(a,0) is the probability of failure that results from equations (1 ) or (8) . Since Pf(0) is always very small, the probabilitv of failure during

533 the period of time is well approximated Pf(T) = V.T b: (14)

where V is the mean rate of outcrossing the failure surface. The difficulties in determining V for realistic problems have pre vented the widespread use of this approach. The alternative method that has been generalized is to treat load and resistance as random variables. The safety problem is then formulated with the resistance variables and a load random variable defined as the maximum of the load process during the structure's lifetime T. The simplest load process is the Ferry BorgesCastanhetamodel|2|. The load history X(t) is represented as a sequence of rectangular pulses of fixed duration. The sequence of pulse amplitudes X is des cribed by independent and identically distributed random variables with distribution function F(x). A mixed model can also account for a nonzero probability (1P) of intervals with no load: f () = (1).() + p.fx(x) (15)

where 6( x ) s a Dirac delta function and () is the probability den sity function of the load at an arbitrary point in time. A simple generalization of this process is the renewal pulse process, in which pulse lengths are allowed to vary, being identically distri buted and mutually independent random variables |60|.This formulation reduces to the Ferry BorgesCastanheta model when the fixed pulse du ration in the last model is made equal to the mean duration of pulses of the renewal process |61|. A generalization of the renewal pulse pro cess is the alternating pulse process which was considered in Ref. \23\ to model the stillwater load effects that are associated with the pe riods in port and at sea (see Fig. 4 ) . For the Ferry BorgesCastanheta process, the distribution of the maximum in repetitions is given by: F m ,n ( x ) = [F x (x)] n which can be approximated by
F

(16a)

m , n ( x ) = 1 n. [1Fx(x)]

(16b)

whenever n.[1Fx(x)] is much smaller than unity. In a renewal pulse process the upcrossing of the process X(t) can be well approximated by a Poisson process, whenever the crossed level is high. The distribution function of the maximum is then given by |57|: FM T(x) = exp ( X.T.[1Fx(x)]) (17)

where is the arrival rate of the pulses or the inverse of the mean pulse duration. These two load models are very useful for load combination studies.

534
The distribution function of the process X(t) obtained by summing Ferry BorgesCastanheta models X^ and X2 is given by |2|: FM.T(X) =i^f Xl (z).[F x (xz)]mdz}n (18) two

where in the period of time there are occurences of process X<(t) and during each occurence of process X^(t) there are m ocurrences of pro cess X 2 (t), (see Fig. 5 ) . If one considers the sum of two renewal pulse processes, the up crossing rate of the resulting process is given by |6|:
v

x () = f
00

fx0(y) . v x (xy).dy + /fx (z) .V


z 1 oo 1

(xz).dz
2

(19) and equation (11) still applies for the resulting process. By choosing an adequate mean pulse duration, these two types of models can provide an adequate representation of many load processes of interest in structural reliability. These two approaches have been applied in Ref. |23| to study the combination of stillwater and wave induced load effects in ships. For the accuracy necessary for applica tions in structural codes Turkstra and Madsen |6l| considered adequa te the Ferry BorgesCastanheta model, which was also adopted by Rackwitz and Fiessler |32|. 4.1.3. Description of Model Uncertainty The initial treatment of structural safety dealt with load and resis tance variables and expressed the reliability as a function of their fundamental variability. In a later stage it became clear that other sources of uncertainty should also be accounted for in a realistic ana lysis. It is well established nowadays that in addition to the funda mental uncertainties due to the intrinsic variability of the processes under study, statistical and model uncertainties contribute to the quan tification of structural safety 62|. Statistical uncertainty results from the estimation of the para meters of the probabilistic models from limited samples of data.Model uncertainty describes the limitations of the theoretical models used in the analysis. The models can be either the probabilistic representa tion of the fundamental variability of the design variables or the ma thematical model of load, load effect or structural capacity assessment. The methods to use in the assessment of model uncertainty are somewhat dependent on the type of problem and of the data at hand. However, the two main types of procedures are based on comparisons of model predic tions with experimental data or with the predictions of more elaborated methods. Subjective information based on engineering judgement and expe rience can also be used to improve the analysis or substitute it when data is lacking. This information is treated in the same way as model uncertainty. Cornell J621 dealt with Bayesian methods as a tool to in corporate additional information in a probabilistic model uncertainty. Cornell made also the important point that model selection should be

535

x, , 1

"

'

' '

ito.
Figure 5 Illustration of the principle on which the Ferry BorgesCastanheta load combination model is based.

, ^ 17 ^"l2.Tanker<20l

.3 __?_

/* " x ' ' f i ,-"9-157

'1-

/a4^fc i r '
s ' 3' 2 1 O 50 ___ .
M n

".Mariner (191 ,^' ^ . crip DeSSi RECENT . ^ \itorshipO?29iV * propoiS' * MERCHA NT SHIPS. Tankers Cargo Ships. B" Carriers. OUOre Carrier.

> _|. 0

1/1

_ . , ia Frigate I*

NA VA L SHIPS OF'SOS 860S C U R R E N T . . R U L e N A V A L S H IPS.

100

150

200

250 300 3S0 Length Between Perpendiculars LB. (m)

Figure 6 Calculated safety indices for naval and merchant ships I78|

536 done on a pragmatic basis instead of aiming at the most accurate repre sentation of reality. This is a consequence of realizing that the pur pose of the models is to provide the basis for decisions to be made.The previous section has already dealt with how different probabilistic mo dels can be accommodated in advanced second moment calculations |5254|. The other important type of model uncertainty is associated with the deterministic models that describe the mechanics of load generation and the strength of structures. A formal treatment of this type of un certainty is due to Ang and Cornell |6264|, who represented it by a random variable that operates on the model predictions X to yield an improved estimate of the variable X: X = .X In the initial treatments was called factor of uncertainty or judgement factor which was aimed at representing all socalled subjec tive uncertainties. This random variable represents in fact the bias and uncertainty of the mechanical model, which are given by its mean value and standard deviation. More recently Ditlevsen |S51 treated model un certainty in connection with its incorporation in advanced second moment methods. He showed that a representation that is invariant to mathema tical transformations of the limit state function is of the form: X = a X + b where X is the initial normally distributed variable in the transformed space and a and b are normally distributed random quantities that des cribe the model uncertainty. Comparison with equation (20) shows that the last expression is essentially a generalization of the Ang and Cornell proposal |64|. Lind 1661 dealt with model uncertainty in strength calculations emphasizing that the choice between two calculation methods of different degree of sophistication should be made on the basis of economic consi derations. This means that the model uncertainty of an approximate cal culation method should be weighted against the extra benefits and costs of a more exact method. Lind determined the model uncertainty in a number of cases by comparing the predictions of two theoretical methods with different levels of sophistication. However, the most common way of determining model uncertainty has been comparing model predictions with experimental results, as done, for example, by Bjorhovde, Galambos and Ravindra |67|, by Guedes Soares and Soreide|8| and by Das, Frieze, and Faulkner 69|. In all these cases model uncertainty was represented in the format of Ang and Cornell |64|. It is interesting to note that although the original formulation 1641 refered to model uncertainty in both the load and resistance va riables, only studies on the quantification of model uncertainty in strength calculations were found in the literature. An exception is the recent work of Guedes Soares, Moan and Syvertsen | 70, 711, which in cludes the model uncertainty in the theories of wave kinematics, as derived . from comparisons between theoretical predictions and mea

537 surements. Reference |23[ deals with the quantitative prediction of the model uncertainty in load effect calculation methods of ship structures, which appears to be the first treatment of this type of problem. 4.2. Applications to Ship Structures The first reference to structural safety of ships dates back to 1962 by Abrahamsen |72| who provided a very clear formulation of the relationship between safety factors and the safety of human lives. The use of reliability theory in the field of ship structures came relatively late. The first paper that was identified on the subject is due to Dunn |73| in 1964. He introduced the main concepts and methods of analysis, suggesting some applications. However, this was an isolated contribution that was much based from an electronic engineering point of view. The first reported work with a ship structural engineering back ground dates from 1969 and is due to Nordenstrom | 741 . He formulated the reliability problem between a normally distributed stillwater load, a Weibull distributed wave induced load and a normally distributed resis tance. He used a level III approach along the lines of the work of Freudenthal et al. |1|. Nordenstrom concentrated his further work on the probabilistic description of the fundamental variability of the wave induced loads. He showed that the longterm distribution of individual wave heights could be represented by a Weibull distribution. Moreover, he observed that the exponential distribution would provide an adequa te model in many cases ]74|. The first complete reliability analysis of a ship structure was not done before 1972 when Mansour developed a probabilistic model for ship strength and analysed a Mariner ship 175 [ This was a major contribution in the field. He adopted Nordenstrom's model for wave induced loads,con sidered different modes of failure of the structure and calculated the probability of failure according to the classical methods of Freudenthal et al.|l|. Mansour also discussed the model uncertainties present in the strength model. They were called subjective uncertainties and were treated along the lines formulated by Ang and Cornell |64|. These un certainties were also incorporated in the analysis but their values were just estimated. No analysis was done to quantify the model uncer tainties. Additional developments on the strength model were presented and calculations were performed for a tanker and a warship by Mansour and Faulkner |78|. Another major conibution was the introduction of second moment methods by Mansour |77| and by Faulkner and Sadden 78|. Mansour adopt ed the reliability index formulation of Cornell |47 and applied it to 18 merchant ships. The reliability indices that he calculated ranged between 4 and 7, which are somewhat larger than the typical ones in civil engineering practice. Faulkner and Sadden used a slightly approach to determine the load and resistance different but they used the same definition of reliabi lity index. They applied the method to 5 warships and obtained values of the reliability index between 1 and 4 (see Fig. 6). The significant differences between the index values for merchant and naval ships have

538 been attributed to different design philosophies, which is only a partial explanation, because the methods of analysis were not identical. Furthermore, the results presented by Mansour indicate that the reliability index decreases with ship length, and Faulkner's warships were smaller than most of the merchant ships considered. Faulkner formalised the use of second moment methods to analyse ship structures and has consistently advocated their usefulness in connection with design |79|. Recently Ferro, Pitalluga and Cervetto 180 , 81| have applied advanced second moment methods to the hull girder reliability . Consideration was also given to horizontal wave-induced effects and an improved strength model was utilised. In addition to these important contributions which dealt with ductile failures under extreme loads, mention must also be made to the work of Nitta 1821 and of Ivanov and Minchev |83| who treated the reliability problem related to fatigue failures. The theoretical probability for this type of failure is much higher than for ductile collapse. However, these analyses do not account for the inspection and repair which are the practical ways of avoiding fatigue failures in ships. Classification Societies have also been interested in the subject, as can be in the papers of Abrahamsen, Roren and Nordenstrom |84| in 1970, Akita et al |85| in 1976, Goodman and Mowat |86| in 1977, Planeix, Raynaud and Hunther |87| in 1977, Stiansen et al 1881 in 1980, Ostergaard and Rabien |89| em 1981, and Ferro and Pittaluga |80| in 1983. The interest of Classification Societies is very important since, in the author's viewpoint, one of the main applications of reliability theory is the calibration of design codes |90| which, in the case of ships, are issued by the Classification Societies. The major applications of reliability to ship structures, which have just been mentioned,use the methods described in section 4.1 specialised to somewhat different formulations of the load and the strength variables. In fact, what distinguishes the applications to ships from other types of structures are precisely those models. Thus a brief description will be given of the basic concepts generally accepted in those fields. 4.2.1. Probabilistic Modelling of the Load Effects Ships are subjected to various service loads which induce different effects or stress components in the primary structure. The main actions are the still-water load, the low and high frequency wave induced loads and the thermal loads. The stress components are the vertical and horizontal bending moments and shear forces, the torsional moment and the axial forces. The still-water and the wave induced loads are the most important ones with the vertical bending moment often being the dominant load component in the common ship types. Most of the reliability studies have only accounted for these two load variables. The still-water loads result from the different distribution of weight and buoyancy along the ships' length. Once the cargo distribution is defined, the equilibrium position of the ship as well as the longitudinal distribution of buoyancy is uniquely determined by the hull geometry. The amount of cargo carried as well as its distribution varies

539 from port to port in a random way which is governed by the market con ditions of the commodity that is being transported. When one concentra tes on a specific transverse section of a ship, the load effects indu ced by the stillwater cargo distributions vary as a random quantity. The common approach is to model these load components in successi ve voyages as outcomes of time independent random variables. The des cription of these variables is best achieved from the analysis of ships1 operational data because the effects that govern the amount of cargo transported and its distribution on board are very difficult to model mathematically. This has been the view taken by the few persons who ha ve tried to model them. In addition to exploratory type of studies J 91 94| a comprehensive analysis was only recently been undertaken |23|. The maximum stillwater bending moments, which tend to occur near the midship region for most ship types, can be satisfactorily represent ed by a normal distribution becomes somewhat skewed in sections towards the ship's ends but the bending moment intensity in that location is less important. Ships have instruments that indicate the load effect magnitude along their length for the input cargo conditions. These instruments are used in the choice of the appropriate distribution of cargo so that the ma ximum allowed values are not exceeded. This tends to make the probabi listic distribution truncated at the design values. The truncation is not absolute because in ships with a small number of large holds it may sometimes be difficult to redistribute the cargo and the maximum values are occasionally exceeded. However, in some cases, a truncated normal distribution is the best model. Since a normal distribution becomes completely defined by its mean value and standard deviation, these two statistics are enough to provi de a probabilistic description of the stillwater load effects. 'These statistics have been calculated for several transverse sections of ships of various types and sizes, and at different cargo conditions. These variables were shown to influence the load effect magnitude and regres sion equations were proposed to represent these effects. For example, the maximum mean value of bending moments M can be predicted from |23|: M = 114.7105.6W.154L+37.7D1+666D2+2.3D3+25 .6D47.7f 33.8D6 (22) where M has been normalised so that the Rule design value is 100, for hogging (+) and sagging () moments and the corresponding standard de viation is given by: S m = 17.47.0.W+.035L+9.90,1.902+10.03+9.304+4051.506 (23) where w is the mean carried deadweight normalised by its maximum value, L is the ship length and Di's are dummy variables which should be one for the ship type considered and zero otherwise. The tankers are the re ference case which has all D's equal to zero. Otherwise, D, corres ponds to dry cargo ships, D 2 to containerships, D3 to bulk carriers,D4 to 0B0 carriers, D 5 to chemical tankers and (, to ore/oil carriers.

540 The voyages have different durations which can also be described in probabilistic terms. The mean durations have been shown to depend on ship type and even on size. In addition to the major changes that occur . in stillwater load effects after each voyage, they also show a conti nuous variation during voyages, as a result of fuel consumption at least. Thus, the stillwater load effects have in reality a continuous varia tion with time, which can be modelled as a stochastic process as was done for the first time in ref. (23). Both of the probabilistic models can be used to derive the proba bility distribution of lifetime maximum load effects or to conduct load combination studies, primarily between the stillwater and the low fre quency wave induced component. The latter is induced on the ships as a result of the interaction between the waves and the rigid body motion which they induce on the ship. The response of the ship is to a great extent linear so that the probabilistic description of the input pro cess, the wave, is also applicable to the wave induced response. The free surface elevation of the sea can be modelled by an ergo dic Gaussian process for adequately short periods of time. This short term description implies that the process is homogeneous in time and in space, that is, its probabilistic properties do not change with time nor with location. Thus, it is equivalent to estimate those properties f rom several sea surface elevation records made at different times in the same point or made at the same time in different points. Because the free surface elevation at a point is Gaussian, it be comes completely described by its variance, once the equilibrium posi tion is used as reference, i.e. the mean surface elevation is zero. The corresponding wave elevation process becomes completely described by the autocorrelation function or by the power spectrum depending whe ther one prefers the time orthe frequency domain representation. The frequency description has proved to be easiest to handle and has been generally adopted to describe sea states and sea excited responses. The sea surface elevation is in reality a nonstationary process because it changes characteristics with time, as is well documented by the growth and decay of storms. However, it can be adequately modelled by piecewise stationary processes called sea states. Each sea state is completely described by a wave spectrum . These spectra result from physical processes and are therefore amenable to theoretical modelling. Various proposed mathematical expressions tore present average sea spectra have appeared in the past. The one that has become generally accepted and that has been commonly used in response analysis is due to Pierson and Moskowitz |95|, although it is most com monly seen in the form.parameterised by the International Ship Struc tures Congress (ISSC) |96[:

S(f) = 0.11 Hg 1

(Txf r^xpfO^CTif)''] m.sec

(24)

where f is the frequency, H s the significant wave height and T x the average wave period. The sea spectrum becomes completely described by these two parameters. This spectrum only provides a good description of the fully deve loped sea states with one wave system. When more than one wave system

541 is present the measured spectra often exhibits two peaks and only re cently has a simple theoretical model been proposed to describe these situations |97|. The linear response of a ship to a wave spectrum is also described by a spectrum S () which is obtained from the wave elevation spectrum Sjj(j) by operating with a transfer function () |98|;
S

R (a))

()

2()

(25)

where is the circular frequency ( = 2iif). The response which is also Gaussian with zero mean, becomes comple tely described by its variance R which is the zeroth moment of the spec trum: R = /S () d o i (26) o R The amplitude of a narrow band Gaussian process has been shown |99| to be Rayleigh distributed, so that the probability Q of exceeding the am plitude is given by: Q(x) = exp x2/2R (27)

Again this probability is fully described by the variance R of the process, whether it is applied to free surface elevation or to ship res ponse. On this basis, the largest maximum that is expected to occur in cycles is given by |99|: X ^ x = R 0 5 nN + y(R/,nN) 5/2 (28)

where is the Euler constant (equal to 0.5772...). The transfer function () represents the amplitude of ship res ponse to a unit amplitude wave with the specific frequency . It can be determined frommodel tests or from strip theory calculations e.g.|l00|. Basically the theoretical calculations determine the rigid body respon se of the ship assuming that the wave excitation corresponds to the re lative motion between the ship and the wave and that the hydrodynamic forces induced by the water can be calculated in two dimensional sec tions. To assess the ship hull's reliability it is necessary to have a load model applicable to the whole ship's lifetime. Thus, it is neces sary to describe the distribution of the short term sea states during the ship's lifetime. This is equivalent to determining the probabilis tic distribution of the two governing parameters of sea spectra: signi cant wave height and average period. Some compilations of statistical data on these two parameters are available, being the work of Hgben and Lumb |l0l| probably the most suitable for application to ships. This is so because it is based on observations in transiting ships, therefore having implicit bad weather avoidance and being concentrated along the ship routes. Some statistics are also available from wave measurements and from hindcasting models but they are not availabe in all ocean areas and the first are often not

542 long enough. To use the visual observations reported by Hogben and Lumb, a ca libration procedure becomes necessary so as to make them agree with measurements. Regression equations have been proposed to adjust the vi sually observed wave heights IL. |102| in metres: H s = 0.75 H v + 2.33 . S = 1.59 (29)

and wave periods T|23| in seconds: Ti = 1.17 T v + 1.30 . S = 2.17 (30)

where S represents the standard deviation of the residuals i.e. the un certainty of the regression predictions. While the situation is well clarified for wave heights, the regression proposed for period results from the analysis of only one set of data while several other studies have given inconclusive results. Because the short term descriptions depend on He and Tl, the short term distribution function Q () is in fact conditional on the value of those two parameters. Thus the marginal distribution, called the long term distribution Q L , is obtained by double integration: Qjx) = Qs(x | Hg,T ) f (hs, t,) dHg dT x (31)

where the conditional distribution is given by equation (27) 1111 | . While this expression is applicable to describe the longterm distribution of wave heights, the ship responses depend also on the transfer function which varies with relative heading between ship and waves a, with ships's speed v, and with the cargo condition of the ship c |91.103105|. The variance R of the response will depend on these variables so that the short term distribution, eqn. (27), should be interpreted to be conditional also on these variables. Thus the mar ginal distribution is obtained from a multiple integral involving them:
Q L (X)

= r r r r rQ ( x I H. , T l f a , v , c) o o o o o s ' s 1 * ' * (fH fT s . i . , , c) dHg dTj. det dv dc

(32)
independent

Some of the variables are usually considered to be so that the joint density function is represented as:
f(H

s> T i, a, v, c) = f(Hs, Ti) f(a) f( v ) f(c)

(33)

Most reliability studies have been performed using a simplified version of the longterm distributions with only one speed and cargo condition. More sophisticated load models are available |23|,account ting for directionality of wave climate, for the voluntary manoeuver ing to avoid heavy weather, for directional spreading of wave energy and its dependence on H and on a few other effects including the quan tification of several modelling uncertainities (see section 4.1.3). The longterm distribution, being the result of successive inte

543 grations involving density functions, some of which are empirical,like f(c) and f(v), does not follow any theoretically derived type.However, the shortterm distribution which is the basic one that is weighted by the different factors, is of Rayleigh type which is the special case of the Weibull distribution with an exponent of two. Fits of the re sulting distributions made by different authors have indicated that probably the most appropriate model for the longterm distribution is the Weibull distribution given by: F(x) = 1exp { (x/aw)*} (34)

where ( j and \ are the scale and location parameters. In particular, the exponential distribution, which is a special case of the Weibull with the exponent equal to unity, has been found appropriate on many occasions |74|. The reliability approaches to ship structures are mostly time in dependent so that the wave induced effects which are time dependentmust be reduced to a time independent formulation. This is done by saying that the ship structure is expected to survive the largest load like ly to occur anytime during its lifetime. This implies that one is in terested in having the probability distribution of the maximum ampli tude in cycles where should correspond to the mean number of cycles expected during the ship's lifetime. However, the previously described longterm distribution of load effects expresses the probability that any random load cycle that occurs anywhere during the ship's lifetime might have a specific value. For high levels of exceedance, which are the ones of interest, the probability of not exceeding the level in cycles can be approximated bya Poisson distribution I57| : P[x] = exp QL(x) (35)

which is a special case of eqn 17 when = and where Q () is given by equation 31 or 32. Thus, when the design reference level is chosen as Q L = 1/N the corresponding value of is the most probable extreme value which has a probability of occurrence of 1 exp ( 1) = 0.63. The maximum value of a variable that has an initial distribution having an exponential tail follows an extreme distribution. Thus, the design wave induced load effect can be represented by an extreme type I distribution given by |23|. F e (X e ) = exp{ exp (^^/} where the parameters are given by:
X

(36)

= a

(n ) 1 / )/

(37a)

^ (n N ) ( 1 "

(37b)

where a w and are the scale and location parameters of the initial Wei

544 bull distribution. The mean value and standard deviation of the extreme distribution is given respectively by: U e = + yo and = //6 (38b) e where is the Euler constant (=.5772). The coefficient of variation is therefore given by:

(38a)

V0 =

//6

Y+UntOVX

(38c)

which decreases with increasing return period N. For the return periods of 10 8 usually associated with the characteristic values used in the design codes V is equalto0.07 . 4.2.2. Probabilistic Modelling of the Resisting Capacity To assess the reliability of.the structure, it is necessary to comapre the values of the load effects in the various components with their res pective strengths. In view of the different load components present and of the corresponding different behaviour of the structural elements several modes of failure or limit states must be considered. In general, the modes of failures of the ship hull are duetoyield ding and plastic flow, to elastoplastic buckling and to crack growth by fatigue or fracture. When considering the primary hull structure, reference is usually made to midship section. However, checks on the capability of secondary structures were also made in some studies. The moment to cause first yield of the cross section either in the deck or in the bottom is a common limit state". This moment is equal to the minimum section modulus multiplied by the yield stress. It tends to be conservative in that the material has a reserve strength after ini tial yield, and because when first yield is reached in one of the ship's flanges, the other is still elastic. The moment M corresponding to first yield is given by: M e = Z e Oy = Jk Oy (39)

where is the material yield strength and is the elastic modulus given as the ratio of the section's moment of inertia I v and the distan ce d from the elastic neutral axis to extreme limit of the section. Another limit state is the plastic collapse moment, which_is reach ed when the entire section becomes fully plastic. This moment is calcu lated considering that all the material is at yield stress. Thus, the plastic neutral axis is in a position such that the total areas of ma terial above and below the neutral axis are equal. The plastic moment IL is equal to the product of the plastic section modulus Z p by the

545

yield s t r e n g t h :

Mp = Vy

(40)

The plastic section modulus for a hollow rectangular crosssection is given by 1106|: = A_ g + 2 A s (D g + g 2 ) + A (dg) (41)

where A, A^ and Ac are the total areas of deck, bottom and sides, D is the depth of the section and g is the distance of the centre of the deck area to the plastic neutral axis: A B + 2A S A D _ D (42) 4AS This limit state is generally unconservative because some of the plates that are subjected to compression may buckle locally decreasing their contribution to the overall moment. Thus, the ultimate collapse moment is the sum of the contribution of all the elements: g Mu = ^ i d a i y i y (43)

where d is the distance of the centroid of the element to the neutral axis and Ou is the ultimate strength of each element, which can be if it is in tension or the buckling collapse stress, ac if it is in com pression. The ultimate moment will in general be between the first yield and the plastic collapse moments. It is a more correct description of the real collapse although it has not always been adopted because it is more difficult to calculate. The other type of failure mode is the unstable buckling failure, which can occur basically in deck and bottom structures. In principle this should not be considered a failure in itself because either the bottom or deck may be able to contribute to the ultimate moment by yielding under tension. However, the reduction in strength may be so large that it may be considered as a global failure. Bottom and deck structures are generally grillages so that diffe rent buckling modes can occur: failure of plates between stiffeners,in terframe flexural buckling of the stiffeners, interframe tripping of the stiffeners and overall grillage failure. The elastoplastic buckling strength of this type of structural elements is the object of active research so that there are various adequate expressions to quantify their Strength.This is not the appropriate place to consider them in detail so that reference is made to 107| for plate behaviour, to 168[ for in terframe collapse and to |108 for global failure. The failure of deck or bottom structures under compressive loads can affect such a large portion of the crosssection that it is some times considered equivalent to a hull failure mode |75|. In fact a re cent study indicates that in some cases hull collapse occurs after the failure of a few individual plate elements |l09|. A more correct model would be the consideration of the ultimate

546 strength of the whole midship section although accounting for the redu. ced contribution of the buckled plate elements. Mansour and Faulkner 1761 accounted for that by a correction factor k\>, that was introduced in the equation for ultimate moment: "u = z e M l + kV) = Me (1 + Kv) (44)

where represents the collapse sttength of a critical panel and Kv de pends on the ratio of side to deck area, tipically around 0.1. The formulations and expressions refered to above are deterministic and based on structural mechanics concepts. The probabilistic models are built upon these by accounting for the uncertainty of the different parameters and combining them in a probabilistic way. One example of such a treatment is the modification of previous expression which includes modelling uncertainties |78] . Mu = a y z [ ( 1 a y + oiy a c ) oty a s with
a

(45a) (45b)

i = 1 + B , i = y , c , s

where B^ is a bias or systematic error in the model. The variable B y ac counts for the uncertainty in yield strength, accounts for the un certainty in design code expressions for collapse strength and B g ac counts for the margin between the moment at which collapse occurs in the weakest panels and the ultimate hull collapse. Another example of the application of probabilistic methods to strength formulations is given in |110| where modelling uncertainties in plate buckling predictions were quantified. It was emphasised that model uncertainties are different depending whether one is considering a laboratory test case, an analysis situation, or a design prediction. Different uncertainties are present, a feature that has not always been considered. In particular, the design situation must give full ac count for the possibility of plate corrosion during a ship's lifetime. This degrading effect on strength has been recognised previously and has been considered in assessing the time degradation of hull's section modulus |111|. However, the effect of the replacement of corroded plates was treat ed for the first time in |l10|. It was considered that plate replace ment depends in a probabilistic way on inspection policy and on the pla te replacement criterion. Once these are established, the average thick ness of replaced plates defines the limit condition of the plate's use full lifetime. By assuming that the point in time in which the extre me load acts on the ship is uniformly distributed along the individual plate's lifetime, a mean plate thickness is determined which is inde pendent of corrosion rate and depends only on the inspection and re placement policies. The strength models described in this section have been used in quantifying the reliability either by introducing them in equation 1175, 76, 84881, in equations 2 and 3 17779,231 or in equations 4 and 5180, 81 I .

547 4.3. Applications to Offshore Structures Ofsshore platforms serve as artificial bases supporting drilling and production facilities above the water surface. The basic types of the fixed platforms are the pile_supported steel jacket type and the concrete gravity based type. The pile supported platforms transmit the environmental and functional loads to the seabed through deeply embedded pile foundations. Gravity platforms resist lateral forces using their weight and the frictional and cohesive strength of the ocean floor sediments. New types of platforms have been developed for drilling in very deep water, such as the guyed tower, the tension-leg platform and even semi-submersibles |112, 113|. Most of the structural reliability studies reported in the literature have dealt with jackets and more recently with tension-leg platforms. These will be the examples considered in this brief survey. Reliability analysis has been used both to develop design criteria|114, 115| and to analyse individual structures |5,116118|. In the same way as done for ships, one can identify loading and resistance problems in which the geotechnical aspects will be included for the fixed platforms. Dead and live loads include the weight of the structure and of the equipment, which must all be considered when analysing the stress distribution on the structural elements and the soil reactions in the case of jackets. However, the major loads are the environmental ones, which are dominated by the wave actions, although currents, wind and eventual ly earthquakes need to be considered. Waves in the North Sea are well described by the same models treat ed in section 3.2.1. However, in the Gulf of Mexico the sea is very calm most of the time and occasionally hurricanes occur. Thus another model has been developed to characterise that behaviour. The storms have been considered to occur according to a Poisson model and the maximum wave in a storm is determined by considering a succession of sea states of increasing severity followed by others of decreasing intensity 1119-120|. The sea currents can be a result of the tides, or they can be wind driven. The first are more important in the North Sea while the opposite occurs in the Gulf of Mexico. The wind driven currents will be correlated with the wave and with the wind loading. They can give important contributions to the total load 170,1211. The Morrison equation |l23| is the mechanical model that represents the in-line force F on a slender vertical cylinder of diameter D in a uniformly accelerated fluid of density pf velocity V and acceleration A:

F = pl 4

C A +

1 2

D C n U|U|

(46)

where C M and C D are mass inertia and drag coefficients determined from empirical data. This formulation has been generalised to account for the simultaneous occurence of waves and current, for the inclination of the cylinders and for their dynamic response |122|.

548 The values of the coefficients of the Morrison equation have been predicted by many authors based on measured data. One of the most recognised results is due to Sarpkaya |124|, who determined the depeiden'ce of the coefficients on the Reynolds and the Keulegan-Carpenter numbers as well as on the relative roughness of the cylinder surf ace. The roughness will change during the platform life due to the growth of fouling 1251, which also increases the forces on the structure as a result of the change of diameter in equation 46. The wave loading is the dominant design criteria, especially for medium and deep water platforms. In shallow water seismic regions, the earthquake design becomes relatively more important |126| Different f allure modes can be considered in these structures, some related to ultimate load carrying capacity and others with fatigue strength |l27|. The latter is more important in this type of structure than on ships because the possibilities of inspection and repair are much smaller. Strength is mostly checked at an element level. To do that,a stress analysis must be performed which can be static up to moderate depths but must be a dynamic analysis for deep water and for compliant structures

I 1 2 8 !
In assessing the jacket strength, consideration must be given to the strength of tubular members under axial load and combined with lateral load |129|. both in an intact situation |130| or in a slightly damaged status |131|. Tubular joints must also be checked as regards their ultimate strength |132| and the fatigue strength 133|.Finally, the strength of the foundation)134| and of the piles [117 must also be accounted for. Jackets are made up of many tubular components and they are attached to the ground by various piles. Thus it is necessary to assess the reliability of the whole system instead of only individual elements.lt is very difficult to quantify the system reliability in this type of structure and very often one works with bounds, on the probability of failure. However, work has also been done in the direct assessment of system reliability |134-136|. The tension-leg platform is a new concept of a semi-submersible platform that has excess buoyancy, being kept on station by the tension of cables at its corners. The main structural components of these structures are the deck, the columns and the cable system, the later two of which involve a certain degree of innovation. Reliability studies have also been conducted for this type of platform both related with code development |137| and with the structural analysis 11381. The code work concentrated very much on the analysis of the cylindrical elements in the columns |139|, although improving also the load model |140|. The other innovation in this type of structure is the cable or tendon'System that is meant to keep it in positionJ_ts effect results from the cumulative contribution of all cables in the bundle so that they act as a parallel system in a reliability's point of view 1411.

549 REFERENCES: I I A.M. Freudenthal, J.M. Garrelts, and M. Shinozuka, 'The Analysis of Structural Safety' J. Struct. Div., American Society of Civil Engineers (ASCE), Vol. 92, .1966, pp. 235-246. J. Ferry Borges and M. Castanheta, Structural Safety,Laboratorio Nacional de Engenharia Civil, Lisboa, 1968 (2nd Edition, 1971). V.V. Bolotin, Statistical Methods in Structural Mechanics,Holden -Day Inc., San Francisco, 1969. 0. Ditlevsen, Uncertainty York, 1982. Modelling, MacGraw-Hill Book Co. New

P. Thoft-Christensen, M. J. Baker, Structural Reliability Theory and Its Applications, Springer Verlag, Berlin, 1982. G. Augusti, A. Baratta, F. Casciati, Probabilistic Methods Structural Engineering, Chapman & Hall, London, 1984. in

A. H-S. Ang, W.H. Tang, Probability Concepts in Engineering Planning and Design, Vol 2, Jonh Wiley & Sons, New York, 1984. H.O. Madsen, S. Krenk, N.C. Lind, Methods of Structural Prentice Hall, New Jersey, 1985. Safety,

R.E. Barlow, T. Proschan, Statistical Theory of Reliability Life Testing, Holt, Rinehart & Winston, New York, 1975.

and

10 | R.E. Barlow, H.E. Lambert, (Eds), Reliability and Fault Tree Analysis, Society for Industrial and Applied Mathematics, (SIAM), 1975. II | C. Guedes Soares, 'Basis for Establishing Target Safety Levels for Ship Structures', Annual Meeting, Committee on Design Philosophy, International Ship and Offshore Structures Congress (ISSC), Washington, DC, April 1986. 12 | D. Faulkner, 'On Selecting a Target Reliability for Deep Water Tension Leg Platforms', 11th IFIP Conference on System Modelling and Optimisation, Copenhagen, July 1983. 13 I A. G. Pugsley, 'The Prediction of the Proneness to Structural Accidents', Struct. Engr., Vol 51, 1973, pp 195-196. 14 | J. Ferry Borges, ''Implementation of Probabilistic Safety Concepts in International Codes', Proc. 3rd Int. Conf. on Struct. Safety and Reliability of Engrg. Struct. (ICOSSAR 7 7 ) , Munich, 1977, pp 121-133.

550 115 I D. I. Blockley, 'Analysis of Structural Failures', Proc. Instn. Civ. Engrs., Part 1, Vol 62, 1977, pp 51-74. |l6 | C.B. Brown, 'A Fuzzy Safety Measure', J. Engrg. Mech. Div.,ASCE, Vol 105, 1979, pp 855-872. 117 | L.A. Zadeh, 'Outline of a New Approach to the Analysis of Complex Systems and Decision Processes', Trans, on Systems, Man and Cybernetics, Inst, of Electrical and Electronic Engineers, Vol SMC 3, 1983, pp 28-44. 118 I C. Guedes Soares,'Introduction to the Theory of Fuzzy Sets and its Application in Engineering Design', (unpublished), Div. of Marine Structures. The Norwegian Institute of Technology NTH, April 1981. 19 J.T.P. Yao, 'Damage Assessment of Existing Structures' J. Engrg. Mech. Div., ASCE, Vol 106, 1980, pp 785-799.

|20 | R. Rackwitz, 'Note on the Treatment of Errors in Structural Reliability', Technische Universitat Mnchen, Laboratorium fur den Konstruktiven Ingenieurbau, Rep. No. 21, 1977, pp 23-35. |21 | N.C. Lind, 'Models of Human Error in Structural Structural Safety, Vol 1, 1983, pp 167-175. |22 | R.E. Melchers, 'Human Error in Research Results', Reliability Structural and Soil Mechanics, tinus Nijhoff Pub., The Hague, 'Reliability',

Structural Reliability Recent Theory and its Application in P. Thoft-Christensen, (Ed.), Mar1983, pp 453-464.

123 I C. Guedes Soares, 'Probabilistic Models for Load Effects in Ship Structures', Report no. UR-84-38, Division of Marine Structures, Norwegian Institute of Technology, 1984. [24 I 0. Ditlevsen, 'Fundamental Postulate in Structural Safety', Engrg. Mech., ASCE, Vol 109, 1983, pp 1096-1102. J.

125 | C. Guedes Soares, and T. Moan, 'Risk Analysis and Safety of Ship Structures', Proc. CONGRESSO 81, Ordem dos Engenheiros, Lisboa, Dec. 1981. Also (in Portuguese), Ingenieria Naval, Vol 50,No 564, 1982, pp 202-212. |26 | S. Fjeld, 'Offshore Oil Production and Drilling Platforms.Design Against Accidental Loads', 2nd Int. Conf. on Behaviour of .Offshore Structures (BOSS'79), London, 1979, pp 391-414. |27 | T. Moan, 'Safety of Offshore Structures!, P r o c 4 th I n t Conf. Applications of Statistics and Probability in Soil and Structural Engineering, Firenze, 1983.

551 128 I J. A. Faulkner, J. D. Clarke, C.S. Smith and D. Faulkner, 'The Loss of HMS Cobra A Reassessment', Transactions, Royal Insti tution of Naval Architects (RINA), vol. 127, 1985, pp 125152. |29 | Y. Yamamoto, et al, 'Analysis of Disastrous Structural Damage of a Bulk Carrier', 2nd Int. Symp. on Practical Design in Ship building (PRADS'83), Tokyo, 1983, pp 1118. |30 | S. Gran. 'Reliability of Ship Hull Structures', Report No. 78 216, Det Norske Veritas, 1978. |31 | J. B. Caldwell, Y.S. Yang, 'Risk and Reliability Analysis Ap plied to Ship Capsize; A Preliminary Study', Int. Conf. on the Safeship Project; Ship Stability and Safety, London, June 1986. |32 | R. Barlow, H. Lambert, 'The Effect of U.S. Coast Guard Rules in Reducing the probability of LNG tankership collision in the Bos ton Harbour', 4th Int. System Safety Conference, San Francisco, 1979. |33 | C. Ostergaard, U. Rabien, 'Use of Importance Measures in tems', Schiffstechnik, Vol. 31, 1984, pp 135172. Sys

|34 | T. Matsuoka, 'An Application of a Reliability Analysis to The Emergency Sea Water Cooling System of the Nuclear Ship Savan nah', Report no. 62, The Ship Research Institute, Tokyo, 1982. [35 | J.N.P. Gray, I.F. MacDonald, 'Safety Study of Part of a Dyna mic Positionning System for a Diving Support Ship', Reliability Engineering, Vol 3, 1982, pp 179192. [36 | D. Hoffman, V.K. Fitzgerald, 'Systems Approach to Offshore Crane Ship Operations', Trans. Society of Naval Architects and Marine Engineers (SNAME), Vol 86, 1978, pp 375412. |37 | B.L. Hutchinson, 'Risk and Operability Analysis in the Environment', Trans. SNAME, Vol 89, 1981,'pp 127154. Marine

[38 | T. Moan and I. Holland, 'Risk Assessment of Fixed Offshore Stru ctures Safety and Reliability, T. Moan and M. Shinozuka, (Eds.) Elsevier Sc. Pub., Amsterdam, 1981, pp 803820. |39 | . Moan, 'The Progressive Structural Failure of the Alexander L Kielland Platform', Case Histories in Of f shore Engineering, G. Maier (Ed.), Springer Verlag 1985. 140 | 0. Furnes, P. E. Kohler, 'Safety of Offshore Platforms, Classi fication Rules and Lessons Learned', Proc. Int. Conf. on Marine Safety, Dept. of Naval Architecture and Ocean Engineering, Uni versity of Glasgow, September, 1983, Marine and Offshore Safety ed. by P. A. Frieze et al. Elsevier, 1984.

552 |41 I J.E. Vinnem, 'Quantitative Risk Analysis in the Design of Offsho re Installations', Reliability Engineering, Vol 6, 1983,pp 112. |42 | J.P. Signonet, A. Leroy, 'The 1800m Water Depth Drilling Project : Risk Analysis', Reliability Engineering, Vol 11, 1985, pp 8392. 43 | 0. Furnes, J. Amdahl, 'Computer Simulation of Of f shore Collisions and Analysis, of ShipPlatform Impacts', Norwegian Maritime Re search, Vol 8, 1980, pp 212. |44 | T. Moan and J. Amdahl, 'On the Risk of FloatelPlatform Colli sion', Proc. 4th ASCE Speciality Conf. on Probabilistic Mechanics and Structural Reliability, ASCE, 1984, pp 167172. 45 | G. Kriger, E. Piermattel, J.D. White, R.W. King, 'Risk Analysis Applied to Offshore Platforms During Unpiled Installation Phase', Proc. 15th Annual Offshore Technology Conf., 1983, Vol 1, pp 9 18. |46 [ M.A. Maes, I.J. Jordaan, J.R. Appleby. P. Fidjestol,'Risk Asse ssment of Ice Loading for Fixed Structures', Proc. 3rd. Int.Of fshore Mechanics and Arctic Engng. (OMAE) Symp., ASME, 1984,Vol III, pp 220227. 147 | C.A. Cornell 'Structural Safety Specifications Based on Second Moment Reliability Analysis'. Final Report, Symposium on Con cepts of Safety of Strucures and Methods of Design, IABSE, Lon don 1969, pp 235246. |48 | A.M. Hasofer and N.C. Lind, 'An Exact and Invariant FirstOrder Reliability Format', J. Engrg. Mech. Div. ASCE, Vol 100, 1974, pp 111121. |49 | 0. Ditlevsen, 'Generalised Second Moment Reliability Index', J. Struct. Mech., Vol 7, 1979, pp 435451. 5 0 | o. Ditlevsen, 'Narrow Reliability Bounds for Structural Systems', J. Struct. Mech., Vol 7, 1979, pp 453472. [51 | M. Hohenbichier and R. Rackwitz, 'NonNormal Dependent Vectors in Structural Reliability', J. Engrg. Mech. Div., ASCE,Vol 107, 1981, pp 12271238. |52 | R. Rackwitz and B. Fiessler, 'Structural Reliability under Com bined Random Load Sequences', Comp. Struct., Vol 9, 1978, pp 489494. |53 | X. Chen and N.C. Lind, 'Fast Probability Integration by three Parameter Normal Tail Approximation', Structural Safety, Vol 1, 1983, pp 269276.

553 154 M. Grigoriu and N.C. Lind, Optimal Estimation of Convolution Integrals', J. Engrg. Mech. Div., ASCE, Vol 106, 1980, pp 1349 1364. 155 I D.B. Parkinson, 'Four Moment Reliability Analysis for Static and TimeDependent Problems', Reliability Engreg. Vol 1,1980, pp 2942. |56 | M. Grigoriu and C. J. Turkstra, 'Structural Safety Indices for Repeated Loads', J. Engrg. Mech. Div., ASCE, Vol 104, 1978, pp 829844. 157 I M. R. Leadbetter, 'Extreme Value Theory and Stochastic Proces ses', Proc. 1st Int. Conf. on Structural Safety and Reliability (ICOSSAR), A. M. Freudenthal (Ed.), Pergamon Press, 1972, pp 7189. 158 | V.V. Bolotin, 'Application of the Methods of the Theory of Pro bability and the Theory of Reliability to Analysis of Structu res', (in Russian), 1971; English Translation, U.S. Department of Commerce, 1974. [59 | D. Veneziano, M. Grigoriu and C. A. Cornell, 'Vector Process Models for System Reliability', J. Engrg. Mech. Div., ASCE,Vol 103, 1977, pp 441460. |60 | R.D. Larrabee and C.A. Cornell, 'Combination of Various Load Processes', J. Struct. Div., ASCE, Vol 106, 1980, pp 223239. 61 I C.J. Turkstra and H.O. Madsen, 'Load Combinations in Codified Structural Design', J. Struct. Div., ASCE, Vol 106, 1980, pp 25272543. |62 | M. Shinozuka, 'Stochastic Characterisation of Loads and Load Combinations', Strutural Safety and Reliability, T. Moan and M. Shinozuka, (Eds.), Elsevier Sci. Pub., Amsterdam, 1981, pp 5776. [63 | A.HS. Ang, 'Structural Risk Analysis and Reliability Based De sign', J. Struct. Div., ASCE, Vol 99, 1973, pp 18911910. |64 | A.H.S. Ang and C.A. Cornell 'Reliability Bases of Structural Safetty and Design', J. Struct. Div., ASCE, Vol 100, 1975, pp 17551769. |65 I 0. Ditlevsen, 'Model Uncertainty in Structural Structural Safety, Vol 1, 1982, pp 7386. Reliability',

|66 | N.C. Lind, 'Approximate Analysis and Economics of Structures', J. Struct. Div., ASCE, Vol 102, 1976, pp 11771196.

554 167 I R. Bjordhovde, T.V. Galanibos and M.K. Ravindra, 'LFRD Criteria for Steel Beam Columns', J. Struct. Div., ASCE, Vol 104, No ST9, 1978, pp 1371-1388. 168 | C. Guedes Soares and T.H. Soreide, 'Behaviour and Design of Stiffened Plates under Predominantly Compressive Loads', Int. shipbuilding Progress, Vol 30, No 341, 1983, pp 13-27. [69 | P. K. Das, P.A. Frieze, D. Faulkner, 'Reliability of Stiffened Steel Cylinders to Resist Extreme Loads', 3rd Int. Conf. on Behaviour of Offshore Structures, (B0SS'82), M.I.T. Aug. 1982, pp 769-783. |70 | C. Guedes Soares and T. Moan, 'On the Uncertainties Related to the Hydrodynamic Loading of a Cylindrical Pile', Reliability Theory and its applications in Structural and Soil Mechanics, P. Thoft-Christensen (Ed.), Martinus Nijhoff Pub., the Hague, 1983, pp 351-364. 171 I C. Guedes Soares and K. Syvertsen, 'Uncertainties in the Fatigue Loading of Offshore Structures', Report No. STF88 F81024, OTTER, Trondheim. May 1981. [72 | E. Abrahamsen, 'Structural Safety of Ships and Risks to Human Life', European Shipbuilding, Vol. 11,1962, pp 134-146. |73 | T.W. Dunn, 'Reliability in Shipbuilding', Trans. SNAME, 1964, pp 14-34. Vol 72,

174 | N. Nordenstrom, 'Probability of Failure for Weibull Load and Normal Strength', Report No. 69-28-S, Det Norske Veritas, 1969. |75 | A. E. Mansour, 'Probabilistic Design Concepts in Ship Structural Safety and Reliability' Trans. SNAME, Vol 80, 1972, pp 64-97. |76 | A.E. Mansour and D. Faulkner, 'On Applying the Statistical Approach to Extreme Sea Loads and Ship Hull Strength', Trans.RINA, Vol 115, 1973., pp 277-314. |77 | A.E. Mansour, ' Approximate Probabilistic Method of Calculating Ship Longtitudinal Strength', J.Ship Research, Vol 18, 1974, pp 203-213. 178 | D. Faulkner and J.A. Sadden, 'Toward a Unified Approach to Structural Safety', Trans. RINA, Vol 121, 1979, pp 1-38. ship

|79 | D. Faulkner 'Semi-probabilistic Approach to the Design of Marine Structures', Extreme Loads Response Symp., SNAME, 1981, pp 213 230. |80 | G. Ferro and A. Pittaluga, ' Probabilistic Modelling of Design

555 Loads for Ships', Reliability Theory and its Application in Stru ctural and Soil Mechanics, P. ThoftChristensen (Ed.), Martinus Nijhoff Pub., The Hague, 1983, pp 465476. [81 I G. Ferro, D. Cervetto, 'Hull Girder Reliability', Ship Structure Symposium, SNAME, 1984, pp 89110. |82 | A. Nitta, 'Reliability Analysis of the Fatigue Strength of Ship Structures', Trans. Nippon Kaiji Kyokai, Vol 155, 1976, pp 16. 183 | L. D. Ivanov and A. D. Minchev, 'Comparative Analysis of the Hull Section Modulus on the Basis of the Theory of Reliability',Budo wnictwo Okretowe, Vol 34, No 11, 1979, pp 1619. |84 | E. Abrahamsen, . Nordenstrom and .M.Q. Rren, 'Design and Re liability of Ship Structures', Proc. Spring Meeting, SNAME,1970. |85 [ Y. Akita, I. Yamaguchi, A. Nitta and H Arai, 'Design Procedure Based on Reliability Analysis of Ship Structures', J. Soc. Nav. Arch. Japan, Vol 140, 1976. |86 | R.A. Goodman and G.A. Mowatt, ' Application of Strength Research to Ship Design', Steel Plated Structures, Crosby Lockwood Sta ples, London, 1977, pp 676712. |87 | J. M. Planeix, J Raynaud and M. Huther, 'New Outlooks for Guar dians of Safety Explicit Versus Implicit Risk Analysis in Clas sification Certification', Safety at Sea, RINA, 1977, pp 7182. 188 | S. G. Stiansen, A.E. Mansour, H.Y. Jan and A. Thayamballi, 'Re liability Methods in Ship Structures', Trans. RINA, Vol 122.1980, pp 381397. 189 | C. Ostergaard and U. Rabien, 'Reliability Techniques for Ship De sign', (in german) Trans. Schiffbau Tech. Gessellsch., Vol 75, 1981, pp 303339. |90 | C. Guedes Soares, T. Moan, 'Uncertainty Analysis and Code Cali bration of the Primary Load Effects in Ship Structures', Proc. 4th Int. Conf. on Structural Safety and Reliability, ICOSSAR 85, 1985, Vol III, pp 501512. |91 | E: V. Lewis, et al.: 'Load Criteria fro Ship Structural Design', Report No. SSC240, Ship Structure Committee, Washington, D.C., 1973. |92 | L. D. Ivanov and H. Madjarov, 'The Statistical Estimation of Still Water Bending Moments for Cargo Ships', Shipping World and Shipbuilder, Vol 168, 1975, pp 759962. |93 | H. Mano, H. Kawabe, K. Iwakawa and N. Mitsumune'Statistical Cha

556 racter of the Demand on Longtitudinal Strength (Second Report) Long Term Distribution of Still Water Bending Moment' (in Japanese) , J. Soc. Nav. Arch, of Japan, Vol 142, 1977, pp 255-263. |94 I C. Guedes Soares, and T. Moan 'Statistical Analysis of Still-Water Bending Moments and Shear Foreces on Tankers Ore and Bulk Carriers', Norwegian Maritime Research, Vol 10, 1982, pp 33-47. |95 | W. J. Pierson, and L. Moskowitz, 'A Proposed Spectral Form for Fully Developed Wind Seas based on the Similarity Theory of S.A. Kitaigorodskii', J. Geophysical Research, Vol 69, No 24,1964, pp 5181-5190. 196 | N. Hgben et al., 'Environmental Conditions', Report of Committee 1.1, Proc. 6th International Ship Structures Congress, Boston, 1976. [97 [ C. Guedes Soares, 'Representation, 'Representation of DoublePeaked Sea Wave Spectra', Ocean Engng., Vol 11, 1984, pp 185-207.

I 98

M. St. Dennis, and W.J. Pierson, 'On the Motions of Ships in Confused Seas', Trans. SNAME, Vol 61, 1953, pp 280-357.

199 | M.S. Longuet-Higgins, 'The Statistical Distribution of the Height of Sea Waves', J. Marine Research, Vol 11, 1951, pp 245-266. 100| N. Salvesen, E.O. Tuck and 0. Faltinsen, 'Ship Motions Loads', Trans. SNAME, Vol 78, 1970, pp 250-287. Her and Sea

101 I N. Hoben and F.E. Lumb, 'Ocean Wave Statistics', Stationary Office, London, 1967. 1021

Majesty's

C. Guedes Soares, 'Assessment of the Uncertainty in Visual Observations of Wave Height', Ocean Engineering, Vol 13, 1986, pp 3756. J. Fukuda, 'Theoretical Determination of Design Wave BendingMoments', Japan Shipbuilding and Marine Engineering vol 2, No 3, 1967, pp 13-22. H. Soding, 'Calculation of Stresses on Ships in a Seaway',Schiff and Hafen, Vol 23, 1971, pp 752-762. M.K. Ochi, 'Wave Statistics for the Design of Ships Structures',Trans. SNAME, Vol 86, 1978, pp 47-69. and Ocean

103|

|104|

|105|

|106|

J. B. Caldwell, 'Ultimate Longtitudinal Strength', Trans. Vol 107, 1965, pp 411-430.

RINA,

|107|

D. Faulkner, 'A Review of Effective Plating for Use in the Analysis of Stiffened Plating in Bending and Compression', J. Ship

557 Research, Vol 19, 1975, pp 117. 1081 A. E. Mansour, A. Thayamballi, 'Ultimate Strength of a Ship's Hull Girder Plastic and Buckling Modes', Report No.SSC299, Ship Structure Committee, Washington, D.C. 1980. Y. Akita, 'Reliability of Ships at Collapse, Fatigue and Corro sive Damages', Proc. 1st Int. Symp. in Ship's Reliability,Varna, September, 1985, pp 412. C. Guedes Soares, 'Uncertainty Modelling in Plate Buckling',Proc. 1st Int. Conf. on Ship's Reliability, Varna, September, 1985. L. D. Ivanov, 'Statistical Evaluation of the Ship's Hull Section Modulus as a Function of the Ship's Age', Proc. 1st Int. Symp. on Ship's Reliability, Varna, September, 1985, pp 4456. T. Moan, 'Overview of Offshore Steel Structures', Fatigue book, A. AlmarMaes (Ed.) Tapir, 1985, pp 138. Hand

109|

[110[

[m|

|112|

|113|

C. Guedes Soares 'Hydrodynamic Loads on Offshore Platforms' (in Portuguese), Revista Portuguesa de Engenharia de Estruturas, vol 4, No 10, 1981, pp 3241. R.G. Bea, 'Reliability Considerations in Offshore Platform Cri teria', J. Struct. Div., ASCE, Vol 106, ST9, 1980, pp 18351853. S. Fjeld, 'Reliability of Offshore Structures', Proc. 9th shore Technology Conference, 1977, vol iv, pp 459471. Off

1141

115|

1116 | H. Crohan, A. Tai, V. HacheminSafar, 'Reliability Analysis of Offshore structures under Extreme Environmental Loading', Proc. 16th Offshore Technology Conf., 1984, Vol 3, pp 417426. |117| W.D Anderson, M.N. Silbert, J.R. Lloyd, 'Reliability Procedure for Fixed Offshore Platforms', J. Struct. Div., ASCE, vol 108, 1982, pp 25172538 and vol 110., 1984, pp 902906. H. Karedeniz, A. Vrouwenvelder, A.C. Bouma, 'Stochastic Fatigue Reliability Analysis of Jacket Type Offshore Structures', Relia bility Theory and its Applications in Structural and Soil Mecha nics, P. ThoftChristensen (Ed.), Martinus Nijhoff, 1983, pp 425 443. H.O. Jahns, J.D. Wheeler, 'Long Term Wave Probabilities Based on Hindcasting Severe Storms', Proc. Offshore Technology Conf., 1972, paper 0TC 1590. L.E. Borgman, 'Probabilities for the Highest Wave in a Hurrica ne', J. Waterways, Harbours and Coastal Engng., ASCE, Vol 99, 1973, pp 185207.

|118|

119I

|120

558 121 I S. Shyam Sunder, J.J. Connor, 'Sensitivity Analysis for Steel Offshore Platforms', Applied Ocean Research, Vol 3, 1981, pp 1326. 122| J. D . Wheeler, 'Method for calculating Forces produced by gular Waves', J. Petroleum Technology, No 22, 1970. irre-

123|

J. R. Mossison, M.P. O'Brien, J. w Johnson, S. A Schaaf, 'The Forces Exerted by Surfaces Waves on Piles', Petroleum Transactions vol 109, 1950, PP 149-157. T . Sarpkaya, 'The Hydrodynamic Resistance of Roughened Cylinders in Harminonic Flow', Trans. RINA vol 120, 1978, pp 41-55. N. J. Heaf, 'The Effect of Marine Growth on the Performance of Offshore Platforms in the North Sea', Proc. 11th Offshore Technology Conf., 1979, Paper No. OTC 3386. J. N. Yang, A.M. Freudenthal, 'Reliability Assessment of Offshore Platforms in Seismic Regions',Proc. Second Int. Conf. on Structural Safety and Reliability (ICOSSAR 77), Munich, 1977, pp 247 -266. P.W. Marshall, R.G. Bea, 'Failure Modes of Offshore Platforms', Proc. Conf. on Behaviour of Offshore Structures (BOSS 76).Trondheim, 1976, vol II, pp 579-635. J.H. Vugts, I.M. Hines, R. Nataraja, W. Schumm, 'Modal Superposition versus Direct Solution Techniques in the Dynamic Analysis of Offshore Structures', Proc. Conf.on Behaviour of Cf f shore Structures (BOSS 79), LOndon , 1979, paper No 49. j. E. Harding, P. J. Dowling, N. Angelidis, (Eds.),Buckling of Shells in Offshore Structures, Applied Science Publishers,1981. C. Guedes Soares, T. H. Soreide, 'Plastic Analysis of Laterally Loaded Circular Tubes', J. Structural Engng., ASCE, vol 109, 1983, pp 451-467. C. S. Smith, W. Kirkwood, J.W. Swan, 'Buckling Strength and Post-Collapse Behaviour of Tubular Bracing Members including Damages Effects', Proc. 2nd Int. Conf. on Behaviour of Offshore Structures (BOSS 79), London 1979, Vol 2, pp 303-326. P. W. Marshall, 'General Considerations for Tubular Joint Design', Proc. Conf. Welding in Offshore Construction, Welding Institute, UK, 1974. P.H. Wirsching, ' Fatigue Reliability for Offshore Structures', J. Structural Engng., ASCE, Vol 110, 1984, pp 2340-2356.

124|

125|

126|

127|

1281

129| 130|

131|

132|

133|

559 1341 R. Cazzulo, A. Pittaluga, G. Ferro, 'Reliability of Jacket Foun dation System', Proc. 5th Int. Offshore Mechanics and Arctic En gineering (OMAE) Conference, ASME 1986, Vol II, 7380. Y. Morutsu, et al., 'Probabilistic Collapse Analysis of Offshore Structures', Proc. 4th Int. Offshore Mechanics and Arctic Engi neering (OMAE) Symp., ASME, 1985, Vol I, pp 250258. F. Moses, 'System Reliability Developments in Structural neering', Structural Safety, Vol 1, 1982. Engi

135|

136|

137|

D Faulkner, N. D. Birrell, S.G. Stian, 'Development of a Relia bility Based Code for the Structure of Tension Leg Platforms'.pa per OTC 4648, Proc. Offshore Technology Conference, 1983. Z. Prucz, T.T. Soong, 'Reliability and Safety Platforms', Engineering Structures, Vol, 1984. of Tension Leg

138|

1391

p. . Das, P. A. Frieze, D. Faulkner, 'Structural Reliability and Modelling of Stiffened Components of Floating Structures'.Struc tural Safety, Vol 2, 1984, pp 316. Y. N. Chen, D. Liu, Y.S. Shin, 'Probabilistic Analysis of ronmental Loading and Motion of a Tension Leg Platform for abilitybased Design', Proc. Int. Conf. on Marine Safety, of Naval Architecture and Ocean Engineering, University of gow, September 1983, Marine and Offshore Safety, ed. by Frieze et al, Elsevier, 1984. B. Stahl, J.F. Geyer, 'Ultimate Strength Reliability of Leg Platform Tendon Systems', Proc. Offshore Technology rence, 1985, vol I, pp 151162. Envi Reli Dept. Glas P.A.

140|

141|

Tension Confe

ACKNOWLEDGEMENTS The present work was done during the author's stay at the Department of Naval Architecture and Ocean Engineering of Glasgow University as an Honorary Senior Research Fellow. The author is grateful to Professor D. Faulkner for his kind hospitality and to the Department's staff for the assistance Drovided in the typing of the manuscript. The author is also grateful to Professor Luciano Faria, Head of the Mechanical Engineering Department of the Technical University of Lis bon for having made the arrangements necessary for this lecture to be come possible. The work is in the scope of the research project 'Structural Re liability' that the author holds at CEMUL, the Centre for Mechanics and Materials of the Technical University of Lisbon, which is finan cially supported by INIC, the National Institute for Scientific Re search.

Subject Index

Abnormal occurrence Accessibility Accident (sequence, scenario, chains of events) Admissible region Adjunct Probability of Failure Aerospace Ageing Aircraft, Aviation Airworthiness Allowable failure rate Analytical Reliability Models Availability Average Interruption Method Basic Parameter Model (BP) Bayesian (approach, influence, statistics, etc.) Behaviourism Beta Factor Binomial Failure Rate Model (BFR) Boolean (algebra, operators, etc.) Bulk Power Systems Capacity Outage Probability Table Cascade failures CauseConsequence (diagrams, relation, etc.) Causes of Failure Challengedependent Probability of Failure Checklist Methods Chemical Industry Chemical Reactor Safety Civil Air Regulation Codecombination Analysis Cognitive Psychology Collision Common Cause Failures (CCF)

see incident 369 107,108,257,320,327,329,332,456457, 525 (see also incident) 498 516 367370 6,105,520 97,257,278,280,367385 370,372 370,371 346,348350,404415 3,1013,176,178,210215,222,303,346, 353,390,514 404405 see CCF parametric models 1516,1819,43,4966,84,112,251,298, 501,534 258 see CCF parametric models see CCF parametric models 20,133,136141 404409 391395,398401 228,248,368,371 104,171,345,359361,365 see Failure causes 188194 324325,334,335 see Process industry 313316 367369 113114 258,259 522,525 97,98,104,106,108,118119,149,150, 156,172,194198,221256,277,279, 289290,368,408,413,461 245252 149,233237 252 149,235,248,252 149,233234

CCF Analysis Procedures CCF Parametric Models:

BP BFR

149,236237,248
561

562 OCF Parametric Models: MFR MGL CCFRBE Common (Extreme) Environment Common Load Sharing Common Mode Failures (CMF) Component (classification, boun daries, operation, event states, etc.) Component Event Data Computer codes n CAFTS " " COVASTOL " " MAPLE " " MARK SMP " " MOCA RP " " SAFETI " " SALPMP " " SALPPC " " SCORE Confidence Limits Consequences of Incidents (evaluation, models, etc.) Contingency Analysis Contingency Table Control Strategy Corrosion Crack (growth, growth rate, etc.) Creep Criticality of Failures Cutoff 252 149,235236,248,251 237245 see Environment 172,181182 see CCF 3,27,6774,108,131

6794 129,156169 159,161162 507,509510 159,167168 345,353354,359 345,354,359 337343 163164 165166 169 15,43,84 320,321,325333,334,518 406407 111,117 263 105,372,492,520,522,546 491494,498,501502,522,538 (see also Fatigue) 494 379380 157,232,249

Damage (accumulation, process, etc.)488,494498,506507,522 Data Banks/Collections (general: 6787,95126 procedures, retrieval, etc.) " " AORS 95126 " " CEDB 6874 " " CEDBFBR 68 " " CREDO 68 " " ERDS 68,97 " 1RS 97,105 LER 114,250 Lloyds 520 " " OREDA 68 " " USERS 97 Data (Failure, Repair, Historical, 6794,95126,252,332,335,367,372, Incident, operational, etc.) 457,474476,515,520525 Decision Making 4966,98,269,319,388,403,413414, 463464,518,525 Decision Tables 21,23,24

563 Decision Tree Deductive (top-down methods, approach) Defects (propagation, etc.) Dependency structures Design criteria Design Optimization Displacement Extrapolation Method Dis tributions Beta Binomial Exponential Extreme Value Gamma Lognormal Loguniform Multinormal Normal Poisson Uniform Weibull Dynamic Process DYLAM technique 525 21,130,380 488,491,494,504,505,509 116,222,225-233 (see also CCF) 489,490,525 see Optimization 504 84 30 6,7,15-18,31,84,356,543 543-544 15,84 14,84,210,356 84 528 14,32,35-37,210 34-35,210 84,205-208 7-9,14,27-28,37-40,84,209,356,494, 495,496,498 231,311-316 303,311-317

Efficiency (production processes) 352-353 Electrical Networks 387-415 Electricity Capacity 387-403 Distribution Reliability 387,409-413 " Generation Reliability 387,398-399 " Transmission Reliability 387-404 Environment (extreme, land, para172,182-184,222,224,233,373,492, meters, common, etc.) 514,524,525 Ergonomics 258,298 EuReDatA 73,76 Events 21,131 Event Tree 21,129-131,153-156,171,227,248,280, 281,284,290,454-460 Expected Number of Failures 11,129,146-147 Expert Opinion/Engineering 296-298,534 Judgement Explosions 319,322,325,326,334,335,523 External Events 131,195,229 Extreme Value 494,538 Fatalities (number of, probability, FN curves) Fatigue (cracks, damage, failure, etc.) Fail Safe 326,328-330,332,334 368,372,488-489,491-495,497,498,520, 538 368,373

564 Failure Cause (mechanism, etc.) " Classification (definition, description) Failure Criteria " Intensity Mode Failure Mode and Effect Analysis (FMEA) Failure Probability Distributions Failure Rate 3,107,223,224,499,522

3,74-77,88-90,224 373,507-510,545-547 11,146 74,77,89-90,223,324,515 21,130,131,151-152,246-247,306,368, 370,371,373,375 4-9,22,176,367,500,502,516,531 5-9,12-19,22,49,84,97,231,346,348, 491,514,518 Failure Repair Process 4,11,12,27 Finite Element Method 502,506 Fire 131,321,322,326,523-524 Fit Test see Statistical Tests of Fitting Flow Discharge Models 325,334,335 Fracture (Mechanics, toughness, etc.) 491,501-502,507 Fragility curves 229 Frequency Duration Approach 397-403 Frequency of Occurrence 111,327-328,520 Functional Unavailability 225-227 Fuzzy Sets 516-517 Fault Avoidance (correction, 425-428,444 detection, tolerance) Design Fault Tree 21,23,24,109,129-169,171,179-181, 226,227,228,229,245,248,249,280, 281,284-286,303,307-311,327,328, 360,368,376,380-384,456-460,518, 522-523,525 Fault Tree Automated construction 156,159,161-162 Fault Tree Drawing 156,158,167-168 HAZOP (Hazard and Operability Analysis) Hazard indices Heavy Gas Dispersion Histogramming Historical Data History Dependence Human Behaviour Model Human Computer Interaction/ interface (HCl) Human Failures (Factors, actions, etc.) Human Failure Rate Human Performance Modelling Human Reliability Analysis (HRA) Impact vector Importance (of MCS, component) 130,303-308,325,334,335 303-304 334 107,108,111,113 518,521,524 498 269-270 257,258,264,266 97,98,104,106,107,109,119-120,131, 153,223,224,229-231,233,257-300, 460,515-517,522 287-288 266-281 291-296 251 148,158

565

Incident (data c o l l e c t i o n ,

analysis,95126,319,515,520,521,523525
120121 21,130,360,368,373 311 109,153,359,454456,522,523 158,224,284,369,371,498,518,538,546 506 2730,111 16 14,488,495499,540,546 463483 509510,528 489490 395398 488,493,495,496,514,516,524426, 531534,538544 see statistical Tests of Fitting 910,16,303,369 90,93,224,264,346,368,372,374,375, 413,457,464,514,522,546 257,271,278280 230,257,259,264266,268 401403,499,527,528 513559 130,156,171203,228,229,345,346, 350354,357,363365,407408,414, 463485,497498,523 233 60 14,27,3340,62,84 see Time to see CCF Parametric Models 348,350,356 27,3033 151,158,205220,228,229,328,346, 354365,404 280282 332333 139143,145,148,156 142,145 261 see Phased Missions

statistics, parameters, etc.) Incident Precursors Inductive Approaches (methods, downtop, etc.) Inhibit Gate Initiating Events Inspections J integral Approach Least squares Methods Level of Significance Life (test, time, prediction) Limiting conditions of operation (LCO) Limiting curves, surface Linear Accumulation of Damage Load models, cycles, etc.: Electrical Mechanical Kolmogorov Test Mai ntai nabili ty Maintenance Management Factors ManMachine Interface/inter action (MMI) Margin (safetystate) Marine structures Markov Analysis/Theory

Marsha11Olki Model Maximum Expected Utility Maximum Likelihood Estimation Mean time ... MGLModel Mobilization Moments Matching Monte Carlo (Analysis, Method, Simulation, etc.) MORT (Management and Oversight, and Risk Tree) Mortality Index Minimal CutSet Minimal Path Set Mistake Multiphase Systems

566 Multiple Correspondence Analysis Multi-state Multivariate Analysis Multivariate Stochastic Processes Non-Destructive Testing Nuclear Power Plants Off-shore Structures Oil Drilling Operability Analysis Operational Experience Operating Profile, Conditions Operator Action Tree Operator Procedures Optimization (cost, system, design) Outages (frequency, occurrence, mean time) Parameter Estimation Pattern Recognition Performance Shaping Factors .Petri Nets Phased Mission Planning Plasticity Power Systems Reliability Pressure Vessels Probability Distributions Process Industry Production Evaluation Production Index (PPI) PSA (Probabilistic Safety Analysis) Qualitative Analysis 116-117 346 111,116-117 487 491 67-126,221-256,257,447-462,463-385 513,515-517,518,523-526,547-548 68,97,345,359-365 see HAZOP 95-126,252,367,372,457 114-115,488,491 291-292 224,228,518 129,388,403,413,418,517 390,406-407,463

27-47,49-66,250-252 106,113 266 345,346,354-365 149,158,163,346,351,353 388,403,413-414 501-502,508,544-545 387-415 284,324,325,487-512 see Distributions 257,303-318,319-344,345-365 345-359 346,350,352,353,358 221,373,447-462,474-484 see FMEA, HAZOP, Cause-Consequence Diagrams 205-210 158,160 417 3-25,176,226 129,210-215,221,257,303,367-385, 457-460 23,467 408-409,528-531,537 370,371 21,222,368,371,413 9-10 11 4,76,92,158,171,172,174,184-188, 205,229,358,538

Random Number Generators Random Variable Combination Real Time Software Design Reliability (Definitions, Theory) " Assessment " Block Diagrams " index Reliability & Cost/Design Redundancy Repair Density " Intensity " Process, Policy

567 R e p a i r Rate " Time R e p a i r a b l e Components Residual Life R e s i s t a n c e ( D i s t r i b u t i o n , model, etc.) Response S u r f a c e Methodology (RSM) Risk A c c e p t a b i l i t y Risk A s s e s s m e n t / A n a l y s i s Risk Contours Risk P e r c e p t i o n Risk Reducing Measures Rupture Rate Safe S u r f a c e Safety Safety Factor Second Moment Methods Sequence (of failures/events) Sensitivity Analysis SHARP (Systematic Human Action Reliability Procedure) Shipping (Reliability, Safety) 9-10,12-13,97,175,348 s e e Time t o R e p a i r 369,389 492-493,499 513,526,531,544-546 502-503 320 221,257,303,319-344,345,517-526 328,331,334 298 320,321,327 495 487 95,98,106,129,153,221,257,369,370, 447-453,303,319,367,368,373-384, 388,463-485,514,518,531 490,509,515,527 526-531 104,106-110, see also Event Tree see Uncertainty Analysis 281,283-284,291

257,513,515-517,518-523,526,534, 535,537-546 SLIM-MAUD 297 Slip 261 Software Engineering 417,433-444 Software Errors/Faults 420-424 Software Reliability 233,417-445 Standby 21,22,107,132,158,172,178-181 State Probability vector 173 State Random Process 172 State Space Diagrams 399,401-403,405,468,471 Statistical Bayesian Analysis see Bayes Classical " 13-14,16,27-47 Correlation " 231 13,27-28 Graphical " 143 Independence Interference 488 Processing 13-19,27-47,84,97,110-117 Tests of Fitting 27,40-43,84 n Chebiar 112,115 " Kolmogorov-Smirnov 17-18,41-43 14,16,17,40-41 " X2 Stochastic Process 496,498 (see also state random process) Strength (Models, etc.) 488,514,522,526,544 Stress (Analysis, Parameters) 491,502-503 Stress Intensity Factor 491,493,494,501,503-508

568 Structural Reliability Structure Function Subjective Probability Success Tree System (boundaries, definition, etc.) System Logic/Model System Series/Parallel System States Task Analysis Test (interval, procedures, etc.) THERP Time to Failure Time to Repair/Restore Time Trending Toxic (cloud, release, effects, etc.) Transition Rate Matrix Transportation Risk Trend Analysis Unavailabili ty Unreliability Uncertainty (assessment, identification, etc.) Variance Reduction Techniques Vulnerability Model Weakest Link Wear (-in, -out) 487-511,513-559 21,139-143 50,53 280 3,20,107,108,130,131,155,226,227, 374,384 21,24,98,120-121,139-143,156,158, 205,226 21-23,404 388,404 257,271-278,284 93,158,224,371,457,464,465,491 291-296,460 4,6,16,90 4,9-10,16,90,92,106,348 106,110,111 319,321,322,326,332,333,335 173-177,497-498 338-343 105,106,110-117 10-13,22,129,143-146,176,178,213215,222,227,390,478,483 129,147 150,151,156,158,222,237-245,321, 326,328,332,335,502,525,534-537 215-216 326 367,498 6-9,27

Reliability Engineering
Proceedings of the ISPRA-Course held at the Escuela Tecnica Superior de Ingenieros Navales, Madrid, Spain, September 22-26,1986 in collaboration with Universidad Politecnica de Madrid

Edited by ANIELLO AMENDOLA


Commission of the European Communities, Joint Research Centre, spia Establishment, Ispra, Italy

and AMALIO SAIZ DE BUSTAMANTE


Universidad Politecnica de Madrid, Escuela Tecnica Superior de Ingenieros Navales, Madrid, Spain

Reliability Engineering focuses on the theory and application of reliability modeling techniques. It is augmented by a series of case studies that offer a comprehensive treatment of the development and use of reliability engineering in Europe. The work is divided into three parts: Part I introduces the fundamental definitions and models of reliability theory and data collection. Part II describes the main reliability techniques, such as fault trees, Markov chains, and Monte Carlo simulation. Problems such as dependent failure and Human fallibility are also discussed. Part III presents applications in the determination of both availability and safety assessment in several industrial sectors, such as major hazard installations, off-shore work, nuclear power plants, aerospace, electrical networks, and telecommunications. There is also a discussion of structural reliability and applications to pressure vessels and marine structures. The book will be of great value to scientists and engineers who have to deal with reliability-availability-maintainability programs and the safely assessment of industrial systems and structures.

Kluwer Academic Publishers


Dordrecht / Boston / London ISBN 90-277-2762-7

Vous aimerez peut-être aussi