Vous êtes sur la page 1sur 8

Interactive Consistency With Multiple Failure Modes

Philip Thambidurai and You-Keun Park

Aerospace Technology Center


Allied-Signal Aerospace Company

Abstract Generals Problem by Lamport et al [2]. Any solu-


tion to this problem must satisfy two conditions:
The theoretical requirement for Interactive Consis-
tency, N > 3t, where t is the maximum number of Agreement: All non-faulty Receivers will agree on
faulty processors and N the total number of proces- the value of the h n s m i t t e r .
sors, is based on worst case models of failure. A de-
sign based on this model, while making no assump-
Validity: If the h n s m i t t e r is non-faulty, then the
value selected by each non-faulty Receiver will
tions about failure modes, imposes a severe penalty
correspond to the Ifammitters own value.
in terms of hardware and performance. We address
the problem of reaching Byzantine Agreement in a Together these conditions are referred to as the In-
distributed system in the presence of different types teractive consistency conditions [l]or the Byzan-
of faults, and show that significant improvements in tine Agreement conditions [2]. To guarantee Byzan-
reliability and performance are possible if faults can tine Agreement, the total number of processors must
be partitioned into disjoint classes. We show that in be more than three times the maximum number of
a distributed system, to guarantee Byzantine Agree- faults [1,2]. The behaviour of faulty nodes (faults)
ment requires N > 2a+2s+b+r where N is the total is not restricted in any way, i.e., no assumptions are
number of processors, a is the number of malicious necessary about the failure modes (we refer to this as
asymmetric faults ( a 5 T ) , s the number of malicious the Byzantine Generals model [Z]). It is also known
symmetric faults, b the number of non-malicious or
intercepted faults, and r is an algorithm dependent
+
that at least t 1 rounds of messages are required by
any algorithm, where t is the maximum number of
term. The practical value of this unified model in faults [3], unless it is assumed that a faulty processor
designing ultra-reliable systems is demonstrated by cannot forge the signature of a non-faulty processor
examples. (authentication assumption).
While Byzantine Agreement is a desirable feature
to have in ultra-reliable distributed systems, it does
1 Introduction and Motiva- not come without a price. The algorithms are ex-
t ion pensive in terms of the number of messages, in terms
of time, and in the number of processors required.
Agreement [1,2] is one of the fundamental problems A t-round algorithm could require O ( N ) messages,
in the design of reliable distributed systems. In +
and message paths of length up to t 1, where N is
these systems, a single processor might be required the number of processors. It is this performance and
to transmit the result of some computation to all cost overhead which precludes the use of these algo-
the other processors in the system. If the sending rithms in many real systems. However, by making as-
processor (called the Transmitter) is faulty, it may sumptions, it is possible to devise relatively efficient
send conflicting values to different processors, caus- algorithms. For instance, the number of processors
ing disagreement among the recipients of the message required and/or the number of rounds of message
(called the Receivers). This situation could arise even exchange can be reduced by placing appropriate re-
if the Transmitter is non-faulty because the commu- strictions on the failure modes. These restrictions are
nication network may be faulty. Such a fault could usually justified by the architecture and the applica-
result in some of the Receivers taking a certain action tion environment. Given that Byzantine Agreement
while other Receivers take a different action, leading is extremely expensive to implement, this paper in-
to inconsistency or disagreement among non-faulty troduces realistic failure models which still allow ar-
processors. This problem was called the Byzantine bitrary failures.

93
CH2612-0/88/0000/0093/$01.000 1988 IEEE
A significant advantage of the Byzantine Gener- sary to restrict the architecture so that a different
als Model in designing critical fault-tolerant systems kind of failure mode does not occur, or, to assume
is that it greatly simplifies a Failure Modes and Ef- that the probability of other failure modes is negligi-
fects Analysis (FMEA). This type of fault analysis ble relative to the system specification.
is required for certification of systems used in crit- In [5], these restrictions take the form of an
ical applications where loss of the system may be atomic broadcast protocol. Efficient but separate
too costly. The FMEA procedures are complex and algorithms are presented to tolerate timing faults,
error-prone because typically a human must system- omission faults, and Byzantine faults (using authen-
atically categorize all the failure modes of the sys- tication), respectively. Faults are partitioned into
tem. In general, for complex systems it is impossible processor faults and link faults; the communication
to consider all failure modes ( a combinatorial prob- network need not be fully connected. A relatively
lem). Therefore, this approach can result in some significant restriction of the atomic broadcast is
subtle failure modes being excluded from considera- the requirement that every message that is broad-
tion. The Byzantine Generals Model however, makes cast be delivered to all non-faulty receivers or to none
absolutely no assumptions about failure modes; thus, of them; this restriction can be enforced with high
the FMEA process is greatly simplified. probability by constraining the architecture. The al-
A real design problem is usually constrained in gorithms for omission and timing faults cannot cope
some way or other (eg. number of processors, pro- with arbitrary faults.
cessor speed, cost, reliability of processors, inter- Assuming the existence of a redundant broadcast
processor communications, etc.), regardless of the network, Babaoglu et al[6] present interactive consis-
specifications or requirements. With this in mind, we tency algorithms to cope with processor faults, chan-
develop a unified theory to allow the designer to build nel faults, and link faults. The redundant broadcast
a system which can take advantage of constrained network and the authentication assumption permit
fault behaviour, if such constraints are present; but, these algorithms to require only two rounds of mes-
unlike other approaches, our models still allow arbi- +
sage exchange, rather than the t 1 rounds of most
trary failures. algorithms. Their results are useful in practice be-
In the quest for realistic models (i.e., practical), cause of the fault partitioning and because of the
several algorithms have been proposed that guar- very low message overhead. However, they do not
antee Byzantine Agreement under limited classes of address the case where a processor may have multi-
faults. However, these algorithms assume that faults ple failure modes.
belong to a single class. No attempt has been made Meyer and Pradhan [4] have developed a model for
to combine these different types of faults under one interactive consistency in the presence of two types of
model, except for [4]. Such models are limited in failure modes: arbitrary faults and benign faults
their application because the system designer must (strictly omission or delay of messages). Their algo-
show that his system cannot exhibit failure modes rithm, while being complex and inefficient, does not
not covered by the restricted model: this means that require a fully connected system. We note that our
-all faults are restricted. These models essentially re- definition of non-malicious faults which follows, in-
quire that the probability of a failure mode not cov- cludes both omission and timing faults, but is not
ered by the model be zero. A significantly more use- restricted to these.
ful approach is to allow different types of restricted In the following section we present our classifi-
faults to occur, while limiting the maximum number cation of faults for distributed systems. Section 3
of unrestricted faults, in the same model; the prob- presents our central result: a model for interactive
ability of an arbitrary fault may now be non-zero. consistency in the presence of different types of fail-
This approach can then provide optimal reliability ure modes. Section 4 is a brief discussion of the ap-
in the presence of different types of faults. plicability of the model in designing fault-tolerant
Since the Byzantine Generals model [2] is expen- systems.
sive to implement, the majority of work in this area
has dealt with restricted scenarios: truly arbitrary
failure modes are not permitted. One such assump- 2 Fault Taxonomy
tion is authentication [2,5] which assumes that a non-
faulty processors signature cannot be forged. Other Our fault classification (Figure 1) is based on the
approaches limit fault behaviour, for example, omis- model of a fully connected distributed system in
sion faults [5,6] and timing faults [5,4]. To ensure which processors communicate by messages. A pro-
that failure modes are restricted, it is either neces- cessor can infer the state of the other processors in
ALL FAULTS
I
I I
NON-MALICIOUS MALICIOUS
I

SYMMETRIC ASYMMETRIC Figure 2: Example of dedicated broadcast: Single


Transmitter and many Receivers

Figure 1: Fault Partitioning


this is also an example of an asymmetric fault. The
distinguishing feature of an asymmetric fault is that
the distributed system only by receiving messages at least one non-faulty recipient receives a different
(or by the lack of expected messages relative to some message.
protocol). The effect of a fault within a processor is A Symmetric fault refers to the case when all re-
completely invisible to the other processors except for ceivers obtain exactly the same message. An example
the messages that the faulty processor sends. This is the case where processor PA received a 1 from
means that from the system perspective, it is mean- PB, but informs all other processors that it received a
ingful to define faults only in the context and content 0. The receiving processors do not know what PB
of messages transmitted by the faulty processors. actually transmitted to PA. Symmetric faults may
An arbitrary fault or Byzantine fault includes occur as a result of faulty processors or as a result of
every type of fault in Figure 1, because no restric- faulty communication channels. The case for defining
tions are imposed on faults (i.e., any fault classi- a symmetric fault is best expressed by a dedicated
fication is meaningless). We partition the set of broadcast channel (see Figure 2). In this method
all possible faults, 7 , into two disjoint subsets: of communication, the channel is dedicated in the
Non-malicious faults E?, and Malicious faults M. A sense that only one processor may transmit on that
Non-malicious or intercepted fault is defined to be a channel; all other processors are receivers. Note that
fault which is detected by every non-faulty receiver data transfer is strictly unidirectional. This is the
of the message. This detection must, of course, oc- approach taken by ultra-reliable fault-tolerant sys-
cur prior to the use of the message in the interactive tems such as SIFT [8] and MAFT [9,10].The major
consistency algorithm. Examples of Non-malicious advantage of this approach is that the sending pro-
faults include Timing faults [5,4], Omission faults cessor (the Transmitter) is physically limited to send-
[5,7] and Crash faults [6]. The case where all non- ing the same message to all the receivers. Therefore,
faulty processors can detect a fault in a message by the sending processor may commit only a symmetric
virtue of its contents (for example, out of bounds fault. Symmetric faults are of course not restricted
data) is also a Non-malicious fault. A Malicious to dedicated broadcast networks. In the worst case,
fault refers to the case when not every non-faulty even the communication link (or any communication
receiver can detect that a fault has occurred. It is device in general) could be a source of asymmetric
these faults that require the use of Byzantine Agree- faults. However, a communication device (transmit-
ment algorithms with several rounds of message ex- ter, receiver, channel) is extremely simple relative to
change. Unlike the current literature, we partition a processor. Hence the probability of an asymmetric
Non-malicious and Malicious faults into two disjoint fault could be much less than the probability of a
subsets, Symmetric faults S , and Asymmetric faults symmetric fault; our model permits the system de-
A. signer to take advantage of this (if it is true in a
Asymmetric faults refer to the case when a mes- particular architecture).
sage is not received identically by all the non-faulty Our fault partitioning is useful because the prob-
receivers of that message. If a processor sends a mes- abilites of the different types of faults are different
sage to other processors via different communication in practice. Non-malicious faults are far more likely
channels, the received messages could differ either to occur than malicious faults. If only non-malicious
due to the processor lying or due to faults in the faults occur, the algorithms for Agreement are very
communication network. In either case, if not all simple and efficient. When a processor is faulty, the
non-faulty receivers receive identical messages (erro- probability that it will generate valid messages is
neous or not) we say that an asymmetric fault has oc- quite low; the probability of that faulty processor
curred. Some receivers might be able to detect errors sending different but valid messages to different re-
in the message while others receive a valid message: ceivers is even smaller (by a valid message we mean a

95
message that is correct relative to the message proto- be fixed prior to execution of the algorithm. In al-
col such as timing, ECC, framing). Thus, in practice, gorithm Z(T), a processor selects the value E if it
Pr{Non-malicious fault} > Pr{Malicious Symmetric detects an error in the message, or if it receives no
fault} > Pr{Malicious Asymmetric fault}. message; this value is used in subsequent transmis-
Also observe that it is possible for a fault to be sions (if any). The algorithm uses a majority func-
symmetric or asymmetric and still be non-malicious. tion to vote on a vector of values. This majority
This partitioning of a non-malicious fault into sym- function excludes all elements having the value E be-
metric and asymmetric cases is not shown in Fig- fore voting; the majority function operates only on
ure 1, because our unified model is capable of treat- the non-E values. If a majority does not exist among
ing both cases uniformly. A non-malicious asymmet- the non-E values (i.e. there is a tie), a default value
ric fault is no worse than a non-malicious symmetric may be chosen, including the value E ; of course this
fault. default must be defined prior to execution of the al-
gorithm. The algorithm presented below is adapted
from Lamport et al 121, except for its use of the E
3 The Unified Model values.
Algorithm Z(P)
We are interested in providing Byzantine Agreement
in the presence of different types of faults. The orig- Step 1: The Transmitter sends its value to every
inal Byzantine Generals Model [1,2] does not take Receiver.
advantage of the case where some faults may be ar-
bitrary while others are restricted. By bounding the Step 2: For each i, let vi denote the value that Re-
number of arbitrary faults, we develop a Unified ceiver i gets from the Transmitter; if no message
model, which takes advantage of the fact that re- is received or if an errbneous message is received,
stricted faults do not require N > st, while also al- the value E is chosen for vi. If r = 0, then ev-
lowing for the possibility of arbitrary faults. ery Receiver uses the value vi. Otherwise, every
Theorem 1 below shows that it is possible to guar- Receiver acts as the Transmitter in Algorithm
antee Byzantine Agreement with much less processor Z(P- 1) to send the value vi to each of the N - 2
and message overhead if faults can be partitioned other Receivers.
into symmetric faults, asymmetric faults, and non-
Step 9: For each i, and each j # i, let v j be the
malicious faults. In the following discussion, a is
value that Receiver i received from Sender j in
the number of malicious asymmetric faults, s the
step 2 by using Algorithm Z(P-1); if no message
number of malicious symmetric faults, b the number is received or if an erroneous message is received,
of non-malicious faults (includes both non-malicious
the value E is used for v j . Receiver i uses the
symmetric and non-malicious asymmetric), P is the majority value from the set ( v 1 , q ,...,v ~ - ~ ) ,
number of rounds of message exchange excluding the excluding elements having the value E prior to
initial transmission, and N the total number of pro-
the vote.
cessors in the system.
We make the following assumptions: Lemma 1 FOPany T, any a, any s, andany b, Al-
gorithm Z(T)satisfies the Validity condition if N >
A1 A direct path exists from every processor to
other processors.
all
+ + +
2a 2s b P.

Proof: Note that the Validity condition applies


A2 The absence of a message can be detected by
the intended recipient of that message. We use only when the Transmitter is non-faulty. The proof
a synchronous agreement protocol. is by induction on T.

A3 If both the sender and recipient of a message are 1. Basis step: If P = 0, since the Transmitter is
non-faulty, then the message is delivered cor- non-faulty, all non-faulty Receivers must receive
rectly. A fault in the communication link be- the Transmitters value v by assumption A3.
tween any two processors is treated as a fault of Thus, the Lemma is true when P = 0.
one of these processors. 2 . Induction step: Assuming that the Lemma is
The algorithm Z(P), defined inductively for T >_ 0, true for T - 1, (P > 0), we show that it is true
solves the Byzantine Agreement problem when there for r.
+ + +
are more than 2a 2s b T processors, in the pres- In step 1, the Transmitter of Algorithm Z(P)
+ + <
ence of a s b faults, for a P. Note that P must sends a value v to each of the N- 1 Receivers and

96
each of the N - 1 Receivers executes Algorithm L e m m a 2 FOT any T , any s, any b, and a <T,
Z(T - 1) in step 2. Algorithm Z ( T ) satisfies the Agreement condition if
+ + +
From the hypothesis N > 2a 2s b T , we + + +
N > 2a 2s b T .
+ + +
have (N - 1) > 2a 2s b (T - 1). Thus, Proof: If the Transmitter is non-faulty, we have
by the induction hypothesis, we conclude that shown that Algorithm Z ( T )satisfies the Validity con-
every non-faulty Receiver gets v j = v for each dition; in this case Validity implies Agreement. We
non-faulty Receiver j . only have to prove that the Lemma is true when the
Since each of the N - 1 Receivers acts as the Transmitter is faulty. The proof is by induction.
Transmitter in Z(T - I), each Receiver gets a
Basis Step: When T = 0 there cannot be any ma-
vector of N - 1elements after the completion of
Z(T - 1). Each element of this vector can take <
licious asymmetric faults because a T . Hence
on one of three possible values (correct value, all non-faulty Receivers will receive an identical
incorrect value, and E). Let k denote the total message, so that Z(0) satisfies the Agreement
number of error values (E) and m the total num- condition.
ber of non-error values ( v j # E) in this vector. Induction step: Assuming that the Lemma is
Then, m = N - 1- le. true for T - 1, we show that it is true for T ,
By hypothesis, where T > 0.
In step 1, the Transmitter of Algorithm Z ( T )
sends a value v to each of the N - 1 Receivers
and each of the N - 1 Receivers executes Z(T- 1)
so that
in step 2.
m > 2a + 2s + b + T - 1- k. If the Transmitter commits a malicious symmet-
ric or a non-malicious fault, all non-faulty Re-
Since k 2 b, we can write k = b + a, where ceivers receive the same value, and by Lemma 1
a 2 0. Then, every non-faulty Receiver gets the same val-
ues for each non-faulty Receiver from Algorithm
m > 2a +28 +b + T - 1- ( b + a) Z(T - 1). Hence the algorithm Z ( T )satisfies the
Agreement condition if the Transmitter commits
which reduces to a malicious symmetric or a non-malicious fault.
When the Transmitter commits a malicious
m >2U$2S +T - 1 - a. (2) asymmetric fault, there is one less malicious
Note that the set of a elements (each element asymmetric fault and one less processor in Algo-
is E) cannot result from malicious symmetric rithm Z(T - 1) than in Algorithm Z ( T ) ,but the
faults. By definition, the E values can be gen- total number of malicious symmetric and non-
erated only by non-malicious faults and ma- malicious faults is the same in both Algorithms.
licious asymmetric faults. All non-malicious
+ + +
Hence, from the hypothesis N > 2a 2s b T ,
faults must generate E values, while all the ma- +
a 5 T , we have ( N - 1) > 2 a + 2 s b + (T - 1) >
licious asymmetric faults need not generate E 2(a-1)+2s+b+(r-1) and (a-1) 5 ( T - 1 ) . w e
values. Therefore, we can write a = a' a and, + can therefore apply the induction hypothesis to
conclude that every non-faulty Receiver gets the
substituting into equation 2 yields
same value in step (3). Thus, every non-faulty
Receiver selects the same value which is the ma-
jority value of m non-error values (m= N - le,
The total number of faulty values among the set where m and k are defined in Lemma 1). Thus,
+
of m elements, t , is then equal to a' S. Then, Algorithm Z ( T )also satisfies the Agreement con-
equation 3 becomesm > Z t + a + r - l . But since dition. 0.
a 2 0 and T > 0, we can write m 2 2t. Thus
The central result of this paper is expressed in the
a majority of the m values are correct (equal to
following Theorem.
v ) . Therefore each non-faulty Receiver selects v
for the Transmitter's value by the majority vote T h e o r e m 1 FOTany T , any s, any b, and a <
T,
in step 3 of the algorithm, which satisfies the Algorithm Z ( T ) satisfies Interactive Consistency if
Validity condition. 0 + +
N > 2a -t 2s b T .

91
Proof: By Lemma 1 amd Lemma 2 , both the con- Model NI [ P(Fai1ure) I Faults I
ditions necessary for Interactive Consistency are sat- BG I 4 I 6.0 x lo- I 1 arbitrary
isfied. Therefore, the Theorem is true. 0
We now consider some special cases that follow
from Theorem 1. The corollaries can easily be proved
BG I 6 I 1.5 x lo- I 1 arbitrary
UM I 4 I 6.0 x lo- I . . b = 0, s = 0
1 arbitrary,
and are essentially special cases of Theorem 1;we do UM 5 1.0 x lo- 1 arbitary, b = 1, s = 0
not supply the proofs of these corollaries here. UM 6 2.0 x lo- 1 arbitary, b = 0, a = 1
UM 6 1.1x 1 arbitary, b = 2, s = 0
Corollary 1 For any T , any s, and any a 5 r , al-
gorithm Z ( T ) guarantees Interactive Consistency if
Table 1: Reliability data for Example 1
+ +
N > 2a 2s r .

The above corollary is useful if all faults are con- refers to the model in which no assumptions are made
sidered to be malicious; there are no non-malicious about fault behaviour, i.e., all faults are arbitrary.
faults. Although Theorem 1 above may be used This example illustrates that the BG model actu-
(b = 0), we differentiate this case because now there ally imposes constraints on the system designer that
is no need for a special E value when there are no may be unreasonable in some cases. Consider the
non-malicious faults. Algorithm Z ( T ) is modified so case where all faults are arbitrary. Assume that a
that there is no E value corresponding to a non- 1-round algorithm is used, so that at most one fault
malicious fault, and the majority vote does not ex- can be tolerated. At least four processors are nec-
clude any elements; let this modified Z ( r ) be called essary, and the occurrence of a second fault is as-
algorithm Z(r). Note that Z(r) is identical to the sumed to cause system failure. Now suppose that
well known algorithm O M ( r ) of [2]. the designer wishes to increase the system reliability
We now consider the case when there are no ma- by using more processors. If the BG model is used,
licious symmetric faults (s = 0). Then, the Unified then, increasing the number of processors from four
model reduces to a form which is essentially equiva- to either five or six, will only decrease the system re-
lent to that of [4], whose benign faults are a subset liability because there are more components to fail.
of our non-malicious faults. However, our algorithm Increasing the number of rounds to T = 2 will not
Z ( T ) is far more efficient than that provided by [4], help if the number of processors is either five or six.
but we do not consider arbitrary connectivity as they If N is the total number of processors, and X the fail-
do. To obtain their result, we set T = a and s = 0 in ure rate of a processor, the system failure probability
Theorem 1, which then leads to the following corol- at time t , for T = 1, is approximately (for At << 1)
lary.
Pr(SystemFai1ure) x ( N ( N - 1)Xat)/2
Corollary 2 For any a and any b, Algorithm Z ( a )
which increases as N. The probability of failure at 1
satisfies the Interactive Consistency conditions, if hour is shown in Table 1,for systems with four, five,
N > 3a+b. and six processors, respectively (UM in the table
refers to the Unified Model). The processor failure
The following corollary shows that the Unified
rate was assumed to be 1.0 x lo- per hour.
model reduces to the result N > 3r of Lamport et
If the BG model is used, the system designer can-
al [2] if b = 0, s = 0, and a = T , where T is the
not increase reliability simply by using additional
maximum number of faulty processors.
processors. It is necessary to change the algorithm
and also to increase the number of processors in steps
Corollary 3 If N > 3r for any r , where T is the
(as the above example demonstrates, increasing from
mazimum number of faulty processors, then algo-
four to five or six processors is of no benefit). In-
rithm Z(T)satisfies the Interactive Consistency con-
creasing T results in an exponential increase in the
ditions.
number of messages, resulting in severe performance
penalties.
4 Applications Even if 7 processors are used with an r = 2 algo-
rithm, the problem does not entirely go away. The
In this section we briefly examine the practical ad- 7-processor system can tolerate at most two faults.
vantages of the unified model in estimating system Now suppose that a fault has occurred, so that there
reliability and in the design of fault-tolerant systems. are six non-faulty processors. From the standpoint
In the discussion below, the expression BG model of system reliability it would be best to degrade the

98
system to four nodes and use an T = 1 algorithm. I I
However, under the BG model, it is not possible in
general to tell which processor is faulty. But we have
shown above that it is preferable to use 4 processors
and not 6 processors if only one fault can be toler-
ated. Thus, the BG model cannot provide optimal
reliability, in the sense that an increase in the number
of processors need not always result in an increase in
reliability.
An interesting question arises: since the five and
six-processor systems cannot tolerate any more arbi-
trary faults than the four-processor system, is there Table 2: Resiliency of a System based on the Unified
any way of increasing the system reliability by in- Model (minimum number of processors required)
creasing the system size to five or six processors?
The Unified model allows the designer to linearly of different types of faults, and does not treat all
increase the number of processors to obtain increased faults as arbitrary. Such an approach can result in
reliability, without increasing the number of rounds. less cost and greater performance for a given level of
Since r = 1, the model states that N > 2a 2s b + ++ reliability than the BG model.
1, so that a 5-processor system can indeed tolerate The problem can be neatly solved by using our
two faults (one malicious asymmetric and one non- model, which, for this specific example, states that
malicious). A 6-processor system can tolerate two + + +
N > 2a 28 b 1, where a 5 1. The model
faults (a = 1, 8 = 1, b = 0) or three faults (a = 1, offers a choice of configurations to the designer. Us-
b = 2, s = 0). Table 1 shows that a significant ing Markov models [ll]and the SHARPE modelling
increase in reliability is possible using the new model. package [12], it was found that the design specifica-
We now consider a hypothetical design problem tions can be satisfied with six, seven, or eight pro-
having the following requirements and constraints: cessors. The 6-processor system is based on the oc-
1. System Failure Probability must be less than curence of at most one arbitrary fault and two non-
1.0 x lo-' for a 10-hour mission (reliability re- malicious faults; the failure probability in this case
is 1.5 x lo-" at 10 hours. A 7-processor system
quirement).
is necessary if the designer wishes to consider the
2. Use the minimum number of processors (cost occurence of at most one arbitrary fault, one mali-
and space constraint). cious symmetric fault, and one non-malicious fault;
the failure probability is 3.5 x lo-" at 10 hours. An
3. Interactive Consistency is required (reliability 8-processor system is required if one arbitrary fault
requirement). and two malicious symmetric faults must be toler-
ated, the failure probability being 7.0 x lo-" for the
4. Individual processor failure rate is X = 1.0 x lo-'
10-hour mission. Under our assumptions, it is inter-
per hour.
esting to observe that the BG model with 10 proces-
5. The system must tolerate at least one arbitrary sors and three rounds of message exchange provides
fault. less reliability than the Unified model using 6,7, and
8 processors.
Item 5 may first appear to be unreasonable; how- In Table 2 we show the minimum number of pro-
ever this requirement is extremely useful from the cessors required to guarantee Interactive Consistency
perspective of "certification" and FMEA. using the Unified Model, for various combinations of
If the BG model were used to solve this problem, it faults, when T = 1.
can easily be shown that at least 10 processors and
three rounds of message exchange (T = 3) are re-
quired. The performance overhead in this case is pro- 5 Summary
hibitively high due to the message overhead ( O ( N ' ) ) .
The failure probability is 2.0 x 10-l'. There are many A new model has been presented for interactive con-
models in the literature which can be used if it is as- sistency in the presence of different types of faults;
sumed that none of the faults is arbitrary: these do this model is much more useful in the design of real
not satisfy the specifications because of item 5 above. systems than the Byzantine Generals model. Our
Ideally we would like a model which takes advantage fault partitioning is unique and covers every possible

99
type of fault. We believe that our bounds are tight, [7] T.K. Srikanth, S. Toueg, Simulating authenti-
and think this can be proven using the results of [3]. cated broadcasts to derive simple fault-tolerant
We have shown that, if an interactive consistency algorithms, Distributed Computing (Springer-
algorithm is used, then the ability to tolerate even Verlag), V01.2, pp. 80-94, 1987.
non-malicious faults is diminished. If the probabil-
ity of malicious asymmetric faults can be bounded, [8] J.H. Wensley et all SIFT: Design and Analysis
then the system reliability is vastly superior to that of a Fault-Tolerant Computer for Aircraft Con-
predicted by the Byzantine Generals model; the en- trol, Proceedings of the ZEEE, Vol. 66, No.10,
hanced reliability is obtained at a performance and October 1978.
cost penalty that is much less than the Byzantine [9] R.M. Kieckhafer, C.J. Walter, A.M. Finn,, P.M.
Generals model. Essentially, we have established a Thambidurai, The MAFT Architecture for Dis-
link between the theory and system design under re- tributed Fault Tolerance, ZEEE lfnturactiona
alistic constraints. on Computers, Special Issue on Fault-Tolerant
Computing, Vol. 37, No. 4, pp.398-405, April
1988.
6 Acknowledgements
[lo] C.J. Walter, R.M. Kieckhafer, A.M. Finn,
The concept of symmetric and asymmetric faults and MAFT: A Multicomputer Architecture for
the suggestion to integrate them under the Byzantine Fault-Tolerance in Real-Time Control Sys-
Generals framework is due to A.M. Finn, R.M. Kieck- tems,, Proc. ZEEE Real-Time Systems Sympo-
hafer, and C.J. Walter. We acknowledge M.C. McE1- sium, pp. 133-140,December 1985.
vany and John Wensley for critically reviewing the
document. [I11 K.S. Trivedi, Probability d Statistice with Reli-
ability, Queuing, and Computer Science Appli-
cations, Prentice-Hall, 1982.
References
[12]R. Sahner, K. Trivedi, A Hierarchical, Combi-
[l] M. Pease, R. Shostak, L. Lamport, Reaching natorial - Markov Method of Solving Complex
Agreement in the Presence of Faults, JACM, Reliability Models, Pwxeedings A CM/IEEE
Vol. 27, No. 2, pp. 228-234,April 1980. Fall Joint Computer Conference, November
1986, Dallas, Texas.
[2]L. Lamport, R. Shostak, M. Pease, The Byzan-
tine Generals Problem, ACM hnsactiona on
Programming Languages and Systems, Vol. 4,
No. 3, pp. 382-401,July 1982.

[3] M.J. Fischer, N.A. Lynch, LALower Bound for


the time to assure Interactive Consistency, Zn-
formation Processing Letters, Vol. 14, No. 4, pp.
183-186,June 1982.

[4] F.J. Meyer, D.K. Pradhan, Consensus with


Dual Failure Modes, Seventeenth Znt. Sympo-
sium on Fault Tolerant Computing, July 1987.

[5] F. Christian, H. Aghili, R. Strong, Atomic


Broadcast: From Simple Message Diffusion to
Byzantine Agreement, Fifteenth Znt. Sympo-
sium on Fault Tolerant Computing, pp. 200-206,
June 1985.

[6] 0.Babaoglu, R.Drummond, Streets of Byzan-


tium: Network Architectures for Fast Reliable
Broadcasts, ZEEE Tmnsactions on Software
Engineering, Vol. SE-11,No.6, pp. 546-554,
June 1985.

Vous aimerez peut-être aussi