© All Rights Reserved

3 vues

© All Rights Reserved

- Errors in Mathematical Writing, Keith Conrad
- A Sample of Add Maths's Project
- All Abstracts UQAW 2015
- Introduction to Business Statistics 7th Edition by Ronald Weiers – Test Bank
- Relation,Function and Linear Function
- Parametric Modeling in Rail Capacity Planning - Review
- Maths Worksheet - Functions, Inverses and Logarithms
- Probability Distributions
- Jefferies_TRB2007.pdf
- Untitled
- BCA(2013 Pattern)
- Common Probability Distribution PDF
- 212053293-Kohavi
- Unit Plan
- EXERCISE3 - S2 2014
- Chapter 1 calculus 1
- TMA 2 - Business Statistics_031080377
- Manual Steps Function Modules
- RF Power Control and Handover Algorithm_ Handover Due to MS-BS Distance
- Relationships between partial derivatives.pps

Vous êtes sur la page 1sur 91

IN APPLIED MATHEMATICS

A series of lectures on topics of current research interest in applied mathematics under the direction of

the Conference Board of the Mathematical Sciences, supported by the National Science Foundation and

published by SIAM.

D. V. LINDLEY, Bayesian Statistics, A Review

R. S. VARGA, Functional Analysis and Approximation Theory in Numerical Analysis

R. R. BAHADUR, Some Limit Theorems in Statistics

PATRICK BILLINGSLEY, Weak Convergence of Measures: Applications in Probability

J. L. LIONS, Some Aspects of the Optimal Control of Distributed Parameter Systems

ROGER PENROSE, Techniques of Differential Topology in Relativity

HERMAN CHERNOFF, Sequential Analysis and Optimal Design

J. DURBIN, Distribution Theory for Tests Based on the Sample Distribution Function

SOL I. RUBINOW, Mathematical Problems in the Biological Sciences

P. D. LAX, Hyperbolic Systems of Conservation Laws and the Mathematical Theory of Shock

Waves

I. J. SCHOENBERG, Cardinal Spline Interpolation

IVAN SINGER, The Theory of Best Approximation and Functional Analysis

WERNER C. RHEINBOLDT, Methods of Solving Systems of Nonlinear Equations

HANS F. WEINBERGER, Variational Methods for Eigenvalue Approximation

R. TYRRELL ROCKAFELLAR, Conjugate Duality and Optimization

SIR JAMES LIGHTHILL, Mathematical Biofluiddynamics

GERARD SALTON, Theory of Indexing

CATHLEEN S. MORAWETZ, Notes on Time Decay and Scattering for Some Hyperbolic Problems

F. HOPPENSTEADT, Mathematical Theories of Populations: Demographics, Genetics and Epidemics

RICHARD ASKEY, Orthogonal Polynomials and Special Functions

L. E. PAYNE, Improperly Posed Problems in Partial Differential Equations

S. ROSEN, Lectures on the Measurement and Evaluation of the Performance of Computing Systems

HERBERT B. KELLER, Numerical Solution of Two Point Boundary Value Problems

J. P. LASALLE, The Stability of Dynamical Systems - Z. ARTSTEIN, Appendix A: Limiting Equations

and Stability of Nonautonomous Ordinary Differential Equations

D. GOTTLIEB AND S. A. ORSZAG, Numerical Analysis of Spectral Methods: Theory and Applications

PETER J. HUBER, Robust Statistical Procedures

HERBERT SOLOMON, Geometric Probability

FRED S. ROBERTS, Graph Theory and Its Applications to Problems of Society

JURIS HARTMANIS, Feasible Computations and Provable Complexity Properties

ZOHAR MANNA, Lectures on the Logic of Computer Programming

ELLIS L. JOHNSON, Integer Programming: Facets, Subadditivity, and Duality for Group and Semi-

Group Problems

SHMUEL WINOGRAD, Arithmetic Complexity of Computations

J. F. C. KINGMAN, Mathematics of Genetic Diversity

MORTON E. GURTTN, Topics in Finite Elasticity

THOMAS G. KURTZ, Approximation of Population Processes

Probabilistic

Expert Systems

This page intentionally left blank

Glenn Shafer

Rutgers University

Newark, New Jersey

Probabilistic Expert

Systems

SJLHJTL.

SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS

PHILADELPHIA

Copyright 1996 by the Society for Industrial and Applied Mathematics.

10987654321

All rights reserved. Printed in the United States of America. No part of this book may be

reproduced, stored, or transmitted in any manner without the written permission of the

publisher. For information, write to the Society for industrial and Applied Mathematics,

3600 University City Science Center, Philadelphia, PA 19104-2688.

Probabilistic expert systems / Glenn Shafer.

p. cm. -- (CBMS-NSF regional conference series in applied

mathematics ; 67)

"Sponsored by Conference Board of the Mathematical Sciences"-

-Cover.

Includes bibliographical references and index.

ISBN 0-89871-373-0 (pbk.)

1. Expert systems (Computer science) 2. Probabilities.

I. Conference Board of the Mathematical Sciences. II. Title.

III. Series.

QA76.76.E95S486 1996

006.3'3--dc20 96-18757

Contents

Preface vii

1.1 Probability distributions 2

1.2 Marginalization 3

1.3 Conditionals 5

1.4 Continuation 7

1.5 Posterior distributions 10

1.6 Expectation 12

1.7 Classifying probability distributions 13

1.8 A limitation 14

Chapter 2. Construction Sequences 17

2.1 Multiplying conditionals 18

2.2 DAGs and belief nets 20

2.3 Bubble graphs 27

2.4 Other graphical representations 30

Chapter 3. Propagation in Join Trees 35

3.1 Variable-by-variable summing out 37

3.2 The elementary architecture 41

3.3 The Shafer-Shenoy architecture 44

3.4 The Lauritzen-Spiegelhalter architecture 50

3.5 The Aalborg architecture 56

3.6 COLLECT and DISTRIBUTE 63

3.7 Scope and alternatives 66

4.1 Meetings 69

4.2 Software 69

4.3 Books 70

V

vi CONTENTS

4.5 Other sources 73

Index 79

Preface

North Dakota at Grand Forks during the week of June 1-5, 1992, this mono-

graph analyzes join-tree methods for the computation of prior and posterior

probabilities in belief nets. These methods, pioneered by Pearl [42], [8] Lau-

ritzen and Spiegelhalter [37], and Shafer, Shenoy, and Mellouli [45] in the late

1980s, continue to be central to the theory and practice of probabilistic expert

systems.

In the North Dakota lectures, I began with the topics discussed here and

then moved on in two directions. First, I discussed how the basic architectures

for join-tree computation apply to other methods for combining evidence, espe-

cially the belief-function (Dempster^Shafer) method, and also how they apply to

many other problems in applied mathematics and operations research. Second,

I looked at other aspects of computation in expert systems, especially Markov

chain Monte Carlo approximation, computation for model selection, and com-

putation for model evaluation.

I completed a draft of the three chapters that form the body of this mono-

graph in the summer of 1992, shortly after delivering the lectures. Unfortunately,

I set the project aside at the end of that summer, expecting to return in a few

months to write additional chapters covering at least the other major topics I

had discussed in Grand Forks. As it turned out, my return to the project was

delayed for three years, as I found myself increasingly concerned with another

set of ideasthe use of probability trees to understand probability and causality.

Rather than extend this monograph, I completed a new and much longer book,

The Art of Causal Conjecture (MIT Press, 1996).

The field of probabilistic expert systems has continued to flourish in the

past three years, yet the understanding of join-tree architectures set out in my

original three chapters is still missing from the literature. Moreover, the broader

research question that motivated my presentationhow well a general theory of

propagation along the same lines can account for the wide variety of recursive

computation in applied mathematicsremains open. I have decided, therefore,

to publish these three chapters on their own, essentially as they were written in

1992. I have resisted even attempting a brief survey of related topics. Instead I

vii

viii PREFACE

have added a brief chapter on resources, which gives information on software and

includes an annotated bibliography. I have also added some exercises that will

help the reader begin to explore the problem of generalizing from probability to

broader domains of recursive computation.

The resulting monograph should be useful to scholars and students in artificial

intelligence, operations research, and the various branches of applied statistics

that use probabilistic methods. Probabilistic expert systems are now used in

areas ranging from diagnosis (in medicine, software maintenance, and space ex-

ploration) and auditing to tutoring, and the computational methods described

here are basic to nearly all implementations in all these areas.

I wish to thank Lonnie Winnrich, who organized the conference in North

Dakota, as well as the other participants. They made the week very pleasant

and productive for me. I also wish to thank the many students and colleagues,

at the University of Kansas and around the world, who helped me learn about

expert systems in the late 1980s and early 1990s. Foremost among them is

Prakash P. Shenoy, my colleague in the School of Business at the University

of Kansas from 1984 to 1992. I am grateful for his steadfast friendship and

indispensable collaboration.

Augustine Kong and A. P. Dempster, who joined with Shenoy and me in

the early 1980s in the study of join-tree computation for belief functions, were

also important in the development of the ideas reported here. Section 3.1 is in-

spired by an unpublished memorandum by Kong. Other colleagues and students

with whom I collaborated particularly closely during this period include Khalid

Mellouli, Debra K. Zarley, and Rajendra P. Srivastava.

Special thanks are due Niven Lianwen Zhang, Chingfu Chang, and the late

George Kryrollos, all of whom made useful comments on the 1992 draft of the

monograph.

I would also like to acknowledge the friendship and encouragement of many

other scholars whose work is reported here, especially A. P. Dawid, Finn V.

Jensen, Steffen L. Lauritzen, Judea Pearl, and David Spiegelhalter. The field of

probabilistic expert systems has benefited not only from their energy, intellect,

and vision, but also from their generosity and good humor.

Finally, at an even more personal level, I would like to thank my wife, Nell

Irvin Painter, who has supported this and my other scholarly work through thick

and thin.

CHAPTER 1

Multivariate Probability

This chapter reviews the basic ingredients of the theory of multivariate proba-

bility: marginals, conditionals, and expectations. These will be familiar topics

for many readers, but our approach will take us down some relatively unex-

plored paths. One of these paths opens when we develop an explicit notation

for marginalization. This notation allows us to recognize properties of marginal-

ization that are shared by many types of recursive computation. Another path

opens when we distinguish among probability distributions on the basis of how

they are stored. We distinguish between tabular distributions, which are sim-

ply tables of probabilities, and algorithmic distributions, which are algorithms

for computing probabilities. A parametric distribution is a special kind of al-

gorithmic distribution; it consists of a few numerical parameters and a rela-

tively simple algorithm, usually a formula, for computing probabilities from those

parameters.

The most complex topic in this chapter is conditional probability. Our pur-

poses require that we understand conditional probability from several viewpoints,

and we rely on some careful terminology to keep the viewpoints distinct. We

distinguish between conditional probabilities in general, which can stand on

their own, without reference to any prior probability distribution, and poste-

rior probabilities, which are conditional probabilities obtained by conditioning

a probability distribution on observations. And we distinguish two kinds of

tables of conditional probabilities: conditionals and posterior distributions. A

conditional consists of many probability distributions for a set of variables (the

conditional's head)one for each configuration of another set of variables (its

tail). A posterior distribution is a single probability distribution consisting of

posterior probabilities.

In the next chapter, we study how to construct a probability distribution

by multiplying conditional probabilitiesor, more precisely, by multiplying con-

ditionals. When we multiply the conditionals in an appropriate order, each

multiplication produces a larger marginal of the final distribution. This means

that each conditional is a continuer for the final distribution; it continues it from

a smaller to a larger set of variables. The concept of a continuer will help us

minimize complications arising from the presence of zero probabilities, which are

unavoidable in expert systems, where much of our knowledge is in the form of

1

2 CHAPTER 1

TABLE 1.1

A discrete tabular probability distribution for three variables.

female male

Dem ind Rep Dem ind Rep

young .08 .16 .08 .02 .04 .02

middle-aged .05 .05 .05 .00 .00 .00

old .05 .05 .05 .10 .10 .10

rules that do not admit exceptions. Continuers will also help us, in Chapter 3,

to understand architectures for recursive computation.

This chapter is about multivariate probability, not about probability in gen-

eral. Not all probability models are multivariate. The chapter concludes with a

brief explanation of why multivariate models are sometimes inadequate.

The quickest way to orient those not familiar with multivariate probability is

to give an example. Table 1.1 gives a probability distribution for three variables:

Age, Sex, and Party. Notice that the numbers are nonnegative and add to one.

This is what it takes to be a discrete probability distribution.

We will write QX for the set of possible values of a variable X, and we will

write lx for the set of configurations of a set of variables x. We call QX and Ox

the frames for X and x, respectively. In general, Ox is the Cartesian product of

the frames of the individual variables: f^ = Hxex ^x- In Table 1.1, we assume

that

&Age = {young,middle-aged,old},

and

Q Party = {Democrat, independent,Republican}.

Thus the frame ^Age,Sex,Party consists of eighteen configurations:

(young,male,Democrat),(old,male,independent),...

and Table 1.1 gives a probability for each of them. In general, as in this example,

a discrete probability distribution for x gives a probability to every element of

fj x ; abstractly, it is a nonnegative function on lx whose values add to one.

If we add together the numbers for males and females in Table 1.1, we get

marginal probabilities for Age and Party, as in Table 1.2. Adding further, we

get marginal probabilities for Age, as in Table 1.3.

Some readers may be puzzled by the name "marginal." The name is derived

from the example of a bivariate table, where it is convenient and conventional

to write the sums of the row and columns in the margins. In Table 1.4, for

This page intentionally left blank

4 CHAPTER 1

for each configuration c of w. Here x\w consists of the variables in x but not

w and c.d is the configuration of x that we get by combining the configuration c

of w and the configuration d of x \ w. For example, if x = {Age,Sex,Party} and

w {Age,Party}, then x \ u> = {^ex}; if c = (old,Democrat) and d (male),

then c.d = (old,male,Democrat).

The arrow notation emphasizes the variables that remain when we marginal-

ize. Sometimes we use instead a notation that emphasizes the variables we sum

out: P~y is the marginal obtained when we sum out the variables in y. Thus

when x = w U y, where w and y are disjoint sets of variables, and P is a proba-

bility distribution on x, both P^w and P~y will represent P's marginal on w.

Though we are concerned primarily with probability distributions, any

numerical2 function / on a set of variables x has a marginal f ^ w for every subset

w of x. The function / need not be nonnegative or sum to one. If w is not

empty, then f ^ w is a function on w.

function f ^ w will be equal to / if w = x.

Here are two important properties of marginalization:

Property 1. If / is a function on y, and

It is informative to rewrite Properties 1 and 2 using the /~ notation. This

gives the following:

2

A numerical function is one that takes real numbers as values. We will consider only

numerical functions in this monograph.

3

In order to understand this equation, we must recognize that the product fg is a function

on x U y. Its value for a configuration c of x U y is given by (fg)(c) f ( c ^ x ) g ( c ^ y ) , where

c^x is the result of dropping from c the values for variables not in x. For example, if / is a

function on {Age,Party} and g is a function on {Sex,Party}, then (/g)(old, male, Democrat) =

/(old, Democrat)(7(male, Democrat).

MULTIVARIATE PROBABILITY 5

FIG. 1.1. Removing y\x from y leaves x n y; removing y\x from x U y leaves x.

of y, the

Property 2. If / is a function on x, and g is a function on y, then

This version of Property 2 makes it clear that we are summing out the same

variables on both sides of the equation (fg)^-x = /(fl^xny)- Summing these

variables out of f g , which is a function on x U y. leaves the variables in x, but

summing them out of g, which is a function on y, leaves the variables in x d y

(see Figure 1.1).

The second version of Property 2 also suggests the following generalization:

Property 3. If / is a function on x. g is a function on y, and

We leave it to the reader to derive this property also from equation (1.2).

As we will see in Chapter 3, Properties 1 and 2 are responsible for the pos-

sibility of recursively computing marginals of probability distributions given as

products of tables. These properties also hold and justify recursive computation

in other domains, where we work with different objects and different meanings

for marginalization and multiplication. Because of their generality, we call Prop-

erties 1 and 2 axioms; Property 1 is the transitivity axiom, and Property 2 is the

combination axiom.

The definition of marginalization, equation (1.2), together with the proofs

of Properties 1, 2, and 3, can be adapted to the continuous case by replacing

summation with integration. We leave this to the reader. We also leave aside

complications that arise if infinities are allowedif the sum or integral is over an

infinite frame or an unbounded function. Our primary interest is in distributions

given by tables, and here the frames are both discrete and finite.

1.3. Conditionals.

Table 1.5 gives conditional probabilities for Party given Age and Sex. We call

these numbers conditional probabilities because they are nonnegative and each

group of three (the three probabilities for Party given each Age-Sex configura-

tion) sums to one. In other words, the marginal for {Age,Sex}, Table 1.6, consists

of ones.

We call Table 1.5 as a whole a conditional. We call {Party} its head, and

we call {Age,Sex} its tail In general, a conditional is a nonnegative function Q

6 CHAPTER 1

TABLE 1.5

A conditional for Party given Age and Sex.

female male

Dem ind Rep Dem ind Rep

young 1/4 1/2 1/4 1/4 1/2 1/4

middle-aged 1/3 1/3 1/3 1/5 1/5 3/5

old 1/3 1/3 1/3 1/3 1/3 1/3

TABLE 1.6

The marginal of Table 1.5 on its tail.

female male

young 1 1

middle-aged 1 1

old 1 1

on the union of two disjoint sets of variables, its head h and its tail t, with the

property that Q^ = 1^, where lt is the function on t that is identically equal to

one.

Two special cases deserve mention. If t is empty, then Q is a probability

distribution for h. If h is empty, then Q = It. We are interested in conditionals

not for their own sake but because we can multiply them together to construct

probability distributions. This is the topic of the next chapter.

Frequently, we are interested only in a subtable of a conditional. In Table 1.5,

for example, we might be interested only in the conditional probabilities for

femalesthe subtable shown in Table 1.7. We call such a subtable a slice. In

general, if / is a table on x and c is a configuration of a subset w of x, then we

write f\w=c for the table on x \ w given by

and we call f\w=c the slice of / on w = c. We leave it to the reader to verify the

following proposition.

PROPOSITION 1.1. Suppose Q is a conditional with head h and tail t, and

suppose w Ct. Then Q\w=c is a conditional with head h and tail t\w.

Table 1.7 illustrates Proposition 1.1; it is a conditional with {Party} as its

head and {^4<?e} as its tail.

We will sometimes find it convenient to generalize the notation for slicing by

allowing the variables whose values we fix to include variables that are outside

the domain of the table and hence have no effect on the result. In general, if / is

a table on x, w is a set of variables, and c is a configuration of iu, then we write

f\w=c for the table on x \ w given by

MULTIVARIATE PROBABILITY 7

TABLE 1.7

The slice of Table 1.5 on Sex = female.

young 1/4 1/2 1/4

middle-aged 1/3 1/3 1/3

old 1/3 1/3 1/3

TABLE 1.8

The marginal of Table 1.1 for Age and Sex.

female male

young .32 .08

middle-aged .15 .00

old .15 .30

1.4. Continuation.

If / is a function on re, w C re, and

Here is an example. Suppose x {Age,Sex,Party} and w {Age,Sex}, and

consider the probability distribution P given by Table 1.1 and the conditional

Q given by Table 1.5. The marginal P^w is given by Table 1.8, and the reader

can easily check that P = P^WQ.4 Thus Q continues P from w to x.

When do continuers exist, and when are they unique?

PROPOSITION 1.2. Suppose f is a function on x, and suppose w C x.

1. // all of f's values are positive, then there is a unique function Q on x

that continues f from w to x. This continuer Q is a conditional.

2. // all of f's values are nonnegative, then there is at least one function Q

on x that continues f from w to x. We can choose Q to be a conditional.

Proof. Try to divide both sides of equation (1.6) by f^w to obtain

or

/ are all positive, then the values of f ^ w are as well, and the division succeeds;

4

Bear in mind that P = PiwQ means P(c) = Piw(c^w)Q(c). Thus each entry in Table 1.8

multiplies a whole row (three entries) in Table 1.5.

8 CHAPTER 1

axiom, we find that

then the division in equation (1.8) may fail for some c, but if f ^ w ( c ] 0, then

f(c.d) = 0 for all d, and hence equation (1.6) will be satisfied with arbitrary val-

ues of Q(c.d) for that c. In particular, we may chose the Q(c.d) to be nonnegative

and add to one for each such c, so that Q is a conditional.

Since a probability distribution has nonnegative but not necessarily all pos-

itive values, it has continuers but not necessarily unique continuers. In our

example, the nonuniqueness is in the conditional probabilities for middle-aged

males. Since middle-aged males have probability zero in Table 1.8, we can change

the numbers 1/5, 1/5, and 3/5 in Table 1.5 however we want without falsifying

equation (1.6).

In addition to continuation to the whole domain of a function, we are also

interested in continuation to subsets. So we generalize the definition of continu-

ation: If / is a function on y, w C x C y, and

In the following chapters, we will frequently be interested in marginals and

continuers for probability distributions that are proportional to a given function.

The next proposition lists some relatively obvious but important aspects of this

situation.

PROPOSITION 1.3.

1. Suppose f is proportional to g. Then any marginal of f is proportional to

the corresponding marginal of g, with the same constant of proportionality. In

other words, if f = kg and w is a subset of the domain of f , then f^w = kg^w.

2. Suppose f is proportional to g; f kg for some nonzero constant k. Then

any continuer for f is also a continuer for g.

3. Suppose the probability distribution P is proportional to the function f on

x. Then the constant of proportionality is l/f^, and P is f's unique continuer

from 0 to x:

and

Moreover,

MULTIVARIATE PROBABILITY 9

4. A probability distribution is its own unique continuer from the empty set

to its domain.

Proof. Statement 1 follows directly from the definition of marginalization,

equation (1.2).

To prove statement 2, we substitute kg for / in equation (1.9), obtaining

(kg)^x = (kg)^wQ. By the combination axiom, this becomes kg^x = kg^wQ, or

9lx = 9lwQ-

Again by the combination axiom, P kf implies P^ kf^. Since P

is a probability distribution, P^ = 1, whence k = l/f^. So equation (1.10)

holds. Since f^ is a positive number, equation (1.10) is the unique solution of

equation (1.11); P is the unique continuer of / from 0 to x.

To prove statement 4, substitute P for / in equation (1.11) and again apply

the combination axiom.

Equations (1.6) and (1.9) do not require that Q be a function on x. They

require only that Q's domain, say v, should satisfy x w U v or, equivalently,

x \ w C v C x. In some cases (when the right-hand side of equation (1.8) does

not depend on all the coordinates of c), there is a continuer with a domain v

that is smaller than x. The situation is illustrated in Figure 1.2, where we have

written u\ for w\v, u^ for wHv, and ^3 for v\w. We may say, in this situation,

that u-2 is sufficient for the continuation from w to x; the other variables in w,

those in wi, can be neglected.

If the function / that we are continuing is a probability distribution, then

the idea of sufficiency can be elaborated in terms of the meaning of the proba-

bilities. If we give the probabilities an objective interpretation, then we can say

that once the configuration of u^ is determined, the configuration of u\ will not

affect the determination of the configuration of 143. If we give the probabilities a

subjective interpretation, then we can say that once we know the configuration

of U2, information about the configuration of u\ will not affect our beliefs about

the configuration of 143.

The philosophy of probability that underlies this monograph is neither strictly

objective nor strictly subjective. Instead, it is constructive. We see a probability

distribution as something we deliberately construct in order to make predictions.

Though these predictions may be the best we can do, we need not be fully com-

mitted to them as beliefs. And though they should be evaluated empirically,

they need not individually represent stable frequencies. In terms of this con-

structive interpretation, sufficiency simply means adequacy for prediction. Once

the configuration of u^ is specified, we ignore information about u\ when we

predict u3.

Instead of saying that u<2 is sufficient for the continuation from w to x, we

may say that 113 is independent of u\ given u^. The concept of conditional inde-

pendence thus defined is mathematically interesting. Its properties include the

symmetry suggested by Figure 1.2: if u^ is independent of u\ given u-2, then u\

is independent of u% given u^ (see Dawid [27], Pearl [8], or Appendix F of Shafer

[9]). Conditional independence is an important concept for both the objective

and subjective interpretations of probability. In the objective interpretation, a

10 CHAPTER 1

or perhaps about causation. In the subjective interpretation, it is a hypothesis

about a person's beliefs. It is also important for the constructive interpretation

of probability, but it does not play a large role in the purely computational issues

considered in this monograph.

Suppose the probability distribution P on x expresses our beliefs about the values

of the variables in x. And suppose we now observe the values of the variables

in a subset w of x\ we observe that w has the configuration c. How should this

change our beliefs about the remaining variables, the variables in x \ w!

The standard answer is that we should change our beliefs by conditioning P

on w = c. This means that we should change our belief that x \ w = d from

Plx\w(d) to

We call this number P's posterior probability for d given c. It exists only if

P^w(c) > 0, but we may suppose that if P^w(c) is zero we will not observe

w = c.

Equation (1.13) defines a whole probability distributiona distribution on

x\w that we may designate by px\w\w=c:

proposition notes, it is proportional to a subtable of P, and it is equal to a

subtable of any continuer of P from w to x.

PROPOSITION 1.4. Suppose P is a probability distribution on x, w C x, and

c is a configuration of w such that P^w(c) > 0.

1. p*\v\w=c oc P\w=c.

2. IfQ continues P from w to x, then px\\=c = Q\w=c.

Proof. Statement 1 follows from equation (1.14) and the definition of slice,

equation (1.4). Statement 2 follows from equations (1.8) and (1.14).

MULTIVARIATE PROBABILITY 11

not just for x\w but for the entire set of variables x. This is the probability

distribution p\w=c on x given by

consists mostly of zeros. The posterior for the remaining variables, px\w\w=c^ \s

related to p\w=c in two ways. It is a slice:

Equation (1.15) says that p\w=c is equal to the product of P and the function

on w that assigns the value l/P^w(c) to the configuration c and the value 0 to

all other configurations. It follows that p\w=c is proportional to the product

of P and a function on w that assigns 1 to c and 0 to all other configurations.

This point is sufficiently important to merit being stated in symbols. To this

end, we write Iw=c for the function on w that assigns 1 to c and 0 to all other

configurations:

PROPOSITION 1.5. If P is a probability distribution on x and c is a configu-

ration of a subset w of x such that P^w(c) > 0, then

In the following chapters, we will be interested in a probability distribution

P given in factored form, say

where the fi are tables of reasonable size, but the number of variables involved

altogether is too large to allow the actual computation and storage of the table P.

(It will not be difficult to compute the value of P for a particular configuration,

at least if we know the constant of proportionality. But there may be too many

configurations for us to compute the value of P for all of them.) In this situation,

as we will see, we can often work from the factorization to find marginals for P,

even though we cannot compute P itself. We may also be interested in computing

marginals for posteriors of P, and therefore we will be interested in transforming

12 CHAPTER 1

how to do this.

PROPOSITION 1.6. Suppose P is a probability distribution on x,

w = {Xi,... ,Xn} and c = {ci,..., cn}. Then

and

with the fact that a slice of a product is the product of the corresponding slices

of the factors.

Equation (1.19) follows from Proposition 1.5, together with the fact that

w

I =c = Ixi=ci ' ' ' Ixn=c,l-

1.6. Expectation.

Most readers will be familiar with the idea of the expectation of a function

V on x with respect to a probability distribution P on x. This is a number,

usually denoted by EP(V}. In the discrete case, it is obtained by multiplying

corresponding values of P arid V and adding the products. Thus

Q is a continuer of P from w to x, then we call the function Ep(V w} on w given

by

to say "with respect to P"). If P is strictly positive, so that it has only one

continuer from w to x, then the conditional expectation for V given x is also

unique; in fact, equation (1.21) can be written

any function / proportional to P.

If w is not empty, then the conditional expectation Ep(V\w) is a function,

not a single number. It assigns a value to every configuration c of w. Usually,

however, we write Ep(V w = c) instead of (Ep(V\w))(c). If P^w(c) > 0, then

P

E(V w = c} is uniquely defined; it is equal to (PV)iw(c)/Plw(c).

MULTIVARIATE PROBABILITY 13

The probability distributions we have been studying are tabular. A tabular dis-

tribution is a table that gives a probability for each configuration. We will find

it useful to distinguish tabular distributions from algorithmic distributions. An

algorithmic distribution consists of an algorithm, together possibly with some

numerical information, that enables us to compute the probabilities of individ-

ual configurations. Algorithmic distributions can involve more or less complex

algorithms and more or less numerical information. At one extreme are dis-

tributions such as the Poisson. which are specified by a single number {the

mean in the case of the Poisson) and a simple formula. At another extreme

are the posterior distributions that arise in Bayesian statistics, which may in-

volve many numbers and complicated algorithms. In the next few chapters,

we will be concerned with an intermediate case; we define a distribution for

a large number of variables as the product of many tables of numbers, each

involving only a few variables. Here there are many numbers but, a simple

algorithm: multiply.

The line between tabular and algorithmic distributions cuts across the line

between discrete arid continuous distributions. A continuous distribution, like a

discrete distribution, can be cither tabular or discrete. In the tabular case, we

store the values of the density at a sufficiently large number of configurations. In

the algorithmic case, we store instead a formula or algorithm that enables us to

compute the value of the density at any configuration. To some extent, the line

also cuts across the line between numerical and categorical variables. (Variables

like Age, Sex, and Party are called categorical, because they have categories

e.g., young, old, and middle-agedrather than numbers as possible values.)

Distributions for categorical variables are usually tabular, but distributions for

numerical variables can be tabular or algorithmic.

When an algorithmic distribution involves only a few numbers, we call the

numbers parameters, and we call the distribution parametric. The distributions

with namesPoisson, multinomial, Gaussian, and so onare parametric.

The terms tabular, parametric, and algorithmic can be applied to conditionals

and other functions as well as to distributions. These terms can help us keep track

of complications involved in finding marginals and continuers of distributions and

in multiplying conditionals. Figure 1.3 shows the main points. When we compute

marginals, we generally stay in the same class of distributions; a marginal of a

table is a table, a marginal of a Gaussian is a Gaussian, and so on. A continuer

or posterior for a tabular distribution is tabular, but only in a few cases (such as

the multinomial and the Gaussian) do continuers or posteriors stay in the same

parametric family as their distributions. Multiplication usually takes us out, of

the class of tabular distributions. Given a collection of tables for the same small

set of variables, we can perform the multiplication to obtain a new table, but

given tables for many different small sets of variables, the size of the frame for

all the variables may prevent us from computing and storing the product we

may have to settle for thinking of the multiplication as an algorithm that allows

us to find the probability for a particular configuration when we want it.

14 CHAPTER 1

handling of probabilities or density values for individual configurations. It is only

probabilities for individual configurations that are explicitly stored by a tabu-

lar distribution; probabilities for sets of configurations must still be computed.

This emphasis on individual configurations is appropriate for expert systems,

but it is not appropriate for all applications of probability. It is inappropriate

for advanced mathematical probability, which is concerned with infinitely many

variables.

1.8. A limitation.

Though the multivariate framework for probability is widely used, it has its

limitations. A principal limitation is that it requires every variable to have a

value no matter how matters come out. This is often appropriate in statistical

work; in our example, every individual has an age and a sex, and we invent the

category "independent" so that every individual will have a party affiliation. It is

less appropriate in expert-system work, where the meaningfulness of a variable

often depends on the values of other variables. A particular medical test or

procedure only has a result if it is carried out, and we carry it out only for

some patients. A particular phoneme has a certain characteristic in the seventh

millisecond only if it lasts that long, and sometimes it may not. "Number of

pregnancies" is applicable only to women, not to men and children. We can

pretend that these variables always have values, but when there are many of

them, this is computationally awkward as well as artificial.

It is one thing to recognize this limitation and another to correct it. The

multivariate framework is flexible as well as expressive, and the obvious alter-

natives lack much of its flexibility. A tree, for example, allows us to represent

some variables as being meaningful only if others have certain values but al-

lows access to the variables only in a certain order. Consequently, most work in

probabilityboth theory and applicationis carried out within the multivariate

framework, and extensions to the framework are developed and used on a fairly

ad hoc basis.

The graphical models that we will study in the following chapters are squarely

within the multivariate framework. For some ideas about going beyond it, see

Dempster [16] and Chapter 16 of Shafer [9].

MULTIVARIATE PROBABILITY 15

Exercises.

EXERCISE 1.1. Derive the three properties of marginalization listed in 1.2

from equation (1.2).

EXERCISE 1.2. Here are some familiar problems, each with its own concept

of combination and its own concept of marginalization. Discuss, in each case,

how to formalize the problems so that the axioms of transitivity and combination

are satisfied.

on numerical variables) are combined by pooling and marginalized

(we usually say "reduced") by eliminating variables.

2. Linear programming problems can be combined by adding (or

perhaps multiplying) their objective functions and pooling their con-

straints. They can be reduced by maximizing their objective functions

over variables that are eliminated.

3. Discrete belief functions are combined by Dempster's rule and

marginalized by restricting the events for which beliefs are demanded.

(One formalization is provided by Shafer, Shenoy, and Mellouli [45]

and another by Shenoy and Shafer [48].)

EXERCISE 1.3. Fix a set of variables X, and consider all pairs of the form

( f , V ) , where f is a strictly positive table on some subset x of X, and V is an

arbitrary table on the same set of variables x. Call x the domain of ( f , V ) .

Define multiplication for such pairs by setting

Show that these operations satisfy the axioms of transitivity and combination.

(Compare equation (1.22).) This example, suggested to the author by Robert

Cowell, is relevant to computation in decision theory, where f may represent

a probability distribution and V may represent a utility function.

EXERCISE 1.4. Consider a function f on a set of variables x, together with a

collection hx,xcx of functions on the individual variables in x. For each subset

w of x, let f^w be the marginal on w of the function obtained by multiplying f

by the hx for X not in w. In symbols,

16 CHAPTER 1

certain factors out (Cowell and Dawid [25]).

Show that out-marginalization and multiplication satisfy the axioms of tran-

sitivity and combination. What is the meaning of out-marginalization in the

context of equation (1.19)?

EXERCISE 1.5. The numerical functions on a given set of discrete variables

and its subsets form a commutative, semigroup under multiplication. The sets of

variables themselves form a lattice. Each element of the semigroup is labeled by

an element of the lattice. Marginalization reduces an element of the semigroup

to a,n element with a smaller label.

Formulate axioms of transitivity and combination in the abstract setting of a

commutative semigroup and associated lattice. Give examples where continuers

do and do not exist.

EXERCISE 1.6. In unpublished work [28], A. P. Dempster has shown how the

Kalman filter can be understood in terms of the combination and multiplication

of belief functions. Dempster calls the belief functions involved normal belief

functions. A normal belief function on a given linear space of variables consists

of a linear functional and an inner product on a subspace of the linear space.

Intuitively, the linear functional tells the expected values of variables in the sub-

space, and the inner product tells their covariances. Marginalizaiiori amounts

to restricting the linear functional and inner product to a yet smaller subspace.

Combination is most easily described in the dual of the linear space of variables

the linear space of configurations. Here the normal belief function looks like an

inner product (the dual of the covariance inner product) on a hyperplane, and

combination amounts to intersecting hyperplanes and adding the inner products.

Verify that the axioms of transitivity and combination are satisfied in this

geometric framework.

CHAPTER

Construction Sequences

Under certain conditions on the heads and tails of a sequence of conditionals, the

product of the conditionals will be a probability distribution. We call a sequence

of conditionals satisfying these conditions a construction sequence.

As we will see, the conditionals in a construction sequence are coritinuers for

the probability distribution obtained by multiplying them together. Initial seg-

ments of the sequence produce marginals of this probability distribution. Thus

the construction sequence represents a step-by-step construction of the proba-

bility distribution.

After constructing a probability distribution, we may want to find a marginal

for it or one of its posteriors. This may be difficult computationally, especially

if the joint frame of all the variables is too large to permit us to carry out the

multiplication of the conditionals. Were we able to carry out this multiplication,

we could store the resulting table and work directly with it to find marginals.

But if we are obliged to keep the probability distribution stored as a product of

tables, then we must look for less direct methods.

In some cases, as we will see in this chapter, a computationally inexpensive

adaptation of a construction sequence will produce a construction sequence for

the marginal we desire. To obtain the marginal for the variables in an initial

segment of a construction sequence, we need only omit the later factors from the

construction sequence. To obtain the posterior for later variables given values

of the variables in an initial segment, we need only slice the later factors. If the

construction sequence is a chain, then we can find a construction sequence for

the variables in a final segment by a simple forward propagation. The general

case, however, requires the more general methods that we will study in the next

chapter -methods that apply to any distribution stored as a product of tables,

whether or not the tables form a construction sequence.

If each new conditional in a construction sequence involves a single new vari-

able, then the most essential qualitative aspects of the construction sequence

can be represented by a directed acyclic graph (DAG). Such graphs have been

widely used for knowledge acquisition for probabilistic expert systems, and on

the theoretical side, they have been studied as a representation of conditional in-

dependence relations (Pearl [8]). Here we emphasize the value of DAGs for repre-

senting alternative construction sequencesconstruction sequences that use the

17

18 CHAPTER 2

TABLE 2.1

Qi, a probability distribution for Age. (This is a conditional with an empty tail and with

Age as its head.)

young .40

middle-aged .15

old .45

TABLE 2.2

Q2, a conditional with Age as its tail and Sex as its head.

female male

young 4/5 1/5

middle-aged 1 0

old 1/3 2/3

TABLE 2.3

QiQ2, a probability distribution for Age and Sex.

female male

young .32 .08

middle-aged .15 .00

old .15 .30

same conditionals but order them differently. By bringing these alternative or-

derings into the picture, a DAG enlarges the number of marginals and posteriors

that we can find by simple manipulations. In the general case, where each new

conditional is allowed to involve more than one new variable, we can similarly

indicate alternative orderings with a bubble graph, which is slightly more general

than a DAG.

Table 2.1 gives a probability distribution Q\ for Age (its single column adds to

one), and Table 2.2 gives a conditional Q% for Sex given Age (each row adds to

one). When we multiply these two tables, we get Table 2.3, which qualifies as a

probability distribution for Age and Sex (its six entries add to one). Notice that

Qi is a marginal of this probability distribution and hence Qi is a continuer.

We need not carry out the numerical multiplication in order to see that the

product Q\Qi is a probability distribution. We can instead perform an abstract

computation:

CONSTRUCTION SEQUENCES 19

Here we have first broken the summation into a summation over Sex followed

by a summation over Age. Since Qi does not involve Sex, it can be factored out

of the first summation, leaving Qi, which sums to one over Sex because it is a

conditional. This leaves us with the sum of Qi over Age, which is one because

Qi is a probability distribution.

Consider more generally any two conditionals Q\ and Q^. Write ti for the

tail, hi for the head, and di for the domain of Q%. (Recall that dl = ^ U/i z .) Our

example generalizes to the following proposition.

PROPOSITION 2.1. Suppose t\ is empty, t? is contained in d\, and hi is

disjoint from d\.

1. The product Q\Qz is a probability distribution on d\ U di.

2. The conditional Qi is Q\Qi 's marginal on d\.

3. The, conditional Qi continues Q\Qi from d\ to d\ U di.

Proof. Since we do not have symbols for individual variables, we will not use

summations like those in equation (2.1); instead, we will use our notation for

marginalization. We prove statement 1 by writing

Here we have used both the transitivity and the combination axioms.

Since Qi has an empty tail, it is a probability distribution. By the combina-

tion axiom,

Qi continues Q\Qi from di to d\ U d%.

Now consider a sequence of n conditionals, Qi,..., Qn. Proposition 2.1 gen-

eralizes, by induction, as follows.

PROPOSITION 2.2. Suppose t\ is empty. Suppose ti is contained in di U U

di-i and hi is disjoint from d\ U U d z -i for i = 2 , . . . , n.

1. Qi Qn is <i probability distribution with domain d\ U U dn.

2. For i 1,... ,n 1, Q\ Qi is the marginal of Q\ Qn on d\ U - - U d j .

3. Fori = 2,... ,n, Qi continues Q\ -Qn fromdiU- - U d j _ i to d\\J- - U d j .

4. More generally, if 1 < i < j < n, then Qi- Qj continues Q\- Qn from

di U U di-i to di U U dj.

When the hypotheses of Proposition 2.2 are satisfied, we call the sequence

Qi,.--,Qn a construction sequence for the probability distribution Q\ ---Qn,

20 CHAPTER 2

FlG. 2.1. Left: the first tail is empty. The. second tail in contained in the first domain,

and the second head is disjoint from the. first domain. Right: two more head-tail pairs have

been added. Each time, the new tail is contained in the existing domain, and the new head is

disjoint from, it.

and we say that the construction sequence represents this probability distribu-

tion. The restrictions on the head tail structure of a construction sequence are

illustrated in Figure 2.1.

Statement 2 of Proposition 2.2 indicates one way that we can exploit a con-

struction sequence. If we are interested only in the variables in di U U di and

not in the remaining variablesthose in /ii+1 U U hnthen we can simply

omit the last n i conditionals from the construction sequence: Q\- Qi is a

construction sequence for the marginal probability distribution on d\ U U ci,.

Another way to exploit a construction sequence is to fix the values of variables

we have observed. If these variables appear at the beginning of the construction

sequence, then this produces a construction sequence for the posterior distribu-

tion.

PROPOSITION 2.3. Suppose Qi,---,Qn is a construction sequence. Suppose

1 < i < n. Write d for U"=1/ij, the domain of Q\- Qn, and write i for U*=1 hj,

the domain of Q^ Q,. Suppose c is a configuration o f t . Then

from t to d. So the proposition follows from Proposition 1.4, together with the

fact that a slice of a product is equal to the product of the corresponding slices

of the factors.

The expert-systems literature has devoted considerable attention to construction

sequences that add one new variable at a timei.e., construction sequences in

which each head consists of a single variable. In this case, we can write

in the head of Ql, and t^ C {.Xi,... ,Xi_i}. We began the chapter with an

example of equation (2.2):

CONSTRUCTION SEQUENCES 21

T,\rn.K 2. 1

A conditional jFor Party given .Age.

young 1/4 1/2 1/4

middle-aged 1/3 1/3 1/3

old 1/3 1/3 1/3

given by Table 1.5. then we obtain the probability distribution PAge.Sex.Party

given by Table 1.1:

Notice that if we use instead the conditional Q'3 given by Table 2.4, then we

obtain the same probability distribution PA<;K.Sex,Party'-

time construction sequences.

When one new variable is added at a time, the head-tail structure of the

construction sequence can be represented by a directed acyclic graph (DAG for

short). This graph has the variables as nodes, and it has arrows to Xi from

each element of $, for i 2 , . . . ,n. We call this graph directed because the

links between the nodes are arrows, and we call it acyclic because there are no

cycles following the arrows.5 (Since the arrows we draw to each Xt are all from

X} with j < i, any path following the arrows always goes in the direction of

increasing indices; it cannot cycle back to a smaller index.) Figure 2.2 shows

DAGs for the construction sequences represented by equations (2.3), (2.4), and

(2.5), respectively. Figure 2.3 shows the DAG for the more complex construction

sequence represented by the equation

The middle graph in Figure 2.2 and the graph in Figure 2.3 both have cycles,

but not cycles following the arrows. The cycle Xi,X3,X/i,Xi in Figure 2.3, for

example, goes against an arrow on its last step.

A belief net is a finite DAG with variables as nodes, together with, for each

node X, a conditional that has X as its head and X's immediate predecessors

5

Some authors prefer the name acyclic directed graph in order to emphasize that only

directed cycles are forbidden; a path that does not always follow the arrows is allowed to be a

cycle. But the name directed acyclic graph and the acronym DAG are strongly established in

the literature.

22 CHAPTER 2

6

in the DAG as its tail. We have just explained how a construction sequence

determines a belief net. It is also true that the conditionals in a belief net can

always be ordered so as to form a construction sequence. This follows from the

following lemma.

LEMMA 2.1. The nodes of a finite DAG can always be ordered so that each

variable's immediate predecessors in the DAG precede it in the ordering. In other

words, we can find an ordering X\,..., Xn such that the immediate predecessors

of Xi in the DAG are a subset of {X\,...,Xi}. (In particular, Xi has no

predecessors in the DAG.)

Proof. The simplest proof is by induction on n, the number of variables in

the DAG. There is at least one node in the DAG that has no successors; if

every node had a successor, then we could form a cycle by going from each node

to a successor until (because there are only finitely many nodes) we repeated

ourselves. If we choose a node with no successors as Xn, and if we then remove

this node and the arrows to it, then we obtain, a DAG with only n I nodes

which, by the inductive hypothesis, has an ordering Xi,..., Xn-\ satisfying the

condition. The ordering Xi,..., Xn then also satisfies the condition.

We may call an ordering of the nodes of a DAG that satisfies the conditions

of Lemma 2.1 a DAG construction ordering. Unless a DAG is merely a chain,

it has more than one DAG construction ordering. The DAG in Figure 2.3, for

example, has five:

6

A variety of other names are also in use, including Bayesian network and graphical model.

CONSTRUCTION SEQUENCES 23

Every DAG construction ordering for the DAG of a belief net gives, of course,

an ordering of its conditionals that is a construction sequence for the probabil-

ity distribution represented by the belief net. Thus the five DAG construction

orderings we just listed produce five construction sequences for the probabil-

ity distribution in equation (2.6)five ways to permute the Qi and still have a

construction sequence.

We can talk about a belief net representing a probability distribution, without

reference to any particular construction sequence: a belief net represents a prob-

ability distribution P if P is equal to the product of the conditionals attached

to its DAG. We can also talk about a DAG by itself representing a probability

distribution: a DAG represents P if by attaching appropriate conditionals we

can make it into a belief net representing Pi.e., if P factors into conditionals

in the way indicated by the DAG.

Considered abstractly, a belief net represents a probability distribution more

concisely than a construction sequence does. It provides the same conditionals,

but it refrains from ordering them completely. For this reason, belief nets are

considered more fundamental than construction sequences in much of the litera-

ture on probabilistic expert systems. As a practical matter, however, belief nets

arise from a step-by-step construction that provides a complete ordering, and

we usually preserve this ordering when we store a belief net. Moreover, as we

will see in the next section, there is no practical advantage in considering only

construction sequences that introduce one new variable at a time. So in this

monograph, we take construction sequences as fundamental, and we treat belief

nets as secondary toolstools that help us see alternative orderings for particu-

lar one-new-variable-at-a-time construction sequences. In small problems, where

we can actually draw the DAG, it enables us to see alternative orderings at a

glance. In larger problems, the idea of the DAG reminds us of the existence of

alternative orderings.

tive construction sequences that we can discern by studying a DAG are important

because they broaden the application of Propositions 2.2 and 2.3. Since we can

apply these propositions to any construction sequence consistent with the DAG,

we can obtain construction sequences for a much larger class of marginals and

posteriors than we can obtain by working with a single construction sequence.

Propositions 2.2 and 2.3 are concerned with initial segments of a construction

sequence. We may also talk about initial segments of a DAG. We say that a set

w of nodes of a DAG is an initial segment of the DAG if all the immediate

predecessors of each element of w are also in w.

LEMMA 2.2. A set w of nodes in a finite DAG is an initial segment of the

DAG if and only if the DAG has a DAG construction ordering X\,..., Xn such

that

for some k.

24 CHAPTER 2

conditions exists, then w is an initial segment in the DAG. To derive the existence

of such an ordering from the assumption that w is an initial segment in the DAG,

we adapt the proof of Lemma 2.1. We argue by induction on m, the number of

nodes not in w. IfTO= 0, then the ordering exists by Lemma 2.1. If m ^ 0i.e.,

w does not include all the nodes in the DAGthen there is at least one node

outside w that has 110 successors, for if every node outside w had a successor, this

successor would also be outside w, and we could form a cycle of nodes outside w

by going from each node to a successor until we repeated ourselves. If we choose

a node that lies outside w and has no successors as Xn, and if we then remove

this node and the arrows to it, then we obtain a DAG with only m 1 nodes

outside w which, by the inductive hypothesis, has a DAG construction ordering

Xi,...,Xn-i satisfying (2.7). By adding Xn to the end of this ordering, we

obtain a DAG construction ordering X\,...,Xn for the original DAG that also

satisfies (2.7).

The definition of initial segment in a DAG, together with Lemma 2.2 and

Propositions 2.2 and 2.3, yields the following proposition.

PROPOSITION 2.4. Suppose w is an initial segment of a belief net that rep-

resents a probability distribution P.

1. Suppose we delete the nodes not in w, together with the arrows to them, and

the conditionals associated with them. Then the resulting belief net represents P 's

marginal on w.

2. Suppose c is a configuration of w. Suppose, we delete the nodes in w,

together with the arrows from them and the conditionals associated with them,

and suppose we change the conditional on each of the remaining nodes by slicing

it on w = c. Then the resulting belief net represents P's posterior given w c.

The simplicity and visual clarity of this proposition accounts for much of the

appeal of belief nets.

Proposition 2.4 can be thought of as a statement about alternative construc-

tion sequences. It says that if we begin with one construction sequence (the one

we used to construct the belief net), then we can shift to an alternative one to

get marginals and conditionals. We can say this without reference to the belief

net as follows.

PROPOSITION 2.5. Suppose Qi,..., Qn is a one-new-variable-at-a-time con-

struction sequence for a probability distribution P. Suppose ii,...,ik is a se-

quence of distinct integers between 1 and n such that t^ is empty and tij is

contained in { X ^ , . . . , Xi^^} for j = 2 , . . . , k. Write w for {Xtl,... ,X^k}.

1. Q j j , . . . , Qlk is a construction sequence for P^w.

2. Suppose c is a configuration of w. Suppose we modify the sequence

Qi,---:Qn by deleting each Qi, and by slicing each of the other conditionals

on w = c. Then the result is a construction sequence for P's posterior given

w c.

Forward propagation in chains. As we have seen, it is trivial to reduce a

belief net to a belief net for an initial segment. If the belief net is a chain, then

with a bit of work we can also reduce it to a belief net for a final segment.

CONSTRUCTION SEQUENCES 25

We call a DAG a chain if its nodes can be ordered, as in Figure 2.4, so that

the first has no immediate predecessors in the DAG and each of the others has

its predecessor in the ordering as its only immediate predecessor in the DAG.

Notice that a chain has only one DAG construction ordering: Xi,... ,Xn is the

unique DAG construction ordering for the chain X\ > - > Xn.

We call a belief net a belief chain if its DAG is a chain. Thus a belief chain

consists of a chain X\ + . . . > Xn and corresponding conditionals Q\,..., Qn.

The first conditional has X\ as its head and an empty tail; the ith conditional

has Xi as its head and _X";_i as its tail. The idea of forward propagation in such

a chain is based on the following lemma.

LEMMA 2.3. In a belief chain,

{X2,... ,Xn}.

Proof. Since {^2} is the intersection of {X2,...,Xn} with the domain of

QiQ2, equation (2.8) is an instance of the combination axiom.

By applying Lemma 2.3 repeatedly, we can reduce our initial construction

sequence Qi,.-.,Qn to a construction sequence for any final segment of the

belief chain. Indeed, once we have a construction sequence Ri, Q ; + i , . . . , Qn for

Xi > f Xn, we can obtain a construction sequence Ri+i,Ql+'2, . ,Qn for

Xl+l - > Xn by setting Ri+l = (RiQi+i)i{x'+l}.

The point of this step-by-step computation is that the tables will generally be

small enough for it to be implemented. In theory, we can move directly from the

construction sequence Qi,.--,Qn to a construction sequence for the marginal

on {Xi,.. ., Xn}, for the combination axiom implies that

Markov chains and hidden Markov models. Readers familiar with the

theory of Markov chains may find it illuminating to note that a finite Markov

chain is a special kind of belief net. It is a belief chain such that each variable has

the same frame and all the conditionals after the first are identical. Figure 2.5

shows a simple Markov chain.

Most of the theory of Markov chains is concerned with their repetitive nature

and hence does not extend to belief nets in general or even to belief chains in

general. For example, a Markov chain is sometimes described in terms of its state

graph. This is a directed graph (not usually acyclic) with the states (elements of

the common frame) as nodes and with an arrow from state i to state j whenever

the (i,j)th entry of the common conditional is positive. (Figure 2.6 shows the

26 CHAPTER 2

FIG. 2.6. The state graph for the Markov chain in Figure 2.5.

state graph for the Markov chain of Figure 2.5.) In general, we cannot draw a

state graph for a belief chain because the successive variables may have different

frames. Even if the frames are the same, the possible transitions or at least their

probabilities will vary.

In recent years, considerable use has been made of belief nets of a type

slightly more general than Markov chainshidden Markov models. To form a

hidden Markov model, we begin with a Markov chain, say X\ > + Xn, and

from each node Xi we add an arrow to a new node, say Yi, so as to obtain a

DAG as in Figure 2.7. All the Yi have the same frame (possibly different from the

frame for the Xi) and the same conditional. In applications, the Yi are observed,

while the Xi are notthe Markov chain X\ > - - Xn is hidden. We are

interested in rinding posterior probabilities for the Xi, We may, for example,

want to find the most likely configuration of Xi,... ,Xn. Since the Yi do not

form an initial segment of the belief net, we cannot use Proposition 2.4 to find

posterior probabilities for the Xi. But efficient methods for finding posterior

probabilities (and for finding most likely configurations) have been developed in

the literature on hidden Markov models, and these methods, as it turns out, are

special cases of more general methods that we will study in Chapter 3.

Figure 2.7 represents only the simplest type of hidden Markov model; in

practice, the model is elaborated in various ways. One common elaboration

involves attaching more than one observable variable to each X;. There may be

a fixed number of observable variables for each Xit or this number itself may be

an observable variable. In speech recognition, for example, each Xi represents

CONSTRUCTION SEQUENCES 27

that the phoneme lasts. Since the length of a phoneme varies, the number of

observations will vary; it itself will be an observed variable. Strictly speaking,

this takes us outside the framework of the belief netit even takes us outside

the multivariate framework. Fortunately, the computational methods needed are

natural extensions of the multivariate methods we will study in Chapter 3.

Though the visual clarity of belief nets is very attractive, there is no practi-

cal reason to limit ourselves to construction sequences involving only one new

variable at a time. All the computational ideas we considered in the preceding

section generalize to the general case, and we can also generalize the graphical

representation itself.

The simplest graphical representation of a general construction sequence is

the bubble graph. This graph has a node for each conditional. This nodecalled

a bubblecontains all the variables in the head and has an arrow to it from each

variable in the tail. Figure 2.8 shows a bubble graph for a construction sequence

for ten variables:

A bubble graph is acyclic in the same sense that a DAG is acyclicwe cannot

go in a cycle following the arrows. Moreover, a bubble graph, like a DAG,

permits us to pick out alternative construction orderings for the nodes i.e.,

alternative construction sequences for the probability distribution. In Figure 2.8,

for example, the bubbles can be ordered in seven different ways:

And hence there are seven ways of ordering the conditionals to form a construc-

tion sequence:

28 CHAPTER 2

at-a-time case, we can exploit alternative construction sequences to find prior

marginals for initial segments or posterior marginals given initial segments, and

we can propagate forward in chains to find prior marginals for final segments.

The idea of initial segments is defined for bubble graphs just as for DAGs,

and Proposition 2.4 continues to hold. Translating this proposition into a di-

rect statement about alternative construction sequences, we get the following

generalization of Proposition 2.5.

PROPOSITION 2.6. Suppose Qi----,Qn is a construction sequence for P.

Suppose ii,.... ik is a sequence of distinct integers between 1 and n such that tt~

is empty and ti.. is contained in h^ U U hlj_l for j = 2 , . . . , k. Write w for

h^ U U hik. '

1. Qil,..., Qik is a construction sequence for P^w.

2. Suppose c is a configuration of w. Suppose we modify the sequence

Q},--.,Qn by deleting each Q.L} and by slicing each of the other conditionals

on w = c. Then the result is a construction sequence for P 's posterior given

w = c.

A construction sequence Q i , . . . , Qn is a construction chain if each ti is con-

tained in ht-i for i = 2 , . . . , n. Figure 2.9 shows a bubble graph for a construction

chain: the bubbles are ordered, and each bubble has arrows only from variables

in the preceding bubble.

Lemma 2.3 generalizes as follows.

LEMMA 2.4. Suppose Q\.... ,Qn is a construction chain. Then

U/in.

Forward propagation proceeds, based on this lemma, just as in the one-new-

variable-at-a-time case; from the sequence fi,, Qz+i,. . , Qn for the marginal on

CONSTRUCTION SEQUENCES 29

hi+i U U /in by setting Rl+l = (#,Q; + i) l / l '+ J .

Figure 2.10 shows an alternative to the bubble graph in Figure 2.9. Here

instead of showing arrows from the individual variables, we put these variables in

the following bubble. They can still be identified; they constitute the intersection

of the two bubbles. A graph of the type shown in Figure 2.fO is called a join

graph. It has the property that the variables that a given node has in common

with any of the preceding nodes are all in the immediately preceding node. In the

next chapter, we will generalize the idea of a join chain to the idea of a join tree.

quence for which we cannot so easily find the marginals we want, consider the

external audit of an organization's financial statement. Figure 2.11 sketches, in

a simplified form, the structure of the evidence in one such audit. The auditor

is concerned with the accounts receivable, and she has distinguished between

the accounts receivable riot allowing for bad debts and the net accounts receiv-

able, which do allow for bad debts. The accounts receivable are fairly stated

only if they are complete, properly classified, and properly valued. The auditor

has obtained evidence for completeness by tracing a sample from a subsidiary

ledger. Customer confirmations have provided evidence that the accounts are

properly classified and properly valued. In addition, the auditor's assessment of

the internal accounting system ("review of the environment") provides evidence

for the accounts receivable being correct, and her assessment of the state of the

economy ("analytic review") provides evidence for the adequacy of the allowance

for bad debts.

The bubble graph in Figure 2.12 depicts a probability model for the situation

described by Figure 2.11. Using the abbreviations indicated in Figure 2.13, we

write

30 CHAPTER 2

evidence shown in Figure 2.11. The variable N, for example, might be a binary

variable indicating whether the net accounts receivable are fairly stated (N = 1)

or not (N = 0).

The auditor's evidence consists of observed values of the variables E, R: T,

and CC, which we may designate by corresponding lower case letters. We are

interested in the posterior distribution of the remaining variables given these

observations, arid according to Equation 1.18 in Proposition 1.6, this is propor-

tional to the function obtained by substituting the observations in the right-hand

side of equation (2.11):

We are particularly interested in the marginal of this posterior for the variable

N, which corresponds to an overall judgment that the financial statement is fairly

stated. Since the observed variables do not form an initial segment of the bubble

graph, we cannot find this marginal using the methods we have studied in this

chapter. Instead, we must use the methods of the next chapter, which apply to

arbitrary factorizations.

There are a number of alternatives to the bubble graph for representing the head-

tail structure of construction sequences, including chain graphs (Wermuth and

Lauritzen [50]) and valuation networks (Shenoy [47]). Figure 2.14 shows a chain

graph and Figure 2.15 shows a valuation network corresponding to the bubbl

graph of Figure 2.12. Both types of graph have uses beyond that of-representing

construction sequences. In the chain graph for a construction sequence, all the

CONSTRUCTION SEQUENCES 31

variables in each head are linked with each other, but by omitting some of these

links, we can represent additional conditional independence relations. By varying

the shape of the relational nodes and its arrows in a valuation network, we can

represent a wide variety of relations.

Another more complex graphical representation has been developed by Heck-

erman [30] under the name similarity network. A similarity network is a tool for

knowledge acquisition; it allows someone constructing a probability distribution

to allow certain variables in a construction sequence to be sufficient for other

variables given some values for earlier variables but not given other values for

these earlier variables.

Exercises.

EXERCISE 2.1. The idea of a construction sequence for a probability distri-

bution generalizes to the idea of a construction sequence for a conditional. In

32 CHAPTER 2

this generalization, we no longer require that the first tail be empty and that each

new tail, be contained in the existing domain. We require only that each new head

be disjoint from the existing domain.

Consider first two conditionals Qi and Qi. Under the hypothesis that hi is

disjoint from d\ (Figure 2.16), prove the following statements:

main d\ U d-2..

2. The product Qilt2 *s Q\Qi 's marginal on d\ U t%.

3. The conditional Q^ continues Q\Qz from d\ U t? to d\ U d^-

Then consider a sequence of conditionals Q\,..., Qn. Under the hypothesis that

hi is disjoint from d,\ U Ud,_i for i = 2 , . . . , n, prove the following statements:

1. The product Q\ Qn is a conditional with head h\ U - U hn

and domain d\ U U dn.

2. For i = 2, ...,n, Qi Qi-]l(dlij-udn)\(h,\j-uh.n)isthe

marginal of Qi Qn on (d\ U U dn) \ (ht U U hn).

CONSTRUCTION SEQUENCES 33

FIG. 2.16. Here we ask only that the second head be disjoint from the first domain.

3. For i = 2 , . . . , n, the conditional Qi continues Q\- Qn from

(di U U dn) \ (h, U U hn) to (d\ U U dn) \ (hi+i U U hn).

4. More generally, ifl<i<j<n, then the product Qi Qj

continues Q\- -Qn from (d\ U U dn) \ (hi U U hn) to (di U U

dn)\(hj+1\J---\Jhn).

When hi is disjoint from di(J- -\Jdi-i fori = 2 , . . . , n, we say that Q i , . . . , Qn

is a construction sequence for the conditional Q\ Qn- Notice that any subse-

quence of a construction sequence is itself a construction sequence.

EXERCISE 2.2. Discuss how the idea of a state graph for a Markov chain can

be generalized so as to apply to more general belief chains.

EXERCISE 2.3. Devise graphical representations for hidden Markov models

in which the number of observed variables attached to a node in the Markov

chain is itself an observed variable.

EXERCISE 2.4. The basic graph in Figure 2.11 can be interpreted as an "and

graph": N = 1 if and only if A = 1 and B 1, and A = I if and only

if C = I, PC = I, and PV = 1. This suggests arrows pointing the other

way, as in Figure 2.17. Show that the marginal on {N,A,B,C,PC,PV} of

a probability distribution of the form provided by equation (2.11) will not, in

general, be represented by the DAG in Figure 2.17.

EXERCISE 2.5. The conditionals involving a particular set of variables form

only a partial commutative semigroup, since products and marginals are not al-

ways conditionals.

Generalize the axioms of transitivity and combination you formulated in Ex-

ercise 1.5 to the case where the semigroup may be only partial. Consider also

the case where labels are binaryhead and tail.

This page intentionally left blank

CHAPTER 3

Propagation in Join Trees

chapter, of computing marginals of a function given as a product of tables on

different sets of variables, say

variable X, on one of the sets x^, or on some other set x of variables. The frame

of all the variables, $l\jXi, is too large for us to compute the table / and then sum

variables out of this table. So our task is to compute marginals for / without

computing / itself.

The approach we take in this chapter is the obvious one: we exploit the

factorization as we sum variables out. We sum variables out one at a time, and

we deal each time only with factors that involve the variable we are summing out;

the others we factor out of the summation. Each step produces a new product of

the same form as the right-hand side of equation (3.1), possibly involving some

larger clusters of variables (when we sum Y out, we must multiply together

all the fi involving Y", and the resulting cluster may be large even after Y is

removed). The next step must deal with these larger clusters, but with luck and

a good choice of the order in which we sum variables out, we may be able to

compute a given marginal without encountering a prohibitively large cluster.

As it turns out, this variable-by-variable summing out produces a join tree,

and the process can be understood directly in terms of the join tree. A join tree

is a tree with clusters of variables as nodes, with the property that any variable

in two nodes is also in any node on the path between the two (equivalently, the

nodes containing any particular variable are connected). The join tree produced

by summing variables out in a given order has the clusters produced by the

summing out as its nodes, and each summing out can be thought of in terms of

a message passed (or "propagated") from one node to a neighbor in this tree.7

7

The name "join tree" was coined in the theory of relational databases in the early 1980s

(Beeri et al. [22]). An alternative, "junction tree," is also current in the literature on belief

nets.

35

36 CHAPTER 3

tree. We can sum out more than one variable at a time. We can carry out

a multiplication after each summing out, or we can leave the multiplications

until they are required for a new summing out. In some cases, we can re-

duce the number of multiplications by judicious divisions. Thus we can distin-

guish different architectures for join-tree marginalization. In this chapter, we

study four: the elementary, Shafer-Shenoy, Lauritzen-Spiegelhalter, and Aal-

borg architectures. The elementary architecture produces the marginal for a

single node of the join tree. The other architectures produce marginals for all

nodes of the tree. The Shafer-Shenoy architecture achieves this by storing the

results of each summing out so they can be used for propagation in any di-

rection. This architecture is very general; it applies not only to the problem

we study in this chapter but also to other problems of recursive computation

involving unrestricted combination and marginalization operations that satisfy

the transitivity and combination axioms. It is somewhat wasteful, however, in

its appetite for multiplication. The Lauritzen Spiegelhalter and Aalborg archi-

tectures eliminate some of the multiplication by substituting a smaller number of

divisions.

If we are concerned only with calculating marginals of factored probability

distributions, the Aalborg architecture is the architecture of choice. Moreover,

the Aalborg architecture handles new evidence quite flexibly. Once it has com-

puted marginals for given observations, it can adjust the marginal for a particular

variable X after the further observation of a variable Y using only the part of

the join tree that lies between X and Y. But the alternative architectures come

into play for a wide variety of collateral problems that do not, for one reason

or another, satisfy all the assumptions made by the Aalborg architecture. For

example, when observations are subject to retraction, the Aalborg architecture

cannot be used because it does not retain the original inputs; Jensen [32] resorts

to the Shafer-Shenoy architecture in this case.

The methods of this chapter require only that the function / be given as a

product of tables; it need not be a probability distribution, and even if it is,

the tables need not be conditionals. (In the case of the elementary and Shafer-

Shenoy architectures, they can even have negative entries.) But we are most in-

terested in the case where / is equal or proportional to a probability distribution.

If / is only proportional to a probability distribution P, it is usually the marginals

of P, not the marginals of /, that we want, but most of the work will be in finding

the marginals of /; we can obtain P's marginals from /'s by equation (1.2).

As noted in the preface and in the exercises at the end of this chapter, join-

tree computation is much broader and older than the problem of finding marginal

posterior probabilities in probabilistic expert systems. In fact, techniques similar

to each of the architectures studied in this chapter have been applied to a variety

of problems in applied mathematics and operations research. Perhaps the oldest

such problem is that of solving a "sparse" set of linear equationsone in which

only a few variables appear in each equation. Other examples include the four-

color problem, dynamic programming, and constraint propagation (Diestel [2]).

PROPAGATION IN JOIN TREES 37

on the nodes of the tree being sufficiently small. In the case of probability

propagation, they must be small enough that multiplication and marginalization

within nodes is inexpensive. Roughly speaking, this means that the the sum of

the frame sizes must be small, or even more roughly, that the largest frame must

be small. Finding a join tree that achieves either of these minima exactly is an

NP-complete problem, but it is known that such minima are always achieved by

join trees that are produced by summing variables out in some order (Mellouli

[39]). Moreover, there are good heuristics for finding reasonable join trees if they

exist (Kong [36], Kjasrulff [35]).

A simple example will suffice to show how variable-by-variable summing out

produces a join tree and how the summing out can be interpreted as message-

passing in this join tree.

Here is a function on seven variables given as a product of five tables:

The clusters of variables involved in the tables are shown in Panel 1 of Figure 3.1.

Let us imagine summing the variables out in the reverse of the order in which

they are numbered, keeping track as we go of the new clusters we create.

Summing Xj out yields

factorization are shown in Panel 2. Above them, we have begun to construct a

join tree by drawing a node representing the variables involved in the summation,

Xc, and X-?. We temporarily link this node to the single variable Xr>, which is

the only variable involved in the new table resulting from the summation.

Next, we sum X$ out, obtaining

38 CHAPTER 3

second node consisting of the variables involved in the summation on this step.

We have linked the new node to the cluster of variables involved in the new table

resulting from the summation.

The next step, which produces Panel 4, is more interesting. Here we sum ^5

out, obtaining

PROPAGATION IN JOIN TREES 39

remove the clusters for tables absorbed in the summation, replacing them with a

single cluster for the new table resulting from the summation. One node already

in the picture was linked to a cluster removed from the list; it is now linked to

the new node.

The reader can write down the formulas for the remaining steps, which are

represented by Panels 5-8. At each step, we pull out from our product the factors

involving the variable we are summing out, multiply them together, perform the

summation, and give a new name to the resulting table (our system for naming

identifies the original tables involved in the subscript and the variables summed

out in the superscript, but this is of no importance). We add to our picture a node

representing the variables involved in the summation. We remove from the list

all the clusters corresponding to tables absorbed into the summation, replacing

them with the single cluster for the new table resulting from the summation

this is the union of the clusters removed minus the variable summed out. We

link the node created to the cluster added. When a linked cluster is removed

from the list, the link is inherited by the new node that absorbs it.

The final result in Panel 8 is indeed a join tree. It is a tree with sets for

nodes, and whenever a variable is contained in two nodes, it is also contained in

all the nodes on the path joining the two. For example, the variable 2, which is

contained in both 23 and 1245, is also contained in the two nodes between them,

12 and 124.

Though we have worked in terms of an example, we have spelled out a general

algorithm. This algorithm applies to any product of tables and to any order for

summing the variables out of such a product. It identifies the clusters involved in

the variable-by-variable summing out, and it arranges these clusters in a graph.

Is this graph always a join tree?

Certainly the graph is always a treei.e., it is always connected and acyclic.

We introduce the nodes in a sequence. Each node except the last is linked with

some later node, so the graph is connected. (Since we can follow the links from

any node to the last node, we can follow them from one node to the last node

and then back to any other node we please.) Each node is linked with only one

later node, so there cannot be any cycles. (If there were a cycle, the earliest

node in it would have to be linked with two later nodes.)

To see that the tree is always a join tree, consider Figure 3.2, where the links

have become arrows pointing from old to new nodes, and each arrow is labeled

with the variable that was summed out when the node from which the arrow

40 CHAPTER 3

comes was created. The node to which an arrow points always includes all the

variables in the node from which the arrow comes, except the variable that was

summed out. For any particular variable X, any node n containing X must be

connected to the node n' created when X is summed out, because the tables

created as we go downward from n continue to contain X until it is summed out.

It follows that all the nodes containing X are connected in the tree (i.e.form

a subtree), and this is equivalent to the tree being a join tree.

The join tree that we construct is this way is interesting because it can be

interpreted as a picture of the computations involved in the variable-by-variable

summing out. We interpret a node x as a register that can store a table for its

variables, and we interpret an arrow from x to y as an instruction to sum out a

variable from x's table and multiply y's table by the result.

We begin by putting tables in the storage registers; in Figure 3.2, for example,

we put the table /i in 23, the table /2 in 57, the product /3/4 in 1234, and the

table /s in 146. We put tables of ones in the other three nodes. The number

beside each arrow tells us which variable to sum out of the table in the node

preceding the arrow. Figure 3.3 shows the summations we perform when we

follow these instructions.

We summed the variables out in the reverse of the order in which they were

numbered: 7, 6, 5, 4, 3, 2. Figures 3.2 and 3.3 make it clear, however, that

this order can be varied to some extent without changing the join tree or the

computations performed. The only constraint is that we sum out of a given node

only after the node has absorbed messages from all nodes with arrows pointing

to it. Only the three nodes 23, 57, and 146 can begin the computation, 1245 can

act after 57, 124 can act after 1245 and 146, and so on.

We do not need the numbers beside the arrows in Figure 3.2. These numbers

tell us which variable to sum out, but we can also find this information by

comparing the node sending the message to the node receiving it. The sender

always sums out the variable it has that its neighbor does not have. In other

words, it marginalizes to its intersection with the neighbor.

The final result of the computation is f ^ X l , the marginal of / for X\. If we

continue by summing X\ out of this table, then we obtain /^ 0 , the marginal of

/ on the empty set. Figure 3.2 can be extended to include this final summation;

we simply add 0 as a node, with an arrow to it from 1.

PROPAGATION IN JOIN TREES 41

Marginalization in join trees can be understood directly, without any reference

to an ordering of the variables. If we place tables in the nodes of an arbitrary

join tree and propagate to a root following the algorithm just described, then

the final table on the root will always be the marginal on the root of the product

of the initial tables. It is not necessary that the join tree or the placement of the

tables should have been determined by an ordering of the variables.

In this section, we will spell out the marginalization algorithm in terms of an

arbitrary join tree. Then we will prove, using only the transitivity and combi-

nation axioms, that the algorithm always produces the marginal on the root.

Before beginning the algorithm, we place in each node x of the join tree a

table on x, say (px. We write (p for the product of the <>x; (p = HxgAr Vx, where N

is the set consisting of all the nodes in the tree. The purpose of the algorithm is

to find the marginal (p^r for a particular node r, which we call the root of the tree.

To begin the algorithm, we make all the links in the tree into arrows in the

direction of r. (Each node other than r will then have exactly one arrow outward,

pointing to its unique neighbor in the direction of r.) Then we have each node

pass a message to its neighbor nearer r according to the rules we learned in

Figure 3.1:

42 CHAPTER 3

Rule 1. Each node waits to send its message to its neighbor nearer

to r until it has received messages from all its other neighbors.

Rule 2. When a node is ready to send its message, it computes the

message by summing out of its current table any variables it has but

the neighbor to whom it is sending the message does not have. (This

was always a single variable in Figure 3.1, but it could be several

variables or none.) In other words, it marginalizes its current table

to its intersection with the neighbor.

Rule 3. When a node receives a message, it replaces its current table

with the product of that table and the message.

Eventually, all the nodes except r will have sent messages, and r will have re-

ceived a message from each of its neighbors and will have multiplied its original

table by all these messages.

Here is the proposition we need to prove.

PROPOSITION 3.1. At the end of the algorithm just described, the table on r

will be (f>^r, the marginal on r of the product of the initial tables.

Proof. Imagine for the moment that the nodes are peeled away from the join

tree as they send their messages, so that in the end only r remains. Thus a single

step of the algorithm consists of three parts: (1) a node t computes the marginal

of its table to b D t, (2) the neighbor b multiplies this marginal into its current

table, and (3) the node t is removed from the tree. This allows us to state the

following lemma.

LEMMA 3.1. After each step, the product of the tables that remain is the

marginal to the variables that remain of the product of the tables before the step.

To see that Lemma 3.1 is true, write N\ for the set of nodes in the tree before

the step, iV2 for the set of nodes in the tree after the step, and i/)x for the table

in node x before the step. Thus the product of the tables before the step is

rizeAT! ^xi and the product of the tables after the step is (O^eTv ^o;)W 0< (see

Figure 3.4). Since the tree is a join tree, b H t = (UA^) H t. So we find, using the

combination axiom, that

which is a restatement of Lemma 3.1. Lemma 3.1, together with the transitivity

axiom, yields the next lemma.

LEMMA 3.2. After each step, the product of the tables that remain is the

marginal to the variables that remain of the product of the initial tables.

PROPAGATION IN JOIN TREES 43

FlG. 3.4. The loaded join tree before and after t 'sends its inward message to b.

At the end of the algorithm, we have only one table, the table on the root,

and so we obtain Proposition 3.1 as a special case of Lemma 3.2.

We can gain some further insight into the algorithm by noting that when a

node b receives a message from a neighbor t, it is also receiving, indirectly, infor-

mation from the nodes on the other side of t. After any step (message-passing

and multiplication) in the algorithm, we can identify the nodes from which a

given node b has received information, either directly or indirectly. These nodes,

together with b itself, form a subtree, which we may call the b's information

branch at that point (see Figure 3.5). The steps we have taken within this sub-

tree are the same as the steps we would have taken had we implemented the

algorithm on it alone, with b as the root. So as a corollary of Proposition 3.1,

we have the following proposition.

PROPOSITION 3.2. After each step, the table on a given node b will be the

marginal on b of the product of the initial tables in b's current information branch.

This is a generalization of Proposition 3.1, because at the end of the algo-

rithm, the root's information branch is the whole tree.

In the course of explaining our algorithm, we have found ourselves talking

about the nodes of the join tree as storage registers and even as individual

processors. Each node can store tables for a certain set of variables, multiply

such tables, and marginalize them. In effect, we have made the join tree, together

with the algorithm, into an architecture for marginalization. We call it the

elementary architecture. In the next few sections, we consider some alternative

architectures, based on the same join tree, that are able to compute marginals

for all the nodes, not merely for a single root node.

Join-tree architectures are potentially applicable to any instance of the gen-

eral problem of computing marginals of a function given as a product of tables,

as in equation (3.1), but in order to apply a join-tree architecture to such a prob-

lem, we first find a join tree that covers the product, one that includes for each

factor a node containing the domain of that factor. (If we want the marginal for a

44 CHAPTER 3

FlG. 3.5. The dashed arrows are those over which messages have already been sent. The

circled subtree is b's information branch at this point.

cluster of variables that is not the domain of one of the factors, then we must

make sure that the join tree also has a node containing this cluster.) Once we

have such a join tree, we place each factor in a node containing its domain. If

a node x receives more than one factor, we multiply them together, and we also

multiply by lx if necessary in order to obtain a table that involves all the vari-

ables in x. If a node x does not receive a factor, we simply assign it the table l x .

If the join tree has more than one node containing the domain of a particular

factor, we can put the factor in whichever of these nodes we please. In Figure 3.2,

for example, we have two different nodes that can accept a table on 124. To

minimize computation, we should choose the node with the smaller frame size,

but this is a minor consideration.

The choice of the join tree is much more important. We want a join-tree

cover with nodes small enough to permit computation. If such a join-tree cover

does not exist, we will have to turn to alternative methods for marginalization,

such as Markov-chain Monte Carlo.

As we noted at the beginning of the chapter, there are heuristics that do

produce reasonable choices for join-tree covers. Some of these heuristics do

involve choosing an order for eliminating (summing out) the variables. This not

only produces a join-tree cover; it also determines a placement of the factors in

the join treeeach factor goes as close as possible to the root.

The elementary architecture allows us to find the marginal for an arbitrary root

of a join tree. If we then want to find the marginal for another node, we can

use the same join tree, but we must repeat the algorithm using the new node

PROPAGATION IN JOIN TREES 45

FIG. 3.6. The partial Shafer-Shenoy architecture. Like the elementary architecture, its

finds the marginal for a single root node. In each separator, we have indicated the set of

variables involved in the messages that will be stored there; this is always the intersection of

the two neighboring nodes.

as the root. This usually involves a great deal of duplication. In Figure 3.4, for

example, most of the steps for computing the marginal on w will be the same as

those for computing the marginal on r.

The Shafer- Shenoy architecture provides one way to eliminate much of this

duplication. In this architecture, each node sends messages in all directions. It

is allowed to send its message to a particular neighbor as soon as it has messages

from all its other neighbors. In order that the computations for a message in one

direction should not interfere with those for a message in another direction, a

node no longer replaces its table each time it receives a message. Instead, it keeps

its initial table, stores the incoming messages, and performs multiplications only

as needed for computing outgoing messages.

As a first step in describing the Shafer-Shenoy architecture, we will describe

a partial version, in which, as in the elementary architecture, messages are prop-

agated only to a single root r. Figure 3.6 shows this partial architecture. The

squares on the arrows in this figure are called separators; they contain storage

registers for storing the messages sent in the direction of the arrows. As in the

elementary architecture, we begin with a table <px on each node x and we want

to find (f>^r for a particular node r, where </? is the product of the (px. The storage

registers in the separators are initially empty.

Here are the rules for propagation in the partial Shafer Shenoy architecture:

Rule 1. Each node waits to send its message to its neighbor nearer to

r until it has received messages from all its other neighbors. (More

precisely, it waits until messages have been received by the separators

between it and these other neighbors.)

46 CHAPTER 3

nearer r (or, more precisely, to the separator between it and its neigh-

bor nearer r), it computes the message by collecting all its messages

from neighbors farther from r, multiplying its own table by these

messages, and marginalizing the product to its intersection with the

neighbor nearer r.

Rule 1 is the same as in the elementary architecture. Here, however, the messages

are intercepted by the separators, where they are stored until they are collected

in accordance with Rule 2. Rule 3, which provides for changing the tables on

nodes, has been omitted. In this architecture, propagation only has the effect of

filling the storage registers in the separators. It does not change the tables on

the nodes.

Since the rules for message-passing are the same in the partial Shafer-Shenoy

architecture as in the elementary architecture, the course of the propagation and

the messages sent will be the same. At the end of the propagation, the root r

will have a message from each neighbor stored in the separator it shares with

that neighbor. Thus we have the following proposition.

PROPOSITION 3.3. At the end of the partial Shafer-Shenoy propagation, we

can get (p^r by collecting all of r's incoming messages and multiplying r's table

by them.

The full Shafer-Shenoy architecture extends the partial architecture by

putting two storage registers in each separator, one for a message in each direc-

tion, as in Figure 3.7. Each node sends messages to all its neighbors, following

these rules:

Rule 1. Each node waits to send its message to a given neighbor until

it has received messages from all its other neighbors.

Rule 2. When a node is ready to send its message to a particular

neighbor, it computes the message by collecting all its messages from

other neighbors, multiplying its own table by these messages, and

marginalizing the product to its intersection with the neighbor to

whom it is sending.

Here, as in the partial architecture, the tables on the nodes do not change. At

the end of the propagation, each node x still has its initial table y>x. The only

effect of the propagation is to fill all the storage registers in the separators.

A comparison of the rules for the full and partial architectures makes it clear

that the full architecture produces the same messages towards any particular

node as the partial architecture with that node as root. So once we have com-

pleted the propagation in the full architecture, we can find the marginal for

any particular node by collecting all its incoming messages and multiplying the

node's table by them.

PROPOSITION 3.4. At the end of the full Shafer-Shenoy propagation, we can

get <p^x for any node x by collecting all ofx's incoming messages and multiplying

x's table by them.

PROPAGATION IN JOIN TREES 47

FlG. 3.7. The full Shafer-Shenoy architecture. The arrow in each storage register indi-

cates the direction of the message to be stored there.

other architectures, to express its computations in formulas. Let us write mn^x

for the Shafer-Shenoy message to x from neighbor n. Then Rule 2 says that the

message from x to neighbor w is given by

Because of Rule 1, the computation must begin with the leaves, the nodes

that have only one neighbor. In Figure 3.7, for example, the leaves are 1, 23, 57,

and 146. Any of these leaves can begin, and the message they send is the only

message they send in the course of the computation. The situation for the other

nodes is more complicated. Node 12, for example, can send a message to 124 as

soon as it has heard from leaves 1 and 23, but it must wait then wait to hear

back from 124 before it can send messages back to 1 and 23.

Figure 3.8 shows one sequence in which messages might be sent in the archi-

tecture of Figure 3.7. The messages first move inward to the node 1 and then

back outward again. The inward pass is identical to propagation to 1 in the

partial Shafer-Shenoy architecture of Figure 3.6.

48 CHAPTER 3

FIG. 3.8. One order in which messages might be sent in the full Shafer-Shenoy architec-

ture.

If the computations are performed serially, there will necessarily be one node,

such as 1 in Figure 3.8, that is the first to receive messages from all its neighbors.

This node can be considered the root. The propagation consists of a pass inward

to the root and another pass back outward. It is not necessary, however, to

specify the root in advance. If the computations are performed in parallel (a

possibility suggested when we talk as if the nodes were individual processors),

then which node is the first to receive all its messages will depend on the pace

of the computations for the different nodes farther out in the tree, and it is even

possible that two nodes will tie for first. This happens in Figure 3.9, where

the computations proceed in parallel and in synchrony, and 124 and 12 receive

messages from each other simultaneously on the third step of the computation.

PROPAGATION IN JOIN TREES 49

By comparing Figures 3.6 and 3.8, we can understand better why the Shafer

Shenoy architecture stores so many messages. The elementary architecture uses

and discards each message when it is sent. But what would happen if we were

to follow the inward pass of the elementary architecture with an outward pass?

In the case of Figures 3.6 and 3.8, this means that after 1 absorbed the message

from 12, it would send a message back to 12. By the usual rule, the message back

would simply be its current table, which was obtained by multiplying its original

table by the message (no marginalization is needed, because the intersection of

1 with 12 is simply 1). Intuitively, this is the wrong, because it forces 12 to

absorb again the message it just sent, effectively counting it twice. The Shafer-

Shenoy architecture sends instead only the original table, uncontaminated with

the message from 12. It is able to do this because it has kept both its original

table and the message. The same thing happens at each further step on the

outward pass. Node 12, for example, since it still has both its original table and

the messages from 23 and 1, is able to send a message back to 124 that is not

contaminated with the message it received from 124.

Roughly speaking, the Shafer Shenoy architecture computes marginals for

all the nodes at about three times the price for a single marginal. We double

the computation because we compute two messages instead of one for each link,

50 CHAPTER 3

and then we increase it by about the same amount again when we do the final

multiplications to get the marginal for each node. This contrasts with repeat-

ing the elementary architecture for each node, which multiplies the amount of

computation for a single marginal by the number of nodes.

Unfortunately, the Shafer-Shenoy architecture is still rather wasteful in its

demand for multiplication. Each node computes a message for each of its neigh-

bors only once (in contrast to what happens if we use the elementary architecture

over and over), but the multiplication a node performs to compute the message

to one neighbor still duplicates much of the multiplication it performs to com-

pute the message to another. In Figure 3.7, for example, node 124 will multiply

its original table by the message from 1245 once when it sends its message to

146 and again when it sends a message to 12. With yet more storage, we could

reduce this remaining duplication somewhat, but it is more effective to take an-

other tack. Instead of trying to keep the message a node sends on the inward

pass from being included in the message it gets back, we can allow for the mes-

sage's later return by dividing the it out of the node's current table as it is sent.

This is the tack taken by the Lauritzen-Spiegelhalter architecture.

The Lauritzen-Spiegelhalter architecture explicitly designates a particular node

r as the root of the propagation. It does not use separators. It begins with a

pass inward to r that duplicates the elementary architecture, except that when

a node sends a message, it divides its own table by that message. It then follows

with a pass outward from r, during which it follows the elementary architecture's

rule for propagation, without the division. This is illustrated by Figure 3.10.

Here is a precise statement of the rules for the inward pass.

Rule 1. Each node waits to send its message to its neighbor nearer r

until it has received messages from all its other neighbors.

Rule 2. When a node is ready to send its message to its neighbor

nearer to r, it computes the message by marginalizing its current ta-

ble to its intersection with its neighbor. It sends this marginal to the

neighbor nearer to r, and then it divides its own current table by it.

Rule 3. When a node receives a message, it replaces its current table

with the product of that table and the message.

These rules are the same as the rules for the elementary architecture, except for

the addition of the italicized phrase in Rule 2. For the outward pass, we use the

same rules, without the divisions:

Rule 1. Each node waits to send its message to a particular neigh-

bor outward from r until it has received messages from all its other

neighbors.

Rule 2. When a node is ready to send its message to a particular

neighbor outward from r, it computes the message by marginalizing

its current table to its intersection with this neighbor.

PROPAGATION IN JOIN TREES 51

FIG. 3.10. Rules for the Lauritzen-Spiegelhalter architecture. The message, In or Out, is

always the marginal of the sender's current table to the sender's intersection with the receiver.

Rule 3. When a node receives a message, it replaces its current table

with the product of that table and the message.

Since each node received messages from all its outward neighbors on the inward

pass, we can restate Rule 1 for the outward pass in a simpler way: Each node

waits to send its messages outward until it has received a message from its unique

neighbor nearer to r. (This neighbor may be r itself; r must begin the outward

pass by sending one or more messages.)

Let us check that the Lauritzen-Spiegelhalter architecture produces the ap-

propriate marginals for all the nodes.

PROPOSITION 3.5. At the end of the Lauritzen-Spiegelhalter propagation, the

table on each node x is </^ x .

Proof. First consider the situation at the end of the inward pass. On the

inward pass, the messages sent are the same as in the elementary architecture

and hence also the same as in the Shafer-Shenoy architecture. If x is not equal to

r, then during the inward pass, x sends its inward neighbor w the Shafer-Shenoy

message mx^w. At the end of the inward pass, x has received messages from

all its own outward neighbors (if any) and has sent only the message to w. This

gives the following lemma.

LEMMA 3.3. At the end of the inward pass, a node x not equal to the root

has as its table

The root r, on the other hand, receives messages from all its neighbors and

sends no messages on the inward pass. So at the end of the inward pass, it has

the same table as at the end of the elementary architecture.

52 CHAPTER 3

LEMMA 3.4. At the end of the inward pass, the table on r is (p^T.

Now consider the outward pass. On the outward pass, each node except the

root receives just one message: the message from its inward neighbor. The root

itself sends messages but does not receive any. So the table on the root does not

change, and each of the other tables changes exactly once, when it is multiplied

by the message from its inward neighbor. Since the propagation moves outward

from the root, Proposition 3.5 follows by induction from Lemma 3.4 together

with the following lemma.

LEMMA 3.5. Suppose w has (p^w as its table when it sends its message to out-

ward neighbor x. Then after absorbing the message, x will have (p^x as its table.

To prove Lemma 3.5, we need a formula for the message w sends to x.

LEMMA 3.6. If w has (p^w as its table when it sends its message to outward

neighbor x, then the message it sends is the product of the Shafer-Shenoy mes-

sages in both directions: mu,^xmx^w.

To prove Lemma 3.6, we note that by its hypothesis and equation (3.3), the

table on w is

equal, by the combination axiom and equation (3.2), to

Since the hypothesis of Lemma 3.6 is always true, its conclusion is too: the

Lauritzen-Spiegelhalter message from w back out to x is always the product of

the Shafer-Shenoy messages in both directions. This substantiates the intuitive

characterization of the Lauritzen Spiegelhalter architecture with which we be-

gan: dividing out the inward message when we send it compensates for the fact

that it will be part of the message that comes back.

Another equally important way of describing the message from w back out

to x is to say that it is the marginal of (f> on w fl x. This is because w has the

PROPAGATION IN JOIN TREES 53

including x and its neighbor nearer r, and before it sends

a message back to x.

FlG. 3.11. The node x and its neighbor w nearer the root before and after w sends a

message back to x.

marginal of (p on w as its table before sending the message, and it computes the

message by marginalizing this table to w fl x.

Using continuers. The alert reader will have noticed that we glossed over the

problem of zero probabilities in our description of the Lauritzen Spiegelhalter

architecture. If the table mx^w has zero values, then we will not be able to

perform the division in equation (3.4). Fortunately, it is not really necessary to

perform this division. The reasoning with which we proved Proposition 3.5 will

work if we can find a continuer, say Qxnw-^xj of (px HneAf \w mn^x from x PI w

to x, for we can use Qxr\w-+x as x's table after it has sent its message inward

to u>, and this will have the same effect as the division. When the message

mw-+xmx^w comes back, we obtain

54 CHAPTER 3

as our table on x, so that Lemma 3.3 and Proposition 3.5 still hold.

The requirement that continuers should exist makes the Lauritzen-

Spiegelhalter architecture slightly less general than the Shafer-Shenoy architec-

ture, which allows negative entries in the tables (px. Continuers may fail to exist

when negative values are allowed. But if the product of the <px is proportional

to a probability distribution, then we can take it for granted that all the entries

are all nonnegative, because dropping minus signs will not change the product.

And, in this case, continuers exist by Proposition 1.1.

Notice the other implication of Proposition 1.1: we can choose the continuers

to be conditionals. More precisely, we can choose the continuer Qxr\w>x to be a

conditional with head x \ w and tail x n w.

When we look beyond probability to other problems satisfying the transitiv-

ity and combination axioms (see the exercises at the end of Chapter 1 and at the

end of this chapter), we find that the Shafer-Shenoy and Lauritzen-Spiegelhalter

architectures have overlapping but distinct ranges of application. The Shafer-

Shenoy architecture works whenever there are no restrictions on multiplication

and marginalization, even if continuers do not exist. The Lauritzen-Spiegelhalter

architecture, on the other hand, can sometimes work under restrictions on

multiplication or marginalization that prevent the use of the Shafer-Shenoy

architecture.

Spiegelhalter architecture is that the product of the tables on the nodes remains

equal to (p during the inward pass. This is clear when we divide: each time we

divide one of the tables by a message, we multiply another by the same message,

so the product does not change. It is equally clear in terms of continuers: each

time we factor a table into a marginal and a continuer and remove the continuer

from the node, we add it as a factor in another node.

Suppose we always choose the continuers to be conditionals. Then at the

end of the inward pass, we have transformed the original factorization of </?,

(p = IlxeAr Vx, into a new factorization,

where w(x) is x's inward neighbor. This new factorization, as it turns out, can

be interpreted as a construction sequence.

In order to make the interpretation as a construction sequence precise, let us

take one more step, continuing the inward pass, as it were, from r to the empty

PROPAGATION IN JOIN TREES 55

set 0. In other words, we factor the marginal <^r into the product of (/^0 and

a continuer from 0 to r. Since (p is proportional to a probability distribution

P, (p^ ^ 0, and hence the continuer is unique; it is the marginal P^r. So

equation (3.6) becomes

If we imagine the a node 0 added to the join tree, with an arrow to it from r,

then at the end of the inward pass, we have the factors on the right-hand side

of equation (3.7) on the nodes of the tree (see Figure 3.12).

By Proposition 1.3, the probability distribution P is equal to (p/tp^. So

equation (3.7) tells us that

It is the conditionals on the right-hand side of this equation that can be arranged

in a construction sequence for P. Indeed, suppose x i , . . . , xm is an ordering of

the nodes of the join tree that moves outward from the rooti.e., such that x\

is the root and each later Xi is an outward neighbor of one of r c i , . . . , x^-i. (Such

orderings exist in any tree.) Write Qi for QXir\w(xi)-*xii fr z = 2 , . . . ,m. Then

we have the following lemma.

LEMMA 3.7. P^r, Q^-, , Qm is a construction sequence for P.

Proof. Equation (3.8) says that P is the product of P^ r ,Q2, , Qmi and

the union of their heads is clearly equal to TV, the domain of P. So to prove

the lemma, we need only show that the head of each conditional is disjoint from

the domain of the preceding ones. But this is an obvious property of join trees:

whenever we order the nodes in a sequence moving outward from a root, the

intersection of each node Xi with the preceding nodes is always contained in its

inward neighbor w(xi), and hence Xi \w(xi) is disjoint from x\ U- -Uzj-i.

Lemma 3.7 says that at the end of the inward pass, the tables on the nodes

are conditionals, and any outward sequence is a construction sequence.

56 CHAPTER 3

The outward pass of the Lauritzen-Spiegelhalter architecture can be under-

stood in terms of the construction sequences produced by the inward pass. Con-

sider, for example, the action of the outward pass on the path going outward

from the root r to a particular node x (Figure 3.13). It is evident that the

conditionals along this path form a construction chain for the marginal of P

on the variables involved, and the propagation outward in this chain is forward

propagation in the sense of Chapter 2.

As we have seen, the message from x in to w in the Lauritzen-Spiegelhalter

architecture is the Shafer-Shenoy message in that direction, mx-+w, while the

message from w back out to x is the product of the Shafer-Shenoy messages in

both directions, mx^wmw^x. When we send mx^w inward, we divide it out of

the table on x in order to compensate for its later return.

The Aalborg architecture takes a more direct tack. In this architecture, we

do not divide mx^w out of the table on x as we send it inward. Instead, we save

mx^w and divide it out of mx^wmw^x when this message comes back. This

requires more storage, but it saves computation, because the division is now in

the domain w n x rather than in the larger domain x. Each entry in mx^,w

divides a whole row, as it were, in the table on x, but only a single entry in the

table mx^wmw-^x.

Messages are stored in separators, just as in the Shafer-Shenoy architecture.

Each message is computed as in the Lauritzen Spiegelhalter architecture: the

node marginalizes its current table to the intersection with the node to which

it is sending the message. On the inward pass, we both store the messages in

the separators (as in the Shafer-Shenoy architecture) and multiply them into the

receiving nodes (as in the elementary and Lauritzen-Spiegelhalter architectures).

On the outward pass, the separator divides the outward message by the message

it has stored before passing it on to be multiplied into the table on the receiving

node. (See Figure 3.14.) By the end, the initial table on each node x will be

multiplied by the Shafer-Shenoy messages from all of x's neighbors. So the final

table on x will be the marginal (p^x.

When a node w computes a message for its outward neighbor x, its own table

is already its marginal, (p^x. So the message it sends to the separator is ip^xnw.

PROPAGATION IN JOIN TREES 57

FlG. 3.14. The inward and outward action of the Aalborg architecture between x and its

inward neighbor w. Here ifjx and t^w are the tables on x and w, respectively, just before x

computes its message to w, and ipx and ifr'w are the tables just before w computes a message

to send back. The table on w may have changed one or more times as a result of messages

from other outward neighbors and its own inward neighbor.

Since we are more interested in this marginal than in the Shafer-Shenoy message,

we store it in the separator after we forward its quotient by the old message.

The action of the separator on the inward pass seems different from its action

on the outward pass, but Figure 3.15 shows how to describe it in a way that

makes it similar. Instead of beginning with the separator empty, we begin with

it containing l^nx, a table of ones. Since In is the same as In/lwr\Xi we can

say that here too the separator is sending forward a quotient rather than merely

sending forward the message it receives. Thus we have the uniform action shown

in Figure 3.16; the separator always stores New but sends forward New/ Old.

In summary, the Aalborg architecture uses a rooted tree with a separator

between each pair of nodes. Initially, each node x has a table </?x, and each

separator has a table of ones. The propagation follows these rules:

Rule 1. Each nonroot node waits to send its message to a given

neighbor until it has received messages from all its other neighbors.

Rule 2. The root waits to send messages to its neighbors until it has

received messages from them all.

Rule 3. When a node is ready to send its message to a particular

neighbor, it computes the message by marginalizing its current table

to its intersection with this neighbor, and then it sends the message

to the separator between it and the neighbor.

Rule 4. When a separator receives a message New from one of its two

nodes, it divides the message by its current table Old, sends the quo-

tient New/ Old on to the other node, and then replaces Old with New.

Rule 5. When a node receives a message, it replaces its current table

with the product of that table and the message.

58 CHAPTER 3

FIG. 3.15. If we suppose that the separator begins with a table of ones, then the inward

action is the same as the outward.

FIG. 3.16. The uniform action of the Aalborg architecture: When u sends New to its

neighbor v, the message is intercepted by the separator, which divides it by Old and passes the

quotient on.

Rules 1 and 2 force the propagation to move in to the root and then back out.

At the end of the propagation, the tables on all the nodes and separators are

marginals of </?, where ip = Y\ x-

Dealing with zeros. We have again been making the simplifying assumption

that there are no negative or zero values in the <px, so that division is always

possible. Now let us relax this to the assumption that there are no negative

values, which is sufficient for continuers to exist.

When zeros are not allowed in the table Old, the quotient New/ Old is the

unique solution ty of the equation Old tp = New. As it turns out, this equation

can still be solved when we allow zeros; the solution is not unique, but it does

not matter what solution we use. So there are two ways we can proceed. We

can stop talking about divisionwe can talk instead about solving the equation

Old ip = New. Or we can extend the definition of division by picking out a

particular solution of the equation Old ty = New and calling it the quotient

New/ Old.

PROPAGATION IN JOIN TREES 59

We will explore both approaches. First, let us see what happens when we

drop talk about division. Since division appears only in Rule 4, all we need to

do is replace that rule with the following rule:

say New, it solves the equation

for ip and sends tp on to its other node. It then discards Old and

stores New in its place.

As the following proposition shows, this works; it is always possible to solve

equation (3.9), and doing so produces the result we want.

PROPOSITION 3.6. If there are no negative values in the initial tables on the

nodes, then propagation under Rules 1,2,3,4', and 5 will result in each node and

separator containing its marginal of (p.

Proof. Since the propagation proceeds inward just as in the elementary archi-

tecture, the root will have its marginal at the end of the inward pass. So we can

prove the proposition by induction on the outward pass. Suppose propagation

to w on the outward pass has resulted in the table (p^w on w, and let us show

that the next step will produce (p^x on ID'S outward neighbor x. On the inward

pass, x had sent in mx^w, and w now sends back (p^xC]w, or mx-+wmw-+x. So

equation (3.9) can be rewritten as

or

Equation (3.11) obviously has a solution, but it may have more than one. We

need to show that any solution will produce the marginal on x when it multiplies

the table now on x. To this end, let Qxt^w-^x De a Lauritzen Spiegelhalter contin-

uer for x. The current table on x is Qxr\w-*xmx->wi so the result of multiplying

it by any solution of equation (3.10) is

Though the solution ty of equation (3.11) may not be unique, the range of

choice is simple. Since all the tables involved in the equation are the same size,

the multiplications are all entry-by-entry. When an entry in mx^w is nonzero,

the corresponding entry in -0 is unique; we obtain it by division. When an entry

in mx>w is zero, the corresponding entry in mx^wmw^x is also zero, and so we

can choose the entry in 1/1 however we please. It is this factthe fact that we

can choose the entries of T/> arbitrarily when they are not fully determinedthat

allows us to handle the situation by extending the definition of division.

60 CHAPTER 3

In the case at hand, we want to divide one table by another of the same

size, but with an eye to further developments, let us consider a more general

situation, where we want to divide one table by another of the same or possibly

smaller size. Say we want to divide a table B on y by a table A on x, where

x C y. We will show how to do so under the assumption that whenever an entry

in A is zero, everything in the corresponding row in B is zeroi.e.,

or, equivalently,

We will say that A supports B when this condition is met. Given a table A on

x that supports a table B on y, we define a table B/A on y by

Here we have set the value of the quotient equal to zero when the value of

denominator is zero. Any other number would do just as well for our immediate

purpose, but zero will prove convenient later.

This extended definition of division immediately yields the following lemma.

LEMMA 3.8. If A supports B, then

PROPOSITION 3.7. // there are no negative values in the initial tables on the

nodes, and we use equation (3.14) as the definition of division, then propagation

under Rules 1,2,3,4, and 5 will result in each node and separator containing its

marginal of (p.

Proof. Since Old is mx^w and New is mx^wmw^x, Old supports New. So

by Lemma 3.8, New/Old, defined as in equation (3.14), is a solution of equa-

tion (3.9). So Rule 4 with our extended definition of division is a special case of

Rule 4', and the proposition follows from Proposition 3.6.

As the following lemma asserts, we can work with extended division in much

the same way that we work with ordinary division. We can combine numerators

and denominators (statement 5), and we can cancel factors in denominators by

multiplication (statement 6).

LEMMA 3.9.

1. f (c) = 0 if and only if B(c) = 0.

2. If A supports B, then A supports BC.

3. If A supports B and C supports D, then AC supports BD.

4. If B is a table on y and x C y, then B^x supports B.

5. If A supports B, then f C = $-.

6. If A supports B and C supports D, then ^- % ^

PROPAGATION IN JOIN TREES 61

8. If A and C both support B, then ;f = (This may not be true if C

does not support B.)

9. If A on x supports B on y, then ( ^ ) i x = ^-.

We leave the proofs of these statements to the reader. In contrast to

Lemma 3.7, most of them (namely, 1 and 5-9) do depend on our having chosen

zero as the value of a quotient, when the denominator is zero.

The Aalborg formula. Let us return, for just a moment, to the assumption

that our tables never have zero entries. Write N for the set of nodes, S for the

set of separators, Tx for the current table on the node x, and Us for the current

table on the separator s. At the beginning of the propagation, Tx <px, UK = 1 S)

and hence

At each step, we change the table on one node and on one separator. The table

on the node is multiplied by New/Old, and the table on the separator is changed

from Old to New- i.e., it also is multiplied by New/ Old. Since the table on the

node is multiplied by the same factor as the table on the separator, the ratio

This is the Aalborg formula. In words, the function whose marginals we want is

always the ratio of the product of the tables on the nodes to the product of the

tables on the separators.

The Aalborg formula still holds even if zero entries are allowed in our tables,

but the reasoning with which we established it holds only if we plug a couple of

holes.

First, we must check that Hse?^ 8 alwavs supports Ilze/v^-' so tnal' ^ ne

ratio (3.16) is defined. To check this, we write x ( s ) for the outward neighbor

of the separator s. Since [/.,. if it is not equal to l s , is a marginal of T x ( s ) , Us

supports Tx(s) (statement 4 of Lemma 3.9). Hence Pises ^ suPPrts FLes ^(s)

(statement 3) and also Tr HseS1 ^() (statement 2), which is equal to Y\xN Tx.

Second, we must check that multiplying the top and bottom of the ra-

tio (3.16) by New/0ldvfi\\ not change it. This follows from statements 6 and 8

of Lemma 3.9, together with the fact that New/Old supports the numerator. We

know that New/ Old supports the numerator because New is a marginal of one

of its factors, and by statement 1 of Lemma 3.9, New/'Old supports whatever

New supports.

62 CHAPTER 3

There is one point of notation that should be clarified in connection with the

Aalborg formula. For simplicity, we have been using a notation that identifies

each node x with a set of variables. We could also identify each separator with a

set of variableswe could say that the separator s between the nodes u and v is

equal to uC\v. It is better, however, to assume that the names of the separators

are distinct from the sets of variables involved, for two or more separators might

involve the same set of variables. (We might have one pair of neighboring nodes

u\ and v\ and another pair 11% and V2 with uiHvi = u^ Pi v-2.) It would burden

our notation unnecessarily for us introduce distinct symbols for the separator

and its set of variables, but the distinction should be kept in mind, even when,

as will happen shortly, we write as if they are the same.

under the assumption that the tables on the separators are initially tables of

ones, this assumption too can be relaxed. Suppose we put nonnegative tables

Tx and Us on the nodes and separators in such a way that the table on each

separator supports the tables on the neighboring nodes. Then the denominator

in equation (3.16) supports the numerator. If we set the quotient equal to <p and

propagate by the Aalborg rules, then we have the following proposition.

PROPOSITION 3.8. At the end of the propagation, the tables on the nodes and

separators will be the corresponding marginals oftp.

Proof. By statements 5 and 6 of Lemma 3.9,

where x(s) is the outward neighbor of the separator s. This suggests that we

compare propagation with Us on s arid Tx on x to propagation with ls on s. Tr

on r, and Tx^s->/Us on x ( s ) . Call the former the loaded propagation (because the

separators are loaded at the beginning) and the latter the adjusted propagation

(because the tables on the nodes are adjusted). We know that the adjusted

propagation results in the marginals of (p on all the nodes and separators; let us

show that the loaded propagation gives the same results.

For the moment, we reserve Tx and Us for the initial tables in the loaded

propagation; we write T_Jaded and y]oaded for the current tables in the loaded

propagation and T*dJusted and [/adjusted f or the current tables in the adjusted

propagation. Initially,

and

These equations will hold throughout the inward pass, for if they hold before an

inward step, they hold after it. To see this, write Mx(s^s for the message from

PROPAGATION IN JOIN TREES 63

the inward loaded message from x(s) is multiplied by Us in comparison with the

inward adjusted message. Since this is the new table for s, equation (3.20) will

still hold. But the loaded propagation divides Us out before sending the message

on to the neighbor w; hence the message multiplied into w is the same in the two

propagations, and the relation between T^oaded and J1djusted (equation (3.19) or

(3.21)) will also be unaffected.

Since the root has the same table at the end of the inward pass in the two

propagations, it sends the same messages back out. So we can complete the

proof by induction on the outward pass. We need only show that if the message

from w out back to s is the same in the two propagations, then the table on x ( s )

will end up the same. But if we write Mw->a for the message from w back to s,

then the table we get on x(s) in the loaded propagation is

The Aalborg formula can be used to find a probability distribution that has

given marginals.

PROPOSITION 3.9. Suppose we are given a probability distribution Tx for

each node x in a join tree. And suppose these distributions are consistent in

the sense that for neighboring nodes x and y, T^xny = T^xCly. Set Us for the.

separator s between x and y equal to this common marginal. Then the func-

tion f given by equation (3.17) is a probability distribution with the Tx as its

marginals.

Proof. When we run the Aalborg propagation, nothing changes. The tables

on the separators are already the marginals of the tables on the nodes, so the

message to the separator is always identical with the table already there, and

the ratio, which is passed on to the neighboring node, is always a table of ones.

So the tables are already the marginals of (p. And any nonnegative table with a

probability distribution as a marginal is itself a probability distribution.

The three major architectures we have studied in this chapterthe Shafer-

Shenoy, Lauritzen-Spiegelhalter, and Aalborg architecturesmove inward in a

tree and then back outward. How should we organize or program this move-

ment? This is a very general question, for many computations are tree recursive.

But we should take a moment to consider it.

We have described each of the three architectures by giving, along with rules

for what the nodes do, rules for when they are allowed to do it. The simplicity

of this description made it convenient for the theoretical understanding we have

been seeking, but at the programming level, it suggests rather expensive control

64 CHAPTER 3

regime in which each node constantly checks on whether it is allowed to act. In

a serial machine, we seem to be suggesting a regime (as in a rule-based program)

in which we constantly search for nodes that are ready to act (rules that are

ready to fire).

A more economical approach is to use the connections of the tree to propagate

signals to act as well as the results of actions. To trigger the inward pass, we

can have the root ask for inward messages from its neighbors, which, in order

to comply with the request, must ask for inward messages from their other

neighbors, and so on. To trigger the outward pass, we can have the root send

messages to its neighbors, together with the request that they pass messages on

to their other neighbors, and so on.

If we run the propagation in this way, the root need not be specified in the

data structure representing the tree; it is merely the node at which we begin the

propagation. Having propagated with one node as the root, and perhaps then

having made changes in the input tables, we can propagate with a different node

as the root.

The tree itself can be represented in object-oriented fashion, with each node

as an object. Each node has a list of neighbors and the ability to communicate

with these neighbors. At a coarse level of description that is common to all three

architectures, a node has two actions, COLLECT, which is used on the inward

pass, and DISTRIBUTE, which is used on the outward pass. Both actions can be

called from outside the system or from a neighboring node. These actions are

recursive, and they also trigger a more basic action, SENDMESSAGE.

When the action COLLECT is called in a node from outside the system, that

node in turn calls COLLECT in all its neighbors. When COLLECT is called in a

node by a neighbor, that node calls COLLECT in all its other neighbors and also,

after the neighbors have completed their action, performs SENDMESSAGE to the

neighbor that made the call. This means that we can trigger the inward pass

simply by calling COLLECT in the node that we want to act as the root. The

call is automatically relayed out toward the leaves, and when it has reached the

leaves, the messages come back in (Figure 3.17).

When the action DISTRIBUTE is called in a node from outside the system,

that node performs SENDMESSAGE to each neighbor and then calls DISTRIBUTE

in that neighbor. When DISTRIBUTE is called in a node by a neighbor, the node

performs SENDMESSAGE to and calls DISTRIBUTE in its other neighbors. So we

can trigger the outward pass by calling DISTRIBUTE in the node we have chosen

to be the root. The call will automatically move outward in the tree, preceded

by outward messages (Figure 3.18).

The action SENDMESSAGE differs from architecture to architecture. In the

Lauritzen-Spiegelhalter architecture, there are actually two distinct SENDMES-

SAGE actions, SENDMESSAGElN, which is used by COLLECT, and SENDMES-

SAGEOuT, which is used by DISTRIBUTE. But the other two architectures,

the Shafer-Shenoy architecture and the Aalborg architecture, use the same

SENDMESSAGE in COLLECT as in DISTRIBUTE.

PROPAGATION IN JOIN TREES 65

FlG. 3.17. After COLLECT is called outward from the root, messages move inward.

the sending node and the receiving node. The message sent is divided out of

the table in the first and multiplied into the table in the second. The action

SENDMESSAGEOUT, on the other hand, affects only the receiving node.

The description of SENDMESSAGE in the Shafer-Shenoy and Aalborg archi-

tectures is affected by where we place the separators. In the case of the Shafer-

Shenoy architecture, it is most convenient to split the separator and put each

storage register in the node to which its messages are directed, so that the affect

of SENDMESSAGE is to fill the storage register in the receiving node. In the case

of the Aalborg architecture, it seems most appropriate to place copies of the

separator in both nodes; when a message is sent, it is stored in the copy in the

sending node and then sent to the receiving node, where it is stored again after

being used to compute the quotient that is multiplied into the node's main table.

To complete the picture, we can also provide each node with a REPORT

action, which results in the node's marginal being sent to the user of the system.

In the Lauritzen-Spiegelhalter and Aalborg architectures, this action involves

no computation, but in the Shafer-Shenoy architecture, it requires the node to

collect the messages in its separators and multiply them all into its main table.

We can make REPORT an action that is called from outside the system, or we

can make it part of DISTRIBUTE, so that marginals are reported as the outward

pass proceeds.

66 CHAPTER 3

FIG. 3.18. As DISTRIBUTE is called outward from the root, messages move outward.

Join-tree propagation may or may not succeed in finding marginals of a par-

ticular product of tables. It will not succeed if the belief net is so highly con-

nected that no feasible join-tree cover exists. In this case, we may be able to

use approximate rather than exact methods. Presently, the most widely used

approximate methods are Gibbs sampling and its cousinsmethods now col-

lectively called "Markov-chain Monte Carlo." These methods were proposed

for probabilistic expert systems by Pearl [43], but they have been less success-

ful for expert systems than for vision (Geman and Geman [29]) and Bayesian

statistics (Besag et al. [13]). The small or zero conditional probabilities often

encountered in expert systemswhere a priori knowledge is strongertend to

violate the conditions that allow the Markov-chain methods to converge. A re-

cent candidate to fill the gap left by the weakness of Markov-chain methods for

expert systems is mean-field theory, also borrowed from statistical physics (Saul

et al. [44]).

In this chapter, we have discussed only the problem of finding marginals

of probability distributions given as products of tables. In principle, join-tree

propagation is applicable to finding marginals in any other problem in which the

transitivity and combination axioms are satisfied. (Examples are given in the

exercises.) There arc, however, problems in which the axioms are satisfied but

the operations are not feasible. Join-tree propagation depends on marginaliza-

PROPAGATION IN JOIN TREES 67

numbers of variables), and sometimes it is not. Continuous probability den-

sities provide an example. We know how to marginalize (integrate) in many

parametric families of densities, but multiplication usually takes us outside the

parametric family, producing densities that are difficult to integrate, even if

only a few variables are involved. As a practical matter, join-tree propaga-

tion for continuous densities has been limited mainly to the multivariate nor-

mal distribution, where it is often discussed in connection with the Kalman

filter.

We should also note another limitation of the join-tree methodin general,

it only helps us find marginals for small clusters of variables. In many problems,

we want to compute other numbers: probabilities involving many variables and

expectations. Markov-chain Monte Carlo, when it works, allows us to compute

these numbers as well.

Exercises.

EXERCISE 3.1. How great is the computational advantage of the Lauritzen-

Spiegelhalter architecture over the Shafer-Shenoy architecture? For a first pass

at answering this question, you may wish to assume that each nonleaf in the join

tree has the same number of neighbors (the tree's "branching factor"), that each

variable has the same number of elements in its frame, and that each node has

the sam,e number of variables in common with its branch as well as the same

number of new variables.

EXERCISE 3.2. Compare the three architectures on the basis of their storage

requirements. Consider the case where we need to keep the initial inputs and the.

case where we do riot.

EXERCISE 3.3. Show how to use join-tree computation to find P^w(x) for

any set w of variables and any single configuration x of w -even ifw is too large

to be contained in any node of the join tree. (Hint: Pretend x is observed, and

exploit the fact that Piw(x) is the inverse of the normalizing constant for the

posterior probabilities.)

EXERCISE 3.4. Discuss ways of measuring the amount of computation re-

quired by a join tree. (In the introduction to Chapter 3, two measures were

suggested: the, sum of the sizes of the frames, and the size of the largest frame.)

Discuss the issue separately for probability propagation and for each of the prob-

lems listed in Exercise 1.2.

EXERCISE 3.5. Verify that the elementary and Shafer-Shenoy architectures

always work in the. abstract framework you formulated in Exercise 1.5.

EXERCISE 3.6. Explore the analogy between the outward pass of the

LauriLzen-Spiegelhalter architecture and the outward pass in recursive, dynamic

programming, in which solutions of reduced problems are. used to build up an

overall solution (Mitten [40], Bertele and Rrioschi [1], Shenoy [46]). Formulate,

an abstract theory that includes both examples as special cases.

68 CHAPTER 3

ditionals in the nodes of a join tree in order for the results of Shafer-Shenoy

computations to remain within the partial semigroup of conditionals' (See Ex-

ercise 2.5.) Explore conditions on the existence of continuers that allow the

Lauritzen-Spieyelhalter architecture to work in this context.

EXERCISE 3.8. In some problems, the mathematical objects that one com-

bines can be embedded in a larger class that comes closer to being a group, so

that the division required by the Aalborg architecture is possible. Discuss the

extent to which this is possible in the examples considered in Exercise 1.2.

CHAPTER 4

Resources and References

4.1. Meetings.

The annual Conference on Uncertainty in Artificial Intelligence (UAI) plays a

leading role in the development of probabilistic, belief-function, fuzzy, and qual-

itative expert systems. Papers given in its first six years (1985-1990) were col-

lected and published by North-Holland in a series entitled Uncertainty in Arti-

ficial Intelligence. Proceedings of subsequent meetings have been published by

Morgan Kaufmann. The Association for Uncertainty in Artificial Intelligence,

the sponsor of the conference, has a site on the World-Wide Web:

http: / /www .auai .org/

This site gives instructions for subscribing to the association's electronic mailing

list and includes links to many other sources of information about the manage-

ment of uncertainty in expert systems.

The biennial International Workshop on Artificial Intelligence and Statistics

is also devoted in part to uncertainty in expert systems. The Web site for its

sponsor, the Society for Artificial Intelligence and Statistics, is

http://www.vuse.vanderbilt.edu/~dfisher/ai-stats/socicty.html

This site is maintained by Douglas H. Fisher at Vanderbilt University.

Another important conference for this community is the International Confer-

ence on Information Processing and Management of Uncertainty in Knowledge-

Based Systems (IPMU), which has been held biennially since 1986. The pro-

ceedings of the most recent conference, held in Paris in 1994, was published

by Springer-Verlag in 1995 under the title Advances in Intelligent Computing,

edited by Bernadette Bouchon-Meunier, Ronald R. Yager, and Lotfi A. Zadeh.

4.2. Software.

A number of software packages for probabilistic expert systems are available.

The most highly developed is the commercial product HUGIN. Developed at

Aalborg, Denmark, it uses the Aalborg architecture described in Chapter 3.

Information on HUGIN can be obtained at:

http: //www.hugin .dk

69

70 CHAPTER 4

cinella. Developed by the IRIDIA research group in Brussels, it handles belief

functions, categorical judgments, and possibility measures as well as probabili-

ties. It is implemented in Common Lisp and is distributed free. Information is

available from IRIDIA's Web site:

http://iridia.ulb.ac.be/pulcinella/

Further information on these and other packages, some commercial and some

free, is available at a Web site maintained by R.ussell Almond:

http://bayes.stat.washington.edu/almond/belief.html

4.3. Books.

There are now many excellent books on probabilistic expert systems and related

topics.

[1] Bertele, Umberto, and Francesco Briosdii (1972). Nonserial Dy-

namic Programming. Academic Press. New York. A readable treat-

ment of join-tree computation for decomposable dynamic program-

ming problems.

[2] Diestel, R. (1990). Graph Decompositions. Clarendon Press. Ox-

ford. A general perspective on decompositions of the type exemplified

by join trees, with hints at the diversity of the applied problems that

inspire these decompositions.

[3] Jensen, Finn V. (1996). An Introduction to Bayesian Networks.

University College Press. London. An engaging and readable intro-

duction to probabilistic networks, with an emphasis on construction

and computation within the Aalborg architecture.

[4] Judd, J. Stephen (1990). Neural Network Design and the Com-

plexity of Learning. MIT Press. Cambridge. This interesting and

readable book demonstrates the relevance of join-tree ideas to the

problem of learning in neural networks.

[5] Lauritzen, Steffen L. (1996). Graphical Models. Oxford Univer-

sity Press. London. A superb treatment of probabilistic networks as

models for data, this book marries probabilistic expert systems with

up-to-date statistical methodology. Relatively comprehensive, it cov-

ers undirected as well as directed graphs, and continuous (normal)

as well as discrete probability distributions. Its greatest originality

lies in its treatment of mixed cases: chain graphs, which combine

directed and undirected graphs, and models with both discrete and

continuous variables.

[6] Neapolitan, E. (1990). Probabilistic Reasoning in Expert Systems.

John Wiley. New York. This readable book covered the state of the

RESOURCES AND REFERENCES 71

publication. It is now somewhat dated.

[7] Oliver, Robert M., and James Q. Smith, eds. (1990). Influence

Diagrams, Belief Nets, and Decision Analysis. John Wiley. New

York. Still a good introduction to the motivations behind influence

diagrams, which generalize probabilistic expert systems by including

variables representing a user's decisions. It includes an introduc-

tory essay by Ron Howard, the most influential proponent of these

diagrams.

[8] Pearl, Judea (1988). Probabilistic Reasoning in Intelligent Sys-

tems. Morgan Kaufmann. San Mateo, California. In a series of

articles preceding this book, its author initiated the study and use

of probabilistic expert systems as the term is now understood. The

book, lively and energetic, introduced them to a wide audience.

[9] Shafer, Glenn (1996). The Art of Causal Conjecture. MIT Press.

Cambridge. A study of causality in terms of the dynamics of proba-

bility, this book shows that the causal interpretation of probabilistic

expert systems, like the causal interpretation of other statistical mod-

els, is often complex: models may have more than one possible causal

interpretation. This book also explores some generalizations of the

DAG structure.

[10] Shafer, Glenn, and Judea Pearl, eds. (1990). Readings in Un-

certain Reasoning. Morgan Kaufmann. San Mateo, California. This

volume collects classic and recent papers on uncertain reasoning in

artificial intelligence. Probabilistic, belief-function, fuzzy, and quali-

tative approaches are included.

[11] Spirtes, Peter, Clark Glymour, and Richard Schemes (1993).

Causation, Prediction, and Search. Lecture Notes in Statistics 81.

Springer-Verlag. New York. This monograph explores a variety of

non-Bayesian ideas for constructing belief nets from data. The em-

phasis is on using limited a priori assumptions about causal relations

among variables together with observed independencies among those

variables.

[12] Whittaker, J. (1990). Graphical Models in Applied Multivariate

Statistics. John Wiley. Chichester. A pioneering statistical treat-

ment of belief nets, emphasizing the multivariate normal distribution.

Many examples.

These articles review several topics mentioned in preceding chapters.

[13] Besag, Julian, Peter Green, David Higdon, and Kerrie Mengersen

(1995). Bayesian computation and stochastic systems (with

72 CHAPTER 4

chain Monte Carlo methods, with an emphasis on Bayesian statistical

problems.

[14] Buntine, Wray (1996). A guide to the literature on learning

graphical models. IEEE Transactions on Knowledge and Data En-

gineering. An excellent review of the problem of selecting graphical

models for probabilistic expert systems on the basis of data.

[15] Charniak, Eugene (1991). Bayesian networks without tears. AI

Magazine. Winter 1991, pp. 50-63. A nontechnical introduction

to belief nets, especially useful for students with limited interest in

mathematical probability theory.

[16] Dempster, A. P. (1971). An overview of multivariate data anal-

ysis. Journal of Multivariate Analysis. 1, pp. 316-346. This classic

article includes a discussion of the limitations of the multivariate

framework, limitations still not overcome in the main body of work

in statistics and probabilistic expert systems.

[17] Neal, Radford M. (1993). Probabilistic inference using Markov

chain Monte Carlo methods. Technical Report. Department of Com-

puter Science. University of Toronto. In contrast to Besag et al., this

review emphasizes probabilistic expert systems.

[18] Rabiner, L. R. (1989). A tutorial on hidden Markov models and

selected applications in speech recognition. Proceedings of the IEEE.

77, pp. 257-286. Still one of the best introductions to hidden Markov

models.

[19] Spiegelhalter, David J., A. Philip Dawid, Steffen L. Lauritzen,

and Robert G. Cowell (1993). Bayesian analysis in expert systems

(with discussion). Statistical Science. 8, pp. 219-283. Currently

the best brief overview of the state of the art of probabilistic expert

systems.

[20] Tatman, J. A., and Ross Shachter (1990). Dynamic program-

ming and influence diagrams. IEEE Transactions on Systems, Man,

and Cybernetics. 20, pp. 365-379. This article reviews influence dia-

grams, which generalize belief nets by including nodes for decisions,

and shows how dynamic programming can be understood within the

framework of influence diagrams.

[21] Xu, Hong, and Robert Kennes (1994). Steps towards an effi-

cient implementation of Dempster-Shafer theory. Advances in the

Dempster-Shafer Theory of Evidence. R. R. Yager, M. Fedrizzi, and

J. Kacprzyk, eds. John Wiley. New York. Pp. 153 174. This article

reviews various ways of making the Shafer-Shenoy architecture as

efficient as possible for belief functions.

RESOURCES AND REFERENCES 73

This is not a comprehensive bibliography of the very extensive work on proba-

bilistic expert systems, but it contains the articles and dissertations that have

most engaged the author's attention.

[22] Beeri, Catriel, Ronald Fagin, David Maier, and Mihalis Yan-

nakakis (1983). On the desirability of acyclic database schemes.

Journal of the Association for Computing Machinery. 30, pp. 479-

513. This very widely cited paper first introduced the idea of a join

tree into the literature on relational databases. It is also responsible

for the name "join tree."

[23] Cano, Jose, Miguel Delgado, and Serafin Moral (1993). An ax-

iomatic framework for propagating uncertainty in directed acyclic

networks. International, Journal of Approximate Reasoning. 8, pp.

253-280. This article extends the axioms for join-tree computation,

discussed in Chapter 1 and in Shenoy and Shafer [48], to computa-

tion within directed acyclic graphs, in the style developed in Pearl's

Probabilistic Reasoning in Intelligent Systems [8].

[24] Cooper, Gregory F., and Edward Herskovits (1992). A Bayesian

method for the induction of probabilistic networks from data. Ma-

chine Learning. 9, pp. 309-347. An influential exposition of a

straightforward Bayesian approach to choosing and parametrizing a

DAG from data for a given set of variables. The method developed

in this article can be contrasted with the non-Bayesiari methods de-

veloped in Spirtes, Glymour, arid Scheines's Causation, Prediction,

and Search [11].

[25] Cowell, Robert G., and A. Philip Dawid (1992). Fast retraction

of evidence in a probabilistic expert system. Statistics and Com-

puting. 2, pp. 37-40. Using out-marginalization (see Exercise 1.4),

this article gives a quick join-tree algorithm for adjusting marginal

probabilities to allow for the omission of previously included obser-

vations. The algorithm allows efficient computation of statistics for

monitoring the performance of a belief net.

[26] Cox, David R., and Nanny Wermuth (1993). Linear dependencies

represented by chain graphs (with discussion). Statistical Science. 8,

pp. 204-283. Taking DAGs and chain graphs as a starting point,

this article discusses a wide variety of graphical representations of

multivariate probability distributions.

[27] Dawid, A. Philip (1980). Conditional independence for statistical

operations. Annals of Statistics. 8, pp. 598-617. This pioneering

article studies general properties of conditional independence that

were later studied as axioms by Judea Pearl.

74 CHAPTER 4

filter. Technical Report. Department of Statistics. Harvard Univer-

sity.

[29] Geman, Stuart, and Donald Geman (1984). Stochastic relax-

ation, Gibbs distributions, and the Bayesian restoration of images.

IEEE Transactions on Pattern Analysis and Machine Intelligence. 6,

pp. 721-741. This article shows how image-analysis problems can be

modeled so that the computation problems are susceptible to reso-

lution by Gibbs sampling. Very much influenced by the work of Ulf

Grenander, the article was in turn very influential in vision, artificial

intelligence, and Bayesian statistics.

[30] Heckerman, David (1990). Probabilistic similarity networks.

Networks. 20, pp. 607 636. This article explores an interesting gen-

eralization of belief networks, in which the factorization that permits

representation by a DAG may apply only conditionally on some val-

ues of the preceding variables.

[31] Jensen, Finn V. (1991). Calculation in HUGIN of probabilities

for specific configurationsa trick with many applications. Scandi-

navian Conference on Artificial Intelligence 91. IOS Press. Burke,

Virginia. Pp. 176-186. This article puts the trick of Exercise 3.3 to

use for practical tasks in probabilistic expert systems: comparison of

competing hypotheses, analysis of conflicts in data, and evaluation

of approximate calculations.

[32] Jensen, Finn V. (1995). Cautious propagation in Bayesian net-

works. Proceedings of the llth Conference on Uncertainty in Arti-

ficial Intelligence. Philippe Besnard and Steve Hanks, eds. Morgan

Kaufmann. San Mateo, California. Pp. 323-328. This article uses

the Shafer- Shenoy architecture to supply a more general solution to

the problem considered by Cowell and Dawid [25].

[33] Jensen, Finn V., and Frank Jensen (1994). Optimal junction

trees. Proceedings of the IQth Conference on Uncertainty in Artificial

Intelligence. R. L. Mantaras and D. Poole, eds. Morgan Kaufmann.

San Mateo, California. Pp. 360-366. Even when sets of variables can

be arranged in a join tree, there may be more than one arrangement,

some more efficient than others. This paper presents an algorithm

for choosing an optimal one.

[34] Jensen, Finn V., Steffen L. Lauritzen, and K. G. Olesen (1990).

Bayesian updating in causal probabilistic networks by local compu-

tation. Computational Statistics Quarterly. 4, pp. 269-282. This

article, all of whose authors work at the University of Aalborg in

Aalborg, Denmark, introduced the architecture named after that city

in Chapter 3.

RESOURCES AND REFERENCES 75

networks by simulated annealing. Statistics and Computing. 2, pp.

7-17. This article suggests a sophisticated heuristic for near-optimal

join trees (or. in the terminology its uses, near-optimal "decomposi-

tions" or "triangulations"). It also gives references to other heuristics.

[36] Kong, Augustine (1986). Multivariate belief functions and graph-

ical models. Doctoral dissertation. Department of Statistics. Har-

vard University. This dissertation spells out how the concept of join-

tree cover is related to the concept of triangulation, which is used

more often in the older literature. It also studies some heuristics for

rinding join-tree covers or triangulations.

[37] Lauritzen, Steffen, and David Spiegelhalter (1988). Local com-

putations with probabilities on graphical structures and their ap-

plication to expert systems (with discussion). Journal of the Royal

Statistical Society, Series B. 50, pp. 157-224. This classic article in-

troduced probabilistic expert systems to the statistical community.

It is the source of the Lauritzen-Spiegelhalter architecture discussed

in Chapter 3. The reader of this article should be cautioned that the

heuristic it uses for finding join-tree covers, maximum cardinality

search, gives rather poor results in general. See [35] and [36].

[38] Li, Zhaoyu, and Bruce D'Ambrosio (1994). Efficient inference

in Bayes networks as a combinatorial optimization problem. Inter-

national Journal of Approximate Reasojiing. 11, pp. 55-81. The au-

thors formulate the problem of rinding an optimal order for summing

variables out as a combinatorial problem.

[39] Mellouli, Khaled (1987). On the propagation of beliefs in net-

works using the Dempster Shafer theory of evidence. Doctoral disser-

tation. School of Business. University of Kansas. This dissertation

includes a demonstration that the class of join-tree covers obtained

by summing out is always large enough to include optimal join-tree

covers.

[40] Mitten. L. G. (1964). Composition principles for synthesis of

optimal multistage processes. Operations Research. 12, pp. 610-619.

An early exploration of the extent of applicability of recursive meth-

ods for optimization such as those described in Bertele and Brioschi's

book.

[41] Ndilikilikesha, Pierre C. (1994). Potential influence diagrams.

International Journal of Approximate Reasoning. 10, pp. 251-285.

This article shows how influence diagrams can be solved using a

rooted join tree.

[42] Pearl, Judea (1986). Fusion, propagation, and structuring in be-

lief networks. Artificial Intelligence. 29, pp. 241-288. An extremely

influential contribution to computation in belief nets, emphasizing

76 CHAPTER 4

the computation. The material in this article was incorporated into

Pearl's 1988 book [8].

[43] Pearl, Judea (1987). Evidential reasoning using stochastic simu-

lation. Artificial Intelligence. 32, pp. 245-257. This may be the first

proposal to nse Markov-chain Monte Carlo for computations in belief

nets. The method had long been used in statistical physics and in

operations research.

[44] Saul, Lawrence K., Tommi Jaakkola, and Michael I. Jordan

(1995). Mean field theory for sigmoid belief networks. Computa-

tional Cognitive Science Technical Report 9501, Center for Biological

and Computational Learning. Massachusetts Institute of Technol-

ogy. This article sketches a program for borrowing the idea of mean-

field theory from statistical physics in order to address the prob-

lem of approximate computation in belief nets with extremely high

connectivity.

[45] Shafer, Glenn, Prakash P. Shenoy, and Khaled Mellouli (1987).

Propagating belief functions in qualitative Markov trees. Interna-

tional Journal of Approximate Reasoning. 1, pp. 349-400. This

paper explores a way of understanding constraint propagation arid

belief-function computation abstractly, without variables.

[46] Shenoy. Prakash P. (1991). Valuation based systems for dis-

crete optimization. Uncertainty in Artificial Intelligence 6. P. P.

Bonissone, M. Henrion, L. N. Kanal, arid J. F. Leinmer, eds. North-

Holland. Amsterdam. Pp. 385-400. The abstract understanding of

inward and outward passes in join-tree computation in this article

generalizes the method of nonserial dynamic programming discussed

by Bertele and Brioschi [1].

[47] Shenoy, Prakash P. (1994). Representing conditional indepen-

dence relations by valuation networks. International Journal of Un-

certainty, Fuzziness and Knowledge-Based Systems. 2, pp. 143-165.

This article advances a general framework for propagating informa-

tion in expert systems. Shenoy's framework applies not only to prob-

ability but also to belief functions and other calculi satisfying the

axioms of Chapter 1.

[48] Shenoy, Prakash P., and Glenn Shafer (1990). Axioms for prob-

ability and belief-function propagation. Uncertainty in Artificial In-

telligence 4. R. D. Shachter, T. S. Levitt, L. N. Kanal, and J. F.

Lemrner, eds. North-Holland. Amsterdam. Pp. 169-198. The ax-

ioms for join-tree computation, discussed in Chapter 1, were first

isolated in this article. The article also describes the Shafer-Shenoy

architecture.

RESOURCES AND REFERENCES 77

function formulas for audit risk. The Accounting Review. 67. pp.

249-283. This article discusses the propagation of evidence for finan-

cial audits, using belief functions rather than probabilities.

[50] Wermuth, Nanny, and Steffen L. Lauritzen (1990). On sub-

stantive research hypotheses, conditional independence graphs, and

graphical chain models (with discussion). Journal of the Royal Sta-

tistical Society, Series B. 52, pp. 21 50. This wide-ranging article

includes a good introduction to the uses of cha,in graphs.

[51] Xu, Hong, and Philippe Smets (1996). Reasoning in evidential

networks with conditional belief functions. International Journal of

Approximate Reasoning. 14. pp. 158 185. This article adds a concept

of conditionals to the theory of belief functions and shows how they

can be implemented in join-tree computation.

[52] Zhang, Neviri Liariwen, Runping Qi, and David Poole (1994). A

computational theory of decision networks. International Journal of

Approximate Reasoning. 11, pp. 83-158. This article extends join-

tree computation to influence diagrams and even to slightly more

general networks; forgetting is allowed.

This page intentionally left blank

Index

Aalborg formula, 61 domain. 3

audit evidence, 29 dynamic programming, 36

Bayesian network, 22 elementary architecture, 43

Bayesian statistics, 66 expectation, 12

belief chain, 25, 33 extended division, 60

belief functions, 15

belief net, 21 factorization, 35, 54

bubble graph, 27 four-color problem, 36

frame, 2

categorical variables, 13

chain, 25 Gibbs sampling, 66

chain graph, 30 graphical model, 22

COLLECT, 64

combination axiom, 5 head, 5

computational cost, 67 heuristics, 37

computional cost, 50 hidden Markov model, 26. 33

conditional, 5, 18 independence, 9

conditional probabilities, 5 information branch, 43

conditioning, 10

configuration, 2 join graph, 29

constraint propagation, 36 join tree, 35, 39

construction chain, 28 cover, 43

construction sequence, 19, 54 heuristics, 37

constructive interpretation of root, 41

probability, 9 junction tree, 35

continuer, 7, 15, 16, 18, 53

Kalman filter, 16, 67

DAG, 21

construction ordering, 22 lattice, 16

initial segment, 23 Lauritzen-Spiegelhalter

density, 3 architecture, 50

directed acyclic graph, 21 linear programming, 15

79

80 INDEX

Markov chain, 25 rules, 63

Markov-chain Monte Carlo, 66

mean field theory, 66 semigroup, 16, 33, 68

multivariate framework, 2, 14 SENDMESSAGE, 64

separator, 45, 56, 62

object-oriented computation, 64

Shafer-Shenoy architecture, 45

out-marginal, 16

similarity network, 31

parallel computation, 48 slice, 6

parameter. 13 state graph, 25, 33

posterior probability, 10 sufficient, 9

probability distribution, 2 support, 60

algorithmic, 13 systems of equations, 15, 36

continuous, 3

discrete, 2 tail, 5

parametric, 13 transitivity axiom, 5

posterior, 10

tabular, 13 valuation network, 30

with given marginals, 63 variable, 2

vision, 66

recursive computation, 5

recursive dynamic programming, 67 zeros, 58

(continued from inside front cover)

BRADLEY EFRON, The Jackknife, the Bootstrap, and Other Resampling Plans

M. WOODROOFE, Nonlinear Renewal Theory in Sequential Analysis

D. H. SATTINGER, Branching in the Presence of Symmetry

R. TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis

MiKL6s Cs6RGO, Quantile Processes with Statistical Applications

J. D. BUCKMASTER AND G. S. S. LuDFORD, Lectures on Mathematical Combustion

R. E. TARJAN, Data Structures and Network Algorithms

PAUL WALTMAN, Competition Models in Population Biology

S. R. S. VARADHAN, Large Deviations and Applications

KIYOSI Ir6, Foundations of Stochastic Differential Equations in Infinite Dimensional Spaces

ALAN C. NEWELL, Solitons in Mathematics and Physics

PRANAB KUMAR SEN, Theory and Applications of Sequential Nonparametrics

LASZLO LOVASZ, An Algorithmic Theory of Numbers, Graphs and Convexity

E. W, CHENEY, Multivariate Approximation Theory: Selected Topics

JOEL SPENCER, Ten Lectures on the Probabilistic Method

PAUL C. FIFE, Dynamics of Internal Layers and Diffusive Interfaces

CHARLES K. CHUI, Multivariate Splines

HERBERT S. WILF, Combinatorial Algorithms: An Update

HENRY C. TUCKWELL, Stochastic Processes in the Neurosciences

FRANK H. CLARKE, Methods of Dynamic and Nonsmooth Optimization

ROBERT B. GARDNER, The Method of Equivalence and Its Applications

GRACE WAHBA, Spline Models for Observational Data

RICHARD S. VARGA, Scientific Computation on Mathematical Problems and Conjectures

INGRID DAUBECHIES, Ten Lectures on Wavelets

STEPHEN F. McCoRMiCK, Multilevel Projection Methods for Partial Differential Equations

HARALD NIEDERREITER, Random Number Generation and Quasi-Monte Carlo Methods

JOEL SPENCER, Ten Lectures on the Probabilistic Method, Second Edition

CHARLES A. MICCHELLI, Mathematical Aspects of Geometric Modeling

ROGER TEMAM, Navier-Stokes Equations and Nonlinear Functional Analysis, Second Edition

GLENN SHAFER, Probabilistic Expert Systems

- Errors in Mathematical Writing, Keith ConradTransféré parMatt Laurd
- A Sample of Add Maths's ProjectTransféré paradex_sufi
- All Abstracts UQAW 2015Transféré parskywalk189
- Introduction to Business Statistics 7th Edition by Ronald Weiers – Test BankTransféré parshaista.aziz
- Relation,Function and Linear FunctionTransféré parrica_marquez
- Parametric Modeling in Rail Capacity Planning - ReviewTransféré parLaura Mohiuddin
- Maths Worksheet - Functions, Inverses and LogarithmsTransféré parCape Town After-School Tutorials
- Probability DistributionsTransféré parBhargav Mendapara
- Jefferies_TRB2007.pdfTransféré parjcaz
- UntitledTransféré parapi-161714983
- BCA(2013 Pattern)Transféré parAMIT HAPASE
- Common Probability Distribution PDFTransféré parJeanette
- 212053293-KohaviTransféré parHacralo
- Unit PlanTransféré parMackenzie Holsten
- EXERCISE3 - S2 2014Transféré parRenukadevi Rpt
- Chapter 1 calculus 1Transféré parPhuong Le
- TMA 2 - Business Statistics_031080377Transféré parst3v3ns33k
- Manual Steps Function ModulesTransféré parmouselabs
- RF Power Control and Handover Algorithm_ Handover Due to MS-BS DistanceTransféré parBang Ben
- Relationships between partial derivatives.ppsTransféré parRaúlÁlvarezM
- a1305-Statistical Quality ControlTransféré parhari0118
- CompensatingActions51_Transféré parSvr Ravi
- Vlsi Cad Lab ReportTransféré parRajen Kumar Patra
- Maths SyllabusTransféré parKrishna
- Math 2nd Exam StudyTransféré parm_society
- Properties of PDF and CDF for Continuous R.V.Transféré parKammanai Phothong
- Probability DistributionTransféré parmahi_111
- c++ set 2.docTransféré parDeepak Sharma
- schole_06_2011_06_klosTransféré parPablo Carreño
- 1106.2751v4Transféré parKevin Gabriel Martinez

- The Secret and the Universal SpiritTransféré parDAe Fragoso
- Dretske, Fred I. Knowledge and the Flow of Information.pdfTransféré parAnonymous QzPfNfDOu
- 4 ECE Radom Process Unit Test ITransféré parBIBIN CHIDAMBARANATHAN
- QNT 561 Week 2 Weekly Learning Assessments - AssignmentTransféré parLorenh Ayden
- Modeling Uncertainty P2Transféré parSam Okpeh
- Continuous and DiscreteTransféré parSarrah May Samuela Mendivel
- Math CoreTransféré parLaurentiu Strimbu
- lec02Transféré parvalladi
- Df StoolsTransféré parM. Edward (Ed) Borasky
- Risk Uncertainty and Expected ValueTransféré parShameem Jazir
- As Leis Do AcasoTransféré parDyerMaker1980
- StatisticsTransféré parFrancisco De Real Onde
- ch08_ismTransféré parsaraaqil
- CS70 - Lecture 24Transféré parAman Sufi
- Probability and StatisticsTransféré parMichael Rodriguez
- (15) Chi-square, Student’s t and Snedecor’s F distributionsTransféré parASClabISB
- Thesis Prospectus Elan MarkovTransféré parYogeshUnde
- f5089Binomial Probability DistributionTransféré parTannawy Sinha
- Probability DistributionTransféré paraaaapple
- Theory of Probability.pptTransféré parVeronica Sanders
- Brand LoyaltiesTransféré parAnnamaria Kozma
- Math9 S16 Lec7 ContinuousRandomVariables1Transféré parHossamSalah
- Signal Detection TheoryTransféré parJoseph_Nuamah
- stat_tables.pdfTransféré parArif Faisal
- HL Notes Binomial ProbabilitiesTransféré parAndy Nguyen
- Quantum information AssignmentTransféré parVargheseAbin
- 5513-syllabus-Fall15Transféré parKetan Rs
- Coastal artillery journal, 1923Transféré parAdlerCastro
- ACFrOgCVGL98_Q-vF8HRZizMompcjwIST48FmChesklLqSlDV6UghVCUFbfeVtK8chs2c_XDuFLwTN736mQ7KxjsBBhhzJd4jDn5zQS3ylMnG0KjL8pDwFrtL3n1lYU= 2Transféré parEduardo Ruiz Tostado
- 128381050 API RBI Training Course Slides r4Transféré parSakthi Pk