Académique Documents
Professionnel Documents
Culture Documents
Advisors:
P. Diggie, S. Fienberg, K. Krickeberg,
1. Oikin, N. Wermuth
Springer
New York
Berlin
Heidelberg
Barcelona
Budapest
Hong Kong
London
Milan
Paris
Santa Clara
Singapore
Tokyo
Springer Series in Statistics
AndersenlBorganlGilllKeiding: Statistical Models Based on Counting Processes.
Andrews/Herzberg: Data: A Collection of Problems from Many Fields for the Student
and Research Worker.
Anscombe: Computing in Statistical Science through APL.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Bolfarine/Zacks: Prediction Theory for Finite Populations.
Borg/Groenen: Modem Multidimensional Scaling: Theory and Applications
Bremaud: Point Processes and Queues: Martingale Dynamics.
BrockwelllDavis: Time Series: Theory and Methods, 2nd edition.
Daley/Vere-Jones: An Introduction to the Theory of Point Processes.
Dzhaparidze: Parameter Estimation and Hypothesis Testing in Spectral Analysis of
Stationary Time Series.
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear
Models.
Farrell: Multivariate Calculation.
Federer: Statistical Design and Analysis for Intercropping Experiments.
Fienberg/HoaglinlKruskal/Tanur (Eds.): A Statistical Model: Frederick Mosteller's
Contributions to Statistics, Science and Public Policy.
Fisher/Sen: The Collected Works of Wassily Hoeffding.
Good: Pennutation Tests: A Practical Guide to Resampling Methods for Testing
Hypotheses.
GoodmanlKruskal: Measures of Association for Cross Classifications.
Grandell: Aspects of Risk Theory.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Hardie: Smoothing Techniques: With Implementation in S.
Hartigan: Bayes Theory.
Heyer: Theory of Statistical Experiments.
Huet/Bouvier/GruetlJolivet: Statistical Tools for Nonlinear Regression: A Practical
Guide with S-PLUS Examples.
Jolliffe: Principal Component Analysis.
KolenlBrennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II.
Kres: Statistical Tables for Multivariate Analysis.
Le Cam: Asymptotic Methods in Statistical Decision Theory.
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts.
Longford: Models for Uncertainty in Educational Testing.
Manoukian: Modem Concepts and Theorems of Mathematical Statistics.
Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
Mosteller/Wallace: Applied Bayesian and Classical Inference: The Case of The
Federalist Papers.
Theory of Statistics
With 26 Illustrations
t Springer
Mark J. Schervish
Department of Statistics
Carnegie Mellon University
Pittsburgh, PA 15213
USA
9 8 7 6 5 4 3(ColTe(:ted
2 second printing. 1997)
This text has grown out of notes used for lectures in a course entitled Ad-
vanced Statistical Theory at Carnegie Mellon University over several years.
The course (when taught by the author) has attempted to cover, in one
academic year, those topics in estimation, testing, and large sample theory
that are commonly taught to second year graduate students in a math-
ematically rigorous fashion. Most texts at this level fall into one of two
categories. They either ignore the Bayesian point of view altogether or
they cover Bayesian topics almost exclusively. This book covers topics in
both classicaJl and Bayesian inference in a great deal of generality. My own
point of view is Bayesian, but I believe that students need to learn both
types of theory in order to achieve a fuller appreciation of the subject mat-
ter. Although many comparisons are made between classical and Bayesian
methods, it is not a goal of the text to present a formal comparison of the
two approaches as was done by Barnett (1982). Rather, the goal has been
to prepare Ph.D. students to be able to understand and contribute to the
literature of theoretical statistics with a broader perspective than would be
achieved from a purely Bayesian or a purely classical course.
After a brief review of elementary statistical theory, the coverage of the
subject matter begins with a detailed treatment of parametric statistical
models as motivated by DeFinetti's representation theorem for exchangeable
random variables (Cha{>ter 1). In addition, Dirichlet processes and other
tailfree processes are presented as examples of infinite-dimensional param-
eters. Chapter 2 introduces sufficient statistics from both Bayesian and
non-Bayesian viewpoints. Exponential families are discussed here because
of the important role sufficiency plays in these models. Also, the concept
of information is introduced together with its relationship to sufficiency.
A representation theorem is given for general distributions based on suffi-
cient statistics. Decision theory is the subject of Chapter 3, which includes
discussions of admissibility and minimaxity. Section 3.3 presents an ax-
iomatic derivation of Bayesian decision theory, including the use of condi-
tional probability. Chapter 4 covers hypothesis testing, including unbiased
tests, P-values, and Bayes factors. We highlight the contrasts between the
traditional "uniformly most powerful" (UMP) approach to testing and de-
cision theoretic approaches (both Bayesian and classical). In particular, we
2These two appendices contain sufficient detail to serve as the. basis for.a ful~
semester (or more) course in measure and probability. They are mcluded m thiS
book to make it more self-contained for students who do not have a background
in measure theory.
Preface ix
from two early drafts of the text and provided invaluable feedback on the
(lack of) clarity in various sections.
As a student at the University of Illinois at Urbana-Champaign, I learned
statistical theory from Stephen Portnoy, Robert Wijsman, and Robert
Bohrer (although some of these people may deny that fact after reading this
book). Many of the proofs and results in this text bear startling resemblance
to my notes taken as a student. Many, in turn, undoubtedly resemble works
recorded in other places. Whenever I have essentially lifted, or cosmetically
modified, or even only been deeply inspired by a published source, I have
cited that source in the text. If results copied from my notes as a student
or produced independently also resemble published results, I can only apol-
ogize for not having taken enough time to seek out the earliest published
reference for every result and proof in the text. Similarly, the problems at
the ends of each chapter have come from many sources. One source used
often was the file of old qualifying exams from the Department of Statistics
at Carnegie Mellon University. These problems, in turn, came from various
sources unknown to me (even the ones I wrote). If I have used a problem
without giving proper credit, please take it as a compliment. Some of the
more challenging problems have been identified with an asterisk (*) after
the problem number. Many of the plots in the text were produced using
The New S Language and S-Plus [see Becker, Chambers, and Wilks (1988)
and StatSci (1992)]. The original text processing was done using U.TEJX,
which was written by Lamport (1986) and was based on 'lEX by Knuth
(1984).
Pittsburgh, Pennsylvania MARK J. SCHERVISH
May 91,1995
Several corrections needed to be made between the first and second print-
ings of this book. During that time, I created a world-wide web page
http://www.stat.cmu.edu/-mark/advt/
on which readers may find up-~o-date lists of any corrections that have
been required. The most significant individual corrections made between
the first and second printings are listed here:
The discussion of the famous M-estimator on page 314 has been
corrected.
Theorems 7.108 and 7.116 each needed an additional condition con-
cerning uniform bounded ness of the derivatives of the Hn and H;
functions on a compact set. Only small changes were made to the
proofs.
The proofs of Theorems B.83 and B.133 were corrected, and small
changes were made to Example 2.81 and Definition B.137.
Contents
Preface vii
References 675
1.1 Background
The purpose of this book is to cover important topics in the theory of statis-
tics in a very thorough and general fashion. In this section, we will briefly
review some of the basic theory of statistics with which many students are
familiar. All that we do here will be repeated in a more precise manner at
the appropriate place in the text.
If X = (XI, ... ,Xn ), where the Xi are lID each with density (or mass
function) !xlIS(IO) when e = 0, then
n
!xls(xIO) = II !XdS(XiIO), (1.1)
i=l
where x = (XI, ... ,Xn ). After observing the data Xl = XI, .. .'Xn = X n ,
the function in (1.1), as a fun~tion of 0 for fixed x, is called the likelihood
function, denoted by L(O). Section 1.3 is devoted to a motivation of the
above structure based on the concept of exchangeability and DeFinetti's
representation theorem 1.49. Exchangeability is discussed in detail in Sec-
tion 1.2, and DeFinetti's theorem is the subject of Section 1.4.
A function of the data which takes values in the parameter space is called
a (point) estimator of 8. Section 5.1 considers point estimation in depth.
Example 1.4. Suppose that X = (Xl, ... ,Xn ) and the Xi are IID with N(O, 1)
distribution under Po, then cjJ(x) = L~=l xi/n = x takes values in the parameter
space and can be considered an estimator of e.
Sometimes we wish to estimate a function g of 8. An estimator of
n. An estimator of 8 is a
g(8) is unbiased if Ell [cjJ(X)] = g()) for all () E
maximum likelihood estimator (MLE) if
supL()) = L((x)),
liEn
where IBis the indicator function of the set B. It will often be possible
(although not necessary) to think of the space X x n as if it were the
underlying probability space S which is introduced in Appendix B. In this
way, X and e
are both easily recognized as functions from S to their
respective ranges. That is, if s = (x, B), then X(s) = x and 8(8) = B.
After observing the data X = x, one constructs the conditional distri-
bution of e given X = x, which is called the posterior distribution, using
Bayes' theorem:
!xle(xIB)fe(fJ)
felx(Blx) = In fXle(xlt)!e(t)dt' (1.6)
and the prior density is fe(8) = JX(271"}-1/2 exp( -.\[8 - 80]2/2}. Multiplying
these together and simplifying yield the following expression for the numerator
of (1.6):
k(x} exp ( - n; .\ [0 - 81]2) , (1.8)
where 81 = (.\80 +nx)/(.\+n), and k(x) does not depend on 8. The expre~sion in
(1.8) is easily recognized as being proportional to the N(81, l/[~+n]) denSIty as a
function of O. So, the posterior distribution of e given X = x IS N(8 1 , l/[.\+n]).
1.2. Exchangeability 5
/Ylx(ylx) = In /Yls(yIO)/slx(Olx)dO.
Example 1.9 (Continuation of Example 1.7; see page 4). Let Y = X n +l, the
next observation. The posterior predictive density of Y is
1.2 Exchangeability
1.2.1 Distributional Symmetry
When one performs a statistical analysis, there are usually several quanti-
ties about which one is uncertain. For example, when conducting a political
poll, one never knows in advance which of several answers each respondent
will provide. In addition, even after the responses are in, one does not know
the answers that would have been supplied by all of the people who were
not polled. If one is interested in the proportions of the population who
would provide each of the available responses, then all of the would-be re-
sponses of all members of the population are potentially of interest. The
most complete specification of a probability distribution would give the
joint distribution of all of these responses. From this joint distribution, the
distributions of the various proportions of interest could also be calculated.
The quantities of interest can be more complicated than counts and pro-
portions without changing the basic considerations. For example, a com-
pany may keep track of the total amount of a sample of its sale to a sample
of its customers at a sample of its stores on a sample of days. It may be
6 Chapter 1. Probability Models
Also, (Xl, X 2 ) has the same joint distribution as (X99 , Xl), (X5, X 2 , X 48 )
has the same joint distribution as (X13, X 100, X 3 ), and so on. The following
fact is easy to prove.
Proposition 1.12. A collection C of random quantities is exchangeable if
and only if, for every finite n less than or equal to the size of the collection
C, every n-tuple of distinct ele-ments of C has the same joint distribution
as every other such n-tuple.
As an example, we stated earlier that the assumption off lID random
variables entailed symmetry and more.
Example 1.13. Consider a collection X I ,X2 , (finite or infinite) of lID random
variables. Clearly, (Xiu ... ,_Xi,,) has the same distribution as (Xil" .. , Xi,,) so
long as il, ... , in are all distinct and jl, ... ,jn are all distinct. Hence, every
collection of lID random variables is exchangeable.
6The reader interested in an in-depth study of the early days of statistics and
statistical reasoning should read Stigler (1986).
1.2. Exchangeability 9
Note that the right-hand side does not depend on iI, ... , in.
The case of conditionally lID random quantities will turn out to be one
of only two general forms of exchangeability. Theorem 1.49 will say that
infinitely many random quantities are exchangeable if and only if they are
conditionally lID given something.
Although an infinite sequence of exchangeable random variables is condi-
tionally IID, sometimes the description of their joint distribution does not
make this fact transparent. Example 1.15 is the famous Polya urn scheme.
It is not obvious from the example that the random variables constructed
are conditionally lID. Theorem 1.49, however, says that they are condi-
tionally lID because they are exchangeable. 7
Example 1:15. Let X = {I, ... , k}, and let UI, ... ,Uk be nonnegative integers
such that u = E~=l Ui > O. Suppose that an urn contains Ui balls labeled i for
i = 1, ... , k. We draw a ball at random 8 and record Xl equal to the label. We
then replace the ball and toss in one more ball with the same label. We then
draw a ball at random again to get X 2 and repeat the process indefinitely. To
prove that the sequence {X.}~l is exchangeable, let n > 0 be an integer and let
jt, ... , jn be elements of X. For i = 1, ... ,k, let Ci(it, ... ,jn) be the number of
times that i appears amongjl, ... ,jn' That is,9 Ci(jl, ... ,jn) = 2::=II{i}(jt).
Define the notation
(a)b = a(a - 1) ... (a - b + 1),
7Hill, Lane, and Sudderth (1987) prove that for k = 2, the Polya urn process
is the only exchangeable urn process aside from lID processes and deterministic
ones. (An urn process is deterministic if all balls drawn are the same. The common
label for all balls can still be random.)
8What we mean by this is that every ball in the urn has the same probability
of being drawn.
9We will often use the symbol I A (x) to stand for the indicator function of the
set A. That is, IA(x) = 1 if x E A and IA(x) = 0 if x If. A.
10 Chapter 1. Probability Models
- .
P r (X I-Jl, X _ . ) _ I1~-1 (Ui + Ci(il, ... ,jn) - l)c;(il, ... ,jn)
... , n -In - ( (1.16)
u+n-l)n
For n = 1, this reduces to Pr(XI =]1) = Uil Iu, which is true. If we suppose that
(1.16) is true for n = 1, ... ,m, then Pr(XI = jl, ... ,Xm +1 = jm+d equals
_ P (X - .
- r 1 - Jl, ... ,
X _.
m - Jm
)Ujm+l + Cjm+l (il, ... ,jm+d - 1
. (1.17)
u+m
In replacing Pr(Xl = jl, ... ,Xm = jm) by (1.16) in (1.17), we note that
Hence, Pr(XI = 0, ... ,X7 = 0) = E(Pr(X l = 0, ... ,X7 = 0IY)) > 0. But this
is absurd, since there are only 6 blue balls. It must be the case that the Xi,
although exchangeable, are not conditionally IID.
Theorem 1.48 will say that a finite collection of random quantities is
exchangeable if and only if they are like draws from an urn without re-
placement.
where x stands for Z::7=1 ij, and the other believes this probability
to be ([n +
1](:)-1 . The first person believes that Pr(Xl = 1) = 0.4, while
the second
believes Pr(X l = 1) = 0.5. On the other hand, both of these distribu
tions are
exchangeable, and so Theorem 1.47 says that both persons believe
that e =
limN_co Z:::=l XdN exists with probability 1, and that Pr(Xl =
(}. They must disagree on the distribution of e. For example, the 118 = (}) =
law of
probability B.70 says Pr(Xl = 1) = E(8), hence they must have differen total
t values
of E(8).
If probab ilities are not frequen cies, then why are frequen cies
though t
to be so import ant in calcula ting probab ilities? The answer lies
in careful
examin ation of the implica tions of DeFine tti's represe ntation theorem
.
Examp le 1.20 (Continuation of Example 1.19). Suppose that the two
people in
this example both observe X = (Xl, ... , X 20 ) = y, and suppose that
y consists of
14 Is and 6 Os. It is not difficult to calculate the conditional distribu
tion of X2l
given this data. For example, to get Pr(X21 = llX = y), we just divide
the joint
probability of (X, X 2I) = (y,l) by the probability of X = y. The
believes first person
12 1
17M
Pr(X21 = llX = y) = ~ = 0.64,
16('ll
while the second person believes Pr(X21 = llX = y) = 17/22 = 0.68.
Notice how
much closer these probabilities are to each other than were the prior
probabilities
of 0.4 and 0.5. Also, notice how close each of them is to the proportion
of successes,
0.7.
In Examp le 1.57 on page 31, we will see the general method for
finding
the conditi onal distribu tion of e after observi ng some Bernou lli
trials. But
12 Chapter 1. Probability Models
Example 1.20 gives us some hint of what happens. In Example 1.20, after
observing 20 Bernoulli trials, the mean of 8 changed to a number closer
to the proportion of successes, regardless of what the prior mean of e
was. The conditional mean of 8 given the observed data is the probability
of a success on a future trial given the data. If we believe a sequence of
Bernoulli random variables to be exchangeable, and we are not already
certain about the limit of the proportion of successes, then after we observe
some data, we will modify our opinion about future observations so that the
probability of success is now closer to the observed proportion of successes.
This phenomenon has nothing to do with frequencies being probabilities.
It is merely a consequence of exchangeability.
(1.21)
dPe
fXls(xIO) = dv (x).
(It will be common in this text to denote the conditional density functio
n
of one random quantity X given anothe r Y by fxIY.) We can assume
that
fXls is measurable with respect to the product a-field B (8) T.11 This
will
allow us to integrate this function with respect to measures on both X
and
o. The function fXle(xIO), considered as a function of 0 after X = x is
observed, is often called the likelihood function L(O).
For each 0 E n, the function fXle(IO) is the conditional density with
respect to v of X given e = O. That is, for each A E 13,
(1.23)
Example 1.24. Let X = (Xl, ... , X n ), where the Xi are conditionally lID
with N(/1-,O' 2 ) distribution given (M, E) = (/1-, a). (Here the parameter is e =
(M, E).) Let the prior distribution be that E2 has inverse gamma distribution
r- l (ao/2, bo/2) and M given E = a has N(/1-o, 0'2/ AO) distribution with ao, bo,
/1-0, and AO constants. The likelihood function in this case can be written as
where x = L:~=l x;/n and w = L:~=l (Xi - x)2. The prior density with respect to
Lebesgue measure is
~
/e(/1-,O')=
2(~) 2 ~ ( 1 )
~r(!'f) O'-(ao+2)exp -20'2 [AO(/1-_/1-0)2+ bo ] , forO'>O.
(1.25)
The prior predictive distribution of the observations can be calculated by multi-
plying together the two functions above and integrating out the parameter. After
completing the square in the exponent, the product can be written as
(1.26)
where
al = ao + n, Al = AO + n,
nAo(x - /1-0? AO/1-0 +nx
bl = bo + w + AO + n ' /1-1 = AO + n
Note that, as a function of (/1-, a), this is in the same form as the prior density
(1.25) with the four numbers ao,bo,/1-o,AO replaced by al,b~,/1-l,Al' H~~ce, the
integral over (/1-,0') is just the constant factor that appears In (1.26) divided by
1.3. Parametric Models 15
the result of changing ao, bo, /1-0, >'0 to ai, bl, /1-1, >'1, respectively, in the constant
factor in (1.25). That is,
(1.27)
fff
Pr(XI
where k s.t is the number of times that s follows t in the sequence h, ... , in.
Diaconis and Freedman (1980c) prove that, aside from pathological cases, if all
finite sequences of Os and Is that have the same first element and the same values
of ks.t have the same probability, then (1.29) must hold.
Example 1.30. This example is the simple linear regression problem. Suppose
that Xl, X2, are fixed known numbers and E I , E 2 , . .. are exchangeable random
variables that are conditionally independent given E = 0' with density fEldeIO').
16 Chapter 1. Probability Models
Co {x : In fXle(xlt)dJ.Le(t) = o} ,
Coo = {x: In fXls(xlt)dJ.Le(t) = 00 } .
J.Lx(A) = i In /xle(xIB)dJ.Ls(B)dv(x).
It follows that
( (fxle(xIB)dJ.Ls(B)dv(x) = 0,
1001n
( (fXIS(xIB)dJ.Le(B)dll(x) = ( oodv(x).
10 ln00 10 00
This last integral will equal 00 if lIe Coo) > O. Since this is impossible, it
must be that lI(Coo ) = 0, hence p,x(Coo ) = O.
The posterior distribution P,slx must satisfy the following. For all sets
A E B and all BET,
= i1 fXIS(xIO)dp,s(O)dll(X). (1.33)
Next, write
Example 1.36. As an example in which Bayes' theorem does not apply, consider
the case in which the conditional distribution of X given e = 8 is discrete with
Pe({8 -I}) = Pe({8 + I}) = 1/2. Suppose that e has a density fa with respect
to Lebesgue measure. The Pe distributions are not all absolutely continuous with
respect to a single u-finite measure. It is still possible to verify that the posterior
distribution of e given X = x is the discrete distribution with
and Pr(9 = x + I\X = x) = 1 - Pr(e = x - I\X = xl:- Note that the posterior
is not absolutely continuous with respect to the prior. 3
= 1 k
II/xtls(xn+iIO)dJ.tSIXl .... 'xn(Olxl, ... 'Xn).
i=l
Example 1.38 (Continuation of Example 1.35; see page 17). The posterior pre-
dictive distribution of future observations can be calculated after observing a
sample of conditionally lID normal random variables. Let Ym be the average of
m future observations. Since the posterior distribution of 9 is in the same form
as the prior (1.25) with ao, bo, Ito, Ao replaced by aI, bl , Itl, AI, it follows that the
posterior predictive distribution of Ym is of the same form as the prior predictive
distribution. Using the result from the end of Example 1.24 on page 14, we get
that the posterior predictive distribution of Ym is tal (ltl, J[I/m + 1/Al]bl/al ).
Eo(Z) = J Z(x)dPo(x).
Similarly, let Varo(X) and Covo(X, Y) stand for the conditional variance
of X given e = 0 and the conditional covariance between X and Y given
8 = 0, respectively.
There will be times when we wish to condition on other random variables
in addition to e. Recall that for two random quantities Z : S -+ IR and
Y: S -+ T, the conditional expectation of Z given Y was defined to be an
Ay measurable function E(ZIY)(s) satisfying
E(ZIB) = Is E(ZIY)(s)dJ.L(s),
subjectivity into the analysis of data but choosing a parametric family does
not. These people are mistaken. Each choice one makes introduces subjec-
tivity.
Philosophy aside, suppose that one finds it difficult to specify a prior
distribution beacause one does not have much idea where the parameter is
likely to be located. In such cases, one may wish to do calculations based
on a prior distribution that spreads the probability very thinly over the
parameter space. A problem that often arises is that, if we take the limit
as the probability is spread more thinly, the prior distribution ceases to
satisfy the axioms of probability theory.
Example 1.40. Suppose that we choose the parametric family of normal dis-
tributions with variance 1 and parameter e equal to the mean. The parameter
space is the real line m.. Suppose that we want a normal prior distribution for e,
but one with very high variance to indicate that we are not willing to say where
we think e is with much certainty. The distribution N(a, n) for large n has this
property. But how can we choose n? If we let n -+ 00, there is no countably
additive limit to the sequence of probability distributions. There is no normal
distribution with infinite variance.
What has become common in problems like Example 1.40 is to choose a
measure A on (n, r) which may not be a probability but still pretend that
it is the prior distribution of e. That is, use A in place of /Le in Bayes'
theorem 1.31. The "posterior" after observing X = x, if it exists, will have
density with respect to A,
(1.41)
In fXle(x\t)dA(t)'
The key is whether or not the denominator in (1.41) is finite and nonzero.
If so, we can pretend that (1.41) is the posterior density of e given X = x
and then proceed with whatever analysis we want to perform. In this case,
we call A an improper prior distribution. If the denominator in (1.41) is 0
or infinite, one may need to choose another prior distribution.
Example 1.42. Suppose that X rv N(9, 1) given e = 9. We can use A equal
to Lebesgue measure as an imprope~rior. Suppose that we observe only one
observation X. Since fXIS{x\6) = (y211")-1 exp{ -[x - 9J2/2), it follows that the
posterior density with respect to Lebesgue measure derived from Bayes' theo-
rem 1.31 is equal to fXIS(x\6) as a function of e. In other words, given X = x,
e has N(x, 1) distribution.
The above discussion of improper priors is not particularly precise math-
ematically. There are two traditional ways to make the concept of improper
prior mathematically precise. Each of them opens its own particular can
of worms, so we will only describe each very briefly and point the reader
to relevant literature. First, one may remove the restriction that the prob-
ability of a set must be at most 1. Hartigan (1983) takes this appro~h
and allows sets to have infinite probability. This makes improper pnors
1.3. Parame tric Models 21
(M, E) = (,.",O'). (In this way, ,." and 0' are the mean and standard deviation in
both cases.) We could let N be the set of all triples (i,,.,,, O'), where i = 1 means
Laplace and i = 2 means normal. Consider the following data values:
-0.0820,1.3312, -1.3518, -1.4930,0.0850,0.7022, 1.735,
-0.3164,2.1948, -0.0371,0.3377, -0.3124,0.6087,0.7339,
-0.4632,0.3398, -0.0352,0.1597, -0.6344, -0.4435.
The value of a that leads to the largest fX,Q(x} is (1,0.0249, 0.9473). The largest
value is 5.943 x 10- 12 , which is only slightly larger than the value achieved at
a = (2,0.1530,0.8909), namely 4.772 x 10- 12 If we decide to use the Laplace
distribution model, we will be pretending that we were sure from the start that
the data would be Lap(,.", 0' / v'2), rather than taking into account the sizable
amount of uncertainty that still remains about the underlying distribution.
In a classical setting, one might look at quantile plots to see whether the
data looked more normal or more like a Laplace distribution. Figure 1.44 shows
quantile plots for both Laplace and normal distributions. The two plots are about
equally straight, although the Laplace plot is a little bit straighter. Choosing
either distribution would surely be acting as if we knew something that was
quite uncertain.
Laplace
o Normal
J e
.. o.
. ". .
j co
00
e ..
'0 0
2 ., o 2
Ouantlles of Distribution
tion 1.5.17 These theorems characterize all of the possible joint distribu
-
tions for exchangeable random quantit ies which take values in a Borel
space
(think finite-dimensional Euclidean space), and they are essentially
d~e to
DeFine tti (1937). They can be summa rized here as follows. If there
IS an
infinite sequence of exchangeable random quantit ies {Xn} ~= l' then
there
must be some random quantit y P such that the Xi are conditionally
IID
given P. If the random quantit ies have Bernoulli distribu tion, then
P can
be taken to be the limit of the propor tion of successes in the first
n ob-
servations. In general, P will turn out to be the limit of the empiric
al
distribu tions P n of Xl,"" X n , which were defined in (1.21). This explain
s
how DeFine tti's theorem helps to motivate models like (1.10).
Examp le 1.45. Consider the case of Bernoulli random variables {Xi}~l'
Here,
X = {O, I} and a random probability measure P on X is equivalent to a random
variable e E [0,1], where e = P( {I}). The empirical distribution P n is equivale
nt
to X n , the average of the first n observations, since P n ({I}) = X n Theorem
1.47
(a special case of Theorem 1.49) will say that X n converges to e a.s., and
conditional on e = 0, the Xi are IID Ber(O) random variables. This is whatthat,
meant on page 7 when we said that "0 is given an implicit meaning as a we
random
variable e, rather than a fixed value." Also, the random variable e will
have a
distribution, which is the measure f.t in (1.10).
The heavy mathem atics in the proof of Theore m 1.49 is required to
make
precise what it means to have a random probab ility measur e P and
what it
means to condition on such a thing. For random quantit ies that assume
only
finitely many different values, random probab ility measures are equival
ent
to finite-dimensional random vectors. For more general random quantit
ies,
random probability measures can be more complicated. For this reason,
we
prove Theore m 1.47 first, even though it is a special case of Theore
m 1.49.
The proof of Theore m 1.47 contain s the essential ideas of the more
com-
plicated proof withou t being encumb ered by so much mathem atics.
If there are only finitely many exchangeable quantit ies Xl, ... , X N, then
all that we can prove is the following. Conditional on the empirical
distri-
bution P N of Xl>'" ,XN, every ordered n-tuple (for n ::; N) of the
Xi has
the distribu tion of n draws without replacement from a finite popula
tion
with distribu tion P N. It is the "witho ut replacement" qualifier that
pre-
vents us from proving that Xl"'" XN are conditionally indepen dent.
(See
Examp le 1.18 on page 10.) It is possible for a finite collection of exchan
ge-
able random variables to be conditionally independent; however, it
is not
necessary. Looking at the Bernoulli case first might aid in unders
tanding
17In most of this text, proofs are given immediately or almost immedi
after the statements of results. Because DeFinetti's representation theorem ately
is so important for motivating statistical modeling, and because its proof 1.49
involves
some rather heavy mathematics, many readers may wish to forego reading
proofs on a first pass through this material. However, every reader should the
try to understand what Theorem 1.49 says. at least
26 Chapter 1. Probability Models
e
variables is exchangeable if and only if there is a random variable taking
values in [0,1) such that, conditional on e = 0, {Xn}~=l are IlD Ber(O).
Furthermore, if the sequence is exchangeable, then the distribution of is e
unique and L:~=l Xdn converges to e almost surely.
For the more general cases, we need some more notation. Let (X, B) be
a Borel space, and let P be the set of all probability measures on (X,B).
The theorems stated below will give the conditional distributions of certain
random quantities taking values in X given certain probability measures.
To be mathematically precise, these probability measures must themselves
be random quantities. That is, we will need a a-field Cop of subsets of
P such that the appropriate probability measures can be thought of as
measurable functions from some probability space (8,A,/-l) to (P,Cp ). Let
Cop be the smallest a-field of subsets of P containing all sets of the form
AB,t = {P E P : PCB) ::; t}, for B E 13 and t E [0,1]. This is the smallest
a-field for which the evaluation functions gB : P -+ rn. are measurable,18
where gB(P) = PCB). It is easy to show that P n , defined in (1.21), is a
measurable function from the n- fold product space (xn , Bn) to (P, Cp ) (see
Problem 24 on page 77). If (8, A, /-l) is a probability space, a measurable
function P : 8 -+ P is called a random probability measure. In this way, P n
is a random probability measure for every n.
Theorem 1.48. 8uppose that Xl, ... ,XN are random quantities taking
values in a Borel space (X,B). Let X = (X1, ... ,X N ), and for each BE
B, let PN(B) = L:~1 IB(Xi)/N be the empirical distribution of X. The
random quantities are exchangeable if and only if, for every ordered n-tuple
18Those familiar with topological concepts will recognize the sets AB" as a
subbase for the topology of pointwise convergence of functions from B to JR,
which is also the ~roduct topology when that set of functions is considered as the
product space IR . As such, (P, Cop) is not a Borel space. This inconvenient cir-
cumstance will not cause problems for us, however. One of the steps in the proof
of Theorem 1.49 is to show that the subset of P in which P lies is the image
of a Borel space (X oo , BOO) under a measurable function. Hence regular condi-
tional distributions are induced on (P, Cop) by the corresponding distributions in
(X oo , BOO).
28 Chapter 1. Probability Models
(il, ... ,i n ) of distinct elements of{I, ... ,N}, the joint distribution of
(Xii' ... ,Xin ), conditional on P N = P is that of a simple random sample
without replacement from the distribution P.
rk(l - rt- k ,
which corresponds to the Xi being lID Ber(r). Hence, the probability of observing
k ones in n trials is (~)rk(1 - rt- k , the binomial probability.
The following example helps to explain why Theorem 1.48 is not used
very often with random variables having continuous distribution.
1.4. DeFine tti's Represe ntation Theorem 29
where A is the union of the (;1 ,~,iJ product sets of the form BS1
x ... X B SN ,
where the subscrip ts 81, ... , 8 N are integers from 1 to k with j appeari
ng ij times
for each j. Needless to say, this formulation will not get us very far
in general.
Examp le 1.52. This example is due to Bayes (1764). Suppose that {Xn};:"=
l are
exchangeable Bernoulli random variables, and we set
1 1 1
(n+2)( ntl) + (n+2)(~!D = (n+l)(~)'
To figure out what P is, recall that P n ( {I}) = X n, the proport ion
of successes
in the first n trials, and lim n -+ oo P n ( {I}) = P ({I} ). Since X n converg
es a.s. to
e, it converges in distribution to e by Theorem B.90. Let Fn(t) = Pr(X n ::; t)
be the CDF of X n . Write
r(a+n W
!X1, ... ,X n (XI, .. ' ,x n ) = r(a)(b + E~=l Xi)a+n' for all Xi> O.
where e = Ltnj. The probability that the first k Xi are at most c while the rest
are greater is
l 1 j= ... j=
c
...
0
c
c c
rea + n)ba
r(a)(b + L:~=l Xi)a+n
dx
n ...
dx
1
t,(-I)jG) (b+(n~k+j)Cr
t,( -1)j G) 1= r~:) za-l exp( -z[b + (n - k + j)c))dz
1 r~:)
00
za-l exp( -z[b + cn])[exp(cz) - Ijk dz .
Multiplying this last expression by (~) and summing over k = 0, ... e give
1 = ba
o r(a) za-l exp( -bz) Pr(Yn,z ~ f)dz,
where Yn,z is a random variable with BinCn, 1- exp[-cz]) distribution. For each
z,
1 -~
c r~:) za-l exp(-bz)dz,
which is the CDF of 1 - exp(-c8), where 8 ~ r(a, b). That is, it is as if there
were a random variable 8 ~ r(a, b) and P -00, c]) = l-exp( -c8). Put another
way, it is as if the Xi were conditionally lID with Exp(9) distribution given e = 9
and e ~ rea, b). That this is indeed the case can be proven. (See Problem 22 on
page 77.)
Example 1.54 (Continuation of Example 1.52; see page 29). In the case of
Bernoulli random variables, it is not difficult to calculate conditional probabil-
ities. Suppose that we observe k* successes in the first n* trials, and we are
interested in the probability of k successes in the next n trials. It is straightfor-
ward to calculate the probability of k successes in next n trials given k* successes
in the first n * trials as
n* + 1
( n*+n) n* + n + l'
k*+k
For example, we get that the probability of k successes in the next n trials given
two successes in the first five trials is
60(~) 1
(1.55)
( n+5) n + 6'
k+2
It is easy to see that the future trials are still exchangeable given the past, and one
could use the distribution in (1.55) to find the distribution of e given the observed
trials, just as we found the original distribution of e to be U(O, 1). Alternatively,
we have a theorem that applies to all exchangeable Bernoulli sequences.
Dividing this by
r g(P)dJ.Lp(P) = r
is iV-l(S)
g(V(O))dO.
1.5. Proofs of DeFinetti's Theorem 33
The function V-I gives us a way of dealing with P as if it were the real num-
ber 9 = V-I (P). This is a special case of a parametric index, to be defined in
Definition 1.85.
E(Ym - Yn)2
Pr(IYm - Ynl 2:: c) <
c2
= (E(Y;) + E(Y~) - 2E(Yn Ym )) 12
c
1
+ -2
m
[m/L2 + m(m -1)m21
- ~[n/L2
mn
+ n(n -1)m2 + n(m - n)m21) ~
c
= ( .!.n _ ~)
m
/L2 - m2 < /L2 - m2 .
c2 nc2
(1.60)
Now, let Zk = YSk and Ak = {s: IZk+l- Zkl2:: 2- k }. It follows easily from
(1.60) (with c = 2- k ) that, for every k, Pr(Ak) < (/L2 - m2)2- k . Now, let
A = n~=1 Uk"=nAk' It follows from the first Borel-Cantelli lem~a A.20 that
Pr(A) = O. We finish the proof by showing that, for every sEA , and every
f > 0, there exists N s , such that n, m 2:: N s , implies \Zn(s) - Zm(s)1 < f.
21Two theorems (7.49 and 7.80) make use of the strong law of large num-
bers for lID random variables 1.63. This result is available as a consequence of
Theorem 1.62, but not from Theorem 1.59.
22This theorem is used in the proof of Theorem 1.49. Its proof resembles the
proof of a similar claim in Loeve (1977, Section 6.3). .
23 A sequence of real numbers {Xn}~=1 is Ca~chy if, ~or every f > 0, the!e eXlsts
N such that n, m 2:: N implies IXn - Xm I < f. Smce IR 1S a complete metnc space,
every Cauchy sequence converges.
1.5. Proofs of DeFinettj's Theorem 35
Write AC = U~=l nk=n Af. For every S E AC, there exists Cs such that
S E n~csAf. If m > n ~ cs , it follows that
L
m-l
IZm(s) - Zn(s)1 ::; IZi+l(S) - Zi(s)1 < Tn+! ::; TCs+I.
i=n
Each ofthe terms above, for which at least one of iI, i2, i 3 , i4 is not repeated,
has mean 0 because the random variables are conditionally independent
with mean O. Let M be a bound for IXnl. It follows that
So, E(Y;) ::; 4M4jn 2 by the law of total probability B.70. It follows from
the Markov inequality B.15 that, for each E > 0,
So, L~l Pr(IYnl > E) < 00. The first Borel-Cantelli lemma A.20 implies
that Pr(IYnl > E infinitely often) = O. Since the event that Yn converges to
o is nk=I{lYn\ > Ijk infinitely often}C, it follows that Yn converges to 0
almost surely. 0
36 Chapter 1. Probability Models
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas. .
24This proof is also similar to one given for the case of lID random variables by
Doob (1953, Section VII, 6). Those who are unfamiliar wi~h m~tingal~ theory
may safely skip this section and study the elementary versIon gIven earher. But
these readers should be aware that two theorems (7.49 and 7.80) do make use of
Corollary 1.63.
25This theorem can be used in the proof of Theorem 1.49.
26This corollary is used in the proofs of Theorems 7.49 and 7.80.
1.5. Proofs of DeFinetti's Theorem 37
.
for example,
Pr(K = kiM = m) = (~) S=~) (1.64)
(m)
Suppose that N --+ 00 in such a way that MIN --+ 9. For fixed nand k,
we can take limits in (1.64) as N --+ 00 and min --+ e.Formally, we would
get
Pr(K = kl9 = e) = (~)ek(l- e)n-k,
which is the model for K "" Bin(n, ()). In fact, this is what Theorem 1.47
says is the case. The precise proof is a bit more complicated than the
heuristic argument above, but the idea is the same.
PROOF OF THEOREM 1.47. The "if" direction is simple and is left to the
reader. For the "only if" direction, assume that {Xn} ~ 1 is an exchangeable
Bernoulli sequence. Let Yi = I:f=1 Xd for = 1,2, .... By the strong law
of large numbers 1.62 or 1.59, we know that Ysn converges almost surely.
Let 9 denote the limit when the limit exists, and let 9 = 1/2 when the
limit does not exist. Let JLe denote the distribution of 9 (a probability
measure on [O,lJ.)
The main step in the proof is to show that for every integer k and every
j1, ... ,jk E {O, I} and every Borel subset C of [0,1],
1 m { Ym if jt = 1,
mLWi,t = 1 - Ym if jt = 0.
(=1
With this notation, we can write
1 m m k
Zn = mk I c(8) L'" L I1Wit ,t
il=l ik=l t=l
11k
-_ ---;;:;c
10(8) '"
~ Wit,t
10(9)
+ --k- L k
11 Wi"t. (1.66)
all it distinct t=l m at least two it equal t=1
The first sum on the right-hand side of (1.66) has m!/(m - k)! terms.
Since E[I0(8) n~=1 Wit,tJ equals the left-hand side of (1.65) when all it
38 Chapter 1. Probability Models
are distinct, and since m!/[(m - k)!mk] converges to 1, the mean of the
first term on the right of (1.66) converges to the left-hand side of (1.65).
The second sum has m k - m!/(m - k)! terms, each of which is bounded
between 0 and 1. Since 1 - m!/[(m - k)!mk] converges to 0, so does the
mean of the last expression in (1.66). This completes the proof of (1.65).
Equation (1.65) is exactly what it means to say that XI, ... ,Xk are
conditionally lID Ber(O) given 8 = O. To see that the distribution J.te is
unique, let C = [0,11 in (1.65) and note that this equation determines the
means of all polynomial functions of 8. Since polynomials are dense in the
set of all bounded continuous functions on [0,1] by the Stone-Weierstrass
theorem C.3, it follows that (1.65) determines the means of all bounded
continuous functions of e, and Corollary B.107 says that the means of
all bounded continuous functions determine the distribution. To finish the
proof, we note that since {Xn}~=l are bounded and conditionally lID,
Lemma 1.61 says that {Yi}bl converges a.s. Obviously, the limit must be
e, a.s. 0
stand for the probability that n draws without replacement from an urn con-
taining balls labeled Xl, ... ,XN form a point in B. Also, let Mn{Blx) stand
for the probability that n draws with replacement from an urn containing
balls labeled Xl, ... , X N form a point in B. Let Y 1 , ... , Y N be conditionally
IID given X = X with distribution MN{-Ix). Then
= i Hn(Blx)dQx(x). o
(1.68)
40 Chapter 1. Probability Models
= fc H~(BIP)dILN(P),
where the first equality follows from Lemma 1.67, and the second follows
from Theorem A.81. This proves (1.68). 0
For the general case, suppose that for each i, ball i has two labels (i, Yi),
and let Yo denote the set of these labels. Let X, and Y' record both labels,
so that the first part of the proof applies to their distributions. Call these
distributions Q' and P'. Assume that X and Y still only record the second
parts of the labels. For each A ~ yn there exists a set A' ~ Y8 such that
X E A if and only if X' E A', and YEA if and only if Y' E A'. In fact,
u
n
A'= II{(i,Yi): Yi = Xj}.
(1.71)
PROOF. Let Qx stand for the joint distribution of Xl! ... , XN. For x =
(XI, ... ,XN) and BE Bn, let Hn(Blx) and Mn(Blx) be as in Lemma 1.67.
(If Pm is the empirical distribution of x, then Mn(Blx) = P:(B).) By
Lemma 1.69, we have IHn(Blx) - Mn(Blx)1 ~ n(n -I)/2N for all x. From
Lemma 1.67, we have
(1.72)
42 Chapter 1. Probability Models
e
The conditional probability that Y = given Y' = e* is i~ p~ecis~ly the same
form as the marginal probability of Y = e, except that the dlstfJ~utlOn?f M"h2~
been replaced by the conditional distribution of M* = M - given Y = R. r
For example, suppose that pm = l/(N + 1) for m = 0, ... ,N. Then, the
probability of one success in one trial is
The probability has been shifted to higher values of m* after seeing one success.
Suppose now that we see two observations Xl = 1 and X2 = O. The probability
of this is
N-I (i) C;::::D
1 2 N-I m(N - m)
~ (;';:) N +1 = N +1 ~ N(N - 1)
28The readers should convince themselves that P;'" is indeed equal to Pr(M" =
m"IY" = eO).
44 Chapter 1. Probability Models
for all X with distinct coordinates. For each such B, Lemma 1.67 says that
k!(~)
PrYl."" Yk) E B,X E C) = NK PrXI , ... ,Xk) E B,X E C).
k!(~)
PrYb , Yk) E B) = NK PrXl.." ,Xk) E B).
Let QX1, ... ,Xk and QY1, ... ,Yk stand for the joint distributions of (Xl,'" ,Xk)
and (Y1 , . , Yk ) respectively. It follows that for all integrable functions f
and all B E Bk such that all points have distinct coordinates:
k!(~) ( (1.79)
= NK lBf(xb ... ,Xk)dQy1, ... ,Yk{Xl, ... ,Xk).
is
PrXI , ... , Xk) E B, X E C)
= QXIX1,. .. ,Xk(C\XI, ... ,xk)dQX1, .. "X,,{Xl.'" ,Xk)'
is
PrYI , ... , Yk ) E B, X E C)
= QXIX1, ... ,Xk(Clxl.'" ,xk)dQY1, ... ,y,,(XI,'" ,Xk),
1.5. Proofs of DeFinetti's Theorem 45
and the two conditional distributions are indeed the same for (Xl,' .. ,Xk)
vectors with distinct coordinates. Since such vectors have probability 1
under the distribution of X, the two conditional distributions are the same
a.s.
Now, we apply Lemma 1.67 to both the conditional distributions given
Xl, ... , X k and given Yl , ... , Yk. Let B E f3n have all distinct coordinates,
and let C be the set of all X E XN with distinct coordinates. Then, we get
(1.81)
29Since the first term on the left-hand side of (1.71) does not depend on N,
the limit in (1.81) must be the same for all convergent subsequences.
30Diaconis and Freedman (1980a) offer a sketch of an abstract proof showing
directly that P N converges in distribution for general random quantities. We will
not actually prove that PN converges in distribution to P. Rather, we prove that
the finite-dimensional joint distributions of P N converge to those of P. Billings-
ley (1968), which contains an in-depth discussion of convergence in di~tributio~,
shows that convergence in distribution requires a condition called t'ghtness In
addition to convergence of finite dimensional joint distributions. The work of the
tightness condition is done in that part of the proof of Theorem 1.49 in w~ich we
prove equation (1.84). An alternative proof is given by Aldous (1985, SectlOn 7).
A very general theorem is proven by Hewitt and Savage (1955).
3lHeath and Sudderth (1976) give an alternative proof in the Bernoulli case
which relies on a different subsequence argument.
1.5. Proofs of DeFinetti's Theorem 47
just (1.81). We will need to show that for every measurable subset C of P,
In the last expression above, the first sum has m!/(m - k)! terms, each of
which has mean equal to the left-hand side of (1.84). Since m!/[(m-k)!mk]
converges to 1, the mean of 1/mk times this first sum converges to the left-
hand side of (1.84). The second sum has m k - m!/(m - k)! terms, each of
which is bounded between 0 and 1. Since 1- m!/[(m - k)!mk] converges to
0, so does the mean of the second sum. This completes the proof of (1.84).
Apply (1.84) with k = 1, il = 1, and Bl = B to get that E(IB(X1 )Ic) =
E(P(B)Ic). This means that P(B) is a version ofPr(Xl E BIAoo) for every
B. Since (X, 8) is a Borel space, this can be assumed to be part of a regular
conditional distribution, and we can assume that P(B) = Pr(Xl E BIAoo).
In this way P becomes a random probability measure so long as we can
prove that it is a measurable function from (8, A) to (P, Cp). The a-field
Cp was set up so that P is measurable if and only if P(B) is measurable for
all B. Since Aoo <;;; A, P(B) is measurable for all B. Also, since Pr(Xl E
BIAoo) = P(B) is a function of P for each B and since Pis Aoo measurable,
it follows from Theorem B.73 that Pr(X l E BIP) = P(B).
Now, let J1.p denote the distribution of P. To prove that the Xi are
conditionally independent given P = P with distribution P, apply (1.84)
with C = {P E A} for arbitrary A E Cp . The result is
k
= iD k
P(Bj)dILp(P) = iD k
where the first equation is immediate from (1.84), the second follows from
Theorem A.81, and the third follows from the fact that Pr(Xl E BIP =
P) = P(B). This completes the proof of conditional independence given P.
For the uniqueness, suppose that ILl and J1.2 are possible distributions for
a random probability measure P such that the Xi are conditionally lID
with distribution P given P = P. We will prove that the finite-dimensional
1.5. Proofs of DeFinetti's Theorem 49
distributions of ILl and IL2 agree, and then Theorem B.131 says that ILl =
IL2. Let B 1 , , Bn E B, and let kl, ... , kn be positive integers. We have
already proven that
Hence, the means of all polynomial functions of (P(Bd, ... , P(Bn are the
same according to ILl and IL2. By the Stone-Weierstrass theorem C.3 the
means of all bounded continuous functions of finitely many PCB) values
are determined by the means of polynomial functions. Hence, the means
of all bounded continuous functions of (P(B 1 ), . . , P(Bn)) are the same
according to ILl and IL2. Corollary B.1D7 says that ILl and IL2 give the same
joint distribution to (P(Bd, ... , P(Bn, and the proof of uniqueness is
complete.
The convergence claim follows from Theorem 1.62 or from Lemma 1.61
and the fact that the bounded random variables IB(Xi ) are conditionally
liD. 0
These cases all have something in common, namely that the set of distribu-
tions is finitely parameterized. That is, there exists a one-to-one mapping
between the set of distributions and a subset of a finite-dimensional Eu-
clidean space. In the normal example, the mapping associates the N(m, s2)
distribution with (m, s) E III x III+. With such a parameter mapping, we can
switch the problem of integration over subsets of P to integration over Eu-
clidean space. The problem of finding conditional distributions is resolved
the same way. The conditional distribution in Euclidean space induces the
appropriate conditional distribution in P. (See Theorem B.28 on page 617.)
There are cases (see Sections 1.6.1 and 1.6.2) in which we want the range of
the parameter mapping to be an infinite-dimensional space. In such cases,
we will need to develop special methods for calculating integrals.
Now, let Po be a subset of P and let a' : Po --+ 0 be a bimeasurable
function, where 0 is a set with a-field T. The a-field of subsets ofP which we
need to consider is Co = {AnPo : A E Cp}. Let ILp be a probability measure
on (Po,Co) and let lLe be the probability on (O,T) induced by a' from
ILp. That is, for each A E Co, ILp(A) = lLe(a'(A)), and for each BET,
lLe(B) = ILP(a'-1(B)). To integrate a measurable function h : Po --+ lll,
we note that
J h(P)dILP(P) = J h(S'-1(0))dlLe(fJ),
S x:: x oo !: P ~ O.
Let the function 8 : S -> 0 be defined by 8(8) = 8' (P (XOO (8))). (Note
that the value of 8 is the same as the value of 8', hence we will often
find it convenient to use the symbol 8 to refer to both 8 and 8'.) We call
8 the pammeter. Let J.le be the probability induced on (0, r) by 8 from
J.l. Let Ax be the sub-a-field of A generated by Xoo. Since (X OO , BOO) is
also a Borel space, regular conditional distributions given 8 exist. For each
A E Ax, let P~(A) = Pr(AI8)(8) for all 8 such that 8(8) = e. For each
B E Boo, let Pe(B) = Po(XOO-l(B. In words, {Pe : e E O} specifies the
conditional distribution of XOO given 8.
Example 1.86. Let Xl = lR and let Po be the set of all normal distributions.
Assume that ftp assigns probability 1 to the set Po. We can let e'(p) be the
vector consisting of the mean and standard deviation of the normal distribution
P. Then e(8) is the vector consisting of the mean and standard deviation of the
limit ofthe empirical distribution of a sequence {Xn}~=l of exchangeable random
variables. By the strong law of large numbers 1.63 and the fact that the Xn are
conditionally lID given P, e(8) is also the limit (a.s.) of the sample average X n =
1:~"'1 X;/n and the sample standard deviation )1:i=1 (Xi - Xn)2 I(n - 1) of the
data sequence. If (J = (ft, a), then Po is the distribution that says that {Xn}~"'l
are lID N(ft, (T2) random variables. The notation P~ stands for the probability
measure on (S,Ax) defined by P~(XOO-l(B = Pe(B) for B E BOO.
The probability measures Pe for e E 0 are on the space (.-1:'00, BOO). They
induce probabilities on all of the spaces (xn, Bn), for n = 1, 2, ... via the
obvious projections. It will prove convenient to refer to all of these induced
probabilities by the same name, Pe. That is, if A E Bn, let Pe(A) denote
Pe(A x Xl x Xl X ... ). This will be very convenient without causing any
confusion. If it becomes important to know over which space, (xn, Bn) or
(Xoo, BOO), Pe is defined, we will be explicit.
Sometimes the parameter 8 can be expressed as a meaningful function of
the distribution P, say H(P) which is also defined for distributions outside
of the parametric family. For example, H(P) = J xdP(x), the mean of the
distribution, is defined for every distribution with finite mean whether or
not that distribution is a member of a parametric family of interest. When
this occurs, it may be that H is continuous in the sense that H (P n) !?.
H(P) if lim n--+ oo P n = P. The distribution of 8 can then be considered as
an approximation to the distribution of H(P n ), where P n is the empirical
probability measure of the first n observations.
Example 1.87 (Continuation of Example 1.53; see page 29). In the Exp(O)
distribution,O is one over the mean of the distribution. So, H(P) = (J xdP(X)-1
and H(Pn) = (1:~=1 X;jnr 1 = l/X n. Indeed, l/X n E, e, so that we can take
52 Chapter 1. Probability Models
pr(~ Y; :::; ti, for i = 1, ... ,p) = Pr((Y1. ... , Yn) E G(t, (1.91)
JECi
where
where
It is easy to see that (1.92) is the same as the right-hand side of (1.90). 0
Since the distributions specified are consistent, we can use them for the
distribution of P.
54 Chapter 1. Probability Models
where ai = a(Bi) for i = 1, ... , n, then we say that P has Dirichlet process
distribution with base measure a, denoted by Dir(a).
The Dirichlet process is useful only if we can do the necessary calculations
for making inference. The most crucial is updating in the light of data.
Theorem 1.94. Suppose that {Xn}~=l is a sequence of exchangeable ran-
dom quantities, that they are conditionally independent with distribution P
given P = P, and that P has Dir(a) distribution. Then the marginal dis-
tribution of each Xi is the probability measure a/a (X) and the conditional
distribution of P given Xl = Xl, ... , X k = Xk is Dir({3), where {3 is the
measure defined by {3(G) = a(G) + 2:7=1 Ic(xi) for each G.
PROOF. First, we prove the claim about the marginal distribution of Xi'
For B E 13,
and note that PEA if and only if (P(B?), . .. ,P(B~)) E AB' Let {3i =
a(Bt) for i = 1, ... ,n, and let {3i = aCE?) for i = n+ 1, ... ,2n. Let c = {i :
33If a(B) = a(X), it is trivial to prove that fB /.!PIXI (Alx)d/.!xl (x) = PreP E
A, Xl E B), where /.!PIXI is the conditional dis~ribut~on. of ~ given Xl to be
defined later in this proof, and J-LXl is the margmal distrIbutIOn of Xl already
determined.
34Recall the extended definition of the Dirichlet distribution in which ai = 0
means that the ith coordinate is 0 with probability 1.
1.6. Infinite-Dimensional Parameters 55
(3i =f:. O}, and let k be the highest number in c. Let c' = {i : (3i =f:. O} \ {k}.
If (3j = 0, let Zj = 0 in the following equations. Then we can write
PreP E A,X l E B)
Pr(P(Bl ) ~ tl,., P(Bn ) ~ tn, Xl E B)
= i P(B)dj.Lp(P) (1.95)
1 tz
AB j=1 J
r(a(X)) IIZt3.-l(l_LZ)t3k-lIIdZ'
TIiEC f((3i) iEc' t iEc' t iEc' t
It follows that
= tAl
j=l a(X) AB
r (a(A') +1)
IliEc r((3f) iEc'
II If-
t
l (1- L z.)f3t
iEc' t
-l II dz
iEC'"
Ferguson (1973) and Blackwell (1973) prove that there is a set of discrete
distributions Po ~ P such that the Dir(a) distribution assigns probabil-
ity 1 to Po. Sethuraman (1994) proves an alternative theorem, which not
only shows that the Dirichlet process is a probability on discrete distri-
butions, but also gives an algorithm for approximately simulating a CDF
with Dir(a) distribution. The result of Sethuraman (1994) is that the set of
points on which the Dir(a) distribution concentrates its mass is an infinite
lID sample YI , Y2 , ... from the probability a/a(X), and the probability as-
signed to Yn is Pn , where PI == QI, and for n > 1, Pn == Qn I1~:II(I- Qi),
where the Qi are lID with Beta(l, a(X)) distribution. What we prove here
is a very simple theorem of Krasker and Pratt (1986) which implies that the
Dir(a) distribution assigns probability 1 to a set of discrete distributions.
Theorem 1.97. Let {Xn}~l be conditionally IID with distribution P
given P == P. For n > 1, define
an == Pr(Xn is distinct from Xl, ... , Xn-d
If lim n-+ oo an == 0, then P is a discrete distribution with probability 1.
PROOF. Define
BE == {P: 3A E B such that P(A) > f and P({x}) == 0 for all x E A}.
It suffices to prove that Pr(BE ) = 0 for all f > O. The conditional probabil-
ity, given P == P and Xl"'" X n - l , that Xn is distinct from Xl"'" X n- l
is at least f for all P E BE' It follows that
an E Pr(Xn is distinct from Xl,'" ,Xn-IIX I , ... , Xn- l , P)
> E[Pr(Xn is distinct from Xl,'" ,Xn-1IXl, ... ,Xn-I,P)IB.(P)]
> f Pr(BE ).
Since limn-+ oo an == 0, Pr(BE ) == 0 for all f > 0 is necessary. 0
For the Dir(a) distribution, it is easy to calculate an == a(X)/[a(X) +
n -1].
The posterior predictive distribution of a future observation is a weighted
average of the prior measure a/ a(X) and the empirical probability measure.
Proposition 1.98. Assume that {Xn};:'=l are conditionally lID with dis-
tribution P given P == P and that P has Dir(a) distribution. The poste-
rior predictive distribution of a future Xi given X I == Xl, .. ,Xn == Xn is
,I3/[a(X) + n], where ,I3(C) == a(C) + E~l Ic(Xi)'
The predictive joint distribution of several future observations can be ob-
tained by applying Proposition 1.98 several times, each time after condi-
tioning on one more random variable. This gives a straightforward way to
generate a sample whose conditional distribution is P, which itself has a
Dirichlet process distribution. The joint distribution can also be described
as follows.
1.6. Infinite-Dimensional Parameters 57
Lemma 1.99. 35 Assume that {Xn}~=l are conditionally IID with distri-
bution P given P = P and that P has Dir(o:) distribution. Let n > o. If
pis a partition of{l, ... ,n}, letg(p) be the number of non empty sets in
p, and let kl (p), ... , kg(p) (p) be the numbers of elements of the g(p) sets.
(Note that 2:.f?{ ki(p) = n for all p.} For each x E X n , let R(x) be the
partition of {I, ... , n} which matches x. (That is, x has g( R( x)) distinct
coordinates, and for each set A in the partition R(x), those coordinates of
x whose subscripts are in A are all equal to each other.} For each x E X n ,
define Z(x) E Xg(R(x to be the vector of distinct coordinates such that
Z(X)i is repeated ki(R(x)) times in x. For each p and each subset B of x n ,
define Bp to be that subset of xg(p) which consists of the set of distinct co-
ordinates of points in B n R- l (p). (That is, Bp = Z (B n R-l (p)).) Define
the measure v on Xn by
v(B) = L o:g(p)(Bp).
Allp
The joint distribution of Xl, ... , Xn has the following density with respect
to the measure v:
n
L IR-l(p) (x) II II (o:({Z(X)i}) + j -1),
g~~~
for every partition p, and the result will then follow by adding up finitely
many terms. It is easy to see that v(C) = o:g(p)(Cp ) for each p and each
subset C of R-1(p), and that fx is a function of Z. It follows that
r
lBnR-l(p)
fx(x)dv(x)
Dn
(a(X) +i -
{ g ( P ) ki(p)
1)-1 a g (p)(B 1 ) DU (j - 1) (1.101)
Now, we will show that (1.101) is the probability that X E BnR-l(p). Let
be g(p) coordinates that are distinct for all x in R-l(p). Let
jl, ... ,jg(p)
Example 1.102. As a simple example of Lemma 1.99, suppose that X = IR. and
a is some finite continuous (no point masses) measure. The measure /I is then
the sum of the various k-dimensional product measures of Q for k = 1, ... , n over
the sets where there are exactly k distinct coordinates. For example, if n = 3,
then the partitions are
PI = {{1},{2},{3}}, P2 = {{1,2},{3}}, P3 = {{1},{2,3}},
= {{I,3},{2}},
P4 P5 = {{I,2,3}}.
So, g(PI) = 3, and ki(PI) = 1, while g(P2) = g(P3) = g(P4) = 2, and so on. Also,
R-I(pI) {(XI,X2,X3): Xl'" X2,XI '" X3,X2 '" X3},
R- I (P2) {(XI,X2,X3) : Xl = X2,XI '" X3},
R- I (P3) = {(XI,X2,X3): X2 = X3,XI '" X3},
R- I (P4) {(XI,X2,X3): Xl = X3,XI '" X2},
R- I (P5) {(Xl,X2,X3): Xl = X2 = X3}.
1.6. Infinite-Dimensional Parameters 59
_
fx(x) - a(X)[a(X)
1
+ l][a(X) + 2]
{21 E
if x R- 1(P5},
otherwise.
To calculate the probability that X is in the unit cube B, say, we must add up
five integrals, one for each partition.
Pr(O::::; Xi::::; 1, for i = 1,2,3)
= a(X)[a(X) : l][a(X) + 2] (a 3(B p1 ) + a 2 (Bp2 } + a 2 (Bp3) + a 2 (Bp4)
+ 2a(Bps }).
e = () with density a9(-)/c9. This is probably not the effect one thought
one was achieving by using a Dirichlet process. That is, there is nothing
~he least bit non parametric about the analysis one ends up performing in
this situation. In fact, this phenomenon is quite general.
Lemma 1.104.36 Suppose that person 1 believes that {Xn}~=l are lID
with a continuous distribution. For each f) E 0, let 0.9 be a continuous
finite measure with 0.9 (X) = c for all (J. Suppose that person 2 models the
e
data as conditionally IID given P = P and = () with distribution P and
that P given e = () has Dir(0.9) distribution. Suppose that person 3 models
the data as conditionally IID given e = (J with distribution 0.9/c. Assume
that 0.8 ", for all (). Suppose that person 2 and person 3 use the same
prior distribution for e. Then person 1 believes that, with probability 1, for
every n, person 2 and person 3 will calculate exactly the same posterior
distributions for e given Xl, ... , X n .
PROOF. First, note that the density fx in Lemma 1.99 is constant in (J for
every data set that has no observed values at points where 0.9 puts positive
mass. Such a data set will occur with probability 1 according to person 1.
Let a9 = d0.8/d",. With probability 1 (according to person 1) person 2 will
then have likelihood function proportional to n~=l a9(xi). This is the same
as the likelihood function that person 3 will have. Hence, persons 2 and 3
will calculate the same posterior. 0
Example 1.105. Since the Dirichlet process assigns probability 1 to discrete
CDFs, it may not be considered suitable for cases in which one really wants a
continuous CDF. One possibility is to model the observable data {Xn };:'=l as
Xi = 1'; + Zi where {Yn};:'=l are conditionally lID with CDF G given G =
G, where G has Dir(a) distribution, and {Zn}~=l are independent of {Yn}~=l
and of G and of each other with a distribution having density f. The posterior
distribution of G is not easy to obtain in this case, but a method that can be used
to approximate it will be given in Section 8.5. Escobar (1988) gives an algorithm
for implementing this method.
Definition 1.106. Let (X,8) be a Borel space. For each integer n > 0, let
71'n be a countable partition of X whose elements are in 8. Suppose that
36This lemma is used to show why it may not be sensible to use a Dirichlet pro-
cess for the prior if there will also be an additional finite-dimensional parameter
of interest.
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
1.6. Infinite-Dimensional Parameters 61
Vn;C = 1. (1.107)
All C such that ps(C) == B
Example 1.110. Let X = nt, and let 11"1 = {( -00,0), {OJ, (0, oo)}. Let {11";}~=2
be a sequence of nested partitions of (-00,0). For each n > 1, let 1I";i be the
partition of (0,00) formed by the negatives of the sets in 11";. Let 1I"n = 11"; U1I";i U
{OJ. SO long as Vn;B = Vn;c whenever B = -c, P will be symmetric around O.
When X P given P and P is a tailfree process, the predictive distri-
I'V
A E 1I'm, and let A = n~lBi' where Bi E 1I'i for each i ~ m. Then the
predictive probability that X E A is
m
J..!x(A) = EP(A) = IIE("'i;B')' (1.112)
i=l
It is sometimes possible to find a density for the predictive distribution of
X with respect to some measure v of interest (like Lebesgue measure).
Lemma 1.113.37 Let (X, B, v) be a a-finite measure space. Assume that
P is tailfree with respect to ({ 1I'n}~1' {Vn;B : n ~ 1, B E 1I'n}). Assume
that each element of each 11'n has positive v measure. For each x EX, let
1 n
fn(x) = v(Cn(x)) gE{Vi{X.
If limn -+ co fn{x) = f(x), a.e. [v], and f f(x)dv(x) = 1, then f = dJ..!x /dv.
PROOF. We need to prove that for each BEe, J..!x(B) = fB f(x)dv(x).
The extensions to the smallest field containing C and the smallest a-field
containing C are straightforward. Let B E 1I'n, and let B ~ Bi E 1I'i for
i = 1, ... , n. Then "'i(x) = "'i,B; for all x E Band i = 1, ... , n. By (1.112),
we have, for each Xo E B,
since fn(x) = fn(xo) for all x E B and B = Cn(xo). For k > n, write
B = Ua:EADa: as the partition of B by elements of 1I'k. Since !k is constant
on each Da: (call the value fk{Xa:) for Xa: E Da:), we can write
= = E("'i(xa:
= L J..!x(Da:) = J..!x{B).
oEA
Example 1.114. Suppose that v is a finite measure and, for each nand B,
E(Vn;B) = v(B)/v(ps(B)). In the notation of Lemma 1.113, fn(x) = l/II(X) for
all n and x. In this case, /-Lx = II/II(X), and the density is constant. In fact, this
gives a convenient way to force a tailfree process to have a desired predictive
distribution for X.
Tailfree processes are conjugate in the sense that the posterior is tailfree
if the prior is tailfree.
Theorem 1.115. Let P be a mndom probability measure that is tailfree
with respect to ({7rn}~l,{Vn;B : n ~ 1,B E 7rn }) and X '" P given P,
thenP given X is tailfree with respect to ({7rn}~=l' {Vn;B : n ~ 1,B E 7rn }).
PROOF. Fix k and nl < ... < nk. Let Vi, for i = 1, ... , k, be a finite (say,
with size Si) collection of the elements of 7rn ;, and let FVi denote their joint
CDF. Let FVilx(lx) denote the conditional CDF of Vi given X = x. Let
A E B, and let Gi ~ JR.Bi, for i = 1, ... , k. We must show that
Ai=l
FVilx(Gilx)d/lx(x),
(1.116)
where /lx is given by (1.112). If we can prove (1.116) for all A E C, it will
be true for all A E B by Theorem A.26.
First, we find FVilx. By definition, for G E JR.Bi, FVilx(C!) is any mea-
surable function h such that
for all A E B. Once again, the equation need only hold for all A E C. We
propose the following function:
(1.117)
1 . 1
D1 Dm
Un ;,l IT
jf.n;
UjdFum(u m ) ... dFu 1 (U1)
1 A
h(x)dJl.x(x) = L 1
BE1rn; AnB
h(x)dJl.x(x). (1.118)
If ni ~ n, there is only one term in the sum in (1.118). It is the term with
B = Bn; such that A ~ Bn;, and the integral equals
h(Bn;)Jl.x(A) = E(Ie(Vi)Vn;;BnJ
E(Vn;;Bn;)
IT
j=l
E(VJ;Bj)
= E(Ie(Vi)Vn;;BnJ IT E(VJ;Bj),
jf.n;
which is what we needed to show. Similarly, if ni > n, the only terms in
the sum in (1.118) which appear are those for which B ~ A, and A is the
union of these sets. The sum becomes
h(B)Jl.xCB) (1.119)
B such that B ~ A
1.6. Infinite-Dimensional Parameters 65
where Bj(B) E 1fj and B ~ Bj(B) for j = 1, ... , ni. The right-hand side
of (1.119) can be written as
n;-l
L E(IC(Vi)Vn;;B) II E(Vy;Bj(B),
B such that B ~ A j=l
{ IR[0,1]
x Gi
ifjlt{nl,,,.,nd,
if j = ni.
In what follows, Wj,l will be the first coordinate of Wj It is easy to see
that
Pr(X E A, Vi E Gi, for i = 1, ... , k)
Pr(X E A, Wj E D j , for j = 1,,,., n)
= [
JD 1
... [
J Dn j=l
IT Wj,ldFwn(w n ) ,,dFw1(wd
n
= IlE(Wj,lIDj(Wj
j=l
k
II
j\l{nl, ... ,nk}
E(Vy;BJ II Eevn;;BnJC; (Vi
i==l
IT E(Vy;B j ) IT E( V~;~JCi(V')).
n k
Ln
j=l i=l ( ni;Bn;)
k
= FViJX(Gi\x)dt-tx(x). o
J
fn(x)dJ-t x v(s,x) = 1 for all n,
( fn(x)dJ-t x v(s,x)
Ix l
J{(s,x): fn(xm }
= fn(x)I(m,oo) (fn(x))dJ-t(s)dv(x)
< r ( ~f~(x)dJ-t(s)dv(x)
JxJsm
~
mJx
rE[f;(x)]dv(x), (1.122)
where the first equation follows from Tonelli's theorem A.69, and
the in-
equality follows since lCm,oo)(fn(x)) < fn(x)/m . By assumption, the
supre-
mum over n of the last expression in (1.122) is a finite number divided
by
m, which goes to 0 with m, so the sequence is uniformly integrable.
Next, we prove that f is a density with respect to v with probability
1.
By Theore m A.60, we have
lim
n--->oo J
fn(x)dJ-t x vex) = J
f(x)dJ-t x vex). (1.123)
~ { E[V;(x) }
o < ~ s~p (EVn (x2 - 1
~ VarVn(x)
~ s~p (EVn (x2
~ Var(VnB)
~ sup (EV. ')2 < 00.
n=l BE7r n n;B
Note that log(y) < y - 1 for all y > o. With y = sUPx E[V;(x)]/(EVn(X2,
we get, for each n,
hence
00 E[V;(x)
~ s~plog (EVn (X2
E[V;(x) }
< ~ s~p (EVn (x2 - 1 <
00 {
00.
Since the Zn(B) are independent, we get that n~=l EVn (x)2/(EVn (x2
equals Ef~(x). Hence, sUPx Ef~(x) is integrable. 0
The following simple corollaries follow from Tonelli's theorem A.69.
Corollary 1.125. Let X rv P given P. If P v with probability 1, then
the predictive distribution of X has density with respect to 1/ equal to the
mean ofdP/dl/.
Corollary 1.126. If {Xn}~=l are conditionally lID with distribution P
given P and P 1/ with probability 1, then conditional on Xl, ... , Xn
P 1/ with probability 1, a.s. with respect to the joint distribution of
Xl, ,Xn .
1.6. Infinite -Dimen sional Parame ters 69
that happens in the Dirichlet process. The difference is that, for a Dirichlet
process, the above argument applies to every set, and every set is in the
first partition 11"1 for the Dirichlet process. Given this description of the
posterior, we can use Corollary 1.125 to construct the predictive density of
a future observation X n +1 given the observed values of Xl, ... , X n .
Example 1.130. Let P have a Polya tree distribution. Suppose that the condi-
tions of Theorem 1.121 hold and that P v with probability 1. Let Xl, . .. , x.,.
be the observed values of the first m random quantities. For each x E X and each
n such that x is not in the same element of 1I'n-1 as one of the Xi,
( V(Cn(X
E [V" X) IXl = Xl,.., Xm = Xm] = E[Vn(x)] = v(ps(Cn(x))'
1
II 9;{X),
r
where r is the first integer such that X is not in the same element of 11'r-1 with
any of the Xi, and 9i(X) is the posterior mean of l-i{x) given the observed data.
It follows from Tonelli's theorem A.69 that Ef(x) equals (1.131).
We can actually find an explicit formula for 9;{X). Using the same notation
as above, suppose that Vn(x) is coordinate in(x) of Wn(x) and that Wn{x) has
Dirk (a n ,l (x), ... , an,k (x prior distribution. Then, the posterior distribution of
Vn{x) is Beta{a,b) with
+L
Tn
a = an,in(z) ICn(z)(Xj),
j=l
+ 2: I Cn _1 (z)\C n(z)(Xj).
m
b = L an,l(x)
t~in("') j=l
That is, the first parameter of the posterior beta distribution equals the prior
parameter plus the number of observations that are in the same partition set
as x. The second parameter of the posterior beta distribution equals the prior
parameter plus the number of observations that were in the same partition set
as x in the most recent partition but now are not in the same partition set as x.
It follows that 9n(X) = a/(a + b).
Example 1.132. Let {Xn}~=l be an exchangeable sequence such that the prior
marginal distribution of each Xi is N(O,lOO). Let Y; = 4>(Xi/lO), where 4> is
the N(O,I) CDF. The prior marginal distribution of the Y; is U(O,I). Suppose
that we model the Y; as conditionally lID given P and that P has a Polya
tree process distribution on [0,1) with k = 2 and an,i(x) = n 2 /2 for all n,
i, and x. This is a special case of Example 1.114 on page 63, hence each Y;
has marginal distribution U(O, 1). Fifty observations Xl, ... , X 50 were simulated
from a Laplace distribution Lap(l, 1), which does not look much like the prior
marginal distribution N(O, 100). The posterior mean of the density of p(104)-I)
was computed and is plotted in Figure 1.133 together with the prior marginal
mean of the density and a histogram of the data values. The posterior mean of
the density of p(104)-I) is high where the data values are close together, as is
to be expected. The posterior mean smoothes out some of the ups and downs
in the histogram, especially those in the tails. The reason it smoothes the tails
a bit more than the center of the distribution is that the partition sets in the
tail which do and do not contain observations only belong to partitions 7l'n for
relatively large n. A few observations do not have much impact on the posterior
distribution of Vn:B for large n because the prior is Beta(n 2 /2, n 2 /2). In the center
of the distribution, however, the different partition sets belong to the same 7l'n
for smaller values of n.
~ Histogram
Prior Mean
"
,I" '
"
, Post. Mean
True Dens.
I'
,, ''
I '
,, ''
I '
,, ''
I '
I '-.
I
I \~
,"!:,
ci , :
\1
\~
-4 -2 o 2 4 6 8
have posterior distributions that depend in no way on where the two sets
are located relative to each other. The same is true for elements of the first
partition in a Polya tree process. For Polya tree processes, two sets in the
nth partition will have their probabilities more closely related when they
share more superset partition sets. For example, two subsets of B 1 ,1 will
be more closely related than a subset of B1,1 and a subset of B 1 ,2' Two
subsets of B 2 ,1 ~ B 1,1 will be more closely related than a subset of B 2 ,1
and a subset of B 2 ,2 ~ B 1,1, even though both are subsets of B1 1.
One potential problem with tailfree process priors is the d~pendence
on the sequence of partitions. One consequence of this dependence is easily
seen in Figure 1.133. The tall vertical lines in the posterior mean plot occur
at boundaries of sets in early partitions. The following example explores
this in more detail.
Example 1.134. Suppose that X = [0,1] and we use a Polya tree prior with
k = 2 and an,i(x) = n 2 /2 for all n, i, and x. Suppose that Xl = 0.49. The
predictive density of X 2 at the value x = 0.51 is calculated as in Example 1.130
on page 70. It is
f X 21X 1 (0.5110.49) = O2.5 ~
0.5
= 0.5.
On the other hand, the predictive density of X2 at x = 0.47 is
f X 2I X l(0.4710.49) = (iI 1) 2~
n=l
n2:
n
+
+1 6
~ = 2.1183.
+ 1 2ir
Note that in each of these cases, the proposed value of X2 differs from the observed
Xl by 0.02 and they are all in the vicinity of 0.5, and yet the first predictive
density is so much smaller than the second. The reason is the following. In the
first case, the two data values share no partition sets in common, not even in 11"1
In the second case, the two data values share the same partition set for the first
five partitions. In symbols, C 1 (0.49) =f C 1 (0.51) while C n (0.47) = C n (0.49) for
n = 1, ... ,5. Sharing partition sets is what makes predictive densities large.
One way to reduce the effect of the problem illustrated in Example 1.134
is to use a mixture of tailfree priors with partitions that have no common
boundaries.
Example 1.135 (Continuation of Example 1.134). Suppose that we use a half-
and-half mixture of two Polya tree priors with k = 2,3 and an,i(x) = n 2 /k. After
some tedious algebra, one calculates the two predictive densities as
f X 21 X l (0.5110.49) 1.8312,
f X 21 x l (0.4710.49) = 2.3191.
The reason that the first density is now almost as high as the second is that 0.49
and 0.51 appear together in one more partition set in the k = 3 prior than ~o
0.49 and 0.47. The densities are higher than with k = 2 alone because a prior
with larger k tends to let the density track the data more. With vB:lues so cl~se
together as the ones in this example, the prior with k = 3 has very h1gh p~tenor
mean of f(x) for x near 0.49 and very low mean for x not near 0.49. (Usmg the
k = 3 prior alone, the two predictive density values would have been 3.1624 and
2.5200, respectively.)
1.7. Problem s 73
1. 7 Prob lems
Throug hout this text problems are given and the following type of expres-
sion is often used: "Suppose that (some random quantities) are conditi
on-
ally independent given e = (J." This will mean that, for all (J in
some
parame ter space (implicit or explicit), the random quantities are
condi-
tionally independent given e = () with some distributions to be specifie
d
later in the problem. Some of the more challenging problems through
out
the text have been identified with an asterisk (*) after the problem number
.
Section 1.2:
Section 1.3:
where X = ~;:l Xi, and the numbers ai are nonnegative and add to 1.
Let 8 = limn~DO ~;:"l X;/n. Prove that the prior distribution of e is
Pr(8 = i/lO) = ai for i = 0, ... , 10.
12. Suppose that for every m = 1,2, ... ,
2
fx 1, .. , X m ( Xl , . . . , x)
711. -- (m + 1)Cm.XI,,Xm.
( )711.+1' if all Xi 20,
(a) Prove that the Xi are exchangeable and that these distributions are
consistent.
(b) Let Yn = cn(X I , ... , Xn). Find the distribution of Yn and the limit
of this distribution as n -> 00.
(c) Find the conditional density of Xn+l given Xl = Xl, .. , Xn = Xn,
and assume that limn~DOcn(XI, ... ,Xn) = O. Find the limit of the
conditional density as n -> 00.
(d) Use DeFinetti's representation theorem to show that the prior (the
answer to part (b)) and likelihood (the answer to part (c)) combine
to give the original joint distribution.
13. Let e be a random variable. Suppose that {Xn}~=l are lID N(O,l). Let
T(O) = O. For each i > 0, let T(i) be the first j > T(i - 1) such that
Xj 2 e. Let Yi = XT(i) for i = 1,2, .... If we use Lebesgue measure as an
improper prior distribution for e, how many observations must we observe
before the posterior becomes proper?
1.7. Problems 75
= L 1I'i/i(xI9).
k
/xle(xI9)
i=l
Show that there are numbers 1I'i (x), . .. , 1I'k{x) adding to 1 such that
= L 11'; (x)Pi(Olx),
k
/elx(Olx)
i=l
(a) Find the marginal distribution for XiI"'" Xin for distinct integers
il, ... ,in
(b) Find the posterior distribution for e given Xl = Xl, ... , Xn = X n ,
that is, find Pr(e ::; 0IXl = Xl, ... , Xn = x n ).
16. Suppose that {Xn}~l are conditionally independent random variables
with Xi'" N(I-'i,l) given (M,Ml, ... ,Mn , ... ) =
(1-',l-'l'''',l-'n, ... ). Sup-
=
pose also that given M 1-',
1
2
1
for each n, and M '" N(J.to, 1).
(a) Prove that {Xn}~l are exchangeable.
(b) Find a one-dimensional random variable 8 such that the Xi are con-
ditionally lID given e = 0, and find the distribution of e.
(c) Show that Xn = L::'l Xi/n does not converge in probability to M.
17. Let c be a constant, and let X, Y be conditionally independent given 8 = 0
with X rv Poi(O), Y '" Poi(cO). Let e r(oo, fJo).
"V
18. Suppose that an expert believes that {Xn}~=l are exchangeable Bernoulli
e
random variables. Let = limn->oo I:~=l X;/n, and assume that a statis-
tician wishes to model e '" Beta(a,b). The statistician tries to elicit the
values of a and b from the expert by asking questions like, "What is the
probability that X I = I?" and "How many Xi = 1 in a row would you have
to observe before you would raise the probability that the next Xj = 1 up
to q?" Suppose that the answer to the first question is p, and suppose that
q in the second question is chosen to be (1 + p)/2. Let the answer to the
second question be m.
(a) Find values for a and b which are consistent with the model.
(b) Find the partial derivatives of a and b with respect to p, and find the
effects of a change of l in m on both a and b.
(c) Suppose that the second question above was changed to "If you were
to observe Xl = ... = XIO = 1, what would you give as the Pr(Xll =
I)?" Let the answer to this question be T. Find values of a and b
consistent with the values of p and T, and find the partial derivatives
of a and b with respect to p and T.
19. Suppose that an expert believes that {Xn}~=l are conditionally lID with
N(/-L, 0'2) distribution given (M,~) = (/-L, 0'). A statistician wishes to model
M '" N(/-LO, 0'2/ >"0) given ~ = 0' and ~2 '" r- l (ao/2, bo/2). The statistician
tries to elicit the values of /-Lo, >"0, ao, and bo from the expert by asking a
sequence of questions such as:
What is the median of the distribution of Xl? (Suppose that the
answer is UI.)
Given that Xl ~ UI, what is the conditional median of the distribu-
tion of Xl? (Suppose that the answer is U2')
Given that Xl ~ U2, what is the conditional median of the distribu-
tion of X I? (Suppose that the answer is U3.)
If X I = U2 is observed, what would be the conditional median of the
distribution of X 2 ? (Suppose that the answer is U4.)
(a) Prove that the following constraints on UI, U2, U3, and U4 are sufficient
for there to exist a prior distribution of the desired form consistent
with the responses: UI < U4 < U2 and (U3 - Ul) / (U2 - uI) > 1.705511.
(The constraints are actually necessary as well.)
(b) Suppose that the following answers are given: UI = 14.56, U2 = 21.34,
U3 = 29.47, and U4 = 19.25. Find values of /-Lo, >"0, ao, and bo which
are consistent with these answers.
Section 1.4:
20. For the joint density in Example 1.53 on page 29, prove that the distribu-
tions are consistent (as n changes).
21. Consider the joint density in Example 1.53 on page 29, and define Yn =
n/ I:~=l Xi. Find the distribution of Yn . Also, let n -> 00 and prove that
the limit of the distribution of Yn is rea, b).
1.7. Problems 77
22. For the joint density in Example 1.53, prove that the conditional distribu-
tion of the Xi given e = (), namely Exp((}), and the marginal distribution
of e, found in Problem 21, namely f(a,b), do indeed induce the joint distri-
bution for X 1, ... ,Xn for all n. How do we know that no other combination
of distributions will induce the joint distribution of the Xi?
23. Let X 1, ... ,X14 be exchangeable Bernoulli random variables, and let M =
L:~~l Xi. Let the distribution of M be given by the mass function (density
with respect to counting measure)
0.3 if m = 2,
/M(m) ={ 0.2 if m = 8,
0.5 ifm = 13.
(a) Find the probability that in four specific trials, we observe three suc-
cesses and one failure (without regard to which of the trials is a failure
and which are successes).
(b) Suppose that we observe three successes in the first four trials. Find
all the probabilities of k successes in n future trials for n = 1, ... ,10
and k = 0, ... , n. (Give a formula.)
24. Suppose that Xl, ... , Xn are exchangeable and take values in the Borel
space (x,13). Prove that the empirical probability measure P n is a mea-
surable function from the n-fold product space (Xn, 13 n ) to (P, Cop).
25. *Refer to Problem 29 on page 664.
(a) Find the distribution of B = lim n _ oo L~=l Xi/no
(b) Assume that we observe Xl = 1 and X2 = O. Find the conditional
distribution of X3, ... ,Xn given this data for all n = 3,4, ....
(c) Using the same data, find the conditional distribution of B.
26. Suppose that {Xn}~l is an infinite sequence of exchangeable random vari-
ables with finite variance.
(a) Prove that the covariance of Xi with X j is nonnegative for i =1= j.
(b) Give an example of such a sequence in which Cov(Xi,Xj ) = 0 but
the random variables are not mutually independent.
27.*Let X I ,X2 ,X3 be lID U(O, 1) random variables. After observing Xl =
Xl, X2 = X2, X3 = X3, define YI, Y2 to be the results of drawing two num-
bers at random without replacement from the set {Xl, X2, X3}. Prove that
YI and Y2 are IID U(O,I).
28. Let Xl, ... ,Xn be lID with some distribution P on a Borel space (X, B).
Let the conditional distribution of YI, ... , Yk (for k < n) given X I =
Xl, ... ,Xn = Xn be that of k draws without replacement from the set
{Xl, ... , x n }. Prove that the joint distribution of YI , ... ,Yk is that of lID
random quantities with distribution P.
29. Let (X, B) be a Borel space, and let Xi take values in X for i = 1, ... , n.
Suppose that Xl, .. " Xn are exchangeable. Let the conditional distribu~
tion of YI , ... , Y k (for k < n) given Xl = Xl,"" Xn = Xn be that of k
draws without replacement from the set {Xl, ... ,Xn}. Prove that the joint
distribution of YI, ... , Yk is the same as the joint distribution of X I, ... , X k.
78 Chapter 1. Probability Models
30. In the setup of Problem 3 on page 73, let P be the limit of the empirical
probability measures of Xl, . .. ,Xn as n -+ 00. Show that P is also the
limit of the empirical probability measures of the Y;.
31. State and prove a central limit theorem for exchangeable random variables.
You may use Theorems B.97 and 1.49.
Section 1.5:
32. Prove Corollary 1.63 on page 36 using Theorem 1.62. (Hint: Prove that
lim Yn is measurable with respect to the tail u-field of {Xn}~=l' Then
apply the Kolmogorov zero-one law B.68.)
33. Refer to Example 1.76 on page 42. Let M* = M -Xl. Take the conditional
distribution of M* given Xl = Xl as a prior distribution for M* after
learning that Xl = Xl.
(a) Find the probability that X 2 = 0 (conditional on Xl = 1) using this
new prior distribution.
(b) Find the posterior distribution for M* given X 2 = 0 (and Xl = 1.)
34. Let {Xn}~=l be exchangeable Bernoulli random variables, and let
Y = min {n :t 1=1
Xi ~ 2} ,
that is, Y is the time until the second success (e.g., if Xl = 1, X 2 = 0, X3 =
1, then Y = 3).
(a) Find the distribution of Y using the form of DeFinetti's representa-
tion theorem in Example 1.82 on page 46.
(b) Find the conditional distribution of {Xn+kHo=l given Y = n.
(c) Show that the distribution in part (refp202) is the same as the con-
ditional distribution of {Xn+d~l given E~=l Xi = 2.
35. Suppose that {Xn}~=l are bounded, exchangeable random variables. Let
e = limn_co E~=l Xi/n, a.s. Prove that Var(8) = CoV(Xl ,X2 ).
36. Prove that the collection Cn in the proof of Theorem 1.62 is a u-field. Also
prove that f : ffi 00 -+ ffi is measurable with respect to Cn if and only if
f(y) = f(x) for all y that agree with X after coordinate n and such that
the first n coordinates of yare a permutation of the first n coordinates of
x.
37. Let {Xn}~=l be lID nonnegative random variables with E(Xi) = 00. Show
that E~l Xi/N = YN diverges to 00 almost surely.
38. *Under the conditions of Theorem 1.59 it is possible to prove that Yn con-
verges almost surely, rather than just the subsequence {Ynk } k= l '
(a) Let v = E~l C 3 / 2 and Ci,k = 1/(kvi3/2) for all i and k. Define
Vi,k = {s : IY~+l)4 (s) - Yi4 (s)1 < Ci,k}' Use the second to last equation
in (1.60) to prove that E:l Pr(V;~) < 00.
1. 7. Problems 79
(b) Let Ak,n = n~n "i,k. Show that for each E > 0 and k, there exists nk
such that Pr(Af,nk) < E/2k.
(c) For each i,j,k with i4 ::; k < (i+ 1)4, define Gi,i,k = {s: l}j(s)-
y/4(s)1 > l/k}. Use the second to last equation in (1.60) to prove
that Pr(Gi,i,k) is at most a fixed multiple of k 2 /i 5
(d) Define Hi,k = U;~~)4-1Gi'i,k. Prove that Pr(Hi,k) is at most a fixed
multiple of k2 /i 2 .
(e) Define Jk,n = U~nHi,k. Prove that for each k and > 0, there exists
mk such that Pr(Jk,mk) < /2k.
(f) Prove that for every pair of sequences {nk}k'=l and {mk}k'=l,
Section 1.6.1:
Section 1.6.2:
50. Let (X, B, v) be a probability space. Assume that P is tailfree with respect
to ({7rn};:O=ll {Vn;B : n ~ 1, B E 1Tn}). Assume that each element of each
1Tn has positive v measure. Show that E(Vn;B) = v(B)/v(ps(B)) for all
nand B if and only if f-tx(B) = v(B) for all B, where f-tx is defined in
(1.112).
51. *Prove that (1.107) and (1.108) are necessary and sufficient in order that a
tailfree process be a probability measure with probability 1. (Hint: That
the two conditions are necessary is straightforward once one realizes that
countable additivity of a measure f-t is equivalent to lim n _ oo f-t(Dn) = 0
when Dl :2 D2 :2 ... and n;:O=lD n = 0. Next prove that (1.107) is sufficient
for the process to be countably additive on the smallest a-field containing
all sets in 7r n for each n. Then show that the union of all these a-fields is a
field and that (1.108) is sufficient for the process to be countably additive
on this field.)
52. Let (Xl, ... , Xn+t) '" Dirn+l (al, ... ,an+l). Define Yi = 2:~=1 Xi for i =
1, ... ,n + 1, and set Zi = Yi/Yi+l for i = 1, ... , n. Prove that the Zi are
mutually independent with Zi '" Beta(al + ... + ai, ai+l).
53. Prove Corollary 1.125 on page 68.
54. Prove Corollary 1.126 on page 68.
55. *Assume that the conditions of Theorem 1.121 hold. Prove that the posterior
distribution of P as found in Theorem 1.115 is the same thing that Bayes'
theorem 1.31 gives, where we let the e in Bayes' theorem 1.31 be P.
56. Let X '" P given P. Suppose that Elg(X)1 < 00, where g : X -+ lEt Prove
that E(lg(X)IIP) < 00, a.s.
57. *Let P have a Polya tree process distribution with each set partitioned into
k sets at the next level. For each x and n, let in(x) be such that Cn(x) =
Bin(x)(Bn-l(X)). Let Dirk(an,l(x), ... ,an,dx)) be the prior distribution
of Wn(x) in (1.128), and define Sn(X) = 2::=1 an,i(X).
(a) Prove that Pr(X = x) = O::"=l(an,in(x)/Sn(X)).
(b) If Xl, X 2 are conditionally lID with distribution P given P, prove
that
Pr(X2 = xlXl = x) = II
oo
bn,in(x)(X) ,
Sn(X) + 1
n=l
where bn,i is defined in (1.129).
(c) If infx,n Sn(X) > 0 and sUPx,n,i an,i(X)/Sn(X) < 1, then P is continu-
ous with probability 1.
(Hint: Use Problem 49 on page 80.)
58. Consider a Dirichlet process Dir(a) as a Polya tree with k = 2 and
an,i(x) = Cn for all n, i, and x.
(a) If a = a(X), show that Cn = a/2 n for all n. (Hint: Use the result
of Problem 52 above, which implies that the product of a Beta(b, c)
times an independent Beta(b + c, d) is Beta(b, c + d).)
(b) Show that a condition for P to be continuous in Problem 57 above is
violated.
CHAPTER 2
Sufficient Statistics
We now turn our attention to the broad area of statistics. This will concern
the manner in which one learns from data. In this chapter, we will study
some of the basic properties of probability models and how data can be
used to help us learn about the parameters of those models.
2.1 Definitions
2.1.1 Notational Overview
We assume that there is a probability space (S, A, J.L) underlying all prob-
ability calculations. It will be common to refer to probabilities calculated
under J.L using the symbol Pr(.). Conditional probabilities will be denoted
Pr(I). We also assume that there is a random quantity X : S -+ X, where
X is called the sample space (with a-field B), which will usually be some
subset of a Euclidean space, but will be a Borel space in any event. We
will often refer to X as the data. Let Ax stand for the sub-a-field of A
generated by X (that is, Ax = X- 1 (B)). Since (X, B) is a Borel space, B
contains all singletons. This will allow us to claim that random quantities
are functions of X if and only if they are measurable with respect to Ax
by Theorem A.42. Generic elements of X will usually be denoted by x, y,
or z, or Xl, X2, . , depending on how many we need at once.
Assume that there is a parametric family Po of distributions for X, and
let the parameter be e : S -+ n, where n is a parameter space with a-field
T. Usually, n will be a subset of some finite-dimensional Euclidean space,
but not always. X will usually be a vector of exchangeable coordinates, but
this is not required. When the coordinates of X are exchangeable, then the
2.1. Definitions 83
Example 2.1. Let {Xn}~l be conditionally lID random variables with N(O, 1)
distribution given e = O. Let X = (Xl, ... , Xn). If BE 8 1 , the one-dimensional
Borel a-field, then
2.1.2 Sufficiency
A statistic is virtually any measurable function of the data, X.
Definition 2.3. Let (T, C) be a measurable space such that C contains all
singletons. If T : X - t T is measurable, then T is called a statistic.
It appears that almost anything can be a statistic. The only requirement
is that it be a measurable function of X to a space in which singletons
are measurable sets, such as a Borel space. It will prove convenient, when
T: X -t T, to refer to T as a random quantity. When we do this, we will
84 Chapter 2. Sufficient Statistics
Next, treat reX) = I:~l Xi as the data. The density of r given e = 8 (with
respect to counting measure on the nonnegative integers) is hle(tI8) = (~)8t(1-
8t- t for t = 0, ... ,n. It follows from Bayes' theorem 1.31 that the posterior given
r = t = I:~=l Xi has derivative
This is the same as the other posterior, hence r is sufficient according to Defini-
tion 2.4.
Example 2.9 (Continuation of Example 2.5; see page 84). The Xi are lID
Ber(8) given 8 = 8, and X = (Xl, ... , Xn). Let T(x) = L:"l Xi. We need
to compute PHX = xIT(X) = t) for all 8 and all X such that t = T(x). Since
both X and T are discrete random variables,
Set r(, t) to be the distribution that is uniform on the set of all x such that
E:=l Xi = t (probability (~) -1 for each such x.) We now see that T is sufficient
according to Definition 2.8.
Example 2.10 (Continuation of Example 2.7; see page 85). If X = (Xl, ... ,Xn)
with the Xi having Exp(8) distribution given 8 = 8, let T(x) = L~ Xi. We
need to find the conditional distribution of X given T = t. By Coroilah- B.55,
the conditional distribution of X given T = t and 8 = (J has density
with respect to a measure VXIT(lt), which does not depend on 8. Since this
distribution is the same for all 8, T is sufficient in the classical sense.
In general, sufficient statistics need not be much simpler than the entire
data set.
Definition 2.11. Let Xl.' .. , Xn be random variables. Define X(l) to be'
min{Xl. ... ,Xn}' and for k > 1, define
The vector (X(l),'" ,X(n is called the order statistics of Xl, .. ,Xn.
In all cases of interest to us, the two definitions of sufficient statistic are
equivalent.
Theorem 2.14. 3 Let (T,C) be a Borel space, and let T : X -. T be a
statistic. The following are both true:
Pr(8 = (JIX = x) =
According to Lemma 2.6, for each (), this is a function of T(x). That is,
there exists a function h such that, for each (),
fXls(xl(J)
~oo
L...i=l C,.fXIS (I(}')
x ,
= h(J, T(x)). (2.16)
By the chain rule A.79, it can be seen that the left-hand side of (2.16) is
equal to dP6/dll*(x).
For (J E {(}i}~l' replace the prior above by one that has Pr(8 = (}i) = Ci
for all i. We still have that (2.16) is dP6/dll*(x). Also, Lemma 2.6 says that
Pr(8 = (}jlX = x) is still a function ofT(x) for all j. But Pr(e = (}jlX = x)
is just Cj times the left-hand side of (2.16). 0
PROOF OF THEOREM 2.14. For part 1, define the function r to be the
conditional probability function on X given T = t calculated from the
probability 11* in Lemma 2.15. That is, for every C E C and every B E B,
We now wish to show that this function r can serve as the conditional
distribution of X given T = t and e = () for all 0. To see that this is
true, note that for all B E B, Po(BIT = t) is any function m : T ___ [O,lJ
satisfying
= i r(B, t)dPO,T(t),
where the first equality follows from (2.17) and the second follows from
(2.19). It follows that r(B, t) can play the role of met) in (2.18), and the
proof of part 1 is complete.
To prove part 2, let r be as in Definition 2.8, and let /La be a prior for
8. By the law of total probability B. 70 (conditional on T), the conditional
distribution of X given T = t, /LxIT(lt), is given for every B E B by
4Versions of this theorem originated with Fisher (1922, 1925) and Neyman
(1935).
90 Chapter 2. Sufficient Statistics
For the "only if" part, assume that T(X) is sufficient. According to
Lemma 2.15, there is a measure 11* such that P8 11* for all (J, dP8/dll*(x)
is a function h of (J and T(x), and 11* II. It follows that
dll*
fXle(xlO) = h(O, T(x)) dll (x).
If we set ml(x) equal to the second factor on the right and m2(T(x),(J)
equal to the first factor on the right, we are done. 0
Example 2.22. Let P9 say that {Xn}~l are lID U(O,O), n= (0,00), and let
X = (X1, ... ,Xn). Then
fX 1 , ,xn le(X1, ... , xnlO) = ml,n(XI, ... , x n )m2,n(Tn (xl, ... , x n ), 0).
Suppose also, that for all n and all t E 7,
where t' = TnH(X1, ... , Xn, Yl. ... , Ye). The posterior density of e with
respect to the measure ). would be
=
In m2,t(t, 1jJ)ml,n(Xl, ... , x n )m2,n(Tn(Xl, ... , x n ), 1jJ)d)..(1jJ)
m2,n+i(t', e)
c(t', n + f) ,
by (2.26). 0
The family of prior densities p and their corresponding distributions is
called a natural conjugate family of priors.
Example 2.27. Let {Xn}~=l be a sequence of conditionally lID Ber(9) random
variables given e = 9. Let Tn = L~l Xi. Then m2,n(t, 9) = 9t (1 - 9t- t and
c(t, n) = t!(n - t)!/(n + I)!. The family of natural conjugate priors is a subset of
the family of Beta distributions. In particular, m2,n(t, O)/c(t, n) is the Beta(t +
1, n - t + 1) density as a function of 9. Actually, the entire collection of Beta
distributions has the property that if the prior is in the Beta family, then the
posterior is as well. Theorem 2.25 only tells us that Beta distributions with integer
parameters are natural conjugate.
{y EX: fXle(yle) = fXle(xle)h(x, y), 'V(} and some h(x, y) > O},
7In most examples it is relatively easy to construct the function T, but you
can actually prove that such a function exists in general. See Problem 15 on
page 139.
2.1. Definitions 93
which is the same as the posterior after learning X = x. Hence, the posterior
is a function of T(x). Lemma 2.6 says that T(X) is sufficient.
Finally, we prove that T(X) is minimal. Let U(X) be another sufficient
statistic. Use the Fisher-Neyman factorization theorem 2.21 to write
(2.30)
Since P9( {x: ml(x) = O}) = 0 for all 0, we can safely assume that ml(x) >
o for all x. We need to show that if U(x) = U(y) for some x,y E X, then
y E V(x). It would then follow that T(y) = T(x), and this would make T
a function of U. Suppose that U(x) = U(y). Use (2.30) to write
is the same for all 0 if and only if E:':-l Xi = E7=1 Vi, in which case h(x, y) = 1
and Vex) = {V: E~l Vi = E:=l Xi}' Then T(X) = E:'-I Xi is the minimal
sufficient statistic. -
Ex~ple 2.32. Suppose that P9 says that {Xn}~=1 are lID U(O,O) random
variables. Let X =: (XI, ... ,Xn) and suppose that the sample space is lR+n.
Then fXls(xIO) = o-n I[O,6j(maxxi). Now suppose that
where X = 2:::1Xi Set hex, y) = n~=l (Xi!)/ n~=l (Yi!) for all X and y. It follows
that fXls(xIO) = hex, y)fxls(yIO) for all 0 if and only if 2:::1 Xi = 2::7=1 Yi and
maxxi = maxYi. Set T(x) = (2::7=1 Xi, max Xi) and note that X E V(y) if and
only if T(x) = T(y). So T(X) is minimal sufficient.
for all O. Then 2:::'0 g(t)ot It! = 0 for all O. This expression is a power series
representation of the analytic function h( 0) = O. Since power series for analytic
functions are unique, it must be that g(t)/t! = 0 for all nonnegative integers t.
There are many functions g with this property, such as get) = sin(27rt). All such
g satisfy Pe(g(T) = 0) = 1 for all O. So T is complete.
Theorem 2.36 (8ahadur's theorem}.8 If U is a boundedly complete
sufficient statistic and finite-dimensional, then it is minimal sufficient.
2.1.4 Ancillarity
At the other extrem e from sufficiency lie statisti cs that are indepen
dent of
the parame ter.
Defini tion 2.37. A statisti c U is called ancillary if the conditional distri-
bution of U given e = () is the same for all ().
Examp le 2.38. Let Xl, X 2 be conditionally independent given e =
0, each with
conditional distribution N(fJ, 1). Let U = X 2 - Xl. The conditional density
of U
given e = 0 is N(O, 2). Since this distribution is the same for all 0, U is ancillary
.
Sometimes the two extremes meet and a minimal sufficient statisti c
con-
tains a coordinate that is ancillary.
Defini tion 2.39. If a minimal sufficient statisti c is T = (TI' T2) and
T2 is
ancillary, then TI is called conditionally sufficient given T .
2
Examp le 2.40. Suppose that Xl"'" Xn are conditionally lID given
U(O - 1/2,0+ 1/2) distribution. Let X = (Xl, ... , Xn). Then
e = 0 with
fXls(xlf J) = I[9_!.oo )(minx; )I(_oo. 9+!](ma xxi).
96 Chapter 2. Sufficient Statistics
(2.44)
In the classical theory, one would be 50 percent confident that the random interval
[n - (1 + T2)/4, Tl + 1/4 - 3T2/4j covers e conditional on T2. In fact, since
the probability in (2.44) is the same for all T2 values, one would be 50 percent
confident that the random interval covers e marginally. If one desires an interval
in which one can place 50 percent confidence after seeing the data, then the
interval in (2.44) makes far more sense than the one in (2.43). If T2 is observed to
be small, then we have not learned much about e, and the conditional interval is
wide to reflect the uncertainty. The unconditional interval in (2.43) is very short,
however, which is counterintuitive. Similarly, when T2 is observed to be large, we
have learned a lot about e and the second interval is short, while the first one is
wide.
Suppose that we have a prior distribution for e with density le(O). Then the
posterior density of e is a constant times fe(O)Il t l-l/2+t2,tl+l/2j(B). If fe(O) is
almost constant over the interval [h - 1/2, h - t2 + 1/2], then the posterior is
approximately
1
lelx(O\x):::::! 1- t2 I[tl-!+t2,tl+!](O).
The posterior probability that e is in the interval in (2.44) is nearly 1/2. If
one uses the improper prior with constant density, then the posterior for e is
U(tl - 1/2 + t2, h + 1/2), and the fact that the posterior probability is 1/2 that
e is in the interval (2.44) will turn out to be a special case of Theorem 6.78.
Sometimes there is more than one ancillary statistic available. Some prin-
ciple is needed to choose between them.
Definition 2.45. An ancillary U is maximal if every other ancillary is a
function of U.
Example 2.46. Let Po say that {Yn}:::='=l are lID with density (with respect to
counting measure on the set {(0,0),(0,1),(1,0),(1,1)})
i(1- B) if Y = (0,0),
{ i(1+0) if y = (0,1),
IYlie (y\O) = i (2 + 0) if y = (1,0)
i(2 - 0) if y = (1,1).
Here, n = [0,1]. Now, let X = (Y1 , . , YN). Let the observable counts be N ij
equal to the number of Y s with the first coordinate i and the second coordinate
j. Let Mi be the number of vectors with the first coordinate i, and let Nj be the
number with the second coordinate j.
First Coordinate
o 1
Second 0
Coordinate 1
--~~~~~+-~-
Mo Bin(N,~) , given 8 = 0,
No Bin( N,~) , given 8 = O.
Both Mo and No are ancillary, but neither is maximal. The conditional inference
will depend on which ancillary one chooses.
For example, Eo(l - 3Noo/NoINo) = 0, and Eo(l - 2Noo/MoIMo) = o. If
one wanted to estimate e in the classical framework, there would seem to be
two natural estimators available depending on which ancillary one chooses. (See
Problems 31 and 32 on page 142.)
On the other hand, if X < 2 is observed, the inference should be the same as
if we had merely observed Z, since we actually did observe Z and the fact that
Z > 2 could have occurred but didn't is irrelevant. If X = 2 is observed, the
inf;ence should be based on the fact that all we know is Z ~ 2, since the fact
that X < 2 had been possible is now irrelevant.
A possible Bayesian solution to this problem would be to let 8 have a conju-
gate prior distribution, say 8 '" N(Oo,O"g) for known values of 00 and O"g. The
conditional distribution of 8 given X = x is N(O!, O"~) if x < 2, where
80 + O"gx
/-11 = 1 + O"g , 0"1
2
= -1 -+O"g2
0"0
2.1. Definitions 99
Inference would then proceed as if no truncation had been possible. On the other
hand, if X = 2 is observed, the conditional distribution of e given X = 2 has
density
E(elx=2)=8o+ [(
..;2iy'1 + O'~ 1 - 4.> )-92 )] .
1+0-0
Brown (1967) and Buehler and Fedderson (1963) prove that there are
other statistics that are not ancillary but on which it might pay to condition
when making inferences. In particular, they consider the case in which
Xl. ... ,Xn are conditionally 110 with N(J.l, (T2) distribution, conditional
on e = () = {IL, (T). Let X = E~=l Xdn and 82 = E~=l (Xi - X)2 I(n -1).
It is well known that Ps(IX - J.lIIS > k) depends only on k and n, call it
o:{k, n). What these authors show is that there is a set C and a number
a < o:(k,n) such that Ps{IX -1L1/8 > kl(X,8) E C) :5 a for all (). Pierce
(1973), Wallace (1959), and Buehler (1959) give conditions under which
such examples can and cannot arise.
Ancillaries are only useful if there is no boundedly complete sufficient
statistic.
Theorem 2.48 (Basu's theorem).l0 If T is a boundedly complete suf-
ficient statistic and U is ancillary, then U and T are independent given
e = (), and they are marginally independent no matter what prior one
uses.
PROOF. Let A be some measurable set of possible values of U. Since U is
ancillary, P~(U E A) = Pr(U E A) for all (). But, P~{U E A) = J P~(U E
AIT = t)dPs,T{t). So
for all (), since T is sufficient. Let g{t) = Pr{U E A) - Pr(U E AIT =
t), which is a bounded measurable function. Equation (2.49) says that
Es(g(T = 0 for all (). Since T is boundedly complete, we have P9(g(T) =
0) = 1 for all (). This means that .p~(U E A) = P~(U E AIT = t), a.s. [PS,T]
for all (), which implies that U and T are conditionally independent given
e = (), for all ().
lOSee Basu (1955, 1958).
100 Chapter 2. Sufficient Statistics
To see that this is a better measure of the variance of X* than is the marginal
variance, consider the simple case n = 3. The distribution of M is
ifm = 1,
ifm = 2,
ifm=3,
otherwise.
_* 18 = 8) = E (N-Ma-
Var(X
2
- - - 8 = fJ
N-1M
1)
= ~ ( (N _ 2) (N - 2)(N - 3)) = (12 N2 - 2N + 3 .
N2 1 + + 3 3 N2
If M = 3, the marginal variance is larger than the conditional variance, while if
M = 1, the marginal variance is too small.
To execute a Bayesian solution, we need a distribution for 8. Suppose that
we model the 8 i as exchangeable random variables with 8 i conditionally inde-
pendent N(t/J, a- 2 ) given (\II, E) = (t/J, a-). The distribution of \II given E = a- is
N(t/Jo, a- 2 / Ao), while the distribution of E2 is r- 1 (ao/2, bo/2)Y The data consist
of observing M = m and 8ij = xj, for j = 1, ... ,m. The unobserved 8i are still
exchangeable, and their conditional distribution given (\II, E) = (t/J, (1) is as in the
prior. The distribution of \II given E = a- and X = x is \II "V
N( t/Jl, a- 2 / Ad, and
the distribution of E2 given X = x is r- 1 (aI/2, bI/2), where
= b0 + L(Xi -
m
al = ao +m, b1 -*)2
X + --,-
mAo (.,.
m+ .... o
'1'0 - -*)2
X
i=l
e"V N (mx* + (N
N
- m)t/J
,(1
2 N -
N2
m) .
-*)2
.
i=l
The latter is very close to the traditional finite population sampling theory vari-
ance estimate.
We conclude this section with two examples that are similar on their
surface, but in one example the ancillary is part of the sufficient statistic
and in the other it is not.
Example 2.53. Let Z '" Ber(1/2) (independent of 6), and let Y and W be
conditionally independent given 6 = fJ (and independent of Z) with Y '" N(fJ, 1)
and W '" N(fJ, 2). If Z = 0, we will observe X = (Y, Z). If Z = 1, we will observe
X = (W,Z). Let Xl stand for the first coordinate of X. The likelihood function
is
for some measurable functions 'lr1,"" 'Irk, t b " " tk and some integer k.
Examp le 2.56. Suppose that Pe says that {Xn}~=l are lID N(IL, (12),
where
8 = (IL, (1). Let X = (Xl>' .. , Xn).
so that the dependence on 0 is throug h the vector 'Ir = ('lr1 (0), ... ,'Irk (0)) E
]Rk. We might as well let 'Ir be the parame ter.
same as for X. In particular, there is a measure liT such that PO,T liT
for all 0 and dPO,T/dIlT(t) = c(O) exp(OT t).
PROOF. Apply Lemma 2.24 with m2(T(x), 0) = c(O) exp(2:7=1 Oiti(X)) and
m1(x) = h(x). 0
Example 2.59 (Continuation of Example 2.56; see page 103). In the case of
n conditionally 110 N(/-t,0'2) random variables, the natural sufficient statistics
are TI = 2:~=1 Xi and T2 = L:~=l X? It is well known that TI and W =
n -Tf /n are indef,en~ent.' wi~h TI having N(n/-t, n0'2). ~istribut~on and W having
r([n - 1]/2,1/[20' ]) dlstnbutlOn. It follows that the Jomt denSity of (TI, T2) is
n-l
C
(
O:()l + ~1 - 0:
]0 ) = jexp{tT[O:()l
2
+ (1 - O:)02]}dVT(t)
Example 2.63. The family of exponential distributions, Exp('I/J), with densities
fXI",(xi'I/J) = 'l/Jexp(-'l/Jx) for x> has hex) = I(o,oo)(x) and natural parameter
6 = -'l/J. So c(6) = -6 and 1/c(6) = -1/6 is convex. The natural parameter
space is (-00,0), a convex set.
max
ZEC(,)"f)
l
exp(tz)
z
-11 :s max {exp(ltiE) -
10
1, exp(lt!'Y) -
'Y
I} .
Since the limit as 'Y -- 0 of the last term above is It I and exp(ltIE) -1 > ItiE,
it follows that I(exp(tz) - l)/zl :s exp(ltIE)/E for alllzi :s E. Thus, we have
that if IfJl :s 10, the absolute value of the integrand in (2.65) is no more than
1(t)1 exp(at) exp(ltIE)/E. Thus, the integral of the absolute value is at most
En
J 1(t)1 (exp{ t(a + + exp{t(a - En)
dVT(t)/E. Choose Esmall enough so
that a E are in the interior of n. By the dominated convergence theorem,
a l l ac(O) a
c(O) aOi c(O) = - c(O) aOi = - aOi 10gc(0).
Example 2.67 (Continuation of Example 2.63; see page 105). Consider the
Exp(t/J) distribution with 0 = -t/J. Here 10gc(0) = log(-O). So the partial with
respect to 0 is I/O = -EII(T).
Example 2.68 (Continuation of Example 2.56; see page ~03). Consider t~e
case of N(JL,u 2) distributions. Here, the natural parameter ~(01,02) = (JL/u ,
-1/[2u 2]), and the natural sufficient statistic is (TI' T2) = (nX, E~=I Xl) So,
-n - -nO~ = -n (2
u + JL 2) = - E II (rr>2 ) .L
202 40~
2.2. Expone ntial Families of Distribu tions 107
82
COVO(Ti' T j ) = - 80i 80j logc(O).
Examp le 2.71 (Contin uation of Exampl e 2.68; see page 106). For 2)
the N(/-L, (
case, logc(9) is given in (2.69). The covariance of X and 2:~=1 X;
is
Examp le 2.73. Suppose that Xl, ... ,Xn are conditio nally lID with
distribu tion given M = /-' and ~ = u. Let the prior be natural conjuga
N(/-L,u 2 )
te as in
Exampl e 1.24 on page 14. The margina l density of the data is given
in (1.27)
in that example. Rewriti ng (1.27) in terms of the natural sufficien
- t statistic s
Tl = nX and T2 = W + nX -2
, we get
14This proposi tion is used in the proofs of Theorem s 3.44 and 7.57.
108 Chapter 2. Sufficient Statistics
The partial derivative of this with respect to h divided by g(t1' ta) equals
al
2
( t1 2
nAo h
- - bo+t2+ - + - [- -J.Lo)
n Ao+n n
2 )-1 2AO h 2t1
( - [- -J.Lo) - - ) ,
Ao+n n n
where g+ and g- are respectively the positive and negative parts of g. Since
e,
E8(g(T)) exists for all both sides of (2.75) are finite for all 8. Let 00 be
interior to n, and let the common value of both sides of (2.75) be r when
e = eo. Define two probability measures:
at imaginary values of 1/J near 0, hence they are also equal at all imaginary
values because they are analytic in the region where the real part of 1/J is in
(-1/Jo,1/Jo). For 1/J = iu, we get that the characteristic function of P equals
the characteristic function of Q in a neighborhood of O. By Corollary B.106,
it follows that P = Q, hence g+(t) = g-(t) a.e. [/IT]. This ensures that
Pe(g(T) = 0) = 1 for all e. 0
As examples, the sufficient statistics from normal, exponential, Poisson,
and Bernoulli distributions are complete.
IIfxtle(xile) = ml(x)m2(t,e).
i=l
Define
K9(t) =
a
au log m2(t, e).
Assume the following conditions:
1. The set ofy such that fX1Ie(yle) >0 is the same for all e.
2. fxtle(yle) is differentiable with respect to e for each y.
3. fxtle(yle) is differentiable with respect to y for each e.
4 There exists eo such that Keo (t) has an inverse.
Then, X has an exponential family distribution with a one-dimensional
natural parameter.
PROOF. Write
n
= L V(Xi).
n
qe(r) = K9 [Kio 1 (r)] , rex)
i=l
Ko(t)
a
Koo (t)CI ()) + C2(()) = a() 10g m 2(t, ()),
log m2 (t, ()) Koo(t)<PI()) + <P2()) + set),
where <Pi(()) = J~ ci(u)du, for i = 1,2, and set) is determined by boundary
conditions. It follows that
Thus, we see that the density is in the form of an exponential family with
k = 1 and
hex) = mI(x) exp{s(t)}, c()) = exp{<P2())},
tI(X) = Koo(t), ?rI()) = <PI ()). o
There are similar theorems in multiparameter cases, but they have even
more conditions. We give a different type of theorem characterizing expo-
nential families by their sufficient statistics in Theorem 2.114. The impor-
tance of the existence of a fixed-dimensional sufficient statistic is twofold.
First, it means that there is a fixed amount of information that must be
stored for making inference about e regardless of the sample size. Second,
there is the possibility of using natural conjugate prior distributions as in
Theorem 2.25.
2.3 Information
It seems intuitively sensible to expect more data to provide more informa-
tion about a parameter or a distribution. Similarly, if a statistic is suffi-
cient, it should contain all of the information about the parameter, and
vice versa. To make these ideas precise, we need to define information.
There are two popular definitions of information: Fisher information and
Kullback-Leibler information.
2.3. Information 111
~ exp { - ;b (x - O)2} ,
x-O
= -b-'
and Ix(O) = lib. Here we see that the smaller the known variance is, the more
information there is in the data about e. This is intuitively sensible.
Example 2.81. Suppose that X rv U(O,O) given e = O. That is, iXle(x\O) =
O- l I(O,6)(X), In this case FI regularity conditions 1 and 3 fail, but we can still
calculate 810g/x1e (xI0)/80 = -I/O. We could then try to define the Fisher
information to be the mean of the derivative of this, namely 'Ix (0) = 1/02 But
this function will not have the properties that Fisher information has under all
three FI regularity conditions.
Example 2.81 should actually appear in the text after Example 2.85 on
page 113.
112 Chapter 2. Sufficient Statistics
( ~ ).
1
UV
Ix (9) = ;;7
[&
families), we obtain
0= ! 82
80i80j fXje(xIO)dll(X) = Eo
fXje(X1o)]
fXje(X\O) .
in order to conclude
It follows that
{) {) {)
{)8 i log fXls(xI8) = {)8 i log fYls(YI8) + {)8 i log fxlY,s(xly, 8), a.s. [Qo],
(2.87)
for all 8. We will prove that the two terms on the right-hand side of (2.87)
are uncorrelated and that the last term is 0 a.s. if and only if Y is sufficient.
Proposition 2.84 says that the first two expressions in (2.87) have mean 0
and that the last one has 0 conditional mean given Y, a.s. [PO,y]. It follows
from the law of total probability B.70 that for all i and j,
Hence, the two terms on the right-hand side of (2.87) are uncorrelated. Since
the conditional mean of the conditional score function is 0, a.s., Proposi-
tion B.78 says that
Tx((J) = ( ~ ;;7~ ).
Now suppose that we chose the natural parameterization of the exponential fam-
ily, namely
{L (Jill
TIl = 0'2 = (J~' Tl2 = - 20'2 = - 2(J~ .
The function c is
2.3. Information 115
I*(TJ) = ( -2~2
~
1
~-~
~
2 "12 )
8
o.f(g
-1
(1])) =
~ 8 8 -1(
L..- 80f(O)o.gj 1]),
1], j=l) 1],
Jlog:~~~P(X)dV(X).
as
Ix(P;Q) =
In the case of parametric families, let 0 and 1/; be two elements of D. The
Kullback-Leibler information is then
fXls(XIO) }
Ix(O; 1/;) = Eo { log fXls(XI'IjI) .
It follows that Ix(9j '1/1) = ('1/1 - 9)2/2. This time Ix(9j '1/1) = Ix('I/1j 9).
Example 2.91. Suppose that X", Ber(O) given e= 9. Then
fXls(xI9) 9 1- 9
log fXls(xl'l/1) = x log ~ + (1 - x) log 1 _ '1/1'
It follows that
Ix(9j'l/1) = 9 log ~9 + (1 - 1-9
9) log 1 _ '1/1'
Example 2.94. Let n = (0,1) and suppose that Po says that {Xn}~=l are lID
Ber(J). Let X = (Xl, .. . , Xn). Let 'Ij; > (J, and let 8 be discrete with
() 1TO if y = 8,
fe y ={ 1 - 1To if y = 'Ij;.
Then
So, the probability of either (J or 'Ij; increases with more data, depending on to
which one pn is closer in Kullback-Leibler information.
p(x) 1
= V21r exp (X2)
-2 ' Ep(X2) = 1, Ep(IXI) = /K,
q(x) = "21 exp( -Ix!), EQ(X 2 ) = 2, EQ(IX!) = 1.
It follows that
log p(x) 1 2 1 2
q(x) "2 log:;;: -"2 x + lxi,
1 2 1 {2
Ix(PjQ) "2 log:;;: - "2 + V:; : = 0.07209,
1 2 2
Ix(QjP) -"2 log :;;: +"2 - 1 = 0.22579.
If data come from a Laplace distribution, it is easier to tell that they don't come
from a normal distribution than vice versa.
Another advantage to Kullback-Leibler information is that no smooth-
ness conditions on the densities (like the FI regularity conditions) are
needed.
118 Chapter 2. Sufficient Statistics
Example 2.96. Suppose that P8 says that X has a uniform U(O, 8) distribution.
For 6 > 0,
Ix(8j8 +6) 1 0
8 (8 6) "81dx = log (6)
+
log -8- 1 +"8 '
~
aOBo.Ix(Oo,0)
a f log !xle(xIOo)
. I -- aOBO. f (1 0) !xle(xIOo)dv(x) I
2
= f -ao~;o.IOg!Xle(xIO)1
, J 9=90
!xle(xIOo)dv(x)
If one plugs in 1/1 = 9, one gets 1/[8(1 - 8)] = Ix (8), the Fisher information.
fXle(xIO) fXlu,e(xlu, 6)
fXle(xl1/J)
= fXlu,e(xlu,1/J) ,
so that Ix(}; 1/J) = EIxlU(O; 1/JIU). 0
Some data sets have more information and some have less depending
on the value of the ancillary U. Theorem 2.98 says that the amount of
information averages out to the marginal information over the distribution
of U, but we can make use of the observed value of U to tell us whether
we have one of the data sets with more or less information.
Example 2.99 (Continuation of Example 2.38; see page 95). In this example,
X = (Xl, X2) with the Xi being lID N(}, 1) given e = (). We had U = X2 - Xl.
The conditional distribution of X' given U can be obtained from the conditional
distribution of Xl given U, which is N(} +u/2, 1/2). The conditional score func-
tion is 2(XI - () - u/2), which has conditional variance equal to 2 for all u. Simi-
larly, the conditional Kullback-Leibler information is IxlU(}; .,plu) = () - .,p)2 for
all u. Hence, this ancillary does not help distinguish data sets from each other.
Example 2.100 (Continuation of Example 2.52; see page 100). In this problem,
there were two ancillaries, Mo and No. We can write the second derivative of the
logarithm of the density as
get
3mo + N(l - (P)
IXI M o(8I m o) = (1- 82)(4 _ 82) .
This is an increasing function of no. It is easy to verify that the means of these two
are both equal to the marginal information, since E(Mo) = 1/3 and E(No) = 1/2.
In Example 2.100, one might ask which of the ancillaries does a better
job of distinguishing data sets from each other in terms of information.
This might be answered by looking at how spread out is the distribution
of the conditional information.
Example 2.102 (Continuation of Example 2.100; see page 119). We can com-
pute Var(Mo) = 2N/9 and Var(No) = N/4, so that the variance of IXI M o(8IMo)
is 2/82 (> 1) times as large as the variance of IXI No(8INo). Suppose that we are
interested in the statistic N oo . We can calculate any aspect we wish of the condi-
tional distribution of Noo given either Mo or No (and 9). To see how much more
Mo distinguishes data sets than does No, Figure 2.103 shows the distribution of
the conditional mean of Noo given Mo and No for 8 = 0.1 and N = 50. It is easy
to see how much more the distribution is spread out conditional on Mo than on
No. Since the variance of the conditional mean of Noo is greater given Mo than
given No (2.25 versus 1.125), it follows from Proposition B.78 that the mean of
the conditional variance given Mo must be smaller (by the same amount) given
GI""n~
NO
Given
I
I
15 20
o 5 10
Condttlona. M n
Mo than given No. In fact, the values are 4.125 and 5.25, respectively. Because
this example is sufficiently simple, one can even calculate the probability that
the conditional variance of Noo given Mo will be smaller than the conditional
variance given No. The probability is 0.8346.
fe(())
a
= Varl! ao h(X, ()). (2.104)
To see that this works, let W = g(8) and let the new parameter space be
n'.Note that h must be modified to h' : X x n' -+ lR by
h'(x,1f;) = h(X,g-l(1f;)),
or else the expression on the right of (2.104) makes no sense. Now
a '(
a1f; h x, 1f;)
fw(1f;) =
8 8
80 h (x,O) = 80 log /xle(xIO),
which is the score function. We have already seen that under the FI reg-
ularity conditions, Varo8h(X,O)/80 = Ix(O). So, the method says to use
fe(O) = cJIx(O) as the prior density, where c is chosen to make the inte-
gral of fe(O) equal to 1, if possible. If no such c exists, then fe(O) = JIx(O)
is often used as an improper prior. This type of prior is called Jeffreys' prior
after Harold Jeffreys, who proposed it in Jeffreys (1961, p. 181).
Example 2.105. Suppose that X '" Bin(n,p) given P = p. Then we saw in
Example 2.82 on page 112 that the Fisher information is Ix(p) = n(p[l _ p])-I.
This makes the Jeffreys' prior proportional to p-l/2(1 - p)-1/2. This is the
Beta(1/2, 1/2) distribution, which is a proper prior.
() !'(X-(})
{)(} log f(X - (}) = f(X _ (}) .
The distribution of this quantity given e = 9 is the same as the d.istribution <?f
!'(X)/ f(X) given e = 0, hence the variance is co~stant as a function of 9. Thl~
means that Jeffreys' prior would be constant. If n IS an unbounded set, Jeffreys
prior is improper.
For multiparameter problems, a similar derivation is possible. Let fe(O)
be proportional to the square root of the determinant of the covariance
matrix of the gradient vector of h(X,O) with respect to O. It is easy to
check that the gradient of h(X,g-1(,p)) with respect to ,p is equal to the
matrix whose determinant is the Jacobian times the gradient of h(x, 0)
with respect to (J evaluated at (J = 9- 1 (,p). In the special case of Jeffreys'
prior, h is the log of the density and fe(9) becomes the square root of the
determinant ofthe Fisher information matrix, Ix(O).
Example 2.108. In Example 2.83 on page 112, we found that if X '" N(p., (12J
given e = (p., (1), the Fisher information matrix was diagonal with entries 1/(1
and 2/(12. The determinant is 2/(14 and Jeffreys' prior is a constant over (12, an
improper prior. The usual improper prior in this problem is a constant over (1.
One interesting feature of Jeffreys' prior is that its definition did not
depend on the parameter space (except that we were able to take deriva-
tives). That is, if the parameter space is actually an open subset of the
set {O : fXls(xIO) is a density}, then Jeffreys' prior has the same form.
Obviously, a different. normalizing constant will be required if the prior is
proper.
Example 2.109 (Continuation of Example 2.107; see page 122). Suppose that
the parameter space is actually only the open interval (a,b), but the conditional
density of X given e = 9 is still f(x - 9). Then Jeffreys' prior is the U(a, b)
distribution, which is proper.
X = {x = (Xl,X2,"') E
n=l
(It is easy to see that X is in the product a-field.) The set X is the set of
sequences of possible data which are consistent in the sense that the data
at time n are an extension of the data at time n - 1 for all n > 1. Define,
for k < n,
Pk,n = Pk,k+1(Pk+l,k+2('" (Pn-l,n(-))))'
Let 13 be the Borel a-field of X. Let Pn : X -4 Xn be the nth coordinate
projection function Pn(Xl,X2,.") = Xn . Let X : 8 -4 X be measurable,
and define Xn = Pn(X), The definition of X makes it clear that, for all
x E X, all n, and all k < n, Pk,n(Pn(x)) = Xk. That is, Xk = Pk,n(Xn) for
all n and all k < n. Let ~n be the sub-a-field of 13 generated by {Tl(Pl)}~n'
and set
2.4. Extremal Families 125
For brevity, we will use the symbol Tn to stand for Tn(Pn(X)) or for Tn(Pn)
This makes E the tail a-field of the sequence of statistics {Tn}~=1'
Theorem 2.111. For each n, let rn : 13n x Tn ---+ [0,1) be a tmnsition
kernel such that
rn (T;; 1 ( {t}), t) = 1, for all t E Tn (2.112)
Suppose that the following is true for each nand t E Tn+1 :16
Condition S: Assume that the distribution of X n+1 is rn+l (', t).
Then r n (, s) is a regular condititnal distribution for Xn given
Tn = s for all s E Tn (Pn ,n+1 (T;;+1 ({t}))).
Let M be the set of all distributions on (X, 13) such that rn is a version of
the conditional distribution of Xn given Tn for all n. Then M is a convex
set. Let : be the extreme points of M. Suppose that M is nonempty. Then,
there exists a set E E E and a tmnsition kernel Q : 13 x E ---+ [0,1] such
that
1. P(E) = 1 for all P E M,
w
2. for each x E E, rn(-, Tn(x n )) ---+ Q(', x),
3. for each P EM, Q is a regular conditional distribution for X given
E,
4. for each x E E and A E E, Q(A,x) E {O, I},
5. for each P E M, there is a unique probability R on (X, E) such that
P = 1 Q(.,x)dR(x),
all of its mass on the set of points where Tn = t, SO that it really looks like
a conditional distribution given Tn = t. Condition S is a way of expressing
the idea that Tn is sufficient without introducing parameters. It says that
conditioning X n+1 on Tn+1 and then looking at the conditional distribution
of Xn given Tn is the same as conditioning Xn on Tn from the start.
The most common situation in which the above conditions hold is that
in which Xn = yn and Bn = 'D n for some Borel space (y, 'D). In this case
Pn,n+1 (Yl, ... , Yn+ d = (Yl, ... , Yn), and there is a bimeasurable function
w : X -+ yoo given by wYl, (Yl, Y2), .. .) = (Yb Y2, . .. ). If, in addition, the
Yi are exchangeable, we have a special version of Theorem 2.111.
Theorem 2.113. Let (y, V) be a Borel space and let Xn = yn, Bn = 'On.
Let Y k : S -+ Y be measurable for all k. Define X as above and define
X: S -+ X by18
Let y = IRk = Tn for all n. Suppose that Tn(Yl,"" Yn) = 2:~=1 Yi' Let
h(l) = hand
h(n)(t) = J h(n-l)(t - y)h(y)dy,
Then the conditions of Theorem 2.113 are satisfied. Also, the members of
the extremal family with distributions absolutely continuous with respect to
Lebesgue measure are the members of the exponential family with the Yi
being IID with density hey) exp(OT y)/c(O), for some 0 satisfying {2.115}.
2.4.2 Examples
In this section, we present examples of the above theorems for exchange-
able sequences. For more general distributions, some examples are given in
Section 8.1.3. The theorems can be summarized as follows. Suppose that
we specify a sequence of sufficient statistics and conditional distributions
for {Yn}~=l given the sufficient statistics in such a way that the sequence
is exchangeable. Then the Xn are conditionally lID with distribution being
one of the limits of rn(Pl,n(-), Tn(x n )). Examination of these limits should
reveal the collection of extremal distributions.
The most straightforward example of Theorem 2.113 is to show that it
implies DeFinetti's representation theorem 1.49 for random variables.
Example 2.116. Let Xn = lRn. Let Tn be the subset of lRn with the coordinates
in nondecreasing order. Let rn(A, t) equal lin! times the number of permuta-
tions of the coordinates oft which are elements of A. Let Pn-I,n(Xl, ... ,Xn ) =
(Xl, ... ,Xn-l). Clearly, every lID distribution is in M, so Mis nonempty. Now,
suppose that Xn+l has distribution pn+l (', t). The set yet) = Pn,n+l (T':-~l ({t}))
for t E Tn+! is just the set of vectors of length n whose coordinates are n draws
without replacement from the coordinates of t, or equivalently, the set of vectors
consisting of the first n coordinates of the permutations of the coordinates of t.
The distribution of Xn is uniform over these (n+ 1)! points (with repeats counted
128 Chapter 2. Sufficient Statistics
as more than one point), and the distribution of Tn is uniform on Tn(Y(t)), which
consists of the n+I vectors obtained by removing one coordinate from t. The con-
ditional distribution of Xn given Tn = s is clearly rn(,s) for each s E Tn{Y(t)).
Hence the conditions of Theorem 2.111 are satisfied.
A combinatorial argument like the one used in the proof of Theorem 1.49 shows
that each limit point of a sequence {rn(Pk,n(-) , Tn(Xn))}~=l of probabilities (for
fixed k and x) has lID coordinates. Hence, Q(', x) is a distribution of lID random
variables for each x. We can determine which lID distribution by looking at the
first coordinate. Since the rn(Pl,n('), Tn(x n distributions are just the empirical
probability measures of the first n coordinates of x, Q("x) is the limit of the
empirical probability measures for those x such that the limit exists. Since all
CDFs on lR are such limits, we get DeFinetti's representation theorem 1.49 out
of Theorem 2.111.
When (Tn,Cn ) is the same space (T,C) for all n, it may be possible to
identify the extremal distributions with elements of T.
Example 2.111. Suppose that X = lRand Tn = lRxlR+ O with Tn(Xl,'" ,xn ) =
(L:~=l Xi, L:~=l xn,
and r n (, (tl, t2 the uniform distribution on the surface of
the sphere of radius Jh - tVn around (h, ... ,td/n. If an n-dimensional vector
Y is uniformly distributed on the sphere of radius I around 0, then rn is the
distribution of h/n + Jt2 - tVnY. So, we will find the distribution of Yl . The
conditional distribution of (Y2 , . , Yn ) given Yl = Yl is uniform on the sphere
in the (n - I)-dimensional space in which the first coordinate is Yl with radius
JI - yi around the point (Yl, 0, ... ,0). The marginal density of Yl is then the
ratio of the surface areas of these two spheres. The surface area of sphere of radius
r in n > I dimensions is 27r n / 2 r n - l /r(n/2). So
r(~) ( 2)~-1
jy,(yd = r(n21) Vi 1-Yl .
Since
.
hm
r(~) I
=-,
n-oo r(n21) Vn v'2
we have that
If O"n converges to 0" and tl/n converges to 1-', this function c?n~e~ges uniformly on
compact sets to the N(I-', 0"2) density. If O"n goes to 00, the limit IS 0 and does not
correspond to a probability distribution. If O"n converges to 0 ~nd tl/n converge~
to 1-', the density goes to 0 uniformly outside of every open mterval around 1-',
hence the limit distribution is concentrated at 1-'. If O"n converges to 0 but tl/n
2.4. Extremal Families 129
goes to oo, the limit is not a probability distribution. Hence the set consists
of all lID distributions in which the coordinates are either normally distributed
or constant.
2.4.3 Proofs+
The proof of Theorem 2.111 will proceed by means of a sequence of lemmas.
The following simple proposition implies that M is a convex set.
Proposition 2.119. Let P and Q be probability measures on a measurable
space (Y,C), and let R = )..P + (1 - )..)Q with 0 < ).. < 1. Let V be a
sub-u-field olC such that P('IV) = Q(IV). Then R('IV) = P(IV).
Next, we prove that the conditional distribution of Xn given {Tt}i=n is the
same as the conditional distribution given Tn alone. 19
Lemma 2.120. For each n and each P E M, Xn is conditionally inde-
pendent 01 {Tn+i}~1 given Tn.
PROOF. We will prove this by showing that the conditional distribution
of Xn given Tn, ... , Tm is the same as the conditional distribution of Xn
given Tn for all m. It will follow from the result in Problem 13 on page 663
that this is also the conditional distribution of Xn given {Tt}i=n'
For each P E M, r n+1(-,t) is the conditional distribution of X n+1 given
Tn+1 = t. Condition S says that rn(-, s) is the conditional distribution of
Xn given Tn = s and Tn+1 = t. But rn is the conditional distribution of Xn
given Tn, so the result is true if m = n + 1. We finish the proof by induction
on m. Suppose that the conditional distribution of Xn given Tn,"" Tm
is rn. Now, find the conditional distribution of Xn given Tn,"', Tm+1'
The conditional distribution of X m+1 given Tm+1 is rm+b so condition
S says that the conditional distribution of Xm given (Tm, Tm+1) is r m ,
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
19This is a stronger statement of the fact that Tn is sufficient. If we think of
{Tt}~n as the parameter, then Lemma 2.120 is the usual classical concept of
sufficiency.
130 Chapter 2. Sufficient Statistics
for all P E M.
for each k and each bounded measurable f. Then both 9k-l (x; f) and
9k(x;f(Pk-l,k)) are versions of E(Yk-1IE). Let H k,/ be the set of x E L
such that the two versions are equal, and let
nn
00
c= H k ,/.
k=l/EGo
r
i ;/(T;;'(B
p
rn(A, Tn(Pn(y)))dQ(y, x) = Q( {Tn(Xn) E B, Xn E A}, x).
(2.125)
Both sides of (2.125) are E measurable functions of x, so the set of x for
which (2.125) holds for fixed n, A, and B is in E. Since Xn and Tn are Borel
spaces, there exist countable fields of sets which generate their respective
Borel a-fields (Proposition B.43). The set of all x such that (2.125) holds
for all A and B in those countable collections and all n simultaneously is
therefore in E. But it is easy to see that if (2.125) holds for all A in a field
that generates Bn (for fixed B and fixed n), then it holds for all A E Bn,
and similarly for all B. Hence VEE. To show that P(V) = 1 for all
P E M, let P E M and let GEE. Since Q("x) is a version of P('IE), it
follows that the integral of the right-hand side of (2.125) over G is
2.4. Extremal Families 133
(1
le p;l(T,;-l(B
rn(A,Tn(Pn(y))dQ(y,x)dP(x)
P(A' \ A) = jA'
[1 - Q(A, x)]dP(x) = O. o
E= n
AEV
{x E V: Q(A,x) = IA(x)}, (2.130)
P(Q(A, X) = IA(X)) = 1.
Now use (2.130) again, to conclude P(E) = 1. 0
Since E ~ C, we have now established parts 1, 2, and 3 of Theorem 2.111.
Next, we establish part 4.
Lemma 2.131. If x E E and A E ~, then Q(A, x) E {0,1}.
PROOF. Let x E E and A E ~. Then Q(,x) E M by Lemma 2.124.
By Lemma 2.127, there is A' E ~' such that Q(A,x) = Q(A',x). But
Q(A',x) = lA' (x) by Lemma 2.128. 0
Now, we are ready to prove parts 5 and 6 of Theorem 2.11l.
Lemma 2.132. Let ~* be the a-field of subsets of E defined by
~* = {A n E: A E ~'}.
Corollary 2.133. If P and pI are in M and they agree on E*, then they
are the same.
Finally, we can prove part 7 of Theorem 2.111.
Lemma 2.134. A probability P E M is extreme if and only if it is a
zero-one measure on E. Also, P E M is extreme if and only if
bimeasurable mapping w. We will prove that the lID distributions are zero-
one on Aoo. We do this by proving that lID distributions are conditional
distributions given A.:xo. Let P stand for the distribution of the data which
says that the Vi are lID with distribution equal to the limit of the empirical
CDFs. Since the empirical CDF based on YI , ... , Yn is An measurable, and
since An ~ An+! for all n, it follows that P is An measurable for every n,
hence it is Aoo measurable. To see that P(B) = Pr( (Yi l l , Vi k ) E BIAoo),
we need to prove that, for every C E Aoo,
Next, we want to prove that the lID distributions are the extremal distri-
butions. Lemma 2.137 says that the lID distributions in M* are contained
in *. We now prove that all extremal distributions are UDit follows from
DeFinetti's representation theorem 1.49 that every distribution 'fJ in M*
is a mixture of liD distributions. We show next that if 'fJ E *, then the
mixture must be trivial. Let 'fJ E *, and represent
(2.138)
Then f(2) is the density of Y 1 + Y2, since Y1 and Y2 are lID in the extremal
family. Since f leads to the same conditional distributions as h, we get
fey) = J
h(y)h(t - y) f(2) (t)dt
h(2)(t) ,
hence f(O) > 0, since h(O) > O. It also follows that
f(t - y)f(y) h(t - y)h(y)
f(2)(t) - h<2)(t) a.e., (2.139)
By taking the log of both sides of (2.139), we get A(t - y) + A(y) = 1jJ(t).
Now, set y = t and note that A(O) = 0, so that A(t) = 1jJ(t). It follows that
138 Chapter 2. Sufficient Statistics
2.5 Problems
Section 2.1.2:
1. Suppose that Ps says that Xl, ... ,Xn are lID N(8,1), for 8 E nt. Let
X = (Xl, ... , Xn) and find a one-dimensional sufficient statistic T. Also,
find the conditional distribution of X given T = t.
2. Refer to the definition of P9,T on page 84. If {P9 : 8 E O} is a regular
conditional distribution on (X,8) given e, then prove that {PS,T: 8 E O}
is a regular conditional distribution on (T, C) given e.
3. Suppose that Xl, ... , Xn are conditionally lID with N(,.", cr 2 ) distribution
given e = (,.",cr). Find a two-dimensional sufficient statistic.
4. Let Xl, ... ,Xn be conditionally independent given P = P with Xi having
conditional density (with respect to counting measure on the nonnegative
integers)
fXiIP(xlp) = r~~ia.~x.~) p"'(1- p)Q;,
where 011, ,an are known strictly positive numbers. (These are general-
ized negative binomial random variables.) Define T = E~=l Xi. Find the
conditional distribution of (XI , ... , X n) given T = t and P = p.
5. Let Ps say that Xl, ... , Xn are lID Poi(8). Show that T = E~l Xi is
sufficient by both Definitions 2.4 and 2.8.
6. Prove Proposition 2.12 on page 86 and find the conditional distribution of
(Xl, ... , Xn) given the order statistics.
7. Prove Proposition 2.23 on page 90.
8. For the experiment in Example 2.53 on page 102, find the conditional
distribution of X given Xl and e and show that Xl is not sufficient. Show
that X is minimal sufficient.
9. *Consider the experiment described in Example 2.54 on page 102. Let N be
the number of observed y;, X = U~=2{0, 1}m, and X = (Z, YI, ... , YN).
(a) Find the density of X given e= 8 with respect to counting measure
on (X,2 x ).
2.5. Problems 139
(b) Let M be the number of observed successes among the Yi, that is,
M = E~l Yi. Show that (N, M) is sufficient. In particular, we do
not need to keep track of Z.
(c) Find the conditional distribution of Z given (N, M, 9), and show that
it does not depend on 9.
10. (Nonexchangeable example) Let HI say that {Xn}~=l are Bernoulli random
variables with the following joint distribution:
Pe(Xl = 1) (h
if Xi-l = 1,
if X i - 1 = 0,
where (J= (Jl,(J11,(JlO).
(a) Let X = (Xl, ... , X n). Find a four-dimensional sufficient statistic.
(b) Suppose that X = (Xl, ... , XN), where N is the number of obser-
vations until k successes (Is) have been observed, where k is known.
Find a three-dimensional sufficient statistic.
Section 2.1.9:
11. Prove that the sufficient statistic T found in Problem 1 on page 138 is
minimal sufficient.
12. Show that T is a complete sufficient statistic in Problem 4 on page 138.
13. (Logistic regression) Let {Yi}~l be Bernoulli random variables, but assume
that each Yi comes with a known vector Xi of k covariates. Conditional on
9 = (J (a vector of length k), the Yi are independent with
Pe(Yi = 1) } T
log { Pe(Yi = 0) = 8 x.
(b) Consider the relation '" on :F defined by f '" 9 if there exists c >
such that g(}) = cf(}) for all (}. Prove that'" is an equivalence
relation. (That is, prove that (i) f '" f, (ii) f '" 9 implies 9 '" f, and
(iii) (J '" 9 and 9 '" h) implies f '" h.)
(c) Let T be the set of all equivalence classes [f] = {g : 9 '" n.
Let the
u-field of subsets of T be the smallest u-field containing sets of the
form A9.1/I.c = {If] : f(}) ~ cf(1jJ)}. Prove that [f] E A9.1/J.c if and
only if [g] E A9.",.c for all 9 E [fl.
(d) Let T2 : :F -+ T be defined by T2(f) = [fl. Prove that T2 is measur-
able.
(e) Prove that T = T2(Tt} satisfies T(x) = T(y) if and only if y E V(x)
in the notation of Theorem 2.29.
16. Suppose that {(Xn, Yn)}~l are conditionally lID given 8 = (} with distri-
bution uniform on the disk of radius (} centered at (0,0), that is,
23. *Let n = {I, 2, 3}, and let X = (Xl,'" ,Xn ) be conditionally lID given e =
0, with density fo(-). Suppose that Pt, P2, and P3 have density functions
(with respect to Lebesgue measure): ft(x) = I(_l,O)(X), hex) = I(O,l)(X),
and hex) = 2xI(o,1) (x). Thus, the model has only three members. Let
Sex) = {(j, k);ni h(xi) +ni h(Xi) > o}. Let
Section 2.1.4:
25. Suppose that Po says that {Xn}~l are lID U(O - 1/2,(} + 1/2). Let X =
(Xl, ... , Xn). Find minimal sufficient statistics and find a nonconstant
function of the sufficient statistic which is ancillary.
26. Prove that if S is ancillary, then Sand e are independent no matter what
prior one uses for e.
27. (A vector example) Suppose that Po says
with n= (-1,1).
(a) Find a two-dimensional minimal sufficient statistic.
(b) Prove that the minimal sufficient statistic found above is not com-
plete.
(c) Prove that Zl = 2:~1 xl and Z2 = 2:7=1 Y/ are both ancillary but
that (Zl, Z2) is not ancillary.
28. *Consider the situation in Example 2.51 on page 100. Prove that the dis-
tribution of U is uniform on the sphere centered at the vector 0 of n Os in
the hyperplane defined by IT U = 0, where 1 is the vector of n Is. (Hint:
Let A be an orthogonal n x n matrix that maps the hyperplane to itself.
Prove that AU has the same distribution as U.)
29. Suppose that Pr(X > 0) = 1 and that the conditional distribution of Y
given X = x is U(O,x). Let Z = X - Y and suppose that Y and Z are
independent. Let fx(x), jy(y), and fz(z) be differentiable.
(a) Prove that Pr(X > c) > 0 for all c > O.
(b) Prove Ix (x) = a 2 xexp(-ax), for x > O.
30. Let Xl, ... , Xn be conditionally IID given e = () each with density g(x-(})
for some function g. Prove that max{XI , ... , Xn} - min{XI , ... , Xn} is
ancillary, but not maximal ancillary if n > 2.
31. Consider the situation in Example 2.46 on page 97. Suppose that we wish
to condition on No if
I
Varo ( 1 - 3 r:;oo No) ~ Varo ( 1 - 2 ~: IMo) ,
and we wish to condition on Mo otherwise. For which data sets would we
choose No, and for which would we choose Mo?
32. Consider the situation in Example 2.46 on page 97. Suppose that we need
to choose upon which ancillary to condition before we see the data. Suppose
that we decide to condition on No if
Section 2.2:
34. Express the family of Poisson distributions in exponential family form. Find
the natural parameter, natural parameter space, and sufficient statistic. Use
Theorem 2.64 to find the mean and the variance of the sufficient statistic.
35. Express the family of Beta distributions in exponential family form. Find
the natural parameter, natural parameter space, and sufficient statistic(s).
Use Theorem 2.64 to find the mean and the variance of the sufficient statis-
tics. (Hint: The derivative of the log of the gamma function is called the
digamma function 'IjJ. The second derivative is called the trigamma function
'IjJ'. )
36. In Problem 9 on page 138, show that the family of distributions for ob-
served data is an exponential family but that the sufficient statistic is not
complete. How do you reconcile this with Theorem 2.74?
37. Prove Proposition 2.70 on page 107.
38. Prove Proposition 2.72 on page 107.
2.5. Problems 143
Section 2.9:
39. Suppose that X and Yare conditionally independent given 8 with con-
ditional densities fXls(xIO) and fYls(yIO), respectively. Suppose that 8 is
k-dimensional. Prove that Ix,y(O) = Ix (0) +Iy(O).
40. Prove Proposition 2.84 on page 112.
41. Prove Proposition 2.92 on page 116.
42. Suppose that Pe says that X '" Poi(O).
(a) Find the Fisher information Ix(O).
(b) Find Jeffreys' prior.
43. Suppose that the FI regularity conditions hold and that two derivatives
can be passed under integral signs. Let T be an ancillary statistic. Prove
that IXIT(Olt) has (i,j) entry equal to -Ee (8 2 log fXIT,S(Xlt, 0)/80i80;).
44. Let n= {(p1,P2,P3): Pi ~ O,E~=lPi = 1} and
P1 if x = 1,
{ P2 if x = 2,
fXIS(xlpl,P2,P3) = P3 if x = 3,
o otherwise.
Let 00 = (1/3,1/3,1/3), and find the value of 0 such that Ee(X) = 2.5 and
Ix (00; 0) is minimized.
45. Suppose that X '" U(O,O) given e = O. Find the Kullback-Leibler infor-
mation Ix (01; (2) for all pairs (01, (2)'
46. Suppose that person 1 believes Pr(8 = 1/3) = 11"0 and Pr(8 = 1/2) = 1-11"0
and person 2 believes Pr(8 = q) = 1. Both persons believe that {Xn}~=l
are lID Ber(O) given 8 = O. Let Yn = E~=l Xi/no
(a) Find Pr(8 = 1/31Yn = q) for person 1.
(b) For each possible value of q, describe person 2's beliefs about how
the value of Pr(8 = 1/3IYn ) (calculated by person 1) will behave as
n --+ 00.
47. Let Y be the number of patients (out of n) who survive for one year after
an operation. Let Z be the number of patients who survive for five years.
Let 8 = (P, Q), and suppose that we model Y '" Bin(n,p) given 8 = (p, q)
and Z '" Bin(y, q) given Y = y and e = (p, q). Let X = (Y, Z). Find Ix (0)
and Jeffreys' prior.
Section 2.4:
48. *Let Tn(xl, ... , x n) = E~=l Xi, and suppose that (Xl, ... , Xn) given Tn = t
is distributed uniformly on the portion of the hyperplane ~~l Xi = t with
all coordinates nonnegative. Find the extremal family of distributions.
49.*Let Xl = {O, 1}. Let Tn(xl, ... ,xn ) = E~=l Xi, and suppose that the dis-
tribution of Xl, ... ,Xn given Tn = t is that of draws without replacement
from an urn containing t 1s and n - t Os. Find the extremal family of
distributions.
CHAPTER 3
Decision Theory
Example 3.3. Let V '" N{l, 1), and suppose that N = lR. and L(v,a) = (v-a)2.
In this case, the amount I lose when I choose action a is the squared distance
between a and the unknown V. Alternatively, we might have L{v,a) = 31v - a!-
In this case, I lose three times the distance between V and a.
Let Y = E~=l Xi and y = E:l Xi. Here is a plausible randomized decision rule:
probability 1 on ao ify <~,
8{x) = { probability Ion al ify >~,
probability ! on each ify=~.
146 Chapter 3. Decision Theory
If Y = n/2 is observed, one could flip a fair coin to decide between the two
actions.
L(v,c5(x)) = 1
L(v,a)dc5(x)(a).
This then allows us to talk about the loss incurred by either a random-
ized or a nonrandomized rule without regard to the result of the auxiliary
randomization in the randomized rule.
Example 3.6 (Continuation of Example 3.5; see page 145). If y = n/2, then
one can easily show that L(v,b(x)) = 1/2 for all v.
for each decision rule and chooses the one with the smallest posterior risk.
Here, dl-l v1x denotes the conditional distribution of V given X. If we do
this for every x and the posterior risk is never +00, the resulting rule is
called a formal Bayes rule.
Definition 3.7. If 60 is such that r(60 Ix) < 00 for all x and r(6 0 Ix) :5 r(6Ix)
for all x and all decision rules 6, then 60 is called a formal Bayes rule.
The use of formal Bayes rules is based on the following principle.
The Expected Loss Principle: When one compares two rules
after observing data, the better rule is the one with the smaller
posterior risk.
A justification for this principle will be given in Section 3.3. One feature
of that justification, which we do not use here, however, is that the loss
function needs to be bounded.
Example 3.8. Let V ~ lR, N ~ lR, and L(v,a) = (v - a)2. Then
Assuming that E(V2IX = x) < 00, we can easily minimize the posterior risk .by
setting 8(x) = E(VIX = x). This result is very general. So long ~ the poster~or
variance of V is finite, a formal Bayes rule with squared-error loss IS the postenor
mean of V.
3.1. Decision Problems 147
It is possible that there exist x values such that the posterior risk given
X = x is +00 for every decision rule. Also, it is possible that although
there exist rules with posterior risk < 00 given X = x, there is no rule
that achieves the minimum of the posterior risk. In these cases, there is no
formal Bayes rule as we have defined it, although there may exist x values
such that, conditional on X = x, the posterior risk can be minimized at a
value < 00. In this latter case, we call a rule that minimizes the posterior
risk at all values of x for which a minimum < 00 can be achieved a partial
Bayes rule.
Example 3.9. Suppose that {Yn}~l are conditionally lID with Cauchy distri-
bution Cau(8, 1) given e = 8, where 0 = 1R = N, V = e, and L(O, a) = (a - 8)2.
Let the prior distribution of 8 be Cau(O, 1). Let t > 0 and let Xi = min{t, Yi}.
Define X = (X 1 ,X2,X3). If at least one of the Xi is strictly less than t, then
the posterior risk will be finite for some decision rule. But if all three Xi = t,
the posterior risk is infinite for all decision rules. In this example, a partial Bayes
rule is any rule that chooses the action minimizing the posterior risk for those
data in which all at least one Xi < t. As we saw in Example 3.8, the action to
choose in those cases is the posterior mean of 8. For those data x such that the
posterior risk is infinite (all Xi = t), it might still make sense to choose 6(x) in
such a way that the posterior distribution of L(V,o(x is stochastically small.
That is, if we define Z6 = L(V, 6(x)), we should prefer 01 to 02 if the CDF of Z61
is everywhere larger than the CDF of Z62'
Example 3.10. Let N = (0,1) = 0, and let X = (Xl, ... , XlO) where the Xi are
lID with Ber(8) distribution given e = 8. Let V = e and L(O, a) = (O-0.1-a)2.
Let the prior distribution of e be Beta(l, 1) so that the posterior given X = x is
Beta(x+ I,ll-x). If X = x > 0 is observed, then the posterior risk is minimized
at oo(x) = (x + 1)/11 - 0.1. However, if X = 0 is observed, the posterior risk is
an increasing function of a for a E N, so we would like to choose 00(0) as small as
possible. But the action space is not closed, so there is no smallest possible value.
Any decision rule 0 such that o(x) = oo(x) for x > 0 is a partial Bayes rule.
PROOF. Before proving the first part, let hx(a) = IvL(v, a)dFvlx(vlx),
the posterior risk of a nonrandomized rule with 6(x) = a. Then, if a formal
Bayes rule exists with finite posterior risk, hx (a) (as a function of a) must
be bounded below. Furthermore, if for some x, hx(a) = 00 for all a, then
every decision rule has infinite posterior risk at x. Hence, we can assume
that c(x) = infaENhx(a) < 00 for all x.
We will prove the contrapositive of the first part of the theorem, namely
that if, at some value of x, there is no nonrandomized formal Bayes rule,
then the formal Bayes rule does not exist. Suppose that there is no non-
randomized formal Bayes rule at a value x. Then there exists no b E ~ such
that hx(b) = c(x). Suppose that 6 is a randomized rule with finite poste-
rior risk (so c(x) > -00 also.) Let An = {a : hx(a) ~ c(x) + lin}. Since
{a: hx(a) = c(x)} = 0, we can choose n large enough so that 6(x)(An) > O.
It follows that
L(} a) =
,
{o
1
J
if (0:::; and a = ao) or if (0 > ~ and a = all,
otherWIse.
Let X = (Xl, ... , Xn) and suppose that n is even. Let Y = E~=l Xi.
Suppose that the prior is 'T/ equal to Lebesgue measure. If Y = y succe~ses ~re
observed in n trials, the posterior is Beta(y + 1, n - y + 1). The po~terI?r risk
for choosing a = ao is r(y) = Pr(8 > 1/2jY = y), and the posterIor rIsk f<;>r
choosing a = al is Pr(8 :::; 1/2jY = y) = 1 - r(y). The formal Bayes rule wI~1
be to choose ao if r(y) < 1 - r(y) (that is, if r(y) < 1/2) and to choose al If
3.1. Decision Problems 149
r(y) > 1/2. If, however, y = n/2, the posterior will be Beta(n/2 + l,n/2 + 1),
which is symmetric about 1/2, so r(y) = 1/2. Randomized rules of the following
form are formal Bayes rules:
probability 1 on ao if Y < ~,
6(X) ={ probability 1 on al ifY > ~,
probability l on each ifY =~.
R(O, 6) = Ix Iv L(v,6(x))dPo.v(v)dPo(x).
There is usually no way to choose 8 to make R(O,6) as small as possible
for all simultaneously. One possibility is to choose a probability measure
7] over n and try to minimize
which is called the Bayes risk. Let Jl.x denote the marginal probability mea-
sure over X, namely Jl.x(A) = In
Po(A)d7](O). Suppose that Po v for ev-
ery 0. If L 2: 0, we can use Tonelli's theorem A.69, or if L(O, 6(x))fxla(xIO)
1 Notice that a predictive decision problem has been replaced, in the classical
setting, by a parametric decision problem with loss L(8, a), which does not depend
on the future observable V.
150 Chapter 3. Decision Theory
Ix
conclude that
r(.,." 6) = r(6\x)d/Lx(x).
Each 6 that minimizes r(.,." 6) is called a Bayes rule, assuming that r(.,." 6)
is finite. Otherwise no Bayes rule exists. So, we can prove the following.
Proposition 3.16. If a Bayes rule 6 exists, then there is a partial Bayes
rule that equals 6 a.s. [p.x].
3.1.4 Summary
We now summarize the last several definitions in the case where V = 8.
Definition 3.17. Suppose that we have a decision problem with action
space N, parameter space fl, sample space X, and loss function L : fl x N -+
JR. Let 6 be a randomized rule. Then L(O,6(x)) = IN L(O, a)d6(x)(a). The
posterior risk of 6 is
r(6Ix) = 10 L(O,6(x))d/Lslx(0Ix).
Let A be the set of all x for which there exists ax that achieves a finite
minimum posterior risk. Then a decision rule such that 60 (x) = ax for all
x E A is called a partial Bayes rule. If A = X, then a partial Bayes rule
is called a formal Bayes rule. The risk function of a rule 6 is R(0,6) =
Ix L(O,6(x))dP8(X). If.,., is a prior distribution for 9, the Bayes risk of 6
with respect to.,., is r(.,." 6) = In
R(O, 6)d.,.,(0). If there is a 6 that minimizes
this quantity at a finite value, then that rule is called the Bayes rule with
respect to .,.,.
(Just check the equation for indicators, simple functions, nonnegative func-
tions, then integrable functions.) Then,
R(O,61) = Ix L(O,61{T(x)))dPIJ{x)
= Ix i L{O, a)d<51{T{x)(a)dPIJ{x),
B <;;; X be the set of all x such thatIN lald8(x)(a) < 00. Let the mean of
the distribution b, considered as a nonrandomized rule, be
Then L(O, 80 (x)) :S L(O, 8(x)) for all x E B and for all 0.
PROOF. Since N is convex, Theorem B.17 says that bo(X) EN for all x E B.
It follows that
This is like flipping a coin between the proportion of successes and the posterior
mean from a U(O, 1) prior. Then
y y+I
6o(y)
2n + 2n +4'
L(fJ,8(y)) ~ (fJ _ l!..) 2 + ~
2 n 2
(fJ _ Y +
n+2
1) 2
= fJ2_fJ(l!.+y+1)+~(y22 +(Y+1)2),
n n +2 2 n (n + 2)2
L(O,6o(y)) ( fJ Y )2
Y+1
2n 2n+ 4
= (}2 _ (l!.n + n+2
y + 1) + ! (l!.. + y + 1)2
4 n n+2
Since (x + z)2/2 < x 2 + Z2 for X =F z, it follows that L(fJ,6o(y)) < L(fJ,8(y)).
The theory of hypothesis testing (see Chapter 4) is one in which N is
not a convex set, and indeed, randomized rules figure prominently in the
classical theory of hypothesis testing.
Theorem 3.22 (Rao-Blackwell theorem).4 Suppose that N is a con-
vex subset of JRm and that for all E n, L(O, a) is a convex function
i
0:,
Example 3.23. Suppose that Po says that Xl, ... , Xn are lID N(e, 1). Let X =
(Xl, ... ,Xn). Let ~ = [0,1] and
L(O,a) = (a - <I> (c - 0))2,
for some fixed c E lR. A naive decision rule is 80 (x) = L~=l IC-oo,cJ(xi)/n. But
T = X is sufficient and 80 is not a function of T. Since ~ is convex and the loss
function is a convex function of a, we should calculate
3.2.2 Admissibility
The Roo-Blackwell theorem 3.22 tells us that under some conditions, the
risk function of one decision rule is no larger than that of another. Similarly,
Theorem 3.20 tells us that under some conditions, the loss incurred from
one decision rule is no larger than that from another. These theorems have
a common theme. That is, sometimes we can tell that one decision rule is
better than another no matter what e equals.
154 Chapter 3. Decision Theory
Definition 3.24. A decision rule 6 is inadmissible if there is another deci-
sion rule 61 such that R(O, 6t} ~ R(O, 6) for all with strict inequality for
some 0. If there is such a 61 , we say that 61 dominates 6. If there is no such
61 , then we say 6 is admissible.
E~ample 3.25. S~ppose that Po sa~ that Xl, ... , Xn ar~IID N(If,'2(12), where
8 - (/-" (1). Let X - (Xl, ... , X n ), N- [0,00), and L(8, a) - (a - (1 ) . Define
n n
Theor em 3.29. If every Bayes rule with respect to a prior A has the same
risk function, then they are all admissible.
PROOF . Let tJ be a Bayes rule with respect to the prior
A, and let g(O) be
the risk function of every such Bayes rule. Suppose that tJo domina
tes tJ.
Then R( 0, ho) ~ g( 0) for all 0 with strict inequality for some 0.. But
t~en
In R(O, tJO)dA(O) ~ In
g(O)dA(O). Since 6 is a Bayes rule, the meq~ahty
must be an equality. This means that 60 is also a Bayes rule, hence
It has
risk function g(O), which is a contradiction.
0
Here is an example in which the condition of Theorem 3.29 does
not
hold.
Examp le 3.30. Let n=
(0,00) and N == [0,00). Let L(B,a) = (B - a)2. Let
X ,...., U(O, e) given e = e, and let
A be the U{O, c) distribu tion for c > 0. Then
81X = x has density log(c/x)I(x,c){e). The formal Bayes rules are
of the form
Theorem 3.32. Suppose that N is a convex subset of IRm and that all Ps
are absolutely continuous with respect to each other. If L(9,) is strictly
convex for all 9 and 00 is )..-admissible for some ).., then 00 is admissible.
PROOF. If 00 were inadmissible, then there would be 01 such that R( 9,01) ::;
R(9,00) for all 9 with strict inequality for some (Jo. Define 02(X) = [oo(x) +
01(x)l/2. Then, for every 9,
e
The first inequality above will be strict unless P (01(X) = oo(X)) = 1.
Since all Ps are absolutely continuous with respect to each other, it follows
that Pe(01(X) = oo(X)) = lfor one (J if and only if Pe(01(X) = oo(X)) = 1
for all 9. Hence the first inequality will be strict unless the distribution of
01(X) is the same as the distribution of oo(X) given e = (J for all 9. In this
case, 01 could not dominate 00, hence the inequality must be strict for all
9. This would imply that 00 is not )..-admissible, no matter what).. is. 0
Example 3.33. Suppose that Pe says that XI, ... ,Xn are lID Ber(O), and let
X = (Xl, ... , Xn). Suppose that N = [0,11 and that the loss is L(O, a) = (0 -
a)2/[O(1 - 0)1. Define Y = L~=l Xi, and let the prior be Lebesgue measure
on [O,lJ The posterior given X = x would be Beta(y + 1,n - y + 1), where
y = Li=l Xi Then
E(L(8 a)IX
,
= x) = r(n + 2)
r(y + l)r(n - y + 1)
11
0
(0 _ a)201l-1(1 _ O)n- II - l dO
is minimized at a = Yin for all x and all n > 0. So, 6o(x) = yin is a Bayes rule
with respect to ).. and it is admissible by Theorem 3.32.
Theorem 3.32 even applies to ).., which put 0 mass on large portions of
the parameter space. See Problem 17 on page 210 for examples.
The concept of )..-admissibility did not require that ).. be a probability
measure. It is common to try to find "Bayes" rules with respect to non-
probability measures.
Definition 3.34. Let dPs/dv(x) = fXls(xI9). Suppose that).. is a measure
on (fl, r) and that for every x there exists o(x) such that
In
If
0< c = fXle{xIO)dA{O) < 00, (3.36)
then, after observing x, one can pretend that the "prior" distribution of 0
has density fXle(xIO)/c with respect to A and that there are no data. If a
formal Bayes rule exists in this problem, it is a generalized Bayes rule. For
this reason, generalized Bayes rules with respect to A are also called formal
Bayes rule with respect to A.
Example 3.37. Suppose that P8 says that Xl, .. " Xn are lID U(O, 6). Let X =
(Xl, ... ,Xn ), n = (O~oo), and N = [0,00). Let..\ be Lebesgue measure on (0,00)
and L(6,a) = (6 - a) . Then
/xle(xI6) = 6- n 1(0,9) (max Xi) = 6- nI(max ",;,00) (6).
We get cin (3.36) equal to [(n-l)(maxxi)"-ltl. So, we could invent the "prior"
density
(n - l)(maxxi)n-l I (6)
6n (max "';,00)
The Bayes rule with respect to this prior is the mean, which is 6(x) = (n -
1) maxxi/(n - 2). This is a generalized Bayes rule.
= Ix l L(O, 6(xfxle(xI0)dA(O)dv(x),
where the last equality follows from Fubini's theorem A.70. If 61 is any
other rule, then
l R(O,oo)dA(O) $l R(0,6t}dA(0).
158 Chapter 3. Decision Theory
So, it cannot be the case that R( (), 15 1 ) :s R( (), 150 ) for all () with strict
inequality for () E A with A(A) > O. 0
Example 3.39. Suppose that Xl, ... ,Xn are liD Ber(O) given e = 0, and let
X = (Xl, ... ,Xn). Let ~ = [0,1] and let the loss be L(O, a) = (0 - a)2. Define
y = 2:::'::1 Xi, and let A have Radon-Nikodym derivative 1/[0(1-0)] with respect
to Lebesgue measure on (0,1). The posterior given X = x would be Beta(y, n-y),
where y = 2::~=1 Xi unless y = 0 or y = n. For 1 $ y $ n - 1, the generalized
Bayes rule is 6(x) = yin. For y = 0 or y = n, the only values of 6(x) which make
(3.35) finite are 6(x) = yin. So 6 is a generalized Bayes rule with respect to A.
Now L(O,6(x))/xIS(xIO) = (0 - Yln)20Y(I- O)n- y , which has integrallln with
respect to counting measure times A. Hence, 6 is A-admissible and admissible.
6This theorem is used in Example 3.43 and in the proof of Theorem 3.44.
3.2. Classical Decision Theory 159
each other, we have Pe(6(X) = 6'(X)) < 1 for all e and R(e, 6/1) < R(e,
6)
for all e. So, for each n,
vnu~ _ n ~ u5 = vnu5
n+1 n+1'
which goes to 0 as n ...... 00. If C is an open subset of no, then An(C)
for all n. Since L is strictly convex in a for all 0, the conditio ns of
2: A1(C)
Theorem 3.40
apply in the parame ter space no, so 5 is admissi ble with this parame
ter space.
By Lemma 3.42, it is admissi ble for the entire parame ter space.
Use the fact that Vhn(O) = 2y'h n(O)V y'hn(O) and the Cauchy-Schwarz
inequality B.19 to conclude that
Example 3.46. Suppose that X '" N2 (/-, 0') given e = (/-,0') where 0' is a 2 x 2
positive definite diagonal matrix and /- is two-dimensional. 8 Let N = IR?, let
L(8,a) = (al - J.tl)2 + (a2 - J.t2)2, and let 8(x) = x. For each value
0' = ( 0'20,1
o 0
0)
O'~ 2 '
I if1It/J1I2 ::; 1,
hn (t/J) = { (1 - 1::;
l~o~~~!/) 2 if 1It/J1l ::; n,
o if Iit/Jil ~ n.
8This example was compiled from material in Brown and Hwang (1982) and
Section 8.9 of Berger (1985).
162 Chapter 3. Decision Theory
J {1jJ:1I.p1l~2}
(II'I/Jillog 1I'l/J11)-2 d'I/J < 00.
1 ~dr 1 d~
2
00
log (r)
= 00
In(2) Z
< 00.
9This proposition is used in Examples 3.48 and 3.59. It is also used to simplify
the class of loss functions in hypothesis testing in Chapter 4.
3.2. Classical Decision Theory 163
J(x - /-L) = 1 00
(z -/-L)J(z - /-L)dz = - i~ (z - /-L)J(z -/-L)dz.
1:
We will use these facts in what follows.
1 100(z -
i:
= 00
(z -/-L)4>(z -/-L)[g(z) - g(O)Jdz - /-L)4>(z - /-L)[g(O) - g(z)]dz
1:
=
This can be bounded by three times the expected value of one over a X~
random variable. The expected value of one over a X~ random variable is
3.2. Classical Decision Theory 165
n/(n - 2) if n > 2, hence Eo(lh~(x)/) < 00. Lemma 3.52 says that the risk
function for 151 is
We can write
= ( 2) 2
"n 2 (n _ 2)2
Xi - ",=::----'n'
L,.,i -1
n- [,,~ x2] 2 - I:~=1 X~ ,
L,.,J=1 J
n a "n X2 -(n _ 2)2
2) 2 L,.,i= 1 i _ -="-=----;:-
"
~ax
i=1
-gi(X)
l
=
- n-
(
["n X~]2
L,.,]=1 ]
- I:~=1 X~ ,
-(n - 2)2
"n 2
L,.,j=1 Xj
< 0, (3.54)
for all x. It follows that the risk function is less than n for all e. 0
From (3.54), in order to calculate the risk for 15 1 , we need the mean
of 1/ L:j=1 XJ. Note that Z = L:j=1 XJ
has noncentral X2 distribution,
NCX;(>") with A = L:j=1 J..t~. From the form of the NCX2 density, it is
clear that Z has the same distribution as Y, where Y "-' X;+2k given K = k
and K ,...., Poi(A). The mean of liZ is
(3.55)
03(X) = (X - Xl) (1 - L: n n -
i=1 (Xi
3- X) 2) + Xl,
o 2 3 4 5 6
A
FIGURE 3.56. Risk Function of James-Stein Estimator for n =6
There is a way to derive the James-Stein estimator from an empirical
Bayes argument. lO This was done by Efron and Morris (1975). Suppose
that 8", Nn(Oo, 7 2 I). The Bayes estimate for 8 is
72
00 + (X - (0) 72+ 1.
The empirical Bayes approach tries to estimate 7 from the marginal dis-
tribution of X. The marginal distribution of X is Nn(Oo, (1 + 7 2 )1). So,
we could estimate 1 + 7 2 by L~=1 (Xi - OOi)2/c for some c. An estimate of
7 2 I (7 2 + 1) is
c
1 - ",n (
wi=1 Xi - 00i
)2'
The empirical Bayes estimator is then 82 (X) if c = n - 2. If we take the
empirical Bayes approach one step further and also try to estimate 00 ,
we could use Xl as an estimate, and the estimate of 1 + 7 2 would be
L~=1 (Xi - X)2/c. With c = n - 3, we get 83 (X).
Another way to arrive at estimators like these is through hierarchi-
cal models (to be discussed in more detail in Chapter 8). For example,
8 1 , .. ,en could be modeled as conditionally lID N (/-L, 7 2 ) given M = /-L
and T = 7. Then M and T could have some distribution, rather than merely
being estimated as in the empirical Bayes approach. Strawderman (1971)
finds a class of Bayes rules that dominate 80 when p ?: 5 and are admissible
by Theorems 3.27 and 3.32.
for all 8. Choose no so that r(An, cn ) ~ c- /2, for all n ~ no. Then, for
n ~ no,
r(An, 6') = / R(O, c')dAn(O) ::; (c - ) / dAnce)
f
= C- f < c - "2 ::; r(An, cn),
which contradicts that 6n is the Bayes rule with respect to An. 0
168 Chapter 3. Decision Theory
2::::
Example 3.61. Suppose that Po says that Xl, ... , Xm are independent with
Xi rv N(Oi,I), where = (01, ... ,Om). Let oo(X) = X, N = mm, and L(O,a) =
1 (Oi - ai)2. Let An be the probability measure that says e
has distribution
Nm(O, nI). The Bayes rules are On(X) = nX/(n + 1), and the Bayes risks are
r(An,On) = mn/(n + 1), which go to m as n --> 00. Also, R(O,oo) = m, so 00 is
minimax. We see that minimax rules need not be admissible.
Example 3.62. Suppose that Po says that X rv Bin(n,O) and that L(O, a) =
(0 - a? /[0(1 - 0)], where n = (0,1) and N = [0,1]. A rule with constant risk is
oo(X) = X/no The risk is R(O,oo) = l/n. We saw earlier that 00 is admissible,
so it is minimax.
Now, suppose that we use the loss L'(O,a) = (0 - a)2. We will see that no
analog to Proposition 3.47 operates here. If the prior for e is Beta(Ct,{3), then
the Bayes rule is 0(x) = (Ct + x) / (Ct + (3 + n). The risk function for this rule is
Since minimax rules are designed to prepare for the worst, we should see
if there are prior distributions that make the worst () just likely enough so
that the corresponding Bayes rules are minimax. 11
Definition 3.63. A prior distribution Ao for e is least favorable if
inf r{Ao, D) = sup inf r{\ D).
o A 0
Definition 3.64. The number V above is called the maximin value of the
decision problem, and the number V is called the minimax value.
Theorem 3.65. If 80 is Bayes with respect to Ao and R(), 80 ) ::; r(Ao,80 )
for all (), then Do is minimax and Ao is least favorable.
llThere is a delicate balance between how likely the bad values are and how
likely the rest of the parameter space is. In the second part of Example 3.62,
the worst values, in some sense, are those near 1/2 because the data are most
variable given e = 1/2. The prior Beta( v'n/2, v'n/2) puts enough mass near 1/2
to force us to take seriously the possibility that the data will be highly variable.
However it still spreads enough mass around the remainder of the parameter
space so' that we cannot ignore other 0 values. If the prior put probability 1 on
e = 1/2, for example, then the Bayes rule would be o(x) = 1/2 for all x.
3.2. Classical Decision Theory 169
Example 3.66 (Continuation of Example 3.62; see page 168). We saw that the
minimax rule with loss (0 - a)2 was Bayes with respect to the Beta( vn/2, vn/2)
prior, Ao. Since the risk function is constant, R(O, 6) = r(Ao, 6) for ~ll O. It follows
that AO is least favorable. The reason it is least favorable is that It puts a lot of
mass on the () values (near 1/2) that have high variance for X. On the other
hand, it does not put so much mass there that the estimator is drawn too close
to 1/2.
Definition 3.67. A rule 60 is extended Bayes if for every to > 0 there exists
a prior .\< such that r(.\fl 60 ) S to + info r(.\fl 6).
Example 3.68. Suppose that Pe says that X '" N(O, 1). Let L(O, a) = (0 - a)2,
and let A. be the N(O, (1 - E)/E) prior for e. Let 60(x) = x. The Bayes rule with
respect to A. is
x
6.(x) == 1 + = (1 - E)X.
1-.
Example 3.70. Suppose that Po says that X '" Poi((}/, where n = (0,00) and
~ = [0,00). Let the loss function be L(9, a) = (0 - a) /0, and let 60 == x. The
risk function is R(O,60 ) = 1, constant. Let >.. be the prior that says e has
r (1, E/[1 - ED distribution. The posterior distribution is
The Bayes rule is 6.(x) = x(1 - E), with Bayes risk 1 - f. Since the Bayes risk of
80 is 1,80 is extended Bayes, hence minimax.
There are certain situations in which minimax rules are known to exist.
These involve finite parameter spaces. When n is finite, the risk function
170 Chapter 3. Decision Theory
is just a vector in some Euclidean space. The set of all risk functions of all
decision rules is just a convex set of vectors. 12
Definition 3.71. Suppose that n = {(h, ... , (h}. Let
R = {z E IRk: Zi = R(Oi, 0), for some decision rule 0 and i = 1, ... , k}.
We call R, the risk set. The lower boundary of set C ~ IRk is the set
{Z E C : Xi ::; Zi for all i and Xi < Zi for some i implies X C}.
The lower boundary of the risk set is denoted 8 L . The risk set is closed
from below if 8 L ~ R.
Example 3.72. Consider a situation with n = {O, I} and N = {I, 2, 3}. Let the
loss function be
L(fJ, a)
a
fJ 1 2 3
o
1 1 0 0.2
Supposing that no data are available, the class of randomized ruies consists of
the set of all probability distributions over the action space N. The risk function
for a randomized rule with probabilities (PI,P2,P3) for the three actions (1,2,3)
is just the point (P2 + 0.5p3,PI + 0.2p3). The set of all such points is the shaded
region in Figure 3.73.
We can locate the minimax rule in Figure 3.73 by looking at all orthants of the
form 0. = {(xo, xI) : Xo ::; s, Xl ::; s} and finding the one with the smallest s that
intersects the risk set. These orthants are shown in Figure 3.73. For all s < 0.3846,
the orthant 0. fails to intersect the risk set. But 00.3846 does intersect the risk set
at the point (0.3846,0.3846). This point corresponds to the randomized rule with
probabilities (0.2308,0, 0.7692) on the three available actions. It is interesting to
note that one is required to randomize in order to achieve the minimax rule.
This is somewhat disconcerting for the following reason (among others). After
performing the randomization, one will then either choose action a = 1 or action
a = 3. In either case, one is no longer using the minimax rule, and the risk point
for the chosen decision is either (0, 1) or (0.5,0.2), but not (0.3846,0.3846) as
hoped.
Two lines are added to Figure 3.73 to show the Bayes rules with respect to
two different priors. The line 0.5xo + 0.5XI = 0.35 passes through the point
(0.5,0.2) to indicate that the action a = 3 is Bayes with respect to the prior
that puts equal probability on each parameter value. (The action a = 3 is also
Bayes with respect to many other priors.) The prior with PreS = 0) = .6154
is least favorable, and the line 0.6154xo + 0.3845xI = 0.3846 passes through all
of the points corresponding to Bayes rules with respect to the least favorable
distribution. (See Problem 25 on page 211 to see how the minimax principle is
actually in conflict with the expected loss principle in this example.)
Notice that the risk set in Figure 3.73 is convex. This is true in general.
Lemma 3.74. The risk set is convex.
PROOF. Let Zi be a point in the risk set that corresponds to a decision rule
8i for i = 1,2, and let 0 ~ a ~ 1. Then aZ1 + (1 - a)z2 corresponds to the
randomized decision rule a8 1 + (1 - a)82 . 0
There is a common misconception that a minimax rule can be located
by finding that point in the risk set with all coordinates equal which lies
closest to the origin. Here is an example in which the unique minimax rule
corresponds to a point with distinct coordinates.
Example 3.75. Consider a situation with fl = {O, I} and N = {1,2,3}. Let the
loss function be
L(9, a)
a
e
1 2 3
o
1 1 0.5 0.75
The class of randomized rules consists of the set of all probability distributions
over the action space N. The risk function for a randomized rule with probabilities
(Pl,P2, P3) for the three actions (1,2,3) is just the point (0.25p2 +P3,Pl +O.5p2 +
0.75p3)' The risk set is illustrated in Figure 3.76 together with the point corre-
sponding to the unique minimax rule, choose action 2. The point (0.625,0.625)
is the closest point to the origin which has equal coordinates.
The following theorem gives conditions under which minimax rules and
least favorable distributions exist.
172 Chapter 3. Decision Theory
-
<> ~"'------~''''''--''''~---''--''--'-'l
o.
~
:1
d'
(I
1
I
CD :
Ci = inf{zi: Z E Gi }.
Since G is closed, the point C = (C1," . , Ck) E G. Suppose that the lower
boundary is empty. Then, there is a point x such that Xi :::; Ci for all i with
at least one inequality strict. This is clearly a contradiction to the way that
c was constructed. 0
Lemma 3.80. Suppose that the loss function is bounded below. If there is
a minimax rule, then there is a point on E1 whose maximum coordinate is
the same as the minimax risk.
PROOF. Let z be the risk function for a minimax rule, and let S be the
minimax risk max{z1, .. " Zk}. Let G = R n {x E IRk : Xi :::; S, for all i}.
Since the loss function is bounded below, Lemma 3.78 says the risk set is
bounded below, so G is bounded below. Lemma 3.79 says that the lower
boundary of G is nonempty. Clearly, the lower boundary of G is a subset
of the lower boundary of R. Since each point in G is the risk function of a
minimax rule, the result is proven. 0
PROOF OF MINIMAX THEOREM 3.77. 13 Let R denote the closure of R. For
each real s, let
As = {z E IRk : Zi :::; s for all i}.
Then As is closed and convex for each s. Let So = inf {s : As n R f 0}.
We will prove first that there is a least favorable distribution. It is easy
to see that
So = inf sup R(O, 6). (3.81)
6 (J
Note that the interior of Aso is convex and does not intersect R. By the
separating hyperplane theorem C.5, there is a vector v and number C such
that v T Z ~ c for all Z E R and v T x:::; C for all x in the interior of Aso. It is
clear that each coordinate of v is nonnegative since, if Vi < 0, a sequence of
points {xn}~=1 in the interior of Aso exists with lim n -+ oo xi = -00 and all
other xi = So - E and lim n -+ oo v T xn = 00 > c. So, let AO be the probability
that puts mass Ao,i = Vi/ E;=1 Vj on Oi for i = 1, ... ,k. Since (so, . .. ,so)
is in the closure of the interior of A so ' it follows that C ~ So EY=1 Vj. We
can now calculate
<Pk,-y(X) = { ,(x) if ft (x) = kfo(x), (3.88)
if hex) < kfo(x).
For k = 0,
<Po(x) = {
I
if ft (x) > 0,
if hex) = 0.
For k = 00,
I if fo(x) = 0,
<Poo(x) = { 0 if fo(x) > O.
Then C is a minimal complete class.
Before giving the proof of this theorem , we will give an outline
of the
proof becaus e there are so many steps. We need to prove that if fJ
is a rule
not in C, then there is a rule in C that domina tes fJ. For a rule fJ
not in C,
we find a rule 6* E C that has the same value of the risk functio n
at () = O.
Half of the proof is devoted to this step. We show that the risk functio
ns
of rules in C at () = 0 are decreasing in k, but they may not be continu
ous.
However, by defining ,(x) approp riately, we can find a rule fJ* E
C such
that R(0,6*) = R(0,6). We then show that R(I,6* ) < R(l,fJ).
PROOF OF THEOR EM 3.87. Let C' be C togethe r with
all rules whose test
functions are of the form <Po,-y in (3.88). Let fJ E C'\C. Then the test
functio n
for 6 is <Po,-y for some, such that poe ,(X) > 0) > O. Let 6 be the rule
0 whose
test functio n is <Po. Since hex) = 0 for all x E A = {x: <Po,-y(x) #
<Po(x)},
it follows that R(I,b) = R(l,bo) . But
R(0,8) = ko [Eo(r(X)IA(X + Po(ft(X ) > 0)]
= koEo(-y(X)IA(X + R(O,bo) > R(O,60 ).
Hence 6 is inadmissible and is domina ted by 60 , We will now proceed
to
prove that C' is a comple te class. It will then follow from what
we just
proved that C is a comple te class.
Next, let <P be a test functio n corresp onding to a rule {) not in C'.
Let
a = R(O, 8) = ! korjJ(x)fo(x)dl/(x).
176 Chapter 3. Decision Theory
Note that Q ~ ko. We will now try to find a rule 8* E C' such that R(O, 8*) =
Q and R(l, 0*) < R(l, 0). To that end, we define the function
g(k) = r
J{h(x)?k/o(x)}
kofo(x)dv(x).
Note that if ')'(x) = 1 for all x and 8* has test function <Pk."f' then g(k) =
R(O,o*). Since {x : h(x) 2: kfo(x)} becomes smaller as k increases, and
h(x) < 00 a.e. [v], it is easy to see that g(k) decreases to 0 as k - 00. Also,
it is easy to see that g(O) = ko 2: Q. We now prove that 9 is continuous
from the left and we find the limit from the right. First, note that
k<m
n {x: h(x) 2: kfo(x)} {x: h(x) 2: mfo(x)}, (3.89)
k>m
u {x: h(x) 2: kfo(x)} {x: h(x) > mfo(x)} U {x : fo(x) = O}.
limg(k)
k!m
r
J{x:h (xm/o(x)}
kofo(x)dv(x), (3.90)
hence 9 is continuous from the left. Note that if ')'(x) = 0 for all x and 0
has test function <Pm."f' then R(O,o*) = limklm g(k). Since 9 is continuous
from the left, either there is a largest k such that g(k) > Q or there is
a smallest k such that g(k) = Q. The first case occurs if 9 has a jump
discontinuity and it jumps from a value greater than Q down to a value
at most Q. The second case occurs if 9 drops continuously to Q. In any
case, let the guaranteed value of k be denoted k*. If Q = 0, it may be that
k* = 00. If Q > 0, we must have k* < 00 because 9 decreases to O.
We will construct a decision rule called 8* whose test function <P has the
form of <Pk*."I' We consider the three possible cases:
1. Q = 0 and k* < 00,
2. Q = 0 and k* = 00,
3. Q > 0 and k* < 00.
We begin by proving that by appropriate choice of the function ')', we can
make R(O, 0*) = R(O,o) = Q.
In the first case, we can use (3.89) and (3.90) with m = k* to show that
')'(x) = 0 makes R(O, 8*) = 0 = Q. In the second case,
In the third case, if g(k*) = a, set "Y(x) = 1 to make R(O, c*) = g(k*) = a.
Otherwise, g(k*) > a. Set the right-hand side of (3.90) with m = k* equal
to v :5 a. Because 9 has a jump discontinuity at k*, it must be that
It follows that
R(O,c*) = ! kOko,.,(x)fo(x)dv(x)
= v+ f ko a - v fo(x)dv(x)
J{!t(X)=kofo(x)} g(k*) - v
a-v
= v + (k)
9 * - v
koPo(fl(x) = k* fo(x = a.
We know that kO,.,(X) = 1 ~ (x) for all x such that fleX) - k*fo(x) > 0
and ko,.,(X) = 0 :5 (x) for all x such that hex) - k* fo(x) < O. Since
is not of the form of some k,." then there must be a set B such that
v(B) > 0 and hex) > 0 for all x E B. Since v = Po + Pl, we get that
fo(x) + h(i) = 1 a.e. [v]. So,
= ! k*
[ko,.,(X) - (x)Jh(x)dv(x) + ko (a - a)
{ [1 - (x)]!t(x)dv(x) = 0
} {x:/o(x)=O}
implies that (x) = oo(x), a.e. [v]. Since 6 was assumed not to be in C' ,
this cannot happen.
What we have shown is that for every 6 C', there is 6* E C' such that
6* dominates 6. Hence C' is complete. As claimed earlier, it now follows
that C is complete.
It is easy to check (see Problem 29 on page 212) that no element of
C dominates any other element of C, so nothing can be removed without
destroying the completeness of C. Hence, C is minimal complete. 0
Notice that C consists of all Bayes rules with respect to those priors with
positive mass on each () (see Theorem 3.91), plus only one Bayes rule with
respect to each of the priors that put mass 0 on one of the () values.
Proposition 3.91. In the decision problem described in Theorem 3.87,
each rule k,,,( is a Bayes rule with respect to a prior that assigns positive
probability to each parameter value. The only admissible Bayes rule with
respect to the prior that says Pr(8 = 0) = 1 is rPoo, and the only admissible
Bayes rule with respect to the prior that says Pr( 8 = 1) = 1 is o
Exatnple 3.92. Let lh > ()o, and let 10 and h be the N()o,l) and N()l,l)
densities, respectively. Then, for any k, h (x)/ 10 (x) > k if and only if
lh + 80 logk
x > -2- + 81 - ()o .
for some 1 > PI > Po> O. Then, for any k, /I(x)/fo(x) > k if and only if
nlog(~) + logk
x> I (PIll-PO)) .
og Po(l-ptl
fo(x) = ( 12)
x
(1)12
'2 '
These are Bin(12, 1/2) and U(0,12) distributions. The measur~ p, .is L~besgue
measure plus counting measure on the integers. To make both distributIOns ab-
solutely continuous, we must change ft(x) to
Then,
ft(x) _ { 0
00 if 0 < x < 12, x not an integer,
_ ifx=O,I, ... ,12,
fo(x) undefined otherwise.
There is only one admissible rule, namely, do what is obvious. If the observed
x is an integer, choose the binomial distribution; otherwise, choose the uniform
distribution.
16This lemma is used in the proofs of Theorem 3.95 and Lemma 4.43. It is also
used in the discussion of testing simple hypotheses against simple alternatives in
Chapter 4.
180 Chapter 3. Decision Theory
Then A and R are disjoint convex sets, so the separating hyperplane the-
orem C.5 says that there exists a constant c and a vector v such that for
all x E A and all y E R, v T Y ~ c ~ V T X. If a coordinate Vi < 0, we can
find a point x such that Xj = Zj - for j =F i and Xi is sufficiently negative
so that v T x > c, a contradiction. So we know that all coordinates of v are
nonnegative. Set
Vi
Ai = k
1: j =l vi
Since Z is a limit point of A, there exists a point x E A such that v T x is
arbitrarily close to v T z. Hence C = v T Z, and AT y ~ AT Z for all y E R. So,
if Z E R, then Z is the risk function of a Bayes rule with respect to A. If
z R, then it is still true that AT z = infYER AT y. 0
PROOF OF THEOREM 3.95. From the definition of the lower 'boundary of
the risk set, it is clear that every point on 8L corresponds to an admissible
rule. It is also clear that to every point in R not on 8L there corresponds a
point in R that dominates it. Hence the lower boundary contains the risk
functions of all and only admissible rules.
Next, we show that the rules whose risk functions are on 8L form a
minimal complete class. For each z E R, define
Let Z not be on 8 L Then there exists z' E R such that z' dominates z
and AZI C A z If z' E 8L, we are done; if not, apply Lemma 3.79 with
C = AZI n R to conclude that there exists a point in 8L that is at least
as good as z' and hence dominates z. This makes the admissible rules a
complete class. Since no admissible rule can be dominated, it is also a
minimal complete class.
Lemma 3.96 shows that 8L consists of the risk functions of all admissible
Bayes rules, so these rules form a minimal complete class. Since the set
of Bayes rules contains the set of admissible Bayes rules, the set of Bayes
rules is a complete class also. 0
Notice that the complete class theorem 3.95 says that the rules Jk,.."
Jo, and Joo in the Neyman-Pearson fundamental lemma 3.87 are the rules
whose risk functions are on 8 L in the risk set.
3.3. Axiomatic Derivation of Decision Theory 181
set of prizes.
Definition 3.97. A Von Neumann-Morgenstern lottery (NM-lottery) is
a probability on the space (P, A2) that is concentrated on finitely many
prizes. 17 Call the set of all NM-Iotteries C.
For convenience, if L E C gives probability 1 to a prize p, then we will often
denote L by p. If the NM-Iottery L awards prize Pi with probability 0i for
i = 1, ... , k, we will denote it by L = 0lP1 + .. '+OkPk, where 0i ~ 0 for all
i, 2::=lOi = 1. This NM-Iottery is to be interpreted as meaning that the
results of the experiment T are to be partitioned into k events E l , ... ,Ek
with probabilities 01, ... , Ok, respectively, and prize Pi is awarded if event
Ei occurs. We assume that the details of the partitioning of the results of
T are irrelevant to us. We only care about the probabilities of the various
prizes.
The choices we need to make will be among NM-Iotteries and some more
general gambles. In order to make choices among gambles, we need to be
able to say which ones we like better than others. 18 We will use the symbol
~ to indicate weak preference, that is, L ~ L' means that we like L' at
least as much as L. Assuming that a weak preference ~ is defined on C, we
can now define the more general class of gambles to which we will extend
~. These gambles will be defined as functions from R to C.
me that prize with some probability to a gamble that gives me the other
prize with the same probability, all other things being equal. We make this
precise with another axiom.
Axiom 2 (Sure-thing principle). For each convex set 11.0 of horse lot-
teries; for every three horse lotteries HI, H 2, H E 'Ho; and for every 0 <
a ~ 1, HI ::; H2 if and only if aH I + (1 - a)H ::; aH2 + (1 - a)H.
Two other axioms are often assumed for mathematical reasons. The first
assures that no horse lottery is worth infinitely more than another. A def-
inition is required first.
Definition 3.100. Let L be an NM-Iottery and let {Lklk'::1 be a sequence
ofNM-lotteries. We say that Lk converges pointwise to L (denoted Lk -+ L)
if and only if, for every A E A 2, limk~oo Lk(A) = L(A).
Since each NM-Iottery is a probability distribution over(P, A 2 ), it is a
function from A2 to [0,11. The above definition of pointwise convergence
agrees with the usual concept of pointwise convergence of functions.
Axiom 3 (Continuity). Let H be a horse lottery, and let {Hk}k'::1 be a
sequence of horse lotteries such that Hk(r) converges pointwise to H(r) for
all r. Let H' be another horse lottery. If H k ::; H' for all k, then H ::; H'.
If H' ::; Hk for all k, then H' ::; H.
The next axiom assures that the relative values of prizes do not vary from
state to state. It is not obvious that such an axiom should be adopted. In
fact, this axiom appears to be nothing more than a mathematical tool for
ensuring that probability and utility can be separated. In Section 3.3.6, we
will consider what happens if we do not assume Axiom 4. A definition is
required before we can state Axiom 4.
Definition 3.101. If E E Al and H l , H2 E 11., the preference between Hl
and H2 is said to be called-off when event E occurs if HI (r) = H2(r) for all
r E E. A set BE Al is called null if whenever the preference between Hl
and H2 is called-off when BC occurs, we have HI ""' H2. A subset is called
nonnull if it is not null. Similarly, a state s is nonnull if the singleton set
{s} is nonnull.
Axiom 4 (State independence). Suppose that there is a nonnull set
B such that the preference between HI and H2 is called-off if B C occurs.
Suppose also that Hl(r) = Ll and H2(r) = L2 for rEB. Then Hl ~ H2
if and only if Ll ::; L2
An interesting discussion of state independence is given by Schervish,
Seidenfeld, and Kadane (1990).
Next we introduce an axiom that is only needed when there are infinitely
many s~ates of Nature. When there are infinitely many possible prizes and
states, Seidenfeld and Schervish (1983) show that it is possible for HI to
3.3. Axiomatic Derivation of Decision Theory 185
3.3.2 Examples
Here are a few examples to illustrate how the axioms can be satisfied or
violated.
Example 3.104. Suppose that there are only two states of Nature rl and r2.
Suppose also that there are only two prizes, PI and P2. Then horse lotteries are
characterized by the pair of numbers (qJ, q2), where qi is the probability of P2 in
state ri for i = 1,2. We give two examples of weak preference, one which satisfies
the axioms and one which does not.
First, suppose that we claim that (ql, q2) ::; (q~, q~) if and only if ql + q2 ~
q~ + q~. It is straightforward to check that this satisfies all of the axioms. The
representation of preference according to Theorem 3.108 below will be Pr(rl) =
Pr(r2) = 1/2 and U(P2) > U(pJ). Consider two horse lotteries HI = (ql,q2) and
H2 = (ql,q3). The preference between HI and H2 is called-off if {rl} occurs. It
is easy to see in this example, and it can be shown in general that if r2 is nonnull
(see Lemma 3.148), then HI ::; H2 if and only if H3 ~ H4 for all H3, H4 of the
form H3 = (q4, q2), H4 = (Q4, Q3). That is, a pair of horse lotteries that differ
only on a nonnull state are ranked the same as every other pair that differ the
same way on the same state.
Second, suppose that we claim that (Ql,Q2) ~ (Q~,Q~) if and only if Ql < Q~
or (Ql = Q~ and Q2 ~ Q~). This fails Axiom 3, and there is no expected utility
representation for the preferences.
Example 3.105. Let P = {Pl, ... ,Pm} and R = {rl, ... ,rn }. Let ql, ... ,qn
be nonnegative numbers that add to 1. Let Ul, ... , Urn be real numbers. For
each NM-Iottery L = 01Pl + ... (};mPm, define U(L) = L:::I0iUi. For each
horse lottery H = (L 1, ... , Ln), define U(H) = L:;;1 qP(Lj). Say that HI ::;
H2 if and only if U(Hl) :::; U(H2). It is easy to see that this is a weak order.
Since U(oLJ + (1 - 0)L2) = oJU(Ld + (1 - 0)U(L2)' it is easy to verify that
Axiom 2 holds. Continuity follows since Lk -> L implies U(Lk) -> U(L). State
3.3. Axiomatic Derivation of Decision Theory 187
independence follows easily from the definition of U(H). Theorem 3.108 says that
all examples in the finite case will be like this.
Let X : R -+ X be a random quantity. Clearly, X can take on only finitely
many values. For each value x of X, let X-l({X}) = {Pk}(x), ... ,Pk,,,,(,,,)}. If
v", = 2:;:1 qkj(x) = 0, then X-l({X}) is null. Otherwise, define Ux(H) =
2:;:1 qkj(x)U(Lkj(x), and say that Hl ::5x H2 if and only if Ux(Hl ) ~ Ux(H2)' It
is easy to verify that this is a conditional preference and that it satisfies Axiom 6.
Theorem 3.110 says that all conditional preferences must be of this form in the
finite case.
to see that W(L} = 0 for all L E C. Define U(H} = V(H} + W(H} and say
that HI j H2 if and only if U(Hd $ U(H2}. Axiom 1 is clearly satisfied. To see
that Axiom 2 is satisfied, note that Wo:HI +[I-o:J H2(p) = aWHI (p) + (1- a)WH2 (p)
for all p, so W(aHI + [1 - a]H2) = aW(H1 } + (1 - a)W(H2}. Now, use the
fact that V(aHI + [1 - a]H2} = aV(H1 } + (1 - a)V(H2) as shown in Ex-
ample 3.106 to see that U(aHI + [1 - a]H2} = aU(Hd + (1 - a}U(H2). If
Hn(r) -+ H(r) for all r, then limn->oo WHn (p, r) = WH(p, r) for all p, r. Let
{pd~1 be the prizes such that either WHn (PiL.? 0 for some n or WH(Pi) > O.
Define fer} = L::':1 wH(Pi,r} and fn(r) = L."::':1 WHn(Pi,r). Then W(H) =
f f(r}dQ(r} and W(Hn} = f fn(r}dQ(r}. Since 0 $ fn ~ 1 and limn->oo fn(r} =
fer} for all r, it follows from the dominated convergence theorem A.57 that
limn->oo W(Hn} = W(H}. As shown in Example 3.106, limn->oo V(Hn} = V(H},
so limn->oo U(Hn} = U(H) and Axiom 3 holds. To see that Axiom 4 is satisfied,
let the preference between HI and H2 be called-off when Be occurs, B is non-
null, and Hl(r} = Ll and H2(r} = L2 for all rEB. Then W(Hd = W(H2} =
E All p
fBe WH. (p, r)dQ(r}, so U(Hd - U(H2} = Q(B)[V(Ld - V(L2}]' We see
that HI j H2 if and only if Ll j L 2. To see that Axiom 5 is violated, let
Hl(r} = Ir, that is, the NM-lottery that gives prize P = r with probability 1. Let
H2 = 1, that is, the constant horse lottery that gives the prize 1 with probability
1 for all r. It is easy to calculate V(Hl} = 1/2, W(Hd = 1, V(H2) = 1, and
W(H2} = O. So U(Hl} = 3/2 > 1 = U(H2}. But U(Hl(r}} = V(HICr = r
for all rand U(H2(r}} = V(l} = 1 for all r. So Hl(r} j H2(r} for all r, but
H2 -< HI. Note that the ranking of horse lotteries by U is not an expected utility
representation because of the added function W.
Theorem 3.108. Assume Axioms 1-5, and assume that preference is non-
degenerate. Then, there exists a bounded junction U : 11. -+ m such that
U(H (r is a measurable function of r for all H E 11. and that satisfies
(3.109)
for all a E [0,1] and all H 1, H 2. Also, there exists a probability Q on (R, A1)
such that for every Hb H2 E 11., H1 :::S H2 if and only if J U(H1(rdQ(r) ::;
JU(H2(rdQ(r). The probability Q is unique, and U is unique up to pos-
itive affine transformation.
The function U in Theorem 3.108 is called a utility function.
We also prove a theorem linking preference and conditional preference.
Theorem 3.110. Assume the conditions of Theorem 3.108. Let (X,8) be
a Borel space, and let X : R -+ X be a random quantity. Let Q be the
probability from Theorem 3.108, and let {Q('lx) : x E X} be a regular
conditional distribution given X. Let {:::Sx: x E X} be a conditional pref-
erence relation given X which satisfies Axiom 6. Then there exists a set
3.3. Axiomatic Derivation of Decision Theory 189
B such that X-l(B) is null and for all x f/. B, Hi ~x H2 if and only if
f U(Hl(r))dQ(rlx) ~ f U(H2(r))dQ(rlx).
In Theorem 3.110, if Y : R ---+ Y is a function of X, and {~y: y E Y} is
a conditional preference relation given Y, then {~y: y E Y} is consistent
with {~x: x E X} because Theorem B.75 and Corollary B.74 say that
conditioning on Y and then X is the same as conditioning on X alone.
Axiomatic developments like the one given in this section have two main
consequences. The obvious one is that, as stated in the main theorems,
if one satisfies the axioms, then the preferences have an expected utility
representation. The contrapositive is also true. If preferences do not have
an expected utility representation, then at least one axiom must be violated.
Example 3.112 (Continuation of Example 3.72; see page 170). The minimax
principle is often in conflict with Axiom 2. In this example, the minimax rule
corresponds to the convex combination
Corollary 3.115. Assume Axioms 1 and 2. Let Hl, H2 E 'Ho, and assume
HI '" H2. Then for every < a ~ 1, every H3 E 'Ho, and every H4,
aH I + (1 - a)H3 j H4 if and only if aH2 + (1 - a)H3 j H 4, and H4 j
aH I + (1- a)H3 if and only if H4 j aH2 + (1- a)H3.
The next lemma says that two different mixtures of the same two horse
lotteries are ranked according to how much probability they give to the
better of the two horse lotteries.
Lemma 3.116. Assume Axioms 1 and 2. Let HI -< H2 E 'Ho. Suppose
that
H3 '" aH2 + (1 - a)HI' H4 '" f3H2 + (1 - (3)H1.
Then a ~ f3 if and only if H3 j H4.
PROOF. Suppose that H3 j H4 but f3 < a. Then
by Corollary 3.115. Since a > f3, use Lemma 3.113 to conclude that H2 j
(f3/a)H2 + ([a - f3l/a)H 1. Use Lemma 3.113 once again to conclude that
H2 j H!, which is a contradiction.
Next, suppose that a ~ {3 but H4 -< H 3. Then {3H2 + (1 - (3)Hl -<
aH2 + (l-a)Hl' by Corollary 3.115. A contradiction follows just as before.
o
A useful consequence of the first three axioms is what is often called an
Archemedian condition. 21
Lemma 3.117 (Archemedian condition). Assume Axioms 1-3, and
assume that Hl, H3 E 'Ho. If HI -< H2 -< H3, then there exists a unique
< a < 1 so that
aH3 + (1 - a)HI H2. f'V
21 In the proofs of results in the finite case, we do not explicitly use Axiom 3,
but rather we only use the Archemedian condition. This fact would allow us to
prove a converse to Lemma 3.117 in the finite case.
3.3. Axiomatic Derivation of Decision Theory 193
For the rest of the proof, assume that there exist H *, H* E 1io such that
H* -< H*. We will use the Archemedian condition in Lemma 3.117 to help
define U. For each H E 1i such that H* ~ H ~ H*, define U(H) equal
to the value of a such that aH* + (1 - a)H* H. Note that U(H*) = 0
f'V
and U(H*) = 1. 22 For each H such that H ~ H*, define U(H) equal to
-al(l - a) for that a such that aH* + (1 - a)H H*. For each H suchf'V
that H* ~ H, define U(H) equal to 1/a for that value of a such that
aH + (1 - a)H* H*. Next, we prove that (3.120) is true.
f'V
H* 1 U(H1)-1
'" U(H1) H1 + U(Ht} H*, (3.121)
H* 1 U(H2)-1
U(H2) H2 + U(H2) H*. (3.122)
Lemma 3.116 and (3.122) say that U(Hd :::; U(H2) if and only if
H * -< 1 U(H1) - 1
- U(Ht} H2 + U(H1) H*.
22In the finite case, one can prove that there exist two NM-lotteries H. and
H' such that H. :j H :j H' for all HE H. (See Problem 33 on page 212.) For
this reason, one can skip to the next paragraph in the finite case.
23The proof ends here in the finite case because we can choose H. and H* so
that H. :j H :j H* for all H E H, by Problem 33 on page 212.
194 Chapter 3. Decision Theory
This and (3.121) are true if and only if HI :j H2 by Lemma 3.113. Case
(ii) is similar. 0
aHI + (1 - a)H2
'" (aU(Hd + [1 - a)U(H2))H* + (1 - aU(Hd - [1 - ajU(H2))H.,
aHI + (1 - a)H2
'" aHI + (1 - a)U(H2)H* + (1 - a)[1 - U(H2)H*
= (3 [ -U(Hd H* + 1 HI]
1 - U(HI) 1 - U(Hd
+ (1 _ (3) [O:U(Ht} + (1 - a)U(H2) H* + (1 - a)[1 - U(H2) H*] ,
1- a + aU(HI) 1- a + aU(HI)
where (3 = a[1 - U(HI ), which is less than 1 because c > O. Use (3.125)
and Lemma 3.113 to see that the last expression is '" cH*+(1-c)H1 This
implies that (3.109) is true.
Case 3. HI ~ H. :j H2 :j H* and c < O. In this case, (3.124) and (3.125)
are still true. This time let {3 = (a - aU(Ht})/(1 - aU(HI))j mix the left-
hand side of (3.124) with weight 1 - {3 with the right-hand side of (3.125)
with weight {3, and mix the other sides also. The result is
Now use Lemma 3.113 to remove the common H* from both sides (there
is more on the left than on the right) to get
1 -c
-1-[o:HI + (1- 0:)H2) + -1-H* rv H*.
-c -c
It follows that (3.109) is true.
Case 4. HI ~ H* ~ H* ~ H2 and c E [0,1). In this case,
-U(Ht) * 1
(3.126)
H* rv 1- U(H1)H + 1- U(Ht) HI,
H* 1 U(H2)-1
(3.127)
rv U(H2) H2 + U(H2) H*.
rv -o:U(Ht} H* + 0: HI
0:(1 - U(H I )) + (1 - 0:)U(H2) 0:(1 - U(Ht)) + (1 - 0:)U(H2)
(1 - 0:) H (1 - 0:)(U(H2) - 1) H
+ 0:(1 - U(Ht}) + (1 - 0:)U(H2) 2 + 0:(1 - U(Ht}) + (1 - 0:)U(H2) *.
One can now use Lemma 3.113 to remove a common component consisting
of the two terms involving H* and H* on the right-hand side with weight
-o:U(Ht} + (1 - 0:)[U(H2) - 11
0:(1 - U(Ht)) + (1 - 0:)U(H2) .
The result says that (3.109) is true.
Case 5. HI ~ H* ~ H* ~ H2 and c > 1. In this case, (3.126) and (3.127)
are still true. This time, let (3 = 0:(I-U(Ht))/[0:(I-U(Ht})+(1-0:)U(H2))'
and mix (3.126) with weight {3 with (3.127) with weight 1 - {3 to get
1 c-l )
"IH + (1- "I)L rv "I ( ~[o:HI + (1- 0:)H2J + -c-H* + (1- "I)L,
196 Chapter 3. Decision Theory
where
c
0(1 - U(Hd} + (1 -
0)U(H2} '
-oU(HI ) -o[U(Hd - 1]
L = 0(1- 2U(Ht}) H + 0(1- 2U(Ht}) H*.
Use Lemma 3.113 to remove the common component of L from both sides
and the result is (3.109).
Case 6. HI, H2 -< H . In this case, c < 0 is clear and we have (3.126)
together with
H -U(H2). 1
* '" 1 _ U(H2) H + 1 _ U(H2) H 2. (3.128)
Let {3 = 0(1-U(HI))/[0(1-U(Hd)+(1-0)(1-U(H2))], and mix (3.126)
with weight {3 with (3.128) with weight 1 - {3 to get
-c 1
H* '" 1- c H * + 1- c[oHI + (1- 0)H2J.
This implies that (3.109) is true.
Case 7. H* -< HI, H 2. This is analogous to case 6.
Case 8. HI -< H. -< H* -< H2 and c < O. This is analogous to case 5.
Case 9. H* :::S HI :::S H* -< H2 and c> 1. This is analogous to case 3.
Case 10. H. ::S HI ::S H* -< H2 and c ~ 1. This is analogous to case 2. 0
Now, we can prove that U is bounded on Ho. 25
Lemma 3.129. Assume Axioms 1-3. Then U is bounded on 'lio.
PROOF. If HI rv H2 for all HI, H2 E 'lio, the result is trivial, so assume
that there exist H. -< H* E 'lio. Without loss of generality we can as-
sume that U(H*) = 0 and U(H*) = 1 (otherwise, just replace U by
[U - U (H *)] / [U (H*) - U (H *)] and the preferences and boundedness are not
changed). Suppose, to the contrary, that U() is unbounded above. (A simi-
lar construction works if U is unbounded below.) Let {Hn}~1 be such that
U(Hn) > n. Let H~ = (1 - l/n)H* + (l/n)Hn for each n. Then H* -< H~
for all n because U(H*) = 1 and U(H~) > 1 for all n, but Hn --+ H* and
H * -< H*. This contradicts Axiom 3. 0
If 'lio ~ 'li contains H. and H' with H. -< H*, define 26
{31(Ho) = sup U(H), {3o('lio) = inf U(H).
HEXo HEXo
Lemma 3.130. Assume Axioms 1-3. For each (3 E (30 Clio), (31 ('Ho)), there
exists an NM-lottery L with U(L) = (3.
PROOF. If HI '" H2 for all H 1,H2 E 'Ho, then /3o('Ho) = (31 ('Ho) , and the
result is vacuous, so assume that there exist H. -< H E 'Ho with U(H.) =
o and U(H*) = 1. Assume, to the contrary, that ao = infLE.c U(L) >
/3o('HO).28 We know that ao $ 0, since U(H.) = O. Let H be a horse lottery
such that U(H) < ao, which must exist since /3o('Ho) is the infimum of all
utilities of horse lotteries. Let = ao - U(H), so that U(H) = ao - < O.
Let a = ao/[2(ao - f)], which is easily seen to be between 0 and 1/2.
Let L be an NM-Iottery such that U(L} = ao(1/2 + a}/2, which is in
the open interval (ao/2, aao). Define H' = aH + (1 - a}H. This means
that U(H') = ao/2 < U(L), hence H' j L. But H'(r) = aH(r) + (1 -
a)H. We have assumed that U(H(r)) ~ ao for all r, since H(r) E C. So
U(H'(r)) ;::: aao > U(L). This implies that L j H'(r) for all r. Axiom 5
implies L j H', a contradiction. A similar contradiction holds if we assume
that sUPL U(L) < (31 ('Ho). 0
We are now in position to prove that 'H is itself convex.
Lemma 3.131. 29 Assume Axioms 1-3. Let 'Ho be the set of all constant
horse lotteries.
For each horse lottery H E 'H, the function g : R --+ m defined by
g(r) = U(H(r is measurable.
If HlI H2 E 'H and 0 $ a $ 1, then aHI + (1 - a)H2 E 'H.
PROOF. For the first part, let H E 'H and let g(r) = U(H(r)). We know
that (3o('Ho) $ g(r) $ (31('H O) for all r. To prove that g is measurable, we
need to show that for every c E (f30('H0),(31(Ho)), {r: g(r) $ c} E AI. For
each such c, let Lc be an NM-Iottery with U(Lc) = c, as guaranteed by
Lemma 3.130. Then
But the first part of this lemma shows that both U(HlO} and U(H2())
are measurable functions. Hence the convex combination is measurable. It
follows that the set on the right-hand side of (3.132) is in At. 0
From now on, so long as we assume Axioms 1-3, we can assume that 'It
is closed under convex combination.
Lemma 3.133. Assume Axiom 5 and that preference is nondegenerate.
Then there exist two NM-lotteries L* and L * such that L* -< L * .
PROOF. Since the preference is nondegenerate, there exist horse lotteries
H * -< H*. If, to the contrary, H * (r) '" H* (r) for all r, then Axiom 5 says
H ::S H., a contradiction. 0
Lemma 3.134. Assume Axioms 1-3. Let H* and H* be horse lotteries
such that U(H*) = 0 and U(H*) = 1. For each B E A l , define HB by
= UAi, = UAi
00
A Bn
i=l i=l
30The proof ends here in the finite case, ~ince there ~o not .exist infinitely many
disjoint subsets of R. Note that in the fimte case AXIOm 3 IS not used, only the
Archemedian condition of Lemma 3.117 is used.
3.3. Axiomatic Derivation of Decision Theory 199
It follows from Axiom 3 (with H' = H" = ~H* + ~HA) that H rv ~H* +
~HA' But H = (1 - a)[~H* + ~HA] + aH*. It follows from Axiom 2 that
either a = 0 or H* rv ~H* + ~HA' Since this latter is clearly false, it must
be that a = 0 and an -+ O. 0
PROOF. The fact that Q(B) = 0 if and only if B is null follows easily from
Axiom 4 and is left to tile reader (as Problem 37 on page 213). By Axiom 5,
L*jHBjL*.3l 0
U(H) = J
U(H(r))dQ(r).
PROOF. Let Li, ... ,L~ be the different NM-Iotteries that H takes on. Let
31In the finite case, the fact that L. :j HB :j L* was already known without
appeal to Axiom 5.
200 Chapter 3. Decision Theory
Also, 0 :::; U(H"(r)) :::; 1 follows from (3.109) and simple algebra. Since
Cl =1= 0 and
it is sufficient to prove the result for H". Since H"(r) is the same mixture
of H (r) and L. and L for all r, H" takes on only finitely many different
NM-Iotteries also. Let H"(r) = Li for r E Bi for i = 1, ... , n, where the
Bi E Al form a finite partition of R. For each i, define Hi by
1 II n-l 1 1
-H
n
+ - -n L . = -HI
n
+ ... + -Hn.
n
J U(H"(r))dQ(r) = L U(Li)Q(Bi ),
i=l
PROOF. First, suppose that U(H(r ~ 0 for all r. Let H" = ~H + ~L.
J
Since U(H") = ~U(H)+~ and U(H"(rdQ(r) = ~ U(H(rdQ(r) +~, J
it suffices to prove the result for H". Let b1 = sUPr U(H"(r. It follows
from Lemma 3.130 that for all x ~ b1 there exists an NM-Iottery L with
U(L) = x.
For each n and k = 0,1, ... ,n2n , define
Lemma 3.137 says that the integrals on the left-hand side of (3.140) are
U(Hn). Since Axiom 5 says that U{Hn) ~ U(H") for all n,
! U(H"(rdQ(r) ~ U(H").
Since U is bounded above, we can choose Mn,k to be NM-Iotteries such
that U(Mn,k) = min{b 1 ,k/2n}. Just as above, let Hn(r) = Mn,k for r E
Bn,k, so that U(H"(r ~ U{Hn(r ~ b1 for all r, n. The dominated
convergence theorem A.57 says that
that HI ::5 Ll, but Axiom 4 says that Ll ::5 HI if X-I (B) is nonnull. It
follows that X- 1 {B) must be null. A similar proof works if Ll -< L 2. 0
Before giving the proof of Theorem 3.110, we give a brief outline. We use
Theorem 3.108 to represent conditional preference by expected utility sepa-
rately for each value of x. We then use Lemma 3.144 to show that the utility
function in the conditional preference representation must equal the utility
function for unconditional preference except on a null set. We prove that
the probability measure for the conditional preference representation must
equal conditional probability calculated from the unconditional preference
by showing that if it were not, we could construct a pair of horse lotteries
that are conditionally ordered one way for all x, but that are marginally
ordered the opposite way, contradicting Axiom 6.
PROOF OF THEOREM 3.110. Let L* and L* be as in Lemma 3.133. Accord-
ing to Theorem 3.108, since ::5x satisfies Axioms 1-5 for each x, there is, for
each x, a probability Px on (R, Ad and a utility Ux such that HI ::5x H2
if and only if J UX {H 1 {r))dPx (r) ~ J Ux {H2{rdPx {r). Let Qx denote the
distribution of X induced from Q. Lemma 3.144 says that each pair of NM-
lotteries is ranked the same by ::5x except possibly for x in a set with null
inverse image. Since Lemma 3.136 says that a set C is null if and only if
Q(C) = 0, we can assume that there is Bo such that Q{X-l{Bo = 0
and Ux(L*) < Ux{L*), for all x Bo. We can certainly assume that
Ux(L",) = 0 = U(L*) and Ux(L*) = 1 = U(L*) for all x Bo. Let B_
be the set of all x such that there exists Lx with Ux{Lx) < U(L x ), and
let B+ be the set of all x such that there exists Lx with Ux(Lx) > U(Lx).
We will show next that for each x E B+ U B_, we could choose Lx so that
U(Lx) = ~. For each x E B+ U B_, let
, 1 b1 - 1 -bo '"
Lx = bl(l- bo)Lx + b1 (l- bo)L* + 1- bo L ,
where bo= min{O, U(Lx)} and b1 = max{l, U(L x )}. Then Ux(L~) =F
U(L~), but now 0 ~ U(L~) ~ 1. By mixing L~ with either L. or L to
create L~, we can have U{L~) = ~ and either Ux{L~) > ~ or U(L~) < ~.
That is, we can assume that U(Lx) = ~. Let Ll/2 = ~L* + ~L*. Define
horse lotteries H+ and H_ by
is null. By a similar argument, we can show that X-I(B+) is null. Let B'
equal BoUB_ UB+. Then X-I (B') is null and, for all x B', U",(L) = U(L)
for all LEe.
Next,a5 we prove that P",(A) is a measurable function of x for all A E AI.
For each A E AI. let HA(r) = L' if rEA and HA(r) = L. if r A. For
each c E [0,1],
D= (3.145)
n=l
Since both P",(An) and Q(Anlx) are measurable functions, each of the sets
in the union is in B, so DEB.
If Qx(D) > 0, then one ofthe sets in the union (3.145) must have strictly
positive Qx measure. Let D' = {x : Px(A 1) < Q(Allx)}, and suppose
that Qx(D') > O. For each rational q E [0,1]' let Dq = {x : P",(A1) <
q < Q(A1Ix)}. Then D' is the union of all the Dq. Since this union is
countable, there exists q such that Qx(Dq) > O. Define HI(r) = L* for
rEAl n X-I(Dq) and HI(r) = L. otherwise. Also, define H2(r) = L*
for rEAl and H2(r) = L .. otherwise. Then U",(Ht} = U",(H2) according
to the definition of conditional preference because H1(r) = H2(r) for all
r E X-1(D q). But Px(A I ) = U",(Ht} by the uniqueness of probability and
Lemma 3.134. Define Ha(r) = qL" + (1 - q)L .. for all r E X-1(Dq) and
Ha(r) = L .. otherwise. Then the definition of conditional preference implies
that Ux(Ha) = U",(qL* + (1 - q)L .. ) = q. Since U",(Ha) = q > U",(Ht} for
all x E Dq, we have HI --<", Ha for all x E Dq. But HI jy Ha for all y Dq,
since HI(r) = Ha(r) for all r X-1(Dq). It follows from Axiom 6 that
HI --< Ha. Now, note the following contradiction:
where the first and last equalities follow from Lemma 3.137, the inequality
follows from the definition of D q , and the other equality follows from the
definition of conditional probability. 0
for all (Ll"'" Ln). If (ql, ... , qn) and U are as guaranteed by Theo-
rem 3.108 and (tl, ... ,tn ) is as above, then V; = qiU/ti will satisfy
n n
L qiU(L i) = L ti V;(Li)'
i=l i=l
(Ll, ... ,Lj - 1 , L~,Li+l, ... ,Ln) ~ (L 1 , .. , L j _l, L~, Lj +1,'" ,Ln)
if and only if Uj(LD ::; Uj(L;), no matter what one chooses for the LiS.
For each j such that Tj is nonnull, there are prizes pj and P*j such that
Uj(Pj) = 1 and Uj(p*j) = O. (If Tj is null, Uj(Pi) can be arbitrary, since
there are no preferences among the horse lotteries.) It is easy to see that
the best and worst horse lotteries are respectively
Ht~(T,.) = {pj if j = i,
'f . 4 .
P*j 1 J r t.
Define qi = U(Ht), where U is constructed by Lemma 3.119 based on all
of 1i with U(H*) = 1 and U(H*) = O. Clearly, qi ~ 0 for all i. To see that
E~=l qi = 1, note that the equal mixture of all the H;s is
1 H*1
- + ... + -n1 H*n = -1 H* n --1 H *.
+-
n n n
Evaluating U at both sides of this expression gives lin E~=l qi = lin.
Finally, we prove that if H = (Ll, .. . ,Ln ), then U(H) = E~l qiUi(Li).
This will complete the proof. Construct n horse lotteries
H ( ) { Li if i = j,
i Tj = P*j ifi fj.
By taking an equal mixture of all n of these, we get
1 n-1 1 1
-H + - - H . = -HI + ... + -Hn.
n n n n
Evaluating U at both sides of this gives U(H)ln = (lin) E~=l U(Hi)' So,
we need only prove that U(Hi) = qiUi(Li) for each i. From the definition
of Ui , we see that Hi '" Hi, where
Hi(T') = { Ui(Li )pi+(1-Ui (Li ))p*i ifi=j,
, P*j ififj
Since Hi = Ui(Li)H; + (1 - Ui(Li))H*,
U(Hi) = U(Hi) = (1 - Ui(Li))U(H.) + Ui(Li)U(H;) = qiUi(Li). 0
208 Chapter 3. Decision Theory
3.4 Problems
Section 3.1:
for fixed known values of m and r > 1. After n person-years have elapsed
in this factory, s injuries are observed.
(a) Using fe(fJ) above, show that with a loss function
L(fJ d)
,
= (d -(0)2
)'
d* ( s) = r +s - 1.
m+n
(b) Now, assume that n (the number of person-years) is fixed a~d treat
S (the number of injuries) as random (not yet observed). Fmd the
risk function for d*. Also find the Bayes risk for d* with respect to
the prior fe and the posterior risk for dOes) given S = s.
3.4. Problems 209
7. Suppose that P8 says that Xl, ... , Xn are lID N{O, 1). Let Do{X) be the
median of the sample, and let T = D1{X) be the sample average. Find a
randomized rule based on the mean, li(T) , which has the same risk func-
tion as lio no matter what the loss function is. (You may wish to solve
Problem 12 on page 663 or Problem 1 on page 138 first. You probably can-
not write a closed-form solution for the randomized rule. You may either
describe the probability distribution in words sufficiently precise to define
it or give an algorithm for actually performing a randomization that will
have the appropriate distribution.)
8. Let 6 : X --+ N be a nonrandomized rule, and let T : X --+ T be a sufficient
statistic. Let 61 be the rule constructed in Theorem 3.18 on page 151. Show
that for each t, the distribution 61 {t)() on N is the probability measure
induced by 6 from the conditional distribution of X given T = t.
9.*Let n = (0,00) x (0,00), X = 1R3 , and N = IR+. Suppose that P8 says
Xl, X~, X3 are lID U(o:, {3), where 0 = (0:, {3). Let
L{O, a)
0:+{3
= ( -2- - a
)2 .
Let 60 {X) = X.
(a) Find a two-dimensional sufficient statistic, T.
(b) Use the Rao--Blackwell theorem 3.22 to find a rule 6I (T) whose risk
function is at least as good as that of 60 (X).
(c) Find the risk functions R(0,60 ) and R(0,6d, and show that there is
at least one 0 such that R(O, lid < R(O, 60 ),
10.*Let {Xn}~=l be conditionally lID Ber(O) given e = 0, and let X =
(Xl, ... , Xn). Let N = n = (0,1) and L(O, a) = (0 - a)2. Let the prior
distribution A of e be U(O, 1). Let Do(x) be the sample median, that is
11. In Example 3.25 on page 154, find the risk function for both 6 and 61 and
show that 61 dominates 6.
12. Suppose that P(J says that X '" Bin(n,O). Let n = (0,1) and N = [0,1].
=
Let L(O, a) (0 - a)2 and
with probability ~,
with probability ~.
Find a nonrandomized rule that dominates 6.
13. Find an example of a decision problem with a decision rule 60 and a prob-
ability A on the parameter space such that 60 is A-admissible but 60 is not
a Bayes rule with respect to )..
14. Suppose that X '" Exp(I/0) given 9 = (J. Let the action space be [0, (0),
and let the loss function be L(O, a) = (0 - a)2.
(a) Prove that 6(x) = x is inadmissible.
(b) Find a nonconstant admissible rule.
15. Let X = (Xl, ... , X n ), where the Xi are conditionally lID N(p" (12) given
9 = (p"o-). Let Cl > and C2 be constants. Let N = lR and L(O, a) =
(p, - a)2. Define 6(x) = (nx + Clc2)/(n + cd. Show that 6 is admissible.
16. *Prove Proposition 3.47 on page 162.
17. Assume that L(O, a) = (0 - a)2 in each of the following questions.
(a) Suppose that X '" N(J, 1) given e= (J. Show that for each constant
c, 6(x) == c is admissible.
(b) Suppose that X '" U(O,O) given 9 = O. Show that for each constant
c, 6(x) == c is inadmissible.
18. Let n = (0,1), N = [0,1], and L(J, a) = (J_a)2. Suppose that P(J says that
X '" Geo( 0), that is,
Suppose that the parameter space and the action space are both (-00, (0).
Let L(O, a) = if a ~ 0 and L(O, a) = 1 if a < (J.
(a) Show that there is no Bayes rule.
(b) Show that every decision rule is inadmissible.
3.4. Problems 211
(c) Show that if the action space is [-00,00], then there is a Bayes rule
and that it is the only admissible rule.
Section 3.2.3:
20. *Prove that the modified James-Stein estimator 63 (X) has smaller risk func-
tion than 6(X) if n 2: 4. (Hint: Let r be an orthogonal transformation with
first row proportional to 1 T , and let Z be the last n - 1 coordinates of r X .
What does Theorem 3.50 say about estimating re by r6 3 (X)?)
21. *Say that a function 9 : rn. --+ rn. is absolutely continuous if there exists a
function g' such that for all Xl < X2, g(X2) = g(xd + :"I l
x2 g'(y)dy.38
(a) Prove that the conclusion to Lemma 3.51 continues to hold if the
assumption that 9 is differentiable is replaced by the assumption that
9 is absolutely continuous.
(b) Prove that the conclusion to Lemma 3.52 continues to hold if the
assumption that the coordinates of 9 are differentiable is replaced by
the assumption that hi is absolutely continuous for every i.
22. Let g(x) = -x min{c, (n - 2)/ 2::~=1
6*(x) = x + g(x).
xn be a function from rn. n to rn. n . Let
(a) Using Problem 21 above, find all values of c > 0 such that 6*(x) has
smaller risk than 6(x) = x in the setting of Theorem 3.50.
(b) Prove that for c > (n - 2)/(n + 2), 6*(x) has smaller risk than 61 (x)
in the setting of Theorem 3.50.
Section 3.2.4:
Section 3.2.5:
38Such functions are called absolutely continuous because they have a prop-
erty similar to measures that are absolutely continuous with respect to Lebesgue
measure. In particular, if 9 is nondecreasing, then 7J((a,bJ) = g(b) - g(a) defines
a measure that is absolutely continuous with respect to Lebesgue measure.
212 Chapter 3. Decision Theory
27. Suppose that Po says X '" U(O, 1) and PI says X '" U(O, 7), and that the
loss function is as in Theorem 3.87. Find all of the admissible rules under
the conditions of that theorem. Express each rule by saying which intervals
of X values lead to making each decision.
28. Suppose that an observation X is to be made and it is believed that X has
one of two densities:
31. Suppose that there are k ~ 2 horses in a race and that a gambler believes
that Pi is the probability that horse i will win (E:=l Pi = 1). Suppose that
the gambler has decided to wager an amount x to be divided among the
k horses. If he or she wagers Xi on horse i and that horse wins, the utility
of the gambler is log(CiXi), where Cl, ... , Ck are known positive numbers.
Find values Xl, ... , Xk to maximize expected utility.
32. Suppose that two agents have a common strictly increasing utility function
U for their fortunes in dollar amounts and that their current fortunes are
the same, Xo. (So, for example, the utility of receiving an additional X
dollars would be U(xo + x).)
(a) Let R be a random dollar amount that is strictly greater than -Xo.
If one of our agents contemplates selling R, what would be the lowest
price at which the agent would be willing to sell it? What would be
the highest price that an agent who did not own R would be willing
to buy it?
(b) Suppose that one agent receives a gift consisting of a lottery ticket
that will pay T > dollars with probability 1/2 and pays nothing with
probability 1/2 and that both agents agree on these probabilities.
Construct a utility function U having the property that, as soon as
an agent receives this gift, he or she is willing to sell it at some price
less than T /2 and the other agent is willing to buy it at that same
price.
33. Assume Axioms 1 and 2 and the Archemedian condition of Lemma 3.117.
Let R = {Tl, ... , Tn} and P = {PI, ... ,Pm}. Consider the set 'H' of all horse
lotteries of the form (Pil , ... ,pi,,). (These are all horse lotteries whose NM-
lotteries assign probability 1 to a single prize.) Let H.,H* E 'H' be such
that H * :j H :j H' for all H E 'H'. Prove that H * :j H :j H* for all
HE'H.
3.4. Problems 213
34. Prove Proposition 3.118 on page 193. (Hint: You can use Theorem 3.147 if
you wish.)
35. Let '11 < '12 < 1, and suppose that HI and H2 are horse lotteries such that
4.1 Introduction
4.1.1 A Special Kind of Decision Problem
Recall the setup used at the beginning of Chapter 3. We had a probability
space (S, A, J.L) and a function V : S -+ V. One example of V is the param-
eter 8. Other examples are measurable functions of e. Other V functions,
which are not functions of e, are possible but are rarely seen in classical
statistics. This is true to a greater extent in hypothesis testing for rea-
sons that will become more apparent once we study the criteria used for
selecting tests in classical statistics.
Definition 4.1. Suppose that we can partition V into V = VHUVA, where
VH n VA = 0. The statement that V E VH is a hypothesis and is labeled
H. The corresponding alternative is labeled A and is the statement that
V E VA. If V = 8, we have f2 = f2H U f2A with f2H n f2A = 0 and V E VH
if and only if e E f2 H . In this case, we write H : e E f2H and A : e E f2A
A decision problem is called hypothesis testing if N = {O, I} and L(v,a)
satisfies L(v,l) > L(v,O) for v E VH and L(v,l) < L(v,O) for v EVA
The action a = 1 is called rejecting the hypothesis, and the action a = 0 is
called accepting the hypothesis. 1 If we reject H but H is true, we made a
type I error. If we accept H and it is false, we made a type II error.
where CI > Co and bo > b1 It is easy to see (see Problem 1 on page 285)
that (4.2) is equivalent to a loss function of the same form with Co = b1 = 0,
bo = 1, and Cl = C > O. Such a loss function is called a O-l-c loss function.
If, in addition, c = 1, it is called a 0-1 loss function. More general loss
functions than the 0-1-c loss might often seem appropriate for the type of
problems in which hypothesis testing is used. For example, if the parameter
is real, the hypothesis is that e ::; 90, and C > 0, an appropriate loss might
be
L(9 ) _ { 9 - 90 if 9> 90, a = 0, (4.3)
,a - (90 - 9)c if 9:5 90, a = 1.
This loss provides for penalties for choosing the wrong decision that are
commensurate with the inaccuracy of the decision. But this loss can be
written as 19 - 90 I times the 0-1-c loss. By Proposition 3.47, so long as the
risk functions of all decision rules are continuous from the left (or all are
continuous from the right) at 9 = 90 , rules admissible under the O-l-c loss
will be admissible under this loss. One could begin the study of hypothesis
testing by concentrating solely on which decision rules are admissible. For
this purpose, the 0-1 loss is sufficient. The focus of hypothesis testing,
however, is on finding tests that meet certain ad hoc criteria to be defined
later.
A randomized decision rule 6 in a hypothesis testing problem can be
described by its test function, which is the measurable function t/J : X -+
[0, I] given by
2See Definition 3.99 on page 183. Basically, the binary relation ~ must be
reflexive and transitive, and all pairs of data values must be compared.
4.1. Introduction 217
Examp le 4.6. Let H state that X "" N(O, 1). We can say that x ~ y
if Ixl ~ Iyl
We are quite free to define ::5 however we wish, so long as it is
a weak
order.
Examp le 4.7. Let H state that X "" N(O, I}. We could define x ~ y by
Ixl 2: Iyl
The most commo n way to define :5 is in terms of a statisti c T : X
-+ lR.
We would say that x :5 y if and only if T(x) ::; T(y). In Examp
le 4.6,
T(x) = Ixl. A pure significance test is obtaine d by calcula ting the
signif-
icance probability PH(X) (Defini tion 4.8) and rejectin g H if PH (x)
is too
small.
Defini tion 4.8. Let the hypoth esis H be a set of distribu tions on (X, 8).
Suppos e that the quantit y PQ(x) = Q({y: X::5 y}) is the same (or
approx -
imately the same) for all Q in H. Then the commo n value PH(X)
is called
the significance probability of the data x relative to the weak order
:5, and
the test that rejects H when PH(X) is small is called a pure signific
ance
test.
Examp le 4.9 (Continuation of Example 4.6). It is easy to see that
PH(X} = 1-,x,
-00
cp(y)dy +
100 cp(y)dy = 2<I>(-lxl),
Ixl
where cp and <I> are the standar d normal density and CDF, respectively.
This pure
significance test would be the same as the usual test of the hypothesis
that the
mean of a normal distribution with variance 1 is 0 versus the alternative
that the
mean is not O.
For the case of Example 4.7, we have
PH(X) = 1''
-''''I
cp(y)dy = 2<I>(lxl).
This test would lead to rejecting H if the data are too consistent with
H. This
is similar to what Fisher (1936) did when considering how closely the
Mendel (1866) matched a theory that Fisher later showed to be inaccur data of
ate.
Examp le 4.10. Suppose that H is the set of distributions that say
that X =
(X1, ... ,Xn ) are conditionally lID with N(O, 0'2) distribution given ~ =
0'. Let
T(x) be the usual t statistic for testing the hypothesis that the mean
sample
2
is 0, namely T(x) = v'nlxI/8 , where x =
x) I(n -1). Then F,,({y : T(x) ~ T(y)}) is the same
z:n_
1.- 1
of a normal
x;/n and 8 2 = z:~ (Xi-
t-l
for all 0'. In fact PH(X) =
2Tn - 1 ( -IT(x)l ), where T n -1 is the CDF of the t n -1 (0,1) distribu
tion. The usual
t-test is a pure significance test.
The advanta ges to pure significance tests over general hypoth esis tests
are
that one need not explici tly state the alterna tives and one is free
to choose
the weak order ::5 however one sees fit. Of course, one would normal
ly choose
~ with some alterna tive in mind, but one need not
say what the alterna tive
is, nor need one calcula te any probab ilities conditi onal on the alterna
tive.
218 Chapter 4. Hypothesis Testing
A serious disadvantage is that one never knows, until one considers explicit
alternatives, whether one should continue calculating probabilities as if the
hypothesis were true or not. Just because PH (x) is large does not mean that
H is a better probability model for the data than some other plausible
distribution not part of H. Similarly, if PH(X) is quite small, it may be
the case that many other distributions not part of H also give very small
probability to the set {y : x ::5 y}. Berkson (1942) forcefully argues this
point, but not forcefully enough for Fisher (1943).
We will not discuss pure significance tests any further in this book except
to mention a few points. 3 First, all of the hypothesis tests developed in this
chapter can be interpreted as pure significance tests if one feels compelled
to do so, although the hypotheses may need to be modified in order to
satisfy the definition of pure significance test. Second, the goodness of fit
tests described in Section 7.5.2 were originally intended to be interpreted
as pure significance tests. Third, pure significance tests have no role to play
in the Bayesian framework as described in various parts of this text. If the
hypothesis H describes all of the probability distributions one is willing to
entertain, then one cannot reject H without rejecting probability models
altogether. If one is willing to entertain models not in H, then one needs
to take them into account, as well as their merits relative to H, before
deciding whether or not to reject H.
3COX and Hinkley (1974, Chapters 3-5) discuss pure significance tests and
related topics in great detail. A nice review is contained in Cox (1977).
4.2. Bayesian Solutions 219
Example 4.12. Suppose that P,~ says that {Xn}~=l are lID N(p" ( 2 ), where
fJ = (p"u) and X = (X1, ... ,Xn ). Let V = 8 and nH = {(p"u) : p, ~ p,o},
and let L be a Q-1-c loss function. If we use the measure with Radon-Nikodym
derivative l/u with respect to Lebesgue measure as an improper prior, then the
posterior distribution of M is tn-l (x, 8/ ..;n). The formal Bayes rule is
where t = ..;n(x - /1-0)/8 is the usual t statistic and Tn-l is the CDF of the
tn-l (0,1) distribution. Note that this is the usual size 1/(1 + c) t-test of H from
every elementary statistics course.
L(v,c5(x = clvH(v)rjJ(x)+IvA(v)[l-rjJ(x)],
L(fJ,c5(x = CP9,V(VH)rjJ(X) + {1 - P9,v(VH)[1- rjJ(x)]}
rjJ(x)[(c+ 1)Pe,v(VH) - 1] + 1- P/J,V(VH),
R(fJ, rjJ) = .BcI>(fJ)[(c + l)Pe,v(VH) - 1] + 1 - P/J,V(VH). (4.14)
Now, define
4The interested reader should try to extend the classical definitions of level,
power, and so forth to the case of predictive hypothesis testing and see what
happens. The problem arises because one usually assumes that the future data
are independent of the past data conditional on the parameters, and all classical
inferences are conditional on the parameters. Hence, the past tells us nothing
about the future, and vice versa.
220 Chapter 4. Hypothesis Testing
If power functions are continuous at that (J such that P9,V(VH) = l/(c + 1),
then the predictive testing problem has been converted into a parametric testing
problem. In words, we have replaced a test concerning the observable V with a
test concerning the conditional distribution of V given e.
Another area in which Bayesian and classical hypothesis testing differ
dramatically is their treatment of more general loss functions. When the
focus of classical testing is on admissible tests, then it does not matter
which of several equivalent loss functions one uses. A Bayesian solution to
a testing problem will depend on which loss one uses because one is trying
to minimize the posterior risk. For example, with the loss function in (4.3),
the posterior risk for choosing a = 0 is J 1(60,00) (O)(O-Oo)dFelx (Olx), while
the posterior risk for choosing a = 1 is cJ 1(-00,60 )(0)(00 - O)dFelx(Olx).
A little algebra shows that the formal Bayes rule is to choose a = 1 if
The expression on the left is increasing in x and the expression on the right is
decreasing in x, so the formal Bayes rule is to choose action a = 1 if x > k
for some number k. This rule has the same form as the formal Bayes rules with
respect to 0-1-c' loss functions.
In classical hypothesis testing, it is not common to recommend different
tests depending on whether the loss is o-l-c or of the form of (4.3) or
anything else. In fact, very little attention is paid to what the loss function
might be in classical testing. Were the focus solely on finding all admissible
rules, this might not be a problem. However, once we advance beyond
the simplest types of testing situations, the classical theory will tend to
abandon the goal of finding all admissible rules and concentrate instead on
finding all tests that satisfy certain ad hoc criteria.
fx e(x, 0) = { pofxle(xIOo) if 0 = 00 ,
, (1- po)fxle(x\O) if 01= 00 .
The marginal density of the data is
The posterior distribu tion of e has density (with respect to the sum of A
and a point mass at ( 0 )
PI if 0 = 00 ,
felx(Olx) = { (1-p )fx1e(xI O)
I fx(x) if 0 1= 00 ,
where
po!xle(xIOo)
PI = fx(x)
is the posterior probability of e = 00 . It is easy to see that
~ _ Po fXle(xIOo)
1 - PI - 1 - Po J fXls(x\O)dA(O) . (4.16)
The second factor on the right-hand side of (4.16) is the Bayes factor.
It
would be the posterior odds in favor of e = 00 if Po = .5. For other values
of
Po, one needs to multiply the prior odds times the Bayes factor to calcula
te
the posterior odds. The advantage of calculating a Bayes factor over
the
posterior odds (pd[l - PI]) is that one need not state a prior odds in
favor
of the hypothesis. This might be useful if one is reporting the results
of an
experiment rather than trying to make a decision. One must still, howeve
r,
state a prior distribution over the alternative given that the hypoth
esis is
false.
222 Chapter 4. Hypothesis Testing
If >. puts positive mass on both sides of 00, then it is easily verified that Jexp(x[O-
Ool - 02 /2)d>.(O) is convex as a function of x and goes to 00 as x -> oo. So all
formal Bayes rules will be of the form "reject H if x is outside of some bounded
interval."
The prior in this class that leads to the smallest Bayes factor can easily be shown
(see Problem 7 on page 286) to be the one with
2 _ { (x - (0)2 - 1 if Ix - 00 1> 1,
7 - 0 otherwise.
5For more discussion of this technique, see Edwards, Lindman, and Savage
(1963).
4.2. Bayesian Solutions 223
x - 00 + c = exp(2c[x _ 00 ]),
x -00 - c
For Ix - 001 > 1.5, the solution c is very nearly equal to Ix - 001
(although
it is always strictly smaller than Ix - (01). If x = 00 + 1.96, for example
, then
c = 1.958. The value of fx(x), when c = Ix-Ool, is [1 +exp( -21x -(012)
If x = 00 + 1.96, for example, then the lower bound on the Bayes factor
]/[2$].
is
approximately twice the global lower bound. This is not surprising, 0.2928,
since the
two-point distribution puts half of its probability very nearly at the
same point
as does the one-point distribution that led to the global lower bound.
The
half of the probability is on a point that contributes nearly nothing becauseother
it is
so far from x.
The global lower bound on the Bayes factor, namely
fx\e(x\ lIo)
(4.20)
sUPo#o fx\e(x\ lI) '
is closely related to the likeliho od ratio test statisti c, which is discuss
ed in
Section 4.5.5.
Upper bounds on Bayes factors are usually harder to come by.
This is
due to the fact that there are often priors (even conjug ate priors)
that
place such high probab ility on the data being very far from
what was
observe d that the hypoth esis will be highly favored if such a prior
is used
for the alterna tive. For exampl e, in Examp le 4.17, if the alterna
tive prior
is N(lIo, r2), the Bayes factor goes to 00 as r2 goes to 00. In this
regard,
it is import ant to note that improp er priors are particu larly inappro
priate
for the conditi onal distribu tion of e given e #- lIo. The limit as
r2 goes
to 00 in Examp le 4.17 leads to an improp er prior. As we just
noted, the
Bayes factor goes to 00 becaus e the improp er prior for the alterna
tive says
that e has probab ility 1 of being outside of every bounde d interva
l. Since
the data will surely be inside some bounde d interva l, it will appear
to be
much more consist ent with the hypoth esis than the alterna tive.
There are
ways, however, to use limits of proper priors in Bayes factors.
224 Chapter 4. Hypothesis Testing
Example 4.21 (Continuation of Example 4.18; see page 222). Suppose that we
wish to let T2 go to 00 in the N (80 , T2) prior for e given the alternative. In order
to use an improper prior to approximate a proper prior in this problem, we would
have to let the prior on the hypothesis be improper also. This could be done by
letting po go to zero in such a way that POT -+ k. In this case, po/[1 - Pol times
the Bayes factor converges to kexp(-[x - 80 ]2/2). It this way, k acts like the
prior odds ratio, and exp( -[x - 90 ]2/2) acts like the Bayes factor. In fact, k is
the limit (as T -+ (0) of the ratio of Po to the prior probability that e is in the
interval [-#' # l given the alternative. (See Problem 8 on page 287.)
By restricting the class of prior distributions, one can obtain useful upper
bounds on Bayes factors. For example, in Example 4.17, one could restrict
attention to priors with T2 ::; c. Since the Bayes factor is increasing as a
function of 7 2 , we get that the maximum occurs at 7 2 = c. For large c,
one can easily compute the upper bound to be approximately Vc times the
global minimum Bayes factor.
Bayes factors can also be calculated in cases in which the hypothesis is of
the form H : g(6) = g((}o) versus A: g(9) =J g((}o) for some function g. For
example, the hypothesis might concern only one of several coordinates of
6. In this case, global lower bounds on the Bayes factor are not particularly
useful.
Example 4.22. Let 8 = (M, E), and suppose that Xl"'" Xn are conditionally
liD given e = (1-', u) with N(I-', ( 2) distribution. Suppose that H : M = 1-'0
is the hypothesis. Given M =1= 1-'0, we suppose that E2 '" r- 1(ao/2, bo/2) and
that M given E = u has N (1-'0, u 2/)..0) distribution. This is the usual conjugate
prior distribution. Conditional on M = 1-'0, we still need a prior distribution of
E2. We will use the conditional distribution given M = 1-'0 obtained from the
joint distribution given M =1= 1-'0. Conditional on M = 1-'0, E2 has r- 1(ao/2, bo/2)
distribution, where a(j = ao + 1 and b(j = bo + )..0(1-'0 - I-'0? The conditional
density of (X, E) given M = 1-'0, iX,EIM(X, ull-'o), equals
b* )~
~ [b~ + w + n(Xn _1-'0)2]}.
(
2 u-(n+a(j+1) exp { __1_
(27r)~r( ~) 2u 2
2 (!!ll.)
2
~ v:ro- u- n - ao - 2 exp {--2
1 2 +n)..o
1 [bO+W+)..l(I-'-I-') T (_ xn-I-'o )2] } ,
(27r)~r(-"f) 2u 1
where
n n
1 nXn + >'01-'0
)..1 =)..0 + n, I-' = )..1
4.2. Bayesian Solutions 225
If we integrate the parameters out of the two densities above and take the ratio,
we get the Bayes factor:
(4.23)
where
n).o (_ 0)2
al = aO +n, b1 = bo + W + ""I; Xn - /.L ,
To put a lower bound on the Bayes factor, we first note that the conditional
distribution of M given E and M i- /.Lo which will lead to the largest marginal
density for X given M i- /.Lo is the one that says M = x with probability 1. We
are then left with the problem of finding distributions for E given M = /.LO and
given M =F /.Lo. It is easy to see that if we let the distribution of E be concentrated
at the same value c for both the hypothesis and the alternative, then the Bayes
factor is exp( -n(x - /.Lo)2/[2c2)), which goes to 0 as c goes to 0, unless x = /.Lo.
If x = /.Lo (a probability 0 event given 9), the lower bound on the Bayes factor
is still 0, but one achieves this by letting the priors for E be different under the
hypothesis and alternative.
If one wished to use improper priors, one would have to let ).0 go to 0 while
Po/$o converges to some finite strictly positive number k. 6 In this case ao = ao
instead of ao + 1 because E and M are independent in the improper prior. To
convert (4.23) to the case of the improper prior, we set ao = -1 and bo = O. The
product of the prior odds and the Bayes factor becomes
)-~
kv'ii ( 1 + n ~ 1
2
'
In general, minimizing a Bayes factor for a problem like the one in Ex-
ample 4.22 would require choosing the prior for the alternative to maximize
the predictive density and choosing the prior for the hypothesis to minimize
the predictive density. But this latter problem was already seen to lead to
the minimum being 0 in most cases. In short, the global lower bound on
the Bayes factor, when the hypothesis concerns only a function of the pa-
rameter, will most likely be 0 and so is not useful. An alternative to the
global lower bound is an approximate Bayes factor formed by maximizing
the marginal density of X separately under the hypothesis and alternative
sUPge{lH IXls(xIO)
(4.24)
sUPge{lA IXls(xIO) .
Let 0"8 = _B-1. If, in addition, the prior densities are relatively fiat in the
regions where the likelihoods attain their largest values, we can write
7We will discuss the large sample properties of the metho~ of ~aplru:e in
Section 704.3. (In particular, see Theorem 7.116 and the ~nsumg. dls~ussl~n.)
Here, we give only a description of the method without any rIgorous JustificatIOn.
The derivation presented here is based on Kass and ~ery ~1995) . .
8The matrix - A is sometimes called the observed Hsher mformatwn.
4.2. Bayesian Solutions 227
_ w + n(x - /1-0?
a", - 2n2 ,
_ w
178 - n2
(1 0)
0 ~ .
The approximate Bayes factor is 1/v'21r times the factor f.,,(~)/ fa(9) times the
expression in (4.26) times the ratio of the square roots of the determinants of the
two matrices above. The result is
nfq,(Il1}
,
fa(8)J27rw
(
"
1+--
n- 1
e)--r , _ n-l
(4.29)
2(~):;
2
r ( ~)
17-(110+1) (bo )
exp - 2cr 2 '
Plugging cr = ~ into the first of these and (/1-,17) = (x, Jw/n) into the second
and taking the ratio give
228 Chapter 4. Hypothesis Testing
= n(bo);Q-r(!lf) b!
(bo)~~r(~) (bn~'
where bl, bi, al, and ai are defined after (4.23). If >'0 and ao are small relative
to n, we can approximate n/v'2 by ";>'0 + nr(ai/2)/r(al/2). With this approx-
imation, the expression above becomes exactly (4.23). Although h(q,)/fe(8)
depends on the data, one could calculate values of the ratio for a range of plau-
sible priors to see how much it could reasonably vary.
As an example, suppose that /-Lo = 1.5 and that n = 14, x = 2.7 and w = 41
are observed. Then q, = 2.0901 and e= (2.7,1.7113). That por~ion of (4.29)
that does not depend on the prior is
n ( t 2 ) -!!j-!
~ 1+ -- = 0.0648.
y21l'w n-l
Next, we let /-L0 = 1.5 and let the other hyperparameters ao, bo, >'0 be elements
of the set {0.1, 1,5,10, 20}. Figure 4.30 shows the 125 different values of the
logarithm ofthe ratio h(q,)/ fe(e) with >'0 varying most rapidly and ao varying
most slowly. Since log(0.0648) = -2.736, those priors corresponding to values on
the vertical axis greater than 2.736 (horizontal line) will lead to Bayes factors
greater than 1, while the others lead to Bayes factors less than 1. Examining
Figure 4.30, we see that many reasonable priors (those with small to moderate
values of ao and >'0 and values of bo/ao in the vicinity of the observed sample
variance w /n = 2.93) give values for the log of the ratio near 2.736. This suggests
that the data will not dramatically alter anyone's opinion very much as to whether
or not M = 1.5. The other approximate Bayes factor (4.26) is 0.0608, which
suggests a significant reduction to the odds in favor of the hypothesis. The t
statistic for the usual classical test would be 2.53, and the hypothesis would be
rejected at level 0.05.
'fr.,
o 20 40 60 80 '00 '20
Prior Sequence
then make detailed comparisons with Bayes factors. At this point, we only
mention that the results can often be in stark contrast. In particular, data
with very small significance probability (supposedly suggesting that the
data do not support the hypothesis) can have relatively large values for the
lower bounds on the Bayes factor (suggesting that the data do not conflict
with the hypothesis to a great extent). This conflict is sometimes called
"Lindley's paradox" [see Lindley (1957) and Jeffreys (1961)].
If one believes that the probability is 0 that the parameter lies in a low-
dimensional subset of the parameter space, then it is not appropriate to
test the types of hypotheses we have considered in this section. As an alter-
native, one can calculate a measure of how far the parameter of interest is
from the hypothesized low-dimensional subset. For example, if the parame-
ter is e = (M, E) and the hypothesis is that M is near 0, then the posterior
distribution of IMI contains a great deal of information about how far M is
from O. Also, the posterior distribution of IMIIE contains information of a
similar sort.
Example 4.31. Suppose that X rv N(O, 1), given e = 0, and that our hypothesis
is that e is near 00 If we consider all prior distributions of the form e '"" N(O, 1'2),
then the posterior distribution of e is N([Oo +1' 2 X]/[1 +1"2], 1"2/[1 +1"2]). For each
6> 0, we can calculate
1, which means r2 -+ 00. This corresponds to the usual improper prior. After
observing X = x, one could plot Pr(le - 801 :::; 6) (using an improper prior) as
a function of 6 to describe how far e is likely to be from 80. For example, if
x = 80 + 1.96, then Pr(le - 801 ::; 6) = 0.05 for 6 = 0.3983.
For multi parameter problems, there may be many possible summaries of
the parameters that measure the extent to which the parameter differs from
the hypothesis. We will consider a very general class of such summaries in
Section 8.2.3.
test 4J is uniformly most cautious (UMC) floor Q if, for every other floor Q
test .,p, {3",(0) ~ {3tf>(0) for all 0 E {lH'
In some cases, both criteria (UMP and UMC) lead to the same optimal
tests. In some cases, they do not. (See Problem 31 on page 289.) Either
way, there is asymmetry in these definitions. A different criterion is used for
protecting against one type of error than that used for protecting against
the other. One argument given for the particular choice is that type I error
is more costly than type II error, so we arrange for the maximum type
I error probability to be small. However, what often happens is that the
probability of type II error can become even smaller for most values of the
parameter. Here is a simple example.
Example 4.35. Suppose that X fVPoi(8) given e = 8, and that n = {I,10}.
We are interested in testing H : e = 1 versus A : e = 10. The MP level 0.05 test
4.3. Most Powerful Tests 231
ft(x) = 0,
fo(x) > 0.
232 Chapter 4. Hypothesis Testing
All of these tests are MP of their respective levels and Me of their respective
floors.
Note that oo will have size 0 because it never rejects H when fo > O. On
the other hand, o will have the largest possible size for an admissible test,
equal to 1 in many problems but not always.
The following result gives conditions under which MP tests are essentially
unique.
Lemma 4.38. Let n = {Oo,Od and let Po II, for some measure II and
both values ofO. Let fi(x) = dPojdll(x) for i = 0,1, and let
Bk = {x : hex) = kfo(x)}.
Suppose that for all k E [0,00], POi (B k ) = 0 for i = 0,1. Let be a test of
H : e = 00 versus A : e = 01 of the form
and let 'ljJ be another test such that f3", (0 0 ) = f3", (00 ). Then either = 'ljJ,
a.s. [Po;] for i = 0, 1 or f3",(Ol) > f3",(Ol).
PROOF. Let and 'ljJ be as stated in the lemma. Define A> = {x : 'ljJ(x) >
(x)} and A< = {x : 'ljJ(x) < (x)}. Clearly, is in the form of a test from
Proposition 4.37, and so it is MP of its size. Also, since only takes on the
values 0 and 1, we have
It follows that
Figure 4.44 has several features that are typical of all risk sets.
Lemm a 4.43. The risk set for a simple- simple hypothesis-testing problem
is closed, convex, and symme tric about the point (1/2, 1/2). It also
contains
that portion of the line al = 1 - ao lying in the unit square.
PROOF . The rule 1/>( x) == ao corresponds to the point (ao,
1- ao) for each ao
between 0 and 1, so the risk set contains the portion of the line a1 =
1- ao
which lies in the unit square.
Suppose that (a, b) is in the risk set. The symmetrically placed point
about (.5, .5) is (1 - a, 1 - b). If I/> produces the first point, then
1 - I/>
produces the second point. Hence the risk set is symmetric about
(.5, .5).
Lemma 3.74 shows that the risk set is convex.
To show that the risk set is closed, we need only show that it contain
s
its lower boundary. The rest of the bounda ry is included by symme try
and
convexity (and the fact that the line al = l-ao is in). By Proposition
3.91,
a Bayes rule exists for every prior. By Lemma 3.96, every point on
8L is
the risk function for one of these Bayes rules; hence fh is in the risk
set. 0
234 Chapter 4. Hypothesis Testing
The definition of the lower boundary 8L of the risk set (see Defini-
tion 3.71) is designed so that exactly those tests which produce points
on fh are admissible. The lower boundary also consists solely of Bayes
rules. (See Lemma 3.96.) Proposition 3.91 tells us that if a Bayes rule with
a
respect to some prior is not in L , then one of the two prior probabilities is
o and there is another Bayes rule with respect to that prior that is in aL.
These considerations lead to the following result.
Lemma 4.45. 9 If a test </> is MP level a for testing H : e= 00 versus
A: e = Ol, then either (3",{Od = 1 or (3",{Oo) = a.
PROOF. We will prove the contrapositive. Suppose that both (3",{Od < 1
and (3",(00) < a. Create another test </>' as follows. Let A = {x: </>{x) < 1}.
Note that Plh (A) > 0, since (3",(Od < 1. Let 9c{X) = min{ e, 1-</>{x)}. Then,
for e ~ 0, hi{e) = EO;9c(X) ~ 0 and hi(e) is nondecreasing in e. Also, it is
easy to see that hi(e) is continuous in c, since Ihi(e) - hi(d)1 ~ Ie - dl It
follows that there exists e such that ho{e) = a - (3",(0 0), Let </>' = </> + 9c It
follows that (3",,(00) = a. Also, </>'{x) > </>(x) for all x E A. Since POl (A) > 0,
it follows that hl(C) > 0 and (3",,(0 1 ) > (3",(0 1 ), So </> is not MP level a. 0
Lemma 4.45 says that a test that is MP level a must have size a unless
all tests with size a are inadmissible. This result allows us to say when the
two optimality criteria (MP and MC) are equivalent in the simple-simple
testing situation.
Propo sition 4.46. Suppose that ao and a1 are both strictly between
0 and
1. Suppose that a test of H : e = 00 versus A : e = 0 1 c~rresp
onds t~
the point (ao, al) in the risk set. Then is Me floor 1 - al if and
only if
it is MP level ao
The reason for the restriction that ao and a1 be strictly between 0
and
1 is that many tests with ao E {O, I} or a1 E {O, I} are inadmissible
even
though they may satisfy one of the two optimality criteria.
The following lemma allows us to conclude that if is an MP level a
test
and we switch the names of hypothesis and alternative, then 1 - become
s
an MC floor 1 - a test in the new problem (see Problem 30 on page
289
for a more general version.)
Lemm a 4.47. If is MP level a for testing H : e = 00 versus A: e
then 1 - has the smallest power at 01 among all tests with size
= 01,
at least
I-a.
PROOF. First, note that 1 - has size at least 1 - a. Next, suppos
e that
{3q,(01) = 1. Then 1 - has power 0 at (h and is clearly the least powerfu
l
of any class to which it belongs. By Lemma 4.45, the only other case
to
consider is that in which {3q,(00) = a. In this case, I - has level
I-a.
Suppose, to the contrary, that {3",(Oo) ~ I-a and {3",(01) < {31-q,(Ot}.
Then
{31-"'(00) ::; a and {31-",(01) > {3q,(OI)' which contradicts the assump
tion
that is MP level a.
0
The following lemma says that in the comparison of two MP Neyma
n-
Pearson tests, the one with the smaller level will also have the smaller
power.
Lemm a 4.48. 10 Let {Po : 0 En} be a parametric family. If 1 is a
level
a1 test of the form of the Neyma n-Pear son fundamental lemma
437 for
testing H : e = 00 versus A : e = 01 , and if 2 is a level a2 test of that
form with al < a2, then {3q,1 (0 1) < {3q,2 (Ot}.
PROOF. By the Neyma n-Pears on fundamental lemma 3.87, both 1
and
2 are admissible. If {3q,1 (h)
2:: {3q,2 (0 1), then 2 is inadmissible. 0
The Neyman-Pearson fundamental lemma 4.37 tells us all of the admis-
sible MP and MC tests. We also saw (Theorem 3.95 and Lemma 3.96)
that
a
these are the tests corresponding to points on L , the lower bounda
the risk set, and they are the Bayes rules with respect to positive priors
ry of
and
one Bayes rule for each of the priors that assign 0 probability to one
of the
parame ter values. The usual classical approach to choosing one of the
ad-
missible tests is not to choose a prior distribution and then take the
Bayes
rule, but rather to choose a value of a and then choose the MP level a
test.
In cases with simple hypotheses and simple alternatives, the classica
l and
Bayesian procedures will agree. That is, for every prior distribution,
there
is a formal Bayes rule and a such that the formal Bayes rule is MP level
a. Similarly, for every a, there is a prior such that the MP level a test is
a formal Bayes rule. Only in cases more complicated than those described
so far can we distinguish these two approaches. Example 4.49 is one such
case.
Example 4.49. Suppose that XI and X 2 are conditionally independent U(O, ())
random variables given e = () and n = {1,2}. Let nH = {I} and nA = {2}.
Suppose that Z is independent of XI and X 2 given e with distribution Ber(I/2).
Hence Z is ancillary. Suppose that we observe Z and X = maxl<i<n Xi where
n = 1 if.Z = 0 and n = 2 if Z = 1. That is, the sample size is random-(n = Z + 1)
but anCillary. The marginal densities of X (given e = 1,2) are
11This test is not the MP level a test based on the joint distribution of the
data (X, Z).
12These Bayes rules will be the MP level a tests based on the joint distribution
of (X, Z) according to the Neyman-Pearson fundamental lemma 4.37.
4.3. Most Powerful Tests 237
We always have 4>"'1 (x) = 1 for x ~ 1. For 0 < x < 1, the value of 4>"'1 (x) is
7rl Z=o Z= 1
<1.5 1 1
1 1 arbitrary
"5
between i and ~ 1 0
1 arbitrary 0
3
>!3 0 0
Notice that the Bayes rule is never the same as either the unconditional level a
test or the conditional level a test when 0 < a < 1. That is, there is no value of
7rl strictly between 0 and 1 such that the Bayes rule rejects H for some (but not
all) 0 < x < 1 for both Z = 0 and Z = 1. The Bayes rule either rejects H for all
values of 0 < x < 1 for at least one of the Z values or it rejects H for no values
of 0 < x < 1 for at least one of the Z values. The power functions of the Bayes
rules are
7rl {j(1) {j(2)
<i l l
1 l!! 1!
"5 2 8
between
1
i and ~ !
a
k
52a
3 2 8
>~ 0 ~
Here a is any number between 0 and 1 corresponding to the "arbitrary" parts of
some of the Bayes rules. Note that for each Q between 0 and 1, there is a Bayes
rule with size Q. (For example, to get Q = 0.05, let a = 0.1 in the fourth row of
the table. One such test is the following. If Z = 1 is observed, 1/3(X) = 1 for
x ~ 1, and if Z = 0 is observed, 1/3(X) = 1 for x > 0.9.) Since the Bayes rule
with size a is the MP level Q test, it has higher power than the unconditional
size a test. (See Problem 15 on page 287.) On the other hand, it does not have
conditional level Q given the ancillary.
This example illustrates a conflict between two principles of classical
statistics. The principle of conditioning on ancillaries together with the
principle of choosing MP level 0: tests leads to the MP conditional size
0: test. This test is dominated by the MP unconditional size 0: test that
ignores the ancillary. This test, in turn, is dominated by the unconditional
size 0: test that makes use of the ancillary. The natural conclusion is to use
this last test. But if we are to make use of the ancillary, aren't we supposed
to condition on it? And if we condition on the ancillary, aren't we supposed
to use the conditional size a test? The reason that it is difficult to justify
(in the classical framework) using the size a test based on the whole data is
that once the ancillary is observed, the conditional size of the test changes
depending on the value of the ancillary. Why should an ancillary affect
my choice of the size of the test? This begs the more important question,
"How should the size of a test be chosen in a particular problem?" There
are no general decision theoretic principles that lead one to be able to
choose the size of a test based on a loss function or a prior distribution.
There are cases in which one can find a simple correspondence between the
238 Chapter 4. Hypothesis Testing
size of a test and a loss function, but these seem to depend on additional
structure not present in all problems, or they seem to be isolated instances
not easily generalized. (See Theorem 6.74 on page 376 for a description of
some additional structure and Example 4.61 on page 241 for an isolated
example.)
Since 0 > 00 in the exponent inside the integral, the Bayes factor is a decreasing
function of x no matter what A is. Hence, the formal Bayes rule (with 0-1-c
13By dealing with this case next, we postpone the issue that arises when the
size of the test is the supremum of the power function over the hypothesis rather
than just the value of the power function at the hypothesis. This subtle point is
actually at the root of the asymmetry between hypotheses and alternatives.
4.3. Most Powerful Tests 239
The dual concept of UMC test would apply if the hypothesis were com-
posite and the alternative were simple. It would then be the case that, if
we switched the names of hypothesis and alternative, 1 minus the UMP
level 0: test would become the UMC floor 1 - 0: test. (See Problem 30
on page 289.) This case is interesting in that it is never discussed in the
hypothesis-testing literature because the classical theory is not equipped
to deal with it. If the power function is continuous (as it is in most of
the examples that we can calculate) the power of an MP level 0: test of
H : e < 00 versus A : e = 00 will be 0:. The optimality criterion gives
us no way to choose between level 0: tests. They all have the same power
on the alternative. Clearly, though, some are better than others. In fact,
those that have smaller power functions for () < ()o are better. But the MP
criterion does not accommodate such comparisons.
!xls(xI82 ) C(82)
fXls(xI81) = c(81) exp {X(92 - (1)},
fXls(xI92) 1 + (x - 9I)2
fXls(xI91) = 1 + (x - (2)2'
240 Chapter 4. Hypothesis Testing
Example 4.54. Suppose that fXle(xIO) = 1/0 for 0 < x < O. For 02 > 01, write
the likelihood ratio as
undefined if x ~ 0,
fXle(xI02) ={ ~ if 0 < x < 01,
fXle(xlOd 00 if 01 ~ x < (h,
undefined if x ~ O2
The two "undefined" regions can be ignored because the ratio need only be
monotone in x a.e. [Pe, + Pe2j. The "undefined" regions have 0 probability under
both Pe, and Pe2. The likelihood ratio is then seen to be monotone increasing for
every 01 < O2 , although it is not strictly increasing. It only takes on two different
values.
CDF of X
L
1.0
a-at~
1-0.
Xo
X
Then 0.* = POo(xo, 00) ~ 0. and POo({xo}) ~ 0. - 0.*. (See Figure 4.59.) Let
which increases in x. Hence, this family has MLR increasing. The UMP level 0.05
test of H : e $ 1 versus A : e > 1 is
if x> v,
q)(x) ={ ~ if x = v,
if x < v,
IV
'Y exp(-I),
v.
+ L
00 1:1:
exp(-l),
x.
= 0.05.
x=v+l
There are also versions of Theorem 4.56 for decreasing MLR and for
hypotheses of the form H : e ;?: (Jo.
Proposition 4.62. If {P6 : (J EO} is a parametric family with decreasing
MLR, then any test of the form
I if X < xo,
={ if x = xo, (4.63)
(x) 'Y
if x > Xo
has a non decreasing (in (J) power function. Furthermore, it is UMP of its
size for testing H : e $ (Jo versus A : e > (Jo, no matter what (Jo is.
Finally, for each a E [0, 1] and each (Jo E 0, there exists Xo E IR U {oo}
and'Y E [0,1] such that the test is UMP level a for testing H versus A.
Proposition 4.64. If {P6 : (J E O} is a parametric family with increasing
MLR, then any test of the form (4.63) has a nonincreasing (in (J) power
function. Furthermore, it is- UMP of its size for testing H : e ;?: (Jo versus
A : e < (Jo, no matter what (Jo is. Finally, for each a E [0,1] and each
(Jo E 0, there exists Xo E IR u {oo} and 'Y E [0, 1J such that the test is
UMP level a for testing H versus A.
4.3. Most Powerful Tests 243
The proofs of these propositions are all very similar to the proof of The-
orem 4.56. The tests in these propositions are also called one-sided tests.
The following simple result shows that one-sided tests also minimize the
power function on the hypothesis among tests with the same size.
Corollary 4.66. 14 Let the family of distributions have MLR, and suppose
that the hypothesis is either H : e :5 (}o or H : e ~ (}o. A one-sided UMP
level a test satisfies {3</J(O) :5 {31/J(O) for all 0 E OH and all tests 1/J such
that {3..,,((}0) ~ a .
PROOF. Let 1/J satisfy f3..,,((}0) ~ a, and let () E OH. As in the proof of
Theorem 4.56, we can show that a MP level 1 - a test of H' : e = (}o
versus A' : e = 0 is anyone-sided test with size 1 - a. Since 1 - is
a one-sided test with size 1 - a and 1/J has level 1 - a, it follows that
f31-if>((}) ~ f31-..,,((}), hence f3if>((}) :5 f3..,,((}). 0
The following result is a simple consequence of Lemma 4.38, and it gives
conditions under which UMP tests are essentially unique. It will also be
useful in showing that there are situations in which there are no UMP tests
of H : e = (}o versus A : e :/= (}o.
Proposition 4.67. Suppose that {Po: () EO} is a parametric family with
Po v for all () and with increasing MLR. Let fXls(I(}) denote dPo/dv.
Let (}o EO, and define
Theorem 4.68. Let the action space be N = {a, I}. Suppose that a para-
metric family has MLR and that there exists 00 E n such that
[L(O, 1) - L(O, 0)](00 - 0) (4.69)
has the same sign for all 0 f 00 , Then the appropriate family of one-sided
tests is an essentially complete class.
PROOF. Consider the case in which (4.69) is always positive and the family
has increasing MLR. Let the hypothesis be H : e $ 00 , Let <!> be any test.
Then
Now let 'Ij; be an arbitrary test. We must show that there is a one-sided test
which is at least as good as 'Ij;. Let be a one-sided test with ,8",(00 ) =
(3",(00 ), Then Theorem 4.56 and Corollary 4.66 give us that (3",(0) - ,8",())
has the same sign as ()o - (). This implies that R(),'Ij;) 2: R(),) for all ().
For the other cases (decreasing MLR and/or (4.69) always negative),
use one of Propositions 4.62-4.65 together with Corollary 4.66 to obtain a
similar result. 0
We can use Theorem 4.68 to help prove that in an MLR family with
a one-sided hypothesis, one-sided tests are UMC as well as UMP. (See
Problems 23 and 24 on page 288.)
Proposition 4.70. 15 Let <!> be a one-sided test as in Theorem 4.56 or in
Propositions 4.62-4.65 for a one-sided hypothesis versus the corresponding
one-sided alternative in an MLR family. Suppose that the base of is "{.
Then is UMC floor "{.
Suppose that we have a prior distribution J.ts on the parameter space.
If there is a test with finite Bayes risk and the loss function is bounded
below, then Theorem 4.68 allows us to conclude that one-sided tests are
formal Bayes rules. (See Problem 33 on page 289.) For the case of D-1-c
15This is not precisely the same as Corollary 4.66. Corollary 4.66 says that the
power function is minimized on the hypothesis subject to the power function at
00 being at least a. If a power function is monotone but not continuous, the base
of the test might be different from its size.
4.3. Most Powerful Tests 245
loss, the result follows in a simpler fashion from the fact that the posterior
probability of a semi-infinite interval is a monotone function of the data in
MLR families.
Theorem 4.71. Suppose that {P8 : () E n} is a parametric family of dis-
tributions for X with MLR and that P,e is an arbitrary prior distribution
on (n, r). Then the posterior probability given X = x that 8 is in a semi-
infinite interval is a monotone function of X.
PROOF. We will prove that if the family has increasing MLR then the
posterior probability that S ~ ()o is a nondecreasing function of x. The
other cases are all similar. Let Xl < X2. Then
= (ri( -00,80)
fXle(x2/())dp,e(()) r
i( -00,80)
fX1e(Xl/())dP,e(()))-1
x[ r r
i[80 ,00) i( -00,(0)
[JXle(X2/()2)fxle(X11()1)
for all Xl < X2 and all ()l < ()2. This makes the last expression in the last
line of (4.72) nonnegative, and the result is proven. 0
Corollary 4.77,17 Let T be any set, let f : T --+ JR and gi : T --+ JR,
for i = 1, ... , n be functions, and let AI, ... , An be real numbers. If to
maximizes f(t) + L:~=I Aigi(t) and satisfies gi(tO) = Ci for i = 1, ... ,n,
then to maximizes f(t) subject to gi(t) ~ Ci for each Ai > 0 and gi(t) S
Ci
for each Ai < O.
The following lemma is sometimes called the generalized Neyman-Pears
on
lemma due to the resemblance it bears to Proposition 4.37.
Lemm a 4.78. 18 Let PO,P1,'" ,Pn be integrable (and not a.e. 0) functio
ns
(with respect to a measure v), and let
I ifpo(x) > 2:~=1 kiPi(X),
o(x) = { ')'(x) i!po(x) = 2:~=1 kiPi(X),
o ifpo(x) < 2:i=1 kiPi(X),
where 0 S ,),(x) S 1 and the ki are constants. Then o minimizes f[1
-
(x )]Po(x )dv(x) subject to
/ (x)pj( x)dv(x ) < / o(x)p j(x)dv( x), for those j such that k j > 0,
/ (x)pj( x)dv(x ) > / o(x)p j(x)dv( x), for those j such that kj < O.
PROOF . Let be an arbitra ry measurable function taking values between
o and 1 which satisfies the preceding inequality constraints. Since (x) S
o(x) whenever po(x) - 2:7=1 kiPi(X) > 0 and (x) ~ o(x) whenev
er
po(x) - E~1 kiPi(X) < 0, it is clear that
Corollary 4.80. 19 Let PO,P1,'" ,Pn be integrable (and not a.e. 0) func-
tions (with respect to a measure v), and let
0 ifpo(x) > E~=l kiPi(X),
>o(x) ={ ,(x) i!po(x) = E~=l kiPi(X),
1 ifpo(x) < Ei=1 kiPi(X),
where 0 :::: ,(x) :::: 1 and the ki are constants. Then >o maximizes f[1 -
>(x)]po(x)dv(x) subject to
The theorems we can prove assume that we are dealing with a one-
parameter exponential family.
Lemma 4.81. 20 Assume that the parametric family has monotone likeli-
hood ratio. If > is an arbitrary test and (h < (h, define ai = (3",((}i) for
i = 1,2. Then there is a test of the form
1/J(x) = { ,io
I if Cl < X < C2,
if x = Ci,
if Cl > x or C2 < x,
with Cl :::: C2 such that {31/Jo ((}i) = ai for i = 1,2.
PROOF. Define >w to be the UMP level w test of H e < (}1 versus
A : e > (}1, and for each 0 :::: u :::: 1 - 0:1, set
>~(x) = CPOl+U(X) - >u(x).
First, note that since 0:1 + u ~ u for all u, 0 :::: >~(x) :::: 1 for all u and
all x. This means that >~ is really a test. By design, {3",~ ((}1) = 0:1' Also,
cP~ has the form of 1/J for each u (with Cl or C2 possibly infinite). This is
true since all >w(x) are 0 for small x and 1 for large x. By construction,
>~ =:= >Ol is the MP level al test of H' : 8 = (}l versus A' : e = (}2, and
>~-Ol = 1 - >l-Ol is the least powerful such test. Since > is also a level 0:1
test of H' versus A', it follows that
{3"'~-o.l ((}2) :::: 0:2 :::: (3"'~((}2)'
If we can show that {3",,,, ((}2) is continuous in w, we can conclude that there
exists w such that (3",,,, ((}2) = a2. The proof of this is left to the reader.
(See Problem 40 on page 290.) It follows that >'w has the form of 1/J and
(3'",((}i) = ai for i = 1,2. 0
1 = al exp(blcl) + a2 exp(b2ct},
1 = al exp(b1c2) + a2 exp(b2C2), (4.83)
where Cl and C2 are from the definition of o. The reader can easily verify
that this system of linear equations has nonzero determinant and hence has
a solution. The solution must have both al. a2 > O. Set ki = aic(Oo)/C(Oi).
With these values of kl. k2' apply Lemma 4.78. Note that minimizing
f[1 - (x)]Po(x)dv(x) is equivalent to maximizing /3",(00), The test that
maximizes /3",(00) subject to /3",(Oi) ~ /3"'0 (Oi) for i = 1,2 is
where a1 and a2 are the solutions to the two linear equations above. Since
al exp(bl x)+a2 exp(b2x) goes to infinity as x -+ oo, it follows that (x) =
1 for C1 < x < C2, which leads to (x) = o(x). This same argument applies
for all 00 between 01 and 02, hence the same o maximizes /3",(0) for all
01 < 0 < O2
Next, try to minimize /3",(00 ) for 00 < 01 . (An identical argument works
for 00 > 02.) This time, we will use Corollary 4.80. Set bi = Oi - 00 for
i = 1,2. Next, note that the function al exp(blx) + a2 exp(b2x) is strictly
monotone if al and a2 have the same signs. If al < 0 < a2, the derivative
250 Chapter 4. Hypothesis Testing
has only one zero and the function goes to 0 as x -+ -00 and to 00 as
x -+ 00. This means that the function equals 1 for only one value of x.
Hence, the solution to the equations in (4.83) must have al > 0 > a2.
Solve the equations and set k i = aic(}o)/C(}i). Since maximizing f[1 -
rJ>(x)]po(x)dv(x) is the same as minimizing (3.p(}o), the test that minimizes
(3.p(}o) subject to (3.p(}I) ~ al and (3.p(}2) $ a2 is
with al > 0 > a2 and b2 > b1 > O. Since al exp(b1x) + a2 exp(~x) goes
to 0 as x -+ -00 and goes to -00 as x -+ 00, it follows that rJ>(x) = 1 for
Cl < x < C2. Once again, we get the same test for all (}o and the same test
as before.
Finally, consider the test rJ>o.(x) == a, and now suppose that al = a2 = a.
Lemma 4.81 guarantees that Cl, C2, 'Yl, and 'Y2 can be chosen so that rJ>o has
the stated form with a1 = a2 = Q. The power function of rJ>o. is the constant
a. It must be that (3.po(}) $ a for every () E nH. Hence rJ>o has level a. Since
every level a test 1jJ must satisfy (3",(}i) :5 a (i = 1,2), and r/>o maximizes
the power on the alternative subject these constraints, it follows that rJ>o is
UMP level a. 0
<fJ( ) -
y -
{I
0
if 0.02133 < y < 0.83685,
otherwise.
Theorem 4.68 says that the class of UMP level Q tests is essentially com-
plete for decision problems that include hypothesis-testing loss functio
ns
for one-sided hypotheses. The first part of Example 4.87 shows that
for
two-sided hypotheses, the class of UMP level Q tests is not essentially
com-
plete. The formal Bayes rule given there is admissible and the risk functio
n
is not the same as that of any UMP level Q test. This, then, is the first point
The way to circumvent the lack of UMP level a tests in cases like Ex-
ample 4.90 is to create a new criterion that one-sided tests fail to satisfy
when the alternative is two-sided. 23 The rationale is that even though the
power function of a one-sided test is high in one part of the alternative, it
is very low in the other part. The new optimality criterion requires that
the power function be higher on the alternative than on the hypothesis.
22Since all N(fJ,1) distributions are mutually absolutely continuous, a.s. with
respect to one of them means a.s. with respect to all of them.
23When the conditions of Proposition 4.67 fail, there may be UMP level a tests
for two-sided alternatives. See Problem 27 on page 288 for an example that even
has MLR.
254 Chapter 4. Hypothesis Testing
Definition 4.91. A test is unbiased level a if it has level a and if {3", ((}) ~
a for all () E nA . If n ~ IRk, a test is called a-similar if (3",((}) = a for
each () E riH n riA, More generally, is a-similar on B ~ n if (3",((}) = a
for each () E B. If is UMP among all unbiased level a tests, then is
uniformly most powerfuL unbiased (UMPU) level a.
The concepts of unbiased level a and a-similar are closely related.
Proposition 4.92. 24 If a test is unbiased level a and (3",(.) is continuous,
then is a-similar.
Since being unbiased level a implies that has floor a, the dual concept
to unbiased level a is simply unbiased floor a.
Definition 4.94. A test is unbiased floor a if it has floor a and if (3",((}) ::;
a for all () E n H . If is UMC among all unbiased floor a tests, then is
uniformly most cautious unbiased (UMCU) floor a.
It is interesting to note that the collection of unbiased tests may not be
essentially complete. The test in the first part of Example 4.87 on page 251
is admissible with 0-1-0.5 loss, but it is not unbiased and it does not have
the same risk function as an unbiased test. The restriction to unbiased
tests, just like the restriction to UMP level a tests in the previous section,
does not follow from considerations of admissibility. It is true that the
restriction to unbiased tests rules out the use of one-sided tests in problems
like Example 4.90 on page 253, but it also rules out many admissible tests.
Example 4.95. Suppose that Y '" Exp(O) given e = 0 with f2H = [1,2) and
f2A = (-00,1) U (2,00). Suppose that the loss function is asymmetric in the
following way.
if 0 E f2H and a = 1,
if 0 > 2 and a = 0,
if 0 < 1 and a = 0,
otherwise.
We will use the usual improper prior with Radon-Nikodym derivative 1/8 with
respect to Lebesgue measure so that the posterior distribution is Exp(y). The
formal Bayes rule will minimize the posterior risk. The posterior risks for the two
possible decisions are
a=O a=l
3(exp(-y) -exp(-2y)) ~ exp(-2y) + 1- exp(-y)
Solving to see when the risk for a = 1 is smaller, we see that this occurs .w.hen
y < 0.2569 or y > 0.9959. The test that rejects when one of t~ese ~o~dltlOns
holds has power function 0.5959 at 0 = 1 and 0.5382 at 8 = 2. Smce It IS more
24This proposition is used in the proofs of Lemma 4.96 and of Theorems 4.123
and 4.124.
4.4. Unbiased Tests 255
We are now in position to begin to prove that the UMPU level a test for
the case of two-sided alternatives in a one-parameter exponential family is
just 1 minus the UMP level 1 - a test for the two-sided hypothesis.
Lemma 4.99. 29 In a one-pammeter exponential family with natuml pa-
mmeter, if is any test of H : e E [(h, 02J versus A : e [0 1 , 02J with
01 < O2 , then there is a test of the form
_{1 i!
'IjJ(x}- 'y.
x ~ Cl or x > C2,
ifx-ci,
o if Cl < x < C2,
PROOF. Lemma 4.81 says that 1 - 'IjJ can be chosen to have (31-1/1(Oi) =
(31-q,(Oi) and thus that 'IjJ is in the desired form. Theorem 4.82 then shows
that 1 - 'IjJ minimizes and maximizes power in just the opposite regions
from where we want 'IjJ to minimize and maximize power under the same
conditions. 0
The tests in Lemma 4.99 are called two-sided tests. It is easy to see that
when the conditions of Lemma 4.99 hold, the class of two-sided tests is
essentially complete for hypothesis-testing loss functions. (See Problem 48
on page 292.) The next theorem says that the UMPU level a tests are a
subset of this essentially complete class. We could show, as in Example 4.87
on page 251, however, that this subset is not essentially complete.
Theorem 4.100. Assume the same conditions as Lemma 499. Also sup-
pose that (31/1(Oi) = a for i = 1,2. A test of the form 'IjJ is UMPU level a
and UMCU floor a.
PROOF. By comparing 'IjJ with a(x) == a and using Lemma 4.99, we see
that'IjJ is unbiased level a and unbiased floor a. Also, Lemma 4.99 shows
that 'IjJ is UMP a-similar and UMC a-similar. Lemma 4.96 and Proposi-
tion 4.97 can be applied since the power functions are continuous in an
exponential family. 0
29This lemma is used in the proof of Theorem 4.100 and to show that the class
of two-sided tests is essentially complete.
4.4. Unbiased Tests 257
Example 4.101. Suppose that X '" N(/-L, 1) given e = /-L. Let nH = [-1,1] and
Q = 0.1. Set C2 = 2.286 and Ci = -2.286. (h = -1 and (h = 1. So
t/J(x) = {
I if x < -2.286 or x
otherwise;
> 2.286,
PROOF. We prove only the first part, since the second is very similar. By
Lemma 4.45, it follows that (3,p(Oo) = a if cP is UMP level a. Let 1/J be
another size a test. Since cP is UMP level a, for every > 0,
(3,p(00 - ) > (3",(00 - f),
a - (3,p(00 - ) < a - (3",(00 - f),
(3.p( ( 0) - (3.p( 00 + ) (3", (00) - (3",(00 + )
<
Since the derivatives are the limits of the quantities in the last inequality
as goes to 0, the result follows. 0
and, for every 0 =f. 00, (3",(0) is maximized among all tests 1/J satisfying the
two equalities above.
PROOF. Let a = (3q,(00) and'Y = d(3q,(0)/dOI8=80. Let cPw be the UMP level
w test of H : e ~ 00 versus A : e < 00 , and for each 0 :::; u :::; a, set
cP~(x) = cPu(x) +1- cP1-a+U(X).
By design, (3q,'u (0 0 ) = a for all u. Also, cP~ has the form of 1/J for every u
(with C1 or C2 possibly infinite). By construction, cP~ = 1- cP1-a is the UMP
level a test of H' : e = 00 versus A' : e > 00 and cP~ = cPa is the least
powerful such test. It follows from Lemma 4.103 that the derivatives of the
power functions of cP~ and cP~ at 00 are respectively the smallest possible
and the largest possible among all tests with power function a at 00 . Hence
~(3,p'a(O)1
dO 8=80
~ 'Y:::; ~(3,p~(O)1
dO ()=80
.
To prove that there is a 1/J satisfying (4.105), we need only show that
d(3w{O)/dO I9=90 is continuous in w. 32 Recall that
1 if x < cw ,
cPw{x) = { 'Yw if x = Cw,
o if x> cw ,
31This theorem is used in the proofs of Corollary 4.109 and Theorem 4.124. It
is also used to show that two-sided tests form an essentially complete class.
32The proof follows part of the proof of Theorem 2 on pp. 220-221 of Ferguson
(1967).
4.4. Unbiase d Tests 259
(4.106)
Define h(x,g) = Poo(X :::; x) + gPoo(X = x), and define the random
variable V = h(X, G), where G has U(O, 1) distribu tion and is indepe
ndent
of X and 8. For 0 < w :::; 1, we note that
Cw = inf{u: FXle(uIO) 2: w}
and
IWPOO(X = cw) = [w - Poo(X < cw)]
For w = 0, Co = sup{u : FXle(uIO) = O} and 10 = o. It follows that
for all
t, h(x, g) :::; t if and only if either x < Ct or x = Ct and 9 :::; It For t 2: 0,
we have
FVle(tIOo) j j I[o,tj(h(x,g))fxle(xIOo)dgdv(x)
Eoo{XI[o,wj(V)} - wEoo(X).
Since w is continuous and V has continuous distribution, it follows
that
the above expression is continuous in w.
260 Chapter 4. Hypothesis Testing
is 'Tf(x) = 1 if
(4.lO8)
where the signs of k1 and k2 depend on which inequalities we use. The
inequality (4.108) simplifies to exp([Ol - Oolx) > k1 + k2X. This inequality
is satisfied for x outside of a bounded interval or for x in a semi-infinite
interval. We already know (Theorem 4.56 and Propositions 4.62-4.64) that
tests with 'l/Jo(x) = 1 for x in a semi-infinite interval are one-sided and they
minimize the power function on one side of the hypothesis. Hence, we need
'l/Jo(x) = 1 for x outside of a bounded interval, and 'l/Jo has the form of'l/J.
Furthermore, the same 'l/Jo works whether 0 1 > 00 or (h < 80 , by choosing
k1 and k2 correctly. 0
When the conditions of Theorem 4.104 hold and the loss function is
of the hypothesis-testing type, then it follows that the class of two-sided
tests is essentially complete. Corollary 4.109 says that the class of UMPU
level n tests is a subset of this essentially complete class. At the end of
Example 4.111 on page 261, we will see that the class of UMPU tests is
not essentially complete.
Corollary 4.109. 33 In a one-parameter exponential family with natural
parameter, let f!H = {Oo}, where 00 is in the interior of f!. If'l/J is a size n
test of the form of Lemma 4.99 with
:of3.p(O)lo=oo = 0,
PROOF. Since the test ,Ax) == 0 has size 0 and 0 derivative at 00 , Theo-
rem 4.104 tells us that tf; is unbiased level o. In light of Theorem 4.104, all
we need to show is that all unbiased level 0 tests must have power func-
tions with 0 derivative at 00 Any test with (3t/>(Oo) = 0 but with nonzero
derivative will have power strictly less than 0 on one side or the other of
00 because the power function is differentiable. Such a test could not be
unbiased level o. 0
Example 4.110. Suppose that X '" N(8, 1) given 8 = 8 and that we wish to
test H : 8 = 80 versus A : e =F 80. To make the test 't/J unbiased, we need
o = ~~.p(8)19=90
= d [ 1-
d8 [C2 --exp
1
Cl
( - x 2 8? )
v'21r
dx ] I
9=90
_ {C2 X - 0
8 exp ( - (x - ( 0)2) dx
lq v'21r 2
= _ j C
cl-90
2-
9
0 ~ exp (- ~2) dx
v'21r
. which is true if and only if -(Cl - (0) = C2 - 80 = c. In this case (3.p(80) =
2[1- ~(c)l = 0 if and only if c = ~-1(1_ 0/2). This gives the usual equal-tailed,
two-sided test, which is UMPU level o.
If we use a 0-1-c loss function, the Bayes rule is to reject H' if its posterior
probability is less than 1/(1 + c). The posterior probability of H' is
262 Chapter 4. Hypothesis Testing
So, the Bayes rule is to reject H' if Ix - 801 > d for some d. This has the same
form as the UMPU level a test.
Alternatively, suppose that Pr(8 = (0) = Po > O. Conditional on 8 "I 80,
suppose that 8 rv N(80,7 2 ). We computed the Bayes factor for this case in
Example 4.18 on page 222. The Bayes factor was given in (4.19) as
Vi
~2
+7 exp
(x -(+
2(1
0 )27 2 )
7 2) .
(See Problem 6 on page 286 for the entire posterior distribution of 8.) If we use
a Q-l-c loss function, then the Bayes rule is to reject H if the probability that
it is true is less than 1/(1 + c). This corresponds to the Bayes factor being less
than some number, which in turn is easily seen to correspond to Ix - 80 1 > d for
some d. This is in the same form as the UMPU level a test.
Finally, suppose that we continue to let Pr(8 = (0 ) = Po > 0, but that the
conditional prior given 8 "I 80 is N (J' , 7 2 ) with (J' "I 80. Then the same kind
of calculation as above leads to the Bayes factor being small when x is far from
[(1 +7 2 )80 - 8'117 2 Such a test is two-sided but is not UMPU of its level. In fact,
the test is biased, even though it is admissible. We see once again that the class
of UMPU tests is not essentially complete.
In Example 4.111, two different types of prior distributions both led to
Bayes rules that were of the same form as the UMPU level a test. Unfor-
tunately, there did not appear to be any transparent connection between
the size a and the loss function or the prior distribution. The reason for
this is related to the inadmissibility of incoherent tests as illustrated in
Problem 42 on page 291. We will discuss this matter more in Section 4.6.
Example 4.112. Suppose that X rv Bin(n,p) given 8 = log(p/(1- p)). Then
the density of X with respect to counting measure on {O, ... , n} is
It follows that
Cl = 0, '11 = 0.52804,
C2 = 5, '12 = 0.00918.
Most people who want a level 0.05 test of this hypothesis would not bother to
compute the UMPU level 0.05 test but rather would perform what is called an
equal-tailed test. Since Theorem 4.104 says that the two-sided tests are admissible,
we could try to find a two-sided test of the form (4.113) such that the probability
of rejecting for small X equals the probability of rejecting for large X (both equal
to 0.025). In this case, the test would have
Cl = 0, '11 = 0.44394,
C2 = 5, '12 = 0.09028.
This test is biased because the derivative of the power function is 0.0236 at 60. In
other words, the probability of rejecting the hypothesis will be slightly less than
0.05 given 9 = 6 for a short interval of 6 values below 60.
One possible Bayesian solution would be to set Pr(P = po) = qo and let
P '" Beta(oo, (30) otherwise, where P = exp(9)/(1 + exp(9)). Then, the Bayes
factor will be
P5(1 - Po)n-", TI~-=-OI(OO + f30 + i)
(4.114)
TI 'i=O
" 1(00 + t')TIj=O
n '" I(R_
fJU + J
')'
In the special case with 00 = f30 = 1, the Bayes factor is
These values have been calculated for n = 10 and Po = 1/4 in Table 4.116
together with the posterior probability when qo = 1/2. Note that if we used a
o-l-c loss function with C = 19 (so ihat 1/(1 + c) = 0.05), we would still accept
H even when X = 6 was observed.
As we noted in Section 4.2.2, we would run into trouble if we naively tried to
use an improper prior (with 00 = {30 = 0) for the alternative. The Bayes factor
264 Chapter 4. Hypothesis Testing
n
then the product of the prior odds ratio qo/(1 - qo) and the Bayes factor would
converge to
Notice that the condition that the derivative of the power function be 0
in Example 4.112 was equivalent to
(4.117)
!"!
~
~
~
..,
v
in
C'f ~
D..
~
:?l x=~
/
/
--- -- - -
..--,,--/
l!! x-10
:::l
0.0 0.1 0.2 0.3 0.4 0.5
II
When UMPU level a tests do not exist, one can try to find LMPU
(locally most powerful unbiased) level a tests. 35 When power functions
are continuously differentiable and if; is the unique unbiased level a test of
H : e = 00 with maximum second derivative for the power function, then if;
is LMPU level a relative to d(O) = 10 - 00 1. (See Problem 50 on page 292.)
where Tn-l is the CDF of the tn-l(O,I) distribution and 8 2 = E~=l (Xi _x)2 /(n-
1). Here, the intersection of OH with OA is !lH = {(I', (7) : I' = I'o}. It is easy to
see that t/J is a-similar, as follows. Given (M,E) = (1'0,17) E !lH, the conditional
distribution of T = (X - p.o)/(8/..;n) is tn-l(O, 1) for all u. Hence
U = '""
L.J(Xi -1'0) 2 = (n - 1)82 + n(X
- -1'0) 2 .
i=l
We can write
w= X - 1-'0 = sign(T) ,
.jfJ ..;nJ~ + 1
so that W is a one-to-one increasing function of T and t/J(X) is a function of W.
We need to show that the conditional distribution of W given (M, E) = 'Y and
U = u is the same for all 'Y E !lH and all u. If this is true, then W is independent
of U given (M, E) = 'Y E S1H and
E..,[t/J(X)IU = u) = E..,[t/J(X) = a,
for all 'Y E S1H and t/J would have Neyman structure .relative. to the !lH an~ U.
Since the distribution of (Xl - 1'0, ... , Xn - 1'0) IS spherically symmetnc, we
showed in Examples B.56 and B.60 (see pages 627 and 628) that the conditional
distribution of
( Xlv'fJ
- /Lo Xn
' ... , v'fJ
-1-'0)
4.5. Nuisance Parameters 267
Example 4.127 (Continuation of Example 4.121; see page 266). The usual two-
sided Hest is an example of Theorem 4.124. To see this, write the joint density
of (X, 8 2 ) as
n-l
Vn(~)-r
v'27rr( n; 1) U
x exp ( - 2~2 [n(x - {to - [{t - {to])2 + (n - 1)8 2])
(h={t-{to 1
O2 = - -
u2 ' 2u 2 '
v = n(x -/-to), U = n(x - {to)2 + (n - 1)8 2.
(Note that u is the observed value of the statistic U from Example 4.121.) The-
orem 4.124 says that the UMPU level a test of H : 8 1 = 0 versus A : 8 1 i= 0 is
the conditional UMPU level a test of H versus A given U. Note that 81 = 0 is
equivalent to M = {to. Since V has a one-parameter exponential family distribu-
tion with natural parameter 8 1 given U, the test will be a two-sided test of the
form
if v < d1 (u) or if v > d2 (u),
tP(vlu) = { ~if d 1 (u) ::; v::; d2 (u),
where dt(u) and d2 (u) are chosen so that the conditional power function equals
a at Ot = 0 for all u and so that the derivative of the conditional power function
equals 0 at fh = 0 for all u. As we saw in Example 4.121, the two-sided t-
test has the above form with dt(u) = -cVu and d2 = cVu, where c > 0 is a
constant. We also saw that the t-test has conditional level a given U. The fact
that the derivative of the conditional power function is 0 follows, as in the proof
of Theorem 4.124, from the fact that the partial derivative of the marginal power
function is O. Hence, the usual two-sided t-test is UMPU level a.
One possible Bayesian approach is to put positive probability on nH . This was
done in Example 4.22 on page 224, where Bayes factors were computed. It is not
the case that the usual two-sided t-test is equivalent to rejecting H when the
Bayes factor is less than a constant. However, with a special type of improper
prior, the two tests are equivalent. In Example 4.22, we showed that if AO -+ 0
and po -> 0 in such a way that the ratio po/~ -> k, a constant, then the
posterior odds in favor of the hypothesis converge to
for t = 0,1, .... The conditional distribution of (X, Y) given T and r is one-
dimensional and can be represented by X alone. It is
_ (~r (t~
/XIT,r(X\t, A, j.L) - x!(t _ x)!
(~r)-1
a!(t - a)! '
exp(x(})j.Lxh
/X,YI9,M(X, y\(}, j.L) = exp( -j.L[exp(}) + 1]) x!y! .
Every unbiased a-similar level a test must have partial derivative of the power
function with respect to (} equal to 0 at every point (log(2),j.L) in G, otherwise
the power function would dip below a on the alternative. The partial derivative
of the power function of a test cjJ with respect to (} is
~~ exp(x(})j.Lx+y
L..t L..t cjJ(x, y) exp( -j.L[exp(}) + 1]) , ,
x.y.
[x - j.Lexp(})]
x=o y=o
= Eo,/L(XcjJ(X, Y - j.Lexp(})!3</>(}, j.L).
Now, plug in (} = log(2) and set this equal to O. Note that 2j.L is the mean of X
and the power function at (log(2), j.L) is a for every j.L for an a-similar test. Hence,
every a-similar level a test cjJ must satisfy
for all j.L. Let h(t) = E(XcjJ(X, Y) - Xa\T = t). Since T is complete for the sub-
parameter space, E)og(2),/L(h(T = 0 for all j.L implies h(T) = 0, a.s. [Ptog (2),/L] for
all j.L. By Proposition 4.118, this is equivalent to the derivative of the conditional
power function being 0 at (} = log(2).
if Xl > c(u),
if Xl = c(u), (4.129)
if Xl < c(u),
272 Chapter 4. Hypothesis Testing
which has maximum conditional power function for (h > ()~ and minimum
conditional power function for ()1 < ()~ subject to the power being (3",((}~) at
(}l = (}? Since in (4.129) minimizes and maximizes the power in precisely
the right places uniformly in u among all tests in a class that contains ,, it
follows that R( (}, ) S R( (}') for all (} if a hypothesis-testing loss function
is being used. It follows that tests of the form (4.129) form an essentially
complete class. Other hypotheses can be handled in a similar way.
arise.
Example 4.132. Suppose that Xl, ... , Xn are conditionally lID N(/-, 0'2) given
(M, E) = (/-,0'). This is an exponential family with natural parameters 8 1 =
M/E 2 and 82 = -1/[2E 2 j. Suppose that we wish to test H : a ::; M ::; b versus
A: not H. We can rewrite H in terms of the natural parameters as
The formal Bayes rule with a D-l-c loss would be to reject H if P < 1/(1 + c).
To see what this test looks like, note that for each value of Sn, P is a decreasing
function of Itl, where
Xn _!!H
t=vn 2 (4.133)
Sn
In fact,
p = Tn- 1 (..;nb2~na - t) - Tn- 1 (-vnb2~na - t) .
So the formal Bayes rule is to reject H if It I > d(sn), where it is also easy to see
that d(sn} is a decreasing function of Sn.
One possible classical solution is to abandon the UMPU criterion and just use
the usual t-test for testing that M = (a+b)/2. That is, reject H if It I > d, where
t is defined in (4.133) and d is determined to make the level a. Unfortunately,
the conditional distribution of fo(X n - [a + bJ/2) / Sn given (M, E) = (/-,0') is
noncentral t NCt n -l( vn[/-- (a+b}/2j/0'). The power function ofthe usual t-test
at (/-,0') for /- =1= (a + b) /2 goes to 1 as 0' goes to 0 for fixed /-. Hence the usual
t-test has level 1 as a test of H versus A. To get a test with level a < 1, one
could let d depend on Sn as in the formal Bayes rule (although one should note
that the formal Bayes rule also has level I-the problem occurs as 0' ...... 00 for
the formal Bayes rule). Calculating the power function of the resulting test would
require a separate two-dimensional integration over the space of (Xn' Sn) values
for each (/-,0') pair.
Another classical solution would be to add together the UMPU level a/2 test
of H' : 8 ::; b versus A' : 8 > b and the UMPU level a/2 test of H" : 8 2: a
versus A" : 8 < a. It can be shown (see Problem 54 on page 292) that this
test has size a. 37 The power function is easy to calculate. It is just the sum of
the two power functions for the two one-sided tests. If a < b, this test is biased,
since there exists () E OA such that the power function at () is close to 0./2. Note,
however, that if a = b, then this test is exactly the same as the UMPU level 0.
test of 8 = a versus 8 =I- a.
One could change the hypothesis in Example 4.132 to make it more like
Example 4.127.
Example 4.134 (Continuation of Example 4.127; see page 270). Suppose that
we wish to test H : M E [ILO + aE, ILo + bE] versus A : M [ILO + aE, JLo + bE].
This is a test about a linear combination of natural parameters, namely iIIl =
81 + 2IL082. The hypothesis is H : iIIl E [a, b] versus A : iIIl [a, b]. We can
let i112 = 82 = -1/[2E 2]. We would now need to work with the conditional
distribution of V = n(X -ILO) given U = n(X -ILo)2+(n-1)S2. This conditional
distribution will have a density equal to a constant (function of'll. and tPI) times
exp( -VtPl)(U - v 2/n)[n-l]/2-1 for 0 < v < y'nU. For each value of'll., one would
have to find d 1 (u) and d2(U) so that
for tPl = a and for tPl = b. Of course, one could wait until the data were observed
and then do it only for the observed value of U.
SUPOEOH !xls(XIO)
LR= .
SUPOEO !xls(XIO)
Example 4.135. Suppose that Xl, ... , Xn are conditionally IID with Ber(9)
distribution given e = 9. Let the hypothesis be H : e = 90 versus A : e f 90
Let Y = 2:~1 Xi. Then
if Y rt {O, n},
ifY = n,
if Y = 0,
since 9Y (1- 9)n-Y is largest if 9 = Yin. The LR test would be to reject H if
LR is smaller than some specified value. As a function of Y, LR increases until
Yin reaches 90 and then decreases. For example, if 90 = 1/4 and n = 10, we
have a case similar to Example 4.112 on page 262. The UMPU level 0 = 0.05
test of H : e = 114 was found there to be a test that rejected H if Y 2: 6 and
randomized if Y E {0,5}. If Y = 0 is observed, then LR = 0.0563. If Y = 6 is
observed, then LR = 0.0647. It follows that the level 0 = 0.05 LR test is not
the same as the UMPU level 0.05 test. The reason is that no LR test can reject
for Y = 6 without rejecting also for Y = 0, since LR is smaller at Y = 0 than
at Y = 6. The level 0.05 LR test will reject H for Y 2: 7 and will randomize if
Y = 0 with probability 0.8259 of rejecting H. Note that the LR test is of the
form of Theorem 4.104, so that it is admissible, but it is not UMPU.
Example 4.136. Suppose that Xl, ... , Xn are conditionally lID given e =
(M,~) = (p.,0") with N(p.,0"2) distribution. Suppose that H : M = P.o is the
hypothesis. The formula in (4.26) gives the observed value of LR, namely (1 +
t2 I[n - 1j)-n/2, where t is the test statistic for the usual t-test. Since LR is a
decreasing function of Itl, the level 0 LR test will be the same as the UMPU level
o test for all o.
Example 4.137. Suppose that Xl, ... , Xn are conditionally liD N(p., 0"2) given
(M,~) = (p.,0"), and H : a ~ M ~ b versus A : not H. We can easily calculate
the LR criterion:
I if X E [a,b],
-n/2
= { ( 1 + (-:~i) )
- 2
LR if X < a,
( 1 + (X-b)
- 2)-n/2
if X> b.
~
We can easily see that the level 0 LR test will be the sum of two one-sided tests.
Since Problem 54 on page 292 shows that the level of the sum of two one-sided
tests in this problem is the sum of the levels, and since LR decreases equally fast
as X drops below a or rises above b, it follows that the two tests should each have
level 0/2. The LR test becomes the test described at the end of Example 4.132.
(4.138)
Example 4.139. Suppose that Xl,i, i = 1, ... , nl are lID N(I-'l, ( 2) independent
of X2,i, i = 1, ... , n2 with N(1-'2, ( 2) distribution given MI = 1-'1, M2 = 1-'2, E = u.
A popular hypothesis is H: MI = M2. One can write this as H: M I -M2 = O. In
the above notation, r = 1, Y is the difference between the two sample averages
times yntn2/(nl + n2), and M = MI - M2 Also, 8 = 1, U is the sum of all the
observations divided by yni + n2, and lit = (niMI + n2M2)/ynl + n2. Finally,
d = ni + n2 - 2 and W is the pooled sum of squared deviations.
Example 4.140. Suppose that Yl, ... , Yn are conditionally independent given
B = (3, E = u with Yi '" N(xT (3, ( 2) distribution, where the Xi are known
k-dimensional vectors. This is the standard linear regression model. A typical
hypothesis is of the form H : CB = c versus A : CB ::/= c, where C is an r X k
matrix of rank r < k and c is an r-dimensional vector. Define the matrix C =
"'~ x,x
L..,.,.,,=1 .. '& , and ~ume that C is nonsingular. The usual least-squares estimator
T
Y = (CC- 1 C T )-"2"CB,
l' W = 'L.....,(Yi
"' - XiT'B) 2 , U = (DC- 1 D T )_lDB'.
"2"
i=1
With lit = (DC- l D T )-l/2 DB, M = (CC- l C T )-l/2CB, 8 = k-r, and d = n-k,
we see that (Y, W, U) have the distributions given in (4.138).
For the general situation, construct new random variables B > 0, ~ E JRr ,
and H E {O, I}, which are conditionally independent of (Y, W, U) given e.
The distribution of e and the new parameters is as follows. Given III = 'I/J,
r: = u, B = {3, r = ,,(, and H = h; M = h"({3/(1 + "( T "() with probability 1.
Given r: = u, B = /3, r = ,,(, and H = h,
if h = 0,
if h = 1.
Finally, Band H are independent, with B having some density f({3) strictly
positive on all of (0,00), and Pr(H = 0) = Po.
The Bayes rule with respect to D-l-c loss is to reject H = if Pre H =
llY = y,U = u, W = w) is large.
Pr(H = llY = y,U = u, W = w) (4.141)
=
Pr(H = 0IY = y, U = u, W = w)
f f f f fy,u,wls(Y, u, wlp" 'I/J, U)dPu ('I/J)dQl,"!,{J (u, p,)7rl,{J("()d"(f({3)d{3
f f f f fy,u,wls(y,u,wlp,,'I/J,u)dPu('I/J)dQo,"!(u,p,)7ro,{J("()d"ff({3)d{3 ,
where Ql,"!,{J is the distribution for (M, E) that puts all of its mass on
p, = "({3/(1 + "( T "() and u = 1/ Jl + "( T "(; Qo,"! is the distribution of
(M, r:) that puts all of its mass on p, = 0 and u = 1/ Jl + "( T "(; and
fy,u,wls(Y, u, wlp" 'I/J, u) is proportional to
To find the Bayes rule, we begin with the innermost integration (over 'I/J)
in both numerator and denominator (since they are the same). The integral
to be performed is
where
278 Chapter 4. Hypothesis Testing
It follows that a(l- ( 2 ) 1/2 is a scale factor for each coordinate of 1/;. Hence
the integral is proportional to exp( -u T u/2). Since this depends on the
data alone, it cancels out of the numerator and the denominator along
with w d / 2 - 1 and the constant in the data density. What remains of (4.141)
is
J {Co exp
1+'YT'Y T
2 (y Y + w) d'Y
}
4.6 P- Values
4.6.1 Definitions and Examples
A common criticism of hypothesis-testing methodology is that the decision
to "reject" or "accept" a hypothesis is not informative enough. One should
also provide a measure of the strength of evidence in favor of (or against)
the hypothesis. The posterior probability of the hypothesis is an obvious
candidate to provide the strength of evidence in favor of the hypothesis,
but the posterior is not available in a classical analysis. In fact, there is
no theory for strength of evidence or degree of support in the classical
theory. Instead, some alternatives to testing hypotheses are available. The
alternative considered here is to give the set of all levels for which a specific
hypothesis would be rejected. 40 For most of the tests that we will consider
in this book, the set of a values such that the level a test would reject H
will be an interval starting at some lower endpoint and extending to 1. The
lower endpoint will be called the P-value of the observed data relative to
the collection of tests.
Definition 4.142. Let H be a hypothesis, and let r be a set indexing
nonrandomized tests of H (Le., {r/>",( : "Y E r} is a set of nonrandomized
tests of H). For each "Y E r, let cp("'() be the size ofthe test r/>",(' Define
the hypothesis. In these cases, PH(X) is called the P-value without reference
to the set of tests or the hypothesis.
Example 4.143. Suppose that X '" N(9,1) given 8 = 9 and HI : e E
[-0.5,0.5]. The UMPU level a test of HI is l/>a(X) = 1 if Ixi > Ca, for some num-
ber C". If X = 2.18 is observed, I/>a will reject HI if and only if 2.18 > Co. Since Co<
increases as a decreases, the P-value is that a such that Ca = 2.18. If Ca = 2.18,
then the test is I/>a(x) = 1 if Ixi > 2.18, so a = <P( -2.68) + 1 - <P(1.68) = 0.0502
is the P-value.
The reader will note that we used the same notation PH (x) to denote the
P-value as we used for the significance probability in Definition 4.8. The
reason is that they are almost always the same thing.
Proposition 4.144. Let {4>",/ : 'Y E r} be a collection of tests. Suppose that
r ~ [0, 1] and 'Yl > 1'2 implies that for all x, "'/l (x) ~ "'/2 (x). Define the
binary relation ~ on the sample space by x ~ Y if and only if the P-value
for x is at least as large as the P-value for y. Then ~ is a weak order and
the P-value always equals the significance probability.
The conditions of Proposition 4.144 say that for every possible observa-
tion x, if x leads to rejection of H at one level, then it leads to rejection at
every higher level. This is just a precise way of saying what we said earlier
about the set of levels at which a hypothesis would be rejected being an in-
terval running from some lower bound up to 1. Although the conditions of
Proposition 4.144 are met in most situations, the following example [from
Lehmann (1986)41] is a case in which they are not.
Example 4.145. Suppose that n = {I, 2, 3} and X = {I, 2, 3, 4}. Consider the
following conditional distribution for X given 8:
x
Ix e(xI9) 1 2 3 4
2 4 3 4
1 13 13 13 13
4 2 1 6
9 2 13 13 13 13
4 3 2 4
3 13 13 13 13
Consider the hypothesis H : 8 :5 2 versus A : 8 =
3. One can show that the MP
level 5/13 test of H is 1/>5/13(X) = 1 if x E {I, 3} and that the MP level 6/13 test
is 1/>6/13(X) = 1 if x E {1,2}. For a = 1, I/>",(x) = 1 for x E {1,2,3,4}. So X = 3
leads us to reject H at some high values of a and at some low values of a, but not
at certain values in between. The infimum of the set of all a such that 1/>",(3) = 1
no longer tells us all of the levels at which we would reject H. In particular, one
of the conditions of Proposition 4.144 is violated.
Because P-values are between 0 and 1 and because the smaller the P-
value is the smaller a would have to be before one could accept the hypoth-
esis, people like to think of the P-value as if it were the probability that the
4IThe example appears in Problem 34 on page 121 of that text and in Prob-
lem 29 on page 116 of the 1959 edition.
4.6. P- Values 281
hypothesis is true. Those who are more careful with their terminology will
still suggest that it is the degree to which the data support the hypothesis.
Sometimes this is approximately true, as in the next two examples. 42
Example 4.146. Suppose that X", Bin(n,p) given P = p, and let H: P:$ PO.
The UMP level Q test rejects H when X > Co, where Co increases as Q decreases.
The P-value of an observed value x is the value of Q such that c" = x-I unless
x = 0, in which case the P-value is 1. The P-value can then be calculated as
PH(X) = t (7)p~(1-
t=x
pot-i.
:t (7)p~(1-
'l=x
po)n-i = PH(X).
So, the P-value is equal to the posterior probability that the hypothesis is true
(using an improper prior), at least when the fosterior is proper. If x = 0, then
the posterior is still improper Beta(O, n + 1).4
Consider next what happens if H : P ~ Po. It turns out that the improper
prior must change to Beta(I,O). (See Problem 64 on page 294.) Because two
different priors are needed to obtain the "degree of support" for the two different
hypotheses, we get the following anomaly. If we take the two hypotheses together,
{P:$ po} U {P ~ po}, the total degree of support is
1+ (:)PO(I- Pot- X
One can easily check that this is not due to the fact that {P = Po} is included in
both hypotheses. One could leave it out of either one and the results would be
the same.
where Y '" r(x, 1) and assuming that the "prior" for S is the improper dfJlfJ. So,
the P-value is the posterior probability that H is true if the prior is improper.
This is actually true in general for Poisson distributions and hypotheses of the
form H : S ~ fJo. The implications of this result include the following. If a
Bayesian uses the improper prior dfJlfJ and has a G-l-c loss function, then he or
she will reject H if the P-value is less than 1/(1 + c). This is the UMP level 0
test if 0 = 1/(1 + c).
Next, suppose that H : e ~ 1 and A : S < 1. The UMP level 0 test is
I ifx<c,
q,(x) ={ 'Y if x = c,
o if x> c,
where c and 'Yare chosen so that q, has size o. The P-value of an observed data
value x is Pr(X ~ xiS = 1). As we did earlier, we can write this as
PH(X) = Pr(at most x events in one time unit of a rate 1 Poisson process)
= Pr(time until event x + 1 is > 1) = Pr(Y > 1) = Pr(S > 11X = x),
where Y '" r(x + 1,1) and assuming that the "prior" for e is the improper dfJ.
If we modify Example 4.143 slightly, we discover a case in which it is
simply impossible to use P-values for measuring degree of support. [See
also Schervish (1996).]
Example 4.148 (Continuation of Example 4.143; see page 280). Let H2 : S E
[-0.82,0.521. The UMPU level 0 test is 1/I0x) = 1 if Ix + 0.151 > do<. If X = 2.18
is observed, then do< = 2.33 and
In Example 4.148, we saw that the P-value of a data value relative to the
class of UMPU tests behaved strangely as the hypothesis varied. This exam-
ple is closely related to the incoherent tests discovered in Example 4.88 on
page 252 and Example 4.102 on page 257. Problems 61 and 62 on page 294
show that for one-sided hypotheses with known or unknown variance, the
P-value always equals the posterior probability of the hypothesis calcu-
lated from the usual improper prior. In the case of interval hypotheses
with unknown variance, the situation is somewhat different.
Example 4.149. Suppose that Xl' ... ' Xn are conditionally lID with N(/J,0- 2 )
distribution given e = (/J,o-). For a hypothesis of the form H : a ~ M ~ b with
a < b, we have not found UMPU tests. We do, however, have the collection of
likelihood ratio (LR) tests. (See Examples 4.132 and 4.137.) The UMPU tests of
one-sided hypotheses (like M ~ b or M ;::: a) and point hypotheses (like M = c
versus M c) are also LR tests. So, we might try to compare the P-values
for various hypotheses relative to the families of LR tests. If X = x > b and
4.6. P-Values 283
the data offer, it seems natural to compare the two. In one-sided cases,
we found that posterior probabilities (when using improper priors) often
corresponded to P-values. In this section we will only compare Bayes factors
to P-values for testing hypotheses of the form H : e = eo versus A : e f:. eo
Edwards, Lindman, and Savage (1963) and later Berger and Sellke (1987)
made comparisons of P-values with lower bounds on Bayes factors, and the
following two examples are inspired by the presentations in those sources.
Example 4.150. Suppose that Xl, ... , Xn are conditionally lID with N(e, 1)
distribution given e = 8, and we are interested in testing H : e = 80 . Let
po = Prce = 80) > O. If we let the prior distribution'>' of e given that e f 80 be
unrestricted, then the global lower bound on the Bayes factor is easily calculated
to be exp( -n[x - eo]2/2). The lower bound on the Bayes factor for.>. being a
normal distribution centered at 80 is 1 if Ix - 801 ~ 1/../ii, and it equals
aThis is the largest possible value of Po which is consistent with the posterior
probability being equal to the P-value.
4.7. Problems 285
is conjugate as in Example 4.22 on page 224. The Bayes factor for the hypothesis
H : M = 1-'0 was given in (4.23). Consider what happens to this expression as
n -+ 00. First, suppose that the usual t statistic converges to a constant to. That
is, assume that v'n(Xn - I-'O)/Sn converges to to, where s! = w/(n - 1). In this
case, the formula in (4.23) behaves asymptotically (as n -+ 00) like v'n times
a constant. This means that the Bayes factor goes to 00 as n increases, hence
the posterior probability of the hypothesis goes to 1. What happens to the P-
value for this same sequence of data sets? Since the t statistic is converging to a
constant to, the P-value is converging to 1- 241(to). For example, if to = 1.96, the
P-value will go to 0.05, while the posterior probability of the hypothesis goes to
1. This is an extreme example of Lindley's paradox. Once again, the suggestion
of Lehmann (1958) to let Q decrease as n increases would seem appropriate.
The situation is much the same if one uses the approximate Bayes factor as
calculated in (4.29). This formula does not require that the prior be natural
conjugate, but merely smooth in some sense. Since both q, and e converge to
finite values almost surely given e (by the strong law of large numbers 1.62), the
expression in (4.29) also behaves asymptotically like v'n times a constant if the
t statistic converges to to.
Of course, the t statistic will converge to a finite value with positive probability
only if the hypothesis is true. So, it is comforting that virtually any smooth prior
distribution will lead to eventually discovering that the hypothesis is true, if it is
indeed true. 44 On the other hand, it is a bit disconcerting that the P-value will
stay bounded away from 1 with positive probability no matter how much data
we observe. (See Problem 65 on page 294 to see how to prove that the P-value
has U(O,I) distribution given that the hypothesis is true.) If the hypothesis is
false, it is easy to check that the P-value will go to 0, as will the Bayes factor.
4.7 Problems
Section 4.1:
1. Prove that the loss function in (4.2) is equivalent to a 0-I-c loss function.
(By "equivalent" we mean that both the posterior risk and the risk function
will rank all decision rules the same way regardless of which loss is used.)
2. Prove that the general form of the hypothesis-testing loss function (in Defi-
nition 4.1) can be written as d( 8) times a 0-1 loss function for some function
d > 0 if the loss is 0 whenever a correct decision is made.
44In Section 7.4, we will prove some results that make more precise this limiting
ability of posterior distributions to identify the value of a parameter.
286 Chapter 4. Hypothesis Testing
3. Let X = (Xl, ... , Xn) be such that the X; are conditionally lID with
N(O,0'2) distribution given E = O' under the hypothesis H. Let T(x) =
V'ri:X/8, where x = L~=l x;fn and 82 = L~=l (Xi - x? /(n - 1). Define
X ::5 y if T(x) ~ T(y). Find PH(X), showing that it is the same for all O'.
Section 4.2:
4. Suppose that X rv N(B, 1) given e = B. Suppose that L(B, 1) < L(B, 0) for
all B < Bo and that L(B, 1) > L(B,O) for all B > Bo. Prove that, for every
prior there exists k such that the formal Bayes rule will be to choose action
a = 1 if X < k.
5. Suppose that Xl"'" Xn are conditionally lID with N(JL, 0'2) distribution
given e = (JL,O'). Use the improper prior having Radon-Nikodym deriva-
tive I/O' with respect to Lebesgue measure on (0,00) x IR. Let Ito and d
be known values, and suppose that the loss function is
Prove that the formal Bayes rule will be of the following form: Choose
a = 1 if ITI > k for some constant k, where T = fo(Xn -ItO)/Sn and
- I~
Xn=;:;:L Xi , S2 n =I -
n-I
L (X;-X- n ).
n
2
i=1 ;=1
where
xr2 + Bo
1+72 '
~ po
(1 - po)
~exp{-!2 [~]
1+
(X-Bo)2}.
1 - PI 7
(b) Show that the minimum Bayes factor is Ix-{;Iol exp( {-[x-{;Io]2+1}/2)
if Ix - {;Iol > 1, and is 1 if Ix - (;Iol :::; 1.
8. In Example 4.21 on page 224, prove that
k r Po
= T~~~(~) -~(-fZ2)'
Section 4.3.1:
9. Let (ao, ad be a point on the lower boundary of the risk set for a simple-
simple hypothesis-testing problem. Prove that ao + a1 :::; 1.
10. In a simple-simple hypothesis-testing problem, prove that the minimax
rule for a 0-1 loss function is any test that corresponds to the point where
8L intersects the line y = x.
11. Let n = {0,1}. Suppose that Po says that X '" U(-V3,V3) and P1 says
that X '" N(O, 1). Let nH = {O}. Draw the risk set for a 0-1 loss function,
and find the minimax rule.
12. In a simple-simple hypothesis-testing problem with 0-1 loss, show that a
MP level a test has size a unless all tests with size a are inadmissible.
13. Return to the situation in Problem 28 on page 212. Consider the hypothesis
H : X '" fo versus A : X '" h. Find all a such that the MP level a test is
of the form "Reject H if -d :::; X :::; d," and write d as a function of a.
14. *Prove Proposition 4.46 on page 235.
15. In Example 4.49 on page 236, prove that the Bayes rule with size a has
higher power than the unconditional size a test.
Section 4.3.2:
16. Suppose that the loss function is 0-1-c, that nH = {Bo}, and that n =
{Bo} U nA. Prove that a UMP level a test has no larger Bayes risk than
any other size a test, no matter what prior we use.
Section 4.3.3:
17. Let Y = IXI where fXls(xl{;l) = l/(7r{;l[l + (X/{;I)2]). Suppose that e >0
for sure. Prove that the family of distributions for Y has MLR.
18. Let X have Cauchy distribution Cau(B, 1) given e = B.
(a) Prove that the MP level a test of H : e = {;Io versus A : e = {;I1 for
(}1 > (}o is essentially unique. That is, if <I> and 1j; are both MP level a
tests, then PII(<I>(X) = 1j;(X)) = 1 for all B.
(b) Prove that there is no UMP level a test of H : e = Bo versus A : e >
(}o for 0 < a < 1.
19. Let the parameter space be the open interval n = (0,100). Let Xl and
X2 be conditionally independent given e = B with Xl '" Poi(B) and
X2 '" Poi(100 - B). We are interested in the hypothesis H : e :S c versus
A: e > c.
288 Chapter 4. Hypothesis Testing
N(O,l). Let H : e E {01,02} and A : e = Oa. Show that each test f/>
satisfying (3",(01) = (3",(02) = 0 is inadmissible if 0 < 0 < 1 and the loss
function is 0-1.
23. Suppose that the parametric family has MLR increasing and that H : e :::;
00 is the hypothesis of interest. Let the alternative be A : e > 00. Suppose
that the power function of every test is continuous. Prove that the UMP
level 0 test is the UMC floor 0 test.
24. Suppose that the parametric family has MLR increasing and that H : e :::;
00 is the hypothesis of interest. Let the alternative be A : e > 00 Suppose
that the UMP level 0 test has base 'Y. Prove that the UMP level 0 test is
the UMC floor 'Y test.
25. Show that the family of U(O, 0) distributions for 0 > 0 does not satisfy the
conditions of Proposition 4.67. Find two UMP level 0 tests of H : e = 1
versus A : (3 > 1 which are not almost surely equal.
26. Show that the family of Poi(9) distributions for 0 > 0 does not satisfy the
conditions of Proposition 4.67. Nevertheless, prove that the collection of
one-sided tests of the form (4.57) for H : e :::; 00 versus A : e > 00 forms a
minimal complete class when the loss function is of hypothesis-testing type
(see Definition 4.1.)
27. *Suppose that n = (0,00) and that Xl, ... , Xn are lID U(O,O) given e =
O. For 0 :::; 0 < 1, find a UMP level 0 test for each of the hypothesis-
alternative pairs below:
(a) H: e :::; 00 versus A : e > 00.
e
(b) H: ~ 00 versus A : < 00. e
(c) H: (3 = 00 versus A : e I: 00.
(d) In part (a), find a second UMP level 0 test that differs from the one
you found for part (a) with positive probability given e = 0 for 0 in
a set of positive Lebesgue measure.
28. (a) Suppose that X U(O, 0) given e = O. Find the conditional distri-
f'V
29. *The density of the NCB(o, (3, 1/J) distribution is given in Appendix D.
(a) Prove that, for fixed 0 and {3, this family of distributions has increas-
ing MLR in x where 1/J is the parameter. (Feel free to pass derivatives
under summations.)
(b) Use the result of the previous part to show that the noncentral F
distribution has increasing MLR also.
(c) Also show that the non central t distribution has increasing MLR.
30. Prove that if cP is UMP level 0 for testing H : e E OH versus A: e E OA,
then 1- cP is UMC floor 1 - 0 for testing H' : e E OA versus A' : e E OH.
31. *Let Xl, ... , Xn be lID U(J - 1/2, (J + 1/2) given e = (J. Let H : e ~ (Jo
and A : e < (Jo.
(a) Find the UMP level 0 test of H versus A.
(b) Find the UM C floor 0 test of H versus A.
(c) Suppose that we begin with a uniform improper prior for e. Let the
loss be o-l-c with 1/(1 + c) = o. Find the formal Bayes rule.
(d) Calculate the power functions of the three tests above. Compare them
and explain the differences intuitively.
(e) Find the UMP level 0 test of H versus A conditional on the ancillary
U = max X; - minX;.
(f) Find the UMC floor 0 test of H versus A conditional on the ancillary
U = maxX; - minX;, and show that this is the same as the UMP
level 0 test conditional on the ancillary.
32. Let 0 = (-00, (Jol, and let Xl, ... , Xn be lID U(J - 1/2, (J + 1/2) given
e = (J. Let H : e = (Jo and A : e < (Jo. Suppose that we have a prior
distribution such that Pr(e = (Jo) = Po > 0 and that the conditional
distribution of e given e < (Jo has strictly positive density g(J) for (J < (Jo.
Assume that the loss is o-l-c. Find the formal Bayes rule for data values
such that max{xI, ... ,X n} < (Jo + 1/2. (This condition assures that the
data are consistent with the parameter space.) Does this test match any of
the tests in Problem 31 above?
33. Suppose that /-Le is a probability measure on (0, r). Suppose that the
conditions of Theorem 4.68 are satisfied and that there is a test with finite
Bayes risk with respect to /-Le. Suppose that the loss function is bounded
below. Show that there is a one-sided test that is a formal Bayes rule.
34. Let 0 ~ m. and H : e :::; (Jo. If power functions are continuously differen-
tiable and cP is the unique level 0 test with maximum derivative for the
power function at (Jo, show that cP is LMP level 0 relative to d(J) = (J - (Jo
for (J > (Jo.
35. Let Xl, ... , Xn be conditionally lID with Cau(J, 1) distribution given e=
(J .
(a) Find the LMP level Q test of H : e ~ (Jo versus A : e > (Jo for
o < 0 < 1. (Hint: Pass derivatives under integral signs.)
290 Chapter 4. Hypothesis Testing
(b) Prove that the power function of this test goes to 0 as 6 -+ 00.
Section 4.3.4:
37. Suppose that the conditions of Theorem 4.82 are met. Let the loss function
be of the hypothesis-testing type for a two-sided hypothesis. Show that the
class of tests of the form given in Theorem 4.82 is essentially complete.
38. Suppose that J1,e is a probability measure on (n, or). Let the loss function be
of the hypothesis-testing type for a two-sided hypothesis. Suppose that the
conditions of Theorem 4.82 are satisfied and that there is a test with finite
Bayes risk with respect to J1,e. Suppose that the loss function is bounded
below. Show that there is a test of the form given in Theorem 4.82 which
is a formal Bayes rule.
39. Suppose that X '" N(6, 1) given e = (J. Let nH be the set of rational
numbers, and let nA be the set of irrational numbers. Prove that the UMP
level Q test of H : e E nH versus A : e E nA is the trivial test </>(x) == Q.
40. *Let X be a random variable, and define
if x < Cw,
if x = cw ,
ifx>c w '
(a) Find the UMP level 0.05 tests of the two hypotheses.
(b) Find the set of all x values such that the UMP level 0.05 test of H2
rejects H2 but the UMP level 0.05 test of H1 accepts H 1.
42. Let 01 C 02 be strictly nested subsets of the parameter space. Let Hi :
e E Oi and Ai : e f/. Oi for i = 1, 2. Suppose that L; is 0-I-c loss for
testing Hi versus Ai for i = 1,2 (with the same c for both cases). Consider
the problem of simultaneously testing both hypotheses with action space
{O, I} 2 , the first coordinate being the action for the first hypothesis, and the
second coordinate being the action for the second hypothesis. (That is, for
example, a = (a" a2) = (0,1) means to reject H2 but accept HI.) Suppose
that the loss function for the simultaneous tests is L(e, a) = L, (e, at) +
L 2(}, a2). A pair of tests (CPl, CP2) can be thought of as a randomized decision
rule in this problem. In this case, CPi(X) = Pr(reject Hi!X = x). We can
say that a pair of tests (CPl, CP2) is incoherent if there exists e E O2 \ 0 1 such
that
Section 4.4:
43. For each k = 1,2, ... , let 'Yk,a denote the 1 - Q quantile of the X~ dis-
tribution. Let Yk have NCX~(c2) distribution. Prove that Pr(Yk > 'Ya) :5
Pr(Yl > 'Ya) for all k ~ 1. (Hint: Let Xl, ... ,Xk be lID N(B,I) given
e =k e, and2
consider two tests of H : e = 0 based on L.,,=l X ,2 and "k
(Li=l Xi) .)
44. Prove Proposition 4.92 on page 254.
45. Prove Proposition 4.93 on page 254.
46. Let Xl, ... , Xn be 110 U(e - 1/2,e + 1/2) given e = e. Let H : e = eo
and A : e '" Bo.
(a) Let cP be a test with size Q. Prove that 1/1 is unbiased level a if I/I(x) = 1
for all x which satisfy !xls(xleo) = O. (Hint: Use the fact that if
!xls(x!(}O)!xls(X!(}l) > 0 for all x E B, then Peo(B) = Pe, (B).)
(b) Prove that there does not exist a UMPU level a test for Q > O. (Hint:
You can slightly modify the UMP and UMC tests for the one-sided
case in Problem 31 on page 289 to produce unbiased level a tests with
maximum power for e < (}o and (} > (Jo, respectively. Then prove that
it is impossible for a single test to achieve both maxima.)
47. Prove Proposition 4.97 on page 255.
292 Chapter 4. Hypothesis Testing
48. Suppose that the conditions of Lemma 4.99 hold and that the 1055 function
is of the hypothesis-testing type. Prove that the class of two-sided tests is
essentially complete.
49. In a one-parameter exponential family with natural parameter 9, prove
that the condition
~{3",(O)1 = 0
d9 9=90
can be written as (4.117) if {3", (Oo ) = o.
50. Let n ~ JR and H : 9 = 00 versus A : 9 =I- 00 If power functions are twice
continuously differentiable and 4J is the unique unbiased level 0 test with
maximum second derivative for the power function at 00, show that tjJ is
LMPU level 0 relative to d(O} = 10 - 00 1 for 0 =I- 00.
Section 4.5:
where cP and ~ are respectively the standard normal density and CDF.
Find the UMPU level 0 test of H : 92 ::; c versus A : 92 > c.
52. Let the parameter space be n = (0, 1) x JR. Conditional on (P,9) = (p,9),
N", Bin(lO,p), and conditional on N = n, Xl, ... , Xn+l are lID N(O, 1).
You get to observe N, Xl, ... , X N +1. We are interested in the hypothesis
H : e ::; 0 versus A : e > o.
(a) Find the UMPU level 0 test of H versus A.
(b) Suppose that P = p is actually known. Show that there is no UMPU
level 0 test of H versus A.
53. In the framework of Example 4.121 on page 266, let H : r1 ::; p.o and
A: r 1 > p.o. Prove that the usual size 0 one-sided t-test is UMPU level o.
54.*Suppose that Xl, ... ,Xn are conditionally lID N(p.,u 2 ) given e = (p.,u).
Define
n
_ 1,",
Xn = ;; L.J Xi, Sn =
i=l
,-Xn - a Xn -b
tt=v n - - , t2 =v'n--.
Sn
Sn
cP
(x) = {I 0
if t1 < !,;;!1 (od or if t2
otherWise.
> T;;!l (1 - 02),
(a) Prove that cP has size 0 as a test of H : a ::; M ::; b versus A: not H.
(Hint: For each p. E [a, b] find the distributions of hand t2 given
e = (p., u) and see what happens as u -+ oo.}
4.7. Problem s 293
(b) Prove that 4> is not an unbiase d level a test if both a1 and
a2 are
strictly positive. (Hint: Show that the limit of the power function
at
(a,l1) is a1 as 11 -+ O. Since the power function is continuous, what
does this say about the power function at (a+f, 11) for 11 and f small?)
55. Suppose that Xl rv Bin(n1 ,pI) indepen dent of X2 rv Bin(n2,
P2) condi-
tionalo n (PI, P2) = (P1,P2). Find the UMPU level a test of H :
H = P2
versus A : PI i P2.
56.*Let Xl, ... ,Xn be lID given 8 = (fh,fh) with density
fXls(xl fh, (h) = exp {-"I/.I(8 1 ) + 81 cos(x - 82)} I[o,21Tj(x),
where
1/J(y) = log 121T exp(yco s(tdt.
Let n=
{(81 ,92): 91 ~ 0,0::::; 82 < 27r}.
(a) Let HI = 8 1 cos 82 and H2 = 8 1 sin 8 2. Let H : 8 2 = 0
versus
A : 82 =1= 0 and let H : H2 = 0 versus A : H2 =1= O. Prove that the
UMPU level a test of H is also UMPU level a for H. (You need not
actually find the test to prove this.)(H int: Use the fact that if I and
9 are analytic functions of one variable and I(x) = g(x) for
all x on
some smooth curve, then I = g.)
(b) Find the form of the UMPU level a test of H as closely as you
can.
(You will not be able to find the cutoffs in closed form.)
(c) Why is this a reasonable test of H but not of H?
57.*Let X and Y have joint density given (M,A) = (JL,>'),
60. Consider the proof in Section 4.5.6 that the F-test is a proper Bayes rule.
(a) Prove that the prior distribution given in Section 4.5.6 puts positive
probability on every open subset of the original parameter space for
e.
(b) Prove that the Bayes rule is admissible.
Section 4.6:
61. Suppose that Xi are lID N(p., 1) given M = p. for i = 1, ... , n and that we
use Lebesgue measure as an improper prior for M. If H : M $ c, show that
the posterior probability that H is true equals the P-value associated with
the family of one-sided tests.
62. Suppose that Xi are lID N(p., (2) given (M,~) = (p., q) for i =
1, ... , n
and that we use the usual improper prior with Radon-Nikodym derivative
l/q as an improper prior. If H : M $ c, show that the posterior probability
that H is true equals the P-value associated with the family of one-sided
t-tests.
63. Prove Proposition 4.144 on page 280.
64. In Example 4.146 on page 281, suppose that we change H to H : P ~ Po
Prove that the P-value, associated with the family of UMP level 0 tests, of
an observed X < n is the posterior probability that the hypothesis is true
based on an improper prior Beta(I,O).
65. Let {P6 : (J E O} be a parametric family. For each 0, let cpa (X) = Iso (x) be
a size 0 test of H : e = (Jo versus some alternative such that, for 0 < 0 < 1,
So: = np>aSp. Suppose that So = 0 and Sl = X.
(a) Prove that if 0 < {J, then Sa ~ Sp.
(b) Let PH(X) be the P-value of an observed x. Show that, given e = (Jo,
PH(X) '" U(O, 1).
(c) Suppose that CP'" is unbiased for each o. Prove that
P!J(PH(X) $ 0) ~ P~o(pH(X) $ 0).
66. *Let X '" N(J,I) given e = 9. Let g(J, x) be the P-value for testing H6 :
e = (J versus A6 : e "" (J for data X = x.
(a) If the hypothesis is H : e E la, bj versus A : e f1. la, bj, and X = x >
(a + b)/2 is observed, prove that the P-value is 41(a - x) + 41(b - x),
where ~ is the standard normal CDF.
4.7. Problems 295
Eo(X) = 1 00
(x)Oex p( -Ox)dx = e,
for allO. This happens if and only if fooo (x) exp( -ex)dx = 1, for all
e. By Theo-
rem 2.64, we can differentiate the left-hand side with respect to 0 under
the inte-
gral and get Jooo x</J(x) exp( -Ox)dx = 0 for allO. This means that Eo(X</J
(X)) = 0
for all e. Since X is a complete sufficient statistic , (x) = 0, a.s.
[Po] for all e.
This contrad icts </J(X) being unbiased. Hence, there are no unbiase
d estimat ors
ofe.
Example 5.6. Suppose that {Xn}~=l are lID N(p,,0'2) given 8 = (M, E) =
(p,,0'). Let X = (Xl, ... , Xn). Then
n n
X = .!.n~
" Xi, and 8 2 = _1_ "(Xi - X)2
n-1~
~1 ~l
are complete sufficient statistics. Since they are unbiased, they are UMVUE of M
and E2, respectively. Notice that 8 2 does not minimize mean squared error, even
among estimators of the form c E~=l (Xi - X)2. (See Example 3.25 on page 154
and Problem 11 on page 210.)
Example 5.7. Suppose that Po says that X has Poi(J) distribution. We know
that X is a complete sufficient statistic. Let g(8} = exp( -38}. We will find the
UMVUE of g(8}. The required condition is
x=O
for all O. It follows from the uniqueness of Taylor series expansions for analytic
functions that I/> is unbiased if and only if I/>(x) = (-2t for x = 0, 1,2, .... This 1/>,
although UMVUE, is an abominable estimator of g(8). We will see some better
estimators later in this chapter. (See Examples 5.29 and 5.32.)
PROOF. For the "only if" part, suppose that 6 is UMVUE. It is clear that
if VarsU(X) = 0 for all 9, then Covs(6(X), U(X)) = O. So, let U E U be
such that VarsU(X) > 0 for some 9. Let A E JR, and define 6>. = 6 + AU.
Then 6>. is unbiased also, and for every A,
which, in turn is true for all A and 9 if and only if Covs(6(X), U(X)) = 0
for all O.
For the "if" part, assume that for all U E U, Covs(6(X), U(X)) = 0
for all 9. Now, let 61 (X) be an unbiased estimator of Es6(X). Then there
exists U E U such that 61 = 6 + U. It follows that
if VI = 1,
x ={ ~ of trials until 2nd failure otherwise,
if x = 1,
if x = 2,3, ....
Define the estimator 60 to be 60 (x) = 1 if x = 1 and 60 (x) = 0 if not. Then
60 is an unbiased estimator of e. We will now try to find a UMVUE. Assume
EeU(X) = 0, for all fJ. Then
00
z=2
+L
00
if and only if U(2) = 0 and U(k) = -(k - 2)U(1) for all k ~ 3. This characterizes
all functions in U according to the value t = U(I). That is,
s(1 - t) (1 ~ ())2 = f
x=2
tS()X-2(X - 2)2.
s(l - t) 2
k=l k=l
Since these two series in this last equation are analytic functions of B, it must be
that s(1 - t)k = tsk 2 for all s, k. This is not possible, hence there is no t such
that these equations hold. And there is no UMVUE.
+L
00
ttdxle(xIO)
o = Jv{ {)Ofxle(xIO)dv(x)
{) (
= Jv fXle(xIO) fXle(xIO)dv(x)
1The first lower bound is due to Rao (1945) and Cramer (1945, Chapter 32;
1946).
302 Chapter 5. Estimation
since the term being added on has zero mean. Now take the absolute value
and use the Cauchy-Schwarz inequality B.19:
JVarocp(X) JIx(O).
Now square the extreme ends of this string and divide by Ix (0). 0
For an unbiased estimator cp(X) of e, the smallest possible variance
is I/Ix(O), since dEocp(X)/dO = 1. A necessary and sufficient condition
for the lower bound to be achieved is that the ~ become an = in Theo-
rem 5.13. The ~ in Theorem 5.13 was introduced by the Cauchy-Schwarz
inequality B.19, which provides for equality if and only if the two fac-
tors are linearly related. (That is, if E(X) = 0 and E(Y) = 0, then
IE(XY)I = JE(X2)JE(Y2) if and only if there exist a and b such that
aX + bY = 0, a.s. and ab f. 0.) So, the Cramer-Rao lower bound is an
equality if and only if cp(X) and the score function olog!xls(XIO)/oO are
linearly related, that is,
o
00 log !xls(xIO) == a(O)cp(x) + d(O), a.s. [Po]'
This means that the Cramer-Rao lower bound can be sharp only in a
one-parameter exponential family with cp(X) being a sufficient statistic.
Example 5.16. Suppose that X ,..., Exp(>..) given A = >... Set e = 1/A. Then
~ exp { -~x} 1(0,00) (x),
1 x
-0 + (}2'
Since the score function is a linear function of x, it follows that c/J(x) must also be
a linear function of x if c/J(X) is to achieve the lower bound. If c/J(x) = a+bx, .then
Eoc/J(X) = a + b(}, so a = 0 and b == 1 gives an unbiased estimat~r ~h~t achieves
the Cramer-Rao lower bound. The reader should verify that thiS IS mdeed the
case.
5.1. Point Estimat ion 303
Ix(O) = f r(~)
r(~)yIa1r a
a+ 1 1- ~(X-8)2
(1+~(x-O)2r
!ill. .
2
Since the denomi nator of the integran d looks like part of the ta+4 density,
we will
perform the following transformation:
z-e =
x-O
Va'
J J
Ja+4
x = (J + a: 4 (z - e), dx = adz.
a+4
The integral becomes
a+4r(~)yIa1r
f 1 - _l_(Z
a+4 - 0)2 dz.
a (1+a~4(Z-0)2)~
Except for the constan t, the integral is E(l - U 2 /\a + 4]), where U
"" ta+4' The
correct constan t to make the integral equal to this expecte d value is
r(~)
There is another lower bound that applies in more general cases, such
as
when the set of possible values for X depends on e. This next lower bound
is due to Chapman and Robbins (1951).
304 Chapter 5. Estimation
[m(O) - m()f)]2 }
Varo(X) ~ sup
{O/:SUpp(OI)~SUpp(O)}
{
Eo [fX I9( X10 I) -
fXI9(XIO)
1] 2 .
U(X) = fxle(X\()f) _ 1.
!xle(XI())
Then EoU(X) = 1 - 1 = 0, and
= m(O') - m(8),
since SUpp()f) S; supp(O). By the Cauchy-Schwarz inequality B.19, the
square of the covariance is at most the product of the variances, so
Example 5.19. This is a case in which the Cramer-Rao lower bound does not
apply. Let Po say that {Xn}~=l are lID with density fXle(xIO) = exp(O-x)/[O,oo).
Let X = (Xl,oo.,Xn ). Then supp(9) = [9,00) and supp(O') C;;; supp(O) so long
as 0' ~ O. From the proof of Theorem 5.18,
U(X) = exp{n(O' - O)}/[OI,OO) (min X;) - 1.
If C/>(X) is an unbiased estimator of e, then [m(O) - m(O'W = (0 - 0')2, and
EOU(X)2 exp{2n(O' - O)}Po(minXi ~ 0')
+ 1,
r
- 2exp{n(9' - O)}Po(minXi ~ (1')
Po(minXi ~ 9') (Po (Xl ~ 9')r
A simple unbiased estimator is c/>(X) = min Xi - lin, which has variance l/n2.
5.1. Point Estimation 305
(Var6~(X) ~).
The inequality follows from the fact that the covariance matrix is positive
semidefinite and C is nonsingular. 0
Lemma 5.20 has two corollaries, one of which is an improvement on
the Cramer-Rao lower bound and the other of which is a multiparameter
version of the Cramer-Rao lower bound.
The first corollary to Lemma 5.20 is one that attempts to improve the
Cramer-Rao lower bound by making use of the fact that the inequality
can fail to be an equality because the score function is not linearly related
to the estimator. To put this another way, if the residual from the linear
regression of the estimator on the score function has nonzero variance, then
it might be possible to get the residual variance down by regression on more
than just the score function. This is the approach taken in Corollary 5.21.
Corollary 5.21 (Bhattacharyya system of lower bounds). Assume
the conditions of Theorem 5.13, assume that k partial derivatives with re-
spect to 0 can be passed under the integral sign, and assume that .:1(0)
(defined below) is nonsingular. Then VarlJ(X) ~ "I T(0).:1- 1 (0),,(0), where
'ljJi(X, 0) =
1 ai
fXls(xIO) OOi fXle(xIO).
PROOF. All we need to do, in order to apply Lemma 5.20, is note that
306 Chapter 5. Estimation
which follows for higher derivatives in just the same way that it did for the
first derivative in the proof of Theorem 5.13. 0
The Cramer-Rao lower bound is the special case with k = 1.
Example 5.22. Suppose that X '" Exp(),) given A = ),. Set e= 1/ A. Then
So, -y(0) T .7- 1 (Oh( O) = 2004 , and the Bhattacharyya lower bound is achieved.
Corollary 5.23 (Multiparameter Cramer-Rao lower bound). As-
sume that the FI regularity conditions on page 111 hold. Let Ix(B) =
(Iij(B))) be the Fisher information matrix, and suppose that it is posi-
tive definite. Suppose that J >(x)fxle(xl(})dv(x) can be twice differentiated
under the integral sign with respect to coordinates of (} and that
PROOF. All we need to do, in order to apply Lemma 5.20, is note that
Ix (0) =( t ? ).
~
The following example shows that the MLE may exist but not be unique.
Examp le 5.27. Suppose that Xl, ... , Xn are conditionally indepen
dent given
e= 0 with U((} - 1/2, () + 1/2) distribu tion. Then,
2This observa tion has led some people to try to base inference on
the likeli-
hood function alone, rather than the posterio r distribu tion. See Barndor
(1988) for an in-depth study of likelihood.
ff-Nielsen
308 Chapter 5. Estimation
l CXl
exp(-30) (~~ ~a:)x Oa+x-l exp( -(b + l)O)dO = G: ~) a+x .
Another popular loss function is L(O, a) = 10 - al. The formal Bayes rule
in this case is a special case of the following result.
Theorem 5.33. Suppose that 8 has finite posterior mean. For the loss
L( 0 a) = { c( a - 0) if a 2 0, (5.34)
, (1 - c) (0 - a) if a < 0,
Since Pr(8 > a'IX = x) ::; c, it follows that r(alx) 2: r(a'lx). Similarly, if
a < a', then
if a 2: 8,
L(8, a) - L(8, a') = c(a - a') + { ~- a if a' 2: 8 > a,
(a' - a) if 8 > a'.
It follows that r(alx) - r(a'lx) 2: (a' - a)[Pr(8 > alX = x) - c]. Since
Pr(8 > alX = x} 2: Pr(8 2: a'IX = x) 2: c, it follows that r(alx) 2: r(a'lx),
so a' provides the minimum posterior risk. 0
Notice that Theorem 5.33 remains true if the loss in (5.34) is replaced by
. T P) - l'
IF( x,, T(Px,t) - T(P)
- 1m
tlO t '
F-l (12)
I-t
T(Px.t ) = { x
F- l ( 2[/-tJ)
IF xt P = 1 if x > F-l(~),
x { 1
( " ) 2/(F- 1 0)) -1 if x < p-l ( ~ ) .
For all x less than the median , the effect of a contam ination at x is
essentially the
same. This is similar, for all x greater than the median. In a finite sample
setting,
suppose that we obtain n observations Xl, .. ,X n and we contem plate
one further
observation X n +!. We can think of the empirical CDF of the first
n observations
as a probabi lity measure P and then with t = 1/ (n+ I), PXn+t. will
t be the empir-
ical CDF of all n+ 1 observations. In this case, the difference between
the sample
median s T(PXn + 1 t ) -T(P) has slightly different forms dependi ng on
whethe r n is
odd or even. For the case of n odd, T(P} = X([n+lJ/ 2), the [n+ 1]/2
order statistic .
If Xn+l > X([n+lJ/ 2+l), then T(PXn + 1 .d = [X([n+lJ/ 2) +X([n+l )/2+1}]/
2, which is in-
depende nt of the value of Xn+l (so long as it is larger than X([n+lJ/ 2+1)'
A similar
expression holds for Xn+l < x([n+!J/ 2-1)' So T(Pxn+ t.t ) - T(P) is
approxi mately
312 Chapter 5. Estimation
the difference between X([n+l]/2) and the next observation either above or be-
low it. If the distribution is continuous, then this difference is approximately one
over two times the density at the median times lin, 1/[2nf(p- 1 (l/2))], which
is approximately tIP(Xn+l;t,P).
One way to summarize the influence function is by the gross error sen-
sitivity. This is defined as ,*(T,P) = sup x IIF(x;T,P)I. If ,*(T,P) is
infinite, then there is no bound on how much T can change when P is
contaminated by even a small amount of mass at an arbitrary point.
Example 5.38. For the mean functional T, ,'(T, P) is the largest absolute devi-
ation possible from the mean. For distributions with unbounded support, " = 00.
For the median, on the other hand, ,'(T, P) = [2f(P- 1 (l/2))tl, which is finite.
For this reason, we say that the median is more robust with respect to gross
errors than the mean.
= t ['IjJ(x,T(Px,t)) - J 'IjJ(y,T(Px,t))dP(Y)] .
J
Suppose that we can differentiate 'l/J(y, T(Px,t))dP(y) with respect to t
by differentiating under the integral sign. Dividing both sides of ~5.41) ~y
t and taking the limit as t -+ 0 gives the derivative at t = O. Smce 'IjJ IS
continuous in its second argument, and since T(Px,t) is continuous at t = 0
for all x at which the influence function exists,
lim
t--+O
J'IjJ(y, T(Px,t))dP(y) 0,
-/t ~'ljJi(y,O)1
j=l BO; 9=T(P)
IF(x;T ,P)jdP (y). (5.42)
'Ij;(x, 0) = :2 ( lx~r ~ u ) .
Ix (9) =( ~
314 Chapter 5. Estimation
if P is the N(IL, (T2) distribution. In fact, one can verify directly that if P is
any distribution with finite variance and T'(P} is the standard deviation, then
IF(x;T',P} is the second coordinate in (5.45). (See Problem 32 on page 342.)
Example 5.46. For an example that does not meet the smoothness criteria,
consider Xl, ... ,Xn conditionally lID with U(O,O} distribution given e = O. Let
Po be the class of distributions on (R, 8 l ) with bounded support, and let T(P}
be the supremum of the support. The MLE is the maximum of the sample, which
is the supremum of the support of the empirical CDF, so the MLE is T of the
empirical CDF. The influence function for T is I F(x; T, P} = 0 if x ~ T(P} and
00 if x> T(p}. (See Problem 33 on page 342.)
-b if t < -b
h(t) = { t if -b ::; t'::; b,
b if t > b.
",,(x, T(P
IF(x; T, P) = P([T(P) _ b, T(P) + b])
Notice that as b -+ 0, the influence function approaches the influence func-
tion of the median. The finite sample version is the estimator Tn which
solves L~=l ",,(Xi, Tn) = O. As with all M-estimators, one can rewrite this
equation as L~=l Wn,i(Xi - Tn) = 0, where Wn,i = 1/J(Xi , Tn)/(Xi - Tn) It
is not difficult to see that Tn = L~l Wn,iXi/ L~=l Wn,i solves the equa-
tion. Since Tn appears on both sides of this equation a solution must be
found iteratively. For example, make an initial guess T~O) and define
(k)
",,(Xi, T~k-l
Wn,i
Xi - T~k-l)
",n 1 w(k) Xi
~l= n,"
f F-l(a)
F - (1-a)
1
xdP(x)
T(P) = 1-20:
for a < 1/2. For 0: = 1/2, the tradition is to call the median the 50%
trimmed mean. The influence function of a trimmed mean at a continuous
distribution is bounded and it has a shape similar to that of the previous
estimator. (See Problem 34 on page 342.)
In Section 7.3.6, we will give some results concerning large sample prop-
erties of M -estimators. More detailed discussion of robust estimators can
be found in the books by Huber (1977, 1981) and Hampel et al. (1986).
In Section 8.6.3, we discuss robustness considerations that are peculiar to
the Bayesian perspective.
The accuracy of a confidence set against B is its probab ility of not coverin
g
parame ter values in B(g(O)) given e = (J. The set B(g(O) ) is the
set of
values you wish not to have in your confidence set if e = O. It
is not
exactly analogous to the alterna tive in hypothesis testing, but it is
related ,
as we will see in Theore m 5.54.
318 Chapter 5. Estimation
We note that
(5.58)
if tl > 01 - 1/2 or if t2 > 00 + 1/2 (see Figure 5.59). If k = 1, then (5.58) changes
to equality on the shaded set in Figure 5.59. If 0 1 2: 00 + 1, then (5.58) holds for
tl 2: 01 - 1/2 for every k 2: O. To make the test have size Q and to make it be
the same for all 0 1 > 00 we must set <P = 1 in the upper corner (shaded region in
Figure 5.59), filling in a large enough area to have probability a. This would be
-,(tl, )_{I
'I' t2 -
if t2 > 9+ ~
0 or it > 00 + ~ - a *,
.1.
o if t2 S 90 +~ and it S 90 +~ - an.
4Note that the hypothesis and alternative need to be switched in order for the
same confidence set to correspond to both a UMP and a UMC test.
320 Chapter 5. Estimation
00 +~
00 - ~
(To see that this is MP for each {h > {}o, note that we choose k = 1 in (5.58) if
{}1 - 1/2 < (}o + 1/2 - a l / n , and we choose k = 0 if 01 - 1/2 > 00 + 1/2 _ a l / n .)
The UMA coefficient 1 - a confidence set against B is [T., 00 ), where T. =
~ax{TI -1/2+a 1/n , T2 -1/2}. This means that P~((J' 2 T.) is minimized for all
o < {} among all coefficient I-a confidence sets. Note, however, that e 2: T2 -1/2
for sure, and T. ::; T2 - 1/2 whenever T1 - 1/2 + a 1/n ::; T2 - 1/2, that is, when
T2 - T1 2 a 1/n . So, we are 100% confident that e 2 T. whenever T2 - Tl 2 a l / n ,
rather than 100(1 - a)% confident. (The probability that T2 - Tl 2 a l/n is
1 - [n/a 1/n - n + 1]. If a = 0.05 and n = 10, then P~(T2 - T1 > a 1/n ) = 0.77
for all o. So the understatement of confidence will occur with probability 0.77 no
matter what e is.)
But there is more. Suppose that we switch to B'(O) = (0,00), so that the
corresponding hypothesis and alternative are H' : e 2 00 versus A' : e < 00 The
UMA coefficient I-a confidence set against B' is (-00, T], where T = min{T1 +
1/2, T2 +1/2-a 1/n }. (See Problem 31 on page 289.) IfT2 -T1 < 2a 1 / n -1, then
the two intervals do not even overlap. For example, if a = 0.05 and n = 10, the
probability of this occurrence is 0.008. As an example, if T1 = 1 and T2 = 1.3,
then T. = 1.24 and T = 1.06. So we are 95% confident that e < 1.06 and we
are 95% confident that e 2 1.24. That makes us 190% confident, ~d we haven't
even covered all the possible values of e. The most straightforward way out of
these dilemmas in the classical framework is to condition on the ancillary T2 - Tt.
This will produce more sensible results, which will also be more closely in line
with a Bayesian analysis. The confidence set produced will not be UMA, however.
Another alternative is to use the UMA confidence interval but assert a confidence
coefficient a(t2 - h) that depends on the observed value of the ancillary.5
The following example shows that the phenomenon of Example 5.57 can
occur in exponential families with UMA unbiased confidence sets. This
example is due to Fieller (1954).
Example 5.62. Let e = (Bo,Bl,E), and suppose that Yl, ... ,Yn are condi-
tionally independent given e = (30,{3l,U) with Yi '" N(30 + (3lXi,( 2 ), where
Xl, ... , Xn are known numbers. Sufficient statistics are
-Y = n1" n
L."Yi,
.
Bl =
2::-1 (Yi -
~n (
.LJi=1 X,
.
Y)(Xi - x)
_)2
X
'
i=l
W = "L.,,(Yi - -
Y -.Bl(Xi - x)) 2 .
i=l
If we let S"'''' = E::"l (Xi - x)2, then the joint density of the sufficient statistics is
(_1_)
211"u
~ ..;ns;; W~-2
r(~ -1)
2
_
Y=-,
Ul /3 - U2 + (xo - X)Ul
n 1 - Sxx '
and the Jacobian is 11 Sxx. The joint density of the Uis given ill = t/J is
[ U3 - -Ul - U2
2 [
+ ( xoS - -)
X Ul 12]~-2 I (0,00) ()h(
W U2,U3,)
n xx
where h is a function that will not concern us. If we expand out the formula for
wand complete the square in this function as a function of Ul, we get
2
[w "+ " 1!
+ Bo+BI X O
1.
n
(XO-x)2
(5.64)
Sxx
{ z .. (130
(
+ BIz?
(__ -;;;\2) <F-
1 (1- )}
a
_ l,n-2 ,
.J1:..
n-2
1.
n
+ ~
Sxx
where Fl,n-2 is the CDF of the F distribu tion with 1 and n-2 degrees
offreedo m.
We will now find all z that satisfy this inequality. Define
y'1i(z - x)
v=
VSx;
, v = . -yy'1i
. ,
f31VSx;
F = (n - "2
2)f3I Sxx ,
C = Fl~~_2(1 - a).
W
Clearly v is a one-to-one function of z and v* is its naive estimat or.
The usual F
statistic for testing Bl = 0 is F. We can write
~
n- 2
(.!.n + (z - X)2) = ~
Sxx n-
1 + v2 .
n 2
So, the confidence set is {z : F(v-v* )2/(1+ v 2) ~ c}. Now, F(v-V*
)2/(1+ v 2) ~ c
if and only if
v 2(F - c) - 2vFv* + (FV*2 - c) ~ o. (5.65)
We have three cases dependi ng on the sign of F - c.
where the first equality follows from the fact that V and X are conditionally
independent given e. This probability can be calculated using the noncentral t
distribution. For IL = Vo, the probability is 1- Tn-l (T;~l (1 - a)v'1 + n/m) <
a. The test is easily seen to be a UMPU test (level strictly less than a) for
M ::; Vo, but it is not clear why it should be considered a test of V::; Vo.
The number 6 is called the tolerance coefficient. The tolerance set R is exact
if P;(P;[V E R(X)IX] ~ 6) = 'Y for all 0 E n. One might wish to require
that P;(R(X) = 0) = 0 for all 0 and/or infllE!1 P;(P;[V E R(X)IX] ~ 6) =
'Y. If this last condition fails, the tolerance set would be called conservative.
Rather than making a single probability statement concerning the joint
distribution of the data X and the future observable V (as is done in a
prediction set), a tolerance set tries to separate the probability statements
about X and V. A conditional probability statement about V is made
given X (the tolerance coefficient) and then a statement is made about the
distribution of this conditional probability (the confidence coefficient).
Example 5.70 (Continuation of Example 5.67; see page 324). Here the data are
conditionally lID N{IL, 0'2) given e = (p"u). Suppose that we want R(x) to have
the same form as the prediction set, namely Rd(X) = [Xn - dSn,X n + dSn],
where d is yet to be determined. Now,
P~(V E Rd(X)IX) = <P ( ..;rnXn + '::n - IL) _ <P ( ..;rnXn - !Sn -IL ) .
(5.71)
Call the right-hand side of (5.71) Hd(X, IL, u). It is easy to see that the distribution
of Hd(X, IL, u) given e = (IL, u) is the same as the distribution of Hd(X, 0,1) given
e = (0,1). Let Yd,.., be the 1- 'Y quantile of Hd(X, 0,1). Since the distributions of
Hd(X, 0,1) are stochastically increasing in d, it is clear that Yd,.., is an increasing
continuous function of d. Also, Yo,.., = 0 and limd_oo Yd,.., = 1. Let d be such that
Yd,.., = 6. With this choice of d, we have PIJ(Hd(X, IL, u) ~ 6) = 'Y. It follows that
Rd is a 6 tolerance set with confidence coefficient 'Y for V. 8
There are also one-sided tolerance sets. For example,
(5.72)
One might think that tolerance sets could be used to construct predictive
tests in the classical framework where prediction sets failed. In the sense
of (3.15), they can.
Example 5.73 (Continuation of Example 5.67; see page 324). Consider the hy-
pothesis H : V :::; Vo and the tolerance set R~ in (5.72). (Recall tha.t V is the
average of m future observations.) Here d is chosen so that 6 is the 1 - 'Y quantile
of the distribution of 1- <P(Jffl[X n - dSn ]) given e = (0,1). We could reject H
if Vo ft R~(X). We now calculate
Note that the left-hand side of (5.74) is 1 - P8(V :::; vol. So, we have replaced
the hypothesis H : V :::; Vo by the hypothesis H' : P8(V :::; Vo) > 1 - 6. (F?r
6 = e/(l + e), H' turns out to be the same as the hypothesis constructed In
Example 4.13 on page 219.)
8Eberhardt, Mee, and Reeve (1989) give a program to compute the number d
in examples like this.
5.2. Set Estimation 327
9Sets with specified posterior probability are often called credible sets in the
statistical literature.
328 Chapter 5. Estimation
If c < 1 - c, this loss penalizes overly long intervals that contain the pa-
rameter less (per unit of length) than it penalizes short intervals that miss
the parameter. If the posterior distribution of e has density fSlx(Olx) with
respect to a measure A, and the posterior mean of e is finite, then The-
orem 5.33 says that the optimal a is any 1 - c quantile of the posterior
distribution of 9.
For bounded intervals, the action space can be considered to be the set
of ordered pairs (a1' a2) in which a1 ~ a2. Consider a loss like (5.76) that
penalizes excessive length above, below, and around e differently:
(5.77)
The optimal interval is the interval between two quantiles of the posterior
distribution.
Theorem 5.78. Suppose that the posterior mean of e is finite and the
loss is as in (5.77) with Cl, C2 > 1. The formal Bayes rule is the interval
between the 1/C1 and 1 - 1/c2 quantiles of the posterior distribution of 9.
PROOF. We can rewrite the loss in (5.77) as L1(O,at} + L2(O,a2), where
if a1 > 0,
if a1 ~ 0,
if a2 ? 0,
if a2 < O.
Since each of these loss functions depends on only one action, the posterior
means can be minimized separately. If we divide L1 by C1, then Theo-
rem 5.33 says that the posterior mean of L1(e,at}/c1 is minimized at a1
The existence of optimal rules in these cases actually requires weaker as-
sumptions than those already considered, because the posterior mean of
the loss is finite in all cases. That is, we need not assume that e has finite
posterior mean to find the formal Bayes rules. If the posterior distribution
of e has a continuous density with respect to Lebesgue measure, then cal-
culus can be used to minimize the posterior risk. For example, the following
result is easy to prove.
Proposition 5.79. Suppose that 0 ~ mand that the action space is the
Borel a-field T of subsets of o. Suppose that the posterior distribution of
e has a density felx with respect to Lebesgue measure A and that the loss
is L(O, B) = A(B) + c (1 - IB(O)). Then, the formal Bayes rule is an HPD
region of the form B(x) = {O : felx(Olx) ;::: l/c}.
For the case in which the density felx is strongly unimodal (that is,
{O: felx(Olx) > a} is an interval for all a), the formal Bayes rule for loss
Lz is an HPD region. In the strongly unimodal case, one can also find the
formal Bayes rule for loss L q (See Problem 40 on page 343.)
CDF of the data. For the pammetric bootstmp, one assumes a parametric
model (with each Xi having CDF F xtls (18 and Fn is Fxtls('ISn) for
some estimate en of e. To be more precise, we follow Efron (1979, 1982).
Let X = (Xl,"" Xn) and let :F be a space of CDFs in which we suppose
that F lies. Let R : X x :F -+ IR be some function of interest. For example,
R might be the difference between the sample median of Xl"'" Xn and
the median of F:
where F-I(q) is understood to mean inf{x : F(x) ~ q}, and X(k) is the kth
order statistic. The bootstrap replaces R(X, F) by R(X, Fn ), where X*
is an lID sample of size n from Fn. If we are interested in the conditional
mean of R(X, F) given P = P where P has CDF F, we try to calculate
the mean of R(X, Fn). The success or failure of the bootstrap will depend
upon the extent to which Fn is "like" F for the purposes of calculating the
distribution of R.
Example 5.80. Suppose that Xl"'" Xn are conditionally lID U(o,6) given
e = 8. Here, F is the U(O, 6) CDF. First, take R(X, F) = E~l Xiln- J xdF(x).
The mean of this quantity is O. In fact, the mean of R(X, F) is 0 if F is any
distribution with finite mean and X is an lID sample from F. In particular, the
mean of R(X, Fn) given Fn is O. The parametric and nonparametric bootstraps
do just fine here.
Next, take R(X,F) = n(F-l(l) - X(n/F- 1 (1), where X(n) is the largest
coordinate of X. The distribution of X(n)1 F- 1 (1) is Beta(n, l) with CDF t n
for 0 $ t :5 1. So, the CDF of R(X,F) is 1 - (1 - tin)". For large n, this
is approximately 1 - exp( -t). For example, if t = 0.1, then HJ(R(X, F) ~
0.1) ~ exp(-O.l) = 0.905. On the other hand, for the nonparametric bootstrap,
R(X, Fn) = n(X(n) - X(n1 X(n) and
..
Ps(R(X ,Fn)
. = 1- (l)n
= OWn) 1- n '
llThe second part of this example was given by Bickel and Freedman (1981).
See Problem 41 on page 343 to see how the parametric bootstrap performs in
this example.
5.3. The Bootstrap 331
:n
(n-1)/2
Pr(Yt/2 = X(k) = L ~
~
( n
,q,n--q
) (k _1)i(n - kr- t - q,
1=0 q=(n+1)/2-1
(5.82)
12Bickel and Freedman (1981) claim that even if one were to smooth the boot-
strap by sampling from a continuous approximation to the empirical CDF, there
would still be a problem in the second half of Example 5.80. Singh (1981) and
Bickel and Freedman (1981) prove some large sample properties of the nonpara-
metric bootstrap as it pertains to estimating central (not extreme) quantiles of
a distribution.
13See Section B.7 for some ideas on how to simulate in general.
332 Chapter 5. Estimation
q=~
We simulated 100 data sets of size n = 51 and another 100 data sets of size n =
101 from the Lap(O, 1) distribution. Then we repeated the exercise with data from
N(O, 1) and U(O, 1) distributions. We approximated the true mean of R(X, F) for
each of the cases using a simulation of 100,000 data sets (except for the uniform
case in which the true value is just n times the variance of the Beta([n+ 1]/2, [n+
1]/2) distribution). The results are summarized in Table 5.83. The last column of
Table 5.83 gives the square root of the average squared difference between the true
value of ER(X, F) (third column) and the 100 simulated averages of R(X, Fn).
The fourth column gives the average of those 100 simulated averages. It appears
that the bootstrap estimate (fourth column) is slightly high in all cases, but less
so for larger sample sizes. The root mean squared error (last column) is very large
compared to the true value, indicating that the bootstrap estimate of ER(X, F)
based on a single data set may not be particularly useful.
(5.85)
where T<.k = L:~':-ll Ti and T>.k = L:~=k+l Ti. Using the same data sets as in
Example 5.81 on page 331, we obtained the results summarized in Table 5.86.
The Bayesian bootstrap seems to estimate ER(X, F) to be even higher than does
the bootstrap. This seems natural due to the additional variance introduced in
the Bayesian bootstrap.
)2 2
R(X, F) = (
n
~ ~Xi - (J XdF(X)) ,
and suppose' that we are interested in the mean of R. This is the bias of the
square of the sample average as an estimator of the square of the mean. If the
observed sample average is Xn and we use s;, = L~- (Xi - Xn)2 In as an estimate
of variance, then ,_1
*
R(X ,Fn) =
A
( n1 LX: )2 - (Xn)
n 2
.
,=1
The mean of this, given the data, is s;'ln. The mean of R(X, F) given P = P
is 0'2 In, where 0'2 = J(x - ft?dF(x). Since s;, is supposed to be close to 0'2 for
large n, the bootstrap is thought to behave well in this case. If, instead, we had
used
R'(X,F) = (~L~1 Xi)2 - (J xdF(x))2
J(x - /-I)2dF(x) ,
then the mean of R'(X*,Fn ) would have been lin which is exactly the mean of
R'(X,F). '
How would one use the fact that R' (X, F) has mean lin to "correct" X! as
an estimator of (J xdF(x))2? Presumably, one would subtract s;'ln. Although
this wo~ld usually do well, if s;'ln happens to be larger than X!, one would get
a very slily result.
336 Chapter 5. Estimation
Similarly, if one were interested in the standard deviation of X2, one could
J
assume that x 4 dF(x) < 00, and apply the bootstrap to
The bootstrap estimate of standard deviation would be the square root of the
average of the values of R(X, Fn). Unfortunately, the term after the minus sign
in (5.90) is not easy to evaluate even with F = Fn. An obvious alternative is to
use the average of the values of (L~=l X;* /n)2.
(5.91)
This makes Y' equal to the 'Y quantile of R(X, F), where R is the func-
tion on the left of the inequality in (5.91). What is commonly called the
percentile-t bootstrap confidence interval for h( F) would be
(5.92)
15The reason for switching from Y to O'(Fn)Y' is that one would suspect that
y' depends less on the underlying distribution than does Y.
5.3. The Bootstrap 337
where Fi is the empirical CDF of the bootstrap sample X;. The sample
'Y quantile of the R{X;, Fn) values could then serve for yl in (5.92). Hall
(1992) examines bootstrap confidence intervals in detail and finds asymp-
totic expressions for their actual coverage probabilities.
Example 5.93. Consider the same data used in Example 1.132 on page 71. The
data were a sample of size n = 50 from a Laplace distribution Lap(l, 1). We are
interested in h(F) = J xdF(x) , the mean of F. We simulated 10,000 bootstrap
samples, Xi, ... , Xioooo, and for each one, we calculated R(X;, Fso), where Fso
is the empirical CDF of the 50 observations. The curve labeled "Bootstrap" in
Figure 5.94 shows the 10,000 values of Xso+R(X;, Fso)8so, where 8so = u(Fso),
the MLE of the standard deviation of X so. As an example of confidence intervals,
one-sided 95% lower and upper bound confidence intervals for h(F) based on the
bootstrap are [0.5735,00) and (-00,1.3199], respectively.
Using the same prior distribution as in Example 1.132, we also used a tailfree
process of Polya tree type to simulate 10,000 values of the mean of the distri-
bution P. The empirical CDF of these values is plotted as the curve labeled
"Tailfree" in Figure 5.94. Not surprisingly, we see that the extreme quantiles of
the tailfree sample are farther from the sample mean than those of the bootstrap
sample. This is due to the fact that the bootstrap procedure ignores the uncer-
tainty from not knowing F when calculating R values. That is, we must pretend
:"!
I
'8 ~
I
I ~
:;;:
Tallfrea
~
-, 0 2 3
that F = Fso when calculating the R values. The Bayesian solution takes into ac-
count the additional uncertainty from not knowing F. The 0.95 upper and lower
bound posterior probability intervals corresponding to the confidence intervals
calculated above are [0.2574,00) and (-00,1.7628], respectively.
Hall (1992) shows that the percentile-t confidence interval has good fre-
quency properties in terms of the conditional probability that the interval
covers h(F) given F. With a single sample, as in Example 5.93, frequency
properties are neither apparent nor relevant. Suppose, however, that one
were to use bootstrap confidence intervals in many applications (with many
different Fs). From the classical perspective, one might actually be inter-
ested in the proportion of times that the interval covers h( F) and how this
compares to the nominal confidence coefficient.
Example 5.95. Suppose that we will sample data from several different Lap(jJ, 0')
distributions on several different occasions but we do not model the data this way.
Rather, suppose that we use the bootstrap and a tailfree prior of Polya tree type
as in Example 5.93 on page 337. As an example, 1000 data sets of size 50 each
were simulated with many different Laplace distributions. The values of 0' were
generated as r- 1 (1, 1) random variables and the locations were 0' times N(O, 1)
random variables. Location and scale changes do not affect the calculation of R
in the bootstrap, but they do affect the variance of X and S. For each data set,
1000 bootstrap samples were formed and 1000 observations from the posterior
r
mean of xdP(x) were simulated. We counted how many times jJ was below
each of the 1000 sample quantiles of the simulated values. Figure 5.96 shows
these proportions for both the bootstrap and Polya tree samples. As expected,
the bootstrap proportions match the nominal significance levels well.
~
Nominal
Boot.trap
Tailfree
I
~
J ;;
:;:
One final note is in order concerning the bootstr ap. All you ever learn
about by using the bootstr ap, without further modeling assumptions,
are
properties of Fn. Unless you have a way of saying how much and/or
in
what ways knowledge of Fn can be transformed into knowledge of P,
the
bootstr ap can only tell you about Fn , not about P.
5.4 Problems
Section 5.1.1:
if Z; = Xi,
if Zi = Yi.
(a) Prove that Zi is conditionally indepen dent of Ui given 8.
340 Chapter 5. Estimation
Section 5.1.2:
14. Suppose that Po says that X'" Poi(O) and that we are trying to estimate
exp(-38).
(a) Find the Cramer-Rao lower bound for unbiased estimators.
(b) If (X) = (_2)x, find Varo(X).
(c) Find both the Cramer-Rao lower bound and the Chapman-Robbins
lower bound for the variance of unbiased estimators of 8. Which is
larger?
5.4. Problem s 341
15. Suppos e that Xl,' .. ,Xn are conditio nally IID Poi (B) given e
= B.
(a) Let r be a known integer. Find the UMVU E of exp( _e)e r .
(b) Let n = 1 in part (a). Find the variance of the estimat or and
the
Cramer -Rao lower bound.
16. Suppos e that Y has Exp(l) distribu tion and is indepen dent of
e. Suppos e
that Xl, ... , Xn are lID conditional on Y = y and e = B with N(B,
l/y)
distribu tion. We get to observe Y and Xl, ... ,Xn .
(a) Find the Cramer -Rao lower bound for the variance of unbiase
d esti-
mators of e.
(b) Show that no unbiase d estimat or achieves the Cramer -Rao
lower
bound.
(c) Explain why the Cramer -Rao lower bound should not be taken
seri-
ously in the problem describe d here.
17. For the location family of ta distribu tions in Exampl e 5.17 on
page 303,
prove that the Bhattac haryya lower bound with k = 2 is the same
as the
Cramer -Rao lower bound.
18. Let X rv U(O, B) given e= B.
(a) Find the Chapm an-Rob bins lower bound on the variance of an
unbi-
ased estimat or of e.
(b) Find an unbiase d estimat or and find by how much its variance exceeds
the lower bound.
19. In Exampl e 5.19 on page 304, prove that min Xi - lin is the
UMVUE.
20. Refer to the situatio n in Exampl e 2.83 on page 112.
(a) Prove that Cove (1/11,1/12) = O.
(b) Prove that Vare4>(X) = 2u 4 + 411?u 2 .
Section 5.1.3:
21. Conside r the situatio n in Problem 6 on page 339. Find the MLE
of e.
22. Find the maximu m likelihood estimat or of e if Xl, ... ,X are
n condition-
ally lID with distribu tion Exp(O) given e = B.
23. Suppose that X rv Cau(B, 1) given e = B. Find the MLE of e.
24. Suppose that Xl, ... ,Xn are lID with Laplace distribu tion Lap(p"
u) given
e=(p" u).
(a) Prove that the value of p, that minimizes 2::1 \Xi - p,\ is the median
of the number s Xl, ... , X n .
(b) Find the MLE of e.
25. Let X have a one-par ameter exponen tial family distribu tion given
e. Sup-
pose that the MLE of e is interior to the parame ter space. Prove
that the
MLE equals a method of momen ts estimat or. (See Problem 13 on page
340.)
342 Chapter 5. Estimation
26. Consider the situation in Example 5.30 on page 308. Let squared error be
the loss, that is, L(O,a) = (p.2 - a)2. Show that the UMVUE dominates
the MLE if n ::::: 2. Find a formula for the difference in the risk functions.
Also, find an estimator that dominates both the MLE and the UMVUE.
Section 5.1.4:
27. Consider the situation in Problem 6 on page 339. Find the likelihood func-
tion and show that a Bayesian would take the data at face value (that is, a
Bayesian would calculate the same posterior as if the sample size had been
fixed in advance at whatever value N turns out to be, no matter what the
prior is).
28. Consider the situation in Problem 7 on page 339. If A and M are indepen-
dent a priori with A", r(a, b) and M '" r(c, d), find the posterior mean of
A given the data.
29. In Example 5.10 on page 299, find the posterior mean of 8 given X =x
for all x assuming that the prior for 8 is Beta( 00,130).
30. Return to the situation of Problem 16 on page 140. If e has a prior density
!e(O) = ac aI(c,oo) (J)/(Ja+1 , find the posterior mean of 8.
31. *Suppose that, conditional on N, {Xi}~l are independent with the first
N of them having Ber(I/3) distribution and the rest having Ber(2/3)
distribution. The prior for N is !N(n) = Tn for n = 1,2, ....
(a) Find the posterior distribution of N given a finite sample Xl, ... , X n ,
for known n.
(b) If Xl = 0, ... , Xn = 0 is the observed finite sample, find the posterior
mean of N.
Section 5.1.5:
32. Let Po be the set of distributions on (IR, 8 1 ) with finite variance. Let T(P)
be the standard deviation of the distribution P. Show that I F(x; T, P) =
(x - p.)2/[20"] - 0"/2, where p. is the mean of P.
33. Let Po be the class of distributions on (IR, 8 1 ) with bounded support, and
let T(P) be the supremum of the support. Prove that the influence function
for T is I F(x; T, P) = 0 if x :::; T(P) and 00 if x > T(p).
34. Find the influence function for the 1000% trimmed mean at a continuous
distribution P.
Section 5.2:
is in T for all x. Let 8 have a prior distribution p,e, and let p,x denote the
prior predictive distribution of X. Let C : X - T be another set function
such that p,e(C(x :5 p,e(L(x, a.s. [p,x]. Prove that Pr(8 E C(x)IX =
x):5 Pr(8 E L(x)IX = x), a.s. (p.x].
37. Prove Proposition 5.56 on page 319.
38. Prove Proposition 5.61 on page 321.
39. Prove Proposition 5.79 on page 329.
40. Suppose that n = ill. and that the posterior density of 8 given X is strongly
unimodal. Let the action space be the set of all closed and bounded intervals
[a!, a2j in ill..
(a) Let the loss function be LI(9, [al,a2j) = a2 - al + c (1 - l[al.a2](9).
Prove that the formal Bayes rule is an HPD region.
(b) Let the loss function be L q (9, [al,a2j) = (a2-al)2+c(1-1[al.a2](9).
Find the formal Bayes rule.
Section 5.3:
Show how you would use this to find bootstrap confidence intervals
for -Bo/Bl.
CHAPTER 6
Equivariance*
Y. = co 0
o
1
0
0
0
1
-1
-1
-1
)x
o 0 0 1
Y."'Nn[(~)'(
o ~ fJ
1
-1
1
...
-1
:=:)].
2-1
1
XnlY = y,e = fJ '" N (fJ + [-1, ... , -1] (In - ~J) y,c) ,
for some number c which we do not need, since the minimum risk equivariant
estimator depends only on the mean of this distribution when fJ = O. We can
rewrite the mean as
n.
;=1 ;=1 ;=1
:F = {f ~ 0: J f(x)dx = 1, J
xf(x)dx = 0, J
x 2f(x)dx = I} .
Suppose that Xl, . .. ,Xn are lID with conditional density f(x - 0) given
e = 0 for some f E :F. Define rn(f) to be the greatest lower bound of the
risk over the set of all equivariant estimators. Then SUPjE.1" rn(f) = rn(fo),
where fo is the standard normal density.
PROOF. If fo is the standard normal density, then X is MRE and rn(fo) =
lin. Since X is always equivariant, it must be that rn(f) ::; lin, since lin
is the risk of X for all f E :F. 3 0
Example 6.14. Suppose that Xl, ... ,Xn are liD with conditional density I(x-
9} given e = 9 where I(x} = exp[-(x+1}] for x ~ -1. This is a family of shifted
exponential distributions, rigged so that 9 = 0 has mean 0 also. The first-order
=
statistic X(l) min Xi is equivariant and is a complete sufficient statistic. Since
Y is ancillary, Theorem 2.48 says that XCI) and Y are independent given e. So
3There is a theorem of Kagan, Linnik, and Rao (1965) which shows that, for
n ~ 3, Eo(XIY} = 0 if and only if 1 = 10. This would imply that Tn(f} < l/n if
n ~ 3 and 1 ::I 10.
350 Chapter 6. Equivariance
The risk of t5 is its variance plus the bias squared, which is minimized by
choosing c = O. Hence the MRE estimator has c = 0 and is unbiased. 0
1
loge is
lxroo nxn(1n+l
log 9
d9 = log X + 0
00
6 (X)E 1 [6 0 (X)\Yj
o Ed65(X )\Yj'
PROOF . Let 60 be an arbitra ry equivariant estimat or
with finite risk. By the
scale analog to Lemma 6.6, all other equivariant estimators have the
form
oo(X)/v(Y), where v is scale invariant. Since the risk function is constan
t
for an equivariant 6,
R(0,6) = R(I,6) = Ed6o(X)/v( Y) - 1]2
= El {Etl(6 0 (X)/v( Y) - 1)2\YJ} ,
which is minimized by minimizing E 1 [(60 (X)/v( Y) _1)2\y = yJ uniform
ly
in y. To do this, choose v(y) = E 1 [65(X)\Y = yl/Etl6o(X)\Y = y].
0
It can be shown that the MRE estimator is also the formal Bayes
rule
with respect to the improper prior 1/0.
Theorem 6.19. Under the same conditions as in Theorem 6.18, the MRE
estimator can be written as the formal Bayes rule with respect to a prior
having Radon-Nikodym derivative l/fJ with respect to Lebesgue measure, if
it has finite risk.
PROOF. We begin with the equivariant estimator IXnl. Let Y be as in The-
orem 6.18. The transformation from x to (yT,xn)T has Jacobian Ixnln - 1 ,
so
Iv,xn\e(y,xnll) = f(lxnly)lxnl n- 1 ,
both for Yn = +1 and Yn = -1. So, the conditional density of Xn given
Y =y is
It follows that
To see that this is the formal Bayes rule with respect to the "prior" l/fJ, note
that the posterior felx(9Ix) is proportional to f(x/fJ)/fJn+l. The expected
loss is (aside from the proportionality constant)
1 00
(fJ - a)2 fJ:+3f(~) dfJ.
The left-hand side of this last equation can be rewritten using the associa-
tive property as
(h 0 g-l) 0 (g 0 g-l) = gog-I.
Hence go g-1 = e. It follows that 9 satisfies the property required to be
called the inverse of g-l. Next, note that eo e = e and g-1 0 9 = e, so
9 0 e = 9 0 (e 0 e) = 9 0 ( e 0 (g -1 0 g = 9 0 (( e 0 9 -1 ) 0 g) = 9 0 (g -1 0 g) = g.
For the uniqueness claims, first, suppose that hog = e. Then, using what
we have just proved,
354 Chapter 6. Equivariance
which means that the inverse of 9 is unique. It follows from what we proved
above that 9 is the unique inverse of g-l. Finally, let hog = 9 for all g.
Then
h = h 0 (g 0 9 -1) = (h 0 g) 0 g-l = gog -1 = e.
Hence, the identity is unique. 0
Sometimes two seemingly different groups are essentially the same.
Definition 6.22. Let G 1 and G 2 be groups with compositions 01 and 02,
respectively. Let : G 1 ~ G 2 be a one-to-one onto function such that, for
all g, hE G 1, (golh) = (g)o2(h). Then is called a group isomorphism.
The following proposition is straightforward.
Proposition 6.23. Let be a group isomorphism between G 1 and G 2.
Then -1 is a group isomorphism between G2 and G1. Also, maps the
identity in G 1 to the identity in G 2. Also, maps the inverse of each 9 E G 1
to the inverse of (g) E G 2.
The groups that will most interest us are groups of transformations. The
set U in the following definition can be the sample space X or the parameter
space n or the action space N.
Definition 6.24. A measurable function f : U ~ U is called a transfor-
mation of U. The function e( u) = u for all u E U is called the identity
transformation.
Proposition 6.25. Suppose that G is a set of transformations of a set
U with e E G being e(u) = u for all u E U. Let composition 0 be the
composition of functions. IfG is a group with identity e, then every element
of G is one-to-one.
Example 6.26. Here are some examples of groups of transformations. When we
refer to these examples in the future, we will call the ith one "Group i."
1. U = IRn , ge(XI, ... ,xn ) = (Xl + c, ... ,Xn + c) for each e E IR. The identity
is 90, the inverse of ge is 9-e, and the composition is 9a 09b = 9aH' This
group is abelian.
2. U = IRn, 9c(Xl, ... ,Xn) = (exl, ... ,CXn) for each C > O. The identi.ty is 91,
the inverse of ge is 9l/e, and the composition is 9a 09b = 9ab. ThIS group
is abelian.
3. U = IRn, 9(a,b) (Xl, ... , Xn) = (bXl + a, ... , bXn + a) for each b > 0, a ~ .IR.
The identity is 9(0,1), the inverse of 9(a,b) is 9(-a/b,1/b), and the compOSItIOn
is g(a,b) 0 g(e,d) = 9(be+a,bd)' This group is not abelian.
4. U = IRn , gA(X) = Ax, where A E GL(n), the set of nonsingular n.~ n
matrices. The identity is gl, the inverse of gA is gA-l, and the compOSItion
is matrix multiplication gA 09B = gAB. This group is not abelian. This is
called the general linear group of dimension n.
6.2. Equivariant Decision Theory 355
5. U:::: IRn , gp(XI,'" ,Xn) :::: (Xp(l),'" ,Xp(n)~' where p is ~ny permut
ation
of (1, ... , n). The identity is g(l, ... ,n), ~he mverse of gp IS gp-l, ~nd
the
composition is composition of permutatIOns, gPl 0 gP2 = gPl 0P2' This
group
is not abelian.
6. U can be any measurable set and G can be the set of all one-to-o
ne mea-
surable functions whose inverses are also measurable.
If A C
- U , we will use the shortha nd notation gA to denote the set
gA = {u E U : u = gy for some YEA}.
Defini tion 6.27. Let Po be a parametric family with parame ter space
and sample space (X, B). Let G be a group of transformations of X.
n
We
say that G leaves Po invaria nt if for each 9 E G and each 0 E n there
exists
0* E n such that Pe(A) = Po- (gA) for every A E B.
It is easy to see that the 0* in the Definition 6.27 is unique.
Lemm a 6.28. Suppose that G leaves Po invariant. Then, for each
9E G
and 0 E fl, O in Definition 6.27 is unique.
PROOF. Let 9 E G and () E n be given. Suppose that both ()* and ()' satisfy
for every A E B,
Po(A) = po-egA) = P8'(gA).
It follows that for every A E B, PO(g-1 A) = Po. (A) = PO' (A).
Since
distinct elements of the parame ter space have to provide distinct probab
ility
measures, this last equation implies that (J* = 0'.
0
We will call the unique value ()* by the name g() to indicate its connec
tion
to both 9 and O. We can try to unders tand intuitively what it means
to say
that G leaves Po invariant. Suppose that we believed that X had distribu
-
tion Pe given e = O. We already know what the conditional distribu
tion
of gX is
P~(gX E A) = P~(X E g-1 A).
This has nothing to do with equivariance yet. It is a simple consequ
ence
of what we already know about the induced distribution of a functio
n of a
random quantity. What invariance of distributions means is that the second
equation below holds (the others are all consequences of probability theory
and group theory):
P~(X E g-1 A) = PO(g-l A) = PgO(gg-l A) = Pgo(A) = P~o(X
E A).
So, we see that the conditional distribution of gX given e = () is Pgo,
which
is the conditional distribution of X given e = gO.
Propo sition 6.29. Suppose that G leaves Po invariant, then, for each
9E
G the transformation g : n -+ n is one-to-one and onto. Also G =
{g : 9 E
G} is a group, and g-l = g-l.
356 Chapter 6. Equivariance
Example 6.32. Consider Group 3. Suppose that Xl, ... ,Xn are conditionally
lID N(J.!, (12) given e = (J.!, (1). Let
Pge(gA) = 1 bA+a
1
(bav'21T)n
t(Xi -
exp { __l_
2b2(12 i=l
a - bJ.!)2} dx
We might ask, "What is the set of all invariant loss functions?" That is, what
is the set of all L such that, for every (1, J.!, d, a, b,
Since this equation must hold for every a,p"d,a,b, it must hold for b = l/a and
a = -p,/a no matter what p, and a are. It then follows that
L((p"a),d) d-p,)
= L ( (1,0), -a- ,
for all d, p" and a. That is, L((p" a), d) = p([d-p,l/a), for some arbitrary.func~ion
p. It is clear that any L of this form is invariant, so we have found all InvarIant
loss functions.
The method used at the end of Example 6.32 is actually a very general
method for finding all invariant functions. The first step was to find a
necessary condition for the function to be invariant. The second step is to
check that the condition is also sufficient. A similar method works when
trying to find all equivariant functions. (See Example 6.34 on page 357 for
an illustration.)
Definition 6.33. A decision problem is invariant under G if Po and the
loss are invariant. In such a case, a nonrandomized decision rule 8(x) is
equivariant if 8(gx) = go(x) for all 9 E G and all x E X. A randomized
rule 8*(x) is equivariant if 8* (gx)(gA) = 8*(x)(A) for all A E ex, x E X,
and 9 E G. A function v is invariant if v(gx) = v(x) for all x E X and all
9 E G.
8()
_ (Xl - X
X =x+86 - 8 - " " ' - 8 - .
Xn - X)
Since every function of the above form is equivariant, we have found all equivari-
ant rules.
Note that the function v(x) = 8 ([Xl - X/ 8, ... , [Xn - xl! 8) is invariant and is an
element of N, while the function h(x) = (x, 8) is equivariant and can be thought
of as an element of G. With this notation, we have written 8(x) = h(x)v(x).
and 0 is equivariant.
For the "only if" part, assume that 0 is equivariant. Let v = h-1o. Then
L )
J.t, (7 ,d) = (d2 - J.t? +
(72
Ilog -;d I.
l
Then L is invariant and 6(x) is two-dimensional, say (61(X), 62(x)). To say 6(gx) =
g6(x) means that, for every a,b,x 6(bxl +a, ... ,bx n +a) = (b6 1(x) + a,b62(x)).
Now, we have already seen that Oo(x) = (x, s) is equivariant, so the most general
equivariant estimator is 6(x) = 60 (x)v(x), where vex) E N is invariant. That is,
let V2(X) be an arbitrary positive invariant function and let Vl(X) be an arbitrary
real-valued invariant function. Then 6(x) = (SV1(X) + x, SV2(X is the general
form of an equivariant estimator.
the orbit of x.
It is clear that an invariant function is always constant on orbits. In the
statement of Theorem 6.8, Y is maximal invariant. Also, in the statement
of Theorem 6.18, Y is maximal invariant. In invariant decision problems,
the risk function of an equivariant decision rule is constant on orbits. This
follows trivially from part 4 of the following lemma.
Lemma 6.39. 6 In the notation of Definition 6.37,
1. y E O(x) if and only if x E O(y)j
2. orbits are equivalence classes;
3. a maximal invariant assumes distinct values on different orbits;
4. suppose that m(x,O) is invariant under the group actions, that is,
m(gx, gO) = m(x,O) for all x, g, O. Then the distribution of m(X, 0)
given e = 0 is a function of 0 through the maximal invariant in n.
PROOF. Parts 1 and 2 are trivial. 7 For part 3, suppose that v is maximal
invariant. Consider the function 0 : X -+ 2x defined in (6.38). Clearly, 0
is invariant, hence it is a function of v. This means that if O(x) :1= O(y),
then v(x) :1= v(y). That is, v assigns different values on different orbits. For
part 4, let r(O) = P~[m(X, 0) E B] for an arbitrary set B. Then,
where c = ma.x{l,x}. The Bayes rule is the posterior mean E(eIX = x) = 2c.
Suppose that we reparameterize to eo = e 2 If there is going to be a connection
between the two decision problems, the loss function had better transform in such
a way that we are essentially estimating the same thing. That is, the loss had
better be LO(OO,aO) = (JiF - Ja*( The prior for eo is
(This makes the situation look more like the equivariance setup.) Now,
we can
find the posterior of e*,
L(8 ) = { (8 - a)2 if 0 ~ 0,
,a 2(8-a) 2 if(} <0.
This says that an error of a certain magnitude is twice as costly
if the true
tempera ture is below freezing than if it is freezing or above. If we try
to
same loss function in OK, then no such distinction is made in the costs use the
of errors
of the same magnitude. It is ludicrous to claim that these two decision
problems
are essentially the same problem with different units. Of course, the
loss L here
is not invariant. The transformed loss
JA
rh(x, y)dxdy = 1
(a,b)A
h(x, y)dxdy.
JA
[ h(x,y)dxdy = 1(a,b}A
;2h (Z ~ a, *) dzdw = 1 (a,b)A
h(z,w)dzdw.
364 Chapter 6. Equivariance
:2h(z~a,~) =h(z,w),
for all a, band a.e. [dzdwl. So, if a = z and b = w,
we must have h(z, w) =
h(O,1}/w 2. It follows that LHM has Radon-Nikodym derivative 1/y2 with re-
spect to Lebesgue measure. Next, suppose that h(x, y) is the Radon-Nikodym
derivative for RHM. Then
1 A
h(x, y)dxdy = 1
A(c,d)
h(x, y)dxdy.
Transform the left-hand side by y = wid, and x = z - weld. Then the Jacobian
is J = lid, and we get
f
} A
h(x,y)dxdy= f
} A(c,d)
~h(w- e:,~)dZdW=1. (a,b)A
h(z,w)dzdw.
for all e, d and a.e. [dzdw). So, if e = z and d = w, we must have h(z, w) =
h(O, l}/w. It follows that RHM has Radon-Nikodym derivative l/y with respect
to Lebesgue measure. This is not the same as LHM.
Not only are measures of sets invariant under group operations when
using LHM, but certain integrals are also invariant.
Lemma 6.47.10 If>. is LHM on G and f is integrable over G, then for all
g E G,
fa f(g 0 h)d)"(h) = fa f(h)d>'(h).
PROOF. First let f(h) = IB(h) for some B E r. Then
fa f(g 0 h)d>'(h) J IB(g 0 h)d>'(h) = ~-lB d>'(h)
= >.(g-1 B) = >.(B) = fa f(h}d>.(h).
Lemma 6.48.11 Let (e,o) be a group, and let (e,f) be a topological space
with the Borel a-field. Suppose that A is a-finite and not identically 0 LHM
on (e, r). Suppose that the function f : ex e - t e defined by f(g, h) =
g-1 0 h is continuous. If>.' is also a-finite and not identically 0 LHM on
(e, r), then there exists a finite positive scalar c such that>.' = c A.
PROOF. The first step is to prove that reg) = I B -l(g)j>..(Ag) is a mea-
surable function of g for each A, B E r. Since f(g, h) is continuous, it is
continuous in g for fixed h, hence reg) = f(g, e) = g-1 is continuous,
hence measurable. If B E r, r-1(B) = B- 1 , hence B-1 E r. It follows
that IE-l (g) is measurable. The function f'(g,h) = hog = f(f*(h),g) is
also continuous, hence measurable. It follows that v(g, h) = (g-1, hog) is
continuous and measurable. It is easy to see that v-l = V, so if A E r,
e x A E r 0r, hence v(e x A) E r 0r and meg, h) = I v (cXA)(g-1, h) is a
measurable function. Define f(g) = f meg, h)dA(h). By Lemma A.67, this
is a measurable function. Now notice that I v (CXA)(g-l, h) = IAg(h) and
J
calculate
f(g) = lAg (h)d>"(h) = >..(Ag).
It follows that r is measurable.
Next, we prove that the following two one-to-one bicontinuous functions
preserve measure in the product space (e x e, r 0 r, >.' x A):
The proofs are similar, and we only prove that T2 preserves measure. Note
that E E r 0 r implies that, for every h E e,
11 This lemma is used to prove Corollary 6.52. The proof is adapted from Halmos
(1950, Theorem 60.C).
366 Chapter 6. Equivariance
A' (A) J
r(h)dA(h)
where the second to last equation follows from r(g-l )A(Ag-l) = IB(g).
Next, apply (6.50) with N = A to get
The following result, whose proof is based on the same concept as the
proof of Lemma 6.47, is useful for converting between integrals with respect
to LHM and RHM.
Proposition 6.53. 13 If A is LHM, p is the related RHM, and f is inte-
grable with respect to p, then J f(g)dp(g) = J f(g-1 )d)..(g). If f is integrable
with respect to A, then J f(g)d)..(g) = J f(g-1)dp(g).
The following result gives a method for converting one LHM or RHM
into many others.
Lemma 6.54. 14 Let pg(B) be defined as p(gB), and let Ag(B) = A(Bg).
Then pg is RHM and Ag is LHM for each 9 E G.
PROOF. Since Ag(hB) = )"(hBg) = )..(Bg) = )..g(B), we have that )..g is
LHM, and a similar argument works for Pg. 0
By Lemma 6.48, )..g is a multiple of )... Let the multiple be Cg. Similarly,
by Corollary 6.52, Pg is a multiple of p, so define c~ by pg(B) = c~p(B). In
abelian groups, cg = c~ = 1 for all g, if)" and p are related. We introduce Pg
because it will play an important role in the proof that Pitman's estimator
is the formal Bayes rule with respect to an invariant measure.
We obtain interesting results if we replace).. by p in Lemma 6.47 and
replace p by ).. in Proposition 6.53.
Lemma 6.55. 15 If p is RHM and 9 E G and f is integrable with respect
J J
to p, then
f(gh)dp(h) = C~-l f(h)dp(h).
If A is LHM and 9 E G and f is integrable with respect to A, then
PROOF. In a manner similar to the proof of Lemma 6.47, we can prove that
The proof of the other part is virtually identical, using Proposition 6.53. 0
Actually, the numbers Cg and c~ are related.
13This proposition is used in the proofs of Lemmas 6.55, 6.56, 6.65, 6.66, and
Theorem 6.74.
14This lemma is used in the proof of Lemma 6.68.
15This lemma is used in the proofs of Lemmas 6.56, 6.65, and 6.66 and Theo-
rem 6.74.
368 Chapter 6. Equivariance
Lemma 6.56. 16 If >. and p are related, then cg = 1/ c~ = C~-l for all
g E G. Also c~ = Cg-l.
PROOF. Let f be integrable with respect to p. Use Lemma 6.55 twice to
write
C~C~-l J f(h)dp(h),
from which it follows that C~_l = l/c~. Next, use what we just proved and
Proposition 6.53 and Lemma 6.55 to show that
1 1 121
gB
-dxdy
y
=
B
92-dzdw
g2 W
= 92 11
B W
-dzdw = 92P(B),
so c~ = 92.
where 7T is the permutation required to return the order statistic to the original
data. Here Y = JRn - 2 X II, where II is the set of permutations, and G = 1R X JR+.
Then t(gx) = (g(X(n),X(n) -x(l),y). There are several invariant losses. Here are
three:
(01 - a)2
LI(O, a) = N=1Rj
O~
(02 - a)2
L2(O, a) = N = JR+j
022
(Ol - al)2 C02 - a2)2
L3(O, a) N=JRxJR+.
022 + 02
2
370 Chapter 6. Equivariance
The first is for location estimation, the second is for scale estimation, and the third
is for simultaneous estimation of both. RHM has Radon-Nikodym derivative l/a
with respect to Lebesgue measure. We can explicitly work through the normal
distribution case. The likelihood function based on n observations is
where w = E~=1 (Xi - x)2. To find the posterior, we multiply by l/a and find the
appropriate constant. If we let r = a- 2, then a = r- 1/ 2 and da = -r- 3/ 2dr/2.
The posterior is, for some constant c,'
This has the form of the product of an N(x,I/(nr)) density times a r([n -
Il/2,w/2) density. The posterior distribution of
(6.61)
is N(O, 1) given 82. But since this distribution does not depend on 8 2 , it is also
the marginal posterior. Also, the posterior distribution of w/e~ is X~-1' where
W = E~=1 (Xi - X)2. These distributions parallel the prior conditional distri-
butions of the sufficient statistics given e. That is, prior to seeing the data and
conditional on 8, (6.61) has N(O,I) distribution and W/8~ rv X~-1. The pos-
terior distributions were named fiducial distributions by Fisher (1935), because
they seem to fall right out of the conditional distributions given e without any
need for a prior on 8. The quantity in (6.61) and w/e~ are called pivotal quan-
tities. These will be special cases of a more general result that will come later
(Corollary 6.67).
JJ 19B(h,y)fxls(h,ylgO)d>'(h)dll(Y)
JJ IB(h,y)fxIS(g 0 h,ylgO)d)'.(h)dll(Y),
6.2. Equivariant Decision Theory 371
where the last equality follows from Lemma 6.47. Since this is true for all
B E A, the integrands of the first and last lines must be equal a.e. [A x Ill
This immediately implies (6.63). 0
A simple corollary to this result is obtained by letting g = <1>-1(1](9)-1),
where </J and 'f/ are defined in Assumption 6.58.
Corollary 6.64. Assume Assumption 6.58. There exists a function r
G x y --+ m such that, for every 9 E n
where the second factor on the right is the conditional density of H given
Yand8.
PROOF. Since n = G, there is only one orbit in the parameter space, hence
the maximal invariant is constant. Since Y is invariant, Lemma 6.39, part
4 shows that Y is ancillary.
To calculate the posterior density of 8 given the data, we need the
marginal "density" of the data:
= J r(1jJ 0 h, y)dA(1jJ) = J
Ch- 1 r(1jJ, y)dA(1jJ)
Ch- 1 J
fXle(x,yie)dA(x) = Jy(y)c',.,
17This lemma is used in the proofs of Lemma 6.66 and Theorem 6.74.
372 Chapter 6. Equivariance
where the second and fifth equalities follow from Corollary 6.64, the third
follows from Proposition 6.53, the fourth follows from Lemma 6.55, and the
sixth follows from the fact that Y is ancillary and from Lemma 6.56. The
posterior density of e given X = (h, y) with respect to p is calculated via
Bayes' theorem 1.31:
f L(e,TJ(h,y))fHly,e(hly,e)d.\(h) = R(e,TJly),
where the first equality follows from Lemma 6.65 and equivariance of TJ,
the second and eighth follow from invariance of L and the definition of
conditional density, the third and seventh follow from Corollary 6.64, the
fourth is elementary group theory, the fifth follows from Lemma 6.55 and
Lemma 6.56, the sixth follows from Proposition 6.53, and the ninth follows
from the definition of conditional risk function. 0
6.2. Equivariant Decision Theory 373
k
Ch Y
where the second equality uses Corollary 6.64 and the invariance of the loss
function, the fourth equality follows from Lemma 6.54, the fifth follows from
Corollary 6.64, and the sixth uses the fact that Ce = 1. Using the definition
of d,_the last integral above is minimized when h-1a = d(y), that is, when
a = hd(y) = 8(h, y). 0
374 Chapter 6. Equivariance
It may be the case that all equivariant rules are inadmissible. For ex-
ample, with Group 4 and one n-dimensional normal observation X, .t~e
MRE rule is to estimate e by X. But we saw in Section 3.2.3 that thIS IS
inadmissible if n > 3.
In later sectiOI; we will see Theorems 6.74 and 6.78, which are like
Theorem 6.59. Th~ conclusions to those theorems say that certain formal
Bayes inferences with respect to RHM priors agree with classical inferences
conditional on the ancillary Y. This is why, in Theorem 6.59, we also showed
6.3. Testing and Confidence Intervals 375
that the MRE decision rule is MRE conditional on Y. Theorems 6.59, 6.74,
and 6.78 parallel each other more this way.
Sometimes, the conclusions of Theorem 6.59 hold even when its condi-
tions are not strictly met. For example, suppose that there is a nuisance
parameter. It may be the case that for each fixed value of the nuisance
parameter, the conditions of Theorem 6.59 apply to the problem with the
appropriate subparameter space.
Exam~le 6.71. Suppose that Xl, ... , Xn are conditionally independent with
N(IL,a ) distribution given e = (lL,a). Let N = IR and L(O,a) = (IL - a? This
loss is not invariant under Group 3; however, it is invariant under Group 1. But
the parameter space is not isomorphic to Group 1. For each value of a, consider
the subparameter space n" = {(j., a) : j. E IR}. The formal Bayes rule with
respect to RHM on n" is 6(x} = x for each a. Since 6 is the MRE rule for each
a, it is the MRE rule under Group 1 for the original problem.
A.
'l-'O,g
(h
,y
)= {l
0
if hE (}O;!l'
zf no.t
tests 18 and each class may lead to a different P-value. However, there is
only one posterior probability that e E ne with respect to RHM. Hence,
we needed to identify exactly which class of tests has P-value equal to that
posterior probability. This point will become clearer after Theorem 6.78.
As in the proof of Theorem 6.59, we will assume that X = t(X) and that
G = G = n, to make the notation simpler. The following lemma is also
useful.
Lemma 6.76. Under Assumption 6.58, the conditional distributions of the
H part of X given Y are invariant.
PROOF. Let B be a measurable subset of G, let 9 E G, and let fl be the
probability measure that gives the marginal distribution of Y. Define
This shows that the conditional size of the test </>o,g (given Y) as a test of
Ho is equal to its conditional power function at O. For each 9 E G, define
18We put the word "equivariant" in quotes because it is not the test function
itself that is equivariant, but rather the combination of the hypothesis and the
test function. That is, if '1/;0 is a test of no, then '1/;0 (h, y) = 'If;go (gh, y).
378 Chapter 6. Equivariance
where the first equality follows from Lemma 6.65, the second equality fol-
lows from Corollary 6.64, the third follows from Proposition 6.53, the fourth
follows from Lemma 6.55, the fifth is just algebra, and the sixth follows from
Corollary 6.64. 0
Example 6.77. Let Xl, ... , Xn be conditionally liD with N(p.,u 2 ) distribution
given e = (IL, u). The group is location and scale (Group 3). Consider the hy-
potheses 0/1 = {(a,b) EO: a ::; p.} for 8 = (p.,u). The corresponding tests
are the usual one-sided t-tests. (The reader should check that the conditions of
Theorem 6.74 are satisfied.) The associated P-values equal the posterior prob-
abilities that the hypotheses are true if the prior is RHM, the measure with
Radon-Nikodym derivative l/u with respect to Lebesgue measure.
As a less familiar example, let
0/1 = {(a, b) EO: a ~ IL, b ::; u}.
This is a simultaneous test of H,.,(T : M ~ IL and E ::; u. We will check condition
4 only. Since e = (0,1), h = (m, s) E Oe satisfies s ::; 1 and m ~ O. For such h,
since us ~ u and J.L + bmj s ? J.L. Suppose that data (Xn, Sn) are observed with
Xn = E~=l xiln and Sn = vlE:':,l
(Xi - Xn}2 I(n - 1). The test 4>9,g rejects
H,..,u if xn ~ J.L + sngl/g2 and Sn ? Ug2. The P-value is the size of the test
4>9,(rxn -,..)/U,8n/U)
PROOF. As in the proofs of Theorems 6.59 and 6.74, we will assume that
X = t(X) and G = G = n for ease of notation. Hence, we will write
x = (h,y). Now, write B(h,y) = hB(e,y) , and use Corollary 6.67 to say that
p~(e-l HE B(-l
e,Y
)IY = y) = Pr(e- l HE B(-l
e,Y
)I(H, Y) = (h, y,
Example 6.80. Let Xl"'" Xn be conditionally lID with N(p., (12) distribution
given e = (p., (1). Here (Xn, Sn) is a complete sufficient statistic, where X n =
E~l Xijn, Sn = JE~=l (Xi - X n)2 j(n - 1). So, we will ignore the Y part of
the problem, since Y is independent of the complete sufficient statistic. Let
where w = 2:: (Xi - x)2. Nobody would ever base a test on this
it is ancillary in location-scale problems. In the normal distributionalone, because
case, it is not
even a function of the sufficient statistic.
If we first consider the sufficient statistic (X, W), we see that
invariant is constant, hence only constant functions of the sufficientthe maximal
statistic are
invariant.
This example raises the question of whether reduction of the set of
tests
by invariance is compatible with reduction by sufficiency. That is, suppos
e
that we first reduce to the set of invariant tests and then find a sufficie
nt
statisti c for the maximal invariant parame ter and further reduce by
con-
sidering only invariant tests that are a function of the sufficient functio
n of
the maximal invariant. Will we get the same tests as we would if we
first
reduced to only those tests that depend on the sufficient statisti c and
then
reduced to only those that depend on the maximal invariant in the
space
of sufficient statistics? In Example 6.81 on page 381, the answer is yes,
but
only because both method s produce degenerate results. Hall, Wijsma
n, and
Ghosh (1965) find conditions for this compatibility.
The following assumption is an obvious preliminary. It requires that
the
group operati on is inherited by the sufficient statisti c space.
Assum ption 6.82. If T(X) is sufficient and, for each 9 E G, we define
Tg(x) = T(gx), then Tg(x) depends on x only through T(x).
Examp le 6.83. Suppose that X = JRn and T(x) = (x, w), where = 2:;=1
x? Then 9a,bX = ( ... ,bXi + c,) and T(9a,bX) = (bx + a,b2 w).w This (Xi-
function
382 Chapter 6. Equivariance
The last factor equals Pr{X E AIT = tu) by (b), so it factors out of the
sum. Also,
Xl -
= ( Xn-l
VeX)
-
Xn
Xn
,"', Xn-2 - Xn .
,slgn(Xn-1 - Xn)
)
,
Xn-l - Xn
The proof of this is very similar to the proof of part 4 of Lemma 6.39.
Definition 6.88. A test is UMPU almost invariant (UMPUAI) level Q
PROOF. Let Uc; be the class of all unbiased level Q tests. First, we show
that E Uo; if and only if g E Uo; for each g, where g(x) = (gx):
Eo*(gX) = Ego*(X)
sup Ego(X) = sup Eo(gX)
</>EU", </>EU",
sup Eog(X) = sup Eo(X) = Eo*(X),
</>EU", </>EU",
since * is UMP in Uc;. So, ; = * a.e. [ILl for each 9 by the uniqueness of
*. This makes * almost invariant. Since o is UMPU AI level Q, (3</>0 (fJ) 2:
f3</>. () for all () E nA , so o is also UMPU level Q and o = * a.e. [ILl also,
and so it is also unique a.e. [ILl. 0
This theorem does not guarantee that the UMPU AI level Q test is
UMPU, but it provides insurance that if there is a unique UMPU level
Q test, we can find it by finding the UMPUAI level lk test.
with "Y T = ("YJ, "YJ) and "Yl the first r coordinates, preserves the hypothesis.
So the testing problem is invariant.
The maximal invariant in X is determined by f(g(x, w = f(x, w) for
all g, x, and w. So, for fixed X and w, let A have first row proportional to
xi, b = -X2, and let c = 1/.,fW. Then
JX~Xl
o ,,) ~ f(x,w).
o
So xi xdw is maximal invariant, since it is clearly invariant. The usual F
statistic for testing H is just d/r times the maximal invariant, and it has
noncentral F distribution NCF(r, d, 8), where 8 = E~=l "Yl/a 2 conditional
on the parameters. The hypothesis H is equivalent to 8 = O. Since the
noncentral F distribution has MLR in the noncentrality parameter (see
Problem 29 on page 289), the F-test is UMPUAI level Q.
386 Chapter 6. Equivariance
G1 = {g~: A is a p x k - q matrix},
G2 = {9b: D is n - k x n - k orthogonal},
G3 {gb : C is q x q orthogonal},
G4 = {gi;: E is p x p nonsingular}.
In particular, suppose that A = -M2' then f(M) = f([EMlC!0IEM3D]),
where is matrix of all zeros. Now, consider the following lemma.
Lemma 6.90. Two a x b matrices Rand T satisfy RRT = TTT if and
only if T = RQ for some b x b orthogonal matrix Q.
Al 0 0
0 0 0
W-!BW-! =r 0
0
As 0 0
rT = rArT,
0 0
0 0
0 0
We might wish to choose values of 6 and a and then require that for all 0 E
06, Pe(reject H) $ a. This means that we are trying to test H' : e E 06
at level a. We will use a version of the group described in Problem 11 on
page 389. An element 9a of the group acts on X by 9a(Xl, ... , Xn) = (c +
a(xl -c), .. . ,c+a(xn -c. The maximal invariant in the sufficient statistic
space is T = yin (X - c) / S. 22 The maximal invariant in the parameter space
is B = (M - c)/E. We know that V $ c if and only if v'm(V - M)/E $
-v'mB, and so Pe(V $ c) ~ 6 if and only ifB $ ~-1(1-6)/v'm. So, 06 =
{O : f3 $ f3o}, where f3 = (It-c)/a and f30 = ~-1(1_ 6)/ v'm. So the test we
seek is equivalent to H : B $ f3o. The conditional distribution of T given
B = f3 is noncentral t, NCt n-l(ylnf3). This distribution has increasing
MLR in the noncentrality parameter f3 (see Problem 29 on page 289). The
UMP invariant level a test is to reject H if T is greater than the 1 - a
quantile of the N Ctn-l ( ylnfJo) distribution. Let this quantile be denoted
d. Then T > d is equivalent to c [X - dS/ yin, 00), which in turn is
equivalent to the test found in Example 5.73 on page 326.
6.4 Problems
Section 6.1.1:
22The reader might wish to prove this in solving Problem 11 on page 389.
6.4. Problems 389
Section 6.1.2:
7. For each vector X = (Xl, ... , Xn), let k(x) denote the subscript of the
last nonzero coordinate with k(O, ... ,O} = O. Let Xo = 1. Prove that a
function u is scale invariant if and only if it is a function of x only through
y(x} = (xdlxk(z)I, ... , Xn/IXk(z)l).
8. Suppose that 00 is scale equivariant and not identically o. Prove that 01 is
scale equivariant if and only if 01 = uOo for some scale invariant u.
Section 6.2.1:
(a) Prove that the formal Bayes rule with respect to the improper prior
with Radon-Nikodym derivative 1/0' with respect to Lebesgue mea-
sure is the usual level 1/(1 + R) t-test.
(b) Let G be a group that acts on X as follows:
Section 6.2.2:
12. Prove Proposition 6.43 on page 361. (Hint: The proof is very much like
Example 6.42. There is no need to transform the data.)
Section 6.2.3:
13.*Let n = (0,00). Suppose that Xl, ... ,Xn are lID given e = fJ each with
density
fX1Is(xlfJ) = ~f G) ,
for some density function f. Let G be Group 2 and N = [0,00). Let L(fJ, a) =
(or - a?/o2r, for some r > 0.
(a) Find 6 so that this problem is invariant.
(b) Characterize all equivariant rules.
(c) Write a formula for the MRE rule.
(d) If I(x) = I[O,l)(X), find the MRE rule.
14. Let n = (0,00). Suppose that Xl, ... ,Xn are lID given e = 0 each with
density
fXlIs(xIO) = ~f (~) ,
for some density function f. Let G be Group 2 and N = [0,00). Let L( 0, a) =
(k 10g(O) - r log(a)2, for some k, r > O.
(a) Find 6 so that this problem is invariant.
(b) Characterize all equivariant rules.
(c) Write a formula for the MRE rule.
15. Let Xl, ... ,Xn be lID U(O,O) random variables conditional on e = 0,
and let the action space be N = [0, 00). Let the loss function be L( 0, a) =
(1 - a/O)2.
(a) Show that this problem is invariant under the one-dimensional scale
group, Group 2.
(b) Find the MRE decision rule.
16. Prove Corollary 6.52 on page 366.
17. Prove Proposition 6.53 on page 367.
18. Suppose that Xl, ... , Xn are lID U (01 , O2 + OI) given e = ((h, (2), where
n = 1R x 1R+. Let N = n and
L(fJ,a) = (0 1 ~al f + (0 ~a2f
2
Show that this problem is invariant under Group 3, and find the MRE
decision rule.
6.4. Problems 391
J
19. Let f : nt -+ [0,00) be a function such that Ixlf(x )dx < 00. Suppose that
X I, ... , X n are conditionally lID given e = (J each with density f (x - (J).
Let the prior density of e be proportional to f(c - (J). Suppose that the
loss function is L(J, a) = p(J - a) for some function p. If the formal Bayes
rule exists, show that it is the same as the MRE decision based on a sample
containing one extra observation X n +1 = c.
20. Let Xl, ... ,Xn (n ~ 2) be lID with Exp(lj(J) distribution given e = (J.
Use the one-dimensional scale group, Group 2. Let the action space be the
same as the parameter space, and let the loss be L(J,a) = (J2 + a2)/(a(J).
(a) Find groups to act on the parameter and action spaces so that the
decision problem is invariant.
(b) Find the best equivariant rule.
21. Suppose that X I, ... , X n are conditionally lID given e (J each with
conditional density
(Jaa
fxtle(xl(J) = x a +l i[9.oo)(X),
where a is known and the parameter space is l1 = (0,00). Let Group 2 (the
one-dimensional scale group) act on the data. Let the action space be the
same as the parameter space.
(a) Find groups acting on the parameter and action spaces so that the
decision problem with loss L(J, a) = (9 - a)2 j(J2 is invariant.
(b) Find the MRE decision rule.
Section 6.3.1:
22. Show that Theorem 6.74 applies, and state the conclusions of the theorem
in the situation described in Problem 31 on page 289.
Section 6.3.2:
23. *Each part of this question assumes the hypotheses of the preceding parts.
(a) Let P and Q be probability measures on (nt, B), where B is the Borel
u-field. Suppose that X = (Xl, ... , Xn) is an lID sample from a
distribution with probability measure P. Let Y be another real-valued
random variable independent of X with distribution Q. Let C =
C(X) be a measurable subset of m.. Define the content of C by Q(C).
Prove that the expected value of the content equals the probability
that C(X) contains Y. You may assume all necessary measurability
conditions.
(b) Let l1 be a parameter space, and suppose now that P is only known
to be an element of the parametric family {P91(J E l1} and that Q is
only known to be an element of the parametric family {Q91(J E l1}
(same parameter space). Let E9 represent expectation with respect to
the conditional distribution of X given e = (J. Suppose that we wish
392 Chapter 6. Equivariance
Section 6.3.3:
26. *Return to Problem 56 on page 293. Find a group of rotations and a loss
function for estimating 82 that make the decision problem invariant. Show
that the hypothesis and alternative H : 81 = 0 and A : 81 > 0 are
invariant, and find the form of the UMPUAI level a test as closely as you
can. (I do not think you can find the cutoffs in closed form.)
27. *Suppose that X is distributed like Nk(/-L, E) given e = (/-L, E). Let the group
be Group 4 on page 354. Only one vector observation will be available.
(a) Show that the family of distributions is invariant and show how a
group element acts on the parameter space.
(b) Suppose that we wish to test the hypothesis H : M = 0 versus A :
M i= O. Show that the hypothesis-testing problem is invariant, and
find the maximal invariant in the data space. Why are invariant tests
useless in this case?
(c) Suppose that we wish to estimate M. Our action space is N = lR k ,
and our loss function is L(}, a) = (/-L - a) TE-1(/-L - a). Find a group
G operating on N so that the loss is invariant.
6.4. Problems 393
(d) For the estimation problem, show that all equivariant rules are of
the form 8(x) = ex for some scalar c. (Hint: First, prove that for
i = 1, ... , k, if x has 0 in coordinate i, then 8(x) has zero in coordinate
i also. Finally, write 8(x) = a(x)x+,8(x)y(x), where y(x) is orthogonal
to x for all x and the representation is unique unless ,8(x) = O. Then
let A be an orthogonal matrix with first row proportional to x T and
second row proportional to y(x) T.)
28. Suppose that Xl, ... , Xn are conditionally lID with N(I-', CT 2 ) distribution
given e = (1-', CT). Let G be the one-dimensional location group 9cX = x+cl.
(a) Show that Assumption 6.82 holds.
(b) For what kinds of hypotheses can we find UMPUAI tests?
(c) Will these tests be UMPU?
29. Suppose that i,l, ... , i,n; are conditionally distributed as Np(l-'i, CT) given
Mi = I-'i and ~ = CT for i = 1, ... , k and all i,j are conditionally inde-
pendent. (Here ~ is a p x p positive definite matrix.) Suppose that the
hypothesis to test is H : MAC = 0, where M is the p x k matrix whose
ith column is Mi, A is a k x k diagonal matrix with vn; in the ith diagonal
element, C is a k x r matrix that equals the first r columns of an orthogo-
nal matrix, and 0 is a p x r matrix of all zeros. (Compare to the one-way
analysis of variance on page 384.) Transform the data in order to put this
problem into the form of the multivariate analysis of variance, and find the
matrices Wand B in the discussion that begins on page 386.
CHAPTER 7
Large Sample Theory
1 The norm of x is denoted by IIxli. Note that a normed linear space is a metric
'space with metric d(x, y) = IIx - yll
7.1. Convergence Concepts 395
Note that in the definition of 0P' the c is allowed to vary with f, so there
is no obvious analog to Proposition 7.4 for Op. We will usually leave the
subscript n off of the norm II . lin, since there is seldom any chance of
confusing one norm with another.
Example 7.5. Let {Zn}~=l be lID random variables with mean I-' and variance
u 2. Let Xn = v'n(Zn -I-')/u. So, Xn = ffi. for every n, and Pn, which is the
distribution of X n , is a probability measure on the Borel subsets of the real line.
The central limit theorem B.97 (together with Problem 25 on page 664) says that
lim n-+ oo Pn -00, t]) = 4>(t) for all t, where 4> is the standard normal CDF. For
each f > 0, there exists t such that ~(t) - ~(-t) ~ 1 - f/2. Choose N such that
for each n ~ N,
f
Pn -00, -t]) ~ ~(-t) + 4' Pn -00, t]) ~ ~(t) - 4'
It follows that Pr(IXnl ~ t) equals
f
Pn-oo, t]) - Pn -00, -t]) ~ ~(t) - ~(-t) - '2 ~ 1- f.
For the "only if" part, assume that peT) and let Tn() be as in Defini-
tion 7.11. Since fn(xn) = o(rn) for (Xl,X2, ... ) E T, it follows that
< 00,
Z < Ilfn(x~)1I
n _ Irn I + -1 ->
0
,as n -> 00,
n
so limn->ex:> Zn = O. For each > 0 and c > 0, choose N such that if n ~ N,
then Zn ::; c. It follows that, if n ~ N, then Pr(llYnll/lrnl ::; c) ~ 1 - .
Hence, Yn = op(rn). 0
where g(k) denotes the kth derivative of g. Taylor's theorem C.1 says
(among
other things) that
}' w(x) - Tk(X, c) _ 0
x~ (x - c)k -.
Suppose that Xn - C = O(rn), where rn = 0(1). Then Xn - C = 0(1),
and we
conclude that W(Xn) - Tk(X n , c) = OXn - C)k), hence W(Xn) = Tk(X n,
c) + o(r~).
Similarly, we can write W(Xn) = Tk-l(Xn,C) + O(r~). Now, suppose that
c = OPCrn). In the notation of Theorem 7.15, let Xn = lR. for all Xn -
n and let
Yo = Yl,l = ffi.. For each n, let If/)(x) = x and h n {-} =
w() - Tk("C) or
w() - Tk-l C,, c). Suppose that there are no 9 functions. Then Theorem
7.15 says
that
The portmanteau theorem 8.83 gives several criteria that are equivalent
to convergence in distribution. These can be used to derive a connection
between convergence in distribution and op.
Lemma 7.19. 6 Suppose that X is a metric space with metric d. If Xn E. X
and d(Xn, Yn ) = op(l), then Yn -+ X.
1)
Then
{Yn E B} ~ {d(Xn,B) ~ f} U {d(Xn, Yn ) > fl.
Define Ce = {x : d(x, B) :5 }, which is a closed set. So,
Rn(B) = Pr(Yn E B)
< Pr(d(Xn , B) :5 ) + Pr(d(Xn , Yn ) > )
= Pn(Ce ) + Pr(d(Xn , Yn ) > f).
1)
hence Yn -+ X. 0
Lemma 7.19 says that if Xn E. X, then so too does anything close to X n ,
that is, anything that differs from Xn by op(l).
Theorem 7.20. If the a-field on X x Y is the product a-field, if Xn E. X
and Yn E. Y, and if Xn is independent of Yn for all n, then (Xn' Yn ) E.
(X, Y), where X and Y are independent.
PROOF. Since Xn and Yn are independent for each n, their joint charac-
teristic function is
6This lemma is used in the proofs of Theorems 7.22, 7.25, 7.35, and 7.63 and
to help develop the delta method.
7.1. Convergence Concepts 401
The result in the example above suggests a valuable use for the delta
method. If the variance of the asymptotic distribution of J1i(Yn - Jl) is
an undesirable quantity in the application for which it is intended, then
a transformation of Yn will have a different variance that may be more
suitable. For example, suppose that nYn has Bin(n,p) distribution given
p = p. The asymptotic distribution of J1i(Yn - p) is N(O,p(l - p)) given
p = p. For comparing several possible values of P, it might be nice if the
only dependence of the random variable on P were through the mean. This
can be arranged asymptotically by choosing a function 9 such that
'( 1
9 p) = Vp( 1 - p)
get) = i c
t 1
Jh(Jl) dJl,
where c is any constant such that the integral exists. The asymptotic
distribution of J1i(g(Yn ) - g(Jl)) will be N(O,1). It is common, when
J1i(Yn - Jl) E. N(O, (12), to say that the asymptotic 2distribution of Yn
is N(Jl, (12 In). In symbols, we may write Yn '" AN(Jl, (1 In). In such cases
we will call (12 In the asymptotic variance of Yn
7.1. Convergence Concepts 403
Here are some multiva riate applica tions of the delta method .
Examp le 7.23. Importa nce sampling (see Section B.7) is a means
of approxi-
mating the ratio of integrals of the form J v(O)h(O)dO/ J h(O)dO. Let
{Xn}~=l
be an 110 sequence of pseudorandom numbers with density f, and
let Wi =
h(Xi)/ f(Xd and Zi = V(XdW i. If these have finite variance, then
the sample
averages (W n, Zn) will, by the multivariate central limit theorem
B.99, be ap-
proximately bivariate normal with mean (w,~) = <I I
h(9)d9, v(9)h(9)d9) and
covariance matrix equal to l/n times the covariance matrix a =
(Wi, Zi) pairs. Now, apply the delta method to find the asympt otic
ai,j
of the
distribu tion
of the ratio of the sample averages. The asympto tic mean is ~/w,
the ratio we
want to approximate, and the asympt otic variance is
The following exampl e uses the reasoni ng behind the delta method
with-
out using the delta method itself.
Examp le 7.24. Suppose that we wish to find the asympt otic distribu
tion of the
roots of polynomials with random coefficients. Let Yn ANk +1(IJ" E/n), where
Yn = (Yno, .. . ,Ynk) T. Define the polynomial
"-J
Pn(U) = L:Ynju j .
j=O
from Theorem 7.15 that U~ ~ Uo. To find the asymptotic distribution of U~,
write Pn(U~) as
where V,: is between Uo and U~. So, S ~ {V,: -+ uo}, and V,: ~ Uo also.
Furthermore,
k k
v'n(U~ - uo)
lOIf F(c) = 0 and F(x) > 0 for x > c, then we only need Fn to be strictly
increasing on [c, x(n)]'
7.2. Sample Quantiles 4U5
Theorem 1.25. Suppose that {Xn}~=l are conditionally IID with distri-
bution P given P = P and suppose that P has CDF F with derivative f
in a neighborhood of X P ' where F(xp) = p, 0 < f(xp) < 00, and 0 < p < 1.
Define yp(n) = F;;l(p), where Fn is the empirical CDF of(X1, ... ,Xn ).
Then
r:: (n) - xp ) -+
yn(Yp V
N (P(l-P)
0, J2(x p) .
Z ) Z Bn Z
An(z) = Fn ( xp + Vn - Fn(xp) - Vrif(xp) = -;;: - Vnf(xp) + Un,
(7.28)
where Bn is the number of observations in the interval (xp, xp + Z / Vn J,
and Un satisfies Pr(IUnl S 2/n) = 1. In particular, Un = Op(l/n). The
conditional distribution of Bn given P = P is Bin(n,On), where
= exp{-itZf(xp)}Eexp(it~)
= (1 - (}n + On exp { ::n} )n exp{ -itzf(x p)}.
We can write
it )
exp ( .J:;; it t2 0 (
=1+---+ 1)
yn Vii 2n
-
n
.
It follows that
(l-en+Onexp{~})n = (l+it~+O(~))n
= (l+i~J(Xp)+o(~))n
-+ exp{iztJ(xp)},
406 Chapter 7. Large Sample Theory
as n ~ 00. So
for all t. So, vnlAn(z) - Unl 2: 0 by the continuity theorem B.93, and
vnlAn(z) - Unl ~ 0 by Theorem B.90. An(z) = Un + op(l/vn). Since
Un = Op(l/n) = op(l/vn), it follows that An(z) = op(l/vn).
Finally, we prove (7.27). The following inequalities are all equivalent:
We will prove that the right-hand side of this equation converges to the
necessary normal probability. We have that Fn(xp) = Cn/n+Dn , where Cn
is the number of observations less than or equal to xp and Dn = Op(l/n).
Also, An(z) = op(l/vn), so
vn
f(x p) (p - Fn(x p - vn An(z)..;n (Cn )
f(x p) = - f(x p) -;;- - p + op(l). (7.29)
The central limit theorem B.97 tells us that vn(Cn/n-p) 2: N(O, p(l-p.
This, together with Lemma 7.19 applied to (7.29), completes the proof. 0
Example 7.30. Suppose that F has derivative
where u > 0 and J.I. are some numbers. If p = 1/2, xl' = J.I. and f(x l' ) = (U1I')-1. It
follows that the sample median Yl~"1 has asymptotic distribution (given P = P)
where u > 0 and J.I. are some numbers. If p = 1/2, xl' = J.I. and f(x l' ) = (u,;'i/i)-l.
It follows that the sample median Yl~i has asymptotic distribution (given P = P)
Let {Xn}~=l be IID with CDF F and let X(n) = max{Xl, ... ,Xn }. Then
nl/Q(t - X(n)) converges in distribution to a distribution with CDF G(x) =
1 - exp( -cx Q
for x> O.
),
PROOF. Write
r{l-F(t- :.)}r
Pr(ni-[t-X(n)]~x) = pr(x(n)$.t- nX~) =F(t- :~)n
~ [1< (n;
Since
it follows that
o
408 Chapter 7. Large Sample Theory
Example 7.33. Suppose that {Xn}~=l are conditionally lID U(0,9) given e =
fJ. The CDF of Xi (given e = fJ) is xlfJ for 0 < x < fJ and I for x 2:: fJ.
With t = fJ we get limx;t(t - x)-l[I- F(x)] = I/fJ. So Theorem 7.32 says that
n(B - X(n) E. Exp(I/B).
A similar theorem can be proven for distributions bounded below.
Proposition 7.34. Suppose thatt E JR, a > 0, andlimxtt{x-t)-O:F{x) =
c> O. Let {Xn}~=l be lID with CDF F and let X(l) = min{Xl'"'' X n}.
Then nl/O:{X(l) - t) converges in distribution to a distribution with CDF
G{x) = 1- exp( -cxO:), for x> O.
Krem (1963) proves that extreme order statistics (like the min and max)
are asymptotically independent of the central order statistics (like the quan-
tiles).
Let Zl, ... , Zk be real numbers and let Ai,n(Zi) equal (7.28) with P = Pi for
i = 1, ... , k. Then,
Since (Al n(zI), ... , Ak,n(zk)) = op(l/ fo), it follows from Lemmas 7.19
and 7.26 that the two vectors Zn and Wn converge in distribution to the
same thing if either one of them converges. It is easier to find what Wn
converges to, so that is what we will do.
We can write
7.2. Sample Quantiles 409
o o
o
o qk+1
Next, note that
~AGn+oP ()n),
Fn(Xpl) )
( : = (7.36)
Fn(x Pk )
where A is the k x (k + 1) matrix with Ion and below the diagonal and 0
n
above the diagonal:
A~ U: : : :
Call the vector on the left of (7.36) R. Then the conditional mean of R
given P = P is P = Aq = (PI. .. ' ,Pk)T, and y'n(p - R) ~ Nk(O,AEA T ).
All that remains is to compute AEA T. This can be seen to equal
PI PI
T_ PI P2.
AEA - ( .
P2: - (PI)
PI) P2: (PI.",Pk). o
PI P2 Pk Pk
Let {Xn}~l be IID with CDF F, and let X(1) = min{X1 , ... ,Xn } and
X(n) = ma.x{Xb'' ,Xn }. Then the asymptotic joint CDF oJn1/0l(X(l)-
it) and n1/02(t2 - X(n) is (1 - exp( -c1xfl (1 - exp( -C2X~2.
If the goal is to estimate g( 9), it might be good if the asymptotic mean were
g(O) given 9 = 0. The asymptotic conditional mean of a1 y ;n) + a2Y1~ni +
a3 Yi~~ is (a1 + a2 + a3)g( 0) + (a3 - a1 )EWt~, since h is symmetric around
O. For p < 1/2, this will equal 9(0) for all 0 if and only if a3 = a1 and
a1 + a2 + a3 = 1. Hence our estimator must be
(7.38)
for some a.
Example 7.39 (Continuation of Example 7.30; see page 406). As an example,
consider the case of Cauchy distributions with a location parameter e. Then
Zi = Xi - 9, h(x) = (11'[1 + X2])-1, and Zl-p = tan[1I'(1/2 - p)]. So, for example,
if p = 1/3, then Zp = -1/v'3, Zl-p = 1/v'3, and h(zp) = 3/(411') = h(Zl-p). The
asymptotic covariance matrix of the three sample quantiles is then
l:=~
n
2(H~
16
ii
2
H) .
~
32
iiI 9 iiI
say, where c(O) = 1. The asymptotic covariance matrix of the three quantiles is
)
1'(1-1') 2
2 ""C'(Pj2
~
2c(p) c(~)2
_71' ( -2...- 1 -2...-
E - -:; 2c(p) 4 2c(p)
2
p(I-~)
c(~)2
~
2c(p) c(p)
c(p? - 2pc(p)
a =a (p) = 2[c(p)2 _ 4pc(p) + 2p]'
and the minimum variance is
71'2
4n
[1 _C(p)2(c(p) - 2p)2 ] _
- 4pc(p) + 2p -
S P
( ).
We can numerically minimize s(p) and find the minima occur at p = 0.42085 and
at p = 0.07915. The minimum s(p) is 2.302, which is only slightly better than
using p = 1/3.
Example 7.41. Suppose that the distributions are double exponential (also
known as Laplace distribution). That is, hex) = exp(-lxl)/2. Then
which has a minimum at a = 0 and the minimum value is 1. This means that, no
matter what p is, it is better to use just the median.
Corollary 5.23 says that the smallest possible variance for an unbiased
estimator of g(8) is cJIxl (B)-lco. Since g(8 n ) is asymptotically unbiased,
the ratio of these two variances might be used as a measure of how good a
consistent estimator is.
Definition 7.43. If Gn is an estimator of g(8) for each nand
then the ratio cJIx l (})-lC(J/Vf} is called the asymptotic efficiency ofGn at
(). If the ratio is 1, the sequence {Gn}~=l is called asymptotically efficient.
Suppose that {Gn}~=l and {G~}~=l are sequences of estimators of 9(8),
and we have a specific criterion that we require of our estimator, such as
variance equal to t. Suppose that Gno and G~, satisfy this criterion. Then
o
the relative efficiency of {Gn}~=l to {G~}~=l for the specific criterion is
n~/no. Suppose that the criterion is allowed to change in such a way that
the sample sizes required to satisfy it go to 00, for example, variance equal
to t with t going to O. If the ratio n~/no converges to a value r, then r is
called the asymptotic relative efficiency (ARE)ll of {Gn}~=l to {G~}~=l'
Example 7.44. Let {Xn}~=l be conditionally lID with N(p., (12) distribution
given e = (p.,u). Let g(8) = JL. Let G n = X n , the sample average, and let G~
be the sample median. Let our specific criterion be that the asymptotic variance
of the estimator must equal 1:. Since the central limit theorem B.97 says that
.;n(Gn -p.) ~ N(O, (12), and Example 7.31 on page 407 shows that .;n(G~ -p.) ~
N(O, (12 7r/2) , we have the relative efficiency equal to J2/7r = 0.798 for all 1:. If
we let I: -+ 0, the ARE of the sample median to the sample mean is 0.798 as well.
llThis definition of ARE is taken from Serfting (1980, pp. 50-52). Serfting's
definition actually applies to more types of inference than estimators, but we will
not pursue that generality here.
12Solve Problem 22 on page 470 to show that the relative rate of convergence
is uniquely defined. Relative rate of convergence is not an example of a criterion
for ARE, but it has a similar nature.
414 Chapter 7. Large Sample Theory
Example 7.47. 14 Suppose that {X n }::"=l are conditionally lID N(),I) given
e = B. We already know that IXl (B) = 1 and Fn(X n - B) '" N(O, 1), so Xn
is asymptotically efficient. (Actually, it is efficient in finite samples.) Let Bo be
arbitrary, and define a new estimator of e:
where.O < a < 1. This is like using Xn when Xn is not~lose to Bo , but using the
posterIor mean of e from a prior centered at ()o when X is close to Bo.
We will now calculate the efficiency of 8n . Suppose that () =1= ()o. Then
where Z = fo(X n -() has N (0, 1) distribution given e = (J. This last probability
goes to as n goes to infinit:L because both of the endpoints either go to +00 or
-00. Hence, if () i= ()o, 8n = Xn + oP(l/v'n).
Now, suppose that (J = (Jo. Then
PROOF. With POo measure 1, I1~=1 fXds(xil()o) > rr=l fxlls(Xil()) if and
tlOg
only if
R(x) =~ fX1Is(xil() < o.
n i=l fx1Is(Xil()o)
By the weak law of large numbers B.95, under POo ,
p fXds(XI()
R(X) -+ Eoo log fxds(XI()o) = -IxI ()o; (),
Z(M ,x ) -- . f I fxde(xl()o) .
III og
"'EM fx l ls(xl1/J)
Assume that for each () =1= ()o there is an open set No such that e E No
and EOoZ(No, Xi) > o. If n is not compact, assume further that there is
a cO,mpact C ~ n such that ()o E C and Eoo Z(n \ c, Xi) > O. Then,
limen = eo, a.s. [POol.
Let E > 0 and let No be the open ball of radius E around ()o. Since C \ No
is a compact set, and {No: E C \ No} is an open cover, we may extract
a finite subcover, N(Jp ... ,N(Jt. Rename these sets and CC to 011 ... ,Om,
so that 0 = No U (Uj'=IOj), and E(JoZ(Oj,Xi ) > O.
Let Xoo be the infinite product space of copies of Xl. Let x E X denote
a generic sequence of possible data values. Let E(JoZ(Oj,Xi ) = Cj. By the
strong law of large numbers 1.63, L~=l Z(Oj, Xi)/n -+ Cj, a.s. [POol. Let
B j ~ Xoo be the set of data sequences such that convergence holds, and let
B = n'J!=IBj . Then Poo(B) = 1 and L:~=1 Z(Oj,xi)/n -+ Cj > 0 for each
x = (XI,X2, ... ) E B. Now, notice that
n ...... oo
m
C U{X:8 n (Xl, ... ,Xn )EOj, infinitely often}
t
j=1
~ U{X:
j=l
inf .!.
n
log !xt\e(xiIOo)
!x1Ie(xil'I/J)
~ 0, infinitely often}
jQ {X , ~ t ((lj, X;):<;
,pErij i=l
logi
90
if x ::5 min{ Oo,1/1},
log !xlle(xl()o) = { 00 if 1/1 < x $. ()o,
!xlIS(xl1/1) -00 if Bo < x ::5 1/1,
undefined otherwise.
Since the last two cases have 0 probability under Peo, we can choose No
([B + Bo]/2, (0) when B > Bo. In this case, Z(No,x) = ~og([B + Bo]/[2Boj) > ~,
a.s. [Peo]. If 0 < 00, choose N9 = ((J/2, [(J + Ool/2). In thIS case Z(No, x) = 00 If
x> [(J + (Jo]/2. Hence, EeoZ(Ne, Xi) > 0 in either case.
We also need a compact set C such that EooZ(n \ C,X i ) > O. Let C =
[Oo/a, a(Jo], for some a > 1. Then
if Xi < 00
if Xi 2: Bo.
7.3. Large Sample Estimation 417
1
()
o
[l~eo log ndx
0
x
170
+ leo
~eo
logadx. 1
The first integral goes to 0 and the second goes to 00 as a -+ 00. This means that
there is some a > 1 such that the mean is positive. It follows from Theorem 7.49
that the MLE is consistent.
In this example, it would have been easier to find the distribution of and en
prove directly that it was consistent, but we will need the above calculation in
Example 7.82 on page 432.
Example 7.52. Suppose that {Xn}~=l given 8 = 8 are lID with N(O, 1) distri-
bution. It is easy to calculate
(7.53)
The minimum of this over any set occurs at 0 equal to the value in the set closest
to x. So, if Ne = (8 - f,8+f), then EeoZ(Ne,x) = IXI (80j 0) + Eeo(R), where
f(X - 9) + ~ if x <0- ,
R = { x(O - x) + x2;8 2 if 9 - f ~ X ~ 9 + f,
f(9-x)+~ ifx>9+(.
!
Clearly, E80(R) can be made arbitrarily small by choosing small. Similarly, if
C = [90 - u, 90 + u], for large u, then
16A slightly more general result can be proved by assuming that fXIls is up-
per semicontinuous (USC). A function f : n -+ 1R. is upper semicontinuous if
limsuPn->oo f(9 n ) ~ f(9) whenever 9n -+ 9. Usc functions possess two properties
that are needed in the proof of Theorem 7.49. The sum of two USC functions is
USC, and the maximum of a USC function is attained on a compact set.
418 Chapter 7. Large Sample Theory
Since N~k) ~ No, it follows that Z(N~k), x) 2: Z(No, x). If EOoZ(NIJ, Xi) =
00, then we have EooZ(NJk), Xi) = 00, for all k. If EOoZ(No, Xi) is finite,
then apply Fatou's lemma A.50 to {Z(NJk),x) - Z(NO,x)}k'=l and use
(7.55) to get
lim inf EOoZ(N~k), Xi) 2: Eoo lim Z(N~k), Xi) = Ix, (00 ; 0) > 0, (7.56)
k-.oo k-+CX)
Suppose that the natural parameter space is n an open subset of IRk. Let
en be the MLE of e based on Xl' ... ' Xn if it exists. Then
lim n -+ oo Po(8 n exists) = 1,
171f fxtl8 is USC, Ok still exists. One must change the ~im to li~inf wherever
it appears in front of Z, and one must change the equality to 2: III (7.55) and
(7.56) to make the rest of the proof work.
7.3. Large Sample Estimation 419
under Po, y'n(8 n -B) ~ Nk(O,Ixl (B)-I), where IXl (B) is the Fisher
information matrix.
PROOF. If the MLE exists, it will satisfy the equation that says that the
partial derivatives of the log-likelihood function with respect to each coor-
dinate of Bare 0, since the parameter space is open. Since
{) {)
{)B i logfxIS(xIB) = nXi + n {)B i 10gc(B),
the resulting equation is xn = v(B), where the ith coordinate of v(B) is
-{)logc(B)/{)Bi . It follows from Proposition 2.70 that
{)2
Covo(Xi,Xj ) = - {)Bi{)B j logc(B) = aij.
Since a covariance matrix is positive semidefinite, it follows that -logc(B)
is a convex function. Since each Xi has nondegenerate exponential family
distribution, their coordinates are not linearly dependent, hence the matrix
r: = aij will be positive definite. It follows that v(-) has a differentiable
inverse h() in the sense that for each B there is a neighborhood N of B such
that h(v(1/I = 1/1 for 1/1 EN, and the derivatives of h are continuous. (See
the inverse function theorem C.2.) If xn is in the image of v, then, for at
least one such function h, the MLE equals h(xn). By the weak law of large
numbers B.95 Xn ~ EoX under Po and EoX = v(O) by Proposition 2.70.
It follows that Xn will be in the range of v with probability tending to 1
as n -+ 00. It follows that the MLE exists with probability tending to 1.
The multivariate central limit theorem B.99 says that under Po, y'n(X n -
V
v(O -+ N(O,E). Using the delta method, we get that under Po, v'n(8 n -
v .
0) -+ Nk(O, Ar:A T ), where A = aij with aij = 8h i (t)/8tj evaluated at
t = v(B), which is the (i,j) element of E- 1 . So, A = r:- 1 . It is also easy
to see that r: is the Fisher information matrix Ix (B), so v'n(8 n - B) ~
N(O,Ix l (B)-I). 1 0
The following corollary is trivial.
Corollary 7.58. Under the conditions of Theorem 7.57, the MLE of 8 is
consistent.
Another corollary says that differentiable functions of MLEs are asymp-
totically efficient estimators.
Corollary 7.59. Assume the conditions of Theorem 7.57. Suppose that 9 :
n -+ IR has continuous partial derivatives.
Then g(8 n ) is an asymptotically
efficient estimator of g(8).
PROOF. Let Co be defined in (7.42). Using the delta method, we get
~
1 { 2 t:t (X~
-2nlogu - 20'2 +y. - iLi )2 + 21 t:t(Xi
~ - Yi)2 } .
The MLEs are easily calculated as M;,n = (X; + Y.)/2 and f:~ = E~l (Xi -
}'i)2/[4n]. Since the conditional distribution of X; -}'i given e = 8 is N(O, 20'2),
it follows that f:~ .!. 0'2/2 under Pe. The MLE of E2 is inconsistent.
Barnard (1970) suggests an empirical Bayes approach, which is to choose a
distribution for the Mi with a fixed finite number of parameters, call them Ill.
Then treat (Ill, E) as the parameters and integrate the Mi out of the problem.
See Section 8.4 for a discussion of empirical Bayes methods. Kiefer and Wolfowitz
(1956) let the distribution of the Mi be more general, but they assume that the
distribution lies in a compact set of distributions so that methods like those of
Theorem 7.49 and Lemma 7.54 can be used.
Example 7.61. Let fl = (1/2,1] and suppose that for x = 0, 1,2, ... ,
if8 # 1,
if 8 = 1.
and we note that for every compact subset C of fl, the infimum of (7.62) over
8 E CC is 0 for x > 0 and is negative for x = O. So the conditions of Theorem 7.49
18Barnard claims that Neyman presented him with the example in a taxicab
in Paris in 1946. Barnard had just met Neyman for the first time and "was
arguing for the broad general validity of the method of maximum likelihood"
when Neyman asked him what he would do in this example.
7.3. Large Sample Estimation 421
fail as well. Let X n be the average of the first n observations. The MLE based
on the first n observations is
Under H, v'n(X n -l) ~ N(O,2), since the mean of each Xi is 1 and the variance
is 2. It follows that limn .... co p{(Xn ~ 1) = 1/2 and 1/(1 + Xn) --t 1/2 under H.
- - p
{j2 1 - (x _ (})2
8(}2 log fXlIs(xl(}) = -2 [1 + (x - (})2)2 .
This is differentiable and the derivative has finite mean. Hence H r exists as in
Theorem 7.63.
422 Chapter 7. Large Sample Theory
The idea of the proof of Theorem 7.63 is the following. We work with
the vector i~(X) of partial derivatives of the logarithm of the likelihood
function divided by n. We evaluate a Taylor expansion of i~(X) around 00
at the point en.
Since i'e" n (X) should be 0, we get that i~0 (X) is essen-
tially the matrix Bn of second partial derivatives of the logarithm of the
likelihood times en -
00 divided by n. Since i~o (X) is the average of lID
random vectors with mean 0 and covariance matrix IXl (00), v'ni~o (X) is
asymptotically normal with covariance IXl (00). Similarly, Bn is nearly the
average of lID random matrices, so Bn ~ -Ixl (00). Setting the two sides
of the Taylor expansion equal, we get that v'nIx l (Oo)(e n - ( 0 ) is asymp-
totically normal with covariance matrix IXl (0 0). Multiplying by IXl (00)-1
gives the desired result.
PROOF OF THEOREM 7.63. Let
I n
i9(X) = - I)og!Xde(XiIO).
n i=1
The ith coordinate ofthe gradient i~(x) is (L:~=1 8 log !xlle(XiI0)/80j) In.
Since 00 E int(n) , there is an open neighborhood of 00 in the interior of
O. Since en ~ 00 under P90' it follows that Zn1int(O)c(9n) = op(l/v'n)
as n --+ 00 for every sequence {Zn};::'=l of random variables. 19 Note that
i~n (X) = 0 for en
E int(O). It follows that
I )
ien (X) = ien (X)Iint(O)C (en) = v'n .
I I
Op
A (
A A p
where 0*n 3" is between 00 "J" and en J" for each j. Since en
t
00 under P90' --+
0* " ~ 00 J" for each j. Set Bn equal to the matrix in (7.66). Then
n" '
i9o(X) + Bn(en - ( 0 ) = op
I A (
..;n
I ) . (7.67)
o= ~
88j
J !xlle(xI8)dv(x),
19In fact, Zn1int(o)c(9n) = OP(Tn) for every Tn. The rea:son is that it ~u~s ,0
with probability tending to 1 and it doesn't matter what It equals when It lsn t
O.
7.3. Large Sample Estimation 423
we see that Eoot'~o (X) = O. Similarly, we get that the conditional covariance
matrix given 9 = 00 of t'~o(X) is Ix1(Oo). The multivariate central limit
'D
N(O,Ix 1 (00. So y'nfoo(X) =
I I
theorem B.99 gives us that y'nt'oo(X) -+
Op(l). It follows from (7.67) that
FnBn(9n - ( 0 ) = Op(l). (7.68)
Next, note that Bn(j,k) = (E~=l 82 log fX 1 Is(Xi IOo)180 j 80k ) In + t1 n, and
(7.64) ensures that lt1nl :5 E~=l Hr(Xi,Oo)ln, when 1100 - O~II :5 r. The
weak law of large numbers B.95 says that
-1 L Hr(Xi, (
n p
0 ) -+ EOoHr(Xi, ( 0 ).
n i=l
Let > 0 and let r be small enough so that EOoHr(Xi, ( 0 ) < /2. Then
in place of IXl (en) when X = x is observed and one wishes to use the MLE
to make inference about 8. The quantity in (7.69) is called the observed
424 Chapter 7. Large Sample Theory
Fisher information. We will see later (in Section 7.4.2) that the observed
Fisher information is indeed the appropriate matrix to use when the goal
is to approximate the posterior distribution of a parameter by a normal
distribution. Efron and Hinkley (1978) say that the reason for preferring
observed over expected information is that the inverse of observed infor-
mation is closer to the conditional variance of the MLE given an ancillary.
Example 7.70. 20 Assume that (XI, Z 1), ... , (X n, Zn) are conditionally lID with
Z; having Ber(1/2) distribution and XilZi = z having N(9, l/[z+l]) distribution
given e = (). That is, we flip a fair coin before observing each Xi and if the coin
comes up tails, we get an N(9, 1) observation. If the coin comes up heads, we get
an N(f), 1/2) observation. The log-likelihood function is a constant plus
;=1 i=1
The MLE is the weighted average en = 2:~=1 (Zi + I)X;/ 2::1 (Zi + 1). The
Fisher information is Ix, (f) = 3n/2, which is also the expected Fisher informa-
tion. The approximation to the distribution of en given e = 9 using the expected
Fisher information is N(9,2/[3n]). On the other hand, the observed Fisher in-
formation is J = 2::1 (Zi + 1). A natural ancillary upon which to condition
is Z = (ZI, ... , Zn). The conditional distribution of en given Z and e = 9 is
N (9, 1/ J which is the same as the approximation based on the observed Fisher
information.
LeCam (1970) proves asymptotic normality of MLEs under ostensibly
weaker conditions than Theorem 7.63. The conditions do not require the
existence of continuous second derivatives. They do, however, require the
existence of functions that behave very much like second derivatives. Also,
a condition very much like (7.64) is required, where the second derivatives
are replaced by these other functions that behave like second derivatives. 21
2. For each 00 and each 0 "I- 00, there exists an open set No containing
osuch that Eoo info/ENs [p(Xi' ( 0 ) - P(Xi' Of)] > -00.
3. For each 00, there exists a compact set C containing 00 such that
Eoo info(tc[p(Xi' ( 0 ) - P(Xi' 0)] > O.
Condition 1 says that p allows one to distinguish possible values of e from
each other and that P(Xi' ( 0 ) tends to be larger when e = 00 than when e
equals some other value. Condition 2 says that it cannot be the case that
even when e = 00 , there are some other possible values Of of e that lead to
p(Xi' Of) being much larger than P(Xi' ( 0 ), Condition 3 says that there is
a region around 00 such that all values of P(Xi' 0), for 0 not in that region,
tend to be less than P(Xi' ( 0 ) simultaneously. If p(a, b) = log fx,le(alb),
then conditions 2 and 3 are two of the conditions of Lemma 7.54. In fact,
the method of proof for that lemma can be applied to prove the following
proposition.
Proposition 7.71. Suppose thatp(X,) is continuous, a.s. [Pool. Also, as-
sume conditions 1-3. lfe n is the value of 0 that maximizes I:~1 p(Xi,O),
then en--+ 00 , a.s. [Pool.
1/J(x,O) is continuous in 0,
for each 00 , there exists h > 0 such that EOo1/J(Xi' 0) is strictly de-
creasing as a function of 0 for 10 - 00 1< h.
Then, for each > 0,
E~=11/J(Xi'()2)/n .!: E/Jo1/J(Xi ,()2) < O. We now note that the probabil-
ity that there is a solution equals
which has several solutions in general. To check the conditions of Theorem 7.72,
we note that 'I/J(x,8) = (x - 8)/[1 + (x - 8)2J. Clearly, E6'I/J(X,8) = 0 for all 8.
The derivative of E6o'I/J(X,8) evaluated at 8 = 80 is
8
= E/Jo 80t 1/Jj(Xi , ()o),
Assume that J1 is nonsingular. Also, assume that there exists Hr(x) such
that
n * - -1 - -
PROOF. Let p(}) = I:i=1 'IjJ(Xi' (})/n, so that en = en - M (en)p(E>n).
We can write
- p
where ()* is between (}o and en. It follows that under POD' ()* --t (}o. By
hypothesis,
It follows that
Also, the multivariate central limit theorem B.99 says that under POo '
The result now follows easily from (7.76) and Theorem 7.22. o
Exa~pI~ 7.77. Con~ide~ the Cau.chy distribution with a location parameter.
The likelihood equatIOn IS very difficult to solve, but there is a simple fo-
consistent estimator, namely, the sample median Set en.
( 0 x-8
'IjJ x,8) = 09 log /xlle(xIO) = 21 + (x _ 9)2'
We now calculate
1 - eX - (0)2 1 _ X2
- 2E9 o [1 + (X _ ( 0 )2]2 = - 2Eo (1 + X2)2
2
-;:
J 1- x 2
(l+x 2 )3dx=.5#O.
The other conditions of Theorem 7.75 can also be verified. The following estimator
is asymptotically efficient:
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
22The proof relies on martingale theory. The proof in Doob (1949) does not
require that a consistent estimator exists, but it has a slightly stronger assumption
which implies that a consistent estimator exists.
430 Chapter 7. Large Sample Theory
For f > 0, define CE = {O : IXI (00 ; 0) < fl. Let JLe be a prior distribution
such that JLe(CE ) > 0, for every f > O. Then, for every f > and open set
No containing CE , the posterior satisfies limn-+oo JLelxn (Nolx n ) = 1, a.s.
[P8o], where xn = (Xb ... ,Xn ) = (Xb .. " Xn) = xn are the data.
PROOF. For each x E Xoo, the infinite product space of copies of Xl, define
7.4. Large Sample Properties of Posterior Distributions 431
The idea behind the remainder of the proof is the following. For each x in a
set with probability 1, we find a lower bound on the numerator of the last
expression in (7.81) and an upper bound on the denominator such that the
ratio of these bounds goes to 00.
First, look at the denominator of the last expression in (7.81). Just as
in the proof of Theorem 7.49, construct the sets 0 1 , .. ,Om SO that 0 =
No U (Uj'=10j), and EooZ(Oj,Xi ) = Cj > o. It is easy to see that for
M~O,
1 n
inf Dn(O,x) ~ - "Z(M,Xi).
OEM n~
i=1
So the denominator of the last expression in (7.81) is at most
f1
j=1 OJ
exp(-nDn(O,x))dJLe(O) ~ L
m
sup exp(-nDn(O,x))JLe(Oj)
j=1 0EOj
= lim
n->oo
1 POo(Vn((}))dJle((})
1 J,\'oor
C6
= lim
n->oo J,\'oo C6
lim r
n->oo J,\'oo
1C6
IWn(x) ((})dJ.le((})dPoo (x)
lim
n-+oo ixr oo
J.le(C6 n Wn(xdPoo(x)
r
= lxCXl n-+oo
lim Jle(Cli n Wn(x))dPoo(x).
Since J.le(C6 n Wn(x)) ~ J.le(C6) for all x and n, we have limn->oo J.le(C6 n
Wn(x)) = J.le(C6), a.s. [Poo], because strict inequality with positive prob-
ability would contradict the above string of equalities. So, there is a set
B' ~ Xoo with Poo(B') = 1 and for every x E B', there exists N'(x) such
that n 2: N'(x) implies J.le(C6 n Wn(x)) > J.le(C6)/2. So, if x E B' and
n 2: N'(x), the numerator of the last expression in (7.81) is at least
1
2 exp( -2n8)J.le(C6)
> ~ exp ( - :C) J.le(C6),
since IXl(}Oj(}) :S 8 for () E Cli. It follows that if x E B n B' and n 2:
max{N(x), N'(x)}, then the ratio in (7.81) is at least J.le(Cli) exp(nc/4)/2,
which goes to 00 with n. 0
Example 1.82 (Continuation of Example 7.51; see page 416). Suppose that
{Xn}~=l given e = B are lID with U(O, B) distribution. We saw earlier that
the conditions of Theorem 7.49 are satisfied. The Kullback-Leibler information
is
T (B B) = { log!o if (J 2: (Jo,
Xl 0, 00 if (J < Bo.
The set C. is the interval [Bo,exp(E)Bo). An open set No containing this interval
will need to contain an open interval (Bo - 6, ( 0 ) for 6 > O. So long as the prior
distribution assigns postive mass to every open interval, then, for every 80 , every
open interval around 80 will have posterior probability going to 1, a.s. [Peol. This
is a much stronger claim than one could infer from Theorem 7.78. 23
23See Problem 48 on page 474 to see why the posterior probability of C. does
not go to 1 almost surely.
7.4. Large Sample Properties of Posterior Distributions 433
(7.84)
If we add both sides of this inequality for all 0: E N and divide by (;), we
get
24The proof of Lemma 7.83 involves martingale theory. Lemma 7.83 also pro-
vides a weaker condition under which the MLE converges a.s. [Peol.
434 Chapter 7. Large Sample Theory
Example 7.85 (Continuation of Example 7.52; see page 417). Suppose that
{Xn}~=l given e = (p.,0') are lID with N(p., 0'2) distribution. If (Jo = (p.o, 0'0), it
is easy to calculate
Problem 47 on page 474.) Theorem 7.80 suggests that the type of closeness
that implies consistency in Bayesian problems is much stronger. 26
_ '/'-1(6A ) A exist,
E - { l:n n if the inverse and en
n- Ik if not. (7.88)
26Barron (1988) gives necessary and sufficient conditions for the posterior dis-
tribution to concentrate on sets "close" to the distribution that generates the
data. His results even apply in nonparametric settings.
27What Heyde and Johnstone (1979) did (whether intentionally or not) was
to take the conclusions Walker (1969) derived from the assumption that the
data were conditionally lID, and use them as assumptions. To use Heyde and
Johnstone's result for the conditionally lID case, one need only repeat the portion
of Walker's proof in which the assumptions of Heyde and Johnstone's theorem are
proven directly. Alternatively, one could prove the assumptions independently.
28Notice that ~~l is the observed Fisher information matrix.
436 Chapter 7. Large Sample Theory
6. For 8 > 0, define No(8) to be the open ball of radius 8 around Bo. Let
An be the smallest eigenvalue of En. If N o(8) ~ n, then there exists
K(8) > 0 such that
lim
n ..... oo
P~o ( sup
IiE N o(6(')),lhll=1
11 + I TEJf~(O)E!,1 < ) = 1.
lim
n-+oo
p~o (sup IhnlX" (7/JIXn ) -
1/JEB
(7/J) I > f) = o.
where
fxJ) = In fe(O)fxnle(IO)dO.
The posterior density of \{In, flJlnlXn (t/JIXn ), can be written as
Our first step is to see how the first factor in (7.91) behaves as n -+ 00.
Choose 0 < f < 1 and let rJ be such that
Since the prior is continuous at 00, there exists 61 > 0 such that 110 - 00 11 <
01 implies Ife(O) - f9(00)1 < rJfe(Oo). By general regularity condition 7,
there exists 02 > 0 such that
(7.93)
where
+Lln }dO.
7.4. Large Sample Properties of Posterior Distributions 439
lim
n-+oo
P~o [(211")~ IEnl~
(1+1])2
k < J 3 < (211")~ IEnl! k
(1-1])2
1 = l. (7.94)
By the way we chose 1] related to , we get from (7.93) and (7.94) that
In other words,
J1 P (2 ) k t () ) (7.95)
IEnl~!Xnle(Xnlen) -+ 11""2 eo
Next, we show that
J2 P
1 -+ O. (7.96)
IE nl 2 !xnle(Xnlen)
exp[-IEnl-i K(8)] p
1 -+ O.
IEnl 2
Let Ti, f. > 0, and let B be a compact subset of IRk. Let b be a bound on
1I'l/J1I2 for 'l/J E B. By general regularity condition 7, there exists 0 and M
such that n ~ M implies
Then, if n ~ N,
p~o (11/1 T (lk - Rn(E!1/I + en, Xn1/I -111/11121 < 71 for all 1/1 E B) > 1 - f.
Since p~o (~n = 0, for all 1/1) - 1, it follows that the second fraction in
(7.91) is between exp( -1]) exp( _111/1112 /2) and exp(1]) exp( _111/1112 /2) with
probability tending to 1, uniformly on compact sets. Since 71 is arbitrary,
the desired result follows. 0
We now give two examples in which Xn does not consist of conditionally
IID coordinates, but the general regularity conditions still hold.
Example 7.99. Let n be the interval (-1,1). Let {Zn}~=l be lID N(O, 1) and
let Yo = O. Define Yn = 9Yn- 1 + Zn for n = 1,2, .... The sequence {Yn}~=l
is called a first-order autoregressive process. The Yi are clearly not conditionally
lID given e = 8 except for the case 8 = O. Let Xn = (YI, ... , Yn). Then in(8)
is a constant plus - (YJ + (1 + ( 2 ) E::l1 y? - 28E~1 Y;Y;-l) /2. The MLE is
easily calculated as en = Ei=l YiYi-l/ E:=l Y;~l' The first four general regu-
larity conditions are trivially satisfied if the prior has a continuous density. Also,
"
in(8) = - EO=1n 2
Y;-l' Since in" does not depend on 8, general regularity condi-
tion 7 is satisHed. Since
2 2) 210 2 28 2k
COVII
(
Y; ,Y;-k = 8 Varll(Y;-k) ~ (1 _ ( 2 )2'
in(8) -in(80) 2 - (8 + (0 ) ~
= - -8-80( ~ Y;2 - 2 ~
~ YiYi-1 ) ,
and
E~-l
E ~ n
o ~
I [~l]-l) ,
'" N (80, L..J .
~
1=1 .. i=l
So, if n? N,
o
(7.103)
and these means are all positive for j = 1, ... , m, it follows that if
where). is the smallest eigenvalue ofIxl (00 ), then general regularity condi-
tion 6 holds. For general regularity condition 7, let > 0 and let fJ be small
enough so that E90H6(Yi,00) < f/(IJ. + f), where H6 comes from (7.64)
and IJ. is the largest eigenvalue of IXl (00). Let IJ.n stand for the largest
eigenvalue of En. For 0 E No(fJ), we have
Ifen E No (c) and IJL - nJLnl < f, it follows from (7.64) that the last
expression above is no greater than
2 n
(JL+f)- LH6 (Yi,Oo).
n i=1
By the weak law of large numbers B.95 and our choice of C, this last expres-
sion converges in probability to something no greater than E. This implies
general regularity condition 7. 0
To make the theorems of this section apply to prior probabilities not
conditional on the parameter, suppose that the prior distribution satisfies
Pr(8 E int(O)) = 1. We can now apply the result from Problem 6 on
page 468 to conclude that the prior probability that the posterior after n
observations will be within f of the normal approximation goes to 1 as n
goes to infini~y.
Example 7.104. Suppose that Xl, ... , X IO are conditionally lID given a = (}
with Cau.(}, 1) distribution. Suppose that the observations are
-5, -3, 0, 2, 4, 5, 7, 9, 11, 14.
Then the MLE of a is 910 = 4.531, and l~o(4.531) = -1.23116. So, Theo-
rems 7.101 and 7.102 suggest that N(4.531,0.813) is approximately the distribu-
tion of e. To see how good this approximation is, look at Figure 7.105. The solid
line is an approximation to the posterior by numerically integrating the likelihood
times the prior (trapezoidal rule from (} = -1 to (} = 11). The dotted line is the
normal approximation. The functions are not particularly similar. For example,
the normal approximation to Pr(a ~ 51X = x) is 0.3015, while the numerical
integral under the posterior curve is 0.3560.
Numerlca' Int.
Normal Appx.
sss
o 2 4 e e 10
10gfxnle(XnIOo) = 10gfx"le(Xnlen)
for every prior Ie, with equality only when Ie is Jeffreys' prior. It follows
that if Jeffreys' prior describes someone's beliefs, then that person believes
that his or her predictive density for the data Ix.. (Xn) will (asymptotically)
be smaller, relative to !x.. le(XnIOo), than that of someone who believes
any other prior. An alternative way to say this is that a person believing
Jeffreys' prior thinks that he or she has more to learn about e from the data
than does someone believing a different prior. This informal description can
be made more rigorous, as Clarke and Barron (1994) do. They consider a
decision problem in which the action space is the set of continuous prior
densities, and the loss is
Note that this loss is precisely the Kullback-Leibler information for com-
paring the distribution of Xn given e = 0 to the prior predictive distri-
bution. They show that Jeffreys' prior is asymptotically least favorable in
this decision problem.
we can write
E(g(8)IX = x) = J g((})fe((})fxle(xl(})d(}.
J fe ((})fxle (xl(})d()
The method of Laplace provides approximations to each of these integrals
for specific values of x. Some conditions and notation are needed to state
the approximations precisely.
Theorem 1.108. For each n, let (Xn Bn) be a Borel space, and let Xn
be a mndom quantity taking values in X n . Let xn = (Xl' ... ' X n) and
let (xn,Bn) be the product space of Xb ... ,Xn. Let {Po: () E O} be a
pammetric family of distributions for {Xn};:"=l with n ~ JR. Suppose that
the distribution of xn given e = () is absolutely continuous with respect to a
measure Vn on (xn, sn) for all n with density fxnle(I(}). Let 9 : n ~ JR+
be a function. Let fe((}) be the prior density of 8 with respect to Lebesgue
measure. Assume that fX"le(xnI8) for all nand xn E xn, g(8), and fe((})
are all continuously differentiable with respect to 8 six times. Assume that
J g(8)fe((})d8 < 00. Define
i n (8jx n )
Hn(8j xn)
Now let y = n~l xn and define the set A ~ Y as the set of all x =
(X 1 ,X2 ,X3 , ) E Y with the following properties:
The integmls J g(8)fe(8)fxnle( xn l(})d8 and J fe(8)fx nle( xn I8)d8
are finite for all n.
in achieves its maximum at a point 8~(xn) for each n.
There exist 60, No, M > 0 such that for all n :::: No, the absolute
values of H n , H~ and their first six derivatives are all bounded by M
for 18 - On(xn)1 ~ 260.
H(O) = H(O) + (0 - O)H'(O) + ~() - 0)2 H"(O) + ~(O - 0)3 Hili (0)
+ .2..(0 _ 0)4H(iV)(0) + _1 (0 _ 9)5H(v)(0) + _1 (0 _ O)6H(vi) (0),
24 120 720
where 0 is between 0 and 0, and H(iv), H(v), and H(vi) respectively stand
for the fourth, fifth, and sixth derivatives ~f H. Use the Taylor series of
exp(x) around X = 0 and the fact that H'(O) = 0 to write
exp(nH(O = exp(nH(O exp (i(O - 0)2 HI/(O)
x [1 + ~(O
6
- 0)3HIII(0) + ~(O - 0)4H(iv)(O)
24
+ ~(O - 0)5 H(v) (0) + n 2 (0 - 0)6 Hili (0)2 + Rn(O)] ,
120 72
7.4. Large Sample Properties of Posterior Distributions 449
where
In, Rn(O) exp (~(O - 0)2 H"(O)) dO = O(n- 2)
as n --+ 00, because Rn(O) is bounded on the bounded set 0'. We can also
show that
for all odd k. (In fact, these last integrals are exponentially small.) This
implies that
x [1 + u*4
8n
H*(iv)(O*) + 5u*6 H*'" (0*)2 + o(n- 2)] .
24n
Next, we prove that 0and 0* differ by O(n- 1 ). Since H* has 0 derivative
at 0*, we can write
o = H*' (0*) = H'(O*) + O(n-1)
H'(O) + (0* - O)H"(O) + O(n-l) + 0(0 - 0*)
1 1
-(0* - 0)2 + O(n- ) + 0(0 - 0*).
A A
=
u
u*
= - exp~n[H*(O) - H(O)]) [1 + O(n- 2)] . o
u
450 Chapter 7. Large Sample Theory
[1f;
nA
- (bo
n+A
nA ~
2- 2
+w+ -(x-J1.O) )
n+A
Suppose that we want to calculate the predictive density of a future observation
Y. Conditional on A = A,
Y '" t
ao+n
(AJ1.O + nx bo
A+ n '
+ w + n~:dx -
ao + n
/1-0)2 [1 + !])
A .
So, for each value of y, we can let g(A) be the density of Y at y and apply
Laplace's method for many values of y.
As an example, suppose that we observe n = 10 observations with x = 14.7
and w = 52.2. Suppose that the prior had ao = 1 = bo = eo and /1-0 = 10. Then
fJ = 0.1598 provides the maximum of the function H. For each value of y between
o and 30, say, we can let g(A} be the tn density as described above, and we get
the plot in Figure 7.114. For comparison, Figure 7.114 includes the predictive
density that would have been obtained had Pr(A = 1) = 1 been assumed. (The
prior mean of A is 1.)
A naive alternative to the Laplace approximation is to use the MLE
g({}) to approximate the P?sterior mean. First, note that exp(nH*(lI*V =
g(9*)exp(nH(9*}). Since 9 - 9* = O(n-l), we get that g(9*) = g(9) +
O(n-l) and (as in the proof) 0".2 = 0"2 + O(n- 1). Combining these facts
2
together with the fact that H(O*) = H(O) + 0([0* - OJ ) we get that the
difference between g(O) and the Laplace approximation is O(n- 1 ). Since
the Laplace approximation differs from the E(g(8)lxn = xn) by O(n-2),
we get that g(O) differs from E(g(8)lxn = xn) by O(n- 1 ). So, the Laplace
approximation can be thought of as a higher-order correction to the use of
the MLE as an approximation to E(g(8)lxn = xn).
7.4. Large Sample Properties of Posterior Distributions 451
...,
d
1- Laplace
A _1
o 5 10 15 20 25 30
A
Example 7.115 (Continuation of Example 7.104; see page 444). We have ob-
served ten random variables with Cau(0,1) distribution given 8 = O. Suppose
that our prior distribution for 8 was very fiat, say N(O, 1000). We are now in
position to approximate the mean of any positive function of 8. For example,
for each t, we can approximate the mean of exp(t8). This would be the mo-
ment generating function of 8. For each x we could approximate the mean of
[71"(1 + (x - 8)2]-1. This would be the predictive density of a future observation
at x. A serious problem arises with this data set. For some values of t or x,
the associated function H* is not unimodal. This makes the use of the Laplace
approximation unsatisfactory.
Nevertheless, we are able to approximate the moment generating function for
small values of t at least. This is done by using the function g(8) = exp(t8)
for several different small values of t. Kass, Tierney, and Kadane (1988) suggest
using numerical derivatives of the moment generating function for approximating
moments of the parameter. For example, if we use t = k X 10- 5 for k = 0,1,2,3,4,
we can approximate two derivatives of the moment generating function at O.
Laplace's method uses g(O) = exp(tO) for the values of t listed above and gives
t E(exp(t8)) - 1.0
0.0 0.0
1.0 X 10- 5 4.49078 X 10- 5
2.0 X 10- 5 8.98176 X 10- 5
3.0 X 10- 5 13.47294 X 10- 5
4.0 X 10- 5 17.96432 X 10- 5
We can fit a quartic polynomial to these values, and its first two derivatives at
452 Chapter 7. Large Sample Theory
o will aRProximate the first two moments. The fitted quartic is -31677440x 4 +
2972.6x + 9.85075x2 + 4.49068x + 1.0. The first derivative is 4.49068 and the
second is 19.7015. Unfortunately, the estimated variance would be negative, so
these moment estimates are not very good. The problem here is that, for some
values of t, the function H* is not far from being multimodal.
Using numerical integration (trapezoidal rule for -20 < 8 < 40 with 6000 in-
tervals), we approximate the mean to be 4.58994 and the variance to be 2.20456.
The moment generating function can also be approximated by numerical inte-
gration. It is
E(exp(t8)) - 1.0
0.0 0.0
1.0 X 10- 5 4.59005 X 10- 5
2.0 X 10- 5 9.18034 X 10- 5
3.0 X 10- 5 13.77086 X 10- 5
4.0 X 10- 5 18.36161 X 10- 5
If we fit a quartic to this, we get
There exists 00 such that the absolute values of Hn and its first four
derivatives are all uniformly bounded for II(} - On(xn)11 ::; 200. There
exists 01 such that for all "( the absolute values of H~ and its first
four derivatives are all uniformly bounded for 11'1/1 - '1/1~ (xn , "() II ::; 201.
454 Chapter 7. Large Sample Theory
For each (Xl, X2, X3, .. ) E A, the marginal posterior density of r (the first
p coordinates of e) given xn = xn is frJxn ('Ylxn) equal to
n~lu~(xn;'Y)14
(27r)~lan(xn)14
x exp(n[H~(1P*(xn);xn,'Y) - Hn(On(xn);x n )]) x [1 + O(n-l)].
PROOF. As in the proof of Theorem 7.108, we will suppress the dependence
on nand xn. Let (Xl, X2, . ) E A, and write
n fOb) exp(nH*(1Pi'Y))d1P
frJxn blx ) = In exp(nH(OdO (7.118)
The proof of Theorem 7.116 can be adapted to show that the approximate
Bayes factor in (4.27) on page 227 is an O(n-l) approximation to the true
Bayes factor when the hypothesis is H : r = /'0. In this case, the parameter
under the hypothesis is '11, the last k-p coordinates of 8. One must replace
()' by 8 and 1/;'(')') by 1/;*(')') in the approximation of Theorem 7.116, but
this does not alter the order of the approximation. One can also show that
the O(n-l) term in Theorem 7.116 is uniform for /' in compact sets.
(and similarly for Qn). For arbitrary probability measures Sand T on the
same a-field e, let
Before we state and prove the main theorem, we will give conditions
under which the hypotheses of the theorem hold.
Lemma 7.119. Let 7r2 7rl be probability measures on a parameter space
(0, r) with parametric family {Po: e EO}. Suppose that for every BE B,
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
3IThe proof relies on martingale theory.
456 Chapter 7. Large Sample Theory
Then Q P.
PROOF. If Q(B) > 0, then there exists a set C ~ n such that P8(B) > 0
for all 0 E C and 1I'2(C) > O. It follows that 1I'1(C) > 0 and then that
PCB) > O. 0
The importance of Lemma 7.119 is that it applies to the popular model
in which data are conditionally lID given some parameter e with a dis-
tribution in a parametric family. Lemma 7.119 says that if two Bayesians
agree on the parametric family but disagree on the prior distribution, then
Theorem 7.120 will apply to them, so long as one of the prior distributions
is absolutely continuous with respect to the other.
Theorem 7.120. If Q P, then for each pn there exists a version of Qn
such that
Q [(XI.X 2,"'): lim p(pn (.Ixl. ... ,xn ) ,Qn (.Ixl. ... ,xn)) =
n ..... oo
0] = 1.
The proof of this theorem requires some lemmas and corollaries.
Lemma 7.121. Let Q be a probability measure, and let E denote expecta-
tion with respect to Q. Let {Yn } ~= 1 be a sequence of random variables such
that limn..... oo Yn = Y a.s. (QI and IYnl ~ m for all n and some nonnegative
m. Let {Uj }~l be an increasing sequence of a-fields. Let U be the smallest
a-field containing all of the Uj Then
Z = 1.lim
.....00
sup E(YnIUi).
i?,j,n?,j
Then
as k -+ 00. The first equality holds because the supremum decreases as j in-
creases. The limit follows from the martingale convergence theorem B.117.
Similarly, we can show that
PROOF. Let Ek = {ITkl > f}, so that {suPk>n ITkl > f} = Uf=nEk. Now
apply Corollary 7.122. - 0
Lemma 7.124. Let Q P and let q = dQ/dP. Define
if qn > 0,
if qn = 0.
Then, qn = dQn/dPn, and for each f > 0,
lim Qn (Idn - 11 > (I Xl,'''' Xn) = 0, a.s. [QI.
n--+co
PROOF. Since, for every B E Bn,
A(u) = {v: dn(u,v) > I}, A(u,e:) = {v: dn(u,v) > 1 + e:}.
Then
p(pn('lu), Qn('lu)) = r
JA(u)
[dn(u, v) - l]dpn(vlu)
~ e: + 1A(U,f)
[dn(u,v) - l]dpn(vlu)
< e: + r
JA(U,f)
dn(u,v)dpn(vlu)
e: + Qn(A(u, e:)lu).
Now, write {limn_ oo p (pn ('Iu n), Qn ('Iu n)) = O} as
f>ON=ln=N
;2 $
f>ON=ln=N
The first containment is what we just proved, and the second is trivial.
Lemma 7.124 says that Q ofthe last of these sets is 1, hence Q of the first
of these sets is 1. 0
'D
Now, suppose that vIn(c - en) N(O, I/Ix 1 (c)) under Pc. Then
A
->
8 n ,H = ( ~ n,H
c ).
Then
sup !xle(XIO) = !xle
OEOH
(x I[ ~ c
n,H
]) .
460 Chapter 7. Large Sample Theory
(7.127)
Hence,
~n,H = ~n + DOl Bri (c - c) + op (In) . (7.129)
in ( ~ n,H
C ) -in(9 n)
n C-CA]T [ C-CA ]
= -"2 [ Dol BJ (c - C) IXI (90) DOl BJ (c _ C) + op(l)
= i(C - c) T[Ao - BoDo1 BJ](c - c) + op(I).
The matrix Ao - BoDo1 B;r is the negative of the inverse of the upper-
left k x k corner of IXI (90)-1, which, in turn, is minus the inverse of the
asymptotic covariance matrix of C. Since
it follows that -2 log Ln E. X~. Note that the choice of t/Jo is irrelevant. 0
When appealing to the asymptotic distribution of the LR criterion, the
tradition is to choose 0: and reject H if -2 log Ln is greater than the 1 - 0:
. quantile of the X~ distribution.
Example 7.130 (Continuation of Example 7.104; see page 444). Using the same
data as in the previous Cauchy example, suppose that we wish to test H : e = 5.
The two values of the likelihood function are
ilO(4.531} = -1Olog(1r} - 27.36 and ilO(5} = -1Olog(1r} - 27.50.
So -2 log Ln = 0.28, which is too small to reject H at any popular level.
~ (Yi - nqi)2
Cn=~ "
i=l nqi
462 Chapter 7. Large Sample Theory
PROOF. The distribution of Y = (Y1 , ... , Yp) T is M ult( n; ql , ... , qp) and
we know that
Vii (
~ ~: ql ) 1)
~ Np(O, E),
y
.:: - qp
where E = ((CTi,i))' with
-qiqj ifii-j,
CTi,j ={ qi(l - qi) if i = j.
1 ( Y1 - nql,o
Dn = -(Yl - nql,O,., Yp - 1 - nqp-l O)E;;-l :
n ' .
Yp - 1 - nqp-l,o
o o
o
o qp-l
o
The traditional X2 goodness of fit test is to reject the hypothesis that the
distribution of the data is P if Cn is greater than the 1 - a quantile of the
X~-l distribution.
Example 7.132. Bortkiewicz (1898) reports data on the number of men killed
by horsekick in the Prussian army.32 The data were collected from 14 army units
for 20 years.
32See Bishop, Fienberg, and Holland (1975) for a more complete analysis of
this data.
7.5. Large Sample Tests 463
ifi #j,
Si,j = {
if i = j.
We could then examine the distribution of Q - q, where q = (q1, ... , qp), or
specifically of IIQ - qll, or whatever. For example, let S. be the upper-left
p - 1 x p - 1 corner of S and consider
Q1 - q1 )
(
Qp-1 ~ qp-1 .
This quantity would have approximately an NCX~_l (2:f=l (Yi - nqi)2 IYi)
distribution.
A different type of hypothesis might be H : e E Po, where Po is a
parametric family with k-dimensional parameter space r with k < p - l.
This case was considered by Fisher (1924).
Theorem 7.133. Let r be a k-dimensional parameter space with parame-
ter \II and k < p-l. Let R 1, ... ,Rp be a partition of X. Let Yi be the number
of observations in Ri for i = 1, ... ,po Call Y = (Y1 , ... , Yp) the reduced
data. Let St/J for'IjJ E r stand for the conditional distribution of the reduced
data given \II = 'IjJ. Define qi('IjJ) = St/J(R;) and q('IjJ) = (q1('IjJ), ... ,qp('IjJ)).
Assume that q has at least two derivatives and is one-to-one. Let q,n be
the MLE based on the reduced data, and let IXl ('IjJ) be the Fisher informa-
tion matrix. Assume that q,n is asymptotically normal Nk('IjJ,Ix l ('IjJ)-1 In).
Define qi,n = qi(q,n) and
464 Chapter 7. Large Sample Theory
PROOF. The likelihood function for the reduced data is i(t/J) = nf=l qfi(t/J).
Setting the partial derivatives of the log of the likelihood equal to 0 gives
the equations
o=
f..
L.t
Yi
.(01.) aa'1'3 qi (t/J ),
ol
i=l q, 'I'
for j = 1, ... ,k. Since q,n is vn-consistent and q is continuous, it follows
that (li,n is a vn-consistent estimator of qi(W). Since l't/[nlli,n] ~ 1 for each
i, it follows from the likelihood equations (and Problem 7 on page 468) that
y2 a
L
P
2 2(q, ) aol. qi(\fIn) = op(l). (7.134)
i=l n qj n '1'3
The argument we just finished for the case in which the hypothesis is
simple shows that for every t/J E r,
C(t/J) - Cn
1 kk A A {)2 A}
- 2A~ LL(t/Jj -Wn,j)(t/Jt -Wn,t) ()t/J.{)t/J qi(Wn) + op(l).
q"n j=1 t=1 J t
We can rearrange the sum of the first set of terms inside the large brace
and use (7.134) to remove these terms from the sum. Then
1 kk A A {)2 A}
- 2A~ LL)t/Jj - Wn,j)(t/Jt - Wn,t) ()t/J.{)t/J qi(Wn) + op(l).
q"n j=1 t=1 J t
Since Yi = Op(n) and the inner summations are both Op(l/n), we can
use the fact that Yi/[nqi,n] ..!: 1 for every i (and Problem 7 on page 468)
to rewrite C(t/J) - Cn as
Since IXl (tP)-l In is the asymptotic covariance matrix of ~n' we have that
C(tP) - C n ~ X~. Next we prove that C(tP) - C n is asymptotically inde-
pendent of Cn. This will make the asymptotic characteristic function of Cn
equal to the ratio of the asymptotic characteristic functions of C(tP) and
C(tP) - Cn. Since the former is that of X~-l and the latter is that of X~,
the ratio is that of X~-k-l'
Define qn = (ql,n,' .. qp,n) T. Use the delta method to write
Vri(qn - q(tP)) = V[~n - tPj + op(I),
where V = Vi,j)) is the p x k matrix, with Vi,j = 8qi (tP) 18tPj. It follows
that Vri(qn -q(tP)) is asymptotically Np(O, VIxl (tP)-l VT). Since q is one-
to-one, V has rank k and V- = (VTV)-l VT exists. It is easy to see that
V- V is a k x k identity matrix. Hence,
C(tP) - Cn = n[V(tP - ~n)jTV- T IXl (tP)V-V(tP - ~n) + op(I)
= n(qn - q)TV-TIx1(tP)V-(qn - q) +op(I).
Also, since (Yi - nqi,n)2/n = Op(I) for each i and Ijqi,n ~ Ilqi(tP), we
can use Problem 7 on page 468 to conclude
C =
n
f.. (Yi -
L..J n
nqi,n)2
.(.1.)
+ (1)
Op.
i=l q, 'I'
The proof will be complete if we can show that Y - qn and qn - q( tP) are
asymptotically independent. Since they are jointly asymptotically multi-
variate normal, it suffices to show that they are asymptotically uncorre-
lated. Since Y - q(tP) = Y - qn + (qn - q(tP)), we need only show that the
asymptotic covariance matrix of qn (namely, VIxl (tP)V T ) is the same as
the asymptotic covariance between Y and qn' We find the latter as follows.
First, note that (following some tedious algebra)
[
E (yt - nqt(tP))
f.. Yi kqi(tP)]
t:t di(tP)
8
= 8tPj qt(tP).
Hence the asymptotic covariance between Y and the vector D( tP) of partial
derivatives oflog l( tP) is V. Next, use the delta method to write
o =
f..
1
-L..JYi
8 '
81/.1; qi(W n )
= -
1
L Yi !-qi(tP) + L mn,t.jVri(wn - tP) + op(I),
p
1/.1;
k ,
7.6 Problems
Section 7.1:
12. Suppose that F(t) = 1 and F(t - E) < 1 for all E > 0 and that F is differ-
entiable at all values less than t with derivative f such that lim"'lt f(x) = c
with 0 < c < 00, and F is continuous at t. Let {Xn}~=l be lID with CDF
'D
F and let X(n) = max{Xl, ... , X n }. Prove that n(t - X(n --+ Exp(c).
13. Prove Proposition 7.34 on page 408.
14. Prove Proposition 7.37 on page 410.
15. Suppose that {Xn}~l are conditionally lID with Cauchy distribution hav-
ing median () given e = (). Calculate the asymptotic efficiency of the sample
median and of the best linear combination of three symmetrically placed
sample quantiles.
16. Let F(x) = [1 + exp( _X)]-l. Assume that {Xn}~=l are conditionally lID
given e = () with CDF F(x - (}).
(a) Prove that the density is symmetric about ().
(b) If we wish to use the L-estimator based on yJn) , Yl';i, and Yl(~!' with
p < 1/2, find the best p and the best coefficients.
17.*Let {Xn}~=l be conditionally lID given e = () with density equal to
if () - ~ < x < (),
if () ~ x < () + ~.
(a) Find the asymptotic joint distribution of the p, 1/2, and 1- p sample
quantiles of a sample of size n as n --+ 00.
(b) Find the best linear combination of the three sample quantiles p, 1/2,
and 1 - p for estimating e.
(c) Try to find the best p if one wishes to estimate e using a linear
combination of the three sample quantiles p, 1/2, 1 - p, and show
that the usual analysis fails.
18. In Problem 17 above, find the asymptotic joint distribution of the largest
and smallest order statistics from a sample of size n.
19. Let the conditional distribution of {Xn}~l given e = () be lID U(O,(}).
Let Xf;l denote the kth order statistic based on Xl, ... , Xn. Find an and
bn such that an(Xf;l- bn} converges in distribution to a nondegenerate
distribution as n --+ 00 for fixed k.
Section 7.9:
(b) Compute the Fisher information IXl (8) and the efficiency of the es-
timator found in part (b) of Problem 16.
(c) Compute the efficiency of Xn = 1::=1 X;fn as an estimator of e.
21. Let {Xn}~=1 be conditionally lID given e = 8 with distribution U(8 2 , 8)
where the parameter space is the interval (0,1).
(a) Find the MLE of e.
(b) Find a nondegenerate asymptotic distribution for the MLE.
22. Prove that the relative rate of convergence is unique by first showing the
following. Let an,a~ > and let H,H' be CDFs. If an(Gn - g(8 Eo H
and a~(Gn - g(8 E. H', then limn_ooa~/an = c E (0,00) and H'(x) =
H(x/c).
23. Prove the claim at the end of Example 7.46 on page 413 about the rela-
tive rate of convergence being the square root of the ARE for asymptotic
variance.
24. Let {Xn}~=1 be conditionally lID with N(8,1) distribution given e = 8.
Let Xn = n- 1 L:=1 Xi and Sn = L:=1 (Xi - X)2. Let an be such that
Pr(Sn > an) = l/n. Let k n be the largest integer less than or equal to ..;n.
Consider the following two estimators:
T. ={ Xn if Sn ~ an
n n ifSn>a n '
(a) Show that the ARE of Un to Tn is using the criterion of rate of
convergence from Example 7.46 on page 413.
(b) Show that for any fixed f > 0,
N(0,a(8.
(b) Using the same criteria as in Example 7.44 on page 413, find the ARE
oflog(2)/Xn to l/X n as estimators ofe.
29. Verify the conditions of Theorem 7.49 in the case in which the observations
are lID given 8 = 0 with density !xls(xI8) = 8e- eo;, for x > o.
30. Assume that 0 = {0 1 , , Om} is finite, and let the prior distribution be
(71"1, . ,7I"m), with 7I"i = Pr(8 = Oil. Let J.t be a measure such that Pe J.t
for each 8 E 0, and let
f;(x) = dZ; (x).
Assume that {Xn}~=1 are conditionally lID given 8 = 8; with density
!;(x) for each i. Let en
be the MLE after n observations. Prove
lim Pr(e n = Oi) = 7I"i.
n-+oo
(a) Prove that there exist numbers (31, (32, . . . such that
( af3j ) Y (3j
Pr(Xl,j = x, X2,j = yle = B) = - 1
(3 - 1
(3' (7.138)
+0 j + j
474 Chapter 7. Large Sample Theory
45. Assume that {Xn}~=l are conditionally lID with N(J.L, ( 2 ) distribution
given 9 = (J.L, u). (See the end of Example 7.52 on page 417.) Prove that
for every 90 and every compact set C ~ n, EooZ(CC,Xi ) ~ 0 in the
notation of Theorem 7.49.
Section 7.4:
46. Use the problem description in Problem 16 on page 75. Show that the
posterior distribution of M given (Xl, ... , Xn) is not consistent in the sense
of Theorem 7.78.
47. Assume the conditions of Theorem 7.78. Prove that there exists a subset
A ~ n with J.Ls(A) = 1 such that for every 9 E A,
P~ ( n-+oo
lim J.LSIXlo .... X n (AIXl , , Xn) = IA(9) = 1.
48. Return to the situation in Example 7.82 on page 432. Prove that the pos-
terior probability of C. does not almost surely converge to 1 given 9 = 90.
(Hint: Rewrite the posterior probability of (0, ( 0 ) in terms of the random
variable n(90 - X(n, where X(n) is the largest observation. Then use the
result in Example 7.33 on page 408.)
49. Suppose that X I, ... ,Xn are conditionally lID with exponential distribu-
tion Exp(9) given e = 9. Let 9 have a qa, b) prior distribution. Use
Laplace's method to construct a formula for the approximation to the pos-
terior mean of 8. How does this compare to the exact posterior mean?
50. Suppose that Xl, ... , Xn are conditionally lID with Laplace distribution
Lap(9,I) given 8 = 9. Let the prior distribution of 8 be Lap(O,I). We
wish to approximate the predictive density of a future observation, namely
f[exp(-Ix - (1)/2jfslx(9Ix)dll for various values of x.
(a) Use Laplace's method to construct a formula for the approximation.
(b) Describe how to use importance sampling to do the approximation.
7.6. Problems 475
51. Let 8 = (r, Ill). Suppose that one wishes to test the hypothesis H : r = ')'0.
Let the prior probability of H be positive, and suppose that the prior for
III given that H is true is the conditional prior of III given r = ')'0 calculated
from the prior on 8 given that H is false. Prove that the approximate Bayes
factor in (4.27) is the same as the Laplace approximation of Theorem 7.116
divided by Ir(')'o) when the hypothesis is H : r = ')'0.
52. Let {Xn}~=l be conditionally lID given (Pl , P2 ) = (Pl,]>2) with Ber(Pl +
P2) distribution. Let the prior distribution be (Pl,P2) "-J Dir3(01,02,03)'
(a) Find the posterior distribution of H given (Xl, ... , Xn).
(b) Conditional on (Pl , P2) = (Pl,P2), say what happens to the posterior
distribution found in part (a) as n -+ 00.
Section 7.5:
53. Let the parameter be 8 = (Ml' M2, E), and suppose that conditional on
8 = (p,l, p,2, 0") Xl,l, ... ,Xl ,n1,X2,1, ... ,X2,n2 are independent with X;,j
having N(p,i' 0"2) distribution for j = 1, ... ,ni and i = 1,2. Prove that the
size a likelihood ratio test of H : Ml = M2 versus A : Ml =I- M2 is also the
UMPU level o.
54. *Let a be a known number strictly between 0 and 1/2. The parameter space
is n = {(8l ,(h) : 0 ~ 81 ~ 0,0 ~ 82 ~ I}. Let D be a discrete random
variable with conditional density given (8 1,82) = (81, (2)
When a model has many parameters, it may be the case that we can con-
sider them as a sample from some distribution. In this way we model the
parameters with another set of parameters and build a model with different
levels of hierarchy. In this chapter we will discuss situations in which it is
natural to model in this way.
8.1 Introduction
8.1.1 General Hierarchical Models
We turn our attention now to a situation in which the observations are
not exchangeable. Suppose, for example, that several treatments are be-
ing administered in a clinical trial. From each treatment group, we will
make some observations. It may be plausible to model the observations
within each treatment group as exchangeable, but it would seem strange
to model all observations as exchangeable. For each treatment group, we
might develop a parametric model as we have done elsewhere in this text. A
hierarchical model for this example involves treating the set of parameters
corresponding to the different treatment groups as a sample from another
population. Prior to seeing any observations, we can model the parameters
as exchangeable. 1 This would mean that we could introduce another set of
parameters to model their joint distribution. These second-level parameters
2We set all variances equal to 1 for simplicity in this first example. In any real
application, the variances would be unknown parameters as well.
478 Chapter 8. Hierarchical Models
Hyperparameters
f ('/'1) - fXlw(xlv;)fw(v;)
wlX 'P x - fx(x) ,
8.1. Introduction 479
Finally, the predictive distribution of future data Y is found from the pos-
terior density of the parameters:
fYlx{ylx) = JJ fYle,w{yIO,1/J)felx,wU'lx,1/J)fwlx{1/Jlx)dOd1/J.
Hierarchical models were first popularized by Lindley and Smith (1972)
and Smith (1973) for the special cases of multivariate normal observations.
Hierarchical models are special cases of partial exchangeability, which we
consider in more detail in the remainder of this seciton.
all n. The jn sequence tells us from which group the nth observa
tion comes. Let
Let k n = L:~=1 ji. Let T n (" t) be the uniform dist'ribu tion on the
surface of the
sphere of radius Jt3 - tVkn - tV(n - k n ) around the point whose
i~~ coor-
dinate is jih/k n + (1 - j;)t2/(n - k n ). One can check that the
conditIOns of
Theorem 2.111 are met.
We would like to proceed as in Exampl e 2.117 on page 128, but we
cannot as-
sume that the coordin ates are lID in the limit distribu tions. So, we
will constru ct
the joint distribu tion of 80 observa tions from group 0 and 81 observa
tions from
group 1 (for fixed 80 and 81) given Tn = t, and see what happen s
as n --. 00. Call
these observa tions Z = (ZI,"" Zso) and W = (WI,"" WS1 )' Let
2=
O'n 2
-1 ( t3 - -1t l - - -1- t 2 , 2) 1
TIn = --k- t2.
n kn n-kn n- n
Then
If O'n converges to 0' E (0,00) and /-tn and I/n converge to /-t and 1/, respecti
ver" this
function converges uniform ly on compac t sets to the density of 80
N(/-t,O' ) and
81 N(I/,0'2 ) random variable s all indepen dent.
If 0'2 goes to 00, there is no limit
distribu tion. If 0'2 goes to 0, and /-tn --. /-t and I/n --. 1/, the function
o uniformly outside of every open neighborhood of the point with ithconverg es to
coordin ate
ji/-t + (1 - j;)I/. In this case the limit distribu tion is
point masses at
dependi ng on whethe r j; = 1 or O. Finally, if O'n goes to a finite value /-t and 1/
and either
/-tn or I/n diverges to oo, there is no limit distribu tion. The
extreme distribu tions
either have all coordin ates degener ate or have the coordin ates being
indepen dent
normal random variable s with commo n variance. In either case,
there are two
different means dependi ng on the values of the sequence {jn}~=I'
Xn =(
-1
4
3
-2 1
2 0)3 .
Then
1
-1 2
(-2,0,3)
(-2,1,3,4) (-1,1,4)
T~ = { (-1,0,2,3)'
(-1,1,2)'
(-1,0,1,2) (0,2,3)
Basically, you throwaway the information about in which row and in which col-
umn each number was, but you keep the information about which other numbers
were in the same row and column with each number. Each of the Tn!cn ! matrices
that can be obtained from Xn by permuting the rows and then the columns will
have the same value of Tn. Similarly, all of thos~ Tn!en! arrays can be constructed
from Tn by a somewhat more tedious algorithm. Clearly, Tn(-, t) must be uniform
over those Tn!C n ! arrays to preserve row and column exchangeability.
Finding all of the Q(,x) distributions is no small task. Aldous (1981) proves
the following result. An array X is row and column exchangeable if and only if
there exists a measurable function f : 10,1]4 ...... IR such that X has the same
distribution as Y = Yi,j)), with Yi,j = f(M, Ai, Bj, Gu), where M, AI, A2,""
B I , B 2 , , GI,I, ... are all lID U(O, 1) random variables.
(8.8)
where
.(.1,
IL, 0/, CT, T
) _ ni xi T2
- 2
+ '!jJu
2
2
niT + CT
The conditional joint distribution of the data given \II = '!jJ, E = u, and
(8.10)
To find the posterior distribution of (:E, T), let X stand for the vector
with coordinates X 1, ... , X k. Then, the conditional distribution of the data
given E = 0' and T = r is
2
X", Nk ("pol, W(a, r)), s2 0' 2
(8.12)
i '" n; _ 1 Xn ;-1'
where
It follows that fE,Tlx,s~, ... ,s~ (0', rlx, s~, ... ,sO is proportional to
There is one special case in which the above formulas simplify tremen-
dously. For fixed A, suppose that
k
= 1'0 + L 'I'i,
i=1
;1 + t + ';0 2-
1'0
W(O',T) 0'2 [ 2-
1 2-
1'0 1'0
where
ni A
Ai = A+ ni, 'I'i = -Ai.
Note how P,i and 'l/J1 no longer depend on 0' and T. In fact, we can use
Proposition 8.13 below to show that
5This proposition is also used in the analysis of the two-way ANOVA in Sec-
tion 8.2.2.
486 Chapter 8. Hierarchical Models
y '"
- E:-l "YiXi+(OA,pO A
00 njxj + A ~" "
E(MjIX = x) = [ (0 + L.Ji-l "Y. fAIX(Alx)dA.
io Aj
Since the integration on the right-hand side is over A, we should see explic-
itly where A is. So, we rewrite the formula as
~k n" - r,p
- L.Jj-l ~Xi+"O 0 A
00 njxj + E" n"
where tPi (A) = oAtPo + L:=l 'YiXi)/O + L:=I 'Yi), and bi (A) is the same as bi
with bo replaced by bo(A).
For importance sampling, we sampled 1000 values from the prior distribution of
A and used these to approximate the integrals that equal the posterior densities at
various p. values. We also used the delta method to calculate standard deviations
for the density values and found these to be at most 0.09 times the density values
in all cases (less than 0.05 times the density in 80% of the cases).
The three posterior means were calculated from the posterior densities and
were found to equal 26.08, 18.86, and 19.86. We see that some shrinkage h&ll
occurred. The numerically evaluated densities are shown in Figure 8.15 together
with the results of an empirical Bayes analysis to be described in Section 8.4 and
a successive substitution sampling analysis to be described in Section 8.5.
488 Chapter 8. Hierarchical Models
Imp. Samp.
Laplaca
SSS
Emp. aay
on
~ d
'"
c:
~
co
d
10 15 20 25 30 35
J.L
One could generalize this model to the case in which the variance of
Xi,; conditional on the parameters is E~. In this case, one can only ob-
tain closed-form posteriors conditional on all of the variance parameters,
E~ , ... , E~, T2. Numerical integration over all k + 1 variance parameters
would then be needed. We postpone illustration of this until Section B.5, at
which time we introduce an alternative method of solution that is better
suited to this type of problem.
where A stands for the random effect and B stands for the fixed effect and
b b
LB; = 0 = L(AB)i,;, for all i, (B.17)
;=1 ;=1
for i = 1, ... ,a, j = 1, .. , ,b, and k = 1, ... ,m. We suppose that the f.i,;,k
are conditionally lID N(O,O"~) given E~ = O"~ (and all other parameters).
It is also traditional to assume that, conditional on other parameters, the
Ai are independent of each other and of the Bj and the (AB)i,j' that the
Bj are independent of the (AB)i,j, and that the (AB)i,j for different i are
independent of each other. We can let
and put these into vectors Mi = (M i ,l, . , . , Mi,b) T. We can then express the
model described above by saying that the Mi are conditionally independent
Nb((), E) vectors given e = () and E = a. Here E is a b x b matrix and
e = (M + Bl, ... M + Bb) T. In order to ensure that (8.17) is reflected in
the conditional distribution of Mi given M, we assume that E has the form
Table 8.18. To do this, we note first note that sufficient statistics are
a b m _ 1 m (i,.I,k )
R88 = L L L ( Y i , j , k -fh,j)2, Y i = -mL..J
"" .. .
i=1 j=1 k=1 k=1 Y;t, b, k
1\T ( ( )
H b
a~ ]
'm + a2
A 11 T+ a 2
AB [] 1 11
- ;; T] ) , (8.19)
and they are still conditionally independent. This means that we can reduce
the sufficient statistic even further. We will still need R88, and of course
we will need Y. = 1::=1 Yi/a. Because of the special form of the covariance
matrix of the Y;, we do not need the whole matrix 1::=1 (Yi - Y.)(Y i -
Y.) T , which would be required if the covariance matrix were unconstrained.
Instead, one can use the fact that
(8.20)
L m(Y
a
= i - Y.) T (Yi - Y.) - 8SA,
i=1
conditional density ofY. in terms ofY.,. and SSB = E~=l ma(Y.,j -Y.,Y,
where Y.,j is coordinate j of Y., and SSB is the usual sum of squares for
the fixed effect. In summary, the sufficient statistics for the model that in-
volves only the parameters 1:: e , 1::A, 1::AB, 1::B, and M are RSS, SSA, SSI,
SSB, and Y.,.. The conditional distributions of these quantities given the
parameters are easily calculated using the fact that they are functions of
orthogonal transformations of the original data. Hence, they are all condi-
tionally independent given the parameters, and their distributions are
/z1Y(zIY) =
00
~ exp -2"
(Y) -t-
(ll)i T(~+i) . (Z)
r(~ + i) z~+t-l exp -2" .
fz,Y(z,y)
(h);tr(!!1.+i)2-(~+i).
fz(z) = L
00
c 2 a
Z
z~+-l exp (--) .
i=Oi!r(!!f)(l+.!!;)!'.f+ir(~+i) 2
Use the formula for 'Y to complete the proof that the distribution is ANCX2.
The mean of Z given Y = y is q + y, and the mean of Y is cal/b l . 0
f Y (Y) = (r(!!f)
h)
2 2f ~-l
y2
( blY )
exp -2 .
i! 22q+ir(~ + i) r(!!f)'
Integrating y out of this gives the density of Z:
Now, make the change of variables from z to u = z/(z+b l +c). The inverse
is z = (b l + c)u/(l- u). The derivative is (b l + c)/(l- U)2. The density of
z/
U = (Z + bl + c) is
r(!ll + !l + 2')
L
~
00 b2 i
fu(u) = I:. 2 2 z u~+i-l(l _ U)~+i-l.
;=0 (b 1 + c).?f+ti! r(~ + i) r(-'-)
Setting, = c/(c + bl) and rearranging r function values produces the
ANCB(q,al,'() density. We know that
a = ( 01 01 -1)
-1 ' 1/10 = ( 00 ) .
8.3. Nonnormal Models 495
::!
~
>
0
u..
C
(.)
~
= exp(_240)(2~)n,
(3 + 27)"'+3 0",+2 exp[-O((3 + 27)],
rea + 3)
f 9 7Ix,A,B(0Ix, a, (3) ~:) 0",-1 exp( -0(3),
8.3. Nonnormal Models 497
24n(.8 + 27)O<+Sr(o + 3 + n)
nlr(a + 3)(,13 + 51)o<+3+ n '
24n,l3Qr(a + n)
n!r(a)(,13 + 24)o<+n'
Therefore, we need to be able to integrate these last two expressions times the
expression in (8.26) renormalized to be a density. The normalization constant is
fx(x), the integral of (8.26) over a and ,B.
First, we used Laplace's method from Section 7.4.3, since all the functions
being integrated are positive. The "8" in this example is (A, B), and g(8) is
one of the several functions obtained by fixing n in either f M2Ix,A,S(nlx, 0,,(3) or
fMrlx,A,S(nlx, 0, (3) from above. Due to the form of the prior, it seemed sensible
to transform to (A, BI A) before applying Laplace's method.
Second, we used importance sampling (see Section B.7) to integrate numeri-
cally. We used a single set of 100, 000 pseudorandom pairs drawn from the prior
distribution of (A, B) to perform all of the integrals. We also calculated variances
using the delta method for each ordinate. The results for M2 are shown in Fig-
ure 8.28, and those for Mr are shown in Figure 8.29. The standard deviations
of the importance sample ordinates were all at least two orders of magnitude
smaller than the ordinates themselves. As we can see in the figures, the two
methods produce nearly the same results.
~
Laplace
Imp.Sam .
'"
Ci
E
~~
~
~
0 2 4 8 8 10
m
I
I
Laplace
Imp. Sam.
II
. .
o 2 4 6 8 10
m
f,
e,R
(()) r(r)k
,r r(()r)kr([l - ()jr)k
IT r(()r + xi)r([l-
i=1
()jr + ni - Xi)
r(r + ni) .
p,r + niYi
'l/Ji ()
p, = .
r+ni
The posterior of M given T = r is N(p,(1) (r), 1/[r,(r )]), where
+ "k
\ (0) --21L-.
(1) r _ AP, LJi=l ni+ rY'
p, () - ,( r)
The posterior for T cannot be given in closed form, but the density
is
propor tional to iT (r) times
where
k
~
A(r) = ""'
L.J - 'n- ,
i=l ni + T
If T has a r(a(O) /2, b(O) /2) prior, then the approx imate posterior
of T is
rca(l) /2,b(l) /2), where
all) = a(O) + k,
bel) = kA
b(O) + w + k + A(p,(0) - y)2.
Of course, using these same approximations, the conditional distribu
tion
of Mi given M and T would be N(Yi, l/k), which is independent of
M and
T anyway.
500 Chapter 8. Hierarchical Models
where n = L::~=1 ni. For fixed (}'2 and A, this is maximized over t/J by
choosing
k _
,
W(A) =
1" Xi
1 L.. -1--1 .
"k
L..Jt=1
(..1.. + 1)
nj A
i=1 no + >:
If we plug this value for t/J into the likelihood and maximize over (}'2 for
fixed A, we get
If we plug this value for (J'2 into the likelihood, we get the following function
of >. to maximize:
(8.30)
N(ni~i+q,A,
A + ni ni + A
tj2.).
For the three groups, these distribu tions are respectively
Example 8.33 (Continuation of Example 7.60; see page 420). In this exam-
ple, each observation is a pair (Xi, Yi) that are conditionally IID N(J1.;,u 2 ) and
the pairs are conditionally independent given (~, M I , M 2 , ). Suppose that we
model (M I , M2, ... ) as conditionally lID N(J1., u 2 / >.) given (M, A) = (J1., >.). The
empirical Bayes approach might treat (~, M, A) as the parameter to be estimated
by maximum likelihood. The likelihood function for these parameters is
+~~(Xi+Yi_X+y)2+
>'+2L..t 2 2
2>.n (_x+y)2])
>'+2 J1. 2 .
i=l
A
A+ 2
A
_.
- mm
{"'n (
1
, ",n
L-i-I Xi - Yi)
(x;+y; _ XtV)2
2} .
L-.=I 2 2
Since the "observations" (Xi + Yi)/2 are conditionally lID N(J1., u 2[(1/2)+ (1/ >.)])
given the parameters, it follows that
nL..t 2 2 2>'
t=l
This implies that A is consistent and, in turn, that f;2(A) is consistent. The extra
terms added due to the empirical Bayes analysis make f;2(A) consistent (relative
to the empirical Bayes model).
[2:k ni
]-1
i=1 E2 + n/l'2
The value of this estimate would depend on how we estimated T and E, of
course. We should also increase the variance of Mi to reflect the fact that
E and T were estimated. An easy way to do this is to replace the normal
distribution in the posterior by a t distribution with appropriate degrees
of freedom. Morris (1983) chooses, instead, to replace the naIve variance
. expression E2 j (ni + A) by
-E2 ( 1 - - - -A
k-l - -) (8.34)
ni k A+ni .
This amounts to estimating the shrinkage factor Aj(A + ni) by a smaller
value.
Example 8.35 (Continuation of Example 8.14; see page 487). We can estimate
Var(W} by
10 12
(
38.37032 + 10 x 14.67878 + 38.37032 + 12 x 14.67878
15
+ 38.37032 + 15 x 14.67878
)-1 = 5.95368.
The additional variance terms for the three groups are 0.25568, 0.19048, and
0.13112, respectively. The adjustments specified by (8.34) are 3.30693, 2.81623,
and 2.30494, respectively. Adding these together gives the adjusted variances to
be 3.56261,3.00671, and 2.43606, respectively, all somewhat larger than the naive
variances.
We might now ask how the adjusted empirical Bayes posteriors compare to
the posteriors calculated from a hierarchical model with prior distributions for
all parameters. Such a model exists in the original description on page 487. Plots
of the posterior densities from these models were drawn in Figure 8.15 together
with the adjusted empirical Bayes distributions. Two of the Mi have empirical
Bayes distributions that are very close to the posteriors, but Ml has a noticeably
smaller variance in the empirical Bayes analysis than in the Bayesian analysis.
504 Chapter 8. Hierarchical Models
TABLE 8.36. Hierarchical Model for One-Way ANOVA with Unequal Variances
Stage Density
Data (211'u?)-~ exp { -~ l:~=l[n.(x. - 1-'.)2 + (n. - 1)8~1}
Parameter (211'T 2)-! exp {-~ ~~=l(I-'i -1/.1)2}
Hyperparameter v'(ii(211'T 2)-i exp {-~(1/.1 -1/Jo)2}
Variance fElo .. .,Ek,T(Ul..'" Uk, T)
(8.37)
The posterior mean of I{I for fixed values of the variance parameters is
= (8.38)
(8.39)
where (8.38) and (8.39) must be solved iteratively. qne can p~oose a start-
ing T value and plug it into (8.38) (together with E~, ... , E k ) to produce
8.5. Successive Substitution Sampling 505
a ~ to plug into (8.39) to produce a new T, and so on, until the esti-
mates converge. 8 Morris (1983) also suggests replacing Mi by (1 - Bi)Xi +
Bi~(Eb"" E k , T), where
'2
B- _ k - 3 Ei
t - k- 2 E; + n;T2
causes there to be less shrinkage toward a common mean. 9 The recom-
mended variance for Mi is given as
_ A 2 2 2 kni
+(xi-W(Eb ... ,Ek,T)) k_3 Bi '2 .'2 k nj
(Ei + n t T ) 2: j =1 E~+njt2
Kass and Steffey (1989) present an alternative treatment of this case from
a Bayesian viewpoint. They find a normal approximation to the posterior
distribution of the parameters V = (El' ... , Ek, W, T) (in a manner similar
to the method of Laplace) and then use the delta method to approximate
the mean and variance of (8.37), thought of as a function of V. The posterior
variance of Mi is E(ET)/ni plus the variance of (8.37).
We can define the operator T from the set of densities with respect to A to
itself by
It is easy to see that T(fy) = Jy, so the joint density of Y is a fixed point
ofT.
The method of successive substitution applied to the fixed-point prob-
lem just described would be to pick an initial density /0, say, and then
let !n = T(fn-d for n = 1,2, .... This would require the calculation of
a great many integrals that may not have closed-form expressions. An
alternative is to draw samples from the various conditional distributions
instead of calculating the integrals. In the notation just used, suppose
that X' = (x~, ... , x~) is generated from the distribution with density
Ix"~ Then suppose that X = (Xl. ... ,Xk) is generated as follows. Gen-
erate Xl from the distribution with density /YliY\l (!x~, ... ,xl.). Let Xl
be the generated value. Generate X 2 from the distribution with density
!Y2IY\2 ('!Xl. x~, ... , xl.). Continue until we ge~e~ate Xk .from th~ distribu-
tion with density !YkIY\k(!Xl. ... ,Xk-l)' The Jomt denSity of X IS T(fx').
So, we can take a starting density !o and generate XO from the distribution
with this density. Then, using the method just described for n = 1,2, ... ,
generate xn from the distribution with density T(fn-l)' This method has
8.5. Successive Substitution Sampling 507
Assume that
and that K > 0 almost everywhere with respect to A x A. Let?-l be the set
01 functions I such that1 3 11/112 = I I/{x)1 2/ fy{X)dA{X) < 00. There exists
a number c E [0,1) such that lor every density 10 E 'H., the sequence 01
functions In = T(fn-l) = m(fo) lor n = 1,2, ... satisfies II/n - hll :::;
IIlollcn lor all n.
lOMany authors call this method Gibbs sampling. This is actually a misnomer.
Geman and Geman {1984} described this method as a way to generate a sample
from a Gibbs distribution, and they called their particular implementation the
Gibbs sampler. Gelfand and Smith {1990} generalized the method to arbitrary
distributions but continued to call it Gibbs sampling, even though they were
no longer sampling Gibbs distributions. The SSS algorithm is a special case of
the broad class of Markov chain Monte Carlo methods. Note that the sequence
{xn}~=o is a Markov chain {see DefinitionB.125}. A good survey of general
Markov Chain Monte Carlo methods is given by Tierney {1994}.
11 An alternative is to notice that the sequence Xl, x 2 , is a Markov chain
{see Definition B.125 on page 650}. One then applies a theorem like the one given
by Doob {1953, Section V.5}. The conditions of such theorems are often difficult,
if not impossible, to verify in specific applications.
12Some good treatments of operator theory can be found in Berberian {1961}
and Dunford and Schwartz (1963).
13We use the symbol 11/11 for the norm of an element of a Hilbert space.
The norm IITII of an operator T is the supremum of IIT(f}II/II/II. Dunford and
Schwartz (1963) use the symbols III and ITI for these norms. They use the sym-
bol IITII for the Hilbert-Schmidt norm or double norm of a Hilbert-Schmidt-type
operator. We only mention this here in case the reader decides to refer to Dunford
and Schwartz (1963) for some of the proofs of auxiliary results.
508 Chapter 8. Hierarchical Models
PROOF. We will use Hilbert space notation and define the inner product
(g, h) = J g(x)h(x)dp.(x),
where p.(A) = JAll/ fy(x)]d>'(x). It follows that 'H is the Hilbert space
L2(p.). The norm in this space is IIgll = V(g,g). If we let Ko(x',x) =
K(x',x)fy(x'), then T(f)(x) = J Ko(x',x)dp.(x') is the operator that takes
a density for observations at one iteration of SSS to the density of obser-
vations as the next iteration, and (8.41) becomes
f g(x)d>.(x) = J T(g)(x)d>.(x) ,
f g(x)d>.(x) = f T*(g)(x)d>'(x),
f T(g)(x)h(x)dp.(x) = f g(x)T*(h)(x)dp.(x).
The last equation is the definition of what it means to say that T* is the
adjoint of the operator T. It also follows from this equation that the adjoint
of the composition U = T(T*) is itself, U. That is to say, U is self-adjoint.
Since U is two applications of successive substitution, it follows that
Since Ko > 0, it follows that U(g+)(x) >0 and U(g-)(x) >0 for all x.
Hence,
It f~llows that U(g+ +g-) > g+ +g- for all x. In other words, U(lgl) > Igl,
which would imply f U(lgj)dA > f IgldA, which contradicts (8.42). Hence,
Irl < 1 and the largest eigenvalue of V has absolute value Irl < 1. It follows
that I!VII = Irl < 1.
Now, we know that IIWII = Irll/2 = c < 1. If I is a density, then
(fy, J) = 1 and
W(f) = T(f) - fy = T(f - fy).
510 Chapter 8. Hierarchical Models
-
1
2X3
{4 2
+Xl +
(X2 - 0.9xt}2 }
0.19 .
By collecting terms here, it is not difficult to show that this function is integrable
over the six variables Xl, X2, Xs, X~, X2, xs.
After one stops the iteration, one has a vector Y from approximately the
correct distribution. One can repeat the process and produce yl, ... , ym
for some value m. If one wants the marginal density of}i, one can let Y\i
stand for the (k - I)-dimensional vector formed from ys by removing Y/,
and then calculate
(8.44)
8.5. Successive Substitution Sampling 511
(8.45)
assuming that the conditional mean of Y; given the others is easily available.
Equation (8.45) should be better than 2:::1 Y/lm, since the variance of
the simple average is the variance of (8.45) plus 11m times the mean of
the conditional variance of Y; given Y\i. Similarly, the variance of Y; can
be approximated by
(8.46)
which should be a better estimate than the sample variance of the 1';".
The SSS algorithm, as described, assumes that the random quantities
Y1 , ... ,Yk are in a fixed order for every iteration. This is not actually
required for convergence of the algorithm. The proof of convergence is sim-
plified by making this assumption however. Note also that each Y; need not
be a single random variable. Some of them might themselves be vectors.
The question of how to arrange the coordinates is important for the rate
of convergence. The more dependence that exists between successive itera-
tions, the slower the convergence will be. One can understand why this is
true intuitively by realizing that convergence "occurs" when an iteration is
"independent" of the starting iteration. The more dependence lingers from
one iteration to the next, the longer it takes to get an iteration that is
essentially independent of the start. Example 8.43 can be used to illustrate
how the choice of coordinate arrangement affects the dependence between
iterates.
Example 8.47 (Continuation of Example 8.43; see page 510). It is not difficult
to see that every order of the three coordinates is essentially equivalent in this
example. Instead, let us compare the natural order YI, Y z , Y3 to the alternative
arrangement Xl = (Yl, YZ)T, Xz = Y3 That is, let the first random quantity be
a two-dimensional vector consisting of both Y1 and Y z . To illustrate the effect of
this change on the amount of dependence between iterations, we will calculate
the conditional distribution of Y1 at the next iteration given the variables at the
current iteration for both arrangements.
In the natural order, we generate Y1 with N(O.9y~,O.19y;) distribution given
Y{ = Yl, Y2 = y~, and Y3 = ya. In the vector arrangement, we generate the whole
vector (Yb Yz) at once with distribution Nz(O, Y3A) where
A = (O~9 i9).
512 Chapter 8. Hierarchical Models
~2 -1 (ao + n1 + ... + nk
L..- r 2 '
of the three population means. These densities are plotted in Figure 8.15. The
respective posterior means were calculated to be 26.30, 18.78, and 19.82 using
(8.45). The posterior variances were calculated using (8.46) to be 4.525, 2.998,
and 2.387, respectively.
The SSS algorithm is also well suited to handle the case in which each
population has its own variance, E;. Suppose that the E; are independent
with r- 1 (ao/2, bo/2) distribution in the prior. In this case, we replace two
of the above distributions with
N
( !!1+~
Ui
n
~+TI
T
l ' n'
1
~+TI
1
)
'
Example 8.49. Suppose that we use the same data as in Example 8.14 on
page 487, but we use the hierarchical model for the population variances. We
continue to use ao = 1, Co = 1, do = 1, '1/10 = 10, and (0 = 0.1, but we include
=
fo 1 and go = 0.1. The resulting posterior densities are plotted in Figure 8.50.
The posterior means and variances from (8.45) and (8.46) are
1 2 3
mean 27.1392 18.7672 19.7399
variance 3.1634 4.6745 2.2667
We could also do a naive empirical Bayes analysis. (Recall that the adjusted
empirical Bayes analysis requires k > 3.) We will use t: s:.
= We will also adjust
the variance of Mi by adding on 1'2 times the square of the coefficient of '1/1 in
(8.37). This results in Mi having variance
'2
~i ( -1 +, '2'2)
T ~i
, .
ni (E~ + niT2)2
We need to iterate between (8.38) and (8.39). Starting with l' = 1, it took four it-
erations to get no difference between the iterations. The results were ~ = 21.9809
514 Chapter 8. Hierarchical Models
!<l
ci
~
~
'"
ci
<=
<l>
Cl
<>
ci
:3
10 15 20 25 30 35
J.1
FIGURE 8.50. Numerical Approximations to Posterior Densities
and T = 23.4547. This leads to the following naive empirical Bayes posteriors:
N(27.3786, 2.5817}, N(18.8116, 5.4845}, and N(19.7526, 2.3257}. These three den-
sities are also plotted in Figure 8.50. Notice that the hierarchical model for the
variances brings the estimated variance for the second population (the largest
of the three) down quite a bit from the empirical Bayes value while it brings
the other two variances up. Although the hierarchical model variance for M3 is
slightly larger than the empirical Bayes variance, the density (in Figure 8.50) is
more peaked. The additional variance comes from heavier tails.
151n fact, the condition that the function K be strictly positive in Theorem 8.40
is violated.
8.5. Successive Substitution Sampling 515
Note that the prior distributions of the constrained parameters are the
conditional distributions of independent random variables given that the
constraint holds. That is, for example, if B1"'" Bb were lID N(O, 0'1) and
we found the conditional distribution of B given that L~=l B j = 0, that
conditional distribution would be the distribution given above for B. (See
Problem 13 on page 535.) Also, if T = b, then this model is the same as
saying that the Bj are lID N(O,0'1) and M = L~=l Bjlb.
The posterior distributions of the parameters conditional on the other
parameters are now easily calculated after we introduce some notation.
Suppose that there are ni,j observations with the A factor at level i and
the B factor at level j. We do not need to assume that the cells all have
the same sample size as we did in Section 8.2.2. In fact, there can even be
empty cells in this analysis. Define
OJ = 1 Lai=l ni,jO:i,
;;:-:- (oJ3).,. n1,. :L~=1:L~=1 ni,j(o:{3)i,j,
."
73. = r!- :L~=1 n.,j{3j, (o:f3) .,j n1,; :L~=1 ni,j(o:{3kj,
73i = ni,. Lbj=l n
-1 {3
t,) (o:{3);,.
}' n!,. L~=l ni,j (o:{3)i,j,
W = i,; (Yi,j,k - -Yi,j,.
:Lai=l :L j=1 l:nk=l
b )2.
516 Chapter 8. Hierarchical Models
Let diag(Vi) stand for the diagonal matrix with diagonal entries equal to the
coordinates of vi. Then the conditional posterior of (AB)i is Nb(CiZ i , C i ),
where Ci = diag(v i ) - Viv iT /(I TVi). A similar analysis works for the B
vector. In this case, define the vectors
r-1 (a e + n.,. be + W + E~=l E~=l ni,j(fh,j,. - Jl- O:i - {3j - (o:{3)i,j )2).
2 ' 2
The only remaining problem for implementing SSS in this example is
how to simulate from a multivariate normal distribution Nb(Cz, C) with a
singular covariance matrix C = diag( v) - vv T / (1 Tv). The most straight-
forward way is to find a b x (b - 1) matrix D such that DDT = C, then
generate an Nb-l{O,I) vector V, and use D(V +DT z). Let v* be the first
b - 1 coordinates of v, let .;v;. be the vector with jth coordinate y'Vi, let
diag(.;v;.) be the diagonal matrix with (j, j) element equal to y'Vi, and let
h_1-~
- lTv .
where one would not normally think. In Example 7.104 on page 444, the
observables Xi were modeled as having Cau(O, 1) distribution given = O. e
Suppose now that we introduce an extra parameter Vi for each observation
and say that Xi given Vi = y and e = 0 has N (0, y) distribution and the Vi
are independent of e and of each other with r- I (1/2, 1/2) distribution. 16
It follows that Xi rv Cau(O, l) given e = 0; hence this model is equiva-
lent to the original model. However, this new model is easily handled via
SSS. In Example 7.104, we supposed that e had N(O, 1000) as a prior. The
conditional posterior of e given the Vi is
e N "n x
L..i= 1 ~ 1 )
"n_ "n
rv (
0.001 + L..._I 1.'
Yi
0.001 + L...=l 1.
Yi
.
17This model and some generalizations of it are discussed by Albert and Chib
(1993).
520 Chapter 8. Hierarchical Models
fy,xle l (y, xl0t} = fy,XI 9 l,W(Y, X101, 1) = fy,xI 90,9l,W(Y, xlOo, 01,1),
so that (Y, X) is conditionally independent of 8 1 - i given ei and \11 = i for
i = 0, 1. The predictive density of (Y, X) given \11 is
fY,Xlw(y,xll/J) = 1n",
fy,xle",(y,xl,p)f9",lw(0,p1l/J)dO,p,
where n,p is the parameter space given \11 = l/J, for l/J = 0,1. The conditional
posteriors are
If there are future data (Y', X') that are conditionally independent of
(Y, X) given the parameters, then predictive inference is available.
1
fy, ,X'IY,X(Y', x'ly, x) =L fY',X'lw(Y', x' 1l/J)fwIX,y (l/Jlx, y)
1/1=0
8.6. Mixtures of Models 521
8.6.2 Outliers
One popular use of mixtures of models is to allow for the possibility of
outliers in a data set. An outlier is an observation whose distribution (before
seeing the data) is not like that of the other observations. This "definition"
of outlier is intentionally vague. Consider an example to help to clarify the
concept.
Example 8.52. Suppose that X I, .. ,Xn are potential observations, but we be-
lieve that some of them may not have the same distributions as the others. Box
and Tiao (1968) describe a model similar to the following. Let e = (M,~) and
suppose that the conditional distribution of each Xi given e = (p" u) is N(p" ( 2 )
with probability 1- 0: and is N(p" 0 ( 2 ) with probability 0:, where c > 1 and 0: are
constants chosen a priori. SUfPose that the conditional distribution of M given
~ = u is N(p,o,O' 2 />..o) and ~ has r- l (ao/2,bo/2) distribution. There is missing
data here, namely the indicators of whether each observation has variance 00 2 or
not. Let lIt stand for the subset of {1, ... , n} such that the observations with sub-
scripts in lIt have larger variance. For each possible value 1/J of lIt, let n", indicate
18The predictive density has been used by many authors as a means for select-
ing and comparing models. See Geisser and Eddy (1979) and Dawid (1984) for
different perspectives.
522 Chapter 8. Hierarchical Models
if i ~ Ill,
if i E Ill.
(8.53)
where
,
n
E~-l ZiXi
>11 = AO + L Z i , Xv> = ""n
i=1 L.."i=l Zi
or'"
If we model the Zi as independent, then the prior probability of III = '!f; is (1-
o)n-n"" so we can calculate the posterior probability of each possible subset of
outliers.
For example, suppose that n = 15 and our prior has ao = 1, bo = 100, /.to = 0,
AO = 1, 0 = 0.02, and c = 25. (For computational simplicity, we truncate the
distribution of III to a maximum of six elements.) The data are the infamous
Darwin data [see Fisher (1966), p. 37] shown below
-67, -48, 6, 8, 14, 16, 23, 24, 28, 29, 41, 49, 56, 60, 75
The possible values of '!f; with the highest posterior probability are given in Ta-
ble 8.54 under the column "Model I." The set of size three with the highest
posterior probability of being outliers is {I, 2, 15} with probability 0.0030. We
can also add the probabilities of all subsets that contain a specific observation to
get the marginal probabilities that each observation is an outlier. See Table 8.55.
The observations not listed in Table 8.55 each have probability less than 0.005
of being an outlier.
Of course, one need not choose a single value of c or a single value of o. One
could treat these as further mixing parameters like III and compute a posterior
distribution for them. For example, (8.53) is now fXI""C,A(xl"l/l,c,o}, which does
not depend on o. This is because X is conditionally independent of A given III
and C. Suppose that we let A have a Beta distribution. It is difficult to have a
Beta distribution with mean 0.02 which is neither extremely concentrated near
its mean nor extremely concentrated near O. Suppose that we choose Beta(l, 49},
which has Pr(A $ 0.02} = 0.6358. Also, let C have probability 0.05 of being one
of the numbers 5, 10, ... , 100. The posterior distribution of C is almost the same
as the prior, meaning that the different values of C do not lead to much difference
in the predictive density of the data, although C = 10 has the highest posterior
probability. The posterior distribution of A has mean 0.0204. The posterior prob-
abilities of the various III sets is given in Table 8.54 under the column "Model 2."
The probability of the set {I, 2, 15} is now 0.0058. The probabilities that each
of the observations is an outlier is in Table 8.55. Although the probability that
=
III {I} is smaller in Model 2, the probability that observation 1 is an outlier is
still quite large, because there are many other sets "1/1 containing 1 which now have
higher probability of equaling Ill. For example, the probability of three outliers
is twice as high in Model 2 as in Modell, and the probability of four outliers is
six times as high.
Before we leave this example, we offer another variation. Suppose that we
give C a continuous prior distribution, say r- 1 (eo/2, do/2} truncated below at
c = 1 and independent of (E, M, A). We could find the posterior distributions of
whatever we wanted by using successive substitution sampling (see Section 8.5).
The following conditional posterior distributions are easy to find and are easy to
simulate:
r- 1 (ao +n+2
1
'
bo + ~ EiE",(Xi - p.}2 + Ei~",(Xi - p.}2 + Ao(P. - p.O}2)
2 '
M
AoP.O + ~ EiE'" Xi + Eill'" Xi 0"2 )
N ( Ao + n - n", + 'Cn",
1 'A0 + n - n", + 'Cn",
l '
gral transform. Notice that the Zi are independent of each other given the other
parameters and the data. An analysis like the one just described is developed by
Verdinelli and Wasserman (1991).
Notice that in the last variation, the model no longer resembles a mixture of
models. In fact, it is just a more highly parameterized model with parameter
(~, M, 1lt, C, A). Alternatively, the parameter could be taken as (l:, M, C, A) with
Z being considered as missing data.
bounds on the posteri or probab ilities of such sets as the prior ranges
over
C.
Theor em 8.57. 19 Suppose that X has conditional density fXle(x\B)
with
respect to v given e = O. Let C be the set of all distributions on (n,
~),
and let C. be as in (8. 56}. For each 7r E C" let 7r(lx) denote the posterzo r
distribution of e given X = x calculated as if 7r were the prior distribu
tion.
Similarly, let J.elx(-Ix) denote the posterior calculated as if J.e were
the
prior. For each C E T,
inf 7r(Clx) =
?TEe.
which increases for 0 < 1/x and decreases thereaft er. So, we have
= {
~
sup IXle(x\B) if ~ :=; c(x),
x
8EG c(x)exp (-c(x)x ) if ~ > c(x),
19This theorem appears in Berger (1985).
526 Chapter 8. Hierarchical Models
exp(-l) if ~ ~ c(x),
sup fXle(xl(l) = { '"
9EC O c(x)exp(-c(x)x) if~ <c(x).
The value of ltelx(Olx) = 'Y by design. The bounds given by Theorem 8.57 for
the -contamination class using C equal to all distributions are, for l/x ~ c(x),
For example, with 'Y = 0.5, a = b = 1, and = 0.1, we have c(x) = 1.678/(1 +
x), which is greater than or equal to l/x for x ~ 1.474. Figure 8.61 shows a
plot of the lower and upper bounds on the posterior probabilities of the interval
[0,1.678/(1 + x) as a function of x. Notice how the degree ofrobustness depends
on the observed data. When x is very small, the likelihood function is quite large
for large values of (I outside of the interval (O,c(x), since c(x) never gets bigger
than 1.678. A prior that assigned probability 1 to such a large (I value would be
consistent with a very small observed x and would give low probability to every
subinterval of [0, 1.678). If such priors seem unreasonable, then perhaps the class
C. is too large.
8 10
o 2 4 8
x
Theorem 8.62. Let r be a class of prior distributions on (0, T), and let 9 :
0:-+ m be a measumble function. Suppose that inf1l"Er I fXls(xI0)d1l'(0) >
O. For each 11' E r, define S1l"(A) = I fXls(xIO)[g{O) - Ajd1l'(0), and let
S{A) = SUpS1l"{A).
1I"Er
Then for finite A, the least upper bound on the posterior means of g(8) is
A if and only ifs{A) = O.
PROOF. Let
IlxIS(xI0)g(0)d1l'(0)
AO = sup
1I"Er
I fXls{xI0)d1l'{0) '
and assume that AO is finite. For the "if" direction, suppose that S{A) = O.
We need to prove that A = Ao. Since S(A) = 0, we know that S1l"(A) ::; 0
for all 11' E r and that there exists a sequence {1I'n}~=1 of elements of r
such that, for each n, S1l"n (A) > -lin. This last claim can be written as
I fXls(xI0)g(0)d1l'n(0) > AI fXls(xI0)d1l'n(0) - lin, which implies
I fXls(xI0)g(0)d1l'(0)
I fXls(xI0)d1l'(0) <- A.
Since this is true for all 11' E r, it follows that AO ::; A, and we conclude
AO = A.
For the "only if" part, we must show that S(AO) = O. From the fact that
for all 11' E r, it easily follows that S(AO) ::; O. Suppose that S(AO) = -to for
some to > O. We will derive a contradiction. We know that there exists a
sequence {1I'n}~=1 of elements of r such that, for each n,
528 Chapter 8. Hierarchical Models
(1 - e) I g(O)!xle(xIO)dJ.Le(O) + !xle(xIOo)eg(Oo)
(1 - e) f !xle(xIO)dJ.Le(O) + e!xle(xIOo)
One can usually find the supremum and infimum of this expression as
a function of 00 using standard numerical methods. The two integrals,
f g(O)!xle(xIO)dJ.Le(O) and f !xle(xIO)dJ.Le(O), are constants in these nu-
merical problems.
Example 8.64 (Continuation of Example 8.60; see page 525).We have X f'V
f g(8)JxIS(xI8)dp.s(IJ) =
a(a + l)ba
(b + x)a+2'
f JXls(xI8)dp.s(IJ) =
aba
If we let a = b = 1 and E = 0.1 as before, we can find the extremes of h(8) for
every possible x. Figure 8.65 shows the lower and upper bounds on E(8IX = x)
for x between 0.1 and 10. As x __ 0, the upper bound goes to 00 because x close
to 0 is most consistent with very large values for e. The bounds get very close
together and small as x __ 00 because large x values are most consistent with
very small values of 8.
In addition to sensitivity analysis, one can try to find prior distributions
such that resulting inferences exhibit some degree of robustness to changes
in the prior. Consider the case in which X}, ... , Xn '" N(J.t, ( 2 ) given e =
(J.L, (1). The natural conjugate prior is one of the form M '" N (J.LO, u 2 / >'0)
given ~ = (1 and ~2 r- 1 (ao/2,bo/2). Such priors have the property that
I'V
the posterior mean of Mis J.tl = (nx + >'oJ.Lo)/(n + Ao), which has a compo-
nent >'oJ.Lo/(n + >'0) that remains the same no matter what the data values
8.6. Mixtures of Models 529
...
I
o 2 4 II 8 10
x
are. It is sometimes desirable to have the influence of the prior become less
pronounced as the data move away from what would be predicted by the
prior. Alternatives to natural conjugate priors, which are less influential,
are ones in which M is independent of E2 with teo (/.Lo, 7"2} distribution. We
will assume that E2 r- 1 (ao/2, bo/2) in this prior also. At first, it may
fV
seem difficult to work with such a prior because the posterior cannot be
written in closed form. However, we can use the following trick to make the
problem more tractable: Invent a random variable Y with r- 1 (Co/2, Co7"~ /2}
distribution independent of E, and pretend as if M N(/.Lo, Y} given Y. f'V
/.Ll(Y,O'} =
H.Q.
y
1
ii
+ U2'
+
nx
n
<T"
' 7"1(Y,O') = ( y+
1
:2 )-1
530 Chapter 8. Hierarchical Models
n
b1(p., y) = bo + ~)Xi - X)2 + n(x _p.)2,
i=l
In fact, since the prior density for M will tend to be very flat relative to
the likelihood function with even a small amount of data, this prior is a lot
like using an improper prior such as Lebesgue measure.
Of course, Bayesians can be interested in the same aspects of robustness
in which classical statisticians are interested, namely robustness with re-
spect to unexpected observations. In the classical framework, we introduced
M-estimators (see Section 5.1.5) to be less sensitive to extreme observa-
tions. In the Bayesian framework, this would correspond to using alter-
native conditional distributions for the data given e to reflect the opin-
ion that occasional extreme observations might arise. Consider the case in
which X l. "" Xn '" N(p., a), given e = (p., a), most of the time but in
which an observation with higher variance is occasionally observed. Alter-
natively, suppose that each observation Xi comes with its own variance E~
and that the E~ are exchangeable. If the conditional distribution of E~ were
r- 1 (ao/2, aoT /2) given T, this would be equivalent to saying that the Xi
had tao(f,.t, T2) distribution given M = f,.t and T = T. That is, we would have
changed the likelihood from normal to tao' Once again, it may seem diffi-
cult to work with such a likelihood because the posterior cannot be written
in closed form. However, we can again use SSS to make the problem more
tractable. Suppose that we model M as N(p.o, Y) given Y, E l. ... , En, T,
and Y as independent of the Ei and T with r- 1 (Co/2, CoTV2) distribution,
and T '" r(do/2, 10/2). The conditional posterior distributions needed are
M
T '"
where
-1
1 1
-+E-
n
( )
y a~ i=1
n) T1(al..,a ,y).
(-+La~
P.o x
n
y i=l'
density of the data given the parameters has thinner tails than the prior density
of M. This is because the degrees of freedom is 5 for the t distribution of the data
given the parameters, but the degrees of freedom is only 1 for the t distribution
of M. This will allow the posterior to resemble the likelihood to a large extent.
Consider the following 10 observations:
1.66, 1.07,0.640,0.310,0.295, -0.070, -0.107, -1.67, -1.90, -1.97.
The posterior mean of M is -0.077, and the posterior standard deviation of M is
0.4202. (The sample average is -0.173, and the sample standard deviation over
vT5 is 0.4017.) We could now consider what changes if one of the observations
moves off to 00. For example, suppose that we take the smallest observation,
-1.97, and let it move to 0 and then to +00. Figure 8.67 shows a plot of the
posterior mean of M as a function of the moving observation. Notice that the
mean of M increases almost linearly with the moving observation for some time
and then begins to decrease again. The decrease is due to the moving observation's
having reached a level beyond which it is more likely to be coming from the tail
of the distribution than from a large value of M.
The other curves in Figure 8.67 correspond to models with different degrees of
freedom. All four of the cases with aD, Co E {1,5} are illustrated. The value of Co
does not have nearly as much influence as the value of aD. When ao changes to 1
(with Co = 1 and the original data), the posterior mean and standard deviation
of M become 0.1258 and 0.1388, respectively. (The MLE of M would be 0.2064,
but the likelihood is not very peaked.) In this case, as one observation changes,
the posterior mean of M is affected most by the average of those observations in
the middle of the data set. When the moving observation enters the middle of
the data set, the posterior mean of M varies linearly with the observation. But
when it moves out of the middle, the posterior mean moves back down again.
:,;t
::::IE
15
&::
i ~~-
ci
a-S.c_1
8-1.0.1
~ 8-5.c_5
a_1.0_5
-2 0 2 4 B B
Moving Observation
FIGURE 8.67. Posterior Mean of M as a Function of One Observation
532 Chapter 8. Hierarchical Models
XlO
ao -1.970 0.531 3.030 8.030
1 0.252 0.918 0.367 1.501
2 0.553 1.048 0.663 1.412
5 1.000 1.000 1.000 1.000
10 1.187 1.001 1.091 0.945
20 1.283 1.003 1.110 0.623
60 1.348 1.003 1.105 0.221
00 1.381 1.003 1.097 0.096
average 1.001 0.997 0.911 0.828
predictive density will surface as the ones that contribute most to the pos-
terior predictive distribution of future data.
Example 8.69 (Continuation of Example 8.66; see page 530). Suppose that we
are not sure which degrees of freedom to use for the conditional distribution of Xi
given (M, T). We might try a mixture of models with model i having ao = i for
i in some set where ao = 00 means that the conditional distribution is N(M, T2)
rather than tao (M, T2). Table 8.68 lists the relative values of the prior predictive
densities of the data for a few values of ao with ao = 5 taken as 1.0. Four
different data sets are used; they differ only in the value of the last observation,
which is listed in the column heading. The t5 distribution is relatively robust
as one observation increases, and the equal mixture of the seven models (the
row labeled "average") has predictive density remarkably close to that of the t5
model. One might argue that the equal mixture of the seven other models is not
itself sensible. Putting 3/7 of the probability on {20, 60,00} degrees of freedom
is saying that one is somewhat confident that the data will be approximately
normal. On the other hand, putting 3/7 of the probability on {I, 2, 5} degrees of
freedom is saying that one is equally confident that the data will likely have an
occasional "outlier."
8.7 Problems
Section 8.1:
Section B.2:
Section 8.5:
11. Suppose that X f"V r-l(a,b) and Y f"V r-l(c,d) are independent. Let Z =
X/Yo Prove that the conditional distribution of X given Z = z is r-1(a+
c,b+dz).
12. Using the notation of Problem 9 above, suppose that e has a prior distri-
bution, which is r(a, b), with a and b known constants.
(a) Find the posterior density of e except for the normalizing constant.
(b) Set up a successive substitution sampling scheme to generate a sample
of e and Ml, ... , Mn from the joint posterior distribution.
(c) Write a formula for an approximation to the posterior density of e
and of Ml, ... , Mn based on the successive substitution sample.
8.7. Problems 535
(d) Describe the similarities and differences between the above approxi-
mations to the joint density for the M j and the approximation found
via the empirical Bayes approach.
13. Let v and z be k-dimensional vectors with Vi > 0 for i = 1, ... , k. Suppose
that X ,....., Nk(diag(v)z,diag(v)), where diag(v) is a diagonal matrix with
(i, i) element equal to Vi. Prove that the conditional distribution of X given
ITX=cis
.
Nk ( [dIag(v) 1
- ~vv T] z,dIag
. ()V - ~vv
1 T) .
1 v 1 v
14. *Prove that condition (8.41) holds if Y ,. . ., Nk(P., 'L,) and SSS is applied to
the coordinates in their natural order. (Hint: First, prove that the condi-
tional distribution of the next iterate X given the current iterate X' is
multivariate normal with constant covariance matrix and mean that is a
linear function of X'. You can now integrate over x' analytically using facts
from the theory of the multivariate normal distribution. The integral over
x becomes the integral of the ratio of two normal densities. You can use
problems 7 and 8 in this chapter to show that this integral is a constant
times the integral of a normal density.)
Section 8.6:
15. The SSS algorithm allows one to approximate posterior distributions with-
out calculating the marginal density of the data fx(x). When fitting mix-
ture models, it is important to compute fx I'" (x 11/1) , where W is the pa-
rameter indexing models. Consider a single model with parameter e =
(el, ... ,8p ). Each iteration of SSS starts with a simulated vector 8(i) =
(e~i), ... , 8~i and then simulates e(HI) one coordinate at a time using
the conditional posterior distribution of each coordinate given the others:
~ (8Ie(i+1) 8(i+l) e(i) (i)
(8.71)
Jejle'j,x ) I '''., j_I' j+I,,,,ep ,x).
Call the expression in (8.71) Vj(HI). (Note that V 1(i+1) and V~i+1) have
slightly different formulas due to the effect of being at the ends of the
vector.) Prove that
where the E() refers to the joint distribution of (e(i), 8(HI in the sim-
ulation.
16. Suppose that one wishes to fit a mixture of k models, but that one is
required to use SSS to fit each model. Let W E {I, ... , k} index the different
models, and let e i be the parameters of model i, for i = 1, ... , k. Explain
how one could use Problem 15 above to help estimate f"'lx(1/Iix) for all
values of 1/1.
CHAPTER 9
Sequential Analysis
Definition 9.2. Let (8, A, JL) be a probability space, let (V, V) be a mea-
surable space, and let V : 8 --+ V be a random quantity. For i = 1,2, ... ,
let (Xi, Bi ) be measurable spaces and let (Xo, Bo) = ({O}, {0, Xo}) be a
trivial space. For i = 0,1, ... , let Xi : 8 --+ Xi be random quantities. Let
X = n:o Xi with product a-field Boo. Let Bn be the sub-a-field generated
by the first n + 1 coordinates (including 0). That is, B E Bn if and only if
B = C x n:n+1 Xi, where C E Bo ... Bn. Let X = (Xo, Xl, X2, ... ).
A stopping time is a nonnegative, extended integer-valued function N (i.e.,
N E N = {O, 1, ... } U {oo}) defined on X such that, for every finite
n, {x : N(x) = n} is measurable with respect to Bn. Let the action
space be N = N' x N, and let the loss be L : V x N --+ IR such that
L(v,(a,n)) = 2:~=oCi + L'(v,a), where Ci ~ 0 for all i (eo = 0). Let Q:
be a a-field of subsets of N', and let P A be the collection of probabil-
ity measures on (W, Q:). A mndomized sequential decision rule is a pair
6 = (6*, N) where 6* : X --+ P A and N is a stopping time. The function 6*
is called the terminal decision rule. A nonrandomized sequential decision
rule is a randomized sequential decision rule such that, for each x EX,
there is 6'(x) E W such that for each A E Q: 6*(x)(A) = 1 if 6'(x) E A
and 6*(x)(A) = 0 if 6'(x) A. If 6 is nonrandomized, then 6' is called the
terminal decision rule.
A convenient notation will be to let xn = (Xo, ... , Xn) for finite n
and Xoo = X if necessary. Also, xn = (xo, Xl. .. . , x n ) and x oo = x for
x E X. Note that we have assumed that a stopping time might be infinite.
If there exists x such that N(x) = 00, then 6*(x) must still be defined, even
though it is hardly a "terminal" decision. For convenience, we will often
write summations like 2::=~ to indicate that one extra term for n = 00
is to be included in the usual sum 2::=1. One way to prevent decision
rules from taking infinite samples with positive probability is to require
that 2::1 Ci = 00 so that any rule that takes infinite samples with positive
probability must have infinite risk.
If we can restrict attention to decision rules (6', N) such that N :5 n,
then there is an intuitively simple method of finding the optimal sequential
decision rule. The idea is to decide what would be the optimal decision and
its risk after observing xn, then compare this to what the risk would be
if we only observed xn-1. Whether N(x) = n or N(x) = n - 1 is decided
is based on which is smaller. We now know what the optimal procedure is
after observing n - 1 observations and we know its risk. Compare this to
what would be optimal if we stopped after xn-2, and so on. This procedure
is called backward induction. Consider an illustration.
Example 9.3. Suppose that {Xn}~=l are conditionally lID Ber(9) given e = 9
and e has U(O, 1) distribution. Suppose that we can take at most four observa-
tions. The action space has N' = {O, I}, and the loss function is L(9, (a, n)) =
538 Chapter 9. Sequential Analysis
o
Posterior Beta 1,4
Pr(8:5 O.4IX = x) 0.8704
a o
Risk(stop) 0.1596
Pr(X4 = 1) 0.2
Risk( continue) 0.1696
Stop yes no yes yes
Risk 0.1596 0.3692 0.2092 0.0556
So, only if Y3 = 1 would we continue to observe X4. Next, suppose that we only
observe X2 = x 2 , and let Y2 = Xl + X2.
Y2
0
Posterior Beta 1,3
Pr(8 :5 O.4IX = x) 0.7840
a 0
Risk(stop) 0.2360
Pr(X3 = 1) 0.25
Risk( continue) 0.2120
Stop no no yes
Risk 0.2120 0.2892 0.0840
We would continue if Y2 E {O, I}. Next, suppose that we only observe Xl = Xl
9.1. Sequential Decision Problems 539
0
Posterior Beta 1,2
Pr{8 ::; O.4IX = x) 0.6400
a 0
Risk(stop) 0.3700
Pr{X2 = 1) 1/3
Risk{ continue) 0.2377
Stop no no
Risk 0.2377 0.1524
If we take one observation, we will take two. Finally, before we take any obser-
vations, Pr{8 ::; 0.4) = 0.4, so the terminal decision would be a = 1 and the risk
would be 0.4. On the other hand, Pr{X1 = 1) = 0.5, so the risk of continuing is
0.5 x 0.1524 + 0.5 x 0.2377 = 0.1951. Hence, we should take the first observation.
To summarize, the optimal procedure is
Data (1,1,.,.) (O,O,O,.) (1,0,1,.) (0,1,1,.) (1,0,0,0)
N 2 3 3 3 4
a
Data
1
(0,1,0,0)
10,0,1,0)
1
(1,0,0,1)
1
(0,1,0,1)
0
(O,O,I,lJ
N 4 4 4 4 4
a 0
1 1
where the dots stand for observations that do not need to be taken.
1
To compare with other procedures, there is the fixed sample size procedure
with n = 4, which has risk 0.2239. This risk is the average of the five possible
risks after four observations because each of the five possibilities has probability
1/5. This procedure rejects H if Y4 E {2, 3, 4}. The optimal procedure which
takes at most three observations has risk 0.2232 (see Problem 1 on page 567).
Po(Q) = min
aEN'
r
Jv L'(u,a)dQ(u),
the minimum risk possible without taking any observations if the prior is
Q. If Q denotes a prior distribution, then for each n, let Qn('lx) denote
the conditional distribution obtained from Q by conditioning on Xo =
XO, , Xn = x n. (For n = 00, Qoo('lx) denotes conditional probability
given X = x.) In particular, Qo(lx) = Q. If N is a stopping time, then
2It may be that Q is already the conditional distribution obtained from some
other probability P after conditioning on some observations.
540 Chapter 9. Sequential Analysis
In words, a decision rule is regular if, whenever the stopping time has not
yet occurred, the risk of stopping is larger than the risk of continuing. The
rule in Example 9.3 on page 537 is regular, as is every backward induction
rule. (See Problem 4 on page 568.)
Theorem 9.7. If 8 decides optimally after stopping, then there is a regular
61 such that p(Q,8t} S p(Q,8).
PROOF. Define 81 as follows. The terminal decision rule for 81 is to decide
optimally after stopping (just like 8). The stopping time N1 for 81 is the
smaller of N (the stopping time for 8) and the first time at which (9.6)
fails. Clearly, this is finite and is a stopping time since both sides of (9.6)
are Bn measurable. If 8 is regular, then (9.6) never fails and 81 = 8. Next,
9.1. Sequential Decision Problems 541
note that both sides of (9.6) are equal for each x such that N(x) = n. So,
we can compute p( Q, 81 ) as
00+
= ~ E [p(Q, 6}IN1 = nJ Pr(N1 = n) = p(Q, 6). o
n=O
Regular decision rules do not sample too many observations, but they
may not sample enough. That is, whenever a regular decision rule contin-
ues sampling, the risk for continuing is smaller than the risk for stopping.
However, when a regular decision rule stops, the risk for continuing may
still be smaller than the risk for stopping. For example, the optimal rule
from a class of rules whose stopping times are all bounded by the same n
is regular.
Proposition 9.8. The optimal rule from the class of sequential decision
rules that sample no more than n observations is regular.
Theorem 9.10. Let Q be a prior on V. If 81 " " , 15k are regular with finite
risk, then 80 = max{81, ... ,8d is regular and p(Q,60} ~ P(Q,Oi}, for
i = 1, ... ,k.
PROOF. We need only prove this for k = 2 because the general case follows
easily by induction. It is clear that
The first equality is true because 60 and 62 agree for all x such that N 2 (x) =
No(x). The inequality follows since 62 is regular and N 2 (x) > n. The last
equality follows since N1(x) = n. Next, suppose that N1(x) = No(x) and
n = N 2 (x) ~ No(x):
The reasons for each line are the same as before except that the inequality
is only strict if Nl (x) > n. (Note that (9.12) holds even if n = 00.) Together
(9.11) and (9.12) show that 60 satisfies (9.6). In both of (9.11) and (9.12),
n = min{N1(x), N 2 (x)}. Write
= U Cn,
00
for j = 0,1,2. Together (9.11) and (9.12) say that the integrand in the
second line of (9.13) for j = 0 is no greater than for either j = 1 or j = 2.
The inequalities in the conclusion to the theorem follow. 0
If r = info p{ Q, 6), then there is a sequence {6d~1 such that r =
limi-too p(Q, 6i ). Finding such a sequence is not as difficult as it may seem.
Definition 9.14. Let 6 = (6*, N) be a regular sequential decision rule. Let
N' be a stopping time. The truncation of 0 at N' is the decision rule with
stopping time min{N, N'} and terminal decision optimal after stopping.
Lemma 9.15. 3 Let 00 be the optimal rule in a sequential decision problem,
and suppose that 00 has finite risk. For each n = 1,2, ... , let On be the
and
{x: Nn{x) = n} = {x : No(x) = n} U {x: No{x) > n}.
So Tk{8 n ) = Tk{OO) for k = 1, ... , n - 1 and
n
Tn{8n) = Tn(80) + Pn + Pr{No > n) L Ci'
i=l
Lemma 9.16. 4 Suppose that L' ;:::: 0 and lim n-+ oo EPo{Qn(IX)) = O. Then
limn-+ooPn = 0, where Pn is defined in Lemma 9.15.
The prior mean of this is bo/[(ao - 2)(>'0 + n)], which goes to o. It follows that
the risk of the optimal procedure that takes at most n observations converges
to the optimal risk. If ao ~ 2, then po(Q) = 00, and it pays to take one or two
observations until an > 2. At this point, pretend that the problem starts over
and use the above reasoning.
If we modify the problem to have loss Lp., u), (a, n = en+(p.-a)2 /u 2 , then
po(Qn(.\X = l/>'n, which depends on the data only through n. Hence, it is easy
to see that the optimal rule has N = n with probability 1, where n provides a
minimum to en + l/(Ao + n).
Proposition 9.20. [fthere exists finite n such that po(Qn(lx)) < Cn+l for
all x, then the optimal procedure takes no more than n observations. The
optimal procedure is a fixed sample size procedure if Po(Qn(lx)) depends
on the data only through n.
In general, it is quite difficult to specify the optimal sequential decision
procedure. The first part of Example 9.19 is one such case. To find or
approximate the optimal rule in general, we will suppose that the cost
of each observation is the same and that the available observations are
9.1. Sequential Decision Problems 545
exchangeable. That is, assume that en = c for all nand {X n };:O=l are
exchangeable. If we let p* (Q) = inf 6 p( Q, 6) denote the risk of the optimal
rule, then it is not difficult to see that
since the second term is just the mean of the optimal risk of continuing
after the first observation given the first observation. If this is smaller than
the optimal risk for no data, then it is the optimal risk. Otherwise, the
optimal decision is to take no data and Po(Q) is the optimal risk. Clearly,
the optimal sequential decision rule is to stop sampling at N(x) = n, where
n is the first time that Po(Qn(IX)) = p*(Qn(IX)). This prescription is only
useful if we know p*. As we will demonstrate in the next theorem, we can
approximate p* by using successive substitution (see Section 8.5).
Theorem 9.22. Let Q be a probability measure and suppose that L' ~ 0
and lim n --+ oo EPo(Qn(IX)) = O. Define
for n = 0, 1, .... Then limn --+ oo Pn(Q) = p*(Q) and Pn(Q) is the risk of the
optimal rule among those that take at most n observations.
PROOF. Clearly, in light of Corollary 9.17, we need only prove that Pn is
the optimal risk for rules that take at most n observations. We will use
induction. We know that Po is the optimal risk among rules that take no
observations. Suppose that Pk is the optimal risk among rules that take at
most k observations for some k ~ O. Then
(9.23)
is the risk for taking at least one observation and then using the optimal
rule that takes at most k more observations. The optimal rule that takes
at most k + 1 observations must either take at least one observation or take
no observations. Hence, the risk of the optimal rule that takes at most k + 1
observations is the smaller of (9.23) and po(Q). That is,
o
Theorem 9.22 can be applied to Qk(IX) to produce the following corollary.
Corollary 9.24. Let Q be a probability measure, and suppose that L' ~ 0
and limn --+ oo Epo(Qn(IX)) = O. For each nand k, the conditional mean of
the risk of the optimal rule among those that take at most n+k observations
given the first k observations and given that the optimal rule takes at least
k observations is Pn(Qk(IX)) + ck.
Corollary 9.24 can be used to define an alternative decision rule.
546 Chapter 9. Sequential Analysis
Definition 9.25. The decision rule that continues to sample until the first
n such that Po(Qn('lx)) = Pk(Qn(-lX)) is called the k-step look-ahead rule.
Example 9.26 (Continuation of Example 9.19; see page 544). It is incredibly
difficult to calculate Pn for n > 2. We illustrate here how to calculate pn for n =
1,2. The posterior distribution is determined by four hyperparameters (a, b, p., A)
and
b
Po(a, b, p., A) = A(a _ 2)'
After observing XI = x, let the posterior hyperparameters be
A 2 AP.+X )
(a, b, p., A)(X) ( a+1,b+ A+1(X-P.) ' A+1 ,A+1
(a+ 1,b(x),p.(X),A+ 1).
We can write b(XI) = (1 + y2)b, where
y =X Ji p. J ~ A 1 ~ ta (0, ~) . (9.27)
So,
i
It follows that E(Pla,b,p.,A)(XI)) equals
T
b _ b (1 + y2)Jy(y)dy
(A + 2)(a - 2) + c(1 p) + (A + 1)(A + 2)(a - 1) -r
9.1. Sequential Decision Problems 547
b
= (.\ + 2)(a - 2) + c(l -
l
p)
b r
r(!!:}!) 2 .. -1
+ (A+ l)(A + 2)(a -1) -r r(~) y'1T(1 + y )-"'-dy
b b
(A + 2)(a - 2) + c(l - p) + (A + l)(A + 2)(a - 2) q,
where p = Pr(lYI :5 r) and q = Pr(lZI :5 r), where Z '" t a -2(0, lila - 2]).
We could now calculate P2(a,b,p., A) after each observation. If po(a,b,p.,A) is
greater than P2, we should continue to sample. If Po(a, b, p., A) equals P2, the
two-step look-ahead rule would stop. We could, however, try to achieve a bet-
ter approximation to p. One way to do this might be to numerically integrate
P2a, b, p., A)(X)) times the predictive density of the next observation in order to
approximate P3(a, b, p., A).
Consider the results in Table 9.30. We used a prior with ao = 3, bo = 8, P.o = 0,
and Ao = 1. The cost per observation was c = 0.1. After the fourth observation, we
do not know whether or not po = p. If we numerically integrate P2, we get P3 =
P2. This means that we would have to consider at least four more observations
before there was any chance that the optimal rule would continue sampling. But
four more observations would cost 0.4 more without taking into account the
loss from squared error. Since the mean of po with four more observations is
just 5/9 times the current Po, which equals 0.337076, it seems unlikely that four
more observations would bring the risk down enough to justify continuing. In
fact, the lowest possible posterior risk we could obtain from sampling four more
observations would occur if all four of them were equal to the current posterior
mean, and then the risk would be 0.587264, which is barely less than po.
o 8.000000 2.866667
1 -0.129354 2.002092 1.201046
2 -2.158607 1.214599 0.928760
3 1.558454 0.935753 0.915571
4 -0.677818 0.606737 0.606737
548 Chapter 9. Sequential Analysis
This means that the optimal risk for continuing from this point is 0.064 and we
should stop now.
Suppose that we observe Xl = X4 = 1 and X 2 = X3 = O. The optimal rule
that takes at most four observations has to stop at this point, and the terminal
risk is 0.3174. The posterior is Beta(3,3). Treating this as the prior, we could
calculate Pn and Tn for many n. At n = 100, they are both 0.2274. This means
that the optimal rule would continue sampling and that the optimal risk for
continuing (not counting cost of current observations) is 0.2274.
L ( ) _ TI~-l II (Xi)
n X - TIni=l JOf (
Xi
)'
which tells us how much more likely the data are under PI than under Po.
The sequential probability ratio test [see Wald (1947)] SPRT(B, A) is, for
each n, to reject H : e = 0 if Ln(x) ~ A, accept H if Ln(x) :$ B, and to
continue sampling if B < Ln(x) < A, where 0 < B < 1 < A. Another way
to write this is to let
When we apply Theorem 9.33, we will let Zi = log [!I (Xi )/10 (Xi)], a =
log A, and b = 10gB.
0: LPo(N = n,Ln ~ A)
n=l
550 Chapter 9. Sequential Analysis
J00
~ {N=n,Ln~A}
1
Ln n n
!1(xi)dv(Xl)'" dv(x n )
< AL
1J 00
II fl(Xi)dv(Xl)'"
n
dv(xn)
n=l {N=n,Ln~A} i=l
1 1
AP1(LN ~ A) = A(I- (3).
I-B A-I
a ;:::;; {3 ;:::;;
A-B' B A-B'
1-{3 {3
A ;:::;; B ;:::;;
a I-a
Theorem 9.35. Let a* and {3* be strictly between 0 and 1. The SPRT
with A = (1 - {3*)/a* and B = {3* /(1 - a*) has operating characteristics
a = PO(LN ~ A) and (3 = P1(LN :::; B), which satisfy a + {3:::; a* + {3*.
PROOF. If a :::; a* and {3 :::; {3*, the result is clearly true. So, suppose that
either {3 > {3* or a > a*. (We will see shortly that both inequalities cannot
occur simultaneously.) If {3 > {3*, then 1 - {3 < 1 - {3* and
1 a*
a < -(1- (3) = - - ( 1 - (3) < a*.
- A 1- {3*
Hence,
I-a ) *a* -a
o < {3 - {3* < {3* ( - - - 1
- 1- a*
= {3 - - .
1- a*
It follows that
a*+{3*-a-{3 = (a*-a)+({3*-{3)
a* -a
> a*-a-{3*--
1- a*
(a* - a)(l - B) > O.
9.2. The Sequent ial Probabi lity Ratio Test 551
!1(Xi)
= log !O(Xi) =
{-IOg 3 if Xi = 0,
Zi log 3 if Xi = l.
There will be no oversho ot of the bounda ries if a = k1 log 3 and b = - k2 log 3 for
kl and k2 integers. Here are some examples:
kJ k2 O! (3 A B
1 1 0.25 0.25 3 0.3333
2 2 0.1 0.1 9 0.1111
2 1 0.077 0.308 9 0.3333
1 2 0.308 0.077 3 0.1111
3 3 0.036 0.036 27 0.0370
Suppos e that we choose the level 0.1 test with k1 = k2 = 2. We
could calculat e
the mean of N, the expecte d number of observa tions needed. It
is clear that N
is even and that N = 2k if and only if the first 2k - 2 observa tions
come in pairs
0,10r 1,0 and the last two are 1,1 or 0, O. So,
=
0.5 x 0.25 x O.75 n - x + 0.5 x 0.75 x O.25 n - x
552 Chapter 9. Sequential Analysis
=
n=l
Since I{n,n+1, ... }(N) = 1 - I{O,l, ... ,n-l} is a function of Zlo'" ,Zn-lo it is
independent of Zn. Hence
S
N -
_ {ab if reject H,
if accept H.
f (!t(X)h
h '" 0 such that
dP(x) = 1,
fo(x)
9.2. The Sequential Probability Ratio Test 553
~ A
-h ( B-h -
B-h_A-h
1) = 1- Bh
Ah_Bh' 0
Example 9.39 (Continuation of Example 9.36; see page 551). Suppose that
{Xn}~=1 are lID Ber(0.6), but we are testing H : e = 0.25 versus A : e = 0.75.
We have
/l(X) = {~ ~f x = 0,
fo(x) 3 1f x = 1.
1 ( 1 _ (~) -0.36907 )
log(-g) + log(81) -036907 = 0.8451,
9-0.36907 - (~) .
0.6 x log(3) - 0.4 x log(3) = 0.2197,
0.8451 = 3.846.
0.2197
554 Chapter 9. Sequential Analysis
Notice that the mean stopping time is longer when 8 is between the hypothesis
and the alternative.
If {Xn}~=l are lID Ber(0.5), then h == 0 is the only value that works in the
equation in Lemma 9.38. Hence, Lemma 9.38 has nothing to say about this case.
The following result has a proof similar to that of Wald's lemma, but
applies to the case not yet handled in Example 9.36.
Proposition 9.40. Suppose that {Zn}~=1 are IID, with E(Zi) = 0 and
E(Zl) = 0'2. Suppose that N is a stopping time such that E(N) < 00. Then
E(SK,) = 0'2E(N).
Example 9.41 (Continuation of Examp,le 9.36; see page 551). If {Xn}~=~ are
lID Ber~0.5), then E(Zi) = 0 and E(Zi) = (log(3))2 = 1.2069. Also, E(SN) ==
4(log(3)) = 4.8278. It follows that E(N) = 4. Of course, this example is simple
enough that we could calculate Ee(N) for all () without any of these theorems.
See Problem 9 on page 568.
if i =F j,
otherwise,
U("'() = inf
6
p("'(, 6).
Since N{x) ? 1 for all x, it follows that U("'() > 0 for all "'I. Since p("'(,6) is
a positive linear function of "'I for each 6, it follows that U is the infimum of
a collection of positive linear functionsj hence, it is concave and continuous
on (O, 1) and positive at the two endpoints. Define
n
An(x) = II fi(Xj),
;=1
for i = 0,1 and n = 1,2, ... , where x = (X}'X2' ). The posterior proba-
bility of {8 = O} given Xl = XI. ... ,Xn = Xn is
The posterior mean of the loss to be incurred if N > n is at least U ("'In (X) ) +
en. Hence, the Bayes rule is to continue sampling so long as h("'(n(x >
U("'(n(x and to stop at N(x) equal to the first n such that h("'(n(x ~
U("'(n(x. Note that h("'() is continuous, has a graph shaped like a triangle,
and satisfies h(O) = h(l) = O. Since U is concave, it follows that h("'() >
U ("'I) for "'I in some interval (91, 92). Figure 9.43 shows the U and h functions
for a typical example. 6 (If h("'() ~ U("'() for all "'I, define 91 = 92 to be
the value of "'I at which h is maximized.) Hence, the Bayes rule continues
sampling so long as
"'I 1 - 92 "'I 1 - g1
B .. = - 1 - - - < Ln(x) < - 1 - - - = A.j
-"'I ~ -"'I ~
it rejects H if LN(X) ? A .. j and it accepts H if LN(X) ~ B ... Therefore,
the Bayes rule is SPRT(B., A.).
The g1 and g2 found above depend on the particular decision problem
only through wand c, where we assume that WI = w and Wo = 1- w. This
is true because the functions h and U depend on the decision problem only
6The example in Figure 9.43 has fa being the Ber(0.6) distribution and h
being the Ber(0.8) distribution. Also, Wa = 0.4 and c = 0.02. In this example,
91 = 0.31 and 92 = 0.48.
556 Chapter 9. Sequential Analysis
through these values. So we call the two values 91(W,C) and 92(W,C). To
finish the proof, we need only find c and W so that 'Yi = 9i(W,C) for both
i = 1,2. Define
92(W, C)
(3(w,c) =
1- 92(W, c) ,
91(W,C)
>'(w, c) = (3(w,c)(I- 91(W,C))'
It is easy to see that 91 and 92 are also functions of (3 and >.. Set
>'0=~1-'Y2
1 - 1'1 'Y2 '
Then, we need to find wand c so that >.(w,c) = >'0 and (3(w,c) = {Jo.
As c ! 0, U(O) and U(I) both approach 0, hence 91(W,C) tends to 0 and
92(W, c) tends to las c! 0 for fixed w. Hence, for fixed w, limc!o >'(w, c) = O.
Since ph, 6) increases as c increases for every 'Y and 6, it follows that Uh)
increases as c increases for every 'Y. Hence 91 (w, c) is a decreasing function
of c and 92(W,C) is increasing in c for fixed w. It follows that>. is strictly
increasing in c for fixed w. As c --+ 00, eventually, hh) $; Uh) for all 'Y.
Let eo(w) = inf{c: 91(W,C) = 92(W,C)}. Then limcTco(w) >'(w, c) = 1. Since
0< >'0 < 1, there exists a unique c = c(w) such that >.(w,c) = >'0' As
w approaches 0 for fixed c, U approaches the constant function c and h
approaches the constant function 0 while the peak of the triangle in the
graph of h moves toward 'Y = O. Hence 91(W,C) and 92(W,C) approach O.
9.2. The Sequential Probability Ratio Test 557
limg2(w,c(W)) = 0, lim,8(w,c(w)) = O.
w!O w!O
L(f (.))
i, ), n = cn +
{Wi0 if i =1= j,
otherwise,
Since ao ~ a1, f30 ~ (31, C > 0, Wo > 0, and WI > 0, it follows that
-y(EoNo - EoN) + (1 - -y)(E1NO - EIN) ~ O.
or
L'(O, (a, b = k(b - a)2 + 1 - I[a,b] (g(O)),
or
a - g(() if g(() < a,
L'(O, (a, b = k(b - a) + { g(O) - b if g(O) > b,
o otherwise,
where k > O. The first two loss functions above lead to intervals with equal
posterior density at a and b. The third one leads to intervals with equal
posterior probability below a and above b. Another alternative is to let
N' = It and set
L'(O,a) = 1- I[a-d,a+d] (g(O)) ,
where d > 0 is some fixed half-width for the interval. In this case, the
interval has a fixed width and the coverage probability is determined (along
with the center of the interval) by the data.
Example 9.45. Suppose that {Xn}~=l are lID N(IL,u 2 ) given e = (lL,u). Let
g(6) = 1'. Suppose that we want an interval of half-width d and the cost of each
observation is c. Suppose that the prior is conjugate. The posterior distribution
of M after n observations will be tan (ILn, bn/[anAn]), where the posterior hyper-
parameters an, bn , ILn, and An are as in Example 9.19 on page 544. The optimal
decision upon stopping is ILn. The terminal risk (not counting cost of observation)
is
where Tan stands for the CDF of the tan (0,1) distribution. To implement the
one-step look-ahead rule, we need to calculate
where
Y = (Xn+l - I-'n) bn(>.:n+ 1) '" tan (0, a1 J.
Even this formula requires numerical integration to compute. For example, sup-
pose that we use a prior with ao = 2, bo = 7, 1-'0 = 1, and >'0 = 1. The cost of each
observation is 0.005 and the half-width of the interval is d = 1.96. Table 9.47 con-
tains some data along with Po and Pl. The terminal decision is 1-'9 = -1.011531,
and so the interval is [-2.9715,,0.9485). The posterior probability in the interval
is 0.9810. The probability is so high because the cost of each observation is so
low. If the cost had been 0.01 instead, the one-step look-ahead rule would have
stopped after seven observations with the interval [-3.0214,0.8986] and posterior
probability 0.9640.
PI
o 0.4047363 0.2816774
1 -1.745003 0.2396304 0.1774288
2 0.793758 0.1178335 0.0881298
3 -4.385832 0.1475386 0.1190706
4 -4.708225 0.1268197 0.1053043
5 -0.363233 0.0797290 0.0675238
6 1.162189 0.0599002 0.0521162
7 -0.244931 0.0360019 0.0330767
8 1.455418 0.0265989 0.0258814
9 -3.079450 0.0189839 0.0189839
560 Chapter 9. Sequential Analysis
(9.48)
= l = n).
n=2
First, recall that Nl is the first n ~ 2 such that S~ ~ k n for some sequence
{kn}~=l' We will prove that the event {N1 = n} is independent of X n .
This will allow some simplification in (9.48). It is sufficient to show that
si, ... ,S~ are all independent of X n . For each k = 1,2, ... , consider the
k x k matrix rk whose rows are all unit vectors and whose ith row (for
i < k) is proportional to the vector with 1 in the first i places and -i in the
(i + 1)st place. The kth row is proportional to the vector of all Is. It is easy
to see that these rows are orthogonal to each other, so r k is orthogonal
for each k. For i < n, define Wi to be the inner product of the ith row
of r n with (X1. ... , X n ). 7 Note that the inner product of the last row of
rk and the vector Xk = (X1, ... ,Xk) is .;kXk. So, Xn is independent
of W1. ... , Wn - 1 Also, since IIrkXkl1 = E~=l xl, it follows that S~ =
r::-l1 W; for k = 2,3, .... Hence X n is independent of S2,' .. ,Sn' This
means that we can write
Po(Xn-d~I-'~Xn+dINl=n) = Po(Xn-d~I-'~Xn+d)
= 2[1-~(Vn~)]'
Hence,
7Note that, for fixed i, Wi is the same no matter which n > i is chosen for its
definition.
9.3. Interval Estimation 561
the posterior distribution of a parameter given the data does not depend
on whether or not any more data will be taken. Hence, the Bayesian can
declare, after every observation, what is the posterior probability that the
parameter lies in any set he or she wishes. If, after n observations, there is
an interval of half-width d such that the posterior probability is 1 - a that
M is in that interval, the Bayesian can declare that and stop sampling.
There are some classical procedures that do actually have the desired cov-
erage probability. We will present one such procedure here. It is a two-stage
sampling procedure. One chooses an initial sample size no and estimates
E2 by S~o. Then one collects a second sample whose size depends on S~o.
Define
c= 1 d
T no - 1 (1- 2)
Q' N2 = max {no, l~~o J+ I} ,
where l zJ denotes the largest integer less than or equal to z. Use the interval
centered at X N2 with half-width d.
Lemma 9.49. With the above notation, the conditional distribution of
(9.50)
TABLE 9.51. Classical Fixed-Width Confidence Interval Sample Sizes for Exam-
ple 9.52
no S~ C N2
2 3.222653 0.0238 136
3 6.707906 0.2075 33
4 6.616990 0.3793 18
5 5.885602 0.4983 12
6 6.462292 0.5814 12
7 5.625235 0.6416 9
8 5.809566 0.6870 9
9 5.561759 0.7224 9
Example 9.52 (Continuation of Example 9.45; see page 558). Suppose that we
desire a classical sequential confidence interval with half-width 1.96 and coefficient
0.95. We will use the same sequence of data as we had earlier. To correspond more
closely with a classical analysis, we should change to an improper prior. In this
case, the interval based on the first nine observations is the first one to have
posterior probability greater than 0.95 and so this would be the naive sequential
interval. For various values of no, we can implement the classical procedure; the
results are summarized in Table 9.51. If no = 7,8, or 9, we will get essentially
the same result as the naive interval.
For a simple example, suppose that {Xn}~=l are lID N(B, 1) given e = B.
We will consider the problem of testing the hypothesis H : e = Bo versus
A: e =1= Bo. Let Xn be the average of the first n of the Xi. Given e = Bo,
y'n(Xn - Bo) has N(O, 1) distribution for every n. Hence, for every e,
It follows that
However, the event {limsuPn-too y'n(Xn - Bo) > c} is in the tail a-field C,
so the Kolmogorov zero-one Law B.68 says
e
Hence, given = Bo, for every e, the probability is 1 that there will exist
n such that 1y'n(Xn - Bo)1 > e. Let
of view, so long as the stopping time is a function of the observed data, and
one conditions on the observed data, the terminal decision rule should be
whatever would be optimal if that data were observed from a fixed sample
size rule. If this is so, can a Bayesian be tricked into sampling to a foregone
conclusion? The answer is no, if the Bayesian uses a proper prior.8
To see why a Bayesian cannot sample to a foregone conclusion,9 suppose
that N is a strictly positive,lO integer-valued random variable that might
equal 00. Suppose also that N is a function of observable data {Xn}~=l
in such a way that, for every finite n, I{n}(N) is a function of Xl, ... ,Xn .
Let
Bn = {(Xl, ... ,xn): N = n}.
Let Z be a random variable of interest (perhaps the indicator of some subset
of the parameter space or anything else) with finite mean c. Suppose that,
for every n and every (Xl, .. ,Xn ) E Bn, E(ZIXI = Xl, . ,Xn = xn) >
d 2: c. (A similar argument works for d ::; c.) It follows from the law of
total probability B.70 that
E(Z) = (9.53)
n=l
+ E(E(ZIXb X 2 , )IN = 00) Pr(N = 00).
If we suppose that Pr(N = 00) = 0, then the right-hand side of (9.53) is
greater than d and the left-hand side equals c ::; d, which is a contradiction.
Hence, Pr(N = 00) > O. This means that, if a Bayesian does not believe
a priori the conclusion (in this case that the mean of Z is greater than c),
then it cannot be guaranteed that he or she will believe it after sequential
sampling.
A more useful result is available if Z = [B(Y) for some random variable
Y. In this case, we can calculate a bound on the conditional probability of
stopping given Y B. First, note that E(Z) = PreY E B) = c, and rewrite
(9.53) as
PreY E B) 2:
n=l
The right-hand side of this is greater than dPr(N < 00). It follows that
8The reason that a proper prior is needed is subtle. The argument given below
depends on the law of total probability B.70. Kadane, Sche~vish, an~ ~eidenfeld
(1996) show that when improper prior~ arE: viewe.d as fimtely ad.d'tlve pr??a-
bilities, sampling to a foregone concluslOn IS poss.l~le because fimtely additive
probabilities do not satisfy the law of total probablhty.
9This argument is like one given by Kerridge (1963).
lOWe saw in Example 9.3 on page 537 that if Pr(N = 0) > 0, then Pr(N =
0) = 1.
9.4. The Relevance of Stopping Rules 565
P (N
r < 00
IY 'i"rt B) = Pr(Y BIN < 00) Pr(N < 00).
Pr(Y B)
c(l - d)
Pr(N < oolY B) < d(l _ c) . (9.54)
The claim of (9.54) was made without proof by Savage (1962) with reference
to a specific example. Cornfield (1966) proves the result in another specific
example. A similar claim, without an explicit bound, was made by Good
(1956).
It should be noted that the proof that a Bayesian cannot sample to a
foregone conclusion involves probabilities calculated under the joint dis-
tribution of all random quantities. For example, if e
has a continuous
distribution and Z = IB(e) for some set B, it might be the case that for
some () E B, P~(N < 00) = 1 (see Problem 12 on page 569, for example),
but the set of all such () must have small prior probability.
This discussion is not meant to say that Bayesians can ignore all stop-
ping rules. All it says is that so long as the stopping rule is a function of
the observed data 11 and the Bayesian conditions on the observed data, no
further account need be taken of the stopping rule. An example is given
by Berger and Berry (1988) of a situation in which one or the other of the
two criteria is not met. We give a modified version here. Other examples
are given by Roberts (1967).
Example 9.55. Suppose that {Xn};:O=1 are liD Ber((J) conditional on e = 0
and e E {0.49, 0.51}. The observations cannot be taken at will; rather they arrive
according to a Poisson process with rate >'(0) conditional on e = 0, but the Xi
are independent of the arrival times conditional on e. Suppose that >'(0.49) is
one observation per second and >'(0.51) is one observation per hour. The stopping
rule will be to sample for one minute and then stop when the next observation
arrives.
Suppose that we observe N = 60. If we accidentally consider Xl, ... , X 60 to be
the observed data and condition on it alone, we will be ignoring valuable infor-
mation. Specifically, the time it took to observe the 60 values contains valuable
information about e. Furthermore, the very fact that 60 observations were ob-
served contains information about e, even if we don't know how long it took to
llWe mean that, for each n, the event {N = n} must be measurable with
respect to the O'-field generated by Xl, ... ,Xn . In a trivial sense, the stopping
time is always a function of the observed data if the observed data are defined
to include all and only those observations with subscripts up to and including
N. If you know that the observed data are Xl, ... ,Xn , then you know that
N = n. What is required is that, for each n, if you are merely told the values of
Xl, ... , Xn (but not N), you would be able to figure out whether or not N > n.
566 Chapter 9. Sequential Analysis
get them. Also, the criterion that the stopping rule be a function of the observed
data would not be met in this case, since one cannot tell from looking at the first
60 Xi values that the experiment would stop at N = 60. One also needs to look
at the clock (and possibly the calendar).
This is not to say that one could not make inference based solely on X =
(Xl, ... , X N ) in this case. To obtain the density of X given e, we first introduce
{Yn}~=l' the interarrival times of the Poisson process. Now, {N = n} is a function
of Yl , . , Y n , and we can write (assuming that OX(O) has units of "observations
per second") the conditional joint density of X and Y = (Yl, ... , YN ) given e = (}
as
where t = 2::::'1 Yi, n is the observed value of N, and k = 2::::'1 Xi. We can inte-
grate Y out of this to obtain the conditional density of X given e .12 To integrate
out Y for fixed n, transform from Yl, ... , Yn to t, Yn-l, ... , Y1. The Jacobian is 1,
and the ranges of integration are (with t innermost and Yl outermost)
t > 60,
i-l
So the likelihood is
For example, suppose that the prior for e puts probability q on e = 0.5l.
Then the posterior probabilities of the two values of e are in the ratio:
That is, N is just one plus a Poi(60oX(O)) r.ando~ variable. So the likelihood
function for observing X alone would be as g1ven m (9.56).
9.5. Problem s 567
bn+m(Q) -,n(Q)1
~ E (l'm(Q(IX1, ... , Xn)) - 'i'o(Q(IXl, ... , Xn))l).
(c) Prove that limn~oo ,n(Q) converges to some quantity ,.(Q) that
satisfies
,.(Q) = min{po(Q), Eb(Q(IXI))) + c}., (9.57)
(d) Suppose that ,i and 1'2 both satisfy (9.57) for all probabilities Q.
Show that
7. Stein (1946) proves that if Pr(!r(X;) '" !O(Xi)) > 0, then there exist
c and p < 1 such that Pr(N > n) ~ cpn for the SPRT. Define Zi
log[fr (X;)/ fO(Xi)J.
(a) Prove that Var(Zi) > 0 implies Pr(!r(Xi) '" !o(Xd) > O.
(b) Find an example in which Pr(!r(Xi ) '" !o(X i )) > 0 but Var(Zi) = o.
(c) Under the conditions of Theorem 9.33, prove that there is a subse-
quence {nk}~l and a c and p < 1 such that Pr(N > nit) < cpnk.
8. Prove Proposition 9.40 on page 554.
9. Suppose that {Xn}~l are liD Ber(O) given e = O. Let N be the first n
such that 11::;=1 =
Xi - n/21 2:: 2. Prove that Ee(N) 2/[8 2 + (1 - 0)2].
Section 9.3:
10. Let {(Xn, Bn)}~=l be a sequence of sample spaces, and let In,9n be two
densities on (Xn' Bn) for every n. Let Xn : S -> Xn be a random quantity
for every n. Define Zn = 9n(Xn)/ln(Xn). Let P be the probability that
says that Xn has density In for every n. Let k > 0 and N = inf{n : Zn 2::
k}. Prove that peN < 00) ~ 1/k.
9.5. Problems 569
Section 9.4:
12. Suppose that {Xn}~=l are conditionally lID N(}, 1) given 8 = (} and that
e rv N(O, 1). Show that, given e = 0, for every a the probability is 1 that
there will exist n such that Pr(8 $ 0IXI, ... , Xn) > a.
13. *Suppose that {Xn}~=l are conditionally lID N(}, 1) given e = (} and that
e rv N(}o, 1/>.). In this problem, we will prove that, given e = (} > (}o,
the probability is less than 1 that there will exist n such that Pre e $
(}oIXI, ... , Xn) > a > 1/2, so long as Pr(e $ (}o) < a. In what follows, let
(} > (}o and a > 1/2.
(a) Define Zi = Xi - (}o and t/J(h) = E9 exp(hZi ). Show that there exists
h < such that t/J(h) = 1.
(b) Let p = Pr(e $ (}o) < a, and let Sn = E:I Zi. Prove that Pr(e $
(}oIXI , ... , Xn) > a is equivalent to Sn $ Cn, where
(e) Define
A.I Overview
A.I.l Definitions
In many introductory statistics and probability courses, one encounters discrete
and continuous random variables and vectors. These are all special cases of a
more general type of random quantity that we will study in this text. Before we
can introduce the more general type of random quantity, we need to generalize
the sums and integrals that figure so prominently in the distributions of discrete
and continuous random variables and vectors. The generalization is through the
concept of a measure (to be defined shortly), which is a way of assigning numerical
values to the "sizes" of sets.
Example A.1. Let S be a nonempty set, and let A ~ S. Define I-(A) to be
the number of elements of A. Then I-(S) > 0, 1-(0) = 0, and if Al n A2 = 0,
I-(AI U A2) = I-(Ad + I-(A2). Note that I-(A) = 00 is possible if S has infinitely
many elements. The measure 1- described here is called counting measure on S.
Example A.2. Let A be an interval of real numbers. If A is bounded, let I-(A)
be the length of A. If A is unbounded, let I-(A) = 00. It is easy to see that
p.(IR) = 00, I 1-(0) = 0, and if Al n A2 = 0 and Al U A2 is an interval, then
J.&(A 1 U A2) = J.&(AI) + J.&(A2). The measure J.& described here is called Lebesgue
measure.
Example A.S. Let f : m. -+ m.+ be a continuous function. 2 Define, for each
interval A, J.&(A) = fA f(x)dx. Then J.&(m.) > 0, J.&(0) = 0, and if Al n A2 = 0 and
Al U A2 is an interval, then J.&(AI U A2) = J.&(A 1 ) + J.&(A2).
Since measure will be used to give sizes to sets, the domain of a measure will
be a collection of sets. In general, we cannot assign sizes to all sets, but we need
enough sets so that we can take unions and complements. A collection of sets
that is closed under taking complements and finite unions is called a field. A field
that is closed under taking countable unions is called au-field.
Example A.4. Let S be any set. Let A = is, 0}. This u-field is called the trivial
u-field. As a second example, let A C S. Let A = is, A, AC , 0}. Let B be another
subset of B, and let A = {B,A, B,Ac,Bc,A n B,A nBc, .. .}. Such examples
grow rapidly. The largest u-field is the collection of all subsets of S, called the
power set of S and denoted 28 .
Example A.S. One field of subsets of m. is the collection of all unions of finitely
many disjoint intervals (unbounded intervals are allowed). This collection is not
a u-field, however.
Infinite measures are difficult to deal with unless they behave like finite mea-
sures in certain important ways. If there exists a countable partition of the set S
such that each element of the partition has finite p. measure, then we say that p.
is IT-finite. When an abstract measure is mentioned in this text, it will generally
be safe to assume that it is IT-finite unless the contrary is clear from context.
A.1.3 Integration
The integral of a function with respect to a measure is a way to generalize the
Riemann integral. The interested readers should be able to convince themselves
that the integral as defined here is an extension of the Riemann integral. That
is, if the Riemann integral of a function over a closed and bounded interval
exists, then so does the integral as defined here, and the two are equal. We
define the integral in stages. We start with nonnegative simple functions. If f is
a nonnegative simple function represented as f(s) = 2::=1
adA, (s), with the ai
distinct and the Ai mutually disjoint, then the integral 0/ f with respect to I-' is
f f(s)dl-'(s) =2::=1 ail-'(Ai). If 0 times 00
occurs in such a sum, the result is 0
by convention. The integral of a nonnegative simple function is allowed to be 00.
For general nonnegative measurable functions, we define the integral of f with
f f
respect to I-' as f(s)dl-'(s) = SUPg:5/,g simple g(s)dl-'(s). For general functions f,
let j+(s) = max{f(s),O} and r(s) = - min{f(s),O} (the positive and negative
parts of f, respectively). Then f(s) = j+(s) - r(s). The integral of f with
respect to p, is
J f(s)dl-'(s) = J
f+(s)dl-'(s) - J r(s)dl-'(s),
if at least one of the two integrals on the right is finite. If both are infinite, the
integral is undefined. We say that f is integrable if the integral of / is defined and
is finite. The integral is defined above in terms of its values at all points in S.
Sometimes we wish to consider only a subset of A ~ S. The integral of f over A
with respect to I-' is
i f(s)dl-'(s) = JIA(S)f(s)dl-'(s).
smaller functions have smaller integrals, and that two integrable functions that
have the same integral over every set are equal almost everywhere. Another useful
property, given in Theorem A.54, is that a nonnegative integrable function leads
to a new measure /I by means of the equation II(A) = fA f(s)dl-'(s).
The most important theorems concern the interchange of limits with integra-
tion. Let Un}~=1 be a sequence of measurable functions such that fn(x) ---+ f(x)
a.e. [1-'1. The monotone convergence theorem A.52 says that if the fn are nonneg-
ative and fn(x) ~ f(x) a.e. [1-'1, then
(A.7)
The dominated convergence theorem A.57 says that if there exists an integrable
function 9 such that Ifn(x)1 :::; g(x), a.e. [I-'], then (A.7) holds.
Part 1 of Theorem A.38 says that measurable functions into each of two mea-
surable spaces combine into a jointly measurable function. Measures and inte-
gration can also be extended from several spaces into the product space. For
example, suppose that I-'i is a measure on the space (8i , Ai) for i = 1,2. To de-
fine a measure on (81 x 82,A 1 A2), we can proceed as follows. For each product
set A = Al X A 2, define 1-'1 x 1-'2(A) = 1-'1 (Al)1-'2(A2). The Caratheodory exten-
sion theorem A.22 allows us to extend this definition to all of the product space.
Lebesgue measure on lR?, denoted dxdy, is such a product measure. Not every
measure on a product space is a product measure. Product probability measures
will correspond to independent random variables.
Extending integration to product spaces proceeds through two famous theo-
rems. Tonelli's theorem A.69 says that a nonnegative function f satisfies
/ [ / f(X,Y)dI-'2(Y)] dl-'l(X),
Fubini's theorem A.70 says that the same equations hold if f is integrable with
respect to 1-'1 x 1-'2. These results also extend to finite product spaces 81 x .. X 8 n
Let 1-'1,1-'2, ... be a collection of measures on the same space (8,A). Let
a1,a2, ... be a collection of positive numbers. Then I-' = EAlI i ail-'i is a
measure and I-'i I-' for all i.
The last example above is important because it tells us that for every countable
collection of measures, there is a single measure such that all measures in the
collection are absolutely continuous with respect to it.
The Ra.don-Nikodym theorem A.74 says that the first part of Example A.8 is
the most general form of absolute continuity with respect to u-finite measures.
That is, if 1-'1 is u-finite and 1-'2 1-'1, then there exists an extended real-valued
measurable function f such that 1-'2(A) = fA f(x)dl-'l(X). In addition, if 9 is
J
1-'2 integrable, then g(X)dI-'2(X) = f g(x)f(x)dl-'l(X). The function f is called
the Radon-Nikodym derivative of 1-'2 with respect to 1-'1 and is usually denoted
(dI-'2/dI-'1)(S).
A similar theorem, A.8i, relates integrals with respect to measures on two
different spaces. It says that a function f : 8 1 -+ 82 induces a measure on the
range 8 2 If 1-'1 is a measure on 81, then define 1-'2(A) = 1-'1 (J-1 (A). Integrals
with respect to 1-'2 can be written as integrals with respect to 1-'1 in the following
way: f g(y)dI-'2(Y) = f g(f(Xdl-'l(X). The measure 1-'2 is called the measure
induced on 82 by f from 1-'1.
A.2 Measures
A measure is a way of assigning numerical values to the "sizes" of sets. The
collection of sets whose sizes are given by a measure is a u-field. (See Examples AA
and A.5 on page 571.)
Definition A.9. A nonempty collection of subsets A of a set 8 is called a field
if
A E A implies4 AC E A,
AI, A2 E A implies Al U A2 E A.
A field A is called a u-field if {An}~=1 E A implies U~1Ai E A.
Proposition A.IO. Let N be an arbitrary set of indices, and let y = {Aa : Q E
N} be an arbitrary collection of u-fields of subsets of a set 8. Then naENAa is
also a u-field of subsets of 8.
Because of Proposition A.l0 and the fact that 28 is a u-field, it is easy to
see that, for every collection of subsets C of 8, there is a smallest u-field A that
contains C, namely the intersection of all u-fields that contain C.
Definition A.n. Let C be the collection of intervals in JR. The smallest u-field
containing C is called the Borel u-field. In general, if 8 is a topological space, and
B is the smallest u-field that contains all of the open sets, then B is called the
Borel u-field.
In addition to the Borel u-field, the product u-field is also generated by a simple
collection of sets.
Definition A.l2 .
Let ~ be an index set, and let {8a }aEN be a collection of sets. Define
8 = flaEN 8 a . We call 8 a product space .
For each a E ~, let Aa be a u-field of subsets of 8a . Define the product
u-field as follows. aENAa is the smallest u-field that contains all sets of
the form flaEN Aa, where Aa E Aa for all a and all but finitely many Aa
are equal to 8 a
In the special case in which ~ = {I, 2}, we use the notation 8 = 8 1 X 82 and the
product u-field is denoted A1 A2.
Proposition A.l3. 5 The Borel u-field 8 k of IRk is the same as the product u-
field of k copies of (IR, 8 1 ).
There are other types of collections of sets that are related to u-fields. Some-
times it is easier to prove results about these other collections and then use the
theorems that follow to infer similar results about u-fields.
Definition A.l4. Let 8 be a set. A collection II of subsets of 8 is called a 11'-
system if A, B E II implies A n B E II. A collection A is called a >.-system if
8 E A, A E A implies AO E A, and {An}~=1 E A with Ai n Aj = 0 for i =f. j
implies U~1 Ai E A.
As in Proposition A.1O, the intersection of arbitrarily many lr-systems is a
lr-system, and so too with A-systems. The following propositions are also easy to
prove.
Proposition A.l5. If 8 is a set and C is a collection of subsets of 8 such that
C is a lr-system and a A-system, then C is au-field.
Proposition A.16. If 8 is a set and A is a A-system of subsets, then A, AnB E A
implies A n BO E A.
k, p,(Ak) p,(B;)
k-oo
i=l i=l i=l
Al = k~"!' Ak U (Q Bi) ,
and all of the sets on the right-hand side are disjoint. It follows that
k-l
Ak Al \ UB;,
;=1
00
lim JL(A k )
k~oo
JL(At} - 2:JL(B;) = JL ( lim A k ). 0
k~oo
;=1
llThis theorem is used in the proofs of Lemma A.72 and Theorems B.90
and 1.61. There is a second Borel-Cantelli lemma, which involves probability
measures, but we will not use'it in this text. See Problem 20 on page 663. The
set whose measure is the subject of this theorem is sometimes called An infinitely
often because it is the set of points that are in infinitely many of the An.
12This theorem is used to prove the existence of many common measures
(including product measure) and in the proofs of Lemma A.24 and of Theo-
rems B.1l8, B.131, and B.133.
A.2. Measures 579
real-valued, and countably additive and satisfies J.t(0) = o. Then there is a unique
extension of J.t to a measure on a measure space13 (S, A, J.t *). (That is, C ~ A
and J.t(A) = J.t*(A) for all A E C.)
PROOF. The proof will proceed as follows. First, we will define J.t* and A. Then we
will show that J.t* is monotone and subadditive, that C ~ A, that A is a o--field,
that J.L* is countably additive on A, that J.t* extends J.t, and finally that J.t* is the
unique extension.
For each B E 2 8 , define
00
where the inf is taken over all {Ai}~l such that B ~ U~lAi and Ai E C for all
i. Let
First, we show that J.t* is monotone and subadditive. Clearly, J.t*(A) :::; J.t(A)
for all A E C and Bl ~ B2 implies J.t*(Bl) :::; J.t*(B2). It is also easy to see that
J.t*(Bl U B2) 'S J.t*(Bl) + J.t*(B2) for all B 1 ,B2 E 28. In fact, if {Bn}~l E 28 ,
then J.t*(U~lBi) :::; L::l J.t*(Bi). The proof is to notice that the collection of
numbers whose inf is J.t* of the union includes all of the sums of the numbers
whose infima are the J.t* values being added together.
Next, we show that C ~ A. Let A E C and C E 28. Since J.t* is subadditive, we
only need to show that J.t*(C) ~ J.t*(C n A) + J.t*(C n A C ). If J.t*(C) = 00, this is
clearly true. So let J.t*(C) < 00. From the definition of J.t*, for every f > 0, there
exists a collection {Ai}~l of elements of Csuch that E:I J.t(Ai) < J.L*(C) + f.
Since J.t(Ai) = J.t(Ai n A) + J.t(Ai n A C) for every i, we have
00 00
i=l i=l
~ J.t*(CnA)+J.t*(CnA c ).
Since this is true for every f > 0, it must be that J.t*(C) ~ J.t*(CnA)+J.t*(CnA c ),
hence A E A.
Next, we show that A is a o--field. It is clear that 0 E A and A E A implies
ACE A by the symmetry in the definition of A. Let AI, A2 E A and C E 2 8 . We
can write
where the first two equalities follow from AI, A2 E A, and the last follows from
the subadditivity of J.t*. So, Al U A2 EA. Let {An}~=l E A; then we can write
l3The usual statement of this theorem includes the additional claim that the
measure space (S, A, J.t*) is complete. A measure space is complete if every subset
of every set with measure 0 is in the o--field.
580 Appendix A. Measure and Integration Theory
A = U~IAi = U~IBi' where each Bi E A and the Bi are disjoint. (This just
makes use of complements and finite unions of elements of A being in A.) Let
Dn = Uf=IBi and C E 25. Since A C S;;; D;{ and Dn E A for each n, we have
LIL*(C n B i ) + IL"(C n A C ).
;=1
intervals of the form (a, b] with a = -00 and/or b = 00 possible.1 4 The collection
C of all unions of finitely many disjoint intervals of this form is easily seen to be
a field. If (aI, bd, ... , (an, bn] are mutually disjoint, set
It is not hard to see that this extension of /-L to C is well defined. This means that if
Uf=l (ai, b;] = U~l (Ci' di], where (Cl' dd, ... , (c m , dm ] are also mutually disjoint,
then E~=l JLai,bij) = E : l /-LCi,dij). If JL is finite for every interval, then it is
u-finite. To see that /-L is countably additive on C, suppose that /-La, bJ) = F(b)-
F(a), where F is nondecreasing and continuous from the right. If {(an' bn]}~=l is
a sequence of disjoint intervals and (a, b] is an interval such that U~=l (an, bn] ~
2:::
(a,b], then it is not difficult to see that E::"=lJLan,bn]) ~ /-La,b]). If (a,b] ~
U~=l(an,bn], we can also prove that 1 /-Lan,bn]) ~ JLa,bj) (see Problem 7
on page 603). Together these facts will imply that JL is countably additive on C.
The proof of Theorem A.22 leads us to the following useful result. Its proof is
adapted from Halmos (1950).
Lemma A.24. l5 Let (S, A, /-L) be a u-finite measure space. Suppose that C is a
field such that A is the smallest u-field containing C. Then, for every A E A and
f > 0, there is C E C such that JL(C~A) < f. 16
PROOF. Clearly, /-L and C satisfy the conditions of Theorem A.22, so that /-L is equal
to the p,* in the proof of that theorem. Let A E A and f > 0 be given. It follows
from (A.23) that there exists a sequence {Ai}~l in C such that A ~ U~lAi and
00
l4If b = 00, we mean (a, 00) by (a, b]. That is, we do not intend 00 to be a
point in the space S.
l5This lemma is used in the proof of the Kolmogorov zero-one law B.68.
16The symbol ~ here refers to the symmetric difference operator on pairs of
sets. We define C~A to be (C n A C ) u (CC n A).
582 Appendix A. Measure and Integration Theory
Similarly,
Since I-'j (Uf=l [C; n AD can be written as a linear combination of values of I-'j
at sets of the form A n C, where C E IT is the intersection of finitely many of
C1, ... , Cn, it follows from A E gc that I-'duf=dC; n A]) = 1-'2 (Uf=l[C; n AJ)
for all n, hence 1-'1(A) = 1-'2(A). 0
17This theorem is used in the proofs of Theorems B.32, B,46, B.118, B.l3l,
and 1.115, Lemma A.64, and Corollary B,44.
A.3. Measurable Functions 583
Definition A.27. Suppose that S is a set with a a-field A of subsets, and let T
be another set with a a-field C of subsets. Suppose that I : S -+ T is a function.
We say I is measurable if for every B E C, l-l(B} E A. If I is measurable,
one-to-one, and onto and 1-1 is measurable, we say that I is bimeasurable. If
T = 1R, the real numbers, and C = B, the Borel a-field, then if I is measurable,
we say that I is Borel measurable.
Proposition A.2S. Suppose that (S, A) and (T, C) are measurable spaces. Sup-
pose that I : S -+ T is a lunction.
II A = 2s , then I is measurable.
IIC = {T,0}, then I is measurable.
II A = {S, 0}, {y} E C lor every yET, and / is measurable, then / is
constant.
As examples, if S = T = 1R and A = B is the Borel a-field, then all continuous
functions are measurable. But many discontinuous functions are also measurable.
For example, step functions are measurable. All monotone functions are measur-
able. In fact, it is very difficult to describe a nonmeasurable function without
using some heavy mathematics.
The following theorems make it easier to show that a function is measurable.
Theorem A.29. l8 Let N, S, and T be arbitrary sets. Let {A, : a E N} be a
collection 0/ subsets 01 T, and let A be an arbitrary subset 0/ T. Let / : S -+ T
be a function. Then
rl (U nEN
An) U
nEN
rl(An},
r l (n ",EN
An) n
",EN
rl(An},
rl(AC} rl(A)C.
PROOF. For the union, if s E l-l(U",ENA",}, then 1(8} E UnENAn , hence there
exists a such that 1(8} E An, so s E f-l(A n } and s E UnEN/-l(A n ). If 8 E
UnEN/-l(An ), then there exists a such that 8 E l-l(An), hence /(s) E An,
hence f(s) E UnENAn , hence 8 E l-l(U",ENA",). This proves the first equality.
The second is almost identical in that "there exists a" is merely replaced by "for
all a" in the above proof. For the complement, if 8 E l-l(A C ), then 1(8) E A C
and 1(8) ~ A. Hence, 8 ~ rl(A) and s E rl(A)c. If s E rl(A)c, then
8 ~ rl(A) and 1(8} ~ A. So, f(8) E A C and 8 E rl(AC). 0
Definition A.31. The u-field rl(C) in Corollary A.30 is called the u-field gen-
emted by f.
A measurable function also generates a u-field of subsets of its image.
Proposition A.32. Let (T, C) be a measumble space. Let U ~ T be arbitmry
(possibly not even in C). Define C. = {U n B : B E C}. Then C. is a u-field of
subsets of U.
r l
(QAi) = Qrl(Ai) E A,
20This theorem is used in the proofs of Lemma A.35, Proposition A.36, Corol-
lary A.37, Theorems A.38, B.75, and B.133, and to prove that stochastic processes
are measurable.
2lThis lemma is used in the proofs of Theorems A.38 and A.74.
22This proposition is used in the proof of Theorem A.38.
A.3. Measurable Functions 585
Another example of the use of Theorem A.34 is the proof that all continu
ous
functions are measurable. The result follows because the Borel a-field
is the
smallest a-field containing open sets.
Coroll ary A.37. Let (5, A) and (T, B) be topological spaces with
their Borel
a-fields. If f : 5 - t T is continuous, then f is mea.surable.
Here are some properties of measurable functions that will prove useful.
Theore m A.38. Let (5, A) be a measurable space.
1. Let N be an index set, and let {(Tn,C",)}"'EN be a collectio
n of measurable
spaces. For each a E N, let j", ; 5 --+ T", be a function . Define
f ; 5 --t
I1aEN Tn by f(s) = {f",(S)}",EN. Then f is measurable (with respect to
the
product a-field) if and only if each fOt is measurable.
2. If (V, Cd and (U,C2) are measurable spaces and f ; 5 --> V and 9
; V -t U
are measurable, then gU) ; 5 --> U is measurable.
3. Let f and 9 be measurable function s from 5 to IR n , and let a be
a constan t
scalar and let b E IR n be constant. Then the following function s
are also
measurable: f+g and af +b. Ifn = 1, then fg and JIg are also measura
ble,
where f / 9 can be set equal to an arbitrary constan t when 9 = o.
4. If, for each n, fn is a measurable, extended real-valued function, then
sUPn fn, infn fn, limsuPn jn, and liminfn fn are all measurable.
5. Let (T, C) be a metric space with Borel a-field. If /k ; 5 --> T is a
measurable
function for each k = 1,2, ... and limk~oo fk(s) = f(s) for all s,
then f is
measurable.
6. Let (T, C) be a metric space with Barela -field, and let J.t be a
measure on
(5, A). If /k ; 5 --> T is a measurable function for each k = 1,2, ...
and
limk_oo fk(s) exists a.e. [ILl, then there is a measurable j ; 5 -->
T such
that limk~oo /k(s) = f(s), a.e. [ILl
PROOF. (1) Suppose that j is measurable. To show
that jOt is measurable, let
Bo. E Co. and let B/3 = T/3 for f3 # a. Set C = I1;EN B/3, which
is in the
product a-field, because all but finitely many B/3 equal the entire space
T/3. Then
/;;1 (Bo.) = rI(C). Since f is measurable, rI(C) E A. Now, suppose that each
fa is measurable, and let B = flOiEN B OI , with Be. E COl for all a and
all but finitely
many BOt (say B Ot11 ... , B Otn ) equal to T",. Then f-1(B) = n~d;;/(
BOI;) E A.
Since the sets of the form B generate the product a-field, r1(B)
E A for all B
in the product a-field according to Theorem A.34.
(2) Let A E C2 We need to prove that g(f)-l( A) E A. First, note
that
g(l)-l = f-1(g-1 ). Since g is measurable, g-l(A) E C . Since
1 I is measurable,
r1(g-1 (A)) E A. So g(f)-l( A) E A.
(3) The arithme tic parts of the theorem are all similar. They all follow
from
parts 2 and 1. For example, hex, y) = x + y is a measurable function
from IR?
to JR, so h(f, g) = I + 9 is measurable. For the quotient, a little
more care is
needed. Let hex, y) = x/y when y # 0 and let it be an arbitrar y constan
t when
y = O. Then h is measurable since {(x, y) ; y = O} is in 8 2 It follows that
h(f, g)
is measurable.
(4) Let f = SUPn In. Then, for each finite b, {s ; f(s) :s b} = n~=l {s ;
fn(s) :s b} E A. Also {s ; f(s) = -oo} = n~=l{S : fn(s) = -oo}
E A, and
586 Appendix A. Measure and Integration Theory
{s : f(s) = oo} = n bl U~=I {s : fn(s) > i} EA. Similar arguments work for
inL Since limsuPnfn = infksuPn>kfn and liminfnfn = sUPkinfn2:kfn, these
are also measurable. -
(5) Let d be the metric in T. For each closed set C E C, and each m, let
Cm = {t: d(t, C) < l/m}. For each closed C, define
00 00 00
(A.39)
m=l n=l k=n
It is easy to see that A.(C) E A is the set of all s such that limn_oo fn(s) E
C. Obviously, f-I(C) consists of those s such that limn_oo fn(s) E C. Hence,
r I (C) = A. (C) E A, and Proposition A.36 says that f is measurable.
(6) Let G = {s : limk_oo fk(S) does not exist}, and let G ~ C with j.t(C) = O.
Let t E T, and define f(s) = t for sEC and f(s) = limk_oo /k(s) for s E CC.
Apply part 5 to the restrictions of the functions {fk}~1 to CC to conclude that
f restricted to CC (call the restriction g) is measurable. If A E C, f-I(A) =
g-I(A) E A if t ft A and rl(A) = g-I(A) U C E A if tEA. So f is measurable.
o
Part 6 is particularly useful in that it allows us to treat the limit of a sequence
of measurable functions as a measurable function even if the limit only exists
almost everywhere. This is only useful, however, if we can show that functions
that are equal almost everywhere have similar properties.
Many theorems about measurable functions are proven first for a special class of
measurable functions called simple functions and then extended to all measurable
functions using some limit theorems.
Definition A.40. A measurable function f is called simple if it assumes only
finitely many distinct values.
A simple function is often expressed in terms of its values. Let f be a simple
function taking values in IRn for some n. Suppose that {al, ... , ak} are the dis-
tinct values assumed by f, and let Ai = rl({ai}). Then f(s) = 2:~=1 adA; (s).
The most fundamental limit theorem is the following.
Theorem A.41. If f is a nonnegative measurable function, then there exists a
sequence of simple functions {Ji}bl such that for all s E S, /i(s) i f(s).
PROOF. For k = 1, ... ,i2 i , let Ak,i = {s : (k _1)/2 i :$ f(s) < k/2 i }. Define
AO,i = {s: f(s) ~ i}. Then AO,i,AI,i,." , A i2 ;,i are disjoint and their union is S.
Define
f '(s) = { k;l if s E Ak,i for k > 0,
i if s E AO,i.
It is clear that Ii(s) :$ f(s) for all i and s, and each fi is a simple function. Since,
for k > 0, Ak,i = A2k-l,i+l U A 2 k,HI, and AO,i = AO,HI U Ai2i+l+l,i+l U U
< fHl(S)
A (Hl)2''+1',.+1, it is easy to see that fi(s) - .
for all i and all s. It is also
-i
easy to see that for each s, there exists n such that for l ~ n, If(s) - f;(s)1 :$ 2 .
, 0
Hence Ii(s) i f(s). . .
The following theorem will be very useful throughout the study of statistics. It
says that one function 9 is a function of another f if and only if 9 is measurable
with respect to the O'-field generated by f
A.4. Integrat ion 587
AA Integration
The integral of a function with respect to a measure is a way to generali
ze the
notion of weighted average. We define the integral in stages. We start
with non-
negative simple functions.
Definit ion A.44. Let f be a nonnegative simple function represen
ted as f(8) =
L::"'I adA,(s) , with the ai distinct and the Ai mutuall y disjoint. Then,
the in-
f
tegral of f with respect to 1- is f(s)dl-(8) = 2::=1
ail-(Ai). If 0 times 00
occurs
in such a sum, the result is 0 by convention.
The integral of a nonnegative simple function is allowed to be
that the formula for the integral of a nonnegative simple function is
00.
It turns out
more general
than in Definition A.44.
588 Appendix A. Measure and Integration Theory
J J J
to 1- is
f(s)dl-(s) = t+(s)dl-(s) - r(s)dl-(s),
if at least one of the two integrals on the right is finite. If both are infinite, the
integral is undefined. We say that f is integrable if the integral of f is defined
and is finite.
The integral is defined above in terms of its values at all points in S. Sometimes
we wish to consider only a subset of S.
Definition A.4S. If A ~ Sand f is measurable, the integral of f over A with
i J
respect to 1- is
f(s)dl-(s) = IA(s)f(s)dl-(s).
J af(s)dJL(s) = a f f(s)dJL(s).
3. If f and 9 are integrable with respect to 1', and f :::; g, a.e. lI'), then
The proofs of the next few theorems are essentially borrowed from
Royden
(1968).
Theore m A.50 (Fatou 's lemma} .24 Let {fn}~=l be a sequence of
nonnegative
measurable functions. Then
J f(s)d/-L(s) = sup
simple <p :s; f
J (s)dJ.L(s) ,
Since this is clearly true if (s) = 0, a.s. [/-L], we will assume that /-L(A)
> 0, where
A = {s : (s) > O}. Let ~ f be simple, let > 0, and let 6 and
M be the
smallest and largest positive values that assumes. For each n, define
Since (1 - )(s) < f(s) for all sEA, U~=lAn = A and An ~ An+!
for all n.
Let Bn = AnA::; .
f fn(s)dJ.L(s) 2:: (
JAn
f",(s)dJ.L(s) 2:: (1 - ) (
JAn
(s)dJ.L(s). (A.51)
If /-L(Bn) = 00 for n = no, then J.L(A) = 00 and (s)dJ.L(s) = 00, f since takes
on only finitely many different values. The rightmost integral in (A.51)
is at least
DJ.L(An), which goes to 00 as n increases, hence lim inf n _ oo J fn(s)dJ.L
(s) = 00 and
the result is true. So, assume p,(Bn) < 00 for all n. Since n~=lBn =
0, it follows
from Theorem A.19 that lim n _ oo J.L(Bn) = 0. So, there exists N such
that n > N
implies J.L(Bn) < . Since
i
-
$ ((s)dJ .L(s) + M,
JAn
(A.51) implies that, for n 2: N,
J
If 4J(s)dp.(s) = 00, the result is true again. If J4J(s)dp.(s) = K < 00, then for
every n ~ N,
! !
hence
l~~:f In (s)dp.(s) ~ 4J(s)dp.(s) - [(l- )M + K].
Since this is true for every > 0,
liminf
n ..... oo
j In (s)dp.(s) ~ j4J(S)dP.(S)'
o
Theorem A.52 (Monotone convergence theorem). Let {fn}~=l be a se-
quence 01 measurable nonnegative functions, and let I be a measurable function
such that In(x) :5 I(x) a.e. [p.] and In(x) -+ I(x) a.e. lP-]. Then,
Rearranging the terms in the first and last expressions gives the desired result. If
both I and 9 have infinite integral of the same sign, then it follows easily using
A.4. Integration 591
Proposition A.49, that 1+ 9 has infinite integral of the same sign. Finally, if only
one of I and 9 has infinite integral, it also follows easily from Proposition A.49
that I + 9 has infinite integral of the same sign. 0
A nonnegative function can be used to create a new measure.
Theorem A.54. Let (8, A, IL) be a measure space, and let I : 8 --+ IR be non-
negative and measurable. Then v(A) = fA l(s)dIL(s) is a measure on (8, A).
PROOF. Clearly, v is nonnegative and v(0) = 0, since l(s)10(S) = 0, a.e. [IL].
Let {An}~=1 be disjoint. For eachn, define gn(S) = l(s)IAn (s) and In(s) =
1::=1 gi(S). Define A = U~=IAn. Then 0 $ In $ I1A, a.e. [IL] and In converges
to I1A, a.e. [IL]. So, the monotone convergence theorem A.52 says that
lim j In(s)dIL(S)
n .... oo
= v(A). (A.55)
Also, V(Ai) =f gi(s)dIL(S), for each i. It follows from Theorem A.53 that
(A.56)
Take the limit as n --+ 00 of the second and last terms in (A.56) and compare to
(A.55) to see that v is countably additive. 0
$ liminfjln(x)dlL(X).
n .... oo
592 Appendix A. Measure and Integration Theory
PROOF. Let ft, f;;, f+, and f- be the positive and negative parts of fn and
f. We will prove that the result holds for nonnegative functions and take the
difference to get the general result. Let ~ > 0 and let c be large enough so that
sUPn J{x:fn(xl>c} fn(x)dJL(x) < ~. The functions
( ) _ {fn(X) if fn(x) $ C,
gn X - C if !n(x) > c
~ li~~~p J fn(x)dJL(x) - ~,
where the second line follows from the dominated convergence theorem A.57 and
J
the third from our choice of c. Since this is true for every f, we have f(x)dp,{x) ;::
J
lim sup fn(x)dp,{x). Combining this with Fatou's lemma A.50 gives
So, B", E A2. Let C be the collection of all sets B ~ 8 1 X 82 such that B", E A2.
If BE C, then (BO)", = {y: (x,y) f/. B} = (B",)o, so BO E C. Let {Bn}~=1 E C.
Then it is easy to see that
(91 Bn) '" = {Y: (x,y) E 91 Bn} = 91{Y: (x,y) E Bn} = 91(Bn)", E c.
(A.62)
Clearly, 81 x 82 E C, so C is a IT-field containing all product sets; hence it contains
AlA2. Next, let fB(X) = P,2(B",) for BE A 1A2. Write 81 X82 = U~=IEn with
En = A 1n X A 2n and P,i(Am) < 00 for all n and i = 1,2 and with the En disjoint.
Then let fB,n = P,2B n En)",). It follows that fB = E:=1 fB,n. If we can show
that fB,n is measurable for each n, then so is fB, since they are nonnegative, and
the sum is well defined. If B = Bl X B 2, then fB,n(X) = IAlnnBl (x)p,2{A2n nB2),
which is a measurable function. Let 'D be the collection of all sets D ~ 81 X 82
28This lemma is used in the proofs of Lemmas A.64 and A.67 and Theo-
rems A.69 and B.46.
594 Appendix A. Measure and Integration Theory
LfDm,n(X),
m=l
00
where the first equality follows from the definition of VI, the fact that Jl.2 is
countably additive, and (A.62); the second equality follows from the monotone
convergence theorem A.52 and the fact that 2:::'=1
Jl.2Bn)",) :::; 2:::'=1
Jl.2Bn )",)
for all m; and last equality follows from the definition of VI. This proves that VI
(and so too V2) is a measure. Note that if B = Al X A2, then
and such that each one has finite III = measure. By Theorem A.26,
112 III agrees
with 112 on all of Al @A2'
0
Definit ion A.65. Let (8i' Ai, J1.i) for i = 1,2 be (7-finite measure
spaces. Define
the product measure J1.1 x J1.2 on (81 x 82, Al @ A2) as the common
value of the
two measure s III and 112 in Lemma A.64.
Lebesgue measure on lR.2, denoted dxdy, is a product measure
. Not every
measure on a product space is a product measure. Produc t probabi
lity measure s
will corresp ond to indepen dent random variables. (See Theorem 8.66.)
Propos ition A.66. Let J1. be a measure on a product space (8 1 X
82, Al @ A2)'
Then J1. is a product measure if and only if there exist set functions
J.ti : Ai ->
1R for i = 1,2 such that, for every Al E Al and A2 E A2, J1.(AI
x A2} =
J1.dAl)J1.2(A2).
Lemm a A.67. 30 Let f be a measurable function from 8 1 x 8 2 to
m such that
either {x E 81 : Jlf(x,y)ldJ1.2(Y) = oo} ~ A E AI, where III (A) =
0, or f 2 O.
Then, there is a measurable (possibly extended real-valued) function
9 : 81 ->
mu{o o} such thatg(x ) = Jf(x,y) dIl 2(Y), a.e. [Ill]' Iff is the indicat
orofa
measurable set B, then
by (A.68). Since 0 S J J
fn(x, y)dIL2(Y) ::; f(x, y)dJj2(y) for all x and n, and
J J
lim n _ oo fn(x, y)dIL2(Y) = f(x, y)dIL2(Y) as in the proof of Lemma A.67, it
follows from the monotone convergence theorem A.52 that
/ [ / f(X,y)dJj2(X,y)] dlLl(X).
The proof that the iterated integrals can be calculated in the other order is
similar. 0
For every f > 0, there is 6. such that JLl(A) < 6. implies JL2(A) < f. (A.73)
This is a contradiction. 0
The following theorem says that the first part of Example A.8 on page 574 is
the most general form of absolute continuity with respect to o--finite measures.
The proof is mostly borrowed from Royden (1968).
Theorem A.74 (Radon-Nikodym theorem). Let ILl and JL2 be measures on
(8, A) such that 1L2 JLl and 1-'1 is o--finite. Then there exists an extended real-
valued measumble junction f : S --+ [0,00] such that for every A E A,
(A.75)
The function f is called the Radon-Nikodym derivative of /-l2 with respect to /-ll
and it is unique a.e. [/-llj. The Radon-Nikodym derivative is sometimes denoted
(d/-l2/d/-l 1) (s). II/-l2 is (I-finite, then I is finite a.e. [/-ld
PROOF. First, we prove uniqueness a.e. [/-llj. Suppose that such an f exists.
Let 9 be another function such that I and 9 are not a.e. [/-llj equal. Let An =
{x : I(x) > g(x) + lin} and Bn = {x : I(x) < g(x) - lin}. Since I and
9 are not equal a.e. [/-ld, then there exists n such that either /-ll(An) > 0 or
/-ll(Bn) > o. Let A be a subset of either An or Bn with finite positive measure.
Then fA f(x)d/-l1(x) '" fA g(x)d/-l1(x). Hence 9 '" d/-l2/d/-ll.
The proof of existence proceeds as follows. First, we show that we can reduce
to the case in which /-ll is finite. Then, we create a collection of signed measures
Va indexed by a real number Ct. For each Ct we find a set A a such that every
subset of Aa has positive Va measure and every subset of the complement B a
has negative Va measure. We then show that B{J ~ B a for (3 ~ Ct, which allows
us to define I(x) = sup{Ct : X E Ba}. Finally, we show that I satisfies (A.75) and
(A.76).
Now, we prove that we need only consider finite /-ll. Since /-ll is u-finite, let
{An}~=l be disjoint elements of A such that /-ll(Ai) < 00 and S = U~lAi. Let
/-lj,i be /-lj restricted to Ai for j = 1,2 and each i. Then /-l2,i /-ll,i for each i and
each /-ll,i is finite. Suppose that for each i we can find Ii as in the theorem with
/-lj replaced by /-lj,i for j = 1,2. Then f(x) = "E:l IA; (X)fi(X) is the function
required by the theorem as stated. Hence, we prove the theorem only for the case
in which /-ll is finite.
Suppose that /-ll is finite, and define the signed measure Va = Ct/-ll - /-l2 for
each nonnegative rational number Ct. (Note that va(A) never equals 00, although
it may equal -00.) For each Ct, define
That is, A" is the supremum of the signed measures of sets all of whose subsets
have nonnegative signed measure. 32 Since 0 EPa, A" ~ o. Let {An}~=l be
such that A" = limi_oo v,,(A i ), and let A a = U~lAi. Since every subset of AO
can be written as a union of subsets of the Ai, it follows that A a EPa, hence
Aa ~ va(Aa). Since A" \ Ai <; A a , it follows that va(A a \ Ai) ~ 0 for all i a~d
l/a(A") = l/a(A a \ A;) + l/a(A;) 2 va(Ai) for all i. It follows that Aa ~ va(A ).
Hence Aa = v,,(A a ) < 00. Define B a = (Aa)c.
Next, we prove that every subset of B a has nonpositive measure. 33 If not, let
B <; B a such that va(B) > o. If B has no subsets with negative signed measure,
32The sets in Pa are often called the positive sets relative to the signed measure
Va
33S uch sets are called negative sets relative to the signed measure Va
A.6. Absolute Continuity 599
then BuA'" E P'" and v",(A"'UB) > >''''' a contradiction. So, let n1 be the smallest
positive integer such that there is a subset B1 ~ B with v",(Bd < -l/n1. For
each k > 1, let nk be the smallest positive integer such that there exists a subset
Bk ~ B \ u7,;;} Bi with V",(Bk) < -link. Now, let C = B \ Uk"=lBk. Clearly
v"'(C) > O. If we prove that C has no subsets with negative signed measure,
then C E P'" and we have another contradiction. So, suppose that D ~ C has
v",(D) = - < O. Since v",(B) > 0, it must be that 2:~=1 V",(Bk) > -00.
Hence limk_oo nk = 00. So, there is k such that I/(nk+1 - 1) < . Notice that
D ~ C ~ B\U~=lBk. Since v",(D) < -l/(nk+l-I), this contradicts the definition
of nk+1.
If (3 > 0, we have
Subtract the first inequality from the second to get (f3 - O)J.!l (A'" n B/3) $ 0,
from which it follows that J.!l(A'" n B/3) = O. Since v/3(A) ;::: v",(A) for f3;::: 0, we
can assume that A'" ~ A/3 if f3 ;::: o. It follows that B/3 ~ B'" for f3 ;::: 0, and we
can define f(x) = sup{o : x E B"'}. Since B O = S, f(x) ~ 0 for all x. It is easy
to see that f(x) ;::: 0 if x E B'" and f(x) $ 0 if x E A"'. It is also easy to see that
{x: f(x) ;::: b} = U"'~bB"'. Since this is a countable union of measurable sets, it
is measurable. By Lemma A.35, f is measurable.
Next, we prove that (A.75) holds for every A E A. Let A E A be arbitrary
and let > 0 be given. Let N > J.!l(A)/ be a positive integer. Define Ek =
An Bk/N n A(k+l)/N and Eoo = A \ Uk::1Ak/N. Then A = Uk::1Ek U Eoo and
the E j are all disjoint. So J.!2(A) = J.!2(Eoo ) + 2:;:'0 J.!2(Ek). By construction
f(x) E [kiN, (k + l)/N] for all x E Ek and f(x) = 00 for all x E Eoo. Since
Vk/N(Ek) $ 0 and V(k+1)/N(Ek) ~ 0, we have, for finite k,
1J.!2(Ek) -le k
f(X)dJ.!l(X)1 $ ~J.!l(Ek)' (A.77)
If J.!l(Eoo ) > 0, then J.!2{Eoo ) = 00 since v",(Eoo ) < 0 for all o. If J.!l(Eoo ) = 0,
then J.!2(Eoo) = 0 by absolute continuity. Either way, J.!2{Eoo ) = JEoo
f(x)dJ.!l{X).
Adding this into the sum of (A. 77) over all finite k gives
{ x E Co: dQo
d>' (x) = 0} ~ UOO
{
x E Cn : dQn
d>' (x) = 0}
n=l
which implies that Co E V and >'(Co) = c.
Since Qo E Q, we now need only prove that v Qo for all v E N to finish
the proof. Suppose that Qo(A) = 0 and v E N. We must prove v(A) O. Since =
Qo(A nCo) = 0 and dQo/d>.(x) > 0 for all x E Co, it follows that >'(A nCo) = 0
and hence v(AnCo) = O. Let C = {x: dv/d>.(x) > O}. Then, v(Anccfnc c ) = 0
since dv/d>.(x) = 0 for x E C c . Let D = Anccf nc, which is disjoint from Co. If
>.(D) > 0, then >'(Co U D) > >'(Co) and D E V. It follows easily that Co U DE V
and >'(Co U D) > >'(Co) contradicts >'(Co) = c. Hence >.(D) = 0 and v(D) = 0,
which implies veAl = v(A nCo) + v(A n ccf n CC) + v(D) = o. 0
There is a chain rule for Radon-Nikodym derivatives.
Theorem A.79 (Chain rule).35 Let v and." be u-finite measures and suppose
that 1. v 1/. Then
1
follows from (A.76) that
J-I(A) = 1 dl.
-(s)dv(s)
Adv
== dl. (8)-d
-d
A v
dv (s)d.,,(s).
."
34This theorem is used in the proofs of Lemmas 2.15 and 2.24. It appears as
Theorem 2 in Appendix 3 of Lehmann (1986) and is attributed to Halmos and
Savage (1949).
35This theorem is used in the proof of Lemma 2.15.
A.6. Absolute Continu ity 601
J J
integrable, then
g(y)dIl2(Y) = g(f(xd j)J(x), (A.82)
That (A.82) is true for all nonnegative simple functions follows by adding
the far
ends of this equatio n (multiplied by positive constan ts). The monoto
ne conver-
gence theorem A.52 allows us to extend the equality to all nonnegative
integrable
functions. By subtrac tion, we can extend to all integrab le function
s. 0
Definit ion A.S3. The measure {t2 in Theorem A.8! is called the measure
induced
on (82 , A2) by f from j)J.
If the measure ttl in Theorem A.81 is not finite, and the function
f is not
one-to-one, the measure {t2 may not be very interesting.
Examp le A.S4. Let 81 = lR?, 8 2 = IR, j)l equal Lebesgue measure
on lR?, and
f(x, y) = x. Let the two u-fields be Borelu-fields. The measure {t2
that f induces
on (82, A2) from ttl is the following. If A E A2 and the Lebesgue measure
of A is
0, then {t2(A) = O. Otherwise, j)2(A) = 00. Althoug h j)2 is absolute
ly continuous
with respect to Lebesgue measure, it is not IT-finite. The only function
s 9 that
are integrable with respect to {t2 are those that are almost everywh
ere O.
If j)l is IT-finite, there is a way to avoid the problem in Exampl e
A.84 by making
use of the following result.
Theore m A.85. 36 A measure {t on a space (S, A) is u-finite if and only
if there
exists an integrable function f : 8 -+ IR such that f > 0, a.e. [ttl.
PROOF. For the "if" part, let f be as in the statement of the theorem. Let 0 <
f f(s)dJ.L(s) = c < 00. Let An = {s : lin ~ f(s) < l/(n - I)}, for n = 1,2, ....
We see that Al = {s: f(s) ~ I} and 8 = U~=lAn. We can write
Example A.S6 (Continuation of Example A.84; see page 601). Let hex, y) =
exp( _[x 2 + y2)/2). It is known that h is integrable with respect to J.Ll and h
is everywhere strictly positive. Let J.L~(C) = Ie h(x,y)dJ.L 1(x, y). Then J.L~ J.Ll
and J.LI J.Li The measure J.L~ induced on (8 2 ,A2 ) from J.Li by f(x,y} = x
is J.L~(B) = J21T IB exp( _x 212)dx. A function 9 : 82 -+ IR is integrable with
respect to J.L; if and only if exp( _x 2/2)g(x) is integrable with respect to Lebesgue
measure.
A.7 Problems
Section A.2:
1. Let 8 be a set and let A be the collection of all subsets of 8 that either
are countable or have countable complement. Prove that A is au-field.
2. Prove Proposition A.lO on page 575.
A.7. Problem s 603
3. Prove Proposi tion A.13 on page 576. (Hint: First, show that every
open
ball in IRk is the union of countably many open rectangles. Then
prove
that the smalles t a-field containing open balls must be the same
as the
smallest a-field containing open rectangles.)
4. Prove that B+ defined on page 571 is a a-field of subsets of the
extende d
real numbers.
5. Prove Proposi tion A.15 on page 576.
6. Prove Proposi tion A.16 on page 576.
7. *Let F : lR -+ lR be a nondecreasing function that is continuous
from the
right. For each interval (a, b], define p.a, b]) = F(b) - F(a).
(a) Suppose that {(an, bn]}~l is a sequence of disjoint intervals
such
that U~=l (an, bnJ ~ (a, bJ. Prove that I:::"=lp.an, bnl) S; p.a, bl).
(Hint: Prove it for finite collections and take a limit.)
(b) Suppose that {(an, bn]};';"=l is a sequence of disjoint intervals
such
that (a,b} ~ U;';"=l(an,bn j. Prove that L:::"=lp,an,bn ]) ~ p,a,b]).
(Hint: First, prove it for finite collections by induction. For
infinite
collections, let p.( (a, b]) > E > O. Cover a compac t interval [a + 6,
b}
with finitely many open intervals (an, bn + 6n ) such that lp.a, bJ)
-
p.a + 6, b])1 < E/2 and IL::=1 L::=1
p.an, bnl) - p.an, bn +6n ])1 <
E/2. This can be done by using continuity from the right.)
(c) Prove that p. is countably additive on the smallest field contain
ing
intervals of the form (a,b]. (Hint: Deal separat ely with finite and
semi-infinite intervals)
8. A measure space (8,A,p.) is complete if A ~ B E A and p,(B) =
0 implies
A E A. Let (8,C,p.) be a measure space, and let '0 = {D : 3A,C
E
C with D~A ~ C and p.(C) = O}. For each D E '0, define p.(D) =
p.(A)
where D~A ~ C and p.(C) = O. Show that p.. is well defined and
that
(8, V, p..) is a complete measure space.
Section A.3:
Section A..4:
(b) Let > O. Prove that there exists a simple function 9 such that for
J J
all measures J,I satisfying J,I(8) = 1, I f(x)dJ,l(x) - g(x)dJ,l(x) I < f.
20. Prove the following alternative type of monotone convergence theorem:
Let {In}~=1 be a sequence of integrable functions such that fn(x) con-
J
verges monotonically to f(x) a.e. [J,I). Then f(x)dJ,l(x) is defined and
J f(x)dJ,l(x) = limn _ oo f fn(x)dJ,l(x). (Hint: Use the dominated conver-
gence theorem A.57 on the positive parts of fn and the monotone con-
vergence theorem A.52 on the negative parts, or vice versa, depending on
whether the convergence is from above or below.)
21. Let (8, A, J,I) be a measure space, let {gn}~=1 be a sequence of integrable
functions that converges a.e. fI.t.), and let 9 be another integrable function.
Suppose that for all C E A,
lim
n-+oo}c
r
gn(s)dJ,l(s) = 1
C
g(s)dJ,l(s).
8ection A.5:
Section A.6:
Probability Theory
B.1 Overview
B.l.l Mathematical Probability
The measure theoretic definition of probability is that a measure space (8, A, JL)
is called a probability space and JL is called a probability if 1-'(8) = 1. Each element
of A is called an event. A measurable function X from 8 to some other space
(X, 8) is called a random quantity. The most popular type of random quantity
is a random variable, which occurs when X is m. with the Borel u-field. The
probability measure JLx induced on (X, 8) by X from JL is called the distribution
ofX.
Example B.l. Let 8 = X = m. with Borel u-field. Let f be a nonnegative
function such that f f(x)dx = 1. Define JL(A) = fA f(x)dx and Xes) = s. Then
X is a continuous random variable with density f, and JLX = JL. If we let /I denote
Lebesgue measure, then JLX /I with dJLx/d/l = f.
1
Example B.2. Let 8 = m. with Borel u-field. Let X = {X 1 ,X2 , a countable
set. Let f be a nonnegative function defined on X such that Ei=l f(x;) = 1.
Define JL(A) = E{i:Z,EA} f(Xi). Then X is a discrete random variable with prob-
ability mass function f, and JLX = 1-'. If we let /I denote counting measure on X,
then JL /I with dJL/d/l = f.
B.l. Overview 607
In both of these examples, we will say that f is the density of X with respect
to v.
When there is one probability space (8, A, IJ.) from which all other probabilities
are induced by way of random quantities, then the probability in that one space
will be denoted Pro So, for example, if J-tx is the distribution of a random quantity
X and if BE B, then Pr(X E B) = J-t(X- 1 (B = J-tx(B).
The expected value or mean or expectation of a random variable X is defined
(and denoted) as E(X) = J xdJ-tx(x), if the integral exists, where J-tx is the
distribution of X. If X is a vector of random variables (called a random vector),
then E(X) will stand for the vector with coordinates equal to the means of the
coordinates of X.
The (in)famous law of the unconscious statistician, B.12, is very useful for
calculating means of functions of random quantities. It says that E[f(X)] =
J f(x)dJ-tx(x). For example, the variance of a random variable X with mean c is
Var(X) = E([X - CJ2), which can be calculated as J(x - c)2dJ-tx(x). The covari-
ance between two random variables X and Y with means Cx and cy, respectively,
is Cov(X, Y) = E([X - cxllY - cyJ).
B.1.2 Conditioning
We begin with a heuristic derivation of the important concepts using the special
case of discrete random quantities. Afterwards, we define the important terms in
a more rigorous way.
Consider the case of two random quantities X and Y, each of which assumes at
most countably many distinct values, X E X = {Xl, ... } and Y E Y = {Yl, ... }.
Let Pij = Pr(X = Xi, Y = Vi). Then
00
= Yi) = L f(Xi)Pilj.
00
E(f(X)IY
i=1
From the conditional distribution, we could define a measure on (X, 2x) by
J-tx!y(AIYi) =L Pi!i
""EA
608 Append ix B. Probabi lity Theory
fa L L L f(Xi)PiIiP.j
00
fa g(y)d/-LY(Y) = E (f (X) fB (Y
in general.
will be used as the propert y that defines conditio nal expecta tion
the definitio n of conditio nal expecta tion, we will define conditio nal prob-
Throug h
ability and conditio nal distribu tions in general.
mean and
Theorem B.21 says that, in general, if a random variable X has finite
A, then a function 9 : S -+ 1R exists which is measura ble
if C is a sub-a-field of
with respect to the a-field C and such that
random
This is the general version of what we worked out above for discrete
s in which C was the u-field generat ed by Y. We will use the symbol
variable
that E(XIC)
E(XIC) to stand for the function g. The two importa nt features
with respect to the u-field C and that it satisfies
possesses are that it is measura ble
that equals E(XIC) a.s. [/-L] will also satisfy (B.3), so there
(B.3). Any function
many function s that satisfy the definitio n of conditio nal expecta tion. All
may be
When we say
such functions are called versions of the conditio nal expecta tion.
E(XIC), we will mean that it is a version of E(XIC).
that a random variable equals
we can set B = S in (B.3) and the equatio n become s E(X) =
Notice that
generali zation
E[E(XIC)]. This result is called the law oftotal probability. A useful
is given in Theorem B.70.
symbol
lf C is the a-field generat ed by another random quantit y Y, then the
of E(XIC). For the case in which C is the a-field
E(XIY) is usually used instead
ed by Y, some special notation is introduc ed. We saw in Theorem A.42
generat
by Y if and
that a function is measura ble with respect to the a-field generat ed
B.l. Overview 609
IJ.(A) = jA
_1_ exp {_3(s~
V37r 3
+ s~ - SIS2)} dSl d S2.
Suppose that Xes) = SI and Y(s) = S2 when s = (SI' S2). Now E(IX!)
00.
y'2[ir <
We claim that g(s) == s2/2 and h(t) == t/2 satisfy the conditions required
=
to be
E(XIY) (s) and E(XIY == t), respectively. First, note that t~e u-field
by Y is Ay == {1R xC: C is Borel measurable}, and IJ.Y IS the measure ed genera~
wIth
density exp(-t2 /2)/.;z. ;r. It is clear that any measurable function
of S2 alone is
Ay measurable. Let B =
1R. x C, so that E(XIB) equals
[: L ~7r
SI exp {-~(s~ + s~ - SI S 2)} ds2dsl
= /1
e
00
-00
SI V2 exp {_3 (SI- !S2)2} -1-exp
v'31r 3 v'2ir
2
{-!s~}dSldS2
2
= r
le 2
!s2_1_ exp
V'f;ir
{-!s~} dS2
L[: ~S2:'
2
where u 2 = u~ +u~ +2{XTIU2. The fair (X, Y) does not have a joint density with
respect to Lebesgue measure on IR , but it does have a joint density with respect
to the measure /.I on IR3 defined as follows. For each A ~ IR\ let A' = {(XI,X2) :
(XI,X2,X1 +X2) E A}. Let /.I(A) = A2(A'), where Ak is Lebesgue meas.ure on IRk
for k = 1,2. Then fx,y(x, y) = fx(x) is the joint density of (X, Y) With respect
to /.I, and
/x(x) = _1_ ex p ( __
fy (y) V21ru*
I_
2u*2
(XI -I'I _[u~ + {XT1U2](~ U
-1'1 - 1-'2)2) ,
lSee Problem 25 on page 664. If X::::: IRk, the same idea can be used.
That
is, Xn E. X if and only if the joint. CDFs Fn of Xn converge to
the joint CDF
F of X at all points at which F is continuous. Since we will not need
to use this
charact erizatio n, we will not prove it.
612 Appendix B. Probability Theory
3This theorem is used in making sense of the notation Ell when introdu
parame tric models. cing
614 Appendix B. Probability Theory
p. = f xdF(x) ~ 1
[c,oo)
xdF(x) ~ cl
[c,oo)
dF(x) = cPr(X ~ c).
Divide the extreme parts by c to get the result. 0
The following well-known inequality follows trivially from the Markov inequal-
ity B.15.
Corollary B.I6 (Tchebychev's inequality).5 Suppose that X is a mndom
variable with finite variance (12 and finite mean p.. Then, for all c > 0,
(12
Pr(IX - p.1 ~ c) ~ 2'
c
Another well-known inequality involves convex functions. 6 The proof of this
theorem resembles the proofs in Ferguson (1967) and Berger (1985).
Theorem B.IT (Jensen's inequality).1 Letg be a convex function defined on
a convex subset X of IRk and suppose that Pr(X E X) = 1. If E(X) is finite,
then E(X) E X and g(E(X $ E(g(X.
PROOF. First, we prove that E(X) E X by induction on the dimension of X.
Without loss of generality, we can assume that E(X) = 0, since we can subtract
E(X) from X and from every element of X, and E(X) E X if and only if 0 E
X - E(X). If k = 0, then X = {O} and E(X) = O. Suppose that 0 E X for all
X with dimension strictly less than m ~ k. Now suppose that X and X have
dimension m and 0 f/. X. Since X and {O} are disjoint convex sets, the separating
hyperplane theorem C.5 says that there is a nonzero vector v and a constant c
such that, for every x E X, V T X $ c and 0 ~ c. 8 If we let Y = v T X, then we
have Pr(Y ~ c) = 1 and E(Y) = 0 ~ c. It follows that Pr(Y = c) = 1 and c = O.
Hence, X lies in the (m - I)-dimensional convex set Z = X n {x : v T x = O}. It
follows that 0 E Z C X.
Next, we prove the inequality by induction on k. For k = 0, E(g(X =
g(E(X, since X is degenerate. Suppose that the inequality holds for all di-
mensions up to m - 1 < k. Let X have dimension m. Define the subset of JRm+l,
X' = {(x,z): x E X,z E JR, and g(x) ~ z}.
Since ag(xl) + (1- a)g(x2) ~ g(y) and w ~ ag(xI) + (1- a)g(x2), it follows that
(y, w) E X', so X' is convex. It is also clear that (E(X), g{E(X))) is a boundary
point of X'. The supporting hyperplane theorem C.4 says that there is a vector
v = (v""v .. ) such that, for all (x,z) E X', v;I x + v .. z ~ v;IE(X) + v .. g(E(X.
Since (x, Zl) E X' implies (x, Z2) E X' for all Z2 > ZI, it cannot be that v.. < 0,
since then lim.. _ oo v;I x + V .. Z = -00, a contradiction. Since (x, g(x E X' for all
x E X, it follows that v;I X + v .. g(X) ~ v;IE(X) + v .. g(E(X, from which we
conclude
v .. g(E(X :5 v~ [X - E(X)] + v .. g(X). (B.18)
Taking expectations of both sides of this gives v .. g(E(X :5 v .. g(X). If v .. > 0, the
proof is complete. If v .. = 0, then (B.18) becomes 0:5 v T [X -E(X)] which implies
v T[X - E(X)] = 0 with probability 1. Hence X lies in an (m - l)-dimensional
space, and the induction hypothesis finishes the proof. 0
The famous Cauchy-Schwarz inequality for vectors 9 has a probabilistic version.
Theorem B.lO (Cauchy-Schwarz inequality).l0 Let Xl and X2 be two ran-
dom vectors of the same dimension such that E(IIXiIl2) is finite for i = 1,2. Then
(B.20)
PROOF. Let Z = 1 if XlX2 ~ 0 and Z = -1 if XlX2 < O. Let Y = IIXI +
cZX2112, where c = -y'EIIXl Il2/EIIX211 2. Then Y ~ 0 and Z2 = 1. So
B.3 Conditioning
B.3.1 Conditional Expectations
Section B.1.2 contains a heuristic derivation of the important concepts in condi-
tioning using the special case of discrete random quantities. We now turn to a
more general presentation.
Theorem B.21. 11 Let (8, A, 1') be a probability space, and suppose that X: 8 ~
1R is a measurable function with E(IXI) < 00. Let C be a sub-u-field of A. Then
there exists a C measurable function 9 : 8 ~ 1R which satisfies
PROOF. Use Theorem A.54 to construct two measures 11+ and p,- on (8,C):
It is clear that p,+ p, and p,_ p,. The Radon-Nikodym theorem A.74 tells us
that there are C measurable functions g+ and g_ such that
Each such function is called a version of the conditional mean. If Y : 8 --+ Y and
C is the sub-a-field generated by Y, then E(XIC) is also called the conditional
mean of X given Y, denoted E(XIY). If, in addition, the a-field of subsets of Y
contains singletons, let h : Y --+ ill. be the function such that 9 h(Y). Then =
h(t) is denoted by E(XIY t). =
When we say that a random variable equals E(XIY), we will mean that it is
a version of E(XIY). The following propositions are immediate consequences of
the above definitions.
Proposition B.24. Let (8, A, p,) be a probability space, and let (y, C) be a mea-
surable space such that C contains singletons. Let X : 8 --+ IR and Y : 8 --+ Y
be measurable. Let P,y be the measure on Y induced from p, by Y. A func-
tion 9 : Y --+ IR is a version of E(XIY = t) if and only if for all B E C,
fB g(t)dp,y(t) = E(XIB(Y'
B.3. Conditioning 617
Proposition B.25 .
If Z and W are both versions ofE(XIC), then Z = W, a.s .
If X is C measurable, then E(X\C) = X, a.s.
Proposition B.26. IfC = {8,0}, the trivial u-fi~ld, then E(XIC) = E(X).
Proposition B.28Y Let (8, A, IL) be a probability space and let X : 8 ..... JR,
Y : 8 ..... (y,B1), and Z: 8 ..... (2,B 2 ) be measurable functions. Let lLy and ILZ
be the measures induced on Y and 2 by Y and Z, respectively, from p,. Suppose
that E(\XI) < 00 and that Z is a one-to-one function of Y, that is, there exists
a bimeasurable h : Y ..... 2 such that Z = h(Y). Then E(X\Y = y) = E(X\Z =
h(y, a.s. IlLY].
Conditional probability is the special case of conditional expectation in which
X=IA.
Definition B.29. Let (8, A, p,) be a probability space. For each A E A, the
conditional probability of A given C (or given Y if C is the u-field generated by
Y) is Pr(A\C) = E(IAIC). IfPr(\C)(s) is a probability on (8, A) for aIls E 8, then
the conditional probabilities given C are called regular conditional probabilities.
It turns out that under very general conditions (see Theorem B.32), we can
choose the functions Pr(AIC) in such a way that they are regular conditional
probabilities. In the future, we will assume that this is done in all such cases.
If C is the u-field generated by Y, then Pr(AIY = y) will be used to stand for
E(IA\Y = y) as in the discussion following Corollary B.22.
If X : 8 --t X is a random quantity, its conditional distribution is the collec-
tion of conditional probabilities on X induced from the restriction of conditional
probabilities on 8 to the u-field generated by X.
Definition B.30. Let (8,A,IL) be a probability space and let (X, B) be a mea-
surable space. Suppose that X : 8 ..... X is a measurable function. Let P be the
probability on (X, B) induced by X from IL. Let C be a sub-u-field of A. For
each B E B, let P(BIC) = Pr(A\C), where A = X- 1 (B). We say that any set of
functions from 8 to [0,1] of the form
We can use Problem 3 on page 662 once again to prove that p.(Nq ) 0 for all q, =
=
hence p.( N) = O. Similarly, we can show that p.( L) 0, where L is the set
{
s:
r
lim
-+ -00
Pr(Z ~ rIC)(s) =J: o} U{s:
r
lim
-+ 00
Pr(Z ~ rIC)(s) =J: I}.
r rational r rational
B.3. Conditioning 619
We would like to prove that all Polish spaces are Borel spaces. First, we prove
that !Roo is a Borel space (Lemma B.36). Then we prove that there exist bimeasur-
able maps between Polish spaces and measurable subsets of !Roo (Lemma BAO).
The following simple proposition pieces these results together.
Proposition B .35. If X is a Borel space and there exists a bimeasurable function
f : Y --> X, then Y is a Borel space.
Lemma B.36. The infinite product space IRoo is a Borel space.
PROOF. The idea of the prooe 4 is the following. We start by transforming each
coordinate to the interval (0,1) using a continuous function with continuous
inverse. For each number in (0,1) we find a base 2 expansion, which is a sequence
of Os and Is. We then take these sequences (one for each coordinate) and merge
them into a single sequence, which we then interpret as the base 2 expansion of
a number in (0,1). If this sequence of transformations is bimeasurable, we have
our function <p.
Let 'I/J : lRoo --> (0,1)00 be defined by
I if 2Yi-l(X) ~ 1,
{
o if not,
Yj(x) = 2Yj-l(X) - Zj(x).
For each j, Zj is a measurable function. It is easy to see that Zj (x) is the jth digit
in a base 2 expansion of x with infinitely many Os. Note also that Yj(x) E [0,1)
for all j and x.
Create the following triangular array of integers:
1
2 3
4 5 6
7 8 9 10
11 12 13 14 15
Let the jth integer from the top of the ith column be l(i,j). Then
Clearly, each integer t appears once and only once as ( i, j) for some i and j .15
Define
(8.37)
Then h is clearly a measurable function from (0, 1)00 to a subset R of (0, 1). There
is a countable subset of (0, 1) which is not in the image of h. These are the numbers
with only finitely many Os in one or more of the subsequences {(i,j)}~l of their
base 2 expansion for i = 1,2, .... For example, the number c = E:o
2- i (Hl)/2-l
is not in R.16 Since the complement of a countable set is measurable, the set R
is measurable.
We define 4> = h( t/J). If we can show that h has a measurable inverse, the proof
is complete. For each x E R, define
Combining (B.37), (B.38), (8.39), and the fact that every integer appears once
and only once as lei, j) for some i and j, we see that h(4)l (x), 4>2 (X), ... ) = x, so
that (4)1,4>2, ... ) is the inverse of h and it is measurable. 0
Lemma B.40. If (X,8) is a Polish space with the Borel (J-field and metric d,
then it is a Borel spaceP
PROOF. All we need to prove is that there exists a bimeasurable f : X -> G, where
G is a measurable subset of ]Roo. We then use Lemma B.36 and Proposition B.35.
Let {Xn}~=l be a countable dense subset of X, and let d be the metric on X.
Define the function f : X -> IR 00 by
l5It is easy to check the following. For each integer t, let k = inf{ n : t :s
n(n + 1)/2}. Then ret) = 1 + k(k + 1)/2 - t and set) = k + 1 - ret) have the
= =
property that (r(t), s(t)) t, r((i,j)) i, and s((i,j)) j. =
l6This number corresponds to having Is in the first column of the triangular
array but nowhere else. Clearly, 0 < c < 1, but it is impossible to have Is in the
entire first column, since this would require Xl = 1. Even if Xl = 1 had been
allowed, its base 2 expansion would have ended in infinitely many Os rather than
infinitely many Is.
l7This proof is adapted from p. 219 of Billingsley (1968) and Theorem 15.8 of
Royden (1968).
622 Appendix B. Probability Theory
all n. Since {Xn}~1 is dense, there exists a subsequence {Xn;}~1 such that
limj--+oo Xnj = x. It follows that 0 = limj--+oo d(x, x nj ) = limj-+oo dey, xnj)j hence
limj--+oo Xnj = y, and y = x.
Next, we prove that f- I : f(X) -+ X is continuous. Suppose that a se-
quence of points {f(Yn)}::'=1 converges to fey). Let limj--+oo Xnj = y. Then
limj--+oo dey, Xnj ) = O. But dey, Xnj ) is the nj coordinate of fey), which in turn
is the limit (as n -+ 00) of the nj coordinate of f(Yn). For each j, d(Yn,Y) $
d(yn, xnj ) +d(y, xnj ). Let 10 > 0 and let j be large enough so that dey, Xnj ) < 10/2.
Now, let N be large enough so that n ~ N implies d(Yn' Xn,;) < dey, Xn,; ) + f/2. It
follows that, if n ~ N, d(Yn, y) < E. Hence limn-+oo Yn = y and f- I is continuous,
hence measurable.
Finally, we will prove that the image G of f is a measurable subset of JRoo. We
will do this by proving that G is the intersection of countably many open subsets
-18
of G. Let G n be the following set:
{x E JRoo : 3 0", a neighborhood of x with d(a,b) ~ lin for all a,b E f-I(O",)}.
Since 0", S;; G n for each x E G n , G n is open. Also, since f and f- I are continuous,
it is easy to see that G S;; G n for all n. Let G' = G n::'=1 Gn. For each x E G',
let O""n S;; G n be such that 0",,1 ;;;? 0",,2 ;;;? and that dCa, b) $ lin for
all a,b E f-I(O""n)' Note that f-I(O."n) ;;;? f-I(O""n+il for all n. If Yn E
f-I(O""n) for every n, then {Yn}::'=1 is a Cauchy sequence, since n,m ~ N
implies d(Yn,Ym) $ liN. Hence, there is a limit Y to the sequence. It is easy to
see that if there were two such sequences with limits Y and y', then dey, y') < f
for all f > 0, hence y = y'. So we can define a function h: G' -+ X by hex) = y.
If x E G, then clearly hex) = rl(x). If x' E O""n, then d(h(x), h(x' ~ lin, so
h is continuous. We now prove that G' S;; G, which implies that G = G' and the
proof will be complete. Let x E G', and let Xn E G be such that Xn -+ x. (This is
possible since G' S;; G.) Since h is continuous, f-I(X n) -+ hex). If Yn = f-I(X n)
and y = hex), then Yn -+ Y and f(Yn) -+ fey) E G, since f is continuous. But
f(Yn) = Xn , so fey) = x, and the proof is complete. 0
Next, we show that products of Borel spaces are Borel spaces.
Lemma B.41. Let (Xn,Bn) be a Borel space for each n. The product spaces
n~=1 Xi for all finite nand n:=1
Xn with product u-fields are Borel spaces.
PROOF. We will prove the result for the infinite product. The proofs for finite
products are similar. If Xn = JR for all n, the result is true by Lemma B.36. For
general X n , let 4>n : Xn -+ R,.. and 4>. : JRoo -+ R. be bimeasurable, where R,..
and R. are measurable subsets of JR. Then, it is easy to see that
4> : g (g Rn)
Xn -+ 4>.
18We use symbol G to stand for the closure of the set q..
The closure C?f a subs~t
G of a topological space is the smallest closed set contammg G. A set IS closed If
and only if its complement is open.
B.3. Conditi oning 623
Lemm a B.42. 19 Let C[O, 1) be the set of all bounded continuous function
s from
[0,1) to JR. Let p(f, g) = SUPXE[O ,l j lf(x) - g(x)l Then, p is a metric on
C[O, 1)
and C[O, 1) is a Polish space.
PROOF. That p is a metric is easy to see. To see that C
is separab le, let Dk be the
set of functions that take on rational values at the points 0, l/k, ...
, (k - 1)/k, 1
and are linear between these values. Let D = Uk=lDk. The set D
is countab le.
Every continu ous function on a compac t set is uniformly continuous,
so let f E
e[O, 1) and > 0. Let 8 be small enough so that Ix-YI < 8 implies If(x)- f(Y)1
<
E/4. Let k be larger than 4/. There exists 9 E Dk such that Ig(i/k) - f(i/k)1
for each i = 0, ... , k. For ilk < x < (i + l)/k, If(x) - f(i/k)1
< /4
< E/4, and
Ig(x) - g(i/k)1 < /2, so I/(x) - g(x)1 < E. To see that e[O, 1] is complet
e, let
{/n}~=l be a Cauchy sequence. Then, for all x, {/n(X)}~=l is a Cauchy
sequence
of real number s that converges to some number f(x). We need to
show that the
convergence of f n to f is uniform. To the contrary, assume that there
exists E such
that, for each n there is Xn such that Ifn(Xn) - f(xn)1 > E. We know
that there
exists n such that m > n implies lin (X) - Im(x)1 < E/2 for all x.
In particul ar,
Ifn(x n ) - fm(xn)1 < E/2 for all m > n. Since limm _ oo fm(xn) = f(x n}
, it follows
that there exists m such that Ifm(x n) - f(xn)1 < E/2, a contrad iction.
0
Because Borel spaces have u-fields that look just like the Borel u-field
of the
real number s, their u-fields are generat ed by countab ly many sets.
The countab le
field that generat es the Borel u-field of 1R is the collection of all
sets that are
unions of finitely many disjoint intervals (including degener ate ones
and infinite
ones) with rational endpoin ts.
Propos ition B.43. 20 Let (X,B) be a Borel space. Then there exists
a countable
field e such that B is the smallest u-field containing C.
Because a field is a 7f-system, Theorem A.26 and Proposi tion B.43
imply the
following.
Coroll ary B.44. Let (X, B) be a Borel space, and let e be a countab
le field that
genemtes B. II ILl and IL2 are u-finite measures on B that agree on e,
then they
agree on B.
fy(y) = i fX,y(x,y)dvx(x),
Note that
1
IAXB(X,y)dp,Xly(xly) = IB(y)P,Xly(Aly), (B.48)
It follows from (B.48) that 'f/ and JL agree on the collection of all
product sets
(a 7T-system that generates BI @ B2)' Theorem A.26 implies that they
agree on
BI @ B2. By linearity of integrals and the monotone convergence theorem
A.52,
if 9 is nonnegative, then
1 h(x,y)d v(x,y) = I
h(X,y)
f(x,y/( x,y)dv (x,y)= Ih(X,y ) (
f(x,y)d JLX,y
)
(B.50)
where the second equality follows from the fact that dJLldv = I,
the third fol-
lows from the fact that JL and 1/ are the same measure, and the
fourth follows
from (B.49). If h is integrable with respect to v, then (B.50) applies
to h+,
h-, and Ihl, and all three results are finite. Also, f Ih(x, y)ldvxl y(xly)
is mea-
surable and vy({y : flh(x,y )ldvxly (xly) = oo}) = o. So f h+(x,y)
dvxly(x l)
and f h-(x,y) dvXly( xly) are both finite almost surely, and their
difference is
f hex, y)dvXly(xly), a measurable function. It now follows that (B.47) holds. 0
The measures /.Iy and VXIY in Theorem B.46 are not unique. In the
proof, we
could easily have defined /.Iy several ways, such as vy(A) = fA g(Y)JLy
(y) for any
strictly positive function 9 with finite JLy integral. A corresponding
adjustm ent
would have to be made to the definition of VXIY:
JLxlY(Aly) = 1 fXIY(xly)dvXly(xly)
21The condition that the joint distribution have a density with respect .t~ a
measure v in Theorem B.52 is always met since v can be taken equal to the Jomt
distribution. The theorem applies even if v is not the joint distribution, however.
B.3. Conditioning 627
measure the conditional distributions are all dominated by the same measure.
(See Pr;blem 15 on page 663.) In general, however, the conditional distributio~
of X given Y = y is dominated by a measure that depends on y. For example, If
Y = g{X), the joint distribution of (X, Y) is not dominated by a product measure
even if the distribution of X is dominated. (See also Problem 7 on page 662.)
Nevertheless, we have the following result.
Corollary B.55. 22 Let (S,A,p,) be a probability space, let (Y,B2) be a measur-
able space such that B2 contains all singletons, and let (X, B) be a Borel space with
Vx a u-finite measure on (X,8). Let X : S -+ X and g : X -+ Y be measumble
junctions. Let Y = g{X). Suppose that the distribution of X has density fx with
respect to vx. Define von (X x Y,Bl B2) by v(C) = vx({x: (X,9(X E C}).
Let p,X,Y be the probability induced on (X x y, BI B2) by (X, Y) from p,. Let the
probability induced on (Y,B2 ) by Y from p, be denoted P,y. Then p,x,y v with
Radon-Nikodym derivative fx,y(x,y) = fx(x)l{g(x)} (y). Also, the conditions of
Theorem B.46 hold, and we can write
dp,y ( )
dvy Y = fy(y) = i l{g(x)} {y)fx (x)dvXIY{x/y),
if y = g(x),
fxlY{x/y) = { O~;f:~ otherwise.
XI rcos(fh),
X2 = r sin(Ot} COS{(2),
X n -l rsin(Ot} .. 'COS(On_l),
Xn = r sin(Ol) ... sin(On_l).
The Jacobian is r n - 1 j(f), where j is some function of 0 alone. The Jacobian for
the transformation to v and 0 is v(n/2)-lj(0)/2. The integral of j(O) over all 0
22This corollary is used in the proof of Theorem 2.86 and in Example 3.106.
23The calculation in this example is used again in Example 4.121.
628 Appendix B. Probability Theory
Iv (v) = 7r ~ V ~ -1 h( V )
2f(~)
The conditional density of X given V =v is then
2f(~)Vl-~ T
IXIV(xlv) = 7r1f I{v}(x x)
f
with respect to the measure IIXIV(Clv) = c v(n/2)-lj(9)dA n_l(9)/2, where
JLxlV(Clv} +1
= r(!!)
7r c'
j(O}dAn-l(O}.
It is easy to see that JLxlV(lv} is the uniform distribution over the sphere of
radius v in n dimensions.
n
every collection of sets Al E Ail, ... , An E A in , we have
If, in addition, Y is constant almost surely, we say {Xi hEN are independent.
Under the same conditions as above, if all of the conditional distributions of
the Xi given Yare the same, then we say {XihEN are conditionally IID given
Y. If, in addition, Y is constant almost surely, we say {XihEN are IID.
Example B.59. Let F be a joint CDF of n random variables Xl,.'" X n , and
let JL be the corresponding measure on lRn. Then JL is a product measure if and
only if Xl, ... ,Xn are independent (see Proposition B.66).
Example B.60 (Continuation of Example B.56; see page 627).24 Transform to
(Y, V), where Y = X/Vl/2. Then, the conditional distribution of Y given V is
given by
where D' = {9 : (cos(91 ), ... ,sin(91 ) .. sin(9n -I)) ED}. We note that this
formula does not depend on v; hence Y is independent of V. In addition, it is
easy to see that JLYlv(ylv) is just the uniform distribution over the sphere of
radius 1 in n dimensions.
The use of conditional independence in predictive inference is based on the
following theorem.
Theorem B.61. 26 Let N be an index set, let Y and {XihEN be a collection
of random quantities, and let Ai be the u-field generated by Xi. Then {XihEN
are conditionally independent given Y if and only if for every nand m and
every set of distinct indices it, ... ,in, jt, ... ,jm and every collection of sets Al E
Ail, ... ,An E Ain , we have
(8.62)
PROOF. For the "if" part, we will assume (8.62) and prove (B.58) by induction
on n. For n = 1, there is nothing to prove. Assuming (8.58) is true for all n ::; k,
we now prove it for n = k + 1. Let Aj E Ai; for j = 1, ... ,k + 1. According to
(B.62) and (8.58) for n = k, we have
Pr (B 0 Ai) = Pr (BnAHIOAi)
=
f
JB Pr(AHIIY)(s) n k
Pr(AiIY)(s)dJL(s) = in k+l
Pr(AiIY)(s)dJL(s).
The equality of the first and last terms above for all B E Ay means that
Il~~ll Pr(AiIY) = Pr(n~~: AiIY), a.s., which is what we need to complete the
induction.
For the "only if" part, we will assume (B.58) and prove (B.62). For a function
9 to be the left-hand side of (8.62), it must be measurable with respect to the
u-field AY,m generated by Y, Xii, ... ,Xjm , and satisfy
(B.63)
for all C E AY,m. Clearly, the right-hand side of (B.62) is measurable with respect
to AY,7n' If C = Cy nCx, where Cy E Ay and Cx is in the O'-field generated by
X jl , ... , Xj"" then
1e.
g(s)dJL(s)
Combining these gives that lIe g(s)dJL(s) - Pr (C n?=l Adl < f. Since f is arbi-
trary, (B.63) holds for all C E AY,m. 0
A particular case of interest involves three random quantities. Theorem B.64
says that when there are only two Xs in Theorem B.61, we can check conditional
independence by checking only one of the equations of the form (B.62).
Theorem B.64. 26 Let X, Y, and Z be three random quantities, and let Ax, Ay,
and Az be the O'-fields generated by each of them. Suppose that for all A E Ax,
=
Pr(AIY, Z) Pr(AIY). Then X and Z are conditionally independent given Y.
PROOF. We need to check that for every A E Ax and B E Az, Pr(A n BIY) =
Pr(AIY) Pr(BIY). Equivalently, for all such A and B, and all C E Ay, we must
show
(B.69)
Since A E C, it follows that A E Cn + l ; hence A and Ck are indepen
dent for
every k. It follows that IL(Ck n A) = /L(A)/L(Ck). It follows from
(B.69) that
IL(A) = IL(A)2, and hence either /L(A) = 0 or /L(A) = 1.
0
The a-field C in Theorem B.68 is often called the tail a-field of the sequence
{Xn}~=l' An interesting feature of the tail a-field is that limits are measurable
with respect to it. 28 (See Problem 21 on page 663.)
where the last equality follows since T = E{ZIB) and C E B. Since E{TIC) is C
measurable, equating the first and last entries of the above string of equations
means that E{TIC) satisfies the condition required for it to equal E{ZIC). 0
When Band C are the a-fields generated by two random quantities X and
Y, respectively, C ~ B means Y is a function of X. So, Theorem B.70 can be
rewritten in this case.
Corollary B.n. Let X: 8 -+ Ul, Y : 8 -+ U2, and Z: 8 -+ 1R be measurable
junctions such that E{IZI) < 00. 8uppose that Y is a function of X. Then,
E{ZIY) = E {E {ZIX)I Y}, a.s. fIL]
The most popular special case of this corollary occurs when Y is constant.
Corollary B.72. 29 Let (8,A,JL) be a probability space. Let X : 8 -+ U1 and
Z : 8 -+ 1R be measurable functions such that E{IZI) < 00. Then, E{Z) =
E{E{ZIX)}.
This is the special case of Theorem B.70 when C is the trivial a-field.
The following theorem implies that if a conditional mean given X depends on
X only through heX), then it is also the conditional mean given heX).
Theorem B.73. 30 Let (8,A,JL) be a probability space and let Band C be sub-
a-fields of A with C ~ B. Let Z : 8 -+ 1R be measurable such that E(IZl) <
28The tail a-field will play a role in the proofs of Corollary 1.63 and Theo-
rem 1.49.
29This corollary is used in the proof of Theorem B.75.
30This theorem is used in the proofs of Theorems 1.49 and 2.6.
B.3. Conditioning 633
To prove this we first note that f(s) = h(Xl (s), X2(S is measurable
with respect
to the (J'-field generated by (Xl, X 2 ), Ax 1 ,x 2' All that remains is
to show that
it satisfies the integral condition required to be E(ZIX 1,X ). That
2 is, for all
l
C E AXl,x2 ,
E(ZIc) = f(s)dj.t(s). (8.76)
Let j.t2 be the measure on (X2,8 2) induced by X2 from j.t. First,
suppose that
C = An B, where A E AXl and B E A x2 . The last hypothesis of
the theorem
says that for all A E AXIl E(ZIAI X2 = y) = fA h(X l (s),y)dj.t(2)(sly)
. If j.t112(ly)
is the probability on (Xl,B l ) induced by Xl from j.t(2) (-Iy), then j.t112(ly)
is also
the conditional distribu tion of Xl given X2 = Y as in Theorem B.46.
Suppose
BA Limit Theorems
There are several types of convergence that will be of interest to us. They involve
sequences of random quantities or sequences of distributions.
lim
n-+ooJB
r Pn(X)dll(X) =(
JB
p(X)dll(X), Jor all B E B.
PROOF. Let 6n (x) = Pn(X) - p(x), and let 6;i and 6;; be its positive and ?eg-
ative parts. Clearly, both limn->oo 6;i = 0 and limn->oo 6;; = 0, a.e. [II]. Smce
o s:; 0;; :5 p is true, it follows from the dominated convergence theorem A.57
that limn~oo IB 0;; (x)dv(x) = 0 for all B. Since both Pn and P are densities,
Ix On (x)dv(x) = 0 for all n. It follows that limn~oo Ix 0;; (x)dv(x) = O. Since
IB(x)8;;(x) s:; 8;t(x) for all x, it follows from Proposition A.58 that
lim
n-+cx>
1
B
o;!"(x)dv(x) = o.
=
So, limn~oo IBfpn(X) - p(x)Jdv(x) 0 for all B. 0
Since defining convergence requires a topology, the following definitions require
that the random quantities lie in various types of topological spaces.
Definition B.SO. Let {Xn}~=,l be a sequence of random quantities and let X
be another random quantity, all taking values in the same topological space X.
Suppose that limn~oo E (f (Xn = E (f (X for every bounded continuous func-
tion , : X ..... JR, then we say that Xn converges in distribution to X, which is
written Xn E. X.
Convergence in distribution is sometimes defined in terms of probability mea-
sures. The reason is that if Xn E. X, the actual values of Xn and of X do not
play any role in the convergence. All that matters is the distributions of Xn and
of X.
Definition B.S1. Let {Pn}~=l be a sequence of probability measures on a topo-
logical space (X, 8) where 8 contains all open sets. Let P be another probability
on (X, 8). We say that Pn converges weakl1f5 to P (denoted Pn ~ P) if, for each
bounded continuous function 9 : X ..... JR, limn~oo J g(x)dPn(x) = J g(x)dP(x).
35This is not exactly the same as the concept of weak convergence in normed
linear spaces [see, for example, Dunford and Schwartz (1957), p. 419J. The col-
lection of all probability measures on a space (X, 8) can be considered a subset
of a normed linear space C consisting of all finite signed measures v (see Defini-
tion A.IS) with the norm being SUPBEB Iv(B)I. Weak convergence of a sequence
{Vn}~='l in this space would require the convergence of L(vn) for every bounded
linear functional L on C. Every bounded measurable function 9 on (X, 8) deter-
I
mines a bounded linear functional Lg on C by Lg(v) = g(x)dv(x), where the
integral with respect to a signed measure can be defined as in Problem 27 on
page 605. Hence, weak convergence of a sequence of probability measures would
require convergence of the means of all bounded measurable functions. In partic-
ular, limn~oo Pn(B) = PCB) for all measurable sets B, not just those for which
P assigns 0 probability to the boundary (see the portmanteau theorem B.83 on
page 636). Alternatively, we can consider the set of bounded continuous functions
, : X ..... JR as a normed linear space N with "'" = sup", I/(x)l. Then the set of
finite signed measures C is a set of bounded linear functionals on N using the
J
definition v(f) = '(x)dv(x). Weak" convergence of a sequence {Vn}~=l in C to
v is defined as the convergence of vn(f) to v(f) for all , EN. This is precisely
convergence in distribution. Hence, it would make more sense to call convergence
in distribution weak" convergence rather than weak convergence. Since the tra-
dition in probability theory is to call it weak convergence, we will continue to do
so.
636 Appendix B. Probability Theory
It is easy to see that these two types of convergence are the same.
Proposition B.S2. Let Pn be the distribution of X n , and let P be the distribution
of X. Then, Xn E. X if and only if Pn ~ P.
Since we will usually be dealing with X spaces that are metric spaces, there are
some equivalent ways to define convergence in distribution or weak convergence.
The proofs of Theorems B.83 and B.88 are adapted from Billingsley (1968).
Theorem B.S3 (Portmanteau theorem).36 The following are all equivalent
in a metric space:
w
1. Pn -+ P;
2. limsuPn_oo Pn(B) :$ P(B) for each closed B;
3. liminf n _ oo Pn(A) 2:: P(A), for each open A;
4. limn_co Pn(C) = P(C), for each C with P(8C) = 0. 37
PROOF. Let d be the metric in the metric space. First, assume (1) and let B be
a closed set. Let 6 > 0 be given. For each e > 0, define C. =
{x : d(x,B) :$ e},
where d(x, B) = infl/EB d(x, y). Since Id(x, B) - d(y, B)I :$ d(x, y), we see that
d(x, B) is continuous in x. Each C. is closed and n.>oc. = B. Let e be small
enough so that P(C.) :$ P(B) + 6. Let f : 1R -+ 1R be
1 if t :$ 0,
f(t) ={ 1- t if 0 < t < 1,
o if t 2:: 1,
and define g.(x) = f(d(x, B)/e). Then g. is bounded and continuous. So,
!~~ f g.{x)dPn{x) = f g.{x)dP{x).
It is easy to see that 0 :$ g.(x) :$ 1, g.(x) = 1 for all x E B, and g.(x) = 0 for all
x C. Hence, for every 6 > 0,
36This theorem is used in the proofs of Theorem B.88 and Lemma 7.19.
37We use the symbol a in front of the name of a subset of a topological space
to refer to the boundary of the set. The boundary of a set C in a topological space
is the intersection of the closure of the set with the closure of the complement.
BA. Limit Theorems 637
exists a sequence {fk}~1 converging to 0 such that P(d(X,B) = fk) = 0 for all
k. It follows that limn_co Pn(C. k ) = P(C. k ) for all k. Since Pn(B) :5 Pn(C. k )
for every nand k, we have, for every k,
Since PCB) = limk ..... co P(C' k )' we have (2). So, (2), (3), and (4) are equivalent
and (1) implies (2).
All that remains is to prove that (2) implies (1). Assume (2), and let f be
a bounded continuous function. Let m < f(x) < M for all x. For each k, let
Fi,k = {x : f(x) :5 m + (M - m)i/k} for i = 1, ... , k. Let FO,k = 0. Each Fi,k is
closed, since f is continuous. Let Gi,k = Fi,k \ Fi- 1 ,k for i = 1, ... , k. It is easy
to see that for every probability Q,
m + (M - m) Lk i-I!
-k-Q(Gi,k) < f(x)dQ(x) :5 m + (M - m) L "kQ(Gi,k).
ki
i=1 i=1
M-m
M - - k- Lk ! M-m
Q(Fi,k) < f(x)dQ(x) :5 M + - k- -
M-m
- k- LkQ(Fi,k).
i=1 i=1
(B.84)
For each i,
(B.85)
! f(x)dP(x) :5 M
M-m
+ -k- -
M-m
-k- L P(Fi,k)
k
i=1
M-m M-m k
:5 M + - - k - - - - k - LlimsupPn(Fi,k)
i=1 n-oo
:5 Mk-m +liminf!f(x)dPn(x),
n_co
where the first inequality follows from the second inequality in (B.84) with Q = P,
the second inequality follows from (B.85), and the third inequality follows from
the first inequality in (B.84) with Q = Pn. Letting k be arbitrarily large, we get
Since this set must have probability 1, then so too must U~=l (n~=NAn,.) for
all . By Theorem A.19, it follows that for every , limN-+oo Pr (n~=NAn,.) = 1.
Hence, for each 10 > 0, limn_oo Pr(A~,<) == 0, which is precisely what it means to
p
say that Xn -> X.
Next assume that Xn .!: X. Let 9 : X -> IR be bounde d and continu
ous with
Ig(x)1 ~ K for all x. Let 10 > 0, and let A be a compac t se~ wit~ Pr(X
E A~ > 1-
/[6Kj. A continuous function (like g) on a compac t set IS umformly
contmu?us.
So let 8 > be such that x E A and d(x, y) < 8 implies Ig(x) - g(y)1
< 10/3. Smce
Xn ~ X, there exists N such that n ~ N implies Pr(d(X n, X) < 8)
> 1-/[6K j.
Let B = {X E A,d(Xn ,X) < 8}. It follows that Ig(X)IB - g(Xn)IB
I < 10/3 and,
for all n ~ N, Pr(B) > 1 - /[3Kj. Also, note that n ;:: N implies
10 10
IEg(X) - E[g(X) IBli < 3' IEg(Xn) - E[g(Xn )IBll < 3'
r/>X(t) J 1 exp -
exp( itx) ..,f'f; (X2) 1
2" dx ==..,f'f; J exp -([X - itJ2 2
2 + t ) dx
exp ( -~).
Similarly, for other normal distributions, N(/-, (12), the characteristic functions
are c/>x(t) == exp( _(12 t 2 /2 + it/-).
By Theorem B.12, if X has CDF F, then c/>x == c/>F. It is easy to see that the
characteristic function exists for every random vector and it has complex absolute
value at most 1 for all t. Other facts that follow directly from the definition are
the following. If Y == aX + b, then r/>y(t) == c/>x(at)exp(itb). If X and Y are
independent, c/>x+y == c/>xc/>y.
The reason that characteristic functions are so useful for proving convergence
in distribution is two-fold. First, for each characteristic function C/>, there is
only one CDF F such that c/>F == c/>. (See the uniqueness theorem B.106.) Sec-
ond, characteristic functions are "continuous" as a function of the distribution
in the sense of convergence in distribution. That is, Xn E. X if and only if
limn~oo c/>Xn(t) == c/>x(t) for all t.40 (See the continuity theorem B.93.)
I if a < x < b,
1-~ if a - 0 < x:::; a,
{ (B.94)
fa,b,6(x) == ~ _ X~b if b:::; x < b+ 0,
otherwise.
Note that this function has equal values at a - 6 and b + 6. Consider the inter-
val [a - 6, b + 6] as a circle identifying the two endpoints. Now, use the Stone-
Weierstrass theorem C.3 to approximate uniformly fa.b.6 to within f on the circle
by f~.b.6 (X} = E~=-l bj exp(21fijx/c}, where c = b - a + 26. If Y is a random
variable, then Ef~.b.6 (Y} is a linear combination of values of the characteristic
function of Y. So, we have lim n ..... oo Ef~.b.6 . (Xn} = Ef~.b.6 . (X}. Let q > 0, and
let a and b be continuity points of Fxt such that Fxt(b} - Fxt(a} = v > q. Let
w = v - q. Let 6 > 0 be arbitrary, and define a' = a - 6 and b' = b + 6. Let N be
large enough so that n ;::: N implies IEf~.b.6.w/3(X~} - Ef~.b.6.w/3(Xl)1 < w/3. If
n;::: N, then
Fxtn (b') - Fxtn (a');::: Efa.b.6(X~) > Ef~.b.6.f(X!) - .!j
>
- EJ'a.b.6.' (Xl)
-"'3 2w ;::: Efa.b.6(X t ) - w
;::: Fxt(b} - Fxt(a} - w = q.
Now, let 9 be a bounded continuous function, and suppose that Ig(x)1 < K for
all x. Let f > O. For each coordinate xt of X, let at and be be continuity points
of Fxt such that Fxt(bt} -Fxt(at) > I-f/(7[K +f/7]k). Let 6 > 0 be arbitrary,
and define at = at - 6, bt = bt - 6, and g*(x} = g(x) n:=l fa~.b~.6(Xt). Use the
Stone-Weierstrass theorem C.3 to uniformly approximate g* to within f/7 on the
rectangle {x : at - 6 :5 Xl :5 bt + 6} by
L ... L
ml mJe
Let Nl be large enough so that n;::: Nl implies Fx~ (bt}-Fx~ (at) ;::: I-f/(7[K +
f/7]k) for all j. Let N2 be large enough so that n ;::: N2 implies IEg'(Xn) -
Eg'(X} I < f/7. Let R be the rectangle R = {x : at < Xl :5 btl. Since g' is periodic
in every coordinate, it is bounded by K + f/7 on all of IRk. If n ;::: max{Nl' N2},
then IEg(Xn} - Eg(X)1 is no greater than
We will prove two more limit theorems that make use of the continuity theo-
rem B.93. Suppose that X has finite mean. Since Iexp(itx} - 11 :5 min{ltxl,2}
for all t,z,42 and
. exp(itx} -1
I1m .
t ..... o . t = lX
(B.98)
It follows that limn -+ oo CPY" (t) = exp( _t 2 u 2 /2), and the continuity theorem B.93
finishes the proof. 0
There is also a multivariate version of the central limit theorem.
Theorem B.99 (Multivariate central limit theorem).43 Let {Xn}~=l be a
sequence of lID mndom vectors in IRP with mean p. and covariance matrix E.
Then v'n(X" - p.) Eo Np(O, E), a multivariate normal distribution.
PROOF.
-
Let Yn = Vri(X n - p.) and let Y '" Np(O,E). Then Yn -+ Y if and
v
only if the characteristic function of Yn converges to that of Y. That is, if and
only if, for each A E JRP, E exp { iA TYn} -+ E exp { iA TY }. This occurs if and
only if, for each A, ATYn Eo ATy. The distribution of ATy is N(O, ATEA), and
ATYn is .;n times the average of the AT (Xn - p.). By the univariate central limit
theorem B.97, ATYn Eo ATy. 0
There are inversion formulas for characteristic functions which allow us to
obtain or approximate the original distributions from the characteristic functions.
Example B.100 (Continuation of Example B.92j see page 640). Let X have
J
distribution N(O, ( 2 ). Then Icpx(t)ldt < 00. In fact,
2~ J exp(-ixt)cpx(t)dt = ~
2~
Jexp ( - u 2 [t + ~X]2 __
2 u2
1 X2) dt
2u 2
PROOF. Clearly, the function in (B.102) is bounded since cpx is integrable. Let Y"
have N,,(O, u 2 lie) distribution. The characteristic function of X + Y" is CPxCPy".
where the second equality follows from the fact that (B.102) applies to normal
distributions. Now suppose that we let (f go to zero. Since </>x is integrable and
</>y,,(t) goes to 1 for all t, it follows that the left-hand side of (B.103) converges to
the right-hand side of (B.102). It also follows that fx+y" is bounded uniformly
in (f and x. Let B be a hypercube such that the probability is 0 that X is in the
boundary of B. Then
where the first equality follows from the boundedness of f x +Yu' and the second
is proven as follows. The difference between fx+y" (x)dx and IB fx(x)dx is IB
the sum over the 2k corners of the hypercube B of terms like
k
L Pr(b i - Y",i < Xi :::; bi , Y",i > 0) + Pr(bi < Xi :::; b; - y",;, Y",i < 0),
i=l
Lemma B.I05. 46 Let Y be a mndom variable such that </>y is integmble. Let X
be an arbitmry mndom variable independent of Y. For all finite a < band c,
Pr(a < X + cY ~ b) = ~
211"
f (exp( -ibt) -. exp( -iat) </>x(t)</>y(ct)dt.
-~t
PROOF. Since </>y is integrable and </>x+cy(t) = </>x (t)</>y (ct), it follows that
X + cY has integrable characteristic function. Lemma B.101 says that (B.102)
applies to X + cY, hence
~ j</>x(t)</>y(ct)exP(-itX)dt
211"
2~ 1b J </Jx(t)</Jy(ct)exp(-itx)dtdx
2~ J </Jy(et)</Jx(t) 1b exp(-itx)dxdt
{x: Xil ::; el, ... ,Xi n ::; Cn , for some n and some integers il, ... ,in
It is clear that X-I (B) E A since it is the intersection of finitely many sets in
A. By Theorem A.34, it follows that X-I (BOO) ~ A, so X is measurable with
respect to this u-field.
B.5.2 Martingales+
A particular type of stochastic process that is sometimes of interest is a martin-
gale. [For more discussion of martingales, see Doob (1953), Chapter VII.]
Definition B.lOS. Let (8, A, 1-1) be a probability space. Let N be a set of consec-
utive integers. For each n EN, let F,. be a sub-u-field of A such that Fn ~ F,.+l
for all n such that n and n + 1 are in N. Let {X,,},.E.N' be a sequence of ran-
dom variables such that Xn is measurable with respect to Fn for all n. The
sequence of pairs {(Xn,F,.)}nE.N' is called a martingale if, for all n such that n
and n + 1 are in N, E(Xn+lIFn) = Xn. It is called a submartingale if, for every
n, E(Xn+lIFn) ~ X n .
Note that a martingale is also a submartingale.
Example B.I09. A simple example of a martingale is the following. Let N =
{1, 2, ..:1,. and let {Yn}~l be independent random variables with mean O. Let
Xn = E:=l Yi. Let Fn be the u-field generated by YI, ... , Yn . Then,
The following result is proven using the same argument as in Example B.111.
Proposition B.ll3. 49 If {(Xn, Fn)}nE.N' is a martingale, then EIXnl is nonde-
creasing in n.
The reader should note that if {(Xn, Fn)}nE.N' is a submartingale and if M ~
N is a string of consecutive integers, then {(Xn, .rn)}nEM is also a submartingale.
Similarly, if k is an integer (positive or negative) and M = {n: n+k EN}, then
{(X~, F~)}nEM is a submartingaie, where X~ = Xn+A: and .r:..
= Fn+lc. This
latter is just a shifting of the index set.
There are important convergence theorems that a.pply to many martingales
and submartingales. They say that if the set N is infinite, then limit random
variables exist. A lemma is needed to prove these theorems. 50 It puts a bound on
how often a submartingale can cross an interval between two numbers. It is used
to show that such crossings cannot occur infinitely often with high probability.
(Infinitely many crossings of a nondegenerate interval would imply divergence of
the submartingale.)
Lemma B.114 (Upcrossing lemma).5l Let.N = {I, ... , N}, and suppose
that {(X",.rn)}~=l is a submartingale. Let r < q, and define V to be the number
of times that the sequence Xl, ... , XN crosses from below r to above q. Then
Now V(s) is one-half of the largest even m such that Tm(s) :5 N. Define, for
i= 1, ... ,N,
E(X) = L:J. . _
N
i=l {s.R,{s)-l}
(Yi(S) - Yi-l(sdl-(s)
= L J . . _ (E(YiI.ri-t}(S) - Yi-l(sdJL(s)
i=l {s.R,{s)-l}
N
:5 L !(E(YiI.ri-t}(S) - Yi_l(sdJL(s)
i=l
N
L(E(Yi) - E(Yi_t}) = E(YN),
i=l
where the second equality follows from (B.116) and the inequality follows from
the fact that {Yn, Fn}~=l is a submartingale. It follows that (q-r)E(V) :5 E(YN).
Since E(YN) :5lrl + E(IXN!), it follows that (B.115) holds. 0
The proof of the following convergence theorem is adapted from Chow, Rob-
bins, and Siegmund (1971).
B=
r < q,
u
r, q rational
{s : X{s) ~q >r ~ X.(s)}.
Now, X{s) > q > r ~ X.{s) if and only if the values of Xn(S) cross from being
below r to being above q infinitely often. For fixed rand q, we now prove that
this has probability OJ hence J.L(B) = O. Let Vn equal the number of times that
Xl, ... , Xn cross from below r to above q. According to Lemma B.114,
The number of times the values of {Xn(S)}~=l cross from below r to above q
equals limn-+oo Vn(s). By the monotone convergence theorem A.52,
By Lemma A.72, we have that for every E > 0 there exists li such that JL(A) < 6
implies T/(A) < E. By the Markov inequality B.15,
1 1
JL(Ac,n) :$ -E(Xn) = -E(X),
c c
for all n. Let C = 2E(X)/li. Then c ~ C implies JL(Ac,n) < 6 for all n, so
T/(Ac,n) < for all n. 0
PROOF OF THEOREM B.ll8. By Lemma B.119, {Xn}~=1 is a uniformly integrable
sequence. Let Y be the limit of the martingale guaranteed by Theorem B.117.
Since Y is a limit of functions of the X n , it is measurable with respect to Foo. It
follows from Theorem A.60 that for every event A, lim n _ oo E(XnIA) = E(Y IA)'
Next, note that, for every A E F n ,
fA
Y(s)dJ.&(s) = n_oo
lim f
A
E(XIFn)(s)dJ.&(s) = f
A
X(s)dJL(S),
where the last equality follows from the definition of conditional expectation.
Since this is true for every n and every A E F n , it is true for all A in the field
F = U~=IFn. Since IXI is integrable, we can apply Theorem A.26 to conclude
that the equality holds for all A E F oo , the smallest IT-field containing F. The
equality E(XIA) = E(YIA) for all A E Foo together with the fact that Y is Foo
measurable is precisely what it means to say that Y = E(XIFoo) = Xoo. 0
For negatively indexed martingales, there is also a convergence theorem. Some
authors refer to negatively indexed martingales in a different fashion, which is
often more convenient.
Definition B.120. Let (S, A, JL) be a probability space. For each n = 1,2, ... ,
let Fn be a sub-IT-field of A such that Fn+l ~ Fn for all n. Let {Xn}~=1 be a
sequence of random variables such that Xn is measurable with respect to Fn for
all n. The sequence of pairs {(Xn,Fn)}~=1 is called a reversed martingale if for
all n E(XnIFn+l) = X n + 1 .
Example B.121. As in Example B.110, we can let {Fn}~=1 be a decreasing
sequence of IT-fields, and let E(IX!) < 00. Define Xn = E(XIFn). It follows from
the law of total probability B.70 that {(Xn,Fn)}~=l is a reversed martingale.
The following theorem is proven by Doob (1953, Theorem VII 4.2).
Theorem B.122 (Martingale convergence theorem: part 11).55 Suppose
that {(Xn, Fn)}n<o is a martingale. Then X = limn _ - oo Xn exists a.s. and has
finite mean.
PROOF. Just as in the proof of Theorem B.117, we let Vn be the number of times
that the finite sequence Xn,Xn+1, ... ,X- 1 crosses from below a rational r to
above another rational q (for n < 0). The upcrossing lemma B.114 says that
1
E{Vn) :$ -
q-r
(E(IX_ 1 !) + Ir!) < 00.
lim E(IXnl)
n--oo
= E(IXI).
By Proposition B.113, it follows that E(lXI) < 00, and so X has finite mean. 0
It is usually more convenient to express Theorem B.122 in terms of reversed
martingales.
Corollary B.123. 56 If {(Xn, Fn)}~l is a reversed martingale, then limn_co Xn
exists a.s. and has finite mean.
There is also a version of Levy's theorem B.118 for reversed martingales.
Theorem B.124 (Levy's theorem: part 11).57 Let {Fn}~=l be a decreasing
sequence of a-fields. Let Foo be the intersection n~=lFn. Let E(IXI) < 00. Define
Xn = E(XIFn) and Xoo = E(XIFoo). Then limn_oo Xn = Xoo a.s.
PROOF. It is easy to see that {(Xn' Fn)}~=l is a reversed martingale and that
E(IX11) < 00. By Theorem B.122, it follows that lim n__ oo Xn = Y exists and is
finite a.s. To prove that Y = Xoo a.s., note that Xoo = E(X1IFoo) since .1'00 ~ Fl.
So, we must show that Y = E(X1IFoo). Let A E .1'00' Then
i Xn(s)dl./-(s) = i Xl (s)dl./-(S) ,
since A E Fn and Xn = E(XdFn). Once again, using (B.112) and Lemma B.119,
it follows that fA Y(s)dl./-(s) = fA X1 (s)dl./-(s); hence Y = E(XdFoo). 0
PROOF. Define X = TIrER Xr and let B be the product u-field. Say that a set
C E B is a finite-dimensional cylinder set if there exists k and rl, ... , rk E R and
a measurable D ~ TI:=l Xr , such that
6This theorem is used in the proofs of Theorem B.133 and DeFinetti's repre-
sentation theorem 1.49.
B.5. Stochas tic Processes 653
For each permuta tion 7r of k items, Pil, ... ,ik (A) = H,,(1) , ... ,i,,(k) (B), where
B = {(X".(I),'" ,X".(k: (Xl, ... ,Xk) E A} .
For each f. E R \ {il, ... , ik}, H1, ... ,ik (A) = Pil, ... ,ik,l(B), where
621n Section 3.3, we give a much more elaborate motivation for the entire
apparatus of Bayesian decision theory, which includes mathematical probab~lity
as one of its components. An alternative derivation of mathematical probability
from operational considerations is given in Chapter 6 of DeGroot (1970).
63There are a few major differences between the approach in this section and
DeFinetti's approach, which DeFinetti, were he alive, would be quick to po~nt
out. Out of respect for his memory and his followers, we will also try to pomt
out these differences as we encounter them.
B.6. Subjective Probability 655
bounded set, C can be made small enough for this to hold, so long as the agent
has some funds available.
Definition B.134. The fair price p of a random quantity is called its prevision
and is denoted P(X). It is assumed, for a bounded random quantity X, that the
agent is indifferent between all gambles whose net gain (loss if negative) to the
agent is c(X - P(X)) for all c in some symmetric interval around O.
The symmetric interval around 0 mentioned in the definition of prevision may
be different for different random variables. For example, it might stand to reason
that the interval corresponding to the random variable 2X would be half as wide
as the interval corresponding to X.
Another assumption we make is that if an agent is willing to accept each of a
countable collection of gambles, then the agent is willing to accept all of them
at once, so long as the maximum possible loss is small enough for the agent to
pay.64 An example of countably many gambles, each of which is acceptable but
cannot be accepted together, is the famous St. Petersburg paradox.
Example B.135. Suppose that a fair coin is tossed until the first head appears.
Let N be the number of tosses until the first head appears. For n = 1,2, ... ,
define
if N = n,
otherwise.
Suppose that our agent says that P(Xn } = 1 for all n. For each n, there is Cn < 0
such that the agent is willing to accept cn(Xn - 1). If - 2::""=1cn 2n is too big,
however, the agent cannot accept all of the gambles at once. Similarly, there are
Cn > 0 such that the agent is willing to accept Cn (Xn - 1). If 2:::'-1
Cn is too
big, the agent cannot accept all of these gambles. The St. Petersburg paradox
corresponds to the case in which Cn = 1 for all n. In this case, the agent pays 00
and only receives 2N in return. We have ruled out this possibility by requiring
that the agent be able to afford the worst possible loss.
if X = x,
if not.
Suppose that our agent is indifferent between all gambles of the form c(Ix _ 2-"')
for all -1 :s :s
c 1 and all integers x. Then, we assume that the agent is also
indifferent between all gambles of the form 2:::'=1
c",(I", - T X ), so long as -1:s
:s 1 for all x. (Note that the largest possible loss is no more than 1.) Let
:L:
C",
y = 1 cx!x with -1 :5 Cx :5 1 for all x. Note that Y is a bounded random
64DeFinetti would not require an agent to accept count ably many gambles at
once, but rather only finitely many. We introduce this stronger requirement to
avoid mathematical problems that arise when the weaker assumption holds but
the stronger one does not. Schervish, Seidenfeld, and Kadane (1984) describe one
such problem in detail.
656 Appendix B. Probability Theory
quantity, and that the agent has implicitly agreed to accept all gambles of the
form c(Y - ,.,,) for -1 ::; c ::; 1, where ,." = E::'=l cz2- z . If the agent were
foolish enough to be indifferent between all gambles of the form dey - p) for
-a ::; d ::; a where p i= ,.", then a clever opponent could make money with no risk.
For example, if p > ,.", let f = min{l, a}. The opponent would ask the agent to
accept the gamble fey -p) as well as the gambles - fcz(Iz _2- Z ) for x = 1,2, ....
The net effect to the agent of these gambles is - f(p -,.,,) < 0, no matter what
value X takes! A similar situation arises if p < ,.". Only p = ,." protects the agent
from this sort of problem, which is known as Dutch book.
X en(lcn - ,.,,(Cn
n=l
the maximum losses from X and from -X are small enough for the agent to
afford. Since this makes X bounded, it follows from Fubini's theorem A.70 that
E(X) = OJ hence it is impossible that X < 0 under all circumstances, and the
previsions are coherent.
For the "only if' part, assume that the previsions are coherent. Clearly, ,.,,(0) =
0, since 10 = 0 and -c,.,,(0) ~ 0 for both positive and negative c. It is also easy to
see that ,.,,(A) ~ 0 for all A. If ,.,,(A) < 0, then for all negative c, c(IA -,.,,(A < 0
and we have incoherence. Countable additivity follows in a similar fashion. Let
{An}~=l be mutually disjoint, and let A = U~=lAn. If ,.,,(A) < E:=l,.,,(A n ),
If J.t(A) >E::"=l J.t(An), then the negative of the above gamble is always negativ~
Either way there is incoherence.
Theorem B.138 says that if an agent insists on dealing with a l7-field
of sub-
sets of some set 8, then expressing coherent previsions for gambles
on events is
equivalent to choosing probabilities. 67 Similar claims can be made about
bounde d
random variables.
Theore m B.139. Let C be the collection of all bounded measura
ble function s
from a measurable space (8, A) to lR. Suppose that, for each X
E C, an agent
assigns a prevision P(X). The previsions are coherent if and only
if there exists
a probability /1 on (8, A) such that P(X) = E(X) for all X E C.
PROOF. Suppose that the agent is indifferent between
all gambles of the form
c(X - P(X for -dx ::; c S dx. For the "if" direction, the proof
is virtuall y
identical to the corresponding part of the proof of Theorem B.138.
For the "only
if" part, note that IA E C for every A E A. It follows from Theorem
B.138 that a
probabi lity J.t exists such that J.t(A) = P(lA) for all A E A. Hence P(X)
for all simple functions X. Let X > 0 and let Xl S X 2 S ... be simple
= E(X)
functions
less than or equal toX such that lim n _ oo Xn = X. Then X =
so
E::"=l(Xn+l -Xn ),
00
671n the theory of DeFine tti (1974), one obtains finitely additive
probabilities
without assuming that probabilities have been assigned to all element
s of a 17-
field.
68DeFinetti (1974) would only require that such conditional gambles
be con-
sidered one at a time rather than a l7-field at a time.
658 Appendix B. Probability Theory
this case, there are only two sets of conditional gambles (other than the "uncondi-
tional" gambles c[X-P(X)]) , namely cIA (X -P(XIA and cIAc(X -P(XIAc .
Here, Q = P(XIA)IA + P(XIAc)IX, Note that the previsions P(XIA) and
P(IA) = ",(A) are already expressed. It is easy to see that
cIA(X - P(XIA
= c(XIA - E(XIA - cP(XIA)(IA -",(A + c[P(XIA)",(A) - E(XIA)].
Clearly, the only coherent choices of P(XIA) satisfy P(XIA)",(A) = E(XIA). If
",(A) > 0, then P(XIA) = E(XIA)/",(A), the usual conditional mean of X given
A. Similarly, P(XIAc)",(Ac ) = E(XI~) must hold.
The general situation is not much different from Example B.140.
Theorem B.141. Suppose that an agent must choose a function Q that is mea-
sumble with respect to a sub-a-field C so that for each nonempty A E C, he or
she is indifferent between all gambles of the form cIA(X - Q). The choice of Q
is coherent if and only ifE(QIA) = E(XIA), for all A E C.
PROOF. As in Example B.140, note that
cIA(X - Q) = c(XIA - E(XIA - c(QIA - E(QIA + c[E(QIA) - E(XIA)]'
The choice of Q can be coherent if and only if E(QIA) = E(XIA). 0
The reader should note the similarity between the conditions in Theorem B.141
and Definition B.23. The function Q must be a version of the conditional mean
of X given C.
Example B.142. Let (X, Y) be random variables with a traditional joint density
with respect to Lebesgue measure Ix,Y. That is, for all C E JR.2,
PrX, Y) E C) = fa IX,y(x,y)dxdy,
B.7 Simulation *
Several times in this text, we will want to generat e observations
that have a
desired distribu tion. Such observations will be called pseudorandom
numbers be-
cause samples appear to have the properties of random variables,
but they are
actually generat ed by a complicated determi nistic process. We will
not go into
detail on how pseudor andom numbers with uniform U(O, 1) distribu
tion are gen-
erated. In this section, we wish to prove a couple of useful theorem s
about how to
generat e pseudor andom number s with other distribu tions under the
assump tion
that pseudor andom number s with U(O, 1) distribu tion can be generat
ed.
Theore m B.144. Let F be a CDF and define the inverse of F by
N = min {i .
U. < fey;} }
. - kg(Y;) .
*This section may be skipped without interrup ting the flow of ideas.
660 Appendix B. Probability Theory
Pr(Z ~ t) = ( IUi::; - -
Pr Yi::; t
f(Yi)
kg(Yi)
Pr (Yi ::; t, U. ::; tK~i))
= --'--.,....---....,--!...
Pr (u. < 1llil)
- k9(YJ
jt -00
fey)
kg(y)g(y)d y =k
1jt -00 f(y)dy,
since Yi has PDF g(.). Similarly, the denominator conditional probability can be
written as
Pr (Ui < f(Yi)
- kg(Yi)
IYi) = f(Yi) .
kg(Yi)
The mean of this is likewise seen to be J f(y)dy/k. The ratio of these is
Pr(Z < t) = J~oo f(y)dy
- J f(y)dy ,
If (U, V) has uniform distribution over the set A, then V /U has density propor-
tional to f.
PROOF. Let (U, V) be uniformly distributed on the set A. Then fu,v(u,v) =
lA(u,v)/c, where c is the area of A. Define X = U and Y V/U. The Jacobian =
for the transformation is x and the joint density of (X, Y) is
/x,y(x,y)
x
= ~IA(x,xy) x
= ~I[O,v'f'"iY)l (x ) .
B.8. Problems 661
J
It follows that fy(y) = ov'fu5 ~dx = icf(y). 0
If both f(x) :::; b and a :::; xV f(x) :::; b for all x, then A is contained in the
rectangle with opposite corners (0, a) and (b, c). We can then generate U '" U(O, b)
and V '" U(a, c). We set X = V/U, and if U2 :5 f(X), take X as our desired
random variable. If U 2 < f(X), try again.
An important application of simulation is to the numerical integration tech-
nique called importance sampling. Suppose that we wish to know the value of the
ratio of two integrals
J v(8)h(8)d8
(B.148)
J h(8)d8 '
where 8 can be a vector. Suppose that f is a density function such that h/ f is
nearly constant and it is easy to generate pseudorandom numbers with density
/. Let {Xi}~l be an IID sequence of pseudorandom numbers with density f.
Then
J h(8)d8 = E (h(Xi)
f(X i ) ,
J v(8)h(9)d9 h(X;)
E ( v(Xd /(Xi) ,
where the expectations are with respect to the pseudo-distribution of Xi. If we let
Wi = h(Xi)/ /(Xi ) and Zi = V(Xi)Wi , then the weak law of large numbers B.95
says that Zn/W n converges in probability to (B.148).69 The reason that we want
h/ / to be nearly constant is so that the variance of Wi is small. In Section 7.1.3,
we will show how to approximate the variance of Zn/W n as an estimate of
(B.148).
B.8 Problems
Section B.2:
69The strong law of large numbers 1.63 says that Zn/W n converges a.s. to
(B.148).
662 Appendix B. Probability Theory
7. Let <l> denote the standard normal CDF, and let the joint CDF of random
variables (X, Y) be
< y + 1,
Fx,Y(x, y) = { ~~y)
~ if Y - 1 ~ x
if x ~ Y + 1,
otherwise.
12. Suppose that Xl, ... , Xn are independent, each with distribution N(e, 1).
Find the conditional distribution of Xl"'" Xn given Xn = x, where Xn =
E~lXi/n.
13. Let 81 ~ 82 ~ ... be a sequence of u-fields, and let X ~ o. Suppose that
E(XI8n ) = Y for all n. Let 8 be the smallest u-field containing all of the
8 n. Show that E(XI8) = Y, a.s. (Hint: Show that the union of the 8 n is a
1I'-system, and use Theorem A.26.)
14. Prove Proposition B.43 on page 623.
15. Assume the conditions of Theorem B.46. Also, suppose that (X,8l,Vt)
and ()l, 82,112) are u-finite measure spaces and v = VI X V2. Prove that VI
can play the role of vXly(ly) for all y and that V2 can play the role of vy
in the statement of Theorem B.46.
16. Prove Proposition B.51 on page 625. (Hint: Notice that IA(V-l(y,w =
IA;(w).)
17. Prove Proposition B.66 on page 631. (Hint: Prove the result for product
sets first, and then use Theorem A.26.)
18. Prove Corollary B.67 on page 631.
19. Prove Corollary B.74 on page 633.
20. Prove the second Borel-Cantelli lemma: If {An}~=l are mutually inde-
pendent and E:=l Pr(An) = 00, then Pr(n~l U~=i An) = 1. (This set
is sometimes called An infinitely often.) (Hint: Find the probability of the
complement by using the fact that 1 - x:5 exp(-x) for 0:5 x :5 1.)
21. *Suppose that (8, A, It) is a measure space. Let {fn}~=l be a sequence of
measurable functions In : 8 --+ T, where (T,8) is a metric space with Borel
u-field. Let C be the tail u-field of {fn}~=l' If limn_oo fn(s) = f(s), for all
s, then prove that f is measurable with respect to C. (Hint: Refer to the
proof of part 5 of Theorem A.3B. Show that the set A. E C by showing
that the union in (A.39) does not need to start at 1.)
22. Let (8, A, It) be a probability space, and let C be the tail u-field of a se-
quence of random quantities {Xn}~=lt where Xn : 8 --+ X for all n. Let
V be the u-field generated by {Xn}~=l' Let X = (Xl,X2, ... ) E Xoo.
If 11' is a permutation of a finite set of integers {1, ... , n}, let 11'X =
(X"(l), ... , X,.(n) , X n + 1, ... ). We say that A E V is symmetric if A =
X-I (B) and for every permutation 11' of finitely many coordinates, A =
(1I'X)-l(B) as well.
(a) Prove that every C E C is symmetric.
(b) Show that there can be symmetric events that are not in C.
23. Prove Proposition B.78 on page 634.
664 Appendix B. Probability Theory
Section B.4:
29. Let {in}~=l be a sequence of numbers in {O, I}. Suppose that {Xn}~=1 is
a sequence of Bernoulli random variables such that
where x = 2::7=1 ij. Show that this specifies a consistent set of joint distri-
butions for n = 1,2, ....
30. Let /-L be a finite measure on (1R, B), where B is the Borel u-field. Suppose
that {X(t) : -00 < t < oo} is a stochastic process such that X(t) has
Beta(/-L(-oo,t],/-L(t,oo)) distribution for each t, X(t) > Xes) ift > s, and
X() is continuous from the right.
(a) Prove that Pr(limt-->oo X(t) = 1) = 1.
(b) Let U = inf{t : X(t) 2: 1/2}. Prove that the median of U is inf{t :
IL(-oo,t]2: /-L(t,oo)}. (Hint: Write {U $ s} in terms of X().)
31. Let R be a set, and let (Xr, Br) be a Borel space for every r E R. Let X =
n Xr and let B be the product u-field. For each r E R, let Xr : X -+ Xr
bert'~ projection function Xr(X) = Xr . Prove that B is the union of all of
the u-fields generated by all of the countable collections of Xr functions.
That is, let Q be the set of all countable subsets of R, and for each q E Q
let xq = {Xr lrEq and let Bq be the u-field generated by xq. Then show
that B = uqEQBq.
Section B. 7:
D(i)(f,x,y) = :t ... :t
i1=1 i;=1
(8Z'
31
~i.8z' I(Z)\
3.
llYi.) ,
z=z 8=1
where we allow notation like 8 3 /8z 1 8z 1 8z4 to stand for 8 3 /8z~8z4. Then, for
XED,
k
1This theorem is used in the proofs of Theorems 7.63, 7.89, 7.108, and 7.125.
For a proof (with m = 2), see Buck (1965), Theorem 16 on page 260.
666 Appendix C. Mathematical Theorems Not Proven Here
2This theorem is used in the proof of Theorem 7.57. For a proof, see Rudin
(1964), Theorem 9.17.
3This theorem is used in the proofs of DeFinetti's representation theorem 1.49
and 1.47 and Theorem B.93. For a proof, see Rudin (1964), Theorem 7.31.
4This theorem is used in the proof of Theorem B.17. For a proof, see Berger
(1985), Theorem 12 on page 341, or Ferguson (1967), Theorem 1 on page 73.
5This theorem is used in the proof of Theorems B.17, 3.77 and 3.95. For a
proof, see Berger (1985), Theorem 13 on page 342, or Ferguson (1967), Theorem 2
on page 73. ..
6This theorem is used in the proof of Theorem 3.77. For a proof, see DugundJ1
(1966), Theorems 3.2 and 4.3 of Chapter XI. .
7This theorem is used to show that certain estimators are UMVUE, and III
the proof of Theorem 2.74. For a proof, see Churchill (1960, Sections 52 and 56).
C.3. Functional Analysis 667
JJ
only il
IK{x', x)1 2 d",{x')d",{x) < 00.
8This theorem is used in the proof of Theorem 2.64. For a proof, see Churchill
(1960), Section 54, or Ahlfors (1966), Theorem 12' on page 134.
9This theorem is used in the proof of Theorem 2.114. For a proof, see Diaconis
and Freedman (1990), Theorem 2.1.
lOThis theorem is used in the proof of Theorem 8.40. For a lroof, see Sec-
tion XI.6 of Dunford and Schwartz (1963). By L 2{",) we mean {/: 12{x)d",{x) <
oo}.
llThis theorem is used in the proof of Theorem 8.40. For a proof, see Theo-
rem 6 of Section XI.6 of Dunford and Schwartz (1963). The reader should note
that Dunford and Schwartz (1963) use the term compact instead of completely
continuous.
l2This theorem is used in the proof of Theorem 8.40. For a proof, see Lemma 1
in Section VIIL3 of Berberian (1961).
l3This theorem is used in the proof of Theorem 8.40. For a proof, see part (5)
of Theorem 2 on p. 132 of Berberian (1961).
ApPENDIX D
Summary of Distributions
The distributions used in this book are listed here. We give the name and sym-
bol used to describe each distribution. Each distribution is absolutely continuous
with respect to some measure or other. In most cases the mean and variance are
given. In some cases, the symbol for the CDF is given.
Variance: 2 [q + a1'12~1'1')]
IThis distribution was derived without a name by Geisser (1967). It was named
L2 by Lecoutre and Rouanet (1981).
D.l. Univariate Continuous Distributions 669
Alternate noncentral F
Symbol: ANCF(q,a,,?
Beta
Symbol: Beta( a, (3)
Density: Ix (x) = ic~'igJ)xa-1(1- X),8-1
Dominating measure: Lebesgue measure on [0,1]
Mean: a~,8
.
v:arlance: a('J
(a+,8)2(a+,8+1)
Cauchy
Symbol: Cau(p" (7'2)
Chi-squared
Symbol: X~
Exponential
Symbol: Exp(O)
Density: fx(x) == Oexp(-xO)
Dominating measure: Lebesgue measure on [0, 00)
Mean: !
Variance: -b
F
Symbol: Fq,a
r(!I...!!.)q~a!
x2- (a + qx)--r
!1 1 !1<!
DensIty: fx(x) == ~r
r(~)r(!)
Dominating measure: Lebesgue measure on [0,00)
Mean: a~2' if a> 2
" .
variance: 2a2 q(a-4)(a-2)2'
q+a-2 'f
I a>
4
Gamma
Symbol: r(a,.8)
Density: fx(x) == ~X"'-l exp(-,Bx)
Dominating measure: Lebesgue measure on [0, 00)
Mean: ~
Variance: -p
Inverse gamma
Symbol: r- 1 (a,,B)
Density: fx(x) == ~X-O-l exp(-~)
Dominating measure: Lebesgue measure on [0, 00)
Mean: 6, if a> 1
Variance: (0 1~2("'_2)' if a > 2
Laplace
Symbol: Lap(tL, 0')
Density: fx(x) == 2~ exp (_Ix~el),
Dominating measure: Lebesgue measure on 1R
Mean: tL
Variance: 20'2
Dol. Univariate Continuous Distributions 671
Noncentral beta
Symbol: NCB (a., {3, 1/J)
D ensl'ty: f x ()
x = ,",00 (!I!.)k exp (!I!.)
L..Jk=O 2
r(a+~)
- 2 klr(a+k)r(~)x a+k-l(1 - x )~-1
Dominating measure: Lebesgue measure on [0,1)
Noncentral chi-squared
Symbol: NCx.~(1/J)
D enslty:
' f x (x ) -- ,",00
L..Jk=O
(!I!.)k
2 exp (!I!.)
- 2 ",!+k-l (
kJ2~+kr(!+k) exp -2
"')
Noncentral F
Symbol: NCF(q, a, 1/J)
Noncentral t
Symbol: NCt a (6)
' If a>2
CDF: NCT,,(oj6)
Normal
Symbol: N(p., 002 )
Density: /x(x) = (V21Too)-l exp ( _("';,.~)2)
Dominating measure: Lebesgue measure on (-00,00)
672 Appendix D. Summary of Distributions
Mean: J.t
Variance: 0'2
Pareto
Symbol: Par(o:, c)
Density: /x(x) = ;=+1
Dominating measure: Lebesgue measure on [e, 00)
Mean: :~l' if 0: > 1
...varIance:
" c 2
(0 2)(0 1)2,'1 f
0: >2
t
Symbol: t a (J.t,0'2)
Uniform
Symbol: U(a, b)
Density: /x(x) = (b - a)-l
Dominating measure: Lebesgue measure on [a, b]
Mean: ~
Variance: (b_a)2
12
Binomial
Symbol: Bin{n,p)
Density: /x{x) = (:)p"'{I- p)l-",
Dominating measure: Counting measure on {O, ... , n}
Mean: np
Variance: np{l - p)
Geometric
Symbol: Geo(p)
Density: /x{x) = p{l- p)'"
Dominating measure: Counting measure on {O, 1,2, ... }
Mean.!=.E
p
Variance: ~
p
Hypergeometric
Symbol: Hyp{N,n,k)
Density: /x{x) = (~)[~):)
Dominating measure: Counting measure on
{max{O,n - N + k}, ... ,min{n,k}}
Mean: ~
Variance: n (~) ( N;k) (~:7 )
Negative binomial
Symbol: N egbin{ a, p)
Density: /x{x) = Wot:lpO{l - p)'"
Dominating measure: Counting measure on {O, 1,2, ... }
Mean' a.!=.E
p
Variance: a7
Poisson
Symbol: Poi{>.)
Density: fx{x) = exp{->'):~
Dominating measure: Counting measure on {O, 1,2, ... }
Mean: >.
Variance: >.
674 Appendix D. Summary of Distributions
Multinomial
Symbol: Multk(n,Pl, ... ,Pk)
Density: !Xl ..... Xk(Xl, ... ,Xk) =( n
.::r:l'I:l:k
)PZll ... p~/c
'"
Multivariate Normal
Symbol: Np(JJ, (1)
Density: !x(x) = (271r~I(1I-~ exp(-!(x - JJ)T (1-l(X - JJ)
Dominating measure: Lebesgue measure on IRP
Mean: E(Xi) = /Ji
Variance: Var(Xi) = (1i.i
Covariance: COV(Xi, Xi) = (1i.i
References
RAO, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.).
New York: Wiley.
ROBBINS, H. (1951). Asymptotically subminimax solutions of compound sta-
tistical decision problems. In J. NEYMAN (Ed.), Proceedings of the Second
Berkeley Symposium on Mathematical Statistics and Probability (pp. 131-
148). Berkeley: University of California.
ROBBINS, H. (1955). An empirical Bayes approach to statistics. In J. NEY-
MAN (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, volume 1 (pp. 157-164). Berkeley: University of
California.
ROBBINS, H. (1964). The empirical Bayes approach to statistical decision prob-
lems. Annals of Mathematical Statistics, 35, 1-20.
ROBERT, C. P. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3,
601-608.
ROBERTS, H. V. (1967). Informative stopping rules and inferences about popu-
lation size. Journal of the American Statistical Association, 62, 763-775.
ROUANET, H. and LECOUTRE, B. (1983). Specific inference in ANOVA: From
significance tests to Bayesian procedures. British Journal of Mathematical
and Statistical Psychology, 36, 252-268.
ROYDEN, H. L. (1968). Real Analysis. London: Macmillan.
RUBIN, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130-134.
RUDIN, W. (1964). Principles of Mathematical Analysis (2nd ed.). New York:
McGraw-Hill.
SAVAGE, L. J. (1954). The Foundations of Statistics. New York: Wiley.
SAVAGE, L. J. (1962). The Foundations of Statistical Inference. London:
Methuen.
SCHEFFE, H. (1947). A useful convergence theorem for probability distributions.
Annals of Mathematical Statistics, 18,434--438.
SCHERVISH, M. J. (1983). User-oriented inference. Journal of the American
Statistical Association, 78, 611-615.
SCHERVISH, M. J. (1992). Bayesian analysis of linear models (with discussion).
In J. M. BERNARDO, J. O. BERGER, A. P. DAWID, and A. F. M. SMITH
(Eds.), Bayesian Statistics 4: Proceedings of the Second Valencia Interna-
tional Meeting (pp. 419--434). Oxford: Clarendon Press.
SCHERVISH, M. J. (1994). Discussion of "Bootstrap: More than a stab in the
dark?" by G. A. Young. Statistical Science, 9, 408-410.
SCHERVISH, M. J. (1996). P-values: What they are and what they are not.
American Statistician, 50, to appear.
SCHERVISH, M. J. and CARLIN, B. P. (1992). On the convergence of successive
substitution sampling. Journal of Computational and Graphical Statistics,
1, 111-127.
SCHERVISH, M. J. and SEIDENFELD, T. (1990). An approach to consensus and
certainty with increasing evidence. Journal of Statistical Planning and In-
ference, 25,401-414.
References 687
Ferguson, T., 52, 56, 61, 173, 179, Johnstone, I., 435, 682
181,248,258,614,666,680
Ferrandiz, J., 669, 680 Kadane, J., 21, 24,183-184,446,564,
Fieller, E., 321, 680 655, 682-683, 687-688
Fienberg, S., 462, 676 Kagan, A., 349, 683
Fishburn, P., 181, 680 Kahneman, D., 23, 683
Fisher, R., 89, 96, 217-218, 307, 370, Kasa, R., ix, 226, 446, 505, 683
373, 522, 680 Kerridge, D., 564, 683
Fraser, D., 435, 677, 680 Kiefer, J., 417, 420, 683
Freedman, D., 15, 28, 40-41, 46, 61, Kinderman, A., 660, 683
70,123,126,330-331,426, Kingman, J., 36, 683
434,479-480,667,676,678- Knuth, D., x, 683
680 Kraft, C., 66, 683
Freedman, L., 24, 680 Krasker, W., 56, 683
Freeman, P., 524, 681 Krem, A., 408, 683
Kshirsagar, A., 386, 684
Gabriel, K., 252, 681 Kullback, S., 116, 684
Garthwaite, P., 24, 681
Gavasakar, U., 24, 681 Lamport, L., x, 684
Geisser, S., 521, 668, 681 Lane, D., 9, 682
Gelfand, A., 507,681 Lauritzen, S., 28, 123, 481, 684
Geman, D., 507, 681 Lavine, M., 69,526,684
Geman, S., 507, 681 LeCam, L., 414, 437, 684
Ghosh, J., 381-382, 681 Lecoutre, B., 668-669, 684, 686
Gnanadesikan, R., 22, 681 Lehmann, E., 231,280, 285,298,350,
Good, I., 565, 681 684
Levy, P., 648, 650
Hadjicostas, P., ix Lindley, D., 6, 229, 284, 479, 684
Hall, P., 337-338, 681 Lindman, H., 222,284,679
Hall, W., 381-382, 681 Linnik, Y., 349, 683
Halmos, P., 364, 600, 681 Laeve, M., 34, 653, 685
Hampel, F., 315, 681 Louis, T., 24, 677
Hartigan, J., 20-21, 33, 681
Matts, J., 24, 677
Heath, D., 21, 46, 681-682
Mauldin, R., 66, 69, 685
Hewitt, E., 46, 682
McDunnough, P., 435, 677, 680
Heyde, C., 435, 682
Mee, R., 326, 679
Hill, B., 9, 484, 682
Mendel, G., 217, 685
Hinkley, D., 218, 423, 678-679
Metivier, M., 66,685
Hodges, J., 414
Monahan, J., 660, 683
Hoel, P., 640, 682
Morgenstern, 0., 181-182,688
Hogarth, R., 24, 682 Morris, C., 166, 500, 679, 685
Holland, P., 462, 676
Huber, P., 310, 315, 428, 682 Nachbin, L., 364, 685
Hwang, J., 160, 677 Neyman, J., 89, 175,231,247,420,
685
James, W., 163,682 Nobile, A., ix
Jaynes, E., 379, 682 Novick, M., ix, 6, 684
Jeffreys, H., 122, 229, 284, 682
Jiang, T., ix Oue, S., ix
Name Index 693
Patterson, R., 33, 688 Smith, A., 479, 507, 681, 684, 687
Pearson, E., 175, 231, 247, 685 Smith, W., 24, 682
Pearson, K., 216, 685 Spiegelhalter, D., 24, 680
Perlman, M., 430, 685 Spj(lltvoll, E., 283, 687
Peters, S., 24, 682 Stahel, W., 315, 681
Phillips, L., 6, 684 Steffey, D., 505, 683
Pierce, D., 99, 685 Stein, C., 163, 379, 382, 568, 682, 687
Pitman, E., 347, 685 Stigler, S., 8, 687
Port, S., 640, 682 Stone, C., 640, 682
Portnoy, S., x Stone, M., 21, 678, 687
Pratt, J., 56, 98, 683, 685 Strasser, H., 430, 688
Strawderman, W., ix, 166,688
Raftery, A., 226, 683 Sudderth, W., 9, 21, 46, 66, 69,
Ramamoorthi, R., 86, 676 681-682, 685
Rao, C., 152, 301, 349, 683, 685-686
Reeve, C., 326, 679 Taylor, R., 33, 688
Regazzini, E., 21, 676 Tiao, G., 521, 677
Rigo, P., 21, 676 Tibshirani, R., 336, 679
Robbins, H., 303, 647, 677, 686 Tierney, L., 225, 446, 507, 683, 688
Robert, C., 225, 686 Tversky, A., 23, 683
Roberts, H., 565,686
Ronchetti, E., 315, 681 Venn, J., 8,688
Rouanet, H., 668-669, 684, 686 Verdinelli, I., 524, 688
Rousseeuw, P, 315, 681 Villegas, C., 379, 677
Royden, H., 578, 589, 597, 621, 686 Von Mises, R., 10, 688
Rubin, D., 332, 686 Von Neumann, J., 181-182,688
Rudin, W., 666, 686
Wald, A., 415, 549, 552, 557,688
Savage, L., 46, 181, 222, 284, 565, Walker, A., 435, 442, 688
600, 679, 681-682, 686 Wallace, D., 99, 688
Scheffe, H., 298, 634, 684, 686 Wasserman, L., ix, 524, 526, 684, 688
Schervish, M., v Welch, B., 320, 688
Schwartz, J., 507, 635, 667, 679 West, M., 524, 688
Schwartz, L., 429, 687 Wijsman, R., x, 381-382, 681
Scott, E., 420, 685 Wilks, A., x, 676
Seidenfeld, T., ix, 21, 183-184, 187, Wilks, S., 325, 688
429, 564, 655, 682-683, Williams, S., 66, 69, 685
686-687 Winkler, R., 24, 682
Sellke, T., 284, 676 Wolfowitz, J., 417, 420, 557, 683, 688
Serfiing, R., 413, 687 Wolpert, R., 526, 684
Sethuraman, J., 56, 687
Short, T., ix Ylvisaker, D., 108, 679
Shurlow, N., v Young, G., 329, 688
Siegmund, D., 647, 677
Singh, K., 331, 687 Zellner, A., 16, 688
Siovic, P., 23, 683 Zidek, J., 21, 678
Subject Index*
Scale invariant loss, 350-351 Stopping time, 537, 548, 552, 554
Scale parameter, 345 Strict preference, 183
Scheffe's theorem, 634 Strong law of large numbers, 34-36
Score function, 111, 122,302, 305 Strongly unimodal, 329
conditional, 111 Submartingale, 646
Second-order efficiency, 414 Successive substitution, 505-506, 545
Sensitivity analysis, 524 Successive substitution sampling, 507
Separable space, 619 Sufficient statistic, 84-85-86, 99, 103,
Separating hyperplane theorem, 666 109, 150-151, 298
Sequential decision rule, 537 conditionally, 95
Sequential probability ratio test, 549 minimal,92
Sequential test, 548 natural, 103
Set estimation, 296 Superefficiency, 414
Shrinkage estimator, 163 Supporting hyperplane theorem, 666
a-field, 575 Sure-thing principle, 184
Borel, 571, 575
generated, 571-572, 584 t distribution, 672
image, 584 Tail a-field, 632
restriction, 584 Tailfree process, 60
tail, 632 Taylor's theorem, 665
a-finite measure, 572, 578, 601 Tchebychev's inequality, 614
Signed measure, 577, 597, 605, 635 Terminal decision rule, 537
Significance probability, 217, 228, 280 Test:
Significance test, 217 goodness of fit, 218, 461
Simple alternative, 215 one-sided, 239, 243
Simple function, 586 two-sided, 256, 273
Simple hypothesis, 215 Test function, 175, 215
Size of test, 2, 215-216 Theorem:
Small order, 394 Bahadur,94
stochastic, 396 Basu, 99
SPRT,549 Bayes, 4,16
Squared-error loss, 146, 297 Bhattacharyya lower bounds,
Vn-consistent, 401 305
SSS, 507 Bolzano-Weierstrass, 666
St. Petersburg paradox, 655 Caratheodory extension, 578
State independence, 184, 205 Cauchy's equation, 667
State-dependent utility, 205-206 central limit, 642
States of Nature, 181, 189, 205 multivariate, 643
Statistic, 83 chain rule, 600
ancillary, 95,99, 119 Chapman-Robbins bound, 304
boundedly complete, 94 complete class, 179
complete, 94, 298 continuity, 640
sufficient, 84-85-86, 99, 103, continuous mapping, 638
150-151,298 Cramer-Rao lower bound, 301
Stein estimator (see James-Stein DeFinetti, 27-28
estimator), 163 dominated convergence, 591
Stochastic large order, 396 Fatou's lemma, 589
Stochastic small order, 396 Fisher-Neyman, 89
Stone-Weierstrass theorem, 666 Fubini,596
702 Subject Index