Vous êtes sur la page 1sur 718

Springer Series in Statistics

Advisors:
P. Diggie, S. Fienberg, K. Krickeberg,
1. Oikin, N. Wermuth

Springer
New York
Berlin
Heidelberg
Barcelona
Budapest
Hong Kong
London
Milan
Paris
Santa Clara
Singapore
Tokyo
Springer Series in Statistics
AndersenlBorganlGilllKeiding: Statistical Models Based on Counting Processes.
Andrews/Herzberg: Data: A Collection of Problems from Many Fields for the Student
and Research Worker.
Anscombe: Computing in Statistical Science through APL.
Berger: Statistical Decision Theory and Bayesian Analysis, 2nd edition.
Bolfarine/Zacks: Prediction Theory for Finite Populations.
Borg/Groenen: Modem Multidimensional Scaling: Theory and Applications
Bremaud: Point Processes and Queues: Martingale Dynamics.
BrockwelllDavis: Time Series: Theory and Methods, 2nd edition.
Daley/Vere-Jones: An Introduction to the Theory of Point Processes.
Dzhaparidze: Parameter Estimation and Hypothesis Testing in Spectral Analysis of
Stationary Time Series.
Fahrmeir/Tutz: Multivariate Statistical Modelling Based on Generalized Linear
Models.
Farrell: Multivariate Calculation.
Federer: Statistical Design and Analysis for Intercropping Experiments.
Fienberg/HoaglinlKruskal/Tanur (Eds.): A Statistical Model: Frederick Mosteller's
Contributions to Statistics, Science and Public Policy.
Fisher/Sen: The Collected Works of Wassily Hoeffding.
Good: Pennutation Tests: A Practical Guide to Resampling Methods for Testing
Hypotheses.
GoodmanlKruskal: Measures of Association for Cross Classifications.
Grandell: Aspects of Risk Theory.
Haberman: Advanced Statistics, Volume I: Description of Populations.
Hall: The Bootstrap and Edgeworth Expansion.
Hardie: Smoothing Techniques: With Implementation in S.
Hartigan: Bayes Theory.
Heyer: Theory of Statistical Experiments.
Huet/Bouvier/GruetlJolivet: Statistical Tools for Nonlinear Regression: A Practical
Guide with S-PLUS Examples.
Jolliffe: Principal Component Analysis.
KolenlBrennan: Test Equating: Methods and Practices.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume I.
Kotz/Johnson (Eds.): Breakthroughs in Statistics Volume II.
Kres: Statistical Tables for Multivariate Analysis.
Le Cam: Asymptotic Methods in Statistical Decision Theory.
Le Cam/Yang: Asymptotics in Statistics: Some Basic Concepts.
Longford: Models for Uncertainty in Educational Testing.
Manoukian: Modem Concepts and Theorems of Mathematical Statistics.
Miller, Jr.: Simultaneous Statistical Inference, 2nd edition.
Mosteller/Wallace: Applied Bayesian and Classical Inference: The Case of The
Federalist Papers.

tconlin"ltd after indu)


Mark J. Schervish

Theory of Statistics

With 26 Illustrations

t Springer
Mark J. Schervish
Department of Statistics
Carnegie Mellon University
Pittsburgh, PA 15213
USA

Library of Congress Cataloging-in. Publication Data


Schervish, Mark J.
Theory of Statistics / Mark J. Schervish
p. cm. - (Springer series in statistics)
Includes bibliographical references (po ) and index.
ISBN 13: 978-1-4612-8708-7
1. Mathematical statistics. L Title. II. Series.
QA276.S346 1995
519.5--dc20 95-11235

Printed on acid-free paper.

1995 Springer-Verlag New York, Inc.


Softcover reprint of the hardcover 1 stedition 1995
All rights reserved. This work may not be translated or copied in whole or in
part without the written permission of the publisher (Springer-Verlag New York,
Inc., 175 Fifth Avenue, New York, NY 10010, USA), except for brief excerpts in
connection with reviews or scholarly analysis. Use in connection with any form of
information storage and retrieval, electronic adaptation , computer software, or by
similar or dissimilar methodology now known or hereafter developed is forbidden.
The use of general descriptive names, trade names, trademarks, etc., in this pub-
lication, even if the former are not especially identified, is not to be taken as a
sign that such names, as understood by the Trade Marks and Merchandise Marks
Act, may accordingly be used freely by anyone.

Production managed by Laura Carlson; manufacturing supervised by Joe Quatela.


Photocomposed pages prepared from the author's Ib-TEX files.
Printed and bound by Edwards Brothers, Inc. , Ann Arbor, MI.
Printed in the United States of America.

9 8 7 6 5 4 3(ColTe(:ted
2 second printing. 1997)

ISBN-13: 9781-4612-8708-7 e-ISBN13: 978-1-4612-4250-5


DOl: 10.10071978-1-4612-4250-5
To Nancy, Margaret, and Meredith
Preface

This text has grown out of notes used for lectures in a course entitled Ad-
vanced Statistical Theory at Carnegie Mellon University over several years.
The course (when taught by the author) has attempted to cover, in one
academic year, those topics in estimation, testing, and large sample theory
that are commonly taught to second year graduate students in a math-
ematically rigorous fashion. Most texts at this level fall into one of two
categories. They either ignore the Bayesian point of view altogether or
they cover Bayesian topics almost exclusively. This book covers topics in
both classicaJl and Bayesian inference in a great deal of generality. My own
point of view is Bayesian, but I believe that students need to learn both
types of theory in order to achieve a fuller appreciation of the subject mat-
ter. Although many comparisons are made between classical and Bayesian
methods, it is not a goal of the text to present a formal comparison of the
two approaches as was done by Barnett (1982). Rather, the goal has been
to prepare Ph.D. students to be able to understand and contribute to the
literature of theoretical statistics with a broader perspective than would be
achieved from a purely Bayesian or a purely classical course.
After a brief review of elementary statistical theory, the coverage of the
subject matter begins with a detailed treatment of parametric statistical
models as motivated by DeFinetti's representation theorem for exchangeable
random variables (Cha{>ter 1). In addition, Dirichlet processes and other
tailfree processes are presented as examples of infinite-dimensional param-
eters. Chapter 2 introduces sufficient statistics from both Bayesian and
non-Bayesian viewpoints. Exponential families are discussed here because
of the important role sufficiency plays in these models. Also, the concept
of information is introduced together with its relationship to sufficiency.
A representation theorem is given for general distributions based on suffi-
cient statistics. Decision theory is the subject of Chapter 3, which includes
discussions of admissibility and minimaxity. Section 3.3 presents an ax-
iomatic derivation of Bayesian decision theory, including the use of condi-
tional probability. Chapter 4 covers hypothesis testing, including unbiased
tests, P-values, and Bayes factors. We highlight the contrasts between the
traditional "uniformly most powerful" (UMP) approach to testing and de-
cision theoretic approaches (both Bayesian and classical). In particular, we

1 What I call classical inference is called frequentist inference by some other


authors.
viii Preface

see how the asymmetric treatment of hypotheses and alternatives in the


UMP approach accounts for much of the difference. Point and set estima-
tion are the topics of Chapter 5. This includes unbiased and maximum like-
lihood estimation as well as confidence, prediction, and tolerance sets. We
also introduce robust estimation and the bootstrap. Equivariant decision
rules are covered in Chapter 6. In Section 6.2.2, we debunk the common
misconception of equivariant rules as means for preserving decisions un-
der changes of measurement scale. Large sample theory is the subject of
Chapter 7. This includes asymptotic properties of sample quantiles, maxi-
mum likelihood estimators, robust estimators, and posterior distributions.
The last two chapters cover situations in which the random variables are
not modeled as being exchangeable. Hierarchical models (Chapter 8) are
useful for data arrays. Here, the parameters of the model can be modeled
as exchangeable while the observables are only partially exchangeable. We
introduce the popular computational tool known as Markov chain Monte
Carlo, Gibbs sampling, or successive substitution sampling, which is very
useful for fitting hierarchical models. Some topics in sequential analysis are
presented in Chapter 9. These include classical tests, Bayesilill decisions,
confidence sets, and the issue of sampling to a foregone conclusion.
The presentation of material is intended to be very general and very pre-
cise. One of the goals of this book was to be the place where the proofs could
be found for many of those theorems whose proofs were "beyond the scope
of the course" in elementary or intermediate courses. For this reason, it is
useful to rely on measure theoretic probability. Since many students have
not studied measure theory and probability recently or at all, I have in-
cluded appendices on measure theory (Appendix A) and probability theory
(Appendix B).2 Even those who have measure theory in their background
can benefit from seeing these topics discussed briefly and working through
some problems. At the beginnings of these two appendices, I have given
overviews of the important definitions and results. These should serve as
reminders for those who already know the material and as groundbreaking
for those who do not. There are, however, some topics covered in Ap-
pendix B that are not part of traditional probability courses. In particular,
there is the material in Section B.3.3 on conditional densities with respect
to nonproduct measures. Also, there is Section B.6, which attempts to use
the ideas of gambling to motivate the mathematical definition of proba-
bility. Since conditional independence and the law of total probability are
so central to Bayesian predictive inference, readers may want to study the
material in Sections B.3.4 and B.3.5 also.
Appendix C lists purely mathematical theorems that are used in the text

2These two appendices contain sufficient detail to serve as the. basis for.a ful~
semester (or more) course in measure and probability. They are mcluded m thiS
book to make it more self-contained for students who do not have a background
in measure theory.
Preface ix

without proof, and Appendix D gives a brief summary of the distributions


that are used throughout the text. An index is provided for notation and
abbreviations that are used at a considerable distance from where they are
defined. Throughout the book, I have added footnotes to those results that
are of interest mainly through their value in proving other results. These
footnotes indicate where the results are used explicitly elsewhere in the
book. This is intended as an aid to instructors who wish to select which
results to prove in detail and which to mention only in passing. A single
numbering system is used within each chapter and includes theorems, lem-
mas, definitions, corollaries, propositions, assumptions, examples, tables,
figures, and equations in order to make them easier to locate when needed.
I was reluctant to mark sections to indicate which ones could be skipped
without interrupting the flow of the text because I was afraid that readers
would interpret such markings as signs that the material was not impor-
tant. However, because there may be too much material to cover, especially
if the measure theory and probability appendices are covered, I have de-
cided to mark two different kinds of sections whose material is used at most
sparingly in other parts of the text. Those sections marked with a plus sign
(+) make use of the theory of martingales. A lot of the material in some
of these sections is used in other such sections, but the remainder of the
text is relatively free of martingales. Martingales are particularly useful in
proving limit theorems for conditional probabilities. The remaining sections
that can be skipped or covered out of order without seriously interrupting
the flow of material are marked with an asterisk (*). No such system is
foolproof, however. For example, even though essentially all of the material
dealing with equivariance is isolated in Chapter 6, there is one example in
Chapter 7 and one exercise that make reference to the material. Similarly,
the material from other sections marked with the asterisk may occasion-
ally appear in examples later in the text. But these occurrences should be
inconsequential. Of course, any instructor who feels that equivariance is an
important topic should not be put off by the asterisk. In that same vein,
students really ought to be made aware of what the main theorems in Sec-
tion 3.3 say (Theorems 3.108 and 3.110), even though the section could be
skipped without interrupting the flow of the material.
I would like to thank many people who helped me to write this book or
who read early drafts. Many people have provided corrections and guidance
for clarifying some of the discussions (not to mention corrections to some
proofs). In particular, thanks are due to Chris Andrews, Bogdan Doytchi-
nov, Petros Hadjicostas, Tao Jiang, Rob Kass, Agostino Nobile, Shingo
Oue, and Thomas Short. Morris DeGroot helped me to understand what
is really going on with equivariance. Teddy Seidenfeld introduced me to
the axiomatic foundations of decision theory. Mel Novick introduced me
to the writings of DeFinetti. Persi Diaconis and Bill Strawderman made
valuable suggestions after reading drafts of the book, and those suggestions
are incorporated here. Special thanks go to Larry Wasserman, who taught
x

from two early drafts of the text and provided invaluable feedback on the
(lack of) clarity in various sections.
As a student at the University of Illinois at Urbana-Champaign, I learned
statistical theory from Stephen Portnoy, Robert Wijsman, and Robert
Bohrer (although some of these people may deny that fact after reading this
book). Many of the proofs and results in this text bear startling resemblance
to my notes taken as a student. Many, in turn, undoubtedly resemble works
recorded in other places. Whenever I have essentially lifted, or cosmetically
modified, or even only been deeply inspired by a published source, I have
cited that source in the text. If results copied from my notes as a student
or produced independently also resemble published results, I can only apol-
ogize for not having taken enough time to seek out the earliest published
reference for every result and proof in the text. Similarly, the problems at
the ends of each chapter have come from many sources. One source used
often was the file of old qualifying exams from the Department of Statistics
at Carnegie Mellon University. These problems, in turn, came from various
sources unknown to me (even the ones I wrote). If I have used a problem
without giving proper credit, please take it as a compliment. Some of the
more challenging problems have been identified with an asterisk (*) after
the problem number. Many of the plots in the text were produced using
The New S Language and S-Plus [see Becker, Chambers, and Wilks (1988)
and StatSci (1992)]. The original text processing was done using U.TEJX,
which was written by Lamport (1986) and was based on 'lEX by Knuth
(1984).
Pittsburgh, Pennsylvania MARK J. SCHERVISH
May 91,1995
Several corrections needed to be made between the first and second print-
ings of this book. During that time, I created a world-wide web page
http://www.stat.cmu.edu/-mark/advt/
on which readers may find up-~o-date lists of any corrections that have
been required. The most significant individual corrections made between
the first and second printings are listed here:
The discussion of the famous M-estimator on page 314 has been
corrected.
Theorems 7.108 and 7.116 each needed an additional condition con-
cerning uniform bounded ness of the derivatives of the Hn and H;
functions on a compact set. Only small changes were made to the
proofs.
The proofs of Theorems B.83 and B.133 were corrected, and small
changes were made to Example 2.81 and Definition B.137.
Contents

Preface vii

Chapter 1: Probability Models 1


1.1 Background........ 1
1.1.1 General Concepts . 1
1.1.2 Classical Statistics 2
1.1.3 Bayesian Statistics 4
1.2 Exchangeability...... 5
1.2.1 Distributional Symmetry 5
1.2.2 Frequency and Exchangeability 10
1.3 Parametric Models . . . . . . . . . . . 12
1.3.1 Prior, Posterior, and Predictive Distributions 13
1.3.2 Improper Prior Distributions . . . 19
1.3.3 Choosing Probability Distributions 21
1.4 DeFinetti's Representation Theorem . 24
1.4.1 Understanding the Theorems . 24
1.4.2 The Mathematical Statements 26
1.4.3 Some Examples . . . . . . . . . 28
1.5 Proofs of DeFinetti's Theorem and Related Results' 33
1.5.1 Strong Law of Large Numbers 33
1.5.2 The Bernoulli Case . . . . 36
1.5.3 The General Finite Case' . . . 38
1.5.4 The General Infinite Case . . . 45
1.5.5 Formal Introduction to Parametric Models' 49
1.6 Infinite-Dimensional Parameters' 52
1.6.1 Dirichlet Processes 52
1.6.2 Tailfree Processes+ 60
1. 7 Problems . . . . . . . . . 73

Chapter 2: Sufficient Statistics 82


2.1 Definitions . . . . . . . . . . 82
2.1.1 Notational Overview 82
2.1.2 Sufficiency . . . . . . 83
2.1.3 Minimal and Complete Sufficiency 92
2.1.4 Ancillarity............ 95
2.2 Exponential Families of Distributions. . . .102

'Sections and chapters marked with an asterisk may be skipped or covered


out of order without interrupting the flow of ideas.
+Sections marked with a plus sign include results which rely on the theory of
martingales. They may be skipped without interrupting the flow of ideas.
xii Contents

2.2.1 Basic Properties . . . . . . . .102


2.2.2 Smoothness Properties. . . . .105
2.2.3 A Characterization Theorem .109
2.3 Information . . . . . . . . . . . . . . .110
2.3.1 Fisher Information . . . . . . .111
2.3.2 Kullback-Leibler Information . l15
2.3.3 Conditional Information .118
2.3.4 Jeffreys' Prior .. . 121
2.4 Extremal Families . . . . . .123
2.4.1 The Main Results .124
2.4.2 Examples .127
2.4.3 Proofs+ .129
2.5 Problems . . . . .138

Chapter 3: Decision Theory 144


3.1 Decision Problems .. . .144
3.1.1 Framework .. . .144
3.1.2 Elements of Bayesian Decision Theory .146
3.1.3 Elements of Classical Decision Theory .149
3.1.4 Summary............. .150
3.2 Classical Decision Theory . . . . . . . . .150
3.2.1 The Role of Sufficient Statistics. .150
3.2.2 Admissibility . . . . . . .153
3.2.3 James-Stein Estimators .163
3.2.4 Minimax Rules . . . . . .167
3.2.5 Complete Classes . . . . .174
3.3 Axiomatic Derivation of Decision Theory . 181
3.3.1 Definitions and Axioms . 181
3.3.2 Examples........... .186
3.3.3 The Main Theorems . . . . . .188
3.3.4 Relation to Decision Theory . .189
3.3.5 Proofs of the Main Theorems .190
3.3.6 State-Dependent Utility .205
3.4 Problems . . . . . . . . . . . . . . . . .208

Chapter 4: Hypothesis Testing 214


4.1 Introduction.................. .214
4.1.1 A Special Kind of Decision Problem .214
4.1. 2 Pure Significance Tests .216
4.2 Bayesian Solutions . . . . .218
4.2.1 Testing in General .218
4.2.2 Bayes Factors . . . .220
4.3 Most Powerful Tests . . . .230
4.3.1 Simple Hypotheses and Alternatives .233
4.3.2 Simple Hypotheses, Composite Alternatives .238
4.3.3 One-Sided Tests . . . . .239
4.3.4 Two-Sided Hypotheses . .246
4.4 Unbiased Tests . . . . . .253
4.4.1 General Results. . . . . .253
Contents xiii

4.4.2 Interval Hypotheses . 255


4.4.3 Point Hypotheses. . 257
4.5 Nuisance Parameters . . . . . 265
4.5.1 Neyman Structure . . 265
4.5.2 Tests about Natural Parameters . 268
4.5.3 Linear Combinations of Natural Parameters . . 272
4.5.4 Other Two-Sided Cases' . . . . . . . . . . 272
4.5.5 Likelihood Ratio Tests . . . . . . . . . . . 274
4.5.6 The Standard F- Test as a Bayes Rule' . . 276
4.6 P- Values. . . . . . . . . . . . . . . . . 279
4.6.1 Definitions and Examples . . . 279
4.6.2 P-Values and Bayes Factors. . 283
4.7 Problems . . . . . . . . . . . . . . . . 285

Chapter 5: Estimation 296


5.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . .. . 296
5.1.1 Minimum Variance Unbiased Estimation. . . . . . . .. . 297
5.1.2 Lower Bounds on the Variance of Unbiased Estimators. . 301
5.1.3 Maximum Likelihood Estimation .307
5.1.4 Bayesian Estimation . 309
5.1.5 Robust Estimation' . 310
5.2 Set Estimation . . . . . . 315
5.2.1 Confidence Sets. . 315
5.2.2 Prediction Sets . 324
5.2.3 Tolerance Sets' . . 325
5.2.4 Bayesian Set Estimation . . 327
5.2.5 Decision Theoretic Set Estimation . . 328
5.3 The Bootstrap . . . . . . . . . . . . . . 329
5.3.1 The General Concept . . . . . . 329
5.3.2 Standard Deviations and Bias . . 335
5.3.3 Bootstrap Confidence Intervals . 336
5.4 Problems . . . . . . . . . . . . . . . . . 339

Chapter 6: Equivariance' 344


6.1 Common Examples . . 344
6.1.1 Location Problems . 344
6.1.2 Scale Problems' . . 350
6.2 Equivariant Decision Theory . 353
6.2.1 Groups of Transformations . 353
6.2.2 Equivariance and Changes of Units . . 359
6.2.3 Minimum Risk Equivariant Decisions. .363
6.3 Testing and Confidence Intervals . . . . . 375
6.3.1 P-Values in Invariant Problems. .375
6.3.2 Equivariant Confidence Sets. . 379
6.3.3 Invariant Tests' . 380
6.4 Problems . . . . . . . . . . . . . . . . 388
xiv Contents

Chapter 7: Large Sample Theory 394


7.1 Convergence Concepts . . . . .394
7.1.1 Deterministic Convergence .394
7.1.2 Stochastic Convergence .395
7.1.3 The Delta Method .401
7.2 Sample Quantiles . . . . . .404
7.2.1 A Single Quantile .404
7.2.2 Several Quantiles . .408
7.2.3 Linear Combinations of Quantiles .410
7.3 Large Sample Estimation . . . . . . . . . .412
7.3.1 Some Principles of Large Sample Estimation .412
7.3.2 Maximum Likelihood Estimators .415
7.3.3 MLEs in Exponential Families . .418
7.3.4 Examples of Inconsistent MLEs . .420
7.3.5 Asymptotic Normality of MLEs . .421
7.3.6 Asymptotic Properties of M-Estimators .424
7.4 Large Sample Properties of Posterior Distributions .428
7.4.1 Consistency of Posterior Distributions+ .. .429
7.4.2 Asymptotic Normality of Posterior Distributions .435
7.4.3 Laplace Approximations to Posterior Distributions .446
7.4.4 Asymptotic Agreement of Predictive Distributions+ .455
7.5 Large Sample Tests . . . . . . . . . . . . . . .458
7.5.1 Likelihood Ratio Tests . . . . . . . . .458
7.5.2 Chi-Squared Goodness of Fit Tests .461
7.6 Problems . . . . . . . . . . . . . . . . . . . .467

Chapter 8: Hierarchical Models 476


8.1 Introduction......... .476
8.1.1 General Hierarchical Models. .476
8.1.2 Partial Exchangeability . . . .479
8.1.3 Examples of the Representation Theorem" . .480
8.2 Normal Linear Models . . . . . . . . . . . .483
8.2.1 One-Way ANOVA . . . . . . . . . .483
8.2.2 Two-Way Mixed Model ANOVA . .488
8.2.3 Hypothesis Testing . . .491
8.3 Nonnormal Models . . . . . . .495
8.3.1 Poisson Process Data . .495
8.3.2 Bernoulli Process Data. .497
8.4 Empirical Bayes Analysis . . . .500
8.4.1 Naive Empirical Ba.yes . .500
8.4.2 Adjusted Empirical Bayes .503
8.4.3 Unequal Variance Case . .504
8.5 Successive Substitution Sampling .505
8.5.1 The General Algorithm . .505
8.5.2 Normal Hierarchica.l Models. .512
8.5.3 Nonnormal Models . . .. .517
8.6 Mixtures of Models . . . . . . . . .519
8.6.1 General Mixture Models . .519
8.6.2 Outliers........ .. .521
Contents xv

8.6.3 Bayesian Robustness . .524


8.7 Problems . . . . . . . . . . .532

Chapter 9: Sequential Analysis 536


9.1 Sequential Decision Problems .536
9.2 The Sequential Probability Ratio Test .548
9.3 Interval Estimation . . . . . . . .558
9.4 The Relevance of Stopping Rules .562
9.5 Problems . . . . . . . . . . . . . .567

Appendix A: Measure and Integration Theory 570


A.l Overview . . . . . . . . . . . .570
A.I.l Definitions . . . . . . .570
A.1.2 Measurable Functions .572
A.I.3 Integration . . . . . .573
A.1.4 Absolute Continuity .574
A.2 Measures . . . . . . . .575
A.3 Measurable Functions .582
A.4 Integration . . . . . .587
A.5 Product Spaces . . . .593
A.6 Absolute Continuity .597
A.7 Problems . . . . . . .602

Appendix B: Probability Theory 606


B.l Overview . . . . . . . . . . . .606
B.Ll Mathematical Probability .606
B.I.2 Conditioning .. . .607
B.I.3 Limit Theorems . . . . . .611
B.2 Mathematical Probability . . . . .612
B.2.1 Random Quantities and Distributions .612
B.2.2 Some Useful Inequalities . .613
B.3 Conditioning . . . . . . . . . . . .615
B.3.1 Conditional Expectations .615
B.3.2 Borel Spaces . . . . . . . .619
B.3.3 Conditional Densities .. .623
B.3.4 Conditional Independence . . .628
B.3.5 The Law of Total Probability .632
B.4 Limit Theorems. . . . . . . . . . . . .634
B.4.1 Convergence in Distribution and in Probability .634
B.4.2 Characteristic Functions . .639
B.5 Stochastic Processes .645
B.5.1 Introduction .. .645
B.5.2 Martingales+ .. .645
B.5.3 Markov Chains .650
B.5A General Stochastic Processes .651
B.6 Subjective Probability .654
B. 7 Simulation .659
B.8 Problems . . . . . . . .661
xvi Contents

Appendix C: Mathematical Theorems Not Proven Here 665


C.l Real Analysis . . . . .665
C.2 Complex Analysis . .666
C.3 Functional Analysis. .667

Appendix D: Summary of Distributions 668


D.l Univariate Continuous Distributions .668
D.2 Univariate Discrete Distributions .672
D.3 Multivariate Distributions . . . . . . .674

References 675

Notation and Abbreviation Index 689

Name Index 691

Subject Index 694


CHAPTER 1
Probability Models

1.1 Background
The purpose of this book is to cover important topics in the theory of statis-
tics in a very thorough and general fashion. In this section, we will briefly
review some of the basic theory of statistics with which many students are
familiar. All that we do here will be repeated in a more precise manner at
the appropriate place in the text.

1.1.1 General Concepts


Most paradigms for statistical inference make at least some use of the fol-
lowing structure. We suppose that some random variables Xl, ... ,Xn all
have the same distribution, but we may be unwilling to say what that distri-
bution is. Instead, we create a collection of distributions called a parametric
family and denoted Po. For example, Po might consist of all normal distri-
butions, or just those normal distributions with variance 1, or all binomial
distributions, or all Poisson distributions, and so forth. Each of these cases
has the property that the collection of distributions can be indexed by a
finite-dimensional real quantity, which is commonly called a parameter. For
example, if the parametric family is all normal distributions, then the pa-
rameter can be denoted e = (M, E), where M stands for the mean and
E stands for the standard deviation. The set of all possible values of the
parameter is called the parameter space and is often denoted by n. When
e = B, the distribution of the observations is denoted by Po. Expected
values are denoted as E o(')'
We will denote observed data X. It might be that X is a vector of ob-
2 Chapter 1. Probability Models

servations that are mutually independent and identically distributed (liD),


or X might be some general quantity. The set of possible values for X is
the sample space and is often denoted as X. The members P8 of the para-
metric family will be distributions over this space X. If X is continuous or
discrete, then densities or probability mass functions l exist. We will denote
the density or mass function for P8 by !xls(IO). For example, if X is a
single random variable with continuous distribution, then

If X = (XI, ... ,Xn ), where the Xi are lID each with density (or mass
function) !xlIS(IO) when e = 0, then
n
!xls(xIO) = II !XdS(XiIO), (1.1)
i=l

where x = (XI, ... ,Xn ). After observing the data Xl = XI, .. .'Xn = X n ,
the function in (1.1), as a fun~tion of 0 for fixed x, is called the likelihood
function, denoted by L(O). Section 1.3 is devoted to a motivation of the
above structure based on the concept of exchangeability and DeFinetti's
representation theorem 1.49. Exchangeability is discussed in detail in Sec-
tion 1.2, and DeFinetti's theorem is the subject of Section 1.4.

1.1. 2 Classical Statistics


Classical inferential techniques include tests of hypotheses, unbiased esti-
mates, maximum likelihood estimates, confidence intervals and many other
things. These will be covered in great detail in the text, but we remind the
reader of a few of them here. Suppose that we are interested in whether or
not the parameter lies in one portion OH of the parameter space. We could
then set up a hypothesis H : e E OH with the corresponding alternative
A : e rt. OH. The simplest sort of test of this hypothesis would be to choose
a subset R ~ X, and then reject H if x E R is observed. The set R would
be called the rejection region for the test. If x rt. R, we would say that we
do not reject H. Tests are compared based on their power functions. The
power function of a test with rejection region R is f3(O) = P8(X E R). The
size of a test is sUP8eoH (3(0). Chapter 4 covers hypothesis testing in depth.
Example 1.2. Suppose that X = (Xl, ... ,Xn) and the Xi are 110 with N(J, 1)
distribution under Pe. The usual size a: test of H : e = 90 versus A : e t= (Jo is

IUsing the theory of measures (see Appendix A) we will be able to dispense


with the distinction between densities and probability mass functions. They will
both be special cases of a more general type of "density."
1.1. Background 3

to reject H if X E R, where X is the sample average,

R= (-00,00+ Jn~-I(~)]U[Oo+ Jn~-I(l-~)'oo),


and ~ is the standard normal cumulative distribution function (CDF).
The notation and terminology in Chapter 4 are different from the above
because we consider a more general class of tests called mndomized tests.
These are special cases of randomized decision rules, which are introduced
in Chapter 3. The following example illustrates the reason that randomized
decisions are introduced.
Example 1.3. Let X '" Bin(5,0) given e = O. Suppose that we wish to test
H ; e :s 1/2 versus A ; e > 1/2. It might seem that the best test would be
to reject H if X > c, where c is chosen to make the test have the desired level.
Unfortunately, only six different levels are available for tests of this form. For
example, if c E [4,5), the test has level 1/32. If c E [3,4), the test has level 3/16,
and so on. If you desire a level such as 0.05, you must use a more complicated
test.

A function of the data which takes values in the parameter space is called
a (point) estimator of 8. Section 5.1 considers point estimation in depth.
Example 1.4. Suppose that X = (Xl, ... ,Xn ) and the Xi are IID with N(O, 1)
distribution under Po, then cjJ(x) = L~=l xi/n = x takes values in the parameter
space and can be considered an estimator of e.
Sometimes we wish to estimate a function g of 8. An estimator of
n. An estimator of 8 is a
g(8) is unbiased if Ell [cjJ(X)] = g()) for all () E
maximum likelihood estimator (MLE) if
supL()) = L((x)),
liEn

for all x E X. An estimator 'IjJ of g(8) is an MLE if 'IjJ(X) = g((X)),


where is an MLE of 8. The reader should verify that the estimator in
Example 1.4 is both an unbiased estimator and an MLE of 8.
If the parameter 8 is real-valued, it is common to provide interval es-
timates of 8. If (A, B) is a pair of random variables with A S; B, and
if
PII(A S; () S; B) ? ,,/,
for all () E n, then [A, B] is called a coefficient,,/ confidence interval for 8.
Section 5.2 covers the theory of set estimation, which includes confidence
intervals, prediction intervals, and tolerance intervals as special cases.
Example 1.5 (Continuation of Example 1.4). Suppose that X = (X!, ... ,Xn )
and the Xi are lID with N(O, 1) distribution under Po, and let
- c - c
A=X-- B=X+-,
..;n' ..;n
where c > O. Then [A, B] is a coefficient 2~( -c) confidence interval for e, where
~ is the standard normal CDF.
4 Chapter 1. Probability Models

1.1.3 Bayesian Statistics


In the Bayesian paradigm, one treats all unknown quantities as random
variables and constructs a joint probability distribution for all of them.
Using the same setup as in Section 1.1.1, this would require that one con-
struct a distribution for the parameter e
in addition to the conditional
e
distribution of X given = B, which was denoted by Po. The distribution
of e is called the prior distribution. Together, the prior distribution and
{Po: fJ En} determine a joint distribution on the space X x n. For exam-
ple, suppose that the prior distribution has a density fe, suppose that X
is continuous, and let B ~ X x n. Then

Pr((X, e) E B) = JJ IB(x, fJ)!xle(xIB)!e(fJ)dxdfJ,

where IBis the indicator function of the set B. It will often be possible
(although not necessary) to think of the space X x n as if it were the
underlying probability space S which is introduced in Appendix B. In this
way, X and e
are both easily recognized as functions from S to their
respective ranges. That is, if s = (x, B), then X(s) = x and 8(8) = B.
After observing the data X = x, one constructs the conditional distri-
bution of e given X = x, which is called the posterior distribution, using
Bayes' theorem:
!xle(xIB)fe(fJ)
felx(Blx) = In fXle(xlt)!e(t)dt' (1.6)

A popular method of finding the posterior distribution is to note that the


denominator of (1.6) is not a function of B. (In fact, the denominator in
(1.6) is called the prior predictive density of the data X, fx(x}.) This
means that we can find !elx(fJlx) by calculating the numerator of (1.6)
and then dividing it by whatever constant is required to make it a density
as a function of B.
Example 1.7 (Continuation of Example 1.4; see page 3). Suppose that X =
(Xl, ... ,Xn ) and the Xi are conditionally lID with N(8, 1) distribution given
e = fJ. Suppose that the prior distribution of e is N(80, 1/ >.), where 80 and >.
are known constants. The likelihood function is

fXIS(xI8) = (271") _ ~n exp (n 2


_ 1 ~[
-"2[8 - x] -"2 f:t Xi -
_]2)
X ,

and the prior density is fe(8) = JX(271"}-1/2 exp( -.\[8 - 80]2/2}. Multiplying
these together and simplifying yield the following expression for the numerator
of (1.6):
k(x} exp ( - n; .\ [0 - 81]2) , (1.8)

where 81 = (.\80 +nx)/(.\+n), and k(x) does not depend on 8. The expre~sion in
(1.8) is easily recognized as being proportional to the N(81, l/[~+n]) denSIty as a
function of O. So, the posterior distribution of e given X = x IS N(8 1 , l/[.\+n]).
1.2. Exchangeability 5

Inferences about e in the Bayesian paradigm, are based on the posterior


distribution. For exa:nple, one might use the posterior mean or median of
e as an estimate of e. In Example 1.7 on page 4, the posterior mean and
median are both (h. The Bayesian paradigm also accommodates inference
about future observables. If Y denotes some future observations that are
conditionally independent of X given e, such as Y = (X n +1., X n +m ),
then the posterior predictive density of Y is

/Ylx(ylx) = In /Yls(yIO)/slx(Olx)dO.
Example 1.9 (Continuation of Example 1.7; see page 4). Let Y = X n +l, the
next observation. The posterior predictive density of Y is

J _1_ exp ( - [y - 0]2)


..,(2; 2
In + A exp (- n + A[0 - 01]2) dO
..,(2; 2
y'n + A
---;==';==::===7
y'27r(n+A+1)
exp (n + A
2(n+A+1)
[y - () 1 ]2) ,

which is the density of the N(Ol, 1 + 1/[n + A]) distribution.

The theory of prior, posterior, and predictive distributions is introduced


in Section 1.3.1. Many Bayesian inferential techniques tend to be decision
theoretic, so the theory of decisions is introduced in Chapter 3. In the text,
Bayesian techniques are usually introduced at locations nearby those at
which corresponding classical techniques are introduced.

1.2 Exchangeability
1.2.1 Distributional Symmetry
When one performs a statistical analysis, there are usually several quanti-
ties about which one is uncertain. For example, when conducting a political
poll, one never knows in advance which of several answers each respondent
will provide. In addition, even after the responses are in, one does not know
the answers that would have been supplied by all of the people who were
not polled. If one is interested in the proportions of the population who
would provide each of the available responses, then all of the would-be re-
sponses of all members of the population are potentially of interest. The
most complete specification of a probability distribution would give the
joint distribution of all of these responses. From this joint distribution, the
distributions of the various proportions of interest could also be calculated.
The quantities of interest can be more complicated than counts and pro-
portions without changing the basic considerations. For example, a com-
pany may keep track of the total amount of a sample of its sale to a sample
of its customers at a sample of its stores on a sample of days. It may be
6 Chapter 1. Probability Models

interested in various average sales amounts across different stores in a sin-


gle department or across different departments in a single store, or across
different days, and so on. Once again, the joint distribution of all vectors of
total sale, register, store, and day would facilitate answering the questions
of interest.
How does (or should) one construct the probability distributions needed
in such examples, and how does one draw inferences from the various types
of data? Some of the more common ways to draw inferences were described
briefly in Section 1.1. In order better to understand probability and in-
ference, let us take a very simplistic example, which should not be too
encumbered by considerations of available scientific knowledge. Consider
an old-fashioned thumbtack 2 (one of the metal ones with a round, curved
head, not the colored plastic ones). We will toss this thumbtack onto a soft
surface3 and keep track of whether it comes to stop with the point up or
with the point down. In the absence of any information to distinguish the
tosses or to suggest that tosses occurring close together in time are any
more or less likely to be similar to or different from each other than those
that are far apart in time, it seems reasonable to treat the different tosses
symmetrically. We might also believe that although we might only toss the
thumbtack a few times, if we were to toss it many more times, the same
judgment of symmetry would continue to apply to the future tosses. Under
these conditions, it is traditional to model the outcomes of the tosses as in-
dependent and identically distributed (lID) random variables with Xi = 1
meaning that toss i is point up and Xi = 0 meaning that toss i is point
down. In the classical framework, one invents a parameter, say (J, which is
assumed to be a fixed value not yet known to us. 4 Then one says that the
Xi are lID with Pr(Xi = 1) = (J. Within a Bayesian framework, one might

2This example is described in detail by Lindley and Phillips (1976). Other


interesting examples of how exchangeability aids in the understanding of inference
problems were given by Lindley and Novick (1981). This example is used, in
preference to tossing of coins, because most readers will not have particularly
strong prior opinions about how a thumbtack will land. On the other hand,
most people believe that the typical coin selected from one's pocket or purse has
probability pretty near 1/2 of landing head up.
3This is done to avoid damaging the thumbtack. This is the last scientific
consideration we will make.
4 A great deal of controversy in statistics arises out of the question of the mean-
ing of such quantities. DeFinetti (1974) argues persuasively that one need not
assume the existence of such things. Sometimes they are just assumed to be un-
defined properties of the experimental setup which magically make the outcomes
behave according to our probability models. Sometimes they are defined in terms
of the sequence of observations themselves (such as limits of relative fre~uencies).
This last is particularly troublesome because the sequence of observatlons does
not yet exist and hence the limit of relative frequency cannot be a fixed value
yet.
1.2. Exchangeability 7

construct a probability distribution,." for this unknown () and say that

Pr(XI = Xl,'" ,Xn = Xn) = ! ()Xl+oo'+Xn(l_()n- Xl-oo'- Xn d,.,,((). (1.10)

It seems unfortunate that so much machinery as assumptions of mutual


independence and the existence of a mysterious fixed but unknown () must
be introduced to describe what seems, on the surface, to be a relatively
simple situation. One purpose of this chapter is to show how to replace the
heavy probabilistic assumptions of lID and "fixed but unknown 0" with
a minimal assumption that reflects nothing more than the symmetry ex-
pressed in the problem. At the same time, we will be able to understand
when models like that of (1.10) are appropriate and why relative frequency
is such a popular device for thinking about probabilities. For example, when
considering the tosses of the thumbtack, we said that we would treat the
information to be obtained from anyone toss in exactly the same way as we
would treat the information from any other toss. Similarly, we would treat
the information to be obtained from any two tosses in exactly the same way
as we would treat the information from any other two tosses regardless of
where they appear in the sequence of tosses, and so on for three or more
tosses. This may seem like a heavy probabilistic assumption in itself. But it
really is nothing more than an explicit expression of the symmetry amongst
the tosses. Anything less would imply asymmetric treatment of the obser-
vations. Note that assuming the tosses to be lID assumes this symmetry
and more. The symmetry is quite explicit in formula (1.10). Every permu-
tation of the numbers Xl, . ,Xn leads to the same value of the right-hand
side of (1.10). If we assume nothing more than this permutation symmetry
for a potentially infinite sequence of possible tosses of the thumbtack, then
Theorem 1.495 will imply that there exists,." such that (1.10) holds. In a
sense, the quantity 0 is given an implicit meaning as a random variable
e, rather than a fixed value, without having to explicitly give it meaning
in advance. (See Example 1.45 on page 25.) Furthermore, the observations
are not necessarily mutually independent, but they will be conditionally
independent given e.
The minimal assumption of symmetry is known as exchangeability, and it
is no more complicated than the permutation symmetry noticed in (1.10).
Definition 1.11. A finite set Xl"'" Xn of random quantities is said to be
exchangeable if every permutation of (Xl, ... ,Xn ) has the same joint dis-
tribution as every other permutation. An infinite collection is exchangeable
if every finite subcollection is exchangeable.
For example, suppose that Xl, ... ,XlOO are exchangeable. It follows eas-
ily from the definition that they all have the same marginal distribution.

5Theorem 1.47 is a simpler version that applies only to Bernoulli random


variables.
8 Chapter 1. Probability Models

Also, (Xl, X 2 ) has the same joint distribution as (X99 , Xl), (X5, X 2 , X 48 )
has the same joint distribution as (X13, X 100, X 3 ), and so on. The following
fact is easy to prove.
Proposition 1.12. A collection C of random quantities is exchangeable if
and only if, for every finite n less than or equal to the size of the collection
C, every n-tuple of distinct ele-ments of C has the same joint distribution
as every other such n-tuple.
As an example, we stated earlier that the assumption off lID random
variables entailed symmetry and more.
Example 1.13. Consider a collection X I ,X2 , (finite or infinite) of lID random
variables. Clearly, (Xiu ... ,_Xi,,) has the same distribution as (Xil" .. , Xi,,) so
long as il, ... , in are all distinct and jl, ... ,jn are all distinct. Hence, every
collection of lID random variables is exchangeable.

The motivation for the definition of exchangeability is to express sym-


metry of beliefs about the random quantities in the weakest possible way.
The definition, as stated, does not require any judgment of independence
or that any limit of relative frequencies will exist. It merely says that the
labeling of the random quantities is immaterial. There are many situations
in which this assumption is deemed reasonable, and many where it is not.
For example, consider the company that sampled sales on various days at
various stores. It might seem reasonable to declare that the sales at a par-
ticular store on a particular day are exchangeable. But the collection of
all sales on all days at all stores might be modeled less symmetrically. In
Chapter 8, we will discuss in more detail cases with less symmetry.
Back in the old days, before probability theory was overrun by u-fields
and the like, the concept of symmetry was central to most calculations of
probabilities. Consider, for example, the first paragraph of the book by
DeMoivre (1756):
The Probability of an Event is greater or less according to the
number of Chances by which it may happen, compared with
the whole number of Chances by which it may either happen
or fail.
DeMoivre was describing a judgment of symmetry amongst the possible
outcomes of some experiment. But other authors, such as Venn (1876), rely
on symmetry amongst a collection of random quantities to define probabili-
ties as frequencies. 6 Although we now realize that symmetry is not essential
to the definition of probability, it nevertheless is a widely used assumption
that can help facilitate the construction of distributions. In addition, Theo-
rem 1.49 helps to explain why frequencies are relevant to the calculation of

6The reader interested in an in-depth study of the early days of statistics and
statistical reasoning should read Stigler (1986).
1.2. Exchangeability 9

probabilities even though probabilities are not defined as frequencies. (See


the discussion in Section 1.2.2.)
In Example 1.13 on page 8, we saw that lID random variables are ex-
changeable. Exchangeability is more general than lID, however. A very
common case of exchangeable random quantities is the following. Suppose
that Xl, X2,'" are conditionally lID given Y. Then the Xi are exchange-
able. (See Problem 4 on page 73.)
Example 1.14. Suppose that {Xn}~=l are conditionally independent with den-
sity J(xJy) given Y = y and that Y has density g(y). Then the joint density of
any ordered n-tuple (Xi" ... , Xi n ) is

JXi, ,... ,Xi n (Xl, ... ,Xn ) = JIT


j=l
J(XjJy)g(y)dy.

Note that the right-hand side does not depend on iI, ... , in.

The case of conditionally lID random quantities will turn out to be one
of only two general forms of exchangeability. Theorem 1.49 will say that
infinitely many random quantities are exchangeable if and only if they are
conditionally lID given something.
Although an infinite sequence of exchangeable random variables is condi-
tionally IID, sometimes the description of their joint distribution does not
make this fact transparent. Example 1.15 is the famous Polya urn scheme.
It is not obvious from the example that the random variables constructed
are conditionally lID. Theorem 1.49, however, says that they are condi-
tionally lID because they are exchangeable. 7
Example 1:15. Let X = {I, ... , k}, and let UI, ... ,Uk be nonnegative integers
such that u = E~=l Ui > O. Suppose that an urn contains Ui balls labeled i for
i = 1, ... , k. We draw a ball at random 8 and record Xl equal to the label. We
then replace the ball and toss in one more ball with the same label. We then
draw a ball at random again to get X 2 and repeat the process indefinitely. To
prove that the sequence {X.}~l is exchangeable, let n > 0 be an integer and let
jt, ... , jn be elements of X. For i = 1, ... ,k, let Ci(it, ... ,jn) be the number of
times that i appears amongjl, ... ,jn' That is,9 Ci(jl, ... ,jn) = 2::=II{i}(jt).
Define the notation
(a)b = a(a - 1) ... (a - b + 1),

7Hill, Lane, and Sudderth (1987) prove that for k = 2, the Polya urn process
is the only exchangeable urn process aside from lID processes and deterministic
ones. (An urn process is deterministic if all balls drawn are the same. The common
label for all balls can still be random.)
8What we mean by this is that every ball in the urn has the same probability
of being drawn.
9We will often use the symbol I A (x) to stand for the indicator function of the
set A. That is, IA(x) = 1 if x E A and IA(x) = 0 if x If. A.
10 Chapter 1. Probability Models

where (a)o = 1 by convention. Then, we claim that

- .
P r (X I-Jl, X _ . ) _ I1~-1 (Ui + Ci(il, ... ,jn) - l)c;(il, ... ,jn)
... , n -In - ( (1.16)
u+n-l)n
For n = 1, this reduces to Pr(XI =]1) = Uil Iu, which is true. If we suppose that
(1.16) is true for n = 1, ... ,m, then Pr(XI = jl, ... ,Xm +1 = jm+d equals

Pr(XI =jl, ... ,Xm =jm)Pr(Xm+ 1 =jm+lIX 1 =jl"",Xm =jm)

_ P (X - .
- r 1 - Jl, ... ,
X _.
m - Jm
)Ujm+l + Cjm+l (il, ... ,jm+d - 1
. (1.17)
u+m
In replacing Pr(Xl = jl, ... ,Xm = jm) by (1.16) in (1.17), we note that

Ci(it, ,jm+d Ci(jl, ... ,jm) if i i- jm+l,


Cjm+l (il,' .. ,jm) Cj",+l (jl, ... ,jm+l) - 1.
The result now follows immediately.

The only other form of exchangeability, besides conditionally lID random


quantities, is illustrated in a problem as simple as drawing balls without
replacement from an urn.
Example 1.18. Suppose that an urn has 20 balls, 14 of which are red and 6 of
which are blue. Suppose that we draw balls without replacement. Let Xi be 1 if
the ith ball is red and 0 if it is blue. If we assume that all 20! possible ordered
draws of the balls are equally likely, then it is not difficult to see that the Xi
are exchangeable. To see that the draws are not conditionally IID, suppose that
there were a random quantity Y such that the Xi were conditionally IID given Y.
Since 0 < Pr(X l = 0) = E(Pr(Xl = 0IY (by the law of total probability B.70),
it follows that Pr(XI = OIY) = 0 a.s. is impossible. Hence Pr(Pr(Xl = OIY) >
0) > 0, from which it follows that

Pr(Pr(Xl = 0, X2 = 0, ... , X 7 = 0IY) > 0) Pr(Pr(XI = 0Iy7 > 0)


Pr(Pr(XI = 0IY) > 0) > O.

Hence, Pr(XI = 0, ... ,X7 = 0) = E(Pr(X l = 0, ... ,X7 = 0IY)) > 0. But this
is absurd, since there are only 6 blue balls. It must be the case that the Xi,
although exchangeable, are not conditionally IID.
Theorem 1.48 will say that a finite collection of random quantities is
exchangeable if and only if they are like draws from an urn without re-
placement.

1.2.2 Frequency and Exchangeability


There was a time when people thought that probabilities had to be frequen-
cies, and as such, we could not know what they were before collecting an
infinite amount of data. [See Von Mises (1957) for an example.] Although
it is still true that we cannot know frequencies (such as the limit of the
1.2. Exchangeability 11

propor tion of successes in a sequen ce of exchan geable Bernou


lli random
variabl es) withou t collecti ng an infinite amoun t of data, DeFine
tti's rep-
resenta tion theorem for Bernou lli random variabl es 1.47 tells us
that such
a limit of frequen cies eis only a conditi onal probab ility given informa -
tion that we do not yet have. The probab ilities themse lves are
calcula ted
based on subject ive judgme nts. The possibl y surpris ing fact is
that even
though differen t people might calcula te differen t probab ilities for
the same
sequen ce of Bernou lli random variabl es, if they all believe the
sequen ce
to be exchan geable, then they all believe that there exists
conditi onal on e = B, the random variabl es are liD Ber(B). That
e
such that
is, the
subject ive judgme nt of exchan geabili ty for a sequen ce of random
variabl es
entails certain consequ ences that are commo n to every specific
instanc e of
the judgme nt, even when the specific instanc es differ in other ways.
Examp le 1.19. Let {Xn};:"=l be Bernoulli random variables. Suppose
that two
different people give them the following joint distributions. Let iI,
i2, ... stand
for numbers in {O, I}. One person believes

Pr(Xl = il, ... ,Xn = in) = X~ 2 (n:4)'


x+2

where x stands for Z::7=1 ij, and the other believes this probability
to be ([n +
1](:)-1 . The first person believes that Pr(Xl = 1) = 0.4, while
the second
believes Pr(X l = 1) = 0.5. On the other hand, both of these distribu
tions are
exchangeable, and so Theorem 1.47 says that both persons believe
that e =
limN_co Z:::=l XdN exists with probability 1, and that Pr(Xl =
(}. They must disagree on the distribution of e. For example, the 118 = (}) =
law of
probability B.70 says Pr(Xl = 1) = E(8), hence they must have differen total
t values
of E(8).

If probab ilities are not frequen cies, then why are frequen cies
though t
to be so import ant in calcula ting probab ilities? The answer lies
in careful
examin ation of the implica tions of DeFine tti's represe ntation theorem
.
Examp le 1.20 (Continuation of Example 1.19). Suppose that the two
people in
this example both observe X = (Xl, ... , X 20 ) = y, and suppose that
y consists of
14 Is and 6 Os. It is not difficult to calculate the conditional distribu
tion of X2l
given this data. For example, to get Pr(X21 = llX = y), we just divide
the joint
probability of (X, X 2I) = (y,l) by the probability of X = y. The
believes first person
12 1
17M
Pr(X21 = llX = y) = ~ = 0.64,
16('ll
while the second person believes Pr(X21 = llX = y) = 17/22 = 0.68.
Notice how
much closer these probabilities are to each other than were the prior
probabilities
of 0.4 and 0.5. Also, notice how close each of them is to the proportion
of successes,
0.7.

In Examp le 1.57 on page 31, we will see the general method for
finding
the conditi onal distribu tion of e after observi ng some Bernou lli
trials. But
12 Chapter 1. Probability Models

Example 1.20 gives us some hint of what happens. In Example 1.20, after
observing 20 Bernoulli trials, the mean of 8 changed to a number closer
to the proportion of successes, regardless of what the prior mean of e
was. The conditional mean of 8 given the observed data is the probability
of a success on a future trial given the data. If we believe a sequence of
Bernoulli random variables to be exchangeable, and we are not already
certain about the limit of the proportion of successes, then after we observe
some data, we will modify our opinion about future observations so that the
probability of success is now closer to the observed proportion of successes.
This phenomenon has nothing to do with frequencies being probabilities.
It is merely a consequence of exchangeability.

1.3 Parametric Models


DeFinetti's representation theorem 1.49 says that infinitely many random
quantities {Xn}~=l are exchangeable if and only if they are conditionally
lID given the limit of their empirical probability measures. The empirical
probability measure (or empirical distribution) of Xl, ... , X n is the random
probability measure

(1.21)

For the case of random variables, the empirical distribution is equivalent


to the empirical distribution function, Fn(t) = E~l IC-oo,tj(Xi)ln, the
function which is 0 at -00 and has jumps of size lin at each observation.
If we are considering a sequence of exchangeable random quantities, let
e be some one-to-one function of the limit of the empirical distributions,
and let n be the set of possible values for 8. Let Po denote the conditional
distribution of Xn given 8 = (J. Then Po = {Po: (J E n} looks like a
typical parametric family with which we are already familiar. Also, e is a
measurable function of the entire sequence {Xn}~=l' hence its distribution
is induced (see Theorem A.81) from the distribution of the sequence. For
this reason, it is natural to think of e as a random quantity in this situation.
Although DeFinetti's representation theorem 1.49 is central to motivat-
ing parametric models, it is not actually used in their implementation.
Furthermore, the concept of parametric models extends to more general
situations, albeit without the same justification. For this reason, we will
postpone formal treatment of DeFinetti's theorem until Section 1.4. In Sec-
tion 1.3.1, we introduce the framework for the use of parametric families in
general situations. Most familiar examples will be of exchangeable random
variables, but other examples will be given as well. In all cases, however,
we will treat the parameter as a random quantity, just as we would if the
data were exchangeable.
1.3. Parame tric Models 13

1.3.1 Prior, Posterior, and Predictive Distributions


We begin by making explicit the general concept of parame ter and
para-
metric family.
Defini tion 1.22. Let (8, A, /-L) be a probability space, and let (X, 13)
and
(n,7") be Borel spaces. Let X : S -+ X and e : S -+ n be measurable.
Then
e is called a parameter and n is called a parameter space. The conditional
distribution for X given e is called a parametric family of distributions
of
X. The parametric family is denoted by
Po = {Pe: VA E B,Pe(A ) = Pr(X E Ale = 0), for 0 EO}.
We also use the symbol P~(X E A) to stand for Pe(A).lO The prior distri-
bution of e is the probability measure /-La over (n,7") induced by e
from
/-L.
Suppose that each Pe, when considered as a measure on (X, 13), is abso-
lutely continuous with respect to a measure v on (X, 13). Let

dPe
fXls(xIO) = dv (x).
(It will be common in this text to denote the conditional density functio
n
of one random quantity X given anothe r Y by fxIY.) We can assume
that
fXls is measurable with respect to the product a-field B (8) T.11 This
will
allow us to integrate this function with respect to measures on both X
and
o. The function fXle(xIO), considered as a function of 0 after X = x is
observed, is often called the likelihood function L(O).
For each 0 E n, the function fXle(IO) is the conditional density with
respect to v of X given e = O. That is, for each A E 13,

Pr(X E Ale = 0) = i fXle(xIO)dv(x).

We let /-Lx denote the marginal distribution of X (/-Lx(A) = Pr(X E


A)).
Using Tonelli's theorem A.69, we can write

/-Lx(A) = 10 i fXle(xIO)dv(x)d/-Le(B) = i 10 fXle(xIB)d/-Le(O)dv(x).


It follows that /-Lx is absolutely continuous with respect to v with
density

(1.23)

lOIn this manner, P~ is a probabi lity measure on the space (S,A)


and Po is
a probabi lity measure on the space (X, B). This fine mathem atical
point could
usually be ignored without causing much confusion, but we will
try to be as
precise as possible for the sake of those few cases where it matters
.
llSee Problem 9 on page 74 for a way to prove this.
14 Chapter 1. Probability Models

This density is often called the (prior) predictive density of X or the


marginal density of X.
For example, suppose that X = (Xl"'" X n ), where the Xi are ex-
changeable and conditionally independent given e, each with conditional
density fx,ls('\O) with respect to a measure vi. Then the conditional joint
density of Xl"'" Xn given e = () (the likelihood in this case) with respect
to the n-fold product measure v n can be written as
n
fx" ... ,xnls(xl, ... , xn\()) = II fx,jS(Xi\()).
i=l
The unconditional joint (prior predictive) density of Xl"'" Xn is

Example 1.24. Let X = (Xl, ... , X n ), where the Xi are conditionally lID
with N(/1-,O' 2 ) distribution given (M, E) = (/1-, a). (Here the parameter is e =
(M, E).) Let the prior distribution be that E2 has inverse gamma distribution
r- l (ao/2, bo/2) and M given E = a has N(/1-o, 0'2/ AO) distribution with ao, bo,
/1-0, and AO constants. The likelihood function in this case can be written as

where x = L:~=l x;/n and w = L:~=l (Xi - x)2. The prior density with respect to
Lebesgue measure is
~

/e(/1-,O')=
2(~) 2 ~ ( 1 )
~r(!'f) O'-(ao+2)exp -20'2 [AO(/1-_/1-0)2+ bo ] , forO'>O.
(1.25)
The prior predictive distribution of the observations can be calculated by multi-
plying together the two functions above and integrating out the parameter. After
completing the square in the exponent, the product can be written as

(1.26)

where

al = ao + n, Al = AO + n,
nAo(x - /1-0? AO/1-0 +nx
bl = bo + w + AO + n ' /1-1 = AO + n

Note that, as a function of (/1-, a), this is in the same form as the prior density
(1.25) with the four numbers ao,bo,/1-o,AO replaced by al,b~,/1-l,Al' H~~ce, the
integral over (/1-,0') is just the constant factor that appears In (1.26) divided by
1.3. Parametric Models 15

the result of changing ao, bo, /1-0, >'0 to ai, bl, /1-1, >'1, respectively, in the constant
factor in (1.25). That is,

(1.27)

A specialized calculation of the preceding sort is often of .i~terest ~n t?is ~xam


pie. Let Yn be the average of the n observations. The conditIOnal dlstnbutlOn of
Yn given 8 = (/1-,0') is N(/1-, O'~ In). The prio~ predicti~e d~nsity of Yn gi~en X = x
can be calculated by integratmg the N(/1-, 0' In) denSity times (1.25) with respect
to f1- and 0'. Alternatively, one can argue as follows. Using well-known features of
the normal distribution, we can conclude that the conditional distrib~tion of Yn
given L: = 0' is N(/1-O, 0'2[l/n + 1/>'0]). If we multiply the correspondmg normal
density times the marginal density of I: and integrate over (j, we get

which is the density of the tao (/1-0, J (lIn + 1/>'o)bolao) distribution.


As we mentioned earlier, the use of parametric families does not require
that the data be a collection of exchangeable random quantities. Here are
some examples of nonexchangeable random quantities whose distributions
could usefully be modeled using finite-dimensional parametric families.
Example 1.28. Let {Xn}~1 be a sequence of Bernoulli random variables that
are not exchangeable. Instead, let Po be the set of joint distributions for infinitely
many Bernoulli random variables that form a Markov chain. For P E Po, define

8(P) (Pr(XI = liP), Pr(Xi+1 = llXi = 1, P), Pr(Xi+1 = llXi = 0, P))


(PI, Pu, PoI).

Let>. be any probability over [0, 1]3, and set

= il, ... ,Xn = in) (1.29)

fff
Pr(XI

p;l(l_ pI)l-ilp~tl(l- pl1)kOlp~~O(1_ POI)kood>'(PI,Pll,pod,

where k s.t is the number of times that s follows t in the sequence h, ... , in.
Diaconis and Freedman (1980c) prove that, aside from pathological cases, if all
finite sequences of Os and Is that have the same first element and the same values
of ks.t have the same probability, then (1.29) must hold.
Example 1.30. This example is the simple linear regression problem. Suppose
that Xl, X2, are fixed known numbers and E I , E 2 , . .. are exchangeable random
variables that are conditionally independent given E = 0' with density fEldeIO').
16 Chapter 1. Probability Models

(Think of the Ei as the error or noise term in a regression model.) Define Yi =


Ei + BXi, where B is a random variable such that B and ~ have joint distribution
flB,E. The parameter now consists of 8 = (B, ~). The random variables Yl, Y2, ...
are not exchangeable even though E l , E2,'" are exchangeable. The reader should
see Zellner (1971) for an in-depth discussion of Bayesian analysis of regression
models.

The conditional distribution of 6 given X = x is called the posterior


distribution of6. The next theorem shows us how to calculate the posterior
distribution of a parameter in the case in which there is a measure v such
that each Po v.
Theorem 1.31 (Bayes' theorem).12 Suppose that X has a parametric
family Po of distributions with parameter space n. Suppose that Po v
for all BEn, and let fXls(xIB) be the conditional density (with respect to
v) of X given 6 = B. Let J.Le be the prior distribution of 6. Let J.Lelx (Ix)
denote the conditional distribution of 6 given X = x. Then J.Lelx J.Le,
a.s. with respect to the marginal of X, and the Radon-Nikodym derivative
is
dJ.Lslx (Blx) = /xls(xIB)
dJ.Le In fXle(xlt)dJ.Le(t)
for those x such that the denominator is neither 0 nor infinite. The prior
predictive probability of the set of x values such that the denominator is
o or infinite is 0, hence the posterior can be defined arbitrarily for such x
values.
PROOF. First, we prove the claims about the denominator. Let

Co {x : In fXle(xlt)dJ.Le(t) = o} ,
Coo = {x: In fXls(xlt)dJ.Le(t) = 00 } .

Let J.LX be the marginal distribution of X,

J.Lx(A) = i In /xle(xIB)dJ.Ls(B)dv(x).

It follows that

( (fxle(xIB)dJ.Ls(B)dv(x) = 0,
1001n
( (fXIS(xIB)dJ.Le(B)dll(x) = ( oodv(x).
10 ln00 10 00

l2Theorem 1.31 applies equally well to infinite-dimensional parameters as t~


finite-dimensional parameters. In infinite-dimensional cru:es, howe~er, the ~ondl
tion Po II for all (J often fails. In fact, the proof apph.es ~ven. 1 (~, r) l~ ~ot
a Borel space. In this last case, a regular conditio~al d~stnbutlOn lS exphcltly
constructed without knowing in advance that one wlll eXlst.
1.3. Parametric Models 17

This last integral will equal 00 if lIe Coo) > O. Since this is impossible, it
must be that lI(Coo ) = 0, hence p,x(Coo ) = O.
The posterior distribution P,slx must satisfy the following. For all sets
A E B and all BET,

PreS E B,X E A) = i p,slx(Blx)dp,x(x). (1.32)

Using Tonelli's theorem A.69 we can write

PreS E B,X E A) = 1i fXIS(xIO)dll(X)dp,s(O)

= i1 fXIS(xIO)dp,s(O)dll(X). (1.33)

Next, write

i P,slx (Blx)dp,x (x) = i [p,slx(BIX) In fX1s(xIO)dP,s(O)] dll(x).


Combining this with (1.33) shows that (1.32) is satisfied for all A and B if
and only if
IB
fXIS(xIO)dp,s(O)
p,slx(Blx) = In
fXls(xIO)dp,s(O)'
a.s. [p,x]. It follows that P,slx p,s and that dp,slx / dp,e( 'Ix) is as speci-
fied. 0
Example 1.34. Suppose that X has Bin( n, 9) given e = 9 and that the prior
distribution of e is Beta( ao, bo). The marginal density of X with respect to
counting measure on the integers is

fx(x) = (n) f(aof(ao)r(bo)f(ao


x
+ bo)f(ao + x)rcbo + n-
+ bo + n)
x)
, for x = 0, ... , n.
The posterior density of e with respect to the prior distribution of e is the ratio
of (;)9"'(1-9)n-", to fx(x). The posterior density of e with respect to Lebesgue
measure is

felx(9Ix) = f(ao + bo + n) 9ao+x-1 (1 _ 9/ o+n - x for 0 < 9 < 1


rcao+x)f(bo+n-x) "
which is easily seen to be the Beta(ao + x, bo + n - x) density.
Example 1.35 (Continuation of Example 1.24; see page 14). For the case of
conditionally lID N(/-L, (72) random variables, the posterior density with respect
to Lebesgue measure can be calculated by dividing the product of prior and like-
lihood (1.26) by the prior predictive density (1.27). The result is easily seen to be
in the same form as the prior density (1.25) with the four constants ao, bo, /-Lo~ >'0
replaced by ai, bl , /-LI, >'1. In other words, the posterior distribution of E is
r- l (al/2,b1 /2), and the conditional posterior ofM given E = (7 is N(/-Ll,(72/>'d.
18 Chapter 1. Probability Models

Example 1.36. As an example in which Bayes' theorem does not apply, consider
the case in which the conditional distribution of X given e = 8 is discrete with
Pe({8 -I}) = Pe({8 + I}) = 1/2. Suppose that e has a density fa with respect
to Lebesgue measure. The Pe distributions are not all absolutely continuous with
respect to a single u-finite measure. It is still possible to verify that the posterior
distribution of e given X = x is the discrete distribution with

Pr(e =x- I\X = x) = fa(x -1)


fa(x - 1) + fa (x + 1)'

and Pr(9 = x + I\X = x) = 1 - Pr(e = x - I\X = xl:- Note that the posterior
is not absolutely continuous with respect to the prior. 3

The (posterior) predictive distribution 01 future data is defined in the


same way as the prior predictive distribution except that the posterior
distribution of e is used instead of the prior distribution of S. For the case
of conditionally lID random variables with conditional density IXlis given
a, we have

Ixn+l ..... xn+klxl ... ,xn (xn+lt. ,xn+klxlt , xn) (1.37)

= 1 k
II/xtls(xn+iIO)dJ.tSIXl .... 'xn(Olxl, ... 'Xn).
i=l
Example 1.38 (Continuation of Example 1.35; see page 17). The posterior pre-
dictive distribution of future observations can be calculated after observing a
sample of conditionally lID normal random variables. Let Ym be the average of
m future observations. Since the posterior distribution of 9 is in the same form
as the prior (1.25) with ao, bo, Ito, Ao replaced by aI, bl , Itl, AI, it follows that the
posterior predictive distribution of Ym is of the same form as the prior predictive
distribution. Using the result from the end of Example 1.24 on page 14, we get
that the posterior predictive distribution of Ym is tal (ltl, J[I/m + 1/Al]bl/al ).

To see how Bayes' theorem 1.31 applies to arbitrary random quantities


whose distributions are modeled using parametric families, consider the
following example.
Example 1.39. Consider two sequences {Xn}~=l and {Yn}~=l of random vari-
ables that are each separately exchangeable. We can model them so that the
parameters are related. For example, suppose that the Xi are lID Exp(8) given
e = 8 and the Y; are liD U(0,8) given e = 8, and we model the Xi and Y;
as conditionally independent given e. We may learn Xl = Xl,., Xn = Xn and
then wish to make inference about is. Let the prior for e be Ita. The posterior
is

13 Another example of this situation occurs in Problem 47 on page 80. In that


example, e is an infinite-dimensional parameter. however.
1.3. Parametric Models 19

where x = 2::~=l Xi. The predictive density of (Yl, ... ,Ym ) is

Since Po is a conditional distribution given another random variable 8 =


0, there exist conditional expectations given 8 = O. Let Eo stand for the
expectation operator under Po' That is, if Z is a random variable with
finite absolute expectation, then Eo(Z) means E(ZI8)(s) for all s such
that 8(s) = O. By Theorem B.12, if I: X -+ IR and Z = I(X),

Eo(Z) = J Z(x)dPo(x).

Similarly, let Varo(X) and Covo(X, Y) stand for the conditional variance
of X given e = 0 and the conditional covariance between X and Y given
8 = 0, respectively.
There will be times when we wish to condition on other random variables
in addition to e. Recall that for two random quantities Z : S -+ IR and
Y: S -+ T, the conditional expectation of Z given Y was defined to be an
Ay measurable function E(ZIY)(s) satisfying

E(ZIB) = Is E(ZIY)(s)dJ.L(s),

for all B E Ay. The conditional expectation of Z given Y and e will be


an ACY,8) measurable function E(ZIY, 8) satisfying

E(ZIB) = 1 E(ZjY, 8)(s)dJ.L(s),

for all BE A(y,e). It follows from Theorem B.75 that E(ZIY = y, e = 0) =


Eo(ZIY = y), where Eo(IY = y) is conditional expectation calculated from
Po. It follows from the law of total probability B.70 that

E(ZIY = y) = In E(ZIY = y, e = O)dJ.LelY(Oly),

where J.LelY(ly) specifies the conditional distribution of 8 given Y = y.

1.3.2 Improper Prior Distributions


Two components are required to specify the distribution of a random quan-
tity X by means of a parametric family. One is the choice of parametric
family, and the other is the prior distribution over the parameter space.
Both of these must be specified if one is to have a marginal distribution for
X. Some people seem to think that choosing a prior distribution introduces
20 Chapter 1. Probability Models

subjectivity into the analysis of data but choosing a parametric family does
not. These people are mistaken. Each choice one makes introduces subjec-
tivity.
Philosophy aside, suppose that one finds it difficult to specify a prior
distribution beacause one does not have much idea where the parameter is
likely to be located. In such cases, one may wish to do calculations based
on a prior distribution that spreads the probability very thinly over the
parameter space. A problem that often arises is that, if we take the limit
as the probability is spread more thinly, the prior distribution ceases to
satisfy the axioms of probability theory.
Example 1.40. Suppose that we choose the parametric family of normal dis-
tributions with variance 1 and parameter e equal to the mean. The parameter
space is the real line m.. Suppose that we want a normal prior distribution for e,
but one with very high variance to indicate that we are not willing to say where
we think e is with much certainty. The distribution N(a, n) for large n has this
property. But how can we choose n? If we let n -+ 00, there is no countably
additive limit to the sequence of probability distributions. There is no normal
distribution with infinite variance.
What has become common in problems like Example 1.40 is to choose a
measure A on (n, r) which may not be a probability but still pretend that
it is the prior distribution of e. That is, use A in place of /Le in Bayes'
theorem 1.31. The "posterior" after observing X = x, if it exists, will have
density with respect to A,

(1.41)
In fXle(x\t)dA(t)'
The key is whether or not the denominator in (1.41) is finite and nonzero.
If so, we can pretend that (1.41) is the posterior density of e given X = x
and then proceed with whatever analysis we want to perform. In this case,
we call A an improper prior distribution. If the denominator in (1.41) is 0
or infinite, one may need to choose another prior distribution.
Example 1.42. Suppose that X rv N(9, 1) given e = 9. We can use A equal
to Lebesgue measure as an imprope~rior. Suppose that we observe only one
observation X. Since fXIS{x\6) = (y211")-1 exp{ -[x - 9J2/2), it follows that the
posterior density with respect to Lebesgue measure derived from Bayes' theo-
rem 1.31 is equal to fXIS(x\6) as a function of e. In other words, given X = x,
e has N(x, 1) distribution.
The above discussion of improper priors is not particularly precise math-
ematically. There are two traditional ways to make the concept of improper
prior mathematically precise. Each of them opens its own particular can
of worms, so we will only describe each very briefly and point the reader
to relevant literature. First, one may remove the restriction that the prob-
ability of a set must be at most 1. Hartigan (1983) takes this appro~h
and allows sets to have infinite probability. This makes improper pnors
1.3. Parame tric Models 21

"proper," but now many traditio nal theorems of probability theory


which
make implicit use of the upper bound on probabilities either must
be re-
proved or fail to apply to infinite probabilities. The second approach is
that
of DeFine tti (1974), in which the requirement that probabilities be
count-
ably additive is relaxed. That is, probability is only required to be finitely
additive. 14 Needless to say, most of the traditio nal results of probab
ility
theory need to be reproved or scrapped in this theory also. 15 The improp
er
prior in Example 1.42, when though t of as a finitely additive prior, gives
0
probability to every compact set and still gives probability 1 to the
whole
real line. Hartigan (1983, Theorem 3.5) gives a version of Bayes' theorem
for possibly infinite probabilities. Berti, Regazzini, and Rigo (1991) prove
a
Bayes' theorem for finitely additive probabilities, as do Heath and Sudder
th
(1989). An alternative to using improper priors is to do a robust Bayesia
n
analysis, as described in Section 8.6.3.

1.3.3 Choosing Proba bility Distri bution s


We have assumed that probability distributions represent our (or someon
e
else's) opinion about unknown quantities. At least a little though t should
be given to how those probability distributions are chosen. The most
com-
mon method for choosing a probability distribution might be called
"avail-
ability." Most 'people who study statistics formally for no more than
one
academic year will only be able to describe one parame tric family
of dis-
tributions suitable for use with continuous data. The family of normal
distributions is both computationally tractab le and remarkably versatil
e
as a model for many natural phenomena. Its versatility is due in part
to
the fact that many other distributions can easily be transformed to normal
distributions so that the computational tractab ility of the normal distribu
-
tion can be widely extended. The family of transformations introduced
by
Box and Cox (1964) is a classic example. Other methods for choosing
prob-
ability distributions are based on data analytic techniques. Either the
very
data on which inference will be based or other seemingly relevant data
are
analyzed by various graphical techniques, hypothesis tests, or other
pro-
cedures in order to try to select an approp riate probability model
to use
as a description of the uncertainty surrounding the data. The most
direct
methods for selecting distributions are those based on elicitation. In
such

14 A finitely additive probabi lity is a function J.L from a field F of


subsets of a
set S to [0,1] which satisfies J.L(0) = 0 and J.L(A U B) = J.L(A) + J.L(B)
if An B = 0.
Kadane , Schervi sh, and Seidenfeld (1985) explore some of the
implica tions of
finitely additive probabi lity for statistic al inference. One well-kno
wn consequ ence
of using imprope r priors is the famous margina lization paradox reporte
d by Stone
and Dawid (1972) and Dawid, Stone, and Zidek (1973).
15Schervish, Seidenfeld, and Kadane (1984) show how the law of
total proba-
bility B.70 fails in the finitely additive theory. Stone (1976) gives
an interest ing
example of this failure.
22 Chapter 1. Probability Models

methods, an expert (a term to be left undefined) is questioned about his or


her beliefs concerning relevant random quantities, and a probability model
for those beliefs is inferred from the responses.
Each of the three types of methods for choosing probability distributions
has its advantages and disadvantages. The availability method may seem
silly as described above, but one usually does have a limited number of fam-
ilies of distributions that one is willing to consider. The methods described
in Section 8.6 on mixtures of models can be useful in sorting out uncertain-
ties amongst alternative models for a given data set. In particular, robust
methods (Sections 5.1.5 and 8.6.3) are designed to assess or even limit the
sensitivity of inferences to specifications of distributions. 16 In short, one
may not be forced actually to choose a single probability distribution to
represent his or her uncertainty. Comparing the effects of various possible
choices may be sufficient for assessing the information content of the data.
When one is determined to use a particular parametric family of distribu-
tions, and only the prior distribution for the parameter needs to be chosen,
it may be the case that various alternatives make little difference and a
choice by convenience (like an improper prior) will be sufficient. Whether
or not this is true, considerations of Bayesian robustness will clearly be in
order when such a choice is made.
Data-based techniques are particularly appealing when one is forced to
analyze someone else's data without access to subject matter expertise.
Also, if one must use one of the popular computer packages, which tend
to be built exclusively around only one distribution for each type of data,
it pays to be able to transform the data into something better suited to
be modeled by that one distribution. Quantile plots and various graphical
techniques [see, for example, Gnanadesikan (1977)] are very useful for help-
ing to select such a transformation. Likelihood-based methods for choosing
a distribution can be described as follows. Suppose that there is an index
set N, and for each a EN, there is a possible distribution for the available
data X. Let IX,a denote the predictive density of the data X as calculated
in (1.23) under the assumption that the distribution being used is the one
corresponding to index a. One could then base a choice amongst the dif-
ferent values of a on the values of g(a) = IX,a(x) once X = x is observed.
(Typically, one chooses a to maximize g(a).) This is similar to what is done
in empirical Bayes analysis (see Section 8.4 for more details). An obvious
drawback to all such data-based methods is that they tend to understate
the amount of uncertainty that remains about interesting unknown quan-
tities. The reason is that one pretends to be sure of something (e.g., which
parametric family or which value of a) of which one really is not sure.
Example 1.43. Suppose that X is a vector of 20 random variables and ~ha~ we
cannot decide whether to model them as lID Lap(I-', 0'/ v'2) or lID N (1-',0' ) given

16Berger (1994) reviews the literature on robust Bayesian methods up to 1993.


1.3. Parametric Models 23

(M, E) = (,.",O'). (In this way, ,." and 0' are the mean and standard deviation in
both cases.) We could let N be the set of all triples (i,,.,,, O'), where i = 1 means
Laplace and i = 2 means normal. Consider the following data values:
-0.0820,1.3312, -1.3518, -1.4930,0.0850,0.7022, 1.735,
-0.3164,2.1948, -0.0371,0.3377, -0.3124,0.6087,0.7339,
-0.4632,0.3398, -0.0352,0.1597, -0.6344, -0.4435.
The value of a that leads to the largest fX,Q(x} is (1,0.0249, 0.9473). The largest
value is 5.943 x 10- 12 , which is only slightly larger than the value achieved at
a = (2,0.1530,0.8909), namely 4.772 x 10- 12 If we decide to use the Laplace
distribution model, we will be pretending that we were sure from the start that
the data would be Lap(,.", 0' / v'2), rather than taking into account the sizable
amount of uncertainty that still remains about the underlying distribution.
In a classical setting, one might look at quantile plots to see whether the
data looked more normal or more like a Laplace distribution. Figure 1.44 shows
quantile plots for both Laplace and normal distributions. The two plots are about
equally straight, although the Laplace plot is a little bit straighter. Choosing
either distribution would surely be acting as if we knew something that was
quite uncertain.

Elicitation techniques tend to lie on the interface between statistical the-


ory and psychology. A series of questions must be designed for interrogation
of the expert. The responses to these questions must then be reconciled
with the axioms of probability theory, keeping in mind the expert's limited
motivation and/or ability to answer accurately. Much has been written in
the psychological literature about the ability of people to assess probabili-
ties subjectively. Kahneman, Slovic, and Tversky {1982} have compiled an
interesting collection of articles that, among other things, illustrate and

Laplace
o Normal

J e
.. o.

. ". .
j co
00

e ..
'0 0

2 ., o 2
Ouantlles of Distribution

FIGURE 1.44. Quantile Plots for Laplace and Normal Distributions


24 Chapter 1. Probability Models

describe various problems people have in quantifying uncertainty. Hogarth


(1975) surveys the early literature on the assessment of subjective distribu-
tions. He concludes that humans are "ill-equipped for assessing subjective
probability distributions." This conclusion might help to explain the num-
ber of tools that have emerged since that time to better equip statisticians
and experts to choose distributions. These tools are directed primarily to-
ward choosing a prior distribution for the parameters of a prespecified
parametric family. For example, Kadane et al. (1980) and Garthwaite and
Dickey (1988) give algorithms for specifying conjugate prior distributions
for the parameters of normal linear models. Garthwaite and Dickey (1992)
extend their method to deal with the selection of variables in multiple re-
gression. Freedman and Spiegelhalter (1983) and Chaloner et al. (1993)
describe methods for eliciting prior information for use in clinical trials.
A common feature of most prior elicitation schemes is their reliance on
the predictive distribution (1.23) to infer the prior. The reason for this is
that experts are more likely to be comfortable thinking about the actual
observables of their study rather than parameters of statistical models.
Problems 18 and 19 at the end of this chapter give some simple examples
of how this might be done. In order to take account of the fact that experts
may not accurately respond to the elicitation inquiries, Dickey (1980) and
later Gavasakar (1984) described probability models for the elicitation pro-
cess itself. In these models, the responses U which the expert will give to
elicitation questions are modeled as data with a distribution that depends
on the subjective distribution P being elicited. One then tries to infer P
from U using Bayesian or related methods. One must be careful not only to
consider the possible errors in U as answers to the elicitation questions, but
also to consider how sensitive the inference from U to P is. For example,
does a small change in U produce a small or a large change in P? (See the
last parts of Problem 18 on page 76.)
In the remainder of this text, as in most texts on the theory of statistics,
little attention will be given to how the probability distributions are chosen.
When a prior distribution is used for a specified parametric family, one can
assume either that the prior was elicited by some method or other, or that
the prior was chosen by convenience (a popular device), or that the prior
is just one of many that will later be compared in a robustness study.

1.4 DeFinetti's Representation Theorem


1.4.1 Understanding the Theorems
In this section we will state some representation theorems for exchange-
able random quantities and give a number of examples. The proofs of
these theorems (and some related results of interest) are deferred to Sec-
1.4. DeFinetti's Representation Theorem 25

tion 1.5.17 These theorems characterize all of the possible joint distribu
-
tions for exchangeable random quantit ies which take values in a Borel
space
(think finite-dimensional Euclidean space), and they are essentially
d~e to
DeFine tti (1937). They can be summa rized here as follows. If there
IS an
infinite sequence of exchangeable random quantit ies {Xn} ~= l' then
there
must be some random quantit y P such that the Xi are conditionally
IID
given P. If the random quantit ies have Bernoulli distribu tion, then
P can
be taken to be the limit of the propor tion of successes in the first
n ob-
servations. In general, P will turn out to be the limit of the empiric
al
distribu tions P n of Xl,"" X n , which were defined in (1.21). This explain
s
how DeFine tti's theorem helps to motivate models like (1.10).
Examp le 1.45. Consider the case of Bernoulli random variables {Xi}~l'
Here,
X = {O, I} and a random probability measure P on X is equivalent to a random
variable e E [0,1], where e = P( {I}). The empirical distribution P n is equivale
nt
to X n , the average of the first n observations, since P n ({I}) = X n Theorem
1.47
(a special case of Theorem 1.49) will say that X n converges to e a.s., and
conditional on e = 0, the Xi are IID Ber(O) random variables. This is whatthat,
meant on page 7 when we said that "0 is given an implicit meaning as a we
random
variable e, rather than a fixed value." Also, the random variable e will
have a
distribution, which is the measure f.t in (1.10).
The heavy mathem atics in the proof of Theore m 1.49 is required to
make
precise what it means to have a random probab ility measur e P and
what it
means to condition on such a thing. For random quantit ies that assume
only
finitely many different values, random probab ility measures are equival
ent
to finite-dimensional random vectors. For more general random quantit
ies,
random probability measures can be more complicated. For this reason,
we
prove Theore m 1.47 first, even though it is a special case of Theore
m 1.49.
The proof of Theore m 1.47 contain s the essential ideas of the more
com-
plicated proof withou t being encumb ered by so much mathem atics.
If there are only finitely many exchangeable quantit ies Xl, ... , X N, then
all that we can prove is the following. Conditional on the empirical
distri-
bution P N of Xl>'" ,XN, every ordered n-tuple (for n ::; N) of the
Xi has
the distribu tion of n draws without replacement from a finite popula
tion
with distribu tion P N. It is the "witho ut replacement" qualifier that
pre-
vents us from proving that Xl"'" XN are conditionally indepen dent.
(See
Examp le 1.18 on page 10.) It is possible for a finite collection of exchan
ge-
able random variables to be conditionally independent; however, it
is not
necessary. Looking at the Bernoulli case first might aid in unders
tanding

17In most of this text, proofs are given immediately or almost immedi
after the statements of results. Because DeFinetti's representation theorem ately
is so important for motivating statistical modeling, and because its proof 1.49
involves
some rather heavy mathematics, many readers may wish to forego reading
proofs on a first pass through this material. However, every reader should the
try to understand what Theorem 1.49 says. at least
26 Chapter 1. Probability Models

the finite case theorem.


Example 1.46. Let Xl"'" XN be exchangeable Bernoulli random variables.
Let the word "success" stand for Xi = 1, and let the word "failure" stand for
Xi = O. Let the word "trial" stand for one of the Xi. Since there are only
finitely many (2N) possible values for the vector (Xl, ... , X N ), the entire joint
distribution can be specified by giving probabilities to all of those 2N vectors
of Os and Is. If Xl, ... , XN are exchangeable, however, many of the vectors will
have the same probability. For example, the N vectors with exactly one success
and N - 1 failures all have the same probability. Similarly, the (~) vectors with
exactly two successes and N - 2 failures all have the same probability. In fact,
for each m = 0, ... , N, all (~) vectors with exactly m successes and N - m
failures have the same probability. Since the total number of successes in all
N trials plays such an important role in the distribution, we give it a name,
M. Let PM = Pr(M = m) for m = 0, ... , N. Then, the probability of each
vector with exactly m successes and N - m failures is PM / (~). All probabilities
associated with the joint distribution of X I, ... , X N can be calculated from these
values. For example, suppose that we let K equal the number of successes in
a particular collection of n trials (for example, the first n, or the last n, or
every other one from the first 2n, etc.). Then Pr(K = k) can be calculated
by adding up all the probabilities of the vectors for which the particular n trials
include exactly k successes. This is nothing more than a straightforward counting
argument, if we first partition the vectors according to the value of M. For each
m = k, . .. , N - n + k, there are (~) (~=~) vectors with M = m and with exactly
k successes on the particular n trials of interest. It follows that

P (K _ k) _ N~k (~) (;-;:=~)


r - - ~ (N) Pm
m=k m

This last expression is easily recognized as a mixture of hypergeometric prob-


abilities H yp( N, n, m) with mixing weights Pm. That is, it appears as if K
has Hyp(N, n, m) distribution conditional on M = m and M has distribution
(po, ... , PN ). Note that the H yp( N, n, m) distribution is the distribution of the
number of successes in n draws without replacement from an urn containing m
successes and N - m failures. Also, the random variable M is equivalent to the
empirical distribution P N.
Example 1.46 above is the proof of the finite version of DeFinetti's theo-
rem for Bernoulli random variables. It is also illustrative of the most general
form of the finite version. The total number of successes must be replaced
by the empirical distribution P N, and then we have that Xl,' .. , X N are
exchangeable if and only if the conditional distribution, given P N, of ev-
ery finite sub collection X il , ... , Xin is the distribution of n random draws
without replacement from a population with distribution P N

1.4.2 The Mathematical Statements


The Bernoulli case is simple enough to state without introduction.
Theorem 1.47 (DeFinetti's representation theorem for B~rnoulli
random variables). An infinite sequence {Xn}~=l of Bernoulh random
1.4. DeFinetti's Representation Theorem 27

e
variables is exchangeable if and only if there is a random variable taking
values in [0,1) such that, conditional on e = 0, {Xn}~=l are IlD Ber(O).
Furthermore, if the sequence is exchangeable, then the distribution of is e
unique and L:~=l Xdn converges to e almost surely.
For the more general cases, we need some more notation. Let (X, B) be
a Borel space, and let P be the set of all probability measures on (X,B).
The theorems stated below will give the conditional distributions of certain
random quantities taking values in X given certain probability measures.
To be mathematically precise, these probability measures must themselves
be random quantities. That is, we will need a a-field Cop of subsets of
P such that the appropriate probability measures can be thought of as
measurable functions from some probability space (8,A,/-l) to (P,Cp ). Let
Cop be the smallest a-field of subsets of P containing all sets of the form
AB,t = {P E P : PCB) ::; t}, for B E 13 and t E [0,1]. This is the smallest
a-field for which the evaluation functions gB : P -+ rn. are measurable,18
where gB(P) = PCB). It is easy to show that P n , defined in (1.21), is a
measurable function from the n- fold product space (xn , Bn) to (P, Cp ) (see
Problem 24 on page 77). If (8, A, /-l) is a probability space, a measurable
function P : 8 -+ P is called a random probability measure. In this way, P n
is a random probability measure for every n.

1.4.2.1 The Finite Version


In order to state the finite version of DeFinetti's theorem, we will find it
convenient to refer to random samples from the empirical distribution of a
collection of random variables Xl, ... , X N. What we mean by this is the
following. Suppose that Xi = Xi for i = 1, ... , N. Create an urn with N
balls labeled Xl> ... ,X N. A simple random sample of size n with/without
replacement from the empirical distribution P N of X I, ... , X N is n draws
with/without replacement from this urn such that, on each draw, every
ball in the urn has equal probability of being drawn.

Theorem 1.48. 8uppose that Xl, ... ,XN are random quantities taking
values in a Borel space (X,B). Let X = (X1, ... ,X N ), and for each BE
B, let PN(B) = L:~1 IB(Xi)/N be the empirical distribution of X. The
random quantities are exchangeable if and only if, for every ordered n-tuple

18Those familiar with topological concepts will recognize the sets AB" as a
subbase for the topology of pointwise convergence of functions from B to JR,
which is also the ~roduct topology when that set of functions is considered as the
product space IR . As such, (P, Cop) is not a Borel space. This inconvenient cir-
cumstance will not cause problems for us, however. One of the steps in the proof
of Theorem 1.49 is to show that the subset of P in which P lies is the image
of a Borel space (X oo , BOO) under a measurable function. Hence regular condi-
tional distributions are induced on (P, Cop) by the corresponding distributions in
(X oo , BOO).
28 Chapter 1. Probability Models

(il, ... ,i n ) of distinct elements of{I, ... ,N}, the joint distribution of
(Xii' ... ,Xin ), conditional on P N = P is that of a simple random sample
without replacement from the distribution P.

1.4.2.2 The Infinite Version


The infinite version of DeFinetti's theorem is the following.
Theorem 1.49 (DeFinetti's representation theorem). Let (8, A, p,)
be a probability space, and let (X, B) be a Borel space. For each n, let
Xn : 8 ---- X be measurable. The sequence {Xn}~=l is exchangeable if
and only if there is a random probability measure P on (X, B) such that,
conditional on P = P, {Xn}~=l are IID with distribution P. Furthermore,
if the sequence is exchangeable, then the distribution of P is unique, and
Pn(B) converges to P(B) almost surely for each B E B.
In Section 2.4, we present a more general theorem of Diaconis and Freedman
(1984) and Lauritzen (1984, 1988), which applies to sequences of random
quantities that are not exchangeable. This Theorem 2.111 will actually
imply Theorem 1.49 as a special case, but its proof is far more complicated
than the proof of Theorem 1.49, which we give in Section 1.5.

1.4.3 Some Examples


Here, we present more examples of exchangeable sequences and the impli-
cations of DeFinetti's theorem.
Example 1.50. Suppose that Xl, ... , XN are liD Ber(r) random variables. If
M = 2:::1Xi, then Pr(M = m) = (~)rm(1 - r)N-m for m = 0, ... , N. That
is, M has Bin(N, r) distribution. Theorem 1.48 says that

rk(l - rt- k ,

which corresponds to the Xi being lID Ber(r). Hence, the probability of observing
k ones in n trials is (~)rk(1 - rt- k , the binomial probability.

The following example helps to explain why Theorem 1.48 is not used
very often with random variables having continuous distribution.
1.4. DeFine tti's Represe ntation Theorem 29

Examp le 1.51. Suppose that Xl"'" X N are exchangeable with


.continuo~s
joint CDF FX1, ... ,XN' The conditional distribu tion of Xl, .. ,Xn
given P.N ~s
that of a simple random sample without replacement from P N, but
the distri-
bution of P N is not simple. If we let B l , .. , Bk be a partitio n of
X, the joint
distribu tion of V = (PN(BJ ), ... , PN(Bk can be expressed formally
as follows.
For each vector (i lo . ,ik) such that E~=l ij = N,

where A is the union of the (;1 ,~,iJ product sets of the form BS1
x ... X B SN ,
where the subscrip ts 81, ... , 8 N are integers from 1 to k with j appeari
ng ij times
for each j. Needless to say, this formulation will not get us very far
in general.
Examp le 1.52. This example is due to Bayes (1764). Suppose that {Xn};:"=
l are
exchangeable Bernoulli random variables, and we set

Pr(k successes in n trials) = _1_, for k = 0, ... , nand n = 1,2, ....


n+l
To check that this gives a consistent set of probabilities, we must show
that, for
every n and every n-tuple (Xl, ... ,xn ) of elements of {O, I},

Pr(X I = XI, ... , Xn = Xn, X n +l = 0)


+Pr(XI = Xl, ... ,Xn = Xn,Xn+1 = 1).
To show this, let k = E~= I Xi. Then, the left-hand side equals 1/
The right-ha nd side equals
(n + 1) G) ] .

1 1 1
(n+2)( ntl) + (n+2)(~!D = (n+l)(~)'
To figure out what P is, recall that P n ( {I}) = X n, the proport ion
of successes
in the first n trials, and lim n -+ oo P n ( {I}) = P ({I} ). Since X n converg
es a.s. to
e, it converges in distribution to e by Theorem B.90. Let Fn(t) = Pr(X n ::; t)
be the CDF of X n . Write

Fn(t) = Pr(at most nt successes in n trials) = LntJ + 1,


n+l
where Lx J denotes the greatest integer less than or equal to x. It is
trivial to see
that limn-+oo(lntJ + l)/(n+ 1) = t, for all 0::; t::; 1. Hence, F(t) =
t is the CDF
of e = lim n -+ co X n . That is, the Xi are conditionally lID Ber(B)
given e = B,
and e has U(O, 1) distribu tion.
Examp le 1.53. Suppose that, for each n, the joint density of Xl, ...
,Xn is

r(a+n W
!X1, ... ,X n (XI, .. ' ,x n ) = r(a)(b + E~=l Xi)a+n' for all Xi> O.

Clearly such random variables are exchangeable for each n. It can


be seen that
these densities are consistent also. (See Problem 20 on page 76.)
Let us try to
30 Chapter 1. Probability Models

find the distribution of the limit of the empirical probability measures, P n .


Pr(Pn((-oo,c]) ~ t) = Pr(at most Ltnj Xi are ~ c)
i
L Pr(exactly k Xi are ~ c)
k=O

where e = Ltnj. The probability that the first k Xi are at most c while the rest
are greater is

l 1 j= ... j=
c
...
0
c

c c
rea + n)ba
r(a)(b + L:~=l Xi)a+n
dx
n ...
dx
1

t,(-I)jG) (b+(n~k+j)Cr
t,( -1)j G) 1= r~:) za-l exp( -z[b + (n - k + j)c))dz

1 r~:)
00
za-l exp( -z[b + cn])[exp(cz) - Ijk dz .

Multiplying this last expression by (~) and summing over k = 0, ... e give
1 = ba
o r(a) za-l exp( -bz) Pr(Yn,z ~ f)dz,

where Yn,z is a random variable with BinCn, 1- exp[-cz]) distribution. For each
z,

. {I if 1 - exp( -cz) < t


}~~ Pr(Yn,z ~ e) = 0 if 1 - exp( -cz) > t = I(O,-log(l-t)/c](z).

So, the CDF of the limit PC( -00, cD of Pn -00, c)) is

1 -~
c r~:) za-l exp(-bz)dz,
which is the CDF of 1 - exp(-c8), where 8 ~ r(a, b). That is, it is as if there
were a random variable 8 ~ r(a, b) and P -00, c]) = l-exp( -c8). Put another
way, it is as if the Xi were conditionally lID with Exp(9) distribution given e = 9
and e ~ rea, b). That this is indeed the case can be proven. (See Problem 22 on
page 77.)

On page 18, we showed how to calculate conditional distributions for


future observations given the ones that have already been observed us-
ing parametric families. For exchangeable random variables, we can often
find these posterior predictive distributions without even introducing the
parameter.
1.4. DeFinetti's Representation Theorem 31

Example 1.54 (Continuation of Example 1.52; see page 29). In the case of
Bernoulli random variables, it is not difficult to calculate conditional probabil-
ities. Suppose that we observe k* successes in the first n* trials, and we are
interested in the probability of k successes in the next n trials. It is straightfor-
ward to calculate the probability of k successes in next n trials given k* successes
in the first n * trials as
n* + 1
( n*+n) n* + n + l'
k*+k
For example, we get that the probability of k successes in the next n trials given
two successes in the first five trials is
60(~) 1
(1.55)
( n+5) n + 6'
k+2
It is easy to see that the future trials are still exchangeable given the past, and one
could use the distribution in (1.55) to find the distribution of e given the observed
trials, just as we found the original distribution of e to be U(O, 1). Alternatively,
we have a theorem that applies to all exchangeable Bernoulli sequences.

Theorem 1.56. Suppose that {Xd~l is an infinite sequence of exchange-


e
able Bernoulli random variables. Let = lim n -+ oo E~l Xdn, and let J.le
be the distribution of e. Conditional on seeing k* successes in n* trials,
the distribution of e has CDF
f k* * k*
F* t _ J[O,tJ () (1 - ())n - dJ.le(())
( )- J 1jJk* (1 _1jJ)n*-k* dJ.le(1jJ)
PROOF. We already know that the Xi are conditionally lID with Ber(())
distribution given e = (). It follows from the definition of conditional prob-
ability that, for every Borel subset B of [0, 1) and every n* and 0 :S k* :S n*,

Pr(k* successes in n* trials and e E B) = l (~:)()k* (l-()t*-k* dJ.le(()).

Dividing this by

Pr(k* successes in n* trials) = J 1jJk* (1 _1jJt*-k* dJ.le(1jJ)

completes the proof. o


Example 1.57 (Continuation of Example 1.54). After observing k* successes in
n* trials, we can find the conditional distribution of e from Theorem 1.56. For
example, if n* = 5 and k* = 2, e has conditional density
r(B) = 60B2(1 _ B)3
with respect to Lebesgue measure. The probability of a success on the sixth trial
given that we observed two successes in the first five trials is then
32 Chapter 1. Probability Models

which agrees with (1.55) with n = 1 and k = 1. After observing k successes in


n* trials, the distribution of 8 has density
(n + I)! Ok" (1 _ Ot"-k
k!(n - k)! '
which is the density of a Beta( k + 1, n - k + 1) random variable. Given the
data, the probability of success on the next trial is the conditional mean of 8,
namely (k + 1)/(n* + 2), which is approximately k In if n is large.
This example helps to illustrate why observed frequencies are relevant for calcu-
lating probabilities, if the observations are thought to be exchangeable. Suppose
that the distribution J.L8 has a density f with respect to some measure on [0, 1].
Then the conditional density of 8 given k* successes observed in n trials will
be a constant times Ok" (1- o)n" -k" f(O). This density is higher for 0 values near
the observed k In than is f. As n gets larger, this becomes more pronounced
to the point of the density resembling a huge spike near k* In, if f is strictly
positive in the vicinity of this value. This argument is the heuristic justification
for the common practice of estimating 8 by k In. The justification only applies
when we believe the trials to be exchangeable. We do not have to believe that
there exists a "fixed value" 0 such that the trials are lID Ber(O). We are just
trying to estimate (or predict) what the limit of the relative frequencies will be.
Example 1.58 (Continuation of Example 1.53; see page 29). In the case of
Bernoulli random variables, we saw that learning more data helped us to learn
the value of 8 more precisely. If learning more data is supposed to help us to learn
p in the present example, then the conditional distribution of X n +l given the
first n observations should approach P. The conditional density of Xn+l given
the first n observationsis
r(a + n + l)(b + L:7-1 Xit+ n
r(a + n)(b + x + L:7=1 xda+n+l
a ( x )-(a"+I)
b. 1 + b. '

where a = a + nand b* = b + L:7=1 Xi. Suppose that a Ib* converges to 0 as


n -+ 00. Then, the conditional density above converges to Oexp(-xO), an Exp(O)
density. Once again, it is as if P is the CDF of the Exp(e) distribution for some
random variable 8. This is like saying that the distribution of P is concentrated
on the set
E = {P: P((-oo,xj) = 1- exp(-xO), for some 0 > o}.
In Problem 22 on page 77, you can prove that the probability is distributed over
E as follows. Consider the mapping V : rn.+ -+ E defined by V(O) = Fe, where
F9(x) = 1 - exp(-xO) for all x > O. To determine the appropriate probability
measure on rn.+ we first note that the conditional distribution of the future
given the past d~pends on the past only through L:7=1
Xi. We suspect that some
function of this converges in distribution to 8. Solve Problem 21 on page 76 to
determine the appropriate function and the limiting distri.buti.on. The measu~e
induced on E by V is the distribution of P, J.Lp. IntegratlOn m the space P IS
performed by integrating over rn.:

r g(P)dJ.Lp(P) = r
is iV-l(S)
g(V(O))dO.
1.5. Proofs of DeFinetti's Theorem 33

The function V-I gives us a way of dealing with P as if it were the real num-
ber 9 = V-I (P). This is a special case of a parametric index, to be defined in
Definition 1.85.

1.5 Proofs of DeFinetti's Theorem and Related


Results*
1.5.1 Strong Law of Large Numbers
Because the infinite forms of DeFinetti's theorem state that certain pro-
portions converge almost surely, we will need to prove a strong law of large
numbers for exchangeable random variables. 19 The strong law of large num-
bers for IID random variables 1.63 says that, for a sequence of lID random
variables with finite mean, the sequence of averages of the first n of them
converges almost surely to the mean. As stated, this result is clearly false
for exchangeable random variables. For example, let X have finite mean
but nondegenerate distribution, and suppose that Xi = X for all i. Then
{X n };;'=l is clearly an exchangeable sequence, and the average of the first
n is equal to X for every n. Hence, the averages converge almost surely to
X, not the mean. For exchangeable random variables, we can only prove
that the averages converge to some random variable which might not be
constant.
We will prove two versions of the strong law for two different sets of
readers. The first version, Theorem 1.59, is based solely on elementary
probability theory, but it concludes only that a subsequence of the sequence
of sample averages converges almost surely,20 and then only under the
assumption of finite variance. But the restricted result of Theorem 1.59
is all that is needed in the proof of Theorem 1.49. The second version,
Theorem 1.62, is based on the theory of reversed martingales and is a more
complete statement of the strong law of large numbers than Theorem 1.59.

"This section may be skipped without interrupting the flow of ideas.


I90ne consequence of DeFinetti's representation theorem 1.49 will be that
many theorems that apply to lID random quantities can be adapted to apply
to exchangeable random quantities. Taylor, Daffer, and Patterson (1985) prove
some examples of such theorems. See also Problem 31 on page 78 and Problem 39
on page 79. One such result would be the strong law of large numbers 1.62.
Unfortunately, one of the steps in the proofs of Theorems 1.47 and 1.49 makes
use of a strong law of large numbers for exchangeable random variables.
20 Hartigan (1983, Theorem 4.6) simplifies his proof of DeFinetti's theorem
using this fact. The statement of Theorem 1.59 can be extended to show that the
entire sequence of sample averages converges almost surely as in Problem 38 on
page 78.
34 Chapter 1. Probability Models

1.5.1.1 An Elementary Version


We state and prove here an elementary version of the strong law of large
numbers for exchangeable random variables, which is only general enough
to allow us to prove Theorem 1.49. Those who desire a more complete
statement of the strong law of large numbers and who have some famil-
iarity with martingales may safely skip this theorem and proceed to the
martingale version in Section 1.5.1.2.21
Theorem 1.59.22 Let (B,A,/L) be a probability space and, for each n,
let Xn : S --t IR be exchangeable random variables. Assume that -00 <
E(XiXj) = m2 < 00 for all i '# j and E(Xt) = /L2 < 00 for all i. Let
Yn = E~l Xi/no Then the subsequence {YSk }h:1 converges almost surely.
PROOF. We will prove that the subsequence converges almost surely by
proving that it is a Cauchy sequence23 almost surely. Use Tchebychev's
inequality B.16 to write, for m > n,

E(Ym - Yn)2
Pr(IYm - Ynl 2:: c) <
c2
= (E(Y;) + E(Y~) - 2E(Yn Ym )) 12
c

= (~2 [n/L2 + n 2(n 2 - 1)m21

1
+ -2
m
[m/L2 + m(m -1)m21

- ~[n/L2
mn
+ n(n -1)m2 + n(m - n)m21) ~
c

= ( .!.n _ ~)
m
/L2 - m2 < /L2 - m2 .
c2 nc2
(1.60)

Now, let Zk = YSk and Ak = {s: IZk+l- Zkl2:: 2- k }. It follows easily from
(1.60) (with c = 2- k ) that, for every k, Pr(Ak) < (/L2 - m2)2- k . Now, let
A = n~=1 Uk"=nAk' It follows from the first Borel-Cantelli lem~a A.20 that
Pr(A) = O. We finish the proof by showing that, for every sEA , and every
f > 0, there exists N s , such that n, m 2:: N s , implies \Zn(s) - Zm(s)1 < f.

21Two theorems (7.49 and 7.80) make use of the strong law of large num-
bers for lID random variables 1.63. This result is available as a consequence of
Theorem 1.62, but not from Theorem 1.59.
22This theorem is used in the proof of Theorem 1.49. Its proof resembles the
proof of a similar claim in Loeve (1977, Section 6.3). .
23 A sequence of real numbers {Xn}~=1 is Ca~chy if, ~or every f > 0, the!e eXlsts
N such that n, m 2:: N implies IXn - Xm I < f. Smce IR 1S a complete metnc space,
every Cauchy sequence converges.
1.5. Proofs of DeFinettj's Theorem 35

Write AC = U~=l nk=n Af. For every S E AC, there exists Cs such that
S E n~csAf. If m > n ~ cs , it follows that

L
m-l
IZm(s) - Zn(s)1 ::; IZi+l(S) - Zi(s)1 < Tn+! ::; TCs+I.
i=n

So, let N s , > 1 + max{ c s , -10g2 E} to finish the proof. 0


There are strong laws of large numbers for lID random variables also.
Here is one which will help in the proof of Theorem 1.49.
Lemma 1.61 (Strong law of large numbers: bounded condition-
ally lID case). Let {Xn}~=l be a sequence of bounded mndom variables,
and let 8 be an arbitmry mndom quantity such that, conditional on 8,
the Xn are lID with mean c(8). Then L~l X;fn = Y n converges almost
surely to c( e).
PROOF. Since Y n converges almost surely to c(8) if and only if Yn - c(8)
converges almost surely to 0, and since Yn - c(8) is the sample average of
the first n of Xi - c(8), assume that c(8) = 0 without loss of generality.
Now, write

Each ofthe terms above, for which at least one of iI, i2, i 3 , i4 is not repeated,
has mean 0 because the random variables are conditionally independent
with mean O. Let M be a bound for IXnl. It follows that

So, E(Y;) ::; 4M4jn 2 by the law of total probability B.70. It follows from
the Markov inequality B.15 that, for each E > 0,

So, L~l Pr(IYnl > E) < 00. The first Borel-Cantelli lemma A.20 implies
that Pr(IYnl > E infinitely often) = O. Since the event that Yn converges to
o is nk=I{lYn\ > Ijk infinitely often}C, it follows that Yn converges to 0
almost surely. 0
36 Chapter 1. Probability Models

1.5.1.2 A Martingale Version+


A more complete proof of the strong law of large numbers for exchangeable
random variables is borrowed from Kingman (1978).24
Theorem 1.62 (Strong law of large numbers).25 Let (S,A,JL) be a
probability space, and let Xi : S --+ m be measumble for all i such that the
Xi are exchangeable with E(IXil) < 00 for all i. Then there exists au-field
Aoo such that Yn = E~=l Xi/n converges almost surely to E(XIIAoo).
PROOF. Define X = (Xl,X2, ... ). For n > 0, let Cn be the collection of all
Borel subsets A of IRoo which satisfy x E A if and only if yEA for all y
that agree with x after coordinate n and such that the first n coordinates
of y are a permutation of the first n coordinates of x. It is not difficult to
show that Cn is a u-field, and it is trivial to see that f(x) = E:=1 Xi is
measurable with respect to Cn (Problem 36 on page 78). Let An = X-l(Cn )
and Zn = E(XIIAn). Since E(jX11) < 00 and {An}~=1 is a decreasing
sequence of u-fields, it follows from Part II of Levy's theorem B.124 that
limn --+ oo Zn = E(Xll:Foo) and is finite a.s. We now show that Yn = Zn.
Since f(x) = E~=l Xi is measurable with respect to Cn, we need only prove
that, for A E Am E(IAYn ) = E(IAXd. But, lAXi has the same distribution
as lAXi for all i,j = 1, ... , n by the assumption of exchangeability and the
permutation symmetry of the set A. Hence E(IAXt} = E:=l E(IAXi)/n =
E(IAYn ). 0
As a corollary, we also mention the usual strong law of large numbers for
lID random variables.
Corollary 1.63 (Strong law of large numbers: lID case).26 Suppose
that {Xn}~=1 is a sequence of lID mndom variables with E(Xi ) = 1.. Then
2::=1 Xdn = Yn converges almost surely to 1..

1. 5.2 The Bernoulli Case


The proof of DeFinetti's representation theorem for finitely many Bernoulli
random variables XI, . .. , X N was given in Example 1.46. There, we saw
that the conditional distribution of the number K of successes in n trials
given M = m was hypergeometric Hyp(N, n, m), where M = Ef:l Xi' So,

+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas. .
24This proof is also similar to one given for the case of lID random variables by
Doob (1953, Section VII, 6). Those who are unfamiliar wi~h m~tingal~ theory
may safely skip this section and study the elementary versIon gIven earher. But
these readers should be aware that two theorems (7.49 and 7.80) do make use of
Corollary 1.63.
25This theorem can be used in the proof of Theorem 1.49.
26This corollary is used in the proofs of Theorems 7.49 and 7.80.
1.5. Proofs of DeFinetti's Theorem 37

.
for example,
Pr(K = kiM = m) = (~) S=~) (1.64)
(m)
Suppose that N --+ 00 in such a way that MIN --+ 9. For fixed nand k,
we can take limits in (1.64) as N --+ 00 and min --+ e.Formally, we would
get
Pr(K = kl9 = e) = (~)ek(l- e)n-k,
which is the model for K "" Bin(n, ()). In fact, this is what Theorem 1.47
says is the case. The precise proof is a bit more complicated than the
heuristic argument above, but the idea is the same.
PROOF OF THEOREM 1.47. The "if" direction is simple and is left to the
reader. For the "only if" direction, assume that {Xn} ~ 1 is an exchangeable
Bernoulli sequence. Let Yi = I:f=1 Xd for = 1,2, .... By the strong law
of large numbers 1.62 or 1.59, we know that Ysn converges almost surely.
Let 9 denote the limit when the limit exists, and let 9 = 1/2 when the
limit does not exist. Let JLe denote the distribution of 9 (a probability
measure on [O,lJ.)
The main step in the proof is to show that for every integer k and every
j1, ... ,jk E {O, I} and every Borel subset C of [0,1],

Pr(X1 =j1, ... ,Xk =jk,9 E C) = fa (}Y(I- ())k-YdJLe((}), (1.65)

where y = j1 + ... + jk. To show this, let Zn = I0(8)Yln(1 - Ysn )k- y


and let Z = Ic(8)9 Y(1- 9)k-y. It is easy to see that Zn --+ Z a.s., hence
Zn ~ Z by Theorem B.90. Since Zn is uniformly bounded, E(Zn) --+ E(Z).
The right-hand side of (1.65) is just E(Z). So, we need only show that
E(Zn) converges to the left-hand side of (1.65). Let m = 8n , and define
Wi,t = I{j.}(Xe), for each integer t = 1, ... , k. Then

1 m { Ym if jt = 1,
mLWi,t = 1 - Ym if jt = 0.
(=1
With this notation, we can write
1 m m k
Zn = mk I c(8) L'" L I1Wit ,t
il=l ik=l t=l

11k
-_ ---;;:;c
10(8) '"
~ Wit,t
10(9)
+ --k- L k
11 Wi"t. (1.66)
all it distinct t=l m at least two it equal t=1

The first sum on the right-hand side of (1.66) has m!/(m - k)! terms.
Since E[I0(8) n~=1 Wit,tJ equals the left-hand side of (1.65) when all it
38 Chapter 1. Probability Models

are distinct, and since m!/[(m - k)!mk] converges to 1, the mean of the
first term on the right of (1.66) converges to the left-hand side of (1.65).
The second sum has m k - m!/(m - k)! terms, each of which is bounded
between 0 and 1. Since 1 - m!/[(m - k)!mk] converges to 0, so does the
mean of the last expression in (1.66). This completes the proof of (1.65).
Equation (1.65) is exactly what it means to say that XI, ... ,Xk are
conditionally lID Ber(O) given 8 = O. To see that the distribution J.te is
unique, let C = [0,11 in (1.65) and note that this equation determines the
means of all polynomial functions of 8. Since polynomials are dense in the
set of all bounded continuous functions on [0,1] by the Stone-Weierstrass
theorem C.3, it follows that (1.65) determines the means of all bounded
continuous functions of e, and Corollary B.107 says that the means of
all bounded continuous functions determine the distribution. To finish the
proof, we note that since {Xn}~=l are bounded and conditionally lID,
Lemma 1.61 says that {Yi}bl converges a.s. Obviously, the limit must be
e, a.s. 0

1.5.3 The General Finite Case*


1.5.3.1 Proof of Theorem 1.48
Define a function h : XN -7 P by h(x) = Px , where
1 N
Px(B) = N L
IB(xi)
i=l

if x = (Xl, ... , X N ) and B E B. We refer to Px as the empirical distribution


of x. It is easy to check that the function h is measurable (see Problem 24
on page 77.) Simple random samples with/without replacement from Px
can be defined in exactly the same way as they were for samples from P N
in Section 1.4.2.1. In fact, if X = (Xl, ... , XN), then Px = PN.
Even though the space (P, Cp ) is quite complicated, the subset of P in
which P N lies is relatively simple. P N concentrates all of its mass on at
most N different points in X. Hence, it is not nearly as complicated an
object as it may appear. In fact, it is really nothing more than two vectors
of equal length (at most N), where the coordinates of one of the vectors are
elements of X and the coordinates of the other are nonnegative multiples
of l/N adding to 1.
The following lemma will be used to help prove Theorem 1.48 and an
approximation in Theorem 1.70.
Lemma 1.67. Suppose that X == (Xl, ... ,XN) are exchangeable random
quantities. Let Qx be the probability measure giving their joint distribu-
tion on (XN, B N ). For n ::; N, x = (Xl, ... , XN), and BE Bn, let Hn(Blx)

-This section may be skipped without interrupting the flow of ideas.


1.5. Proofs of DeFinetti's Theorem 39

stand for the probability that n draws without replacement from an urn con-
taining balls labeled Xl, ... ,XN form a point in B. Also, let Mn{Blx) stand
for the probability that n draws with replacement from an urn containing
balls labeled Xl, ... , X N form a point in B. Let Y 1 , ... , Y N be conditionally
IID given X = X with distribution MN{-Ix). Then

Pr((Xi1 , ... ,XiJ E B,X E C) L Hn(Blx)dQx(x),

Pr((Yi!, ... , Yin) E B, X E C) L Mn(Blx)dQx(x),

for each B E Bn and each C E BN.


PROOF. The second equation above is immediate from the definition of
conditional distribution. Next, notice that Hn(Blx) can be written

By the exchangeability of Xl, ... , X N, we have

Pr((Xi !, ,XiJ E B,X E C)


=
1
(~) L Pr((Xjl'".,Xjn)EB,XEC)
distinct
Cil> ... , jn)
1
(~) distinct
(il, ,jn)

= i Hn(Blx)dQx(x). o

We are now in position to prove Theorem 1.48.


PROOF OF THEOREM 1.48. The "if" part is fairly straightforward and left
to the reader. Only the n = N case is needed, since the others follow from
it by taking marginal distributions.
For the "only if" part, assume that Xl,"" X N are exchangeable, and
let P N be as defined. For each probability P with support on at most N
points and probabilities of the form kiN, let H~(BIP) be the probability
that n draws without replacement from P forms a point in B E Bn. What
we need to prove is that for all n and all distinct i l , ... ,in and all B E B n
and C E Cop,

(1.68)
40 Chapter 1. Probability Models

where ILN is the probability measure giving the distribution of P N. Let


h : XN ----> P be the function that maps a point x = (Xl'"'' XN) to the
empirical distribution of x. That is, hex) = Px and h(X) = P N . Let Qx be
the distribution of X. It follows that ILN is the measure induced on (P,B p )
from Qx by h. Also, it follows that H~(BIP) = Hn(Blx) for all X such
that hex) = P. This means that Hn(Blx) is a function of hex), namely
H~(BIPx)' Now, write the left-hand side of (1.68) as

PrXi1 , ... , XiJ E B, X E h-I(C)) = r


Jh-1(C)
Hn(Blx)dQx(x)

= fc H~(BIP)dILN(P),
where the first equality follows from Lemma 1.67, and the second follows
from Theorem A.81. This proves (1.68). 0

1.5.3.2 An Approximation Theorem


Some questions naturally arise when comparing the cases of finitely many
and infinitely many exchangeable random variables. First, what happens
when N becomes infinite? Second, is there any sense in which the finite
or infinite N cases approximate each other? The following lemma bounds
the difference between probabilities calculated under sampling with and
without replacement from a finite set and is useful in addressing these
approximation questions.
Lemma 1.69. Suppose that we have an urn with N balls labeled YI, ... ,YN
Let Y be the set of distinct items in the set of labels. Let X = (Xl,"" Xn)
be values of the labels for n draws without replacement from the urn. Let
Y = (YI , ... , Yn ) be values of the labels for n draws with replacement from
the urn. Let P be the distribution of Y and let Q be the distribution of X.
Then SUPAc;;yn IP(A) - Q(A)I :::; n(n - 1)j(2N).
PROOF. 27 First, suppose that there are no duplicate labels. Then both
P and Q are constant on the set A = {x : Q({x}) > O} and on the
complement of this set. It follows that this set and its complement are
where the supremum of the difference occurs. The difference between Q(A)
and peA) is easily seen to be 1- peA) = 1-n!(~) jNn. If Xi E (0, ~ for all
i = 1, ... ,n, then it is easy to show (say, by induction) that 1- L:i=l Xi :::;
rr=l(l- Xi)' It is clear that n!(~)jNn = n~=l(l- ijN), hence

_ ~ .!:.- = _ n(n - 1) < peA).


1 L...J N 1 2N-
i=l
The result now follows by subtracting both sides from 1.

27Freedman (1977) gives the first part of this proof.


1.5. Proofs of DeFinetti's Theorem 41

For the general case, suppose that for each i, ball i has two labels (i, Yi),
and let Yo denote the set of these labels. Let X, and Y' record both labels,
so that the first part of the proof applies to their distributions. Call these
distributions Q' and P'. Assume that X and Y still only record the second
parts of the labels. For each A ~ yn there exists a set A' ~ Y8 such that
X E A if and only if X' E A', and YEA if and only if Y' E A'. In fact,

u
n
A'= II{(i,Yi): Yi = Xj}.

It now follows that

sup IP(A) - Q(A)I ~ sup IP'(A') _ Q'(A')I ~ n(n - 1). 0


Ac;yn A'C;y~ 2N

Lemma 1.69 allows us to prove some approximation theorems for fi-


nite exchangeable sequences. The next theorem is borrowed from Diaconis
and Freedman (I980a). It says that the joint distribution of finitely many
exchangeable random quantities is uniformly approximated by the joint
distribution of conditionally lID random quantities.
Theorem 1.70. Suppose thatXI, ... ,XN are exchangeable random quan-
tities taking values in a Borel space (X, B). For each P E 'P and each n
and each B E 8", let pn(B) stand for the probability that a vector of n
IID random variables with distribution P lies in B. Let IJN be the distri-
bution of P N Then, for all n and all distinct iI, ... ,in in {I, ... ,N} and
allBEBn,

(1.71)

PROOF. Let Qx stand for the joint distribution of Xl! ... , XN. For x =
(XI, ... ,XN) and BE Bn, let Hn(Blx) and Mn(Blx) be as in Lemma 1.67.
(If Pm is the empirical distribution of x, then Mn(Blx) = P:(B).) By
Lemma 1.69, we have IHn(Blx) - Mn(Blx)1 ~ n(n -I)/2N for all x. From
Lemma 1.67, we have

\prX i 1l''' ,Xin ) E B) - f Mn(Blx)dQx(x) \

= If Hn(Blx)dQx(x) - f Mn(Blx)dQx(x) 5 I n(~; I}.


All that remains is to show that the distribution IJN of P N satisfies

(1.72)
42 Chapter 1. Probability Models

Consider, once again, the function h : XN -+ P which maps a point x E X N


to Px . Note that h(x) = Ml(lx) also. Since h(X) = P N , Theorem A.81
says that
(1.73)

Since P:(B) = Mn(Blx), it follows that (1.73) is (1.72). 0


Theorem 1.49 says that infinitely many exchangeable random quantities
are conditionally IID. Theorem 1.70 says that there is continuity in passing
from the finite to the infinite case.

1.5.3.3 Conditional Distributions


There is a general form for the conditional distributions of finitely many
exchangeable random quantities. If Xl. ... , XN are exchangeable, it is easy
to prove that X k+1, ... , X N are exchangeable conditional on Xl, ... , Xk.
Proposition 1.74. If Xt, ... ,XN are exchangeable, then Xk+l, ... ,XN
are exchangeable conditional on Xl, ... , X k .
If one uses the conditional distribution of Xk+b ... ,XN given Xb ... , Xk in
place of the distribution of X in Theorem 1.48, one obtains the conditional
distribution of each subset of Xk+b ... , X N given Xl! ... ,Xk
Proposition 1.75. Suppose that X = (Xl, ... , XN) are exchangeable, that
k+n ~ N, and thatjl, ... ,jn E {k+1, ... ,N} are distinct. The conditional
joint distribution of Xk+il' .. ,Xk+in given (Xl'' Xk) = (Xl, ... , Xk)
and P N = P is that of a simple random sample of size n without replace-
ment from the distribution P with one ball for each of Xl, ... , Xk removed
first.
Example 1.76. Suppose that Xl' ... ' X N are exchangeable Bernoulli random
variables. For simplicity, suppose that we are interested in the first k+n XiS. If we
will observe X!, ... , Xk, then it suffices to be able to calculate Pr(Y = flY = i*)
for each i, e' where Y = :E:::+l Xi and Y' = :E:=l Xi. It follows from the
exchangeability that

'This section may be skipped without interrupting the flow of ideas.


1.5. Proofs of DeFinetti's Theorem 43

where we have made the substitutions

N" =N-k, m" = m-e", * =


Pm'

e
The conditional probability that Y = given Y' = e* is i~ p~ecis~ly the same
form as the marginal probability of Y = e, except that the dlstfJ~utlOn?f M"h2~
been replaced by the conditional distribution of M* = M - given Y = R. r
For example, suppose that pm = l/(N + 1) for m = 0, ... ,N. Then, the
probability of one success in one trial is

N G)(;';::::D lIN m _ N(N + 1) _ ~


~ (;';:) N+1 = N+1~N- 2(N+1)N- 2'
After seeing one observation Xl = 1, we calculate the conditional distribution of
the number of remaining successes

" _ (:~l) G) Nh _ 2(m" + 1) { < -k if m* < N;l,


Pm' - (N) 1
m'+l 2
- N(N + 1) > 1-
N
if m' > N 2-1.

The probability has been shifted to higher values of m* after seeing one success.
Suppose now that we see two observations Xl = 1 and X2 = O. The probability
of this is
N-I (i) C;::::D
1 2 N-I m(N - m)
~ (;';:) N +1 = N +1 ~ N(N - 1)

2 [N2(N - 1) _ N(N - 1)(2N - 1)] 1


(N-1)N(N+1) 2 6 ='3'
and the conditional distribution of the future is

" (:~2)(~)Nh 6(m*+1)(N-m*-1)


Pm' = (N) 1 = N(N - l)(N + 1)
Tn*+l 3

The maximum of this probability occurs at the value of m* closest to (N - 2)/2,


and the probability drops off as m* moves away from (N - 2)/2.

We can also prove an approximation theorem that applies to the cal-


culation of conditional probabilities. The theorem says that if we pretend
that Xi are conditionally lID given P N, and the probability of repeats is 0,
then the conditional probabilities we calculate for future observations are
uniformly close to the correct conditional probabilities.
Theorem 1.77. Suppose that X = (Xl, ... , X N) are exchangeable and
that Pr(Xi = X j ) = 0 for i I- j. Let P N be the empirical distribution of

28The readers should convince themselves that P;'" is indeed equal to Pr(M" =
m"IY" = eO).
44 Chapter 1. Probability Models

X. Let Y I , , Y N be conditionally IID with distribution P given P N = P.


Then, for n < N - k, and B E Bn,

IPrXk+l. ... ,Xk+n) E BIXI = Xl, .. ,Xk = Xk)


- PrYk+b . .. , Yk+n) E BIYI = Xb . , Yk = Xk)\
n(n -1)
:::; 2(N _ k) + 1 - 1 - N
(k)n ,a.s.

with respect to the distribution of X.


PROOF. First, we prove that the conditional distribution of X given Y b ... ,
Yk is the same as that of X given Xl,"" Xk. Call the latter QXIX1, ... ,Xk'
Let Mn and Hn be as in Lemma 1.67. Let B E Bk be such that all points
in B have k distinct coordinates. Then

for all X with distinct coordinates. For each such B, Lemma 1.67 says that

k!(~)
PrYl."" Yk) E B,X E C) = NK PrXI , ... ,Xk) E B,X E C).

In particular, if C is the set of all X with distinct coordinates, then Pr(X E


C) = 1 and

k!(~)
PrYb , Yk) E B) = NK PrXl.." ,Xk) E B).

Let QX1, ... ,Xk and QY1, ... ,Yk stand for the joint distributions of (Xl,'" ,Xk)
and (Y1 , . , Yk ) respectively. It follows that for all integrable functions f
and all B E Bk such that all points have distinct coordinates:

fa f(xI,'" ,xk)dQYl, ... ,Yk{xb ... ,Xk) (1.78)

k!(~) ( (1.79)
= NK lBf(xb ... ,Xk)dQy1, ... ,Yk{Xl, ... ,Xk).

From the definition of conditional distribution, we have

is
PrXI , ... , Xk) E B, X E C)
= QXIX1,. .. ,Xk(C\XI, ... ,xk)dQX1, .. "X,,{Xl.'" ,Xk)'

If we set f{xl. .. , ,Xk) = QXIX1,. .. ,Xk (C\Xl.'" ,Xk) in (1.78), we get

is
PrYI , ... , Yk ) E B, X E C)
= QXIX1, ... ,Xk(Clxl.'" ,xk)dQY1, ... ,y,,(XI,'" ,Xk),
1.5. Proofs of DeFinetti's Theorem 45

and the two conditional distributions are indeed the same for (Xl,' .. ,Xk)
vectors with distinct coordinates. Since such vectors have probability 1
under the distribution of X, the two conditional distributions are the same
a.s.
Now, we apply Lemma 1.67 to both the conditional distributions given
Xl, ... , X k and given Yl , ... , Yk. Let B E f3n have all distinct coordinates,
and let C be the set of all X E XN with distinct coordinates. Then, we get

PrXk+1, ... ,Xk+n) E BIX l =Xl,.,Xk =Xk)


= l Hn(BIXk+l,'" ,xN)dQxlx1, ... ,Xk(xlxl,'" ,Xk),

PrYk+1,"" Yk+n) E BIYl = Xl. ... , Yk = Xk)

= l Mn(Blx)dQxlxl, ... ,Xk(xlxl,'" ,Xk).

Lemma 1.69 says that

We must now bound the difference IMn(Blxk+1,'" ,XN) - Mn(Blx)l. As


in the proof of Lemma 1.69, one of these probabilities is constant on
{Xl, ... ,X N}n and the other is constant on a subset and is 0 elsewhere.
The sets B with the largest difference will be the set A where the second
probability is positive and its complement. Since Mn(Alxk+1,' .. ,XN) = 1
and Mn(Alx) = (1- k/N)n, we get

IMn(Blx) - Mn(BI Xk+1,'" ,xN)1 < 1- (1- ~) n

The conclusion to the theorem now follows. o


Example 1.80. If N =1,000,000, n = 100, and k = 100, then the bound in
Theorem 1.77 is 0.0199, or about 2%. On the other hand, if N =1,000,000,
n = 1000, and k = 1000, then the bound is a useless 1.632.

1.5.4 The General Infinite Case


1.5.4.1 Approximation by the Finite Case*
Theorem 1.70 says that the probabilities concerning n random quantities
calculated under the finite exchangeable distribution of Xl,'" ,XN are
uniformly approximated by those calculated under a conditionally lID dis-
tribution. As N - 00, one would expect that the joint distribution of

*This section may be skipped without interrupting the flow of ideas.


46 Chapter 1. Probability Models

X 1, ... , X N would actually become that of conditionally IID random quan-


tities. In examining the statement of Theorem 1.70, we note that the first
term inside the absolute value in (1.71) does not depend on N, but P.N
clearly does depend on N. If we could show that there exists p.p and a
subsequence {Nd~1 such that, for all n and all B E Bn,

(1.81)

then we would have a representation theorem for infinite exchangeable


sequences. 29 We will prove that this indeed is true in this section. In fact,
(1.81) would follow from the continuous mapping theorem B.88 if we could
prove that pn(B) is a bounded continuous function of P and that PNt
converges in distribution to a random quantity with distribution p.p. 30 In
the case of an infinite sequence of exchangeable Bernoulli random variables,
we could easily prove these facts. 31
Example 1.82. Let {Xn}~=l be a sequence of exchangeable Bernoulli random
variables. Then P N is nothing more than the proportion of successes, X N, in the
first N trials. (That is, P N( {I}) = X Nand PN( {O}) = I-X N.) The strong law of
large numbers 1.59 or 1.62 will say that a subsequence X nk converges a.s. (hence
in distribution by Theorem B.90) to something, call it e. Then P( {I}) = e, and
pn(B) is a bounded continuous function ofe (pn(B) = ~IIEB e t (lI) (l_e)n-t(lI) ,
where t(y) = ~~=1 Yi). It follows that {Xn};:"=l is a sequence of exchangeable
Bernoulli random variables if and only if there exists a distribution p.e such that
for all n, all distinct it, ... , in, and all Xl, ... , Xn E {O, I},

where k = ~:l Xi This is the representation portion of Theorem 1.47.

In general, if we wish to show that the Xi are actually conditionally lID


given some random probability measure P, we will need to prove more than

29Since the first term on the left-hand side of (1.71) does not depend on N,
the limit in (1.81) must be the same for all convergent subsequences.
30Diaconis and Freedman (1980a) offer a sketch of an abstract proof showing
directly that P N converges in distribution for general random quantities. We will
not actually prove that PN converges in distribution to P. Rather, we prove that
the finite-dimensional joint distributions of P N converge to those of P. Billings-
ley (1968), which contains an in-depth discussion of convergence in di~tributio~,
shows that convergence in distribution requires a condition called t'ghtness In
addition to convergence of finite dimensional joint distributions. The work of the
tightness condition is done in that part of the proof of Theorem 1.49 in w~ich we
prove equation (1.84). An alternative proof is given by Aldous (1985, SectlOn 7).
A very general theorem is proven by Hewitt and Savage (1955).
3lHeath and Sudderth (1976) give an alternative proof in the Bernoulli case
which relies on a different subsequence argument.
1.5. Proofs of DeFinetti's Theorem 47

just (1.81). We will need to show that for every measurable subset C of P,

PrXil"'" Xd E B, P E C) = L pn(B)dJ-Lp(P), (1.83)

In effect, we need to prove that P exists and that


PN(B)Ie(P) ~ pn(B)Ic(p)
for all Band C.
The distribution J-LN in Theorem 1.70 was seen to be the distribution
of the empirical probability measure P N of X = (Xl,' .. ,XN). If (1.83) is
going to hold, it would stand to reason that J-Lp would equal the distribution
of the limit of P N as N -+ 00, if the limit exists. That this limit exists will
follow from the strong law of large numbers 1.59 or 1.62. We will use this
fact to prove that (1.83) holds.

1.5.4.2 Proof of Theorem 1.49


The proof of Theorem 1.49 is a bit complicated, so a brief outline will be
given first. We use the strong law of large numbers to conclude that (at
least a subsequence of) the empirical probability measures at each set B,
{Pn(B)}~=l converges to something, which we call P(B). To show that
P() is a random probability measure, we show that P(B) = Pr(Xl E
BIAoo) for some O'-field Aoo. Since X is a Borel space, there is a regular
conditional distribution of which P(B) is a version for all B. It is easy to
show that P : S -+ P is a measurable function. The same calculation that
lets us prove that P(B) = Pr(Xl E BIAoo), namely equation (1.84) in the
proof, also leads to the conclusion that the Xi are conditionally lID.
PROOF OF THEOREM 1.49. The "if" direction is straightforward, and its
proof is left to the reader. (See Problem 4 on page 73.) For the "only if"
direction, assume that {Xn}~l are exchangeable. Let P n be the empirical
distribution of Xl, ... ,Xn . For each B E 13, lim n-+ oo Psn (B) = P(B) exists
a.s., as either Theorem 1.59 or 1.62 says.
For each B E 13 and each integer i, P(B) = limn-+ oo L~=JB(Xm)/8n,
a.s. It follows that for each i, P(B) is measurable with respect to the 0'-
field generated by {Xn}~=i' Hence, it is measurable with respect to the
intersection (over i) of all of these O'-fields, which is the tail a-field of
{Xn}~l' Call the tail O'-field Aoo.
We next prove that for every k, all distinct i l , ... , i k , all B l , ... , Bk E 13,
and every C E A oo ,

Pr({Xi , EB1,,,,,Xik EBk}nC)= 1 e


k
IIp(Bj)(s)djL(S).
j=l
(1.84)

To do this, let Zn = Ie I1~=1 Psn(B j ) and Z = Ie I1~=1 P(B j ). Note that


Zn -+ Z, a.s. as n -+ 00, hence Zn ~ Z by Theorem B.90. Since Zn
48 Chapter 1. Probability Models

is uniformly bounded, E(Zn) - t E(Z). Since P(Bi ) is measurable with


respect to A oo , the integral on the right-hand side of (1.84) is E(Z). All
that remains to the proof of (1.84) is to show that E(Zn) converges to the
left-hand side of (1.84). To do this, let m = 8n , and write

In the last expression above, the first sum has m!/(m - k)! terms, each of
which has mean equal to the left-hand side of (1.84). Since m!/[(m-k)!mk]
converges to 1, the mean of 1/mk times this first sum converges to the left-
hand side of (1.84). The second sum has m k - m!/(m - k)! terms, each of
which is bounded between 0 and 1. Since 1- m!/[(m - k)!mk] converges to
0, so does the mean of the second sum. This completes the proof of (1.84).
Apply (1.84) with k = 1, il = 1, and Bl = B to get that E(IB(X1 )Ic) =
E(P(B)Ic). This means that P(B) is a version ofPr(Xl E BIAoo) for every
B. Since (X, 8) is a Borel space, this can be assumed to be part of a regular
conditional distribution, and we can assume that P(B) = Pr(Xl E BIAoo).
In this way P becomes a random probability measure so long as we can
prove that it is a measurable function from (8, A) to (P, Cp). The a-field
Cp was set up so that P is measurable if and only if P(B) is measurable for
all B. Since Aoo <;;; A, P(B) is measurable for all B. Also, since Pr(Xl E
BIAoo) = P(B) is a function of P for each B and since Pis Aoo measurable,
it follows from Theorem B.73 that Pr(X l E BIP) = P(B).
Now, let J1.p denote the distribution of P. To prove that the Xi are
conditionally independent given P = P with distribution P, apply (1.84)
with C = {P E A} for arbitrary A E Cp . The result is
k

Pr(Xh E Bl,,,,,Xik E Bk,P E A) = r


J{P E A}
IIp(B )(s)dJ1.(s)
j=l
j

= iD k

P(Bj)dILp(P) = iD k

Pr(Xij E BjlP = P)dILP(P),

where the first equation is immediate from (1.84), the second follows from
Theorem A.81, and the third follows from the fact that Pr(Xl E BIP =
P) = P(B). This completes the proof of conditional independence given P.
For the uniqueness, suppose that ILl and J1.2 are possible distributions for
a random probability measure P such that the Xi are conditionally lID
with distribution P given P = P. We will prove that the finite-dimensional
1.5. Proofs of DeFinetti's Theorem 49

distributions of ILl and IL2 agree, and then Theorem B.131 says that ILl =
IL2. Let B 1 , , Bn E B, and let kl, ... , kn be positive integers. We have
already proven that

Xk1++k" E Bn) = fIT P(Bi)k dILl (P) fIT P(B


i=l
i =
i=l
i ) ki dIL2(P).

Hence, the means of all polynomial functions of (P(Bd, ... , P(Bn are the
same according to ILl and IL2. By the Stone-Weierstrass theorem C.3 the
means of all bounded continuous functions of finitely many PCB) values
are determined by the means of polynomial functions. Hence, the means
of all bounded continuous functions of (P(B 1 ), . . , P(Bn)) are the same
according to ILl and IL2. Corollary B.1D7 says that ILl and IL2 give the same
joint distribution to (P(Bd, ... , P(Bn, and the proof of uniqueness is
complete.
The convergence claim follows from Theorem 1.62 or from Lemma 1.61
and the fact that the bounded random variables IB(Xi ) are conditionally
liD. 0

1.5.5 Formal Introduction to Parametric Models*


The infinite version of DeFinetti's representation theorem 1.49 says that if
an infinite sequence of random quantities is exchangeable, then specifying
the joint distribution of all of them can be done by specifying a distribu-
tion for the limit of the empirical probability measures. Every probability
measure is a limit of empirical probability measures, and so the space P, on
which the distribution must be placed, is quite large. There are (at least)
two problems involved in specifying a distribution over P:
1. How do you perform the general integration J h(P)dILP(P)?
2. How do you get the conditional distribution of P given data?32
These two problems are related, and the usual method of solving them
is to say that ILp assigns probability 1 to a relatively small subset of P.
In Example 1.53 on page 29, we saw a case in which ILp assigned all of its
probability to the set of exponential distributions. An alternative is to as-
sign all probability to normal distributions or t distributions, and so forth.

*This section may be skipped without interrupting the flow of ideas.


32We have not tried to prove that (P, Cp) is a Borel space. However, since
(X OO , BOO) is a Borel space and P : Xoo -> P is measurable, regular conditional
distributions exist on Xoo and they induce regular conditional distributions on
P.
50 Chapter 1. Probability Models

These cases all have something in common, namely that the set of distribu-
tions is finitely parameterized. That is, there exists a one-to-one mapping
between the set of distributions and a subset of a finite-dimensional Eu-
clidean space. In the normal example, the mapping associates the N(m, s2)
distribution with (m, s) E III x III+. With such a parameter mapping, we can
switch the problem of integration over subsets of P to integration over Eu-
clidean space. The problem of finding conditional distributions is resolved
the same way. The conditional distribution in Euclidean space induces the
appropriate conditional distribution in P. (See Theorem B.28 on page 617.)
There are cases (see Sections 1.6.1 and 1.6.2) in which we want the range of
the parameter mapping to be an infinite-dimensional space. In such cases,
we will need to develop special methods for calculating integrals.
Now, let Po be a subset of P and let a' : Po --+ 0 be a bimeasurable
function, where 0 is a set with a-field T. The a-field of subsets ofP which we
need to consider is Co = {AnPo : A E Cp}. Let ILp be a probability measure
on (Po,Co) and let lLe be the probability on (O,T) induced by a' from
ILp. That is, for each A E Co, ILp(A) = lLe(a'(A)), and for each BET,
lLe(B) = ILP(a'-1(B)). To integrate a measurable function h : Po --+ lll,
we note that

J h(P)dILP(P) = J h(S'-1(0))dlLe(fJ),

where 0 is used to stand for an arbitrary element of 0 and P = a' -1 (0) E P.


For example, if X = III and P is the N(m, s2) distribution, and a'(p) =
(m,s), then for 0 = (m,s), a'-l(O) = P. As another example, let X =
{O, 1, 2}. In this case P is already a finite-dimensional set. We can let

If P is the distribution that says P( {i}) = Pi for i = 0,1,2 with Po + P1 +



P2 = 1 and Pi ~ for all i, then we can let e'(p) = (pO,Pl,P2). In this
case Po = P.
In general, we can make the above discussion precise as follows. Let
(8, A, IL) be a probability space and let (Xl, B1) be a Borel space. SUP'p?se
that {Xn}~=l is an infinite sequence of exchangeable random qua~tltl:s,
Xn : S --+ Xl. Let xn and XOO be finite and infinite products of X wlth
a-fields Bn and Boo, respectively. Let XOO denote the function mapping
S to XOO by XOO(s) = (X l (S),X2 (s), ... ). Similarly, let xn : S --+ xn
be xn(s) = (Xl(s), ... ,Xn(s)). Let P : XOO --+ P denote the "li.~it of
empirical probabilities" function. Next we introduce a general defimtlOn.
Definition 1.85. A bimeasurable mapping 8' from a subset Po of P to a
subset 0 of some Borel space with a-field T is called a parametric index.
The parametric index is denoted by 8' : Po --+ O. The set n is called the
parameter space, and the set Po is called the parametnc Jamzly.
1.5. Proofs of DeFinetti's Theorem 51

Let 8' : P -> 0 denote a parametric index. We have constructed the


following sequence of functions:

S x:: x oo !: P ~ O.
Let the function 8 : S -> 0 be defined by 8(8) = 8' (P (XOO (8))). (Note
that the value of 8 is the same as the value of 8', hence we will often
find it convenient to use the symbol 8 to refer to both 8 and 8'.) We call
8 the pammeter. Let J.le be the probability induced on (0, r) by 8 from
J.l. Let Ax be the sub-a-field of A generated by Xoo. Since (X OO , BOO) is
also a Borel space, regular conditional distributions given 8 exist. For each
A E Ax, let P~(A) = Pr(AI8)(8) for all 8 such that 8(8) = e. For each
B E Boo, let Pe(B) = Po(XOO-l(B. In words, {Pe : e E O} specifies the
conditional distribution of XOO given 8.
Example 1.86. Let Xl = lR and let Po be the set of all normal distributions.
Assume that ftp assigns probability 1 to the set Po. We can let e'(p) be the
vector consisting of the mean and standard deviation of the normal distribution
P. Then e(8) is the vector consisting of the mean and standard deviation of the
limit ofthe empirical distribution of a sequence {Xn}~=l of exchangeable random
variables. By the strong law of large numbers 1.63 and the fact that the Xn are
conditionally lID given P, e(8) is also the limit (a.s.) of the sample average X n =
1:~"'1 X;/n and the sample standard deviation )1:i=1 (Xi - Xn)2 I(n - 1) of the
data sequence. If (J = (ft, a), then Po is the distribution that says that {Xn}~"'l
are lID N(ft, (T2) random variables. The notation P~ stands for the probability
measure on (S,Ax) defined by P~(XOO-l(B = Pe(B) for B E BOO.
The probability measures Pe for e E 0 are on the space (.-1:'00, BOO). They
induce probabilities on all of the spaces (xn, Bn), for n = 1, 2, ... via the
obvious projections. It will prove convenient to refer to all of these induced
probabilities by the same name, Pe. That is, if A E Bn, let Pe(A) denote
Pe(A x Xl x Xl X ... ). This will be very convenient without causing any
confusion. If it becomes important to know over which space, (xn, Bn) or
(Xoo, BOO), Pe is defined, we will be explicit.
Sometimes the parameter 8 can be expressed as a meaningful function of
the distribution P, say H(P) which is also defined for distributions outside
of the parametric family. For example, H(P) = J xdP(x), the mean of the
distribution, is defined for every distribution with finite mean whether or
not that distribution is a member of a parametric family of interest. When
this occurs, it may be that H is continuous in the sense that H (P n) !?.
H(P) if lim n--+ oo P n = P. The distribution of 8 can then be considered as
an approximation to the distribution of H(P n ), where P n is the empirical
probability measure of the first n observations.
Example 1.87 (Continuation of Example 1.53; see page 29). In the Exp(O)
distribution,O is one over the mean of the distribution. So, H(P) = (J xdP(X)-1
and H(Pn) = (1:~=1 X;jnr 1 = l/X n. Indeed, l/X n E, e, so that we can take
52 Chapter 1. Probability Models

the distribution of e to be an approximation to the distribution of 1/X n (See


Problem 21 on page 76.)
Example 1.88 (Continuation of Example 1.38; see page 18). The marginal pos-
terior for M can be calculated by integrating u out of (1.25) (after changing the
o SUbscripts to 1), or by using the fact that M is the limit as m goes to 00 of
Ym , so the distribution of M is the limit of the distributions of the Y m . Since
the tal (/LI, J[I/m + 1/AI]bl/al ) densities converge to the tal (/Ll, Jbl/[alAd)
density as m -+ 00, Scheffe's theorem B.79 says that this limit is the distribution
ofM.

1.6 Infinite-Dimensional Parameters*


An alternative to the use of finite-dimensional para~eter spaces is to at-
tempt to place a probability distribution on an infinite-dimensional space
p. It is common to call such models nonparametric. We will consider two
types of probability measures on infinite-dimensional parameter spaces,
Dirichlet processes and tailfree processes.

1.6.1 Dirichlet Processes


Ferguson (1973) gives a probability measure on an infinite-dimensional
space P for which certain calculations are simple. We can think of P as a
stochastic process as in Section B.5 with index set B, a u-field of subsets
of X. To specify a distribution for P, we need to specify the joint distribu-

tion of (P(Bt}, ... ,P(Bn for all n and all B 1 , ,Bn E B in such a way
that the distributions are consistent according to Definition B.132. One
way to do this is as follows. Let a be a finite measure on (X, B). For each
integer n > 0 and partition B 1 , . , Bn of X, define the joint distribution

of (P(B 1 ), . . , P(Bn to be the Dirichlet distribution Dirn(aI, ... , an),
where ai = a( Bd for i = 1, ... , n. This is a distribution for a vector
(YI, ... ,Yn ) such that E~l Vi = 1 and such that (Y1 ,.., Yn ) have joint
density
r(a1++ a n) 0<1-1 ... O<n-1(1 _ _ ... _y )O<n- 1
r( a1 ) ... r( an ) Y1 Yn-l Yl n-l

The Dirichlet distribution is a multivariate generalization of the Beta dis-


tribution. To avoid having to deal separately with the cases in which some
of the sets Bi have a(Bi) = 0, we will extend the definition of the Dirichlet
distribution to allow some of the ai to be o. In this case, those coordinates
corresponding to ai = 0 are equal to 0 with probability 1 and the rest of
the coordinates have the usual Dirichlet distribution.
We prove next that this specification of the distribution of P is consistent.

*This section may be skipped without interrupting the flow of ideas.


1.6. Infinite-Dimensional Parameters 53

Theorem 1.89. Let P be a mndom probability measure on a Borel space


(X, 8), and let B l , ... , Bn E 8 partition X. Let a be a finite measure on
(X,8) with a(X) > 0, and let ai = a(Bi) for all i. To say that
(P(Bl),"" P(Bn rv Dirn (a1. ... , an)
specifies a consistent set of distributions in the sense of Definition B.132.
PROOF. Let A b ... , Ap be elements of B. Set B1. ... , Bn equal to the
partition consisting of the constituents of A1. ... , Ap. That is, each Bi
equals one of the 2P sets G l n '" n Gp , where each Gi is either Ai or
Af for i = 1, ... ,p (e.g., Al n Aq n A3 ... n A~). Let Gi = P(Ai) for
i = 1, ... ,po We need to show that for each i = 1, ... ,p and each set of
numbers t1.' .. , ti-l, ti+1!" . ,tp,
lim FG1, ... ,Gp (h, ... ,tp) = FGlo ... ,Gi_l,Gi, ... ,Gp(tb ... ,ti-l' ti+1." . ,tp).
t i-+OO
(1.90)
Let Y; = P(Bj ) for j = 1, ... ,n. Let Ci = {j : B j ~ Ai}. Then, the
expression whose limit is being taken on the left-hand side of (1.90) is

pr(~ Y; :::; ti, for i = 1, ... ,p) = Pr((Y1. ... , Yn) E G(t, (1.91)
JECi

where

G(t) = {(y1. ... ,Yn) : LYj :::; t i , for i = 1, ... ,p}.


JECi

Now, fix an i, and let BL ... ,B:n be the constituents of {A j : j i- i}.


Each B! is the union of two of the B j (For example, if i = 1 and p = 3,
then (AI n Aq n A 3 ) U (Af n Aq n A3 ) is one of the BJ.) Let Zs = PCB!)
for s = 1, ... , m. The proposed distribution implies that (Zl,"" Zm) has
Dirm ({3b"" 13m) distribution, where {3s = ajl +ah when B! = Bit UBh
For j i= i, let dj = {s : B! ~ A j }. The limit as ti --+ 00 of the expression in
(1.91) is

where

It is easy to see that (1.92) is the same as the right-hand side of (1.90). 0
Since the distributions specified are consistent, we can use them for the
distribution of P.
54 Chapter 1. Probability Models

Definition 1.93. If a is a finite (not identically 0) measure on (X,13)


and P is a random distribution such that, for each n, and each partition
{B 1 , ... ,Bn} of X,

where ai = a(Bi) for i = 1, ... , n, then we say that P has Dirichlet process
distribution with base measure a, denoted by Dir(a).
The Dirichlet process is useful only if we can do the necessary calculations
for making inference. The most crucial is updating in the light of data.
Theorem 1.94. Suppose that {Xn}~=l is a sequence of exchangeable ran-
dom quantities, that they are conditionally independent with distribution P
given P = P, and that P has Dir(a) distribution. Then the marginal dis-
tribution of each Xi is the probability measure a/a (X) and the conditional
distribution of P given Xl = Xl, ... , X k = Xk is Dir({3), where {3 is the
measure defined by {3(G) = a(G) + 2:7=1 Ic(xi) for each G.
PROOF. First, we prove the claim about the marginal distribution of Xi'
For B E 13,

Pr(Xi E B) = E(Pr(Xi E BIP)) = E(P(B = :~~~,


where the first equality follows from the law of total probability B.70, and.
the last follows from the fact that each coordinate of a Dirichlet distribution
has Beta distribution.
By the form of the purported posterior, it is clear that if we can prove the
result for k = 1, we can extend it to arbitrary k by induction. Let J-Lp denote
the Dir(a) measure on the space of probability measures P. For arbitrary n
and partition B 1 , . , B n , and arbitrary B E 13 and h, ... , tn, let A = {P :
P(Bi ) ~ ti, for i = 1, ... , n}. Assume that a(B) < a(X).33 Define, for
i = 1, ... ,n, B? = Bi n Band BI = Bi n Be. Then B?, ... ,B~, Bi, .. . ,B~
form a partition of at most 2n nonempty sets. 34 In particular, we can write
P(B) = L:~=1 P(B?). Define

AB = {(Zl,"" Z2n) : Zi + Zn+i :::; ti, for i = 1, ... ,n},

and note that PEA if and only if (P(B?), . .. ,P(B~)) E AB' Let {3i =
a(Bt) for i = 1, ... ,n, and let {3i = aCE?) for i = n+ 1, ... ,2n. Let c = {i :

33If a(B) = a(X), it is trivial to prove that fB /.!PIXI (Alx)d/.!xl (x) = PreP E
A, Xl E B), where /.!PIXI is the conditional dis~ribut~on. of ~ given Xl to be
defined later in this proof, and J-LXl is the margmal distrIbutIOn of Xl already
determined.
34Recall the extended definition of the Dirichlet distribution in which ai = 0
means that the ith coordinate is 0 with probability 1.
1.6. Infinite-Dimensional Parameters 55

(3i =f:. O}, and let k be the highest number in c. Let c' = {i : (3i =f:. O} \ {k}.
If (3j = 0, let Zj = 0 in the following equations. Then we can write
PreP E A,X l E B)
Pr(P(Bl ) ~ tl,., P(Bn ) ~ tn, Xl E B)
= i P(B)dj.Lp(P) (1.95)

1 tz
AB j=1 J
r(a(X)) IIZt3.-l(l_LZ)t3k-lIIdZ'
TIiEC f((3i) iEc' t iEc' t iEc' t

LA 1 r(a(X) +1) II zt3f-l


n ( 1- L Z f3i- II dz
)
l

j=l a(X) AB IliEc f(3f) iEc' t iEc' t iEc' t,

where (3; = (3j + 1 and (3{ = (3i for i =f:. j.


For each x E X, let ax denote the measure defined by ax(C) = a(C) +
Ic(x) for each C E B. Let j.LPlx 1 ('Ix) denote the Dir(a x ) measure on P.
It is easy to see that for x E B, j.LPlx 1 ('Ix) says that the joint distribution
of {P(B{), for i = 1, ... ,n and j = 0, I} is Dir2n(3{, ... ,(3~n)' where j is
such that x E BJ. Hence, j.LPlx 1 (Alx) equals

It follows that

L j.LPlx 1 (Alx)dj.LXl (x)

= a(~) 1 j.LPlx, (Alx)da(x)

= tAl
j=l a(X) AB
r (a(A') +1)
IliEc r((3f) iEc'
II If-
t
l (1- L z.)f3t
iEc' t
-l II dz
iEC'"

which is the same as the last expression in (1.95). That is

1 j.LPIXl (Alx)dj.LXl (x) = PreP E A, Xl E B),

which is what it means to say that the conditional distribution of P given


Xl = x is Dir(ax)' 0
By combining Theorems 1.94 and 1.89, we see that the posterior distri-
bution found in Theorem 1.94 is a regular conditional distribution.
Example 1.96. If Q is a continuous measure (Le., every singleton has 0 measure),
then the posterior measure f3 is a mixture of discrete and continuous parts. There
is mass 1 at every observed data value, but no other values have positive measure.
56 Chapter 1. Probability Models

Ferguson (1973) and Blackwell (1973) prove that there is a set of discrete
distributions Po ~ P such that the Dir(a) distribution assigns probabil-
ity 1 to Po. Sethuraman (1994) proves an alternative theorem, which not
only shows that the Dirichlet process is a probability on discrete distri-
butions, but also gives an algorithm for approximately simulating a CDF
with Dir(a) distribution. The result of Sethuraman (1994) is that the set of
points on which the Dir(a) distribution concentrates its mass is an infinite
lID sample YI , Y2 , ... from the probability a/a(X), and the probability as-
signed to Yn is Pn , where PI == QI, and for n > 1, Pn == Qn I1~:II(I- Qi),
where the Qi are lID with Beta(l, a(X)) distribution. What we prove here
is a very simple theorem of Krasker and Pratt (1986) which implies that the
Dir(a) distribution assigns probability 1 to a set of discrete distributions.
Theorem 1.97. Let {Xn}~l be conditionally IID with distribution P
given P == P. For n > 1, define
an == Pr(Xn is distinct from Xl, ... , Xn-d
If lim n-+ oo an == 0, then P is a discrete distribution with probability 1.
PROOF. Define
BE == {P: 3A E B such that P(A) > f and P({x}) == 0 for all x E A}.

It suffices to prove that Pr(BE ) = 0 for all f > O. The conditional probabil-
ity, given P == P and Xl"'" X n - l , that Xn is distinct from Xl"'" X n- l
is at least f for all P E BE' It follows that
an E Pr(Xn is distinct from Xl,'" ,Xn-IIX I , ... , Xn- l , P)
> E[Pr(Xn is distinct from Xl,'" ,Xn-1IXl, ... ,Xn-I,P)IB.(P)]
> f Pr(BE ).
Since limn-+ oo an == 0, Pr(BE ) == 0 for all f > 0 is necessary. 0
For the Dir(a) distribution, it is easy to calculate an == a(X)/[a(X) +
n -1].
The posterior predictive distribution of a future observation is a weighted
average of the prior measure a/ a(X) and the empirical probability measure.
Proposition 1.98. Assume that {Xn};:'=l are conditionally lID with dis-
tribution P given P == P and that P has Dir(a) distribution. The poste-
rior predictive distribution of a future Xi given X I == Xl, .. ,Xn == Xn is
,I3/[a(X) + n], where ,I3(C) == a(C) + E~l Ic(Xi)'
The predictive joint distribution of several future observations can be ob-
tained by applying Proposition 1.98 several times, each time after condi-
tioning on one more random variable. This gives a straightforward way to
generate a sample whose conditional distribution is P, which itself has a
Dirichlet process distribution. The joint distribution can also be described
as follows.
1.6. Infinite-Dimensional Parameters 57

Lemma 1.99. 35 Assume that {Xn}~=l are conditionally IID with distri-
bution P given P = P and that P has Dir(o:) distribution. Let n > o. If
pis a partition of{l, ... ,n}, letg(p) be the number of non empty sets in
p, and let kl (p), ... , kg(p) (p) be the numbers of elements of the g(p) sets.
(Note that 2:.f?{ ki(p) = n for all p.} For each x E X n , let R(x) be the
partition of {I, ... , n} which matches x. (That is, x has g( R( x)) distinct
coordinates, and for each set A in the partition R(x), those coordinates of
x whose subscripts are in A are all equal to each other.} For each x E X n ,
define Z(x) E Xg(R(x to be the vector of distinct coordinates such that
Z(X)i is repeated ki(R(x)) times in x. For each p and each subset B of x n ,
define Bp to be that subset of xg(p) which consists of the set of distinct co-
ordinates of points in B n R- l (p). (That is, Bp = Z (B n R-l (p)).) Define
the measure v on Xn by

v(B) = L o:g(p)(Bp).
Allp

The joint distribution of Xl, ... , Xn has the following density with respect
to the measure v:
n
L IR-l(p) (x) II II (o:({Z(X)i}) + j -1),
g~~~

fx(x) = I1(o:(X) + i _1)-1


i=l All p i=1 j=2

where an empty product is taken to be 1.


PROOF. Let X = (Xl, ... , Xn). We need to show that Pr(X E B)
IB fx(x)dv(x) for all B ~ xn. Let B ~ xn. We will show that

Pr(X E B n R-l(p = ( fx(x)dv(x)


lBnR-l(p)

for every partition p, and the result will then follow by adding up finitely
many terms. It is easy to see that v(C) = o:g(p)(Cp ) for each p and each
subset C of R-1(p), and that fx is a function of Z. It follows that

r
lBnR-l(p)
fx(x)dv(x)

( !x(Z-l(z))do: g(p)(z) (1.100)


lBp
=
,=1
n
II(o:(x) + i _1)-1 1 g(p) ki(p)
II II (o:({zd) + j -l)do:g(p)(z).
Bp i=1 j=2

35This lemma is used in Examples 1.102 and 1.103.


58 Chapter 1. Probability Models

Fix p and write Bp = Bl U B 2 , where every coordinate of every point in Bl


has 0 a measure. The points in B2 have at least one coordinate with pos-
itive a measure. There are at most countably many values of y such that
a( {y}) > 0, say they are Yl, Y2, .... For k = 1, ... , g(p), ii, ... , ik distinct el-
ements of {I, ... , g(p)}, and C1 , .. , Ck distinct integers, let B 2 ;il, ... ,idl, ... ,ik
be the subset of B2 in which Zit = Yet for t = 1, ... , k and all other coordi-
nates of Z are distinct points with 0 a measure. These sets are disjoint, and
their union is B 2 On each of these sets, and on BI, the integrand in the
far right-hand side of (1.100) is constant. Hence, the far right-hand side of
(1.100) can be written as

Dn
(a(X) +i -
{ g ( P ) ki(p)
1)-1 a g (p)(B 1 ) DU (j - 1) (1.101)

+L 0'(')(B2; " " " ; " , , , , ) (g )X (o( {y,,})j - 1))


x ( II Yf )}
i\l{il, ... ,ik} j=2 (j - 1) .

Now, we will show that (1.101) is the probability that X E BnR-l(p). Let
be g(p) coordinates that are distinct for all x in R-l(p). Let
jl, ... ,jg(p)

W = Z(X) = (X jll ... ,Xjg(p) E Bp.


The first term in (1.101) is the probability that W E Bl and that the other
coordinates of X all match the coordinates they need to match in order for
R(X) = p. Also, each of the summands in the second term in (1.101) is the
probability that W E B 2 ;il, ... ,ik;il, ... ,ek and that the the other coordinates
of X all match the coordinates they need to match. The sum is then the
probability that X E B n R-l(p). 0

Example 1.102. As a simple example of Lemma 1.99, suppose that X = IR. and
a is some finite continuous (no point masses) measure. The measure /I is then
the sum of the various k-dimensional product measures of Q for k = 1, ... , n over
the sets where there are exactly k distinct coordinates. For example, if n = 3,
then the partitions are
PI = {{1},{2},{3}}, P2 = {{1,2},{3}}, P3 = {{1},{2,3}},
= {{I,3},{2}},
P4 P5 = {{I,2,3}}.
So, g(PI) = 3, and ki(PI) = 1, while g(P2) = g(P3) = g(P4) = 2, and so on. Also,
R-I(pI) {(XI,X2,X3): Xl'" X2,XI '" X3,X2 '" X3},
R- I (P2) {(XI,X2,X3) : Xl = X2,XI '" X3},
R- I (P3) = {(XI,X2,X3): X2 = X3,XI '" X3},
R- I (P4) {(XI,X2,X3): Xl = X3,XI '" X2},
R- I (P5) {(Xl,X2,X3): Xl = X2 = X3}.
1.6. Infinite-Dimensional Parameters 59

The measure v is a 3 on R-l(pI) plus a 2 on R- 1 (P2) U R- 1 (P3) U R- 1 (P4) plus a


on R- 1(P5). Also,

_
fx(x) - a(X)[a(X)
1
+ l][a(X) + 2]
{21 E
if x R- 1(P5},
otherwise.

To calculate the probability that X is in the unit cube B, say, we must add up
five integrals, one for each partition.
Pr(O::::; Xi::::; 1, for i = 1,2,3)
= a(X)[a(X) : l][a(X) + 2] (a 3(B p1 ) + a 2 (Bp2 } + a 2 (Bp3) + a 2 (Bp4)

+ 2a(Bps }).

For concreteness, suppose that X is [-1,1] and a is Lebesgue measure. Then


a(X) = 2 and ag(p) (Bp) = 1 for all p. So, Pr(X E B) = 0.25, substantially above
the product probability a 3(B}/a(X)3 = 0.125. The negative unit cube (all Xi
between -1 and 0) also has probability 0.25, while the six other subcubes each
has probability 1/12.

Straightforward applications of Dirichlet process priors to one-sample


problems are singularly uninteresting, except in cases in which one might
use the bootstmp technique (see Section 5.3). There are, however, ways to
make use of Dirichlet process priors in less straightforward fashion.
Example 1.103. Suppose that {Xn}:::"=l are conditionally lID with distribution
P given P = P, and we are sure that there exists a finite number 9 such that
P( (-00,9]) = 1. Unless 9 is known, it is not possible that P has Dirichlet process
distribution. If we let e be the unknown least upper bound on the support of the
XiS, we can suppose that P given e = 9 has Dir(a9} distribution, where a9 is
a finite measure on (-00,0]. Let e have prior density fa. Let C9 = a9(( -00,9]).
Suppose also that a9 is absolutely continuous with respect to Lebesgue measure
with Radon-Nikodym derivative a9. Using Lemma 1.99, the likelihood function
for e after observing Xl, ... ,Xn to obtain 9 distinct values Yl, ... ,Yg with k i
repetitions of Yi is

Hence, the posterior distribution for e can be found. Conditional on e = 9,


the posterior for P is a Dirichlet process with measure f39 equal to a9 plus
point masses at the observed values. The marginal posterior of F is a mixture of
Dirichlet processes. Antoniak (1974) studied mixtures of Dirichlet processes and
describes many of their properties.

Example 1.103 can be somewhat deceiving if one is really trying to model


data from a continuous distribution. If 9 = n in that example, then all of
the k i = 1. If Co is the same for all (), then the likelihood function is the
same as one would obtain by modeling the data as conditionally lID given
60 Chapter 1. Probability Models

e = () with density a9(-)/c9. This is probably not the effect one thought
one was achieving by using a Dirichlet process. That is, there is nothing
~he least bit non parametric about the analysis one ends up performing in
this situation. In fact, this phenomenon is quite general.

Lemma 1.104.36 Suppose that person 1 believes that {Xn}~=l are lID
with a continuous distribution. For each f) E 0, let 0.9 be a continuous
finite measure with 0.9 (X) = c for all (J. Suppose that person 2 models the
e
data as conditionally IID given P = P and = () with distribution P and
that P given e = () has Dir(0.9) distribution. Suppose that person 3 models
the data as conditionally IID given e = (J with distribution 0.9/c. Assume
that 0.8 ", for all (). Suppose that person 2 and person 3 use the same
prior distribution for e. Then person 1 believes that, with probability 1, for
every n, person 2 and person 3 will calculate exactly the same posterior
distributions for e given Xl, ... , X n .
PROOF. First, note that the density fx in Lemma 1.99 is constant in (J for
every data set that has no observed values at points where 0.9 puts positive
mass. Such a data set will occur with probability 1 according to person 1.
Let a9 = d0.8/d",. With probability 1 (according to person 1) person 2 will
then have likelihood function proportional to n~=l a9(xi). This is the same
as the likelihood function that person 3 will have. Hence, persons 2 and 3
will calculate the same posterior. 0
Example 1.105. Since the Dirichlet process assigns probability 1 to discrete
CDFs, it may not be considered suitable for cases in which one really wants a
continuous CDF. One possibility is to model the observable data {Xn };:'=l as
Xi = 1'; + Zi where {Yn};:'=l are conditionally lID with CDF G given G =
G, where G has Dir(a) distribution, and {Zn}~=l are independent of {Yn}~=l
and of G and of each other with a distribution having density f. The posterior
distribution of G is not easy to obtain in this case, but a method that can be used
to approximate it will be given in Section 8.5. Escobar (1988) gives an algorithm
for implementing this method.

1.6.2 Tailfree Processes+


In this section, we introduce a second class of distributions over an infinite-
dimensional space of probabilities. This time it will be possible for the
random probability measure P not to be discrete.

Definition 1.106. Let (X,8) be a Borel space. For each integer n > 0, let
71'n be a countable partition of X whose elements are in 8. Suppose that

36This lemma is used to show why it may not be sensible to use a Dirichlet pro-
cess for the prior if there will also be an additional finite-dimensional parameter
of interest.
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
1.6. Infinite-Dimensional Parameters 61

71' 1 is a refinem ent of 71' n for each n. Let the trivial


partitio n be 71'0 = {X}.
L:: C = U~==171'n' Suppos e that 13 is the smallest a-field containing
~. For
each n, let {Vn;B : B E 71'n} be a collection of nonnegative random vanabl
es
such that the collections are mutual ly indepen dent. For each n ~ 1 and
each
Bl 2 ... 2 Bn with Bi E 71'i, define
n
P(B n ) = II Vi;Bi'
i==1

Then we say that the stochas tic process P = {P(B) : B E C}. is


tailfree
with respect to ({7I'n}~==I,{Vn;B : n 2: 1,B E 71'n}). For each n 2:
1 and
BE 71'n, define ps(B) = C, where C E 71'n-l and B c C. Call this the
most
recent superset of B. For each x E X and each n, define

Cn(x) = that B E 71'n such that x E B,


Vn(x) Vn;Bn(x),
Note that the random variables in {Vn;B : B E 71'n} do not have
to be
indepen dent of each other, but they must be indepe ndent of those in
{Vm;B :
B E 71'm} for m =I n. The class of tailfree processes was introdu ced
by
Freedm an (1963) and Fabius (1964). Also, see Ferguson (1974).
A necessary conditi on for P to be a random probab ility measure is
that

Vn;C = 1. (1.107)
All C such that ps(C) == B

Anothe r necessary conditi on is that if Bn is a union of elements of


71'n for
each n, Bl 2 B2 2 "', and n~==IBn = 0, then

IT= ( L Vn;B ) = 0, a.s. (1.108)


n==l {B:BE7rn,B~Bn}

These two conditions are also sufficient. (See Problem 51 on page


81.) For
the remain der of this book, when we refer to a tailfree process,
we will
assume that it is a random probab ility measure.
As an example we can show that Dirichlet processes are tailfree
with
respect to every sequence of partitio ns.
Examp le 1.109. Let P have Dir(o:) distribution. Let {7l'n}~1
be a
of countable partitions such that 7l'n+l is a refinement of 7l'n for all n.sequence
prove that P is tailfree with respect to {7l'n}~=1' For each n and each We can
set Vn;B = P(B)/P (ps(B)) . The fact that the collections {Vn;B : B E B E 7l'n,
7l'n}
independent follows from a well-known fact about Dirichlet distributions. are
(See
Problem 52 on page 81.)
As anothe r example, we can place a tailfree process distribu tion on
the
class of distribu tions symme tric around a point.
62 Chapter 1. Probability Models

Example 1.110. Let X = nt, and let 11"1 = {( -00,0), {OJ, (0, oo)}. Let {11";}~=2
be a sequence of nested partitions of (-00,0). For each n > 1, let 1I";i be the
partition of (0,00) formed by the negatives of the sets in 11";. Let 1I"n = 11"; U1I";i U
{OJ. SO long as Vn;B = Vn;c whenever B = -c, P will be symmetric around O.
When X P given P and P is a tailfree process, the predictive distri-
I'V

bution of X can be computed.


Proposition 1.111. Let P be a random probability measure that is tailfree
with respect to ({1I"n};::"=1,{Vn;B: n ~ I,B E 1I'n}) and X P given P. Let
I'V

A E 1I'm, and let A = n~lBi' where Bi E 1I'i for each i ~ m. Then the
predictive probability that X E A is
m
J..!x(A) = EP(A) = IIE("'i;B')' (1.112)
i=l
It is sometimes possible to find a density for the predictive distribution of
X with respect to some measure v of interest (like Lebesgue measure).
Lemma 1.113.37 Let (X, B, v) be a a-finite measure space. Assume that
P is tailfree with respect to ({ 1I'n}~1' {Vn;B : n ~ 1, B E 1I'n}). Assume
that each element of each 11'n has positive v measure. For each x EX, let
1 n
fn(x) = v(Cn(x)) gE{Vi{X.

If limn -+ co fn{x) = f(x), a.e. [v], and f f(x)dv(x) = 1, then f = dJ..!x /dv.
PROOF. We need to prove that for each BEe, J..!x(B) = fB f(x)dv(x).
The extensions to the smallest field containing C and the smallest a-field
containing C are straightforward. Let B E 1I'n, and let B ~ Bi E 1I'i for
i = 1, ... , n. Then "'i(x) = "'i,B; for all x E Band i = 1, ... , n. By (1.112),
we have, for each Xo E B,

since fn(x) = fn(xo) for all x E B and B = Cn(xo). For k > n, write
B = Ua:EADa: as the partition of B by elements of 1I'k. Since !k is constant
on each Da: (call the value fk{Xa:) for Xa: E Da:), we can write

In fk(x)dv(x) "f La fk(x)dv(x) "f g


k

= = E("'i(xa:

= L J..!x(Da:) = J..!x{B).
oEA

37This lemma is used in Example 1.114.


1.6. Infinite-Dimensional Parameters 63

Hence, we have fB fk(X)dv(x) = /lx(B) for all k ~ n. So,

lim [ h(x)dv(x) = /lx(B).


k-+OO}B

It follows from Scheffe's theorem B.79 that

lim [ fk(X)dv(x) = [ f(x)dv(x). o


k-+OO}B }B

Example 1.114. Suppose that v is a finite measure and, for each nand B,
E(Vn;B) = v(B)/v(ps(B)). In the notation of Lemma 1.113, fn(x) = l/II(X) for
all n and x. In this case, /-Lx = II/II(X), and the density is constant. In fact, this
gives a convenient way to force a tailfree process to have a desired predictive
distribution for X.

Tailfree processes are conjugate in the sense that the posterior is tailfree
if the prior is tailfree.
Theorem 1.115. Let P be a mndom probability measure that is tailfree
with respect to ({7rn}~l,{Vn;B : n ~ 1,B E 7rn }) and X '" P given P,
thenP given X is tailfree with respect to ({7rn}~=l' {Vn;B : n ~ 1,B E 7rn }).
PROOF. Fix k and nl < ... < nk. Let Vi, for i = 1, ... , k, be a finite (say,
with size Si) collection of the elements of 7rn ;, and let FVi denote their joint
CDF. Let FVilx(lx) denote the conditional CDF of Vi given X = x. Let
A E B, and let Gi ~ JR.Bi, for i = 1, ... , k. We must show that

Pr(X E A, Vi E Gi , for i = 1, ... , k) = 1II k

Ai=l
FVilx(Gilx)d/lx(x),

(1.116)
where /lx is given by (1.112). If we can prove (1.116) for all A E C, it will
be true for all A E B by Theorem A.26.
First, we find FVilx. By definition, for G E JR.Bi, FVilx(C!) is any mea-
surable function h such that

i h(x)d/lx(x) = Pr(X E A, Vi E G),

for all A E B. Once again, the equation need only hold for all A E C. We
propose the following function:

(1.117)

where Vm(x) is defined in Definition 1.106 to be the random variable corre-


sponding to that element of partition 7rm which contains x. Note that h is
constant on each element of 7rni . We find it convenient to let h(B) stand for
64 Chapter 1. Probability Models

that constant value if x E B E 1I"n;. Let A = Bn E 1I"n, let m = max{ni,n},


and define, for j = 1, ... , m,
if j ~ n, j =f:. ni, A ~ Bj E 1I"j,
if j > n, j =f:. ni,
if j = ni ~ n, A ~ Bn; E 1I"n;,
if j = ni > n,
if j ~ n, j =f:. ni, A ~ B j E 1I"j,
if j > n, j =f:. ni,
if j = ni,
where B* is the union of all elements of 11"n; which are subsets of A. If
j = ni ~ n, it is possible that the first coordinate of Uj is repeated later in
the vector. With these definitions, it is clear that
Pr(X E A, Vi E C)
Pr(X E A,Uj E Dj , j = 1, ... ,m)

1 . 1
D1 Dm
Un ;,l IT
jf.n;
UjdFum(u m ) ... dFu 1 (U1)

= E(Un;,l Ie(Vi IT E(Uj )


jf.i

E(Vn;,Bn; Ie (Vi)) I1jf.n; E(VJ;Bj) if ni ~ n,


= { I: B such E(Ic(Vi)Vn;'B) I1J~~-ll E(VJB.(B
J I - I
if ni > n.
that B ~ A

If we let hex) be as in (1.117), then

1 A
h(x)dJl.x(x) = L 1
BE1rn; AnB
h(x)dJl.x(x). (1.118)

If ni ~ n, there is only one term in the sum in (1.118). It is the term with
B = Bn; such that A ~ Bn;, and the integral equals

h(Bn;)Jl.x(A) = E(Ie(Vi)Vn;;BnJ
E(Vn;;Bn;)
IT
j=l
E(VJ;Bj)

= E(Ie(Vi)Vn;;BnJ IT E(VJ;Bj),
jf.n;
which is what we needed to show. Similarly, if ni > n, the only terms in
the sum in (1.118) which appear are those for which B ~ A, and A is the
union of these sets. The sum becomes
h(B)Jl.xCB) (1.119)
B such that B ~ A
1.6. Infinite-Dimensional Parameters 65

where Bj(B) E 1fj and B ~ Bj(B) for j = 1, ... , ni. The right-hand side
of (1.119) can be written as
n;-l

L E(IC(Vi)Vn;;B) II E(Vy;Bj(B),
B such that B ~ A j=l

and it follows that h is a version of FVi IX (G\).


To prove (1.116), let A E 1fn . If n < nk, break A into its intersections
with all elements of 1fnk and then add up the sides of (1.116) over the
disjoint intersections to see that (1.116) holds. Hence we need only prove
(1.116) if n ;::: nk. As before, let B j be the element of partition 1fj such that
A ~ B j for j = 1, ... , n. Since n ;::: nk, A is a subset of one and only one
of the partition elements for each 1fn ; partition. Define, for j = 1, ... , n,
Bj ifj It {n},,,.,nk},
= { (Vy;Bj' Vi) if j = ni,

{ IR[0,1]
x Gi
ifjlt{nl,,,.,nd,
if j = ni.
In what follows, Wj,l will be the first coordinate of Wj It is easy to see
that
Pr(X E A, Vi E Gi, for i = 1, ... , k)
Pr(X E A, Wj E D j , for j = 1,,,., n)

= [
JD 1
... [
J Dn j=l
IT Wj,ldFwn(w n ) ,,dFw1(wd

n
= IlE(Wj,lIDj(Wj
j=l
k
II
j\l{nl, ... ,nk}
E(Vy;BJ II Eevn;;BnJC; (Vi
i==l

IT E(Vy;B j ) IT E( V~;~JCi(V')).
n k

Ln
j=l i=l ( ni;Bn;)
k

= FViJX(Gi\x)dt-tx(x). o

It is not difficult to check that the posterior distribution found in Theo-


rem 1.115 still satisfies the two conditions (1.107) and (1.108). Hence, it is
a regular conditional distribution.
66 Chapter 1. Probability Models

Example 1.120. Suppose that X = IR and each set B is an interval. In the


proof of Theorem 1.115, suppose that one of the Vi is just the single coordinate
Vni (x). Also, suppose that this Vni (x) has a prior density f(v} with respect
to some measure (like Lebesgue measure). Then (1.117) says that the posterior
density of Vni(x} (aside from a normalizing constant) is vf(v}. If Vi is another
random variable from partition ni, and if (V', Vni(x)) have joint prior density
g( Vi, v}, then their joint posterior is proportional to vg( Vi, v}.

Tailfree processes are more general than Dirichlet processes. In fact,


they can be continuous or even absolutely continuous with respect to non-
discrete measures with probability one. Mauldin, Sudderth, and Williams
(1992) prove a result giving conditions under which P is continuous with
probability 1. (See Problem 57 on page 81.) Kraft (1964) and Metivier
(1971) proved theorems that said that if X = (0,1] and 1l'n has 2n sets,
and if a few other conditions hold, then P has a density with respect to
Lebesgue measure with probability 1. We generalize these latter theorems
to arbitrary tailfree processes.
Theorem 1.121. Let (X, B, II) be a probability space. Assume that P is
tailfree with respect to ({ 1l'n}~=1' {Vn;B : n ~ 1, B E 1l'n}). Assume that
each element of each 1l'n has positive II measure and that, for all nand
BE 1l'n, E(Vn;B) = II(B)/II(ps(B)). For each n and each x E X, define

If SUPn Ix E[f~(x)]dll(x) < 00, then, with probability 1,


1. limn --+ oo fn(x) = f(x) exists and is finite a.e. [II], and
2. P(A) = fA f(x)dll(X), for all A E B.
Before proving Theorem 1.121, we should say a little about its statement.
The condition that E(Vn;B) = II(B)/II(ps(B)) is equivalent to saying that
II = Itx in (1.112). (See Problem 50 on page 81.) The formula for fn(x)
is nothing more than the formula for P(Cn(X))/II(Cn(X)). Hence, fn is the
density (with respect to II) corresponding to an approximation to P which
ignores the fine detail on sets in partitions past n. In other words, fn is
constant on all sets in 1l'n and is a density for P restricted to the smallest
u-field containing all sets in partitions up to n. Since E(f(x)) = 1 for all
n and x the theorem says that if there is not too much variation in the
approxi~ate densities, then the approximate densities converge to a density
for P.
PROOF OF THEOREM 1.121. 38 Consider the probability space (8 x X, A
B, It x II). For part (1), let Fn be the product u-field of the u-field generated

38The proof makes use of martingale theory.


1.6. Infinite -Dimen sional Parame ters 67

by {Vn;B : B E 7ri, i ::; n} with the a-field B. Then, a~ a function


from
S x X to JR, fn is measurable with respect to Fn. Also, smce the Vi(x)
are
independent for fixed x,

So, the stochastic process {( fn (x), F n)} ~= 1 is a martingale. Since

J
fn(x)dJ-t x v(s,x) = 1 for all n,

the Martingale convergence theorem B.117 implies that limn--->oo fn


= f
exists and is finite a.s. [J-t x v], which means (in terms of the origina
l prob-
ability space) that with probability 1, limn--->oo fn{x) = f{x) exists
and is
finite a.e. [v].
For part (2), we first show that the sequence {fn}~=l is uniformly
inte-
grable with respect to J-t x v:

( fn(x)dJ-t x v(s,x)

Ix l
J{(s,x): fn(xm }

= fn(x)I(m,oo) (fn(x))dJ-t(s)dv(x)

< r ( ~f~(x)dJ-t(s)dv(x)
JxJsm

~
mJx
rE[f;(x)]dv(x), (1.122)

where the first equation follows from Tonelli's theorem A.69, and
the in-
equality follows since lCm,oo)(fn(x)) < fn(x)/m . By assumption, the
supre-
mum over n of the last expression in (1.122) is a finite number divided
by
m, which goes to 0 with m, so the sequence is uniformly integrable.
Next, we prove that f is a density with respect to v with probability
1.
By Theore m A.60, we have

lim
n--->oo J
fn(x)dJ-t x vex) = J
f(x)dJ-t x vex). (1.123)

Since the left-hand side is 1, we have that f is integrable. It follows


that
if A E A B, then lACS, x)\fn(x) - f(x)\ is uniformly integrable. Let
A=
B x X, where

B= {s :Ix f(x)dv( x) > I} .


Then limn--->oo fA fn(x)dJ-t x v(s,x) = fA f(x)dJ-t x v(s,x). The left-han
d
side is just J-t(B) = Pr(B) since fn is a density. But the right-h and
side
68 Chapter 1. Probability Models

is greater than J.L(B) if J.L(B) > 0, so the integral of f is at most 1 with


I
probability 1. But f(x)dJ.L x I/(S,X) = 1 from (1.123), so Ix
f(x)dl/(x) = 1
with probability 1. It follows from Scheffe's theorem B.79 that part (2) is
true. 0
There is a convenient way to check the condition SUPn Ix
E[f~(x)ldl/(x) <
00 in Theorem 1.121.

Lemma 1.124. IfL~=lSUPBE7rn Var(Vn;B)/(EVn;B)2 < 00, then

sup ( E[f~(x)ldl/(x) < 00.


n Jx
PROOF. Since the set in 7l"n to which x belongs is not random and E(Vn(x))
= E(Vn;B) for all x E B, we have

~ { E[V;(x) }
o < ~ s~p (EVn (x2 - 1

~ VarVn(x)
~ s~p (EVn (x2
~ Var(VnB)
~ sup (EV. ')2 < 00.
n=l BE7r n n;B
Note that log(y) < y - 1 for all y > o. With y = sUPx E[V;(x)]/(EVn(X2,
we get, for each n,

hence
00 E[V;(x)
~ s~plog (EVn (X2
E[V;(x) }
< ~ s~p (EVn (x2 - 1 <
00 {
00.

Since the Zn(B) are independent, we get that n~=l EVn (x)2/(EVn (x2
equals Ef~(x). Hence, sUPx Ef~(x) is integrable. 0
The following simple corollaries follow from Tonelli's theorem A.69.
Corollary 1.125. Let X rv P given P. If P v with probability 1, then
the predictive distribution of X has density with respect to 1/ equal to the
mean ofdP/dl/.
Corollary 1.126. If {Xn}~=l are conditionally lID with distribution P
given P and P 1/ with probability 1, then conditional on Xl, ... , Xn
P 1/ with probability 1, a.s. with respect to the joint distribution of
Xl, ,Xn .
1.6. Infinite -Dimen sional Parame ters 69

At this point, we will introduce one special class of tail~ree process


es
which includes Dirichlet processes as special cases but also mcludes
cases
that satisfy the conditions of Theorem 1.121. The class is called
Polya
tree distributions. A good introduction to these processes is contain
ed in
the papers by Mauldin, Sudderth, and Williams (1992) and Lavine (1992).
[See also Mauldin and Williams (1990)].
Defini tion 1.127. Let P be tailfree with respect to ({1I'n}~=1' {Vn;B
: n 2:
1, BE 1I'n}) and suppose that
for each n 2: 0 and each B E 1I'n, there are exactly k sets in 1I'n+1,
Bl(B), ... , Bk(B) such that B = ps(B;( B)),
for each n 2: 0 and each BE 1I'n, the joint distribution of {Vn+l;B,(B)
:
i = 1, ... , k} is Dirichlet, and they are independent for differen
t B E
1I'n,
then P has Polya tree distribution.
Note that each partitio n 1I'n has k n elements. It is possible to allow
some
of the partitio n elements to be 0 so that there are fewer than kn
non-
empty elements of 1I'n, but then the Dirichlet distributions would have
to
be partially degenerate in the sense that some coordinates would have
to
be 0 with probability 1.
The posterior distribution of a Polya tree process P given an observation
X can be determined by examining the step in the proof of Theorem
1.115
in which the posterior is given, namely (1.117). For each n and each x
E X,
let
Wn(x) = (Vn;B1(Cn_l(X"'" Vn;Bk(Cn_l(X)))' (1.128)
in the notation of Definitions 1.127 and 1.106. This is the vector of
ran-
dom variables for partitio n 11'n corresponding to the subsets of C -
n l (x).
According to (1.117), the posterior distribution of Wn(x) given X =
x has
a density with respect to the prior distribution equal to vx/E(V n(x, where
Vx is the dummy variable for the coordinate corresponding to Vn(x),
which
is a coordinate of Wn(x). If Dirk (an, 1 (x), ... ,an,k(x is the prior distribu
-
tion of Wn(x), then the posterior is Dirichlet with the ith parame ter
equal
to
(1.129)
For all of the other V random variables corresponding to partitio n 1I'n,
the
posterior distributions are the same as the prior distributions. In summa
ry,
the posterior distributions of the V random variables are the same as
the
priors for all V s corresponding to sets such that x is not in the most
re-
cent superset. For those sets such that x is in the most recent superse
t,
the distribution of the vector of V random variables is Dirichlet with
the
same parameters as in the prior except for the set in which x lies, whose
parameter is one higher than in the prior. Note that this is the same
thing
70 Chapter 1. Probability Models

that happens in the Dirichlet process. The difference is that, for a Dirichlet
process, the above argument applies to every set, and every set is in the
first partition 11"1 for the Dirichlet process. Given this description of the
posterior, we can use Corollary 1.125 to construct the predictive density of
a future observation X n +1 given the observed values of Xl, ... , X n .
Example 1.130. Let P have a Polya tree distribution. Suppose that the condi-
tions of Theorem 1.121 hold and that P v with probability 1. Let Xl, . .. , x.,.
be the observed values of the first m random quantities. For each x E X and each
n such that x is not in the same element of 1I'n-1 as one of the Xi,

( V(Cn(X
E [V" X) IXl = Xl,.., Xm = Xm] = E[Vn(x)] = v(ps(Cn(x))'

It follows that for all such n and x,

1
II 9;{X),
r

Efn(x) = v(Cr(x (1.131)


1=1

where r is the first integer such that X is not in the same element of 11'r-1 with
any of the Xi, and 9i(X) is the posterior mean of l-i{x) given the observed data.
It follows from Tonelli's theorem A.69 that Ef(x) equals (1.131).
We can actually find an explicit formula for 9;{X). Using the same notation
as above, suppose that Vn(x) is coordinate in(x) of Wn(x) and that Wn{x) has
Dirk (a n ,l (x), ... , an,k (x prior distribution. Then, the posterior distribution of
Vn{x) is Beta{a,b) with

+L
Tn

a = an,in(z) ICn(z)(Xj),
j=l

+ 2: I Cn _1 (z)\C n(z)(Xj).
m

b = L an,l(x)
t~in("') j=l

That is, the first parameter of the posterior beta distribution equals the prior
parameter plus the number of observations that are in the same partition set
as x. The second parameter of the posterior beta distribution equals the prior
parameter plus the number of observations that were in the same partition set
as x in the most recent partition but now are not in the same partition set as x.
It follows that 9n(X) = a/(a + b).

For the special case of X = [0,1], k = 2 and an,l(x) = Cb an,2(x) = C2 for


all n and x, Dubins and Freedman (1963) show that P, although continuous,
has no density. That is, the distribution is not absolutely continuous with
respect to Lebesgue measure. For Polya trees with an,i(x) = en for all n,
i, and x, Var(Vn(x)) = (k - 1)/[k2 (en + 1)]. Lemma 1.124 says that if
E::ll/en converges, then P is absolutely continuous with respect to v
with probability 1. If v is absolutely continuous with respect to Lebesgue
measure, this gives us an easy way to construct Polya tree processes that
have densities with respect to Lebesgue measure.
1.6. Infinite-Dimensional Parameters 71

Example 1.132. Let {Xn}~=l be an exchangeable sequence such that the prior
marginal distribution of each Xi is N(O,lOO). Let Y; = 4>(Xi/lO), where 4> is
the N(O,I) CDF. The prior marginal distribution of the Y; is U(O,I). Suppose
that we model the Y; as conditionally lID given P and that P has a Polya
tree process distribution on [0,1) with k = 2 and an,i(x) = n 2 /2 for all n,
i, and x. This is a special case of Example 1.114 on page 63, hence each Y;
has marginal distribution U(O, 1). Fifty observations Xl, ... , X 50 were simulated
from a Laplace distribution Lap(l, 1), which does not look much like the prior
marginal distribution N(O, 100). The posterior mean of the density of p(104)-I)
was computed and is plotted in Figure 1.133 together with the prior marginal
mean of the density and a histogram of the data values. The posterior mean of
the density of p(104)-I) is high where the data values are close together, as is
to be expected. The posterior mean smoothes out some of the ups and downs
in the histogram, especially those in the tails. The reason it smoothes the tails
a bit more than the center of the distribution is that the partition sets in the
tail which do and do not contain observations only belong to partitions 7l'n for
relatively large n. A few observations do not have much impact on the posterior
distribution of Vn:B for large n because the prior is Beta(n 2 /2, n 2 /2). In the center
of the distribution, however, the different partition sets belong to the same 7l'n
for smaller values of n.

It is interesting to note that there is a similarity between the posterior


distributions from Polya tree processes and Dirichlet processes. Let PI
have Dir(a.) distribution, and let P 2 have Polya tree process with al,i =
a.(BI,i) for i = 1, ... , k. Then, it is easy to check that for every data
set, the posterior distributions of Pj(BI,d, ... ,Pj(Bl,k) are the same for
j = 1,2. That is, for sets in the first partition 1I't. the Polya tree process
looks just like a Dirichlet process. For Dirichlet processes, two disjoint sets

~ Histogram
Prior Mean
"
,I" '
"
, Post. Mean
True Dens.
I'

,, ''
I '

,, ''
I '

,, ''
I '

I '-.
I
I \~
,"!:,
ci , :
\1
\~

-4 -2 o 2 4 6 8

FIGURE 1.133. Posterior Mean of Polya Tree Density


72 Chapter 1. Probability Models

have posterior distributions that depend in no way on where the two sets
are located relative to each other. The same is true for elements of the first
partition in a Polya tree process. For Polya tree processes, two sets in the
nth partition will have their probabilities more closely related when they
share more superset partition sets. For example, two subsets of B 1 ,1 will
be more closely related than a subset of B1,1 and a subset of B 1 ,2' Two
subsets of B 2 ,1 ~ B 1,1 will be more closely related than a subset of B 2 ,1
and a subset of B 2 ,2 ~ B 1,1, even though both are subsets of B1 1.
One potential problem with tailfree process priors is the d~pendence
on the sequence of partitions. One consequence of this dependence is easily
seen in Figure 1.133. The tall vertical lines in the posterior mean plot occur
at boundaries of sets in early partitions. The following example explores
this in more detail.
Example 1.134. Suppose that X = [0,1] and we use a Polya tree prior with
k = 2 and an,i(x) = n 2 /2 for all n, i, and x. Suppose that Xl = 0.49. The
predictive density of X 2 at the value x = 0.51 is calculated as in Example 1.130
on page 70. It is
f X 21X 1 (0.5110.49) = O2.5 ~
0.5
= 0.5.
On the other hand, the predictive density of X2 at x = 0.47 is

f X 2I X l(0.4710.49) = (iI 1) 2~
n=l
n2:
n
+
+1 6
~ = 2.1183.
+ 1 2ir
Note that in each of these cases, the proposed value of X2 differs from the observed
Xl by 0.02 and they are all in the vicinity of 0.5, and yet the first predictive
density is so much smaller than the second. The reason is the following. In the
first case, the two data values share no partition sets in common, not even in 11"1
In the second case, the two data values share the same partition set for the first
five partitions. In symbols, C 1 (0.49) =f C 1 (0.51) while C n (0.47) = C n (0.49) for
n = 1, ... ,5. Sharing partition sets is what makes predictive densities large.

One way to reduce the effect of the problem illustrated in Example 1.134
is to use a mixture of tailfree priors with partitions that have no common
boundaries.
Example 1.135 (Continuation of Example 1.134). Suppose that we use a half-
and-half mixture of two Polya tree priors with k = 2,3 and an,i(x) = n 2 /k. After
some tedious algebra, one calculates the two predictive densities as
f X 21 X l (0.5110.49) 1.8312,
f X 21 x l (0.4710.49) = 2.3191.
The reason that the first density is now almost as high as the second is that 0.49
and 0.51 appear together in one more partition set in the k = 3 prior than ~o
0.49 and 0.47. The densities are higher than with k = 2 alone because a prior
with larger k tends to let the density track the data more. With vB:lues so cl~se
together as the ones in this example, the prior with k = 3 has very h1gh p~tenor
mean of f(x) for x near 0.49 and very low mean for x not near 0.49. (Usmg the
k = 3 prior alone, the two predictive density values would have been 3.1624 and
2.5200, respectively.)
1.7. Problem s 73

1. 7 Prob lems
Throug hout this text problems are given and the following type of expres-
sion is often used: "Suppose that (some random quantities) are conditi
on-
ally independent given e = (J." This will mean that, for all (J in
some
parame ter space (implicit or explicit), the random quantities are
condi-
tionally independent given e = () with some distributions to be specifie
d
later in the problem. Some of the more challenging problems through
out
the text have been identified with an asterisk (*) after the problem number
.
Section 1.2:

1. Let Xl,X 2,X3 be random variables whose joint distribu tion is


given by
Pr(X l == 1, X2 == 1, X3 == 0) Pr(X 1 == 1, X2 == 0, X3 == 1)
1
Pr(Xl == 0,X2 == I,X3 == 1) == 3'
(a) Prove that Xl, X2, X3 are exchangeable.
(b) Prove that if X 4 E {O, I} is another random variable, then it
cannot
be that Xl, X 2 , X 3 , X 4 are exchangeable.
2. For each positive integer n, let Fn be the joint CDF of n random
variables.
Suppos e that the following two conditions hold:
The sequence of n-dimen sional CDFs {Fn}~=l is consiste nt. (See
Def-
inition B.132 on page 652.)
For each n, each n-tuple (Xl, ... ,x n ), and each permut ation (Yl,
... ,
Yn) of (Xl, ... , Xn), Fn(Xl, ... ,Xn) == Fn(yl, ... , Yn).
Prove .that there is a sequence of random variables {Xn}~=l that
are ex-
changeable and such that F n is the joint CD F of Xl, ... , X n for
every
n.
3. Suppos e that {Xn}~l are exchangeable. Let Vi == X n + for
i i = 1,2, ....
Show that {Yn}~l are exchangeable conditio nal on Xl, ... , X .
n
4. Suppose that {Xn}~l are conditionally IID given Y. Prove that
they are
exchangeable.
5. Suppose that {Xn}~l are lID N(j.L,I) conditio nal on M = j.L,
and M rv
N(O, 1). Find the joint distribu tion of every subset of size k of the
Xi and
show that the Xi are exchangeable. Also, find the conditio nal distribu
tion
of {Xn+k}~l given Xl = Xl, ... ,Xn = X n
6. In Exampl e 1.15 on page 9, prove that the limit (as n ->
(0) of the
empirical probabi lity measure s of Xl, ... ,Xn is the Dirichlet distribu
tion
Dirk(Ul, ... ,Uk).
7. Let {Yn}~=l be lID random variables with CDF F. Let Z be indepen
dent
of {Yn}~=l with CDF G, and let Xn == Yn + Z for every n.
(a) Prove that {Xn}~=l are exchangeable.
(b) Write the joint CDF of (Xl,' .. ,Xn ) in terms of F and G.
74 Chapter 1. Probability Models

Section 1.3:

8. Let Xl, .. . , Xm be numerical characteristics of m individuals in a finite


population. Suppose that we are interested in S = ~:l Xi, the population
total. We model the Xi as exchangeable as follows. Let e be a parameter
such that conditional on 8 = 0, the Xi are lID with Exp(O) distribution
and e has r(a, b) distribution. Suppose that we observe Xl = Xl,"" Xn =
Xn for some n < m. Find the predictive distribution of S given this data.
Also, find the mean of this predictive distribution.
9. Let e be a parameter with parameter space (0, T), and let fXls(xIO) =
dPo/dl/ for every O. Let /LS be a prior on (O,T). Let Q be the joint distri-
bution of (X, e). Show that IXls = dQ/dl/ x /Ls, a.s. [1/ x /Ls]. This says
that, for every prior distribution, we can find a version of IXls which is
jointly measurable in (x,O).
10. Prove that the formula on the right-hand side of (1.37) on page 18 is the
same as
IXI ,... ,X n + k (Xl, ... , Xn+k)
fXI,,,,,Xn (Xl, ... , Xn)
11. Suppose that for every m = 1,2, ... , I X 1, ... ,X m (Xl, ... , Xm) equals

{ ol~m ~~~o aiiX(1O - i)m.-x if all Xi E {O, I},


otherwise,

where X = ~;:l Xi, and the numbers ai are nonnegative and add to 1.
Let 8 = limn~DO ~;:"l X;/n. Prove that the prior distribution of e is
Pr(8 = i/lO) = ai for i = 0, ... , 10.
12. Suppose that for every m = 1,2, ... ,
2
fx 1, .. , X m ( Xl , . . . , x)
711. -- (m + 1)Cm.XI,,Xm.
( )711.+1' if all Xi 20,

where Cm.(XI, ... , Xm.) = max{2, XI, ... , X11l}'

(a) Prove that the Xi are exchangeable and that these distributions are
consistent.
(b) Let Yn = cn(X I , ... , Xn). Find the distribution of Yn and the limit
of this distribution as n -> 00.
(c) Find the conditional density of Xn+l given Xl = Xl, .. , Xn = Xn,
and assume that limn~DOcn(XI, ... ,Xn) = O. Find the limit of the
conditional density as n -> 00.
(d) Use DeFinetti's representation theorem to show that the prior (the
answer to part (b)) and likelihood (the answer to part (c)) combine
to give the original joint distribution.
13. Let e be a random variable. Suppose that {Xn}~=l are lID N(O,l). Let
T(O) = O. For each i > 0, let T(i) be the first j > T(i - 1) such that
Xj 2 e. Let Yi = XT(i) for i = 1,2, .... If we use Lebesgue measure as an
improper prior distribution for e, how many observations must we observe
before the posterior becomes proper?
1.7. Problems 75

14. An observation X is to be made in hopes of learning something about a


parameter e. The prior distribution of e has some density fe, but we
are not certain what is the appropriate distribution to use for X given
e. Suppose that we have k choices with densities h (1 9), ... , /k (1 9). Let
11'1, ,1I'k be nonnegative numbers adding to 1, and set

= L 1I'i/i(xI9).
k

/xle(xI9)
i=l

Show that there are numbers 1I'i (x), . .. , 1I'k{x) adding to 1 such that

= L 11'; (x)Pi(Olx),
k

/elx(Olx)
i=l

where Pi('lx) is the posterior density of e that we would have calculated


if we had used /i(IO) as the conditional density for X given e = O.
15. Let {Xn}~=l be lID Ber(O) random variables given e = O. Define a prior
I-'e for e by l-'e(B) = [6(B) + I-'(B)1/2 where I-' is Lebesgue measure and
6(B) = 1B(I/2).

(a) Find the marginal distribution for XiI"'" Xin for distinct integers
il, ... ,in
(b) Find the posterior distribution for e given Xl = Xl, ... , Xn = X n ,
that is, find Pr(e ::; 0IXl = Xl, ... , Xn = x n ).
16. Suppose that {Xn}~l are conditionally independent random variables
with Xi'" N(I-'i,l) given (M,Ml, ... ,Mn , ... ) =
(1-',l-'l'''',l-'n, ... ). Sup-
=
pose also that given M 1-',

1
2

1
for each n, and M '" N(J.to, 1).
(a) Prove that {Xn}~l are exchangeable.
(b) Find a one-dimensional random variable 8 such that the Xi are con-
ditionally lID given e = 0, and find the distribution of e.
(c) Show that Xn = L::'l Xi/n does not converge in probability to M.
17. Let c be a constant, and let X, Y be conditionally independent given 8 = 0
with X rv Poi(O), Y '" Poi(cO). Let e r(oo, fJo).
"V

(a) Find the posterior distribution of e given X = x.


(b) Find the posterior predictive distribution of Y given X = x, and show
that it is a member of the negative binomial family.
76 Chapter 1. Probability Models

18. Suppose that an expert believes that {Xn}~=l are exchangeable Bernoulli
e
random variables. Let = limn->oo I:~=l X;/n, and assume that a statis-
tician wishes to model e '" Beta(a,b). The statistician tries to elicit the
values of a and b from the expert by asking questions like, "What is the
probability that X I = I?" and "How many Xi = 1 in a row would you have
to observe before you would raise the probability that the next Xj = 1 up
to q?" Suppose that the answer to the first question is p, and suppose that
q in the second question is chosen to be (1 + p)/2. Let the answer to the
second question be m.
(a) Find values for a and b which are consistent with the model.
(b) Find the partial derivatives of a and b with respect to p, and find the
effects of a change of l in m on both a and b.
(c) Suppose that the second question above was changed to "If you were
to observe Xl = ... = XIO = 1, what would you give as the Pr(Xll =
I)?" Let the answer to this question be T. Find values of a and b
consistent with the values of p and T, and find the partial derivatives
of a and b with respect to p and T.
19. Suppose that an expert believes that {Xn}~=l are conditionally lID with
N(/-L, 0'2) distribution given (M,~) = (/-L, 0'). A statistician wishes to model
M '" N(/-LO, 0'2/ >"0) given ~ = 0' and ~2 '" r- l (ao/2, bo/2). The statistician
tries to elicit the values of /-Lo, >"0, ao, and bo from the expert by asking a
sequence of questions such as:
What is the median of the distribution of Xl? (Suppose that the
answer is UI.)
Given that Xl ~ UI, what is the conditional median of the distribu-
tion of Xl? (Suppose that the answer is U2')
Given that Xl ~ U2, what is the conditional median of the distribu-
tion of X I? (Suppose that the answer is U3.)
If X I = U2 is observed, what would be the conditional median of the
distribution of X 2 ? (Suppose that the answer is U4.)

(a) Prove that the following constraints on UI, U2, U3, and U4 are sufficient
for there to exist a prior distribution of the desired form consistent
with the responses: UI < U4 < U2 and (U3 - Ul) / (U2 - uI) > 1.705511.
(The constraints are actually necessary as well.)
(b) Suppose that the following answers are given: UI = 14.56, U2 = 21.34,
U3 = 29.47, and U4 = 19.25. Find values of /-Lo, >"0, ao, and bo which
are consistent with these answers.

Section 1.4:

20. For the joint density in Example 1.53 on page 29, prove that the distribu-
tions are consistent (as n changes).
21. Consider the joint density in Example 1.53 on page 29, and define Yn =
n/ I:~=l Xi. Find the distribution of Yn . Also, let n -> 00 and prove that
the limit of the distribution of Yn is rea, b).
1.7. Problems 77

22. For the joint density in Example 1.53, prove that the conditional distribu-
tion of the Xi given e = (), namely Exp((}), and the marginal distribution
of e, found in Problem 21, namely f(a,b), do indeed induce the joint distri-
bution for X 1, ... ,Xn for all n. How do we know that no other combination
of distributions will induce the joint distribution of the Xi?
23. Let X 1, ... ,X14 be exchangeable Bernoulli random variables, and let M =
L:~~l Xi. Let the distribution of M be given by the mass function (density
with respect to counting measure)
0.3 if m = 2,
/M(m) ={ 0.2 if m = 8,
0.5 ifm = 13.

(a) Find the probability that in four specific trials, we observe three suc-
cesses and one failure (without regard to which of the trials is a failure
and which are successes).
(b) Suppose that we observe three successes in the first four trials. Find
all the probabilities of k successes in n future trials for n = 1, ... ,10
and k = 0, ... , n. (Give a formula.)
24. Suppose that Xl, ... , Xn are exchangeable and take values in the Borel
space (x,13). Prove that the empirical probability measure P n is a mea-
surable function from the n-fold product space (Xn, 13 n ) to (P, Cop).
25. *Refer to Problem 29 on page 664.
(a) Find the distribution of B = lim n _ oo L~=l Xi/no
(b) Assume that we observe Xl = 1 and X2 = O. Find the conditional
distribution of X3, ... ,Xn given this data for all n = 3,4, ....
(c) Using the same data, find the conditional distribution of B.
26. Suppose that {Xn}~l is an infinite sequence of exchangeable random vari-
ables with finite variance.
(a) Prove that the covariance of Xi with X j is nonnegative for i =1= j.
(b) Give an example of such a sequence in which Cov(Xi,Xj ) = 0 but
the random variables are not mutually independent.
27.*Let X I ,X2 ,X3 be lID U(O, 1) random variables. After observing Xl =
Xl, X2 = X2, X3 = X3, define YI, Y2 to be the results of drawing two num-
bers at random without replacement from the set {Xl, X2, X3}. Prove that
YI and Y2 are IID U(O,I).
28. Let Xl, ... ,Xn be lID with some distribution P on a Borel space (X, B).
Let the conditional distribution of YI, ... , Yk (for k < n) given X I =
Xl, ... ,Xn = Xn be that of k draws without replacement from the set
{Xl, ... , x n }. Prove that the joint distribution of YI , ... ,Yk is that of lID
random quantities with distribution P.
29. Let (X, B) be a Borel space, and let Xi take values in X for i = 1, ... , n.
Suppose that Xl, .. " Xn are exchangeable. Let the conditional distribu~
tion of YI , ... , Y k (for k < n) given Xl = Xl,"" Xn = Xn be that of k
draws without replacement from the set {Xl, ... ,Xn}. Prove that the joint
distribution of YI, ... , Yk is the same as the joint distribution of X I, ... , X k.
78 Chapter 1. Probability Models

30. In the setup of Problem 3 on page 73, let P be the limit of the empirical
probability measures of Xl, . .. ,Xn as n -+ 00. Show that P is also the
limit of the empirical probability measures of the Y;.
31. State and prove a central limit theorem for exchangeable random variables.
You may use Theorems B.97 and 1.49.

Section 1.5:

32. Prove Corollary 1.63 on page 36 using Theorem 1.62. (Hint: Prove that
lim Yn is measurable with respect to the tail u-field of {Xn}~=l' Then
apply the Kolmogorov zero-one law B.68.)
33. Refer to Example 1.76 on page 42. Let M* = M -Xl. Take the conditional
distribution of M* given Xl = Xl as a prior distribution for M* after
learning that Xl = Xl.
(a) Find the probability that X 2 = 0 (conditional on Xl = 1) using this
new prior distribution.
(b) Find the posterior distribution for M* given X 2 = 0 (and Xl = 1.)
34. Let {Xn}~=l be exchangeable Bernoulli random variables, and let

Y = min {n :t 1=1
Xi ~ 2} ,
that is, Y is the time until the second success (e.g., if Xl = 1, X 2 = 0, X3 =
1, then Y = 3).
(a) Find the distribution of Y using the form of DeFinetti's representa-
tion theorem in Example 1.82 on page 46.
(b) Find the conditional distribution of {Xn+kHo=l given Y = n.
(c) Show that the distribution in part (refp202) is the same as the con-
ditional distribution of {Xn+d~l given E~=l Xi = 2.
35. Suppose that {Xn}~=l are bounded, exchangeable random variables. Let
e = limn_co E~=l Xi/n, a.s. Prove that Var(8) = CoV(Xl ,X2 ).
36. Prove that the collection Cn in the proof of Theorem 1.62 is a u-field. Also
prove that f : ffi 00 -+ ffi is measurable with respect to Cn if and only if
f(y) = f(x) for all y that agree with X after coordinate n and such that
the first n coordinates of yare a permutation of the first n coordinates of
x.
37. Let {Xn}~=l be lID nonnegative random variables with E(Xi) = 00. Show
that E~l Xi/N = YN diverges to 00 almost surely.
38. *Under the conditions of Theorem 1.59 it is possible to prove that Yn con-
verges almost surely, rather than just the subsequence {Ynk } k= l '
(a) Let v = E~l C 3 / 2 and Ci,k = 1/(kvi3/2) for all i and k. Define
Vi,k = {s : IY~+l)4 (s) - Yi4 (s)1 < Ci,k}' Use the second to last equation
in (1.60) to prove that E:l Pr(V;~) < 00.
1. 7. Problems 79

(b) Let Ak,n = n~n "i,k. Show that for each E > 0 and k, there exists nk
such that Pr(Af,nk) < E/2k.
(c) For each i,j,k with i4 ::; k < (i+ 1)4, define Gi,i,k = {s: l}j(s)-
y/4(s)1 > l/k}. Use the second to last equation in (1.60) to prove
that Pr(Gi,i,k) is at most a fixed multiple of k 2 /i 5
(d) Define Hi,k = U;~~)4-1Gi'i,k. Prove that Pr(Hi,k) is at most a fixed
multiple of k2 /i 2 .
(e) Define Jk,n = U~nHi,k. Prove that for each k and > 0, there exists
mk such that Pr(Jk,mk) < /2k.
(f) Prove that for every pair of sequences {nk}k'=l and {mk}k'=l,

S;; {s: {Yn(S)}~=l is a Cauchy sequence}.


(g) Prove that for every 0 < c < 1, the probability that {Yn}~=l is a
Cauchy sequence is at least c, hence it must be 1.
39. State and prove a weak law of large numbers for exchangeable random
variables. You may use Theorems B.95 and 1.49. (Don't use Theorem 1.62.)
40. *Suppose that {Xn}~=l is a sequence of exchangeable random variables
with finite mean. Let {nk}k'=l be a subsequence of {I, 2, ... }. Prove that
n k
lim .!. ~ Xi = k--oo
n-+oo n L....J
lim -kl ~ X ni , a.s.
~
i=l i=l
(Hint: Use DeFinetti's representation theorem and the strong law of large
numbers 1.62 to express the two limits as the same function of P.)
41. *In this problem, you will prove the following generalization of a theorem of
Aldous (1981): An infinite sequence of random quantities {Xn}~=l taking
values in a Borel space X is exchangeable if and only if there exists a
measurable function! : [0, If -; X such that X =
(Xl, X2, ... ) has the
same distribution as (f(Zo, Zd, !(Zo, Z2), .. .), where Zo, Zl, ... are lID
U(O,I).
(a) Let P be the limit of the empirical distributions of {Xn}~l' and
let P* be the limit of the empirical distributions of {X2n}~=1. Use
Problem 40 on this page and Problem 31 on page 664 to show that
p. = P, a.s.
(b) Note that P is a measurable function on Xoo. Use Proposition B.145
and Lemma B.41 to show that there exists a measurable function
g: [0,1] -; P such that g(Zo) has the same distribution as P.
(c) Map X to IR and for each z map g(z) to a CDF F". Then define
F- l ( ) _ { inf{x: g(z)(x) ;::: q} if q > 0,
z q - sup{x : g(z)(x) > O} if q = O.

Find a function .,p : IR -; X such that {'I/J(Fiol(Zi))}~l has the same


joint distribution as {X2i+d~1.
80 Chapter 1. Probability Models

(d) Show that J(z,w) = tf;(Fz-l(w)) is the desired function.

Section 1.6.1:

42. Prove Proposition 1.98 on page 56.


43. Suppose that {Xn};:"=l are conditionally independent with distribution P
given P = P, and P has Dir(a) distribution where a is a finite measure
on (X, B). Let Kn be the number of distinct values amongst Xl, ... , X n .
Prove that lim n _ oo E(Kn)/ log(n) = a(X).
44. Consider the situation described in Example 1.103 on page 59. Consider
two different choices for a8. The first is a8 equal to the U(O,O) density.
The second is Lebesgue measure on [0, OJ.
(a) If no repeats occur in the data, show that the likelihood function for
e is the same for both choices of a8.
(b) Explain the differences between the two likelihood functions in the
case in which repeats do occur in the data.
45. Suppose that P has Dir(a) distribution and that Xl and X 2 are random
variables that are conditionally lID with distribution P given P = P.
Suppose that a is absolutely continuous with respect to Lebesgue measure
with Radon-Nikodym derivative a(). Find the joint density of (Xl ,X2)
with respect to the measure, which is two-dimensional Lebesgue measure
on A = {(Xl, X2) : Xl #- X2} plus one-dimensional Lebesgue measure on
B = {(Xl,X2) : Xl = X2}. (For definiteness, calculate one-dimensional
Lebesgue measure of a subset C ~ B by calculating the Lebesgue measure
of the set Cl = {x: (x,x) E C}. The most natural alternative would be to
multiply this measure by y'2 so that it equaled length for line segments.)
46. Suppose that P has Dir(a) distribution, where a is a finite measure on
(JR, B). Let Z be the median of P, namely Z = inf{x : P(( -00, xl) ~ 1/2}.
Show that the median of Z is the median of the distribution a/a(JR). (Hint:
Use the result of Problem 30 on page 664.)
47. Let P ,...., Dir(a), where a is continuous (no point masses). Prove that the
posterior distribution of P is not absolutely continuous with respect to
the prior. In fact they are mutually singular (that is, the posterior assigns
probability 1 to a set to which the prior assigns probability 0).

Section 1.6.2:

48. Prove Proposition 1.111 on page 62.


49. Let P be a random probability measure on (X, B). Let Xi : S -+ X be
random quantities that are conditionally lID given P with distribution P.
Let J.LX be the marginal distribution of Xi. We will say that a probability
measure P is continuous if P(Xl = X2) = o. Prove that P is continuous
with probability 1 if and only if Pr(X2 = xlXl = x) = 0, a.s. [J.Lxj.
1.7. Problems 81

50. Let (X, B, v) be a probability space. Assume that P is tailfree with respect
to ({7rn};:O=ll {Vn;B : n ~ 1, B E 1Tn}). Assume that each element of each
1Tn has positive v measure. Show that E(Vn;B) = v(B)/v(ps(B)) for all
nand B if and only if f-tx(B) = v(B) for all B, where f-tx is defined in
(1.112).
51. *Prove that (1.107) and (1.108) are necessary and sufficient in order that a
tailfree process be a probability measure with probability 1. (Hint: That
the two conditions are necessary is straightforward once one realizes that
countable additivity of a measure f-t is equivalent to lim n _ oo f-t(Dn) = 0
when Dl :2 D2 :2 ... and n;:O=lD n = 0. Next prove that (1.107) is sufficient
for the process to be countably additive on the smallest a-field containing
all sets in 7r n for each n. Then show that the union of all these a-fields is a
field and that (1.108) is sufficient for the process to be countably additive
on this field.)
52. Let (Xl, ... , Xn+t) '" Dirn+l (al, ... ,an+l). Define Yi = 2:~=1 Xi for i =
1, ... ,n + 1, and set Zi = Yi/Yi+l for i = 1, ... , n. Prove that the Zi are
mutually independent with Zi '" Beta(al + ... + ai, ai+l).
53. Prove Corollary 1.125 on page 68.
54. Prove Corollary 1.126 on page 68.
55. *Assume that the conditions of Theorem 1.121 hold. Prove that the posterior
distribution of P as found in Theorem 1.115 is the same thing that Bayes'
theorem 1.31 gives, where we let the e in Bayes' theorem 1.31 be P.
56. Let X '" P given P. Suppose that Elg(X)1 < 00, where g : X -+ lEt Prove
that E(lg(X)IIP) < 00, a.s.
57. *Let P have a Polya tree process distribution with each set partitioned into
k sets at the next level. For each x and n, let in(x) be such that Cn(x) =
Bin(x)(Bn-l(X)). Let Dirk(an,l(x), ... ,an,dx)) be the prior distribution
of Wn(x) in (1.128), and define Sn(X) = 2::=1 an,i(X).
(a) Prove that Pr(X = x) = O::"=l(an,in(x)/Sn(X)).
(b) If Xl, X 2 are conditionally lID with distribution P given P, prove
that
Pr(X2 = xlXl = x) = II
oo
bn,in(x)(X) ,
Sn(X) + 1
n=l
where bn,i is defined in (1.129).
(c) If infx,n Sn(X) > 0 and sUPx,n,i an,i(X)/Sn(X) < 1, then P is continu-
ous with probability 1.
(Hint: Use Problem 49 on page 80.)
58. Consider a Dirichlet process Dir(a) as a Polya tree with k = 2 and
an,i(x) = Cn for all n, i, and x.
(a) If a = a(X), show that Cn = a/2 n for all n. (Hint: Use the result
of Problem 52 above, which implies that the product of a Beta(b, c)
times an independent Beta(b + c, d) is Beta(b, c + d).)
(b) Show that a condition for P to be continuous in Problem 57 above is
violated.
CHAPTER 2
Sufficient Statistics

We now turn our attention to the broad area of statistics. This will concern
the manner in which one learns from data. In this chapter, we will study
some of the basic properties of probability models and how data can be
used to help us learn about the parameters of those models.

2.1 Definitions
2.1.1 Notational Overview
We assume that there is a probability space (S, A, J.L) underlying all prob-
ability calculations. It will be common to refer to probabilities calculated
under J.L using the symbol Pr(.). Conditional probabilities will be denoted
Pr(I). We also assume that there is a random quantity X : S -+ X, where
X is called the sample space (with a-field B), which will usually be some
subset of a Euclidean space, but will be a Borel space in any event. We
will often refer to X as the data. Let Ax stand for the sub-a-field of A
generated by X (that is, Ax = X- 1 (B)). Since (X, B) is a Borel space, B
contains all singletons. This will allow us to claim that random quantities
are functions of X if and only if they are measurable with respect to Ax
by Theorem A.42. Generic elements of X will usually be denoted by x, y,
or z, or Xl, X2, . , depending on how many we need at once.
Assume that there is a parametric family Po of distributions for X, and
let the parameter be e : S -+ n, where n is a parameter space with a-field
T. Usually, n will be a subset of some finite-dimensional Euclidean space,
but not always. X will usually be a vector of exchangeable coordinates, but
this is not required. When the coordinates of X are exchangeable, then the
2.1. Definitions 83

elements of Po will usually be distributions that say that the coordinates


of X are lID each with distribution Po for some 0 E fl. As mentioned
in Section 1.5.5, we will use the symbol Po to stand for the conditional
distribution of X given e = 0 (a distribution on (X, 13)) as well as the
conditional distribution of each coordinate of X in the case in which X has
exchangeable coordinates. We use the symbol P~(-) to stand for Pre 18 = 0),
a probability on (S,Ax). That is, for A E Ax with A = X-l(B) for some
BE 13,
P~(A) = P~(X E B) = Pr(X E BI6 = B) = Po(B).

Example 2.1. Let {Xn}~l be conditionally lID random variables with N(O, 1)
distribution given e = O. Let X = (Xl, ... , Xn). If BE 8 1 , the one-dimensional
Borel a-field, then

P9(B) = l vk exp ( - (x ~ 0)2) dx = PO(Xi E B) = Pr(Xi E Ble = 0).

Similarly, if C E Bn , the n-dimensional Borel a-field, then

P9(C) 1(21lr~ exp(-~


c
t(Xi - 0)2) dXl dXn
.=1
P~(X E C) = Pr(X E Cle = 0).
Let J.Le be the distribution of a, and let D E T and B E 13. Let A =
X-l(B) and E = a-leD). Then we have

Pr(X E B, 8 E D) = J.L(A n E) = Ie Pr(AI8)(s)dJ.L(s) Iv Po(B)dJ.Le(B),


=

where the last equality follows from Theorem A.81.


Example 2.2 (Continuation of Example 2.1). Suppose that e '" N(O, 1). Then,
for each D E T and B E 8, Pr(e E D,X E B) equals

2.1.2 Sufficiency
A statistic is virtually any measurable function of the data, X.
Definition 2.3. Let (T, C) be a measurable space such that C contains all
singletons. If T : X - t T is measurable, then T is called a statistic.
It appears that almost anything can be a statistic. The only requirement
is that it be a measurable function of X to a space in which singletons
are measurable sets, such as a Borel space. It will prove convenient, when
T: X -t T, to refer to T as a random quantity. When we do this, we will
84 Chapter 2. Sufficient Statistics

mean the random quantity T(X) : S -+ T. When we need to refer to the


specific value that T assumes when X = x, we write T(x). We will let Pe,T
stand for the probability measure induced on the space (T,C) from Po by
the function T. In this way P~(T(X) E C) = PO,T(C). We will often write
P~(T E C) to stand for this quantity as well.
There is a special class of statistics that are very useful in statistical
inference. These are statistics that provide a summary of the data sufficient
for performing all inferences of interest.
Definition 2.4. Let Po be a parametric family of distributions on (X, 8).
Let (0, T) be a parameter space and e : Po -+ 0 be the parameter. Let
T : X -+ T be a statistic. We say that T is a sufficient statistic for e (in the
Bayesian sense) if, for every prior J.le, there exist versions of the posteriors
J.lelx and J.leiT such that, for every BET J.lelx(Blx) = J.leIT(BIT(x)), a.s.
[J.lx], where J.lx is the marginal distribution of X.
It appears that once one has settled on a parametric family of distributions
for the data X, one need only calculate a sufficient statistic, because the
posterior distribution of e given the sufficient statistic is the same as given
X, no matter which prior distribution one uses. So long as one sticks with
the chosen parametric family, the sufficient statistic is sufficient for making
inference about e, and thereby about future observations (conditionally
independent of X given e) through (1.37) on page 18. 1
Example 2.5. Let {Xn}~=l be exchangeable Bernoulli random variables, and
let Po be the set of all lID distributions (the largest parametric family available).
Let X = (Xl," ., X n ), and let Po be the distribution that says the coordinates of
X are lID Ber(8) random variables. We have already seen (Theorem 1.56) that
if the prior is fJ-e, then the posterior for e has Radon-Nikodym derivative

Next, treat reX) = I:~l Xi as the data. The density of r given e = 8 (with
respect to counting measure on the nonnegative integers) is hle(tI8) = (~)8t(1-
8t- t for t = 0, ... ,n. It follows from Bayes' theorem 1.31 that the posterior given
r = t = I:~=l Xi has derivative

This is the same as the other posterior, hence r is sufficient according to Defini-
tion 2.4.

1 See Problem 24 on page 141 for an example of observations not conditionally


independent given e for which a sufficient statistic is not sufficient for making
predicitive inference.
2.1. Definitions 85

In Example 2.5, for every prior the posterior distribution of e given


X = x was a function of T(x). The following lemma says that this fact is
enough to conclude that T is sufficient.
Lemma 2.6. 2 Let T be a statistic and let BT be the sub-rF-field of B gen-
erated by T. Then T is sufficient in the Bayesian sense if and only if, for
every prior distribution J-Le, there exists a version of the posterior distribu-
tion given X, J-Lelx, such that for all BET, J-Lelx(BI) is measurable with
respect to BT .
PROOF. This result is an immediate consequence of Theorem B. 73 if we
make the following correspondences:
Theorem B.73 z
Lemma 2.6
o
Example 2.7. Let 'Po be the set of all lID exponential distributions, and let P8
say that {Xn}~=l are lID Exp(9) random variables. If p,e is the prior distribution
and X = (Xl, ... , X n ), then the posterior has density

dp,elx (9Ix) = 9n exp( -9 E:::I Xi)


dp,e J.,pn exp( -.,p E~=l Xi) dp,e(.,p)
with respect to the prior. Notice that this is a function ofT(x) = E:::I Xi, hence
Lemma 2.6 says that T is sufficient.
There is a more commonly used definition of sufficient statistic, which
does not refer to prior distributions. Loosely speaking, this definition says
that T is sufficient if the conditional distribution of X given e = () and T
does not depend on ().
Definition 2.8. Let Po be a parametric family of distributions on (X, B).
Let (0, T) be a parameter space and e : Po --+ 0 be the parameter. Let
T: X --+ T be a statistic. Suppose that there exist versions of Po(IT) and
a function r : B x T --+ [0,1] such that r(, t) is a probability on (X, B) for
every t E T, r(A,) is measurable for every A E B, and for every () E 0 and
every BE B
Po (BIT = t) = reB, t), a.e. [PO,T].
Then we say that T is a sufficient statistic for e (in the classical sense).
This definition says that a statistic T is sufficient if and only if, after one
observes the value T(X) = t, one can generate data X' with conditional
distribution r(, t), and then the conditional distribution of X' given e is
the same as the conditional distribution of X given e. It will be common
to use the symbol E(\T) in place of EO('IT) when T is sufficient. In such a
case, if Eolg(X)1 < 00 for all 0, then E(g(X)IT = t) = g(x)dr(x,t).J
2This lemma is used in the proofs of Lemma 2.15 and Theorem 2.29.
86 Chapter 2. Sufficient Statistics

Example 2.9 (Continuation of Example 2.5; see page 84). The Xi are lID
Ber(8) given 8 = 8, and X = (Xl, ... , Xn). Let T(x) = L:"l Xi. We need
to compute PHX = xIT(X) = t) for all 8 and all X such that t = T(x). Since
both X and T are discrete random variables,

R' (X = xIT(X) = t) = 8t (1 - 8t- t = (n)-l


9 (~)8t(1 _ 8)n-t t

Set r(, t) to be the distribution that is uniform on the set of all x such that
E:=l Xi = t (probability (~) -1 for each such x.) We now see that T is sufficient
according to Definition 2.8.
Example 2.10 (Continuation of Example 2.7; see page 85). If X = (Xl, ... ,Xn)
with the Xi having Exp(8) distribution given 8 = 8, let T(x) = L~ Xi. We
need to find the conditional distribution of X given T = t. By Coroilah- B.55,
the conditional distribution of X given T = t and 8 = (J has density

!XIS(Xl, ... ,xnI8) 8n exp( -8t) (n - I)!


hls(tI8) = _1_8
(n-l)!
n t n- l exp(-t8) = t n- l

with respect to a measure VXIT(lt), which does not depend on 8. Since this
distribution is the same for all 8, T is sufficient in the classical sense.

In general, sufficient statistics need not be much simpler than the entire
data set.
Definition 2.11. Let Xl.' .. , Xn be random variables. Define X(l) to be'
min{Xl. ... ,Xn}' and for k > 1, define

The vector (X(l),'" ,X(n is called the order statistics of Xl, .. ,Xn.

Proposition 2.12. Let X = (Xl. ... , X n ), and suppose that Xl!' .. , Xn


are exchangeable random variables. Then the order statistics are sufficient.
Example 2.13. Let Xl, ... , Xn be conditionally lID with Cauchy distribution
Cau(8, 1) given e = 8. In this case, one can show that all sufficient statistics are
at least as complicated as the order statistics. To do this, however, some theorems
from Section 2.1.3 will prove useful.

In all cases of interest to us, the two definitions of sufficient statistic are
equivalent.
Theorem 2.14. 3 Let (T,C) be a Borel space, and let T : X -. T be a
statistic. The following are both true:

3The proof of part 1 is reminiscent of the proof of Theorem 1 of H~os and


Savage (1949). The proof of part 2 is due to Blackwell and Ramamoorthl (1982).
2.1. Definitions 87

1. If there is a (1-finite measure II such that for all (J, P6 v and T is


sufficient in the Bayesian sense, then T is sufficient in the classical
sense.
2. If T is sufficient in the classical sense, then T is sufficient in the
Bayesian sense.
The proof of part 1 of Theorem 2.14 requires a lemma that will also be
used in the proof of Theorem 2.2l.
Lemma 2.15. Let II be a (1-finite measure such that P6 II for all (J. 1fT
is sufficient in the Bayesian sense, then there exists a probability measure
11* such that P6 11* for all (J and dP6/dll*(X) is a function h(J, T(x)).
Also, II'" II.
PROOF. Let dP6/dv(x) = IXls(xl(J). Since each P6 v, Theorem A.78
says that there exist countable sequences {(Ji}~l and {C.;}~l such that
Ci ~ 0, E:l Ci = 1, and P6 11* , for every (J E n, where 11* = E:l CiP6i .
Note that 11* II. For (J E n such that (J is not one of the ()i, specify the
following prior distribution over n: Pr(8 = (J) = 1/2, and Pr(e = (Ji) =
c';2, for i = 1,2, .... The posterior probability of 8 = () given X = x is

Pr(8 = (JIX = x) =

According to Lemma 2.6, for each (), this is a function of T(x). That is,
there exists a function h such that, for each (),

fXls(xl(J)
~oo
L...i=l C,.fXIS (I(}')
x ,
= h(J, T(x)). (2.16)

By the chain rule A.79, it can be seen that the left-hand side of (2.16) is
equal to dP6/dll*(x).
For (J E {(}i}~l' replace the prior above by one that has Pr(8 = (}i) = Ci
for all i. We still have that (2.16) is dP6/dll*(x). Also, Lemma 2.6 says that
Pr(8 = (}jlX = x) is still a function ofT(x) for all j. But Pr(e = (}jlX = x)
is just Cj times the left-hand side of (2.16). 0
PROOF OF THEOREM 2.14. For part 1, define the function r to be the
conditional probability function on X given T = t calculated from the
probability 11* in Lemma 2.15. That is, for every C E C and every B E B,

1I*(T- 1 (C) n B) = fa r(B, t)dllT(t) ,


88 Chapter 2. Sufficient Statistics

where vT is the probability on (7, C) induced by T from v*. It is easy to


see that this implies that for every integrable 9 : 7 ___ IR and B E B,

J g(T(x))IB(x)dv*(x) = J g(t)r(B,t)dvT(t). (2.17)

We now wish to show that this function r can serve as the conditional
distribution of X given T = t and e = () for all 0. To see that this is
true, note that for all B E B, Po(BIT = t) is any function m : T ___ [O,lJ
satisfying

P~(X E B, T(X) E C) = i m(t)dPO,T(t), for all C E C, (2.18)

where PO,T is the probability on (T, C) induced by T from Po. According


to Lemma 2.15, we have that
dPO,T
-d * (t) = h(O, t). (2.19)
vT
The left-hand side of (2.18) can be written as

JIB(x)Ic(T(x))h(O, T(x))dv*(x) J Ic(t)r(B, t)h(B, t)dVT(t)

= i r(B, t)dPO,T(t),

where the first equality follows from (2.17) and the second follows from
(2.19). It follows that r(B, t) can play the role of met) in (2.18), and the
proof of part 1 is complete.
To prove part 2, let r be as in Definition 2.8, and let /La be a prior for
8. By the law of total probability B. 70 (conditional on T), the conditional
distribution of X given T = t, /LxIT(lt), is given for every B E B by

/LxIT(BIT = t) = In Po(BIT = t)d/LaIT(Olt)

In reB, t)dJ.LaIT(Blt) = reB, t),

where /LalT is the posterior distribution of e given T. Hence, we have


that the conditional distribution of X given T and e is the conditional
distribution of X given T. According to Theorem B.64, this means that X
and e are conditionally independent given T. According to Theorem B.61,
we have that the posterior given T and X is the same as the posterior given
T. Corollary B.74 says that the posterior given T and X is the same as the
posterior given X. 0
Blackwell and Ramamoorthi (1982) give an example in which the extra
condition in part 1 of Theorem 2.14 fails and there exists a statistic suffi-
cient in the Bayesian sense but not sufficient in the classical sense. There
is, however, the following result.
2.1. Definitions 89

Theorem 2.20. If T is sufficient in the Bayesian sense, then for every


prior distribution fle, there exists a version of the conditional distribution
of X given T, flxlT such that for every B E B,

fle({O: Po(BjT = t) = flxIT(Blt), a.e. [J.LTJ}) = 1,

where flT is the marginal distribution of T.


PROOF. Let fle be a prior distribution for e. Let flxIT(BIT) be the con-
ditional probability of {X E B} given T = t calculated from the marginal
distribution of X and T (not conditional on e.) Since T is sufficient in
the Bayesian sense, the conditional distribution of e given T is the same
as the conditional distribution of e given X (and T). This means that e
is independent of X given T according to Theorem B.64. Theorem B.61
then says that the conditional distribution of X given e and T is the same
as the conditional distribution given T, which means that for all B E B,
Po (BIT = t) = flxIT(Blt), a.s. with respect to the joint distribution of e
and T. The result now follows. 0
Theorem 2.20 says that every prior distribution assigns probability 1 to a
subset of the parameter space which, if it were the entire parameter space,
would allow us to conclude that sufficiency in the Bayesian sense implied
sufficiency in the classical sense without the added condition of absolute
continuity.
There is an easier way to characterize sufficiency in the case in which all
conditional distributions given e are absolutely continuous with respect to
a single O"-finite measure.
Theorem 2.21 (Fisher-Neyman factorization theorem).4 Assume
that {Po: 0 E O} is a parametric family such that Po v (O"-jinite) for all
o and dPo/dv(x) = fXle(xIO). Then T(X) is sufficient for e if and only if
there are functions ml and m2 such that

fXle(xjB) = ml(x)m2(T(x),B), for all B.

PROOF. First, we do the "if" part. Let fXle(xjB) = ml(x)m2(T(x), B) for


all B, and let fle be an arbitrary prior for e. Bayes' theorem 1.31 says that
the posterior distribution of e given X = x is absolutely continuous with
respect to the prior, and the Radon-Nikodym derivative is

dfle\x (Bjx) = ml(x)m2(T(x),O) = m2(T(x),O)


dfle In ml (x)m2(T(x), 1/1 )dfle(1/1) In m2(T(x), 1/1 )dfle (1/1)'
which is a function of T(x). It follows from Lemma 2.6 that T(X) is suffi-
cient.

4Versions of this theorem originated with Fisher (1922, 1925) and Neyman
(1935).
90 Chapter 2. Sufficient Statistics

For the "only if" part, assume that T(X) is sufficient. According to
Lemma 2.15, there is a measure 11* such that P8 11* for all (J, dP8/dll*(x)
is a function h of (J and T(x), and 11* II. It follows that
dll*
fXle(xlO) = h(O, T(x)) dll (x).

If we set ml(x) equal to the second factor on the right and m2(T(x),(J)
equal to the first factor on the right, we are done. 0

Example 2.22. Let P9 say that {Xn}~l are lID U(O,O), n= (0,00), and let
X = (X1, ... ,Xn). Then

/xle(xI8) = 01n 1[0,00) (min Xi) 1[0,111 (max Xi).



By Theorem 2.21, T(X) = maxi Xi is sufficient. If p,e is a prior, then the posterior
has derivative (dp,eIT/dp,e)(Olt) = I[t,oo) (o)/on.
The case not covered by Theorem 2.21, in which not all P8 are absolutely
continuous with respect to the same a-finite measure, is more complicated.
One case in which the conclusion to Theorem 2.21 still applies is that of
discrete random variables.
Proposition 2.23. If {P8 : (J E O} is a parametric family such that each
P8 is a discrete distribution, then T(X) is sufficient in the classical sense
for e if and only if there are functions m1 and m2 such that

Pr(X = xiS = (J) = ml(x)m2(T(x),(J), for all (J.

This proposition is needed only to handle cases in which C = U8EO {x :


P8(X) > O} is an uncountable set; otherwise all P8 are absolutely continuous
with respect to counting measure on C.
The following lemma tells us that when a statistic T is sufficient and the
distributions of X are all dominated by a common a-finite measure, then
we can replace X by T and the distributions of T are still dominated by a
a-finite measure. In fact, we can give a formula for the density of T.
Lemma 2.24. 5 Assume the conditions of Theorem 2.21, and assume that
T : X --+ 'T is sufficient. Then there exists a measure liT on (7, C) such
that P8,T liT and dP8,T/dIlT(t) = m2(t,O).
PROOF. Apply Theorem A.78 to find a probability 11* = E:l Ci P8i such
that P8 11* for all (J. Then
dP8 fXle(xlO) m2(T(x),O)
dll* (x) = 2::1 cdxle(xl(Ji) = 2::1 Cim2(T(x), (Ji)'
5This lemma is used in the proof of Lemma 2.58.
2.1. Definitions 91

Since this density is a function of T(x), we can write

PO,T(B) = ( dP: (x)dv*(x)


IT-l(B) dv

= f 00 m2(T(x), 0) dv*(x) = f m2(t, 0)


00 dvT(t),
IT-l(B) Li=l cim2(T(x), Oi) JB Li=l Ci m 2(t, Oi)
where vr is the measure on (7, C) induced by T from v*. Define VT by
dVT/dvr(t) = L:1 e; m 2(t, Oi) to complete the proof. 0
In many of the examples that we have considered and will consider, there
exists a sufficient statistic whose dimension is the same for all sample sizes.
In such cases, there might exist a particularly convenient family of prior
distributions available for the parameter.
Theorem 2.25. Suppose that there exists a sufficient statistic of fixed di-
mension k for all sample sizes. That is, suppose that there exist functions
Tn (with image 7 ~ IRk for all nY, m1,n, and m2,n such that

fX 1 , ,xn le(X1, ... , xnlO) = ml,n(XI, ... , x n )m2,n(Tn (xl, ... , x n ), 0).
Suppose also, that for all n and all t E 7,

0< c(t, n) = In m2,n(t, O)d)'(O) < 00,

for some measure).. Then the family of densities with respect to ).


m 2 ,n(t,.) }
P ={ c(t, n) : t E 7, n = 1,2, ...
forms a conjugate family in the sense that the posterior density with respect
to ). is a member of this class if the prior is a member of the class.
PROOF. 6 Let le(O) = m2,e(t,O)/c(t,f) for some t E 7 and some f. Let
(YI,"',Ye) be such that Tm (Y1, ... ,Ye) = t. Suppose that the data are
Xl = x!, ... ,Xn = Xn for some n. We note that

fX 1 , . ,xt +n le(X1, ... , Xn, Yl., YeIO)


= m1,n(Xr, ... ,xn)m2,n(Tn (X1, ... ,xn),O)
x m1,e(Yr, ... , ye)m2,e(t, 8) (2.26)
= ml,n+e(Xr, ... , Xn, Yr, ... , Ye)m2,n+e(t', 8),

where t' = TnH(X1, ... , Xn, Yl. ... , Ye). The posterior density of e with
respect to the measure ). would be

felx"".,X n (Olx1,." ,Xn)

6This proof follows the presentation of Section 9.3 of DeGroot (1970).


92 Chapter 2. Sufficient Statistics

=
In m2,t(t, 1jJ)ml,n(Xl, ... , x n )m2,n(Tn(Xl, ... , x n ), 1jJ)d)..(1jJ)
m2,n+i(t', e)
c(t', n + f) ,

by (2.26). 0
The family of prior densities p and their corresponding distributions is
called a natural conjugate family of priors.
Example 2.27. Let {Xn}~=l be a sequence of conditionally lID Ber(9) random
variables given e = 9. Let Tn = L~l Xi. Then m2,n(t, 9) = 9t (1 - 9t- t and
c(t, n) = t!(n - t)!/(n + I)!. The family of natural conjugate priors is a subset of
the family of Beta distributions. In particular, m2,n(t, O)/c(t, n) is the Beta(t +
1, n - t + 1) density as a function of 9. Actually, the entire collection of Beta
distributions has the property that if the prior is in the Beta family, then the
posterior is as well. Theorem 2.25 only tells us that Beta distributions with integer
parameters are natural conjugate.

2.1.3 Minimal and Complete Sufficiency


The entire data set is always sufficient, so there is not always a savings in
using a sufficient statistic. However, there are often times when a simpler
statistic than the entire data set is also sufficient. There is a sense in which
a sufficient statistic can be as simple as possible.
Definition 2.28. A sufficient statistic T : X -+ T is called minimal suf-
ficient if for every sufficient statistic U : X -+ U, there is a measurable
function g : U -+ T such that T = g(U), a.s. [Po] for all e.
Clearly, a bimeasurable function of a minimal sufficient statistic is also
minimal sufficient. The following theorem says that the mapping from data
values to the likelihood function is minimal sufficient.
Theorem 2.29. Suppose that there exist versions of fXle(le) for every e
and a measurable function 7 T: X -+ T such that T(y) = T(x) if and only
if y E D(x), where D(x) is the set

{y EX: fXle(yle) = fXle(xle)h(x, y), 'V(} and some h(x, y) > O},

then T(X) is a minimal sufficient statistic.


PROOF. First, we show that the distinct sets D(x) form a partition. If
y E V(x), then we can set h(y, x) = l/h(x, y) and we get that fXle(xI9) =
fXls(yle)h(y, x) for all e. So x E V(y). With h(x, x) = 1 for all x, we see
that x E V(x), so the distinct D(x) form a partition.

7In most examples it is relatively easy to construct the function T, but you
can actually prove that such a function exists in general. See Problem 15 on
page 139.
2.1. Definitions 93

Next, we show that T(X) is sufficient. If p,e is an arbitrary prior, the


posterior after learning X = x is absolutely continuous with respect to the
prior with Radon-Nikodym derivative

dp,elx (8Ix) = fXle(xIO) ,


dp,e J fXle (xlt/J)dp,e (t/J)
according to Bayes' theorem 1.31. If y E V(x), the posterior after learning
X = y has Radon-Nikodym derivative

which is the same as the posterior after learning X = x. Hence, the posterior
is a function of T(x). Lemma 2.6 says that T(X) is sufficient.
Finally, we prove that T(X) is minimal. Let U(X) be another sufficient
statistic. Use the Fisher-Neyman factorization theorem 2.21 to write

(2.30)

Since P9( {x: ml(x) = O}) = 0 for all 0, we can safely assume that ml(x) >
o for all x. We need to show that if U(x) = U(y) for some x,y E X, then
y E V(x). It would then follow that T(y) = T(x), and this would make T
a function of U. Suppose that U(x) = U(y). Use (2.30) to write

for all O. With h(x,y) = ml(y)/ml(x), we see that y E V(x). o


Ex~mple 2.31. Suppose that P9 says that {Xn}~=l are lID Ber(O) random
varIables. Let X = (Xl, ... ,Xn ). Then

fXls(xIO) = OE?=l"'i(1_ Or-E?=l":;'


for all Xi E {a, I}. So the ratio

is the same for all 0 if and only if E:':-l Xi = E7=1 Vi, in which case h(x, y) = 1
and Vex) = {V: E~l Vi = E:=l Xi}' Then T(X) = E:'-I Xi is the minimal
sufficient statistic. -

Ex~ple 2.32. Suppose that P9 says that {Xn}~=1 are lID U(O,O) random
variables. Let X =: (XI, ... ,Xn) and suppose that the sample space is lR+n.
Then fXls(xIO) = o-n I[O,6j(maxxi). Now suppose that

O-n I[O,6j(maxxi) = h(x,y)frnI[o,9j(maxy;},


94 Chapter 2. Sufficient Statistics

for all O. This is true if and only if


I[o,ej(maxxi) = h(x,y)I[o,ej(maxYi),
for all 0, which in turn is true if and only if maxxi = maxYi, in which case
h(x,y) = 1 and Vex) = {y : maxYi = max x;}. Then T(X) = maxXi is the
minimal sufficient statistic.
Example 2.33. Suppose that Pe says that {Xn}~=l are lID with density
oy
fXlIS(yIO) = ",e et ' for y = 0, ... ,0,
y! l..Jt=o tf
with respect to counting measure on the positive integers. Here n is the set of
positive integers. If X = (Xl, ... ,Xn ), then
Ox
fXls(xIO) =
rr=l (Xi!)
(
2::~=0 *)n' if max Xi :s 0,

where X = 2:::1Xi Set hex, y) = n~=l (Xi!)/ n~=l (Yi!) for all X and y. It follows
that fXls(xIO) = hex, y)fxls(yIO) for all 0 if and only if 2:::1 Xi = 2::7=1 Yi and
maxxi = maxYi. Set T(x) = (2::7=1 Xi, max Xi) and note that X E V(y) if and
only if T(x) = T(y). So T(X) is minimal sufficient.

There are some cases in which we need a sufficient statistic to satisfy an


additional property. These situations are difficult to describe at the present
time, but they arise in several places later in this text (in Chapters 3, 4
and 5, in particular).
Definition 2.34. A statistic T is complete if for every measurable, real-
valued function g, Ee(g(T)) = 0 for all (J En implies geT) = 0, a.s. [Pel for
all O.
A statistic T is boundedly complete if for every bounded, measurable,
real-valued function g, Ee(g(T)) = 0 for all 0 E n implies geT) = 0, a.s.
[Pol for all O.
Example 2.35. Suppose that Pe says that T rv Poi(O). Suppose also that

Ee(g(T)) = f get) ot ex~( -0)


t.
=0
t=o

for all O. Then 2:::'0 g(t)ot It! = 0 for all O. This expression is a power series
representation of the analytic function h( 0) = O. Since power series for analytic
functions are unique, it must be that g(t)/t! = 0 for all nonnegative integers t.
There are many functions g with this property, such as get) = sin(27rt). All such
g satisfy Pe(g(T) = 0) = 1 for all O. So T is complete.
Theorem 2.36 (8ahadur's theorem}.8 If U is a boundedly complete
sufficient statistic and finite-dimensional, then it is minimal sufficient.

8See Bahadur (1957).


2.1. Definitions 95

PROOF. Let T be anothe r sufficient statisti c. We need to show that


U is a
function of T. Express U = (UI(X), ... , Uk(X , where k is the dimens
ion
of U. Let Vi(U) = (1 +exp(U d)-I, so that V = (VI,"" Vk) is a one-to-
one
measurable function of U and each Vi is bounded. Define

Hi(t) = Eo(v;,(U)IT = t),


Li(U) = Eo(Hi(T)IU = u).
Since U and T are sufficient, these conditional means given e = ()
do not
depend on (). Since the v;, are bounded, so are the Hi and L i Note
that
Eo(Vi(U)) = Eo(Eo(Vi(U)IT = Eo(Hi(T)
= Eo(Eo(Hi(T)!U = Eo(Li(U,
It follows that Eo(v;,(U) - Li(U)) = 0 for all (). Since U is boundedly
com-
plete, it follows that Po(Vi = L i ) = 1 for all (). So, Eo(Li(U)IT) =
Hi(T).
Since Li and Hi are bounde d, they have finite variance, and Propos
i-
tion B.78 says that

VarO(Li(U)) EoVaro(Li(U)IT) + Varo(Hi(T,


Varo(Hi(T = EoVaro(Hi(T)!U) + Varo(Li(U.
It follows easily from these equatio ns that Varo(Li(U)IT) = 0,
a.s. [Po]'
hence Varo(v;,(U)\T) = 0, a.s. [Po]. So Vi(U) = E(v;,(U)IT) = Hi(T),
a.s.
[Po]. Since Vi is one-to-one, we get Ui = v:-1(Hi(T)) for each i, and U is
a
function of T, as needed.
0

2.1.4 Ancillarity
At the other extrem e from sufficiency lie statisti cs that are indepen
dent of
the parame ter.
Defini tion 2.37. A statisti c U is called ancillary if the conditional distri-
bution of U given e = () is the same for all ().
Examp le 2.38. Let Xl, X 2 be conditionally independent given e =
0, each with
conditional distribution N(fJ, 1). Let U = X 2 - Xl. The conditional density
of U
given e = 0 is N(O, 2). Since this distribution is the same for all 0, U is ancillary
.
Sometimes the two extremes meet and a minimal sufficient statisti c
con-
tains a coordinate that is ancillary.
Defini tion 2.39. If a minimal sufficient statisti c is T = (TI' T2) and
T2 is
ancillary, then TI is called conditionally sufficient given T .
2
Examp le 2.40. Suppose that Xl"'" Xn are conditionally lID given
U(O - 1/2,0+ 1/2) distribution. Let X = (Xl, ... , Xn). Then
e = 0 with
fXls(xlf J) = I[9_!.oo )(minx; )I(_oo. 9+!](ma xxi).
96 Chapter 2. Sufficient Statistics

Let Tl = maxXi and T2 = maxXi - minX;. Then T = (Tl,T2) is minimal


sufficient and T2 is ancillary (see Problem 25 on page 141). So Tl is conditionally
sufficient given T2. In particular, if n = 2, the density of T2 with respect to
Lebesgue measure is h2(t2) = 2(1 - t2)I[o,lj(t2)'
When a statistic is ancillary, it does not mean that you should ignore it.
It only means that if you learned nothing but the ancillary, you would not
change your mind about E>. You might, however, change your mind about
everything else, including the conditional distribution of other data given
e.
Example 2.41 (Continuation of Example 2.40; see page 95). The joint density
of T given e = f:J is

Since Tl is the maximum of n lID uniform random variables given e, it follows


that the distribution of Tl given e = f:J is Beta(n, 1) shifted by () - 1/2. It is
not hard to show that the conditional distribution of n given (9, T2) = (f:J, t2) is
U(f:J - 1/2 + t2, () + 1/2). The bigger t2 is, the more concentrated the distribution
of Tl given e is. Even though T2 is ancillary and (by itself) tells us nothing about
9, it tells us something about the conditional distribution of Tl given e.
A common (but not universal) suggestion, in classical inference, is to
perform inference conditional on ancillaries. The reason for this is that
when one performs classical inference conditional on a statistic, the statis-
tic does not count as data in the inference; it merely counts as background
information that we supposedly knew before we collected the data. Since
the ancillary does not contain (in itself) any information about e, no in-
formation is being lost by not treating it as part of the data. This allows
the classical statistician a convenient way to condition on at least some of
the data. In the Bayesian framework, one could construct a joint distribu-
tion for e and the data, then condition on all of the data, and whether
a statistic is ancillary becomes irrelevant. In fact, whether a statistic is
sufficient becomes irrelevant. Fisher (1934) proposed inference conditional
on ancillaries because he claimed that it made better use of the informa-
tion available in the actual sample obtained. Example 2.52 on page 100
illustrates Fisher's point, as does the first half of the following example.
Example 2.42. Suppose that Xl, X2 are conditionally lID with U( f:J - 1/2, (J 7-
1/2) distribution given 9 = f:J. Let Tl = max{Xl,X2} and T2 = IXI - X21 It IS
traditional, in the classical literature, to interpret the statement
1
P9 (n - T2 ::; e ::; Td = '2 for all f:J

as meaning that one is 50 percent confident that the random interval

[Tl - T 2 , Td = [minX;, max Xi] (2.43)


2.1. Definitions 97

will contain 8. 9 However, we already saw that the conditional distribution of Tl


given T2 = t2 and e = 0 is U(O - 1/2 + t2, 0 + 1/2). It follows that
if t < ~,
if t 2: ~.
If, for example, T2 2: 1/2 is observed, then we know that 8 is in the interval
between min Xi and max Xi. It would seem that knowledge of the ancillary gives
us a better idea of how much "confidence" we should have that e is in the interval.
Alternatively, we could choose our interval using the conditional distribution
given the ancillary T2. For example, we can easily show that

(2.44)

In the classical theory, one would be 50 percent confident that the random interval
[n - (1 + T2)/4, Tl + 1/4 - 3T2/4j covers e conditional on T2. In fact, since
the probability in (2.44) is the same for all T2 values, one would be 50 percent
confident that the random interval covers e marginally. If one desires an interval
in which one can place 50 percent confidence after seeing the data, then the
interval in (2.44) makes far more sense than the one in (2.43). If T2 is observed to
be small, then we have not learned much about e, and the conditional interval is
wide to reflect the uncertainty. The unconditional interval in (2.43) is very short,
however, which is counterintuitive. Similarly, when T2 is observed to be large, we
have learned a lot about e and the second interval is short, while the first one is
wide.
Suppose that we have a prior distribution for e with density le(O). Then the
posterior density of e is a constant times fe(O)Il t l-l/2+t2,tl+l/2j(B). If fe(O) is
almost constant over the interval [h - 1/2, h - t2 + 1/2], then the posterior is
approximately
1
lelx(O\x):::::! 1- t2 I[tl-!+t2,tl+!](O).
The posterior probability that e is in the interval in (2.44) is nearly 1/2. If
one uses the improper prior with constant density, then the posterior for e is
U(tl - 1/2 + t2, h + 1/2), and the fact that the posterior probability is 1/2 that
e is in the interval (2.44) will turn out to be a special case of Theorem 6.78.
Sometimes there is more than one ancillary statistic available. Some prin-
ciple is needed to choose between them.
Definition 2.45. An ancillary U is maximal if every other ancillary is a
function of U.
Example 2.46. Let Po say that {Yn}:::='=l are lID with density (with respect to
counting measure on the set {(0,0),(0,1),(1,0),(1,1)})
i(1- B) if Y = (0,0),
{ i(1+0) if y = (0,1),
IYlie (y\O) = i (2 + 0) if y = (1,0)
i(2 - 0) if y = (1,1).

9This is an example of a confidence interval statement. We will discuss confi-


dence intervals in more depth in Section 5.2.1.
98 Chapter 2. Sufficient Statistics

Here, n = [0,1]. Now, let X = (Y1 , . , YN). Let the observable counts be N ij
equal to the number of Y s with the first coordinate i and the second coordinate
j. Let Mi be the number of vectors with the first coordinate i, and let Nj be the
number with the second coordinate j.
First Coordinate
o 1
Second 0
Coordinate 1
--~~~~~+-~-

Then N = Noo + NlO + NOI + Nll and

fx[e(xIO) = G) N (1 - O)Noo(1 + O)NOl (2 + 0)NlO(2 _ O)Nll.


Any three of the N ij is minimal sufficient. We also see that

Mo Bin(N,~) , given 8 = 0,
No Bin( N,~) , given 8 = O.
Both Mo and No are ancillary, but neither is maximal. The conditional inference
will depend on which ancillary one chooses.
For example, Eo(l - 3Noo/NoINo) = 0, and Eo(l - 2Noo/MoIMo) = o. If
one wanted to estimate e in the classical framework, there would seem to be
two natural estimators available depending on which ancillary one chooses. (See
Problems 31 and 32 on page 142.)

Sometimes we need to condition on a statistic even if it is not ancillary.


The following example was given by Morris DeGroot (personal communi-
cation). A similar example can be found in Pratt {1962}.
Example 2.47. Consider a meter that is trying to measure a quantity 8. Sup-
pose that the meter gives a reading Z, which has N(O, 1) distribution given 8 = 0
if Z < 2, but if Z ~ 2, the reading is always 2. Let X = min{Z,2} be the reading.
Then Po(X = 2) = 1 - cfl(2 - 0), where <P is the standard normal distribution
function. For x < 2'/x[e(xI0) is the N(O, 1) density. The event {X = 2} is not
ancillary but is obviously important for what inference to perform. For example,
trying to construct an unbiased estimator is difficult since

Eo(X) = cfl(2 - 0)9 + 2[1 - cfl(2 - 0)] -


1 {(2 - 0)2}
v'27r exp - 2 .

On the other hand, if X < 2 is observed, the inference should be the same as
if we had merely observed Z, since we actually did observe Z and the fact that
Z > 2 could have occurred but didn't is irrelevant. If X = 2 is observed, the
inf;ence should be based on the fact that all we know is Z ~ 2, since the fact
that X < 2 had been possible is now irrelevant.
A possible Bayesian solution to this problem would be to let 8 have a conju-
gate prior distribution, say 8 '" N(Oo,O"g) for known values of 00 and O"g. The
conditional distribution of 8 given X = x is N(O!, O"~) if x < 2, where

80 + O"gx
/-11 = 1 + O"g , 0"1
2
= -1 -+O"g2
0"0

2.1. Definitions 99

Inference would then proceed as if no truncation had been possible. On the other
hand, if X = 2 is observed, the conditional distribution of e given X = 2 has
density

If we want the posterior mean of e, we can integrate to get

E(elx=2)=8o+ [(
..;2iy'1 + O'~ 1 - 4.> )-92 )] .
1+0-0

Brown (1967) and Buehler and Fedderson (1963) prove that there are
other statistics that are not ancillary but on which it might pay to condition
when making inferences. In particular, they consider the case in which
Xl. ... ,Xn are conditionally 110 with N(J.l, (T2) distribution, conditional
on e = () = {IL, (T). Let X = E~=l Xdn and 82 = E~=l (Xi - X)2 I(n -1).
It is well known that Ps(IX - J.lIIS > k) depends only on k and n, call it
o:{k, n). What these authors show is that there is a set C and a number
a < o:(k,n) such that Ps{IX -1L1/8 > kl(X,8) E C) :5 a for all (). Pierce
(1973), Wallace (1959), and Buehler (1959) give conditions under which
such examples can and cannot arise.
Ancillaries are only useful if there is no boundedly complete sufficient
statistic.
Theorem 2.48 (Basu's theorem).l0 If T is a boundedly complete suf-
ficient statistic and U is ancillary, then U and T are independent given
e = (), and they are marginally independent no matter what prior one
uses.
PROOF. Let A be some measurable set of possible values of U. Since U is
ancillary, P~(U E A) = Pr(U E A) for all (). But, P~{U E A) = J P~(U E
AIT = t)dPs,T{t). So

J [Pr(U E A) - Pr{U E AIT = t)] dPs,T{t) = 0, (2.49)

for all (), since T is sufficient. Let g{t) = Pr{U E A) - Pr(U E AIT =
t), which is a bounded measurable function. Equation (2.49) says that
Es(g(T = 0 for all (). Since T is boundedly complete, we have P9(g(T) =
0) = 1 for all (). This means that .p~(U E A) = P~(U E AIT = t), a.s. [PS,T]
for all (), which implies that U and T are conditionally independent given
e = (), for all ().
lOSee Basu (1955, 1958).
100 Chapter 2. Sufficient Statistics

Let J.te be an arbitrary prior, and let B be a measurable set of possible


values of T.

Pr(U E A, T E B) = II Pr(U E AIT = t)dPO,T(t)dJ.te((J)

= l Pr(U E A)PO,T(B)dJ.te((J) = Pr(U E A) Pr(T E B),


which says that U and T are marginally independent. 0
Basu's theorem 2.48 says that if T is a boundedly complete sufficient
statistic, conditioning on an ancillary is not going to change the joint dis-
tribution of T and e. Both Bayesian and classical statisticians would ignore
the ancillary in such a case.
Example 2.50. Suppose that Pe says that {Xn}~=l are 110 N(fJ, 1). Let X =
(Xl, ... ,Xn), T = X, and U = E:= (Xi - T)2/(n - 1). Then T is a complete
sufficient statistic and U is ancillary. -they are independent given e = fJ and are
marginally independent no matter what prior we use.
Example 2.51. Suppose that Pe says that {Xn}~=1 are 110 N(I',u 2), where
fJ= (I',u). Let X = (Xl, ... ,Xn), Tl = X, and T2 = v'E~=I(Xi -Tl)2/(n -1).
The fact that T = (T1, T2) is a complete sufficient statistic will follow most easily
from Theorem 2.74, to be proven later. Let

Then, U is ancillary and independent of T given e = fJ. Also T and U are


marginally independent no matter what prior we use. The distribution of U is
uniform on a sphere of radius 1 in an (n -I)-dimensional hyperplane. (See Prob-
lem 28 on page 141.)
One reason that some people give for conditioning on an ancillary is
that they get a better measure of the precision of the inference. Here is an
example due to Basu.
Example 2.52. Let 9 = (91, ... ,9 N ), where N is the (known) size of a pop-
ulation and 8i is some characteristic of unit i in the population. Select a set of
labels il, ... , in from {I, ... , N} with replacement, with n :5 N. Let Xj = e i; be
observed for j = 1, ... , n. Let X = (Xl, ... , Xn). If the selection is random, then
fXle(xlfJ) = liNn for all x compatible with fJ. (Notice that the distribution of X
is dependent on e even though the distribution of the labels is not.) Let M be the
number of distinct labels drawn. Then Pe(M = m) is the same for all fJ, so M is
ancillary. Let Xi, ... , X Mbe the distinct observed values. One possible estimat~
of the population average is X = E~l X; 1M. The conditional variance of X
given 9 = fJ and M = m is
_. N-mu 2
Var(X 19 = 0, M = m) = N_1 m'
2.1. Definitions 101

To see that this is a better measure of the variance of X* than is the marginal
variance, consider the simple case n = 3. The distribution of M is
ifm = 1,
ifm = 2,
ifm=3,
otherwise.

Since E(X*18 = fJ, M = m) = (j for all m, it follows that

_* 18 = 8) = E (N-Ma-
Var(X
2
- - - 8 = fJ
N-1M
1)
= ~ ( (N _ 2) (N - 2)(N - 3)) = (12 N2 - 2N + 3 .
N2 1 + + 3 3 N2
If M = 3, the marginal variance is larger than the conditional variance, while if
M = 1, the marginal variance is too small.
To execute a Bayesian solution, we need a distribution for 8. Suppose that
we model the 8 i as exchangeable random variables with 8 i conditionally inde-
pendent N(t/J, a- 2 ) given (\II, E) = (t/J, a-). The distribution of \II given E = a- is
N(t/Jo, a- 2 / Ao), while the distribution of E2 is r- 1 (ao/2, bo/2)Y The data consist
of observing M = m and 8ij = xj, for j = 1, ... ,m. The unobserved 8i are still
exchangeable, and their conditional distribution given (\II, E) = (t/J, (1) is as in the
prior. The distribution of \II given E = a- and X = x is \II "V
N( t/Jl, a- 2 / Ad, and
the distribution of E2 given X = x is r- 1 (aI/2, bI/2), where

Al = Ao + m, .1. _ AOt/Jo + mx*


'1'1 - Al '

= b0 + L(Xi -
m

al = ao +m, b1 -*)2
X + --,-
mAo (.,.
m+ .... o
'1'0 - -*)2
X
i=l

The posterior distribution of the population average e is obtained in stages.


First, conditional on (\II, E) = (t/J, a-),

e"V N (mx* + (N
N
- m)t/J
,(1
2 N -
N2
m) .

Integrating t/J out of this, we get that conditional on E = a-,

Finally, integrating a- out, we get that e has distribution

llThis is an example of a hierarchical model, which will be discussed in more


detail in Chapter 8.
102 Chapter 2. Sufficient Statistics

If we use an improper prior (AO = 0, bo = 0, ao = -1), then the location becomes


x and the squared scale factor becomes
m -1-
N --
-
Nm m-1
L (*x-x
m


-*)2
.
i=l
The latter is very close to the traditional finite population sampling theory vari-
ance estimate.

We conclude this section with two examples that are similar on their
surface, but in one example the ancillary is part of the sufficient statistic
and in the other it is not.
Example 2.53. Let Z '" Ber(1/2) (independent of 6), and let Y and W be
conditionally independent given 6 = fJ (and independent of Z) with Y '" N(fJ, 1)
and W '" N(fJ, 2). If Z = 0, we will observe X = (Y, Z). If Z = 1, we will observe
X = (W,Z). Let Xl stand for the first coordinate of X. The likelihood function
is

It is possible to show that Xl is not a sufficient statistic. (See Problem 8 on


page 138.) In this case, it makes perfect sense to perform inference conditional
on the ancillary Z.
Example 2.54. Let Z '" Ber(1/2) (independent of 6), and let {Yn}~=l be
conditionally 110 with Ber(fJ) random variables given 6 = fJ (and independent
of Z. ) If Z = 1, we will observe Yl, ... , Yn for fixed n. If Z = 0, we will observe
the l'i until we see k successes with k < n. In Problem 9 on page 138, we will
see that a sufficient statistic is (N, M), where N is the number of observed l'i
and M is the number of successes among the observed l'i. Clearly, Z is ancillary,
but it is difficult to justify conditioning on Z since it is not part of the sufficient
statistic. That is, if we observe n of the l'i and there are k successes among them,
then it does not matter to us whether Z = 0 or 1. (Of course, in all other cases
we can figure out what Z was from the rest of the data.)

2.2 Exponential Families of Distributions


2.2.1 Basic Properties
There is a special class of distributions for which complete sufficient statis-
tics with fixed dimension always exist. This class includes some, but not
all, of the commonly used distributions.
Definition 2.55. A parametric family with parameter space n and density
fXls(xI8) with respect to a measure v on (X,8) is called an exponential
family if
2.2. Expone ntial Families of Distribu tions 103

for some measurable functions 'lr1,"" 'Irk, t b " " tk and some integer k.
Examp le 2.56. Suppose that Pe says that {Xn}~=l are lID N(IL, (12),
where
8 = (IL, (1). Let X = (Xl>' .. , Xn).

/x,9(xlO) ~ (2.);"U' ~p { ~ 2~' t,(x< ~ ~),}

~ (2'~'''U-'~+ ;::}exp {~2~' t,x1 + :,""}


In this form, we see that k = 2 and
n
1
h(x) = (27T)n/2' tl(X) = nx, t2(X) = LX:,
i=l

c(8) = (1-n exp { - ~~:} ,


The function c(O) in Definition 2.55 can be written as

so that the dependence on 0 is throug h the vector 'Ir = ('lr1 (0), ... ,'Irk (0)) E
]Rk. We might as well let 'Ir be the parame ter.

Defini tion 2.57. In an exponential family, the natural parameter


is the
vector n = ('lr1 (6), ... ,'lrk(6)), and

r = {'Ir E mk: Lh(X)exp(~'lriti(X)) dv(x) < oo}


is called the natural parameter space.
The mapping n : n . . . . r need not be one-to-one, nor need it be onto.
It
is common, however, to use the symbol 6 for the natural parame ter
and
assume that n = r. It is obvious from the form of the exponential
family
density and the Fisher- Neyma n factorization theorem 2.21 that T(X)
=
(tl (X), ... ,tk(X)) is a sufficient statistic. This statisti c is sometimes called
the natural sufficient statistic.
The sufficient statisti c from an exponential family sample also has
an
exponential family distribution.
Lemm a 2.58. 12 If X has an exponential family distribution, then so
does
the natural sufficient statistic T(X), and the natural parameter for T
is the

12This lemma is used in the proof of Theorem 2.62.


104 Chapter 2. Sufficient Statistics

same as for X. In particular, there is a measure liT such that PO,T liT
for all 0 and dPO,T/dIlT(t) = c(O) exp(OT t).
PROOF. Apply Lemma 2.24 with m2(T(x), 0) = c(O) exp(2:7=1 Oiti(X)) and
m1(x) = h(x). 0
Example 2.59 (Continuation of Example 2.56; see page 103). In the case of
n conditionally 110 N(/-t,0'2) random variables, the natural sufficient statistics
are TI = 2:~=1 Xi and T2 = L:~=l X? It is well known that TI and W =
n -Tf /n are indef,en~ent.' wi~h TI having N(n/-t, n0'2). ~istribut~on and W having
r([n - 1]/2,1/[20' ]) dlstnbutlOn. It follows that the Jomt denSity of (TI, T2) is
n-l

[ViO'nr(n;1)2nrl (t2_~)-r exp(-2~2 ((h-nn/-t? +t2-~])'


This can be simplified to c(8)h(h, h) exp(8ltl + 82t2), with c(8) the same as in
Example 2.56 and h(h,t2) a constant times (t2 - tI/n)[n-IJ/2.
There are degenerate exponential families. That is, it is possible for some
linear function of X to be constant (the same constant for all 0) with
probability 1 given e = 0 for all O. For example, let Y1 be k1-dimensional
and have conditional density given e = 0 (with respect to a measure Ill)
c( 0) exp(Yi 0). Define rT = (e T, IJ! T), where IJ! is k2-dimensional. Let 112
be the measure that puts a mass of 1 on the point r E IR k2 and puts 0
mass on the rest of IRk2. Define 1/ = III X 112, and let yT = (Y1T , r T). Then
y has conditional density given r = (0, 1jJ) = "( with respect to 1/,
c* ("() exp(y T "() = c( 0) exp(yi 0),
where c*~"() = c(O) exp( -r T1jJ). The natural parameter space of r values
is n x IR 2, where n is the natural parameter space of e values. For this
reason, we introduce a definition.
Definition 2.60. An exponential family of distributions for X is degener-
ate if there exists at least one vector a and a scalar r such that Po(o TX =
r) = 1 for all O. If the exponential family is not degenerate, it is called
nondegenerate.
Example 2.61. Let X", Multk(n;PI, ... ,Pk) given P = (PI, ... ,Pk)' The nat-
ural parameter is e = (log PI, ... ,10gPk)' We know that Pe(l T X = n) = 1 for
all (), where 1 is a vector of k Is.
When the exponential family is degenerate, the natural parameter space
is a subset of a (k - I)-dimensional linear manifold in IRk, hence it has
empty interior. Some theorems in Section 2.2.2 will require that the natural
parameter space have a nonempty interior. However, degenerate families of
distributions are easily converted into nondegenerate families by means of
linear transformations. For example, with the multinomial distribution, we
could just delete the last coordinates of both X and the natural parameter.
For this nondegenerate family, the natural parameter space does contain
an open subset of IRk - 1.
2.2. Exponential Families of Distributions 105

2.2.2 Smoothness Properties


The means of functions of exponential family random variables tend to be
smooth functions of the natural parameter. In fact, the natural parameter
space is itself a nice subset of Euclidean space.
Theorem 2.62. The natural parameter space n of an exponential family
is convex and l/c()) is a convex function.
PROOF. We will work in the sufficient statistic space. Write l/c()) =
J exp{t T ()}dVT(t), where VT is the measure from Lemma 2.58. Since exp()
is a convex function, we get, for ()1,()2 En and 0 < 0: < 1,

C
(
O:()l + ~1 - 0:
]0 ) = jexp{tT[O:()l
2
+ (1 - O:)02]}dVT(t)

< J (0: exp{t T()Il + (1 - 0:) exp{ t T ()2}) dVT(t)

= 0: j exp{tT()IldvT(t) +(1-0:) Jexp{tT()2}dvT(t)


1 1
= 0: C()l) + (1 - 0:) C()2) < 00.

This proves that O:()l + (1 - 0:)()2 E n, so n is convex. It also proves that


l/c is convex. 0


Example 2.63. The family of exponential distributions, Exp('I/J), with densities
fXI",(xi'I/J) = 'l/Jexp(-'l/Jx) for x> has hex) = I(o,oo)(x) and natural parameter
6 = -'l/J. So c(6) = -6 and 1/c(6) = -1/6 is convex. The natural parameter
space is (-00,0), a convex set.

The following theorem is used in several places in the remainder of the


text to establish smoothness properties of various conditional means, given
the natural parameters, of functions of random variables with exponential
family distributions.
Theorem 2.64. Let the density of T(X) with respect to a measure VT be
c())exp{t T ()}. Ifcp: T --+ IR is measurable and Jlcp(t)lexp{tT()}dvT(t) <
00, for () in the interior of the natural parameter space, then

J(z) = j cp(t) exp{t T z}dvT(t)

is an analytic function 13 of z in the region where the real part of z is interior


to the natural parameter space, and

8~/(z) = j ticp(t)exp{t T Z}dVT(t).

13By analytic function, we mean a complex-valued function of a complex (vec-


tor) argument that is differentiable with respect to that complex argument.
106 Chapter 2. Sufficient Statistics

PROOF. We will do the k = 1 case, as the others follow by induction. Let


Zo = a + ib and fJ = fJ 1 + ifJ 2 for some a in the interior of n. Then
!(zo + fJ~ - !(zo) = J (t) exp(tzo) exp{ t;} - 1 dVT(t). (2.65)

The maximum modulus theorem C.S says that an analytic function on a


closed bounded set achieves its maximum on the boundary of the set. For
o < 'Y < 10, consider the set C(,,(,E) = {z : 'Y :s Izl :s E}. For fixed 10 and
every 0 < 'Y < 10,

max
ZEC(,)"f)
l
exp(tz)
z
-11 :s max {exp(ltiE) -
10
1, exp(lt!'Y) -
'Y
I} .

Since the limit as 'Y -- 0 of the last term above is It I and exp(ltIE) -1 > ItiE,
it follows that I(exp(tz) - l)/zl :s exp(ltIE)/E for alllzi :s E. Thus, we have
that if IfJl :s 10, the absolute value of the integrand in (2.65) is no more than
1(t)1 exp(at) exp(ltIE)/E. Thus, the integral of the absolute value is at most
En
J 1(t)1 (exp{ t(a + + exp{t(a - En)
dVT(t)/E. Choose Esmall enough so
that a E are in the interior of n. By the dominated convergence theorem,

lim !(zo + fJl- !(zo) =


0-+0
f (t)t exp(tzo)dVT(t). D

Theorem 2.64 allows us to calculate moments of the sufficient statistics


in exponential families by taking derivatives of the function log c( 0).
Example 2.66. Let cp(t) = 1. Then we can calculate
Eo(Ti) = f C(O)ti exp(t T O)dIlT(t) = c(O) a~i f exp(t T O)dIlT(t)

a l l ac(O) a
c(O) aOi c(O) = - c(O) aOi = - aOi 10gc(0).
Example 2.67 (Continuation of Example 2.63; see page 105). Consider the
Exp(t/J) distribution with 0 = -t/J. Here 10gc(0) = log(-O). So the partial with
respect to 0 is I/O = -EII(T).
Example 2.68 (Continuation of Example 2.56; see page ~03). Consider t~e
case of N(JL,u 2) distributions. Here, the natural parameter ~(01,02) = (JL/u ,
-1/[2u 2]), and the natural sufficient statistic is (TI' T2) = (nX, E~=I Xl) So,

log c(O) = ~ loge -2(2) + ~ :! . (2.69)

The partial derivative with respect to 0 1 is

?:!: 01 = -nJL = -EII(TI ).


2 O2
The partial derivative with respect to 02 is

-n - -nO~ = -n (2
u + JL 2) = - E II (rr>2 ) .L
202 40~
2.2. Expone ntial Families of Distribu tions 107

The method illustrated in Example 2.66 is actually quite general.


Propo sition 2.70. 14 Let T = (T1 , ... , Tk). Suppose that the conditi
onal
density of T given e = () is fr,e(t\O ) = c(O) exp(OT t). Let 1.' ..
, k ~ 0
be such that = 1 + ... + k. Then

Eo (il Til;) = c(O) 801 1 .~e. 80!k ctO)'


In particular, Eo(Ti) = -8/80i logc(O), and

82
COVO(Ti' T j ) = - 80i 80j logc(O).

Examp le 2.71 (Contin uation of Exampl e 2.68; see page 106). For 2)
the N(/-L, (
case, logc(9) is given in (2.69). The covariance of X and 2:~=1 X;
is

as can be verified directly.

A similar result holds for the posterior means of polynomial functions


of
e, if we use a conjugate prior.
Propo sition 2.72. Let X = (Xl,." ,Xn ) where the Xi are conditi
onally
IID, given e = 0, with density equal to c(O) exp(OT T(x)), where e
is a
k-dime nsional parameter. Suppose that the prior for e is proport
ional to
c(O)aexp(OTb), where a > 0 and b is a k-dime nsional vector (a
natural
conjugate prior). Suppose that 1, ... ,ek ~ 0 and e= 1 +.. '+k'
Write the
predictive density of X as fx(x) = g(t}, ... , tk), where ti = 2:7=1
Ti(xj).
Then

Examp le 2.73. Suppose that Xl, ... ,Xn are conditio nally lID with
distribu tion given M = /-' and ~ = u. Let the prior be natural conjuga
N(/-L,u 2 )
te as in
Exampl e 1.24 on page 14. The margina l density of the data is given
in (1.27)
in that example. Rewriti ng (1.27) in terms of the natural sufficien
- t statistic s
Tl = nX and T2 = W + nX -2
, we get

g(h, h) = constan t x (110 + t2 + -tfn + - - [tl


n>.o
- - ]2)-~
>'o+n n
/-'0 .

14This proposi tion is used in the proofs of Theorem s 3.44 and 7.57.
108 Chapter 2. Sufficient Statistics

The partial derivative of this with respect to h divided by g(t1' ta) equals

al
2
( t1 2
nAo h
- - bo+t2+ - + - [- -J.Lo)
n Ao+n n
2 )-1 2AO h 2t1
( - [- -J.Lo) - - ) ,
Ao+n n n

which simplifies to J.L1a1/b1, the posterior mean of 8 1 = MI'E?


Diaconis and Ylvisaker (1979) prove other interesting results about pos-
terior means of parameters when conjugate priors are used for exponential
families.
The following theorem is used to show that certain estimators and hy-
pothesis tests have classical optimality properties when the data come from
an exponential family.
Theorem 2.74. If the natural parameter space n of an exponential family
contains an open set in mk, then T(X) is a complete sufficient statistic.
PROOF. We will prove the k = 1 case, and the others follow by induction.
Let T(X) have density c(O)exp{tO} with respect to a measure VT. Let 9
be a function such that E8(g(T)) = 0 for all Then e.
J g(t)c(O) exp{tO}dVT(t) = 0

for all 8. This says that

J g+(t)exp{t8}dvT{t) = J g-{t)exp{t8}dvT{t), (2.75)

where g+ and g- are respectively the positive and negative parts of g. Since
e,
E8(g(T)) exists for all both sides of (2.75) are finite for all 8. Let 00 be
interior to n, and let the common value of both sides of (2.75) be r when
e = eo. Define two probability measures:

P{A) = ~ ( g+{t) exp{teo}dVT(t),


r JA
Q(A) = ~ ( g-(t) exp{tOo}dVT(t).
r JA
The two sides of (2.75) are

J exp(t[8 - eo])dP(t) = J exp(t[8 - 80 ])dQ(t).

By Theorem 2.64, these are analytic functions of t/J = e - eo According


to Theorem C.7, these functions equal their power series expansions in a
neighborhood, say (-t/Jo,t/Jo), oft/J = O. The fact that they agree for all real
values near 0 implies that they have the same derivatives at 0, hence they
have the same power series expansion around 0, hence they are also equal
2.2. Exponential Families of Distributions 109

at imaginary values of 1/J near 0, hence they are also equal at all imaginary
values because they are analytic in the region where the real part of 1/J is in
(-1/Jo,1/Jo). For 1/J = iu, we get that the characteristic function of P equals
the characteristic function of Q in a neighborhood of O. By Corollary B.106,
it follows that P = Q, hence g+(t) = g-(t) a.e. [/IT]. This ensures that
Pe(g(T) = 0) = 1 for all e. 0
As examples, the sufficient statistics from normal, exponential, Poisson,
and Bernoulli distributions are complete.

2.2.3 A Characterization Theorem*


The following theorem characterizes one-parameter exponential families es-
sentially as those families of distributions with smooth densities on a com-
mon set with one-dimensional sufficient statistics for all sample sizes.
Theorem 2.76. Suppose that Xl, ... ,Xn are conditionally IID given e =
e each with density fXde(le). Let T be a one-dimensional sufficient statis-
tic. Write
n

IIfxtle(xile) = ml(x)m2(t,e).
i=l
Define
K9(t) =
a
au log m2(t, e).
Assume the following conditions:
1. The set ofy such that fX1Ie(yle) >0 is the same for all e.
2. fxtle(yle) is differentiable with respect to e for each y.
3. fxtle(yle) is differentiable with respect to y for each e.
4 There exists eo such that Keo (t) has an inverse.
Then, X has an exponential family distribution with a one-dimensional
natural parameter.
PROOF. Write
n

~)ogfxde(xile) = logm2(t, e) + logml(x),


i=l
and define

= L V(Xi).
n
qe(r) = K9 [Kio 1 (r)] , rex)
i=l

*This section may be skipped without interrupting the flow of ideas.


110 Chapter 2. Sufficient Statistics

Since Ko(t) = qo(Koo(t)) = qo(r(x)), it follows that


a2 a a
a()a x,. 10g!x1Ie(Xil()) = -ar qo(r)-a
Xi
rex).

Thus, we get ar(x)/aXi = V'(Xi), and


1 a2 a (2.77)
V'(Xi) a()aXi log !xle(xl()) = ar qo(r).

Since r is invariant under permutations of the coordinates of X and the


left-hand side of (2.77) depends on x through Xi alone, both sides must
depend only on (). So, we get
a
arqo(r) = CI(())'

Ko(t)
a
Koo (t)CI ()) + C2(()) = a() 10g m 2(t, ()),
log m2 (t, ()) Koo(t)<PI()) + <P2()) + set),
where <Pi(()) = J~ ci(u)du, for i = 1,2, and set) is determined by boundary
conditions. It follows that

Thus, we see that the density is in the form of an exponential family with
k = 1 and
hex) = mI(x) exp{s(t)}, c()) = exp{<P2())},
tI(X) = Koo(t), ?rI()) = <PI ()). o
There are similar theorems in multiparameter cases, but they have even
more conditions. We give a different type of theorem characterizing expo-
nential families by their sufficient statistics in Theorem 2.114. The impor-
tance of the existence of a fixed-dimensional sufficient statistic is twofold.
First, it means that there is a fixed amount of information that must be
stored for making inference about e regardless of the sample size. Second,
there is the possibility of using natural conjugate prior distributions as in
Theorem 2.25.

2.3 Information
It seems intuitively sensible to expect more data to provide more informa-
tion about a parameter or a distribution. Similarly, if a statistic is suffi-
cient, it should contain all of the information about the parameter, and
vice versa. To make these ideas precise, we need to define information.
There are two popular definitions of information: Fisher information and
Kullback-Leibler information.
2.3. Information 111

2.3.1 Fisher Information


Fisher information is designed to provide a measure of how much informa-
tion a data set provides about a parameter in a parametric family with
some smoothness properties.
Definition 2.78. Suppose that e is k-dimensional and that fXle(xIO) is
the density of X with respect to v. The following conditions will be known
as the FI regularity conditions:
1. There exists B with v(B) = 0 such that for all 0, 8fxle(xIO)/80i
exists for x B and each i.
2. J fXle(xIO)dv(x) can be differentiated under the integral sign with
respect to each coordinate of ().
3. The set C = {x : fXle(xl()) > O} is the same for all 0.
Definition 2.79. Assume that the three FI regularity conditions above
hold. Then the matrix Ix()) = Ix,i,j()))) with elements

IX,i,j()) = Cove (a~i log fXle(XIO), a~j log fX1e(X10))


is called the Fisher information matrix about e based on X. The random
vector with coordinates 8Iogfxle(XI())/8()i is called the score function.
If T is a statistic, the conditional score function is the vector whose ith
coordinate is 8Iog!xIT,e(Xlt,0)/8()i. The conditional Fisher information
given T = t, denoted by IXIT(Olt), is the conditional covariance matrix of
the conditional score function.
Here are some examples.
Example 2.80. Let b be known, and suppose that X rv NCO, b) given e = O.
Then, the FI regularity conditions are satisfied and

~ exp { - ;b (x - O)2} ,
x-O
= -b-'
and Ix(O) = lib. Here we see that the smaller the known variance is, the more
information there is in the data about e. This is intuitively sensible.
Example 2.81. Suppose that X rv U(O,O) given e = O. That is, iXle(x\O) =
O- l I(O,6)(X), In this case FI regularity conditions 1 and 3 fail, but we can still
calculate 810g/x1e (xI0)/80 = -I/O. We could then try to define the Fisher
information to be the mean of the derivative of this, namely 'Ix (0) = 1/02 But
this function will not have the properties that Fisher information has under all
three FI regularity conditions.

Example 2.81 should actually appear in the text after Example 2.85 on
page 113.
112 Chapter 2. Sufficient Statistics

Example 2.82. Suppose that X '" Bin(n,p) given P = p. Then


fXIP(xlp) = (:)PZ(I_ pt- Z , for x = 0, ... ,n,

log fXIP(xlp) = log (:) + x log(p) + (n - x) log(1 - p),


8 x n-x x n
8p log IxIP(xlp) p--I---p = p(l-p) -I-p'
n
Ix(P)
p(l- p)"
The more extreme P is, the more information an observation has about P.
Example 2.83. Suppose that Po says X '" N(p., 0'2), where 8 = (p.,0'). Then

fXls(xI8) = _1_exp{ __I_(X_p.)2},


0'..;2; 20'2
8 x-p.
8p. log fXls{xI8)
U2 ,
8 I (x - p.)2
80' log fXls(xI9) = --+
0' 0'3'
I
Varo ( X O'~ P. ) = 0'2'
Varo ( _.!. + (X - p.)2) =
2
0'2'
0' 0'3
C oVo (--+
I (X - p.)2 -
X --P. )
0' 0'3' 0'2 = 0,

( ~ ).
1
UV
Ix (9) = ;;7

A useful result about the score function is the following.


Proposition 2.84. When the FI regularity conditions hold, the mean of
the score function is O. If, in addition, T is a statistic, then the conditional
mean given T of the conditional score function is 0, a.s. [PO,T]'
If we can differentiate twice under the integral signs (as in exponential

[&
families), we obtain

0= ! 82
80i80j fXje(xIO)dll(X) = Eo
fXje(X1o)]
fXje(X\O) .

Now, use the fact that


82
8 0i 80j logfxje(XIO)

(* fXje(XIO)) fXjs(XIO) - (~fxje(XIO)) (IB; fXje(X\O))


=
f~18(XIO)
2.3. Information 113

in order to conclude

Ee [ao~~Oj log/X\9(X 0)]


1

= 0 - COV9 (!i 10g!xls(XIO), a~j IOg/x1s(X10) = -IX,i,j(O).


This gives an alternative method for calculating Ix(O) when we can dif-
ferentiate twice under the integral sign. In exponential families with the
natural parameterization, the situation is even simpler, since the second
derivative of the logarithm of the density does not depend on the data,
hence no expectation need be calculated. In this case,

Ix(O) = - ((ao~~Oj logc(O))) .


Example 2.85 (Continuation of Example 2.82; see page 112). The derivative of
the score function is
82 X X - np
8p2 log fXIP(xlp) = - p2(1 - p) + p(l - p)2'

The mean of this is -n/(p(l - p)] = -Ix (p).


Suppose that Xl, ... ,Xn are conditionally lID given e = () with den-
sity /x1Is(xIO). Let X = (X1, ... ,Xn ). In this case, 10gfxls(XI(}) =
2:::~=llog fXd9(X i I0), a sum of lID random variables conditional on e = O.
It follows that the covariance matrix for the sum, namely Ix(O), is n times
the covariance matrix of one of them, namely IXl (}). That is, Ix(O) =
nIXl (}), so Fisher information adds up over lID observations. In fact it
is additive over any finite collection of conditionally independent data sets
(see Problem 39 on page 143). In this sense it measures how much infor-
mation we have in a data set. Also, the more information the data provide,
the better we should be able to estimate functions of e. Two such results,
which will be proven later, are Theorems 5.13 and 7.57.
There is another sense in which Fisher information measures the infor-
mation in a data set. Let Y = g(X) be an arbitrary statistic. We will see
that Ix(O) is at least as large as Iy(O).
Theorem 2.86. Let Y = g( X). Suppose that e is k -dimensional and Po
Vx for all O. Then Ix(O) -Iy(O) is positive semidefinite. The matrix is all
Os if and only if Y is sufficient.
PROOF. Define Qo(C) = P61(X, Y) E C]. By Corollary B.55, Q9 v,
where v(C) = vx( {x: (x, g(x E C}), with Radon-Nikodym derivative

fx,Yls(x, ylO) = fXls(xIO) = fYls(yl(})fxIY,s(xly, 0).


114 Chapter 2. Sufficient Statistics

It follows that
{) {) {)
{)8 i log fXls(xI8) = {)8 i log fYls(YI8) + {)8 i log fxlY,s(xly, 8), a.s. [Qo],
(2.87)
for all 8. We will prove that the two terms on the right-hand side of (2.87)
are uncorrelated and that the last term is 0 a.s. if and only if Y is sufficient.
Proposition 2.84 says that the first two expressions in (2.87) have mean 0
and that the last one has 0 conditional mean given Y, a.s. [PO,y]. It follows
from the law of total probability B.70 that for all i and j,

Covo ({)~i log fYls(y18), {)~j log fXIY,e(XIY, 8))


= Eo {{)~i log fYls(YI8) 8~i log fXIY,e(XIY, 8) }
Eo {Eo [8~i log fyls(YIO) f)~i log fXIY,e(XIY, 8)\ Y] }
Eo {f)~i log fyle(YI8)Eo ( f)~i log fXIY,e(XIY, 0)\ Y) } = 0,

Hence, the two terms on the right-hand side of (2.87) are uncorrelated. Since
the conditional mean of the conditional score function is 0, a.s., Proposi-
tion B.78 says that

It follows that Ix(8) - Iy(8) is positive semidefinite. The difference is all


Os if and only if, for all i, the conditional score function equals 0 a.s. [Qoj.
This happens if and only if fXly,e(xly, 8) is constant in 8, which means if
and only if Y is sufficient. 0
One feature of Fisher information, which is worth noting, is that it de-
pends on which of several equivalent parameterizations one chooses.
Example 2.88. Suppose that Pe says X "" N({L, 0'2), where (J = ({L,O'). The
Fisher information matrix was seen in Example 2.83 on page 112 to be

Tx((J) = ( ~ ;;7~ ).
Now suppose that we chose the natural parameterization of the exponential fam-
ily, namely
{L (Jill
TIl = 0'2 = (J~' Tl2 = - 20'2 = - 2(J~ .

The function c is
2.3. Information 115

Taking the negative of the matrix of second partial derivatives, we get

I*(TJ) = ( -2~2
~
1
~-~
~
2 "12 )

This is clearly not the same as IX(g-l(TJ)).

In general, when changing parameters to H = g(8), we can use the chain


rule as follows. If f(O) is a function of k variables and 9 is one-to-one, then

8
o.f(g
-1
(1])) =
~ 8 8 -1(
L..- 80f(O)o.gj 1]),
1], j=l) 1],

where OJ = g;l(T}) is the jth coordinate of g-l. In our case, we need to


consider f(O) = log fXls(XIO). It follows that the Fisher information about
H is the matrix
IX(1]) = ~(1])IX(g-l(1]))~ T (1]),
where ~(1]) is a matrix whose (i,j) entry is 8g;1(1])/81]i. The reader can
verify that this method also works in Example 2.88 above.

2.3.2 Kullback-Leibler Information


There is another measure of information that has similar properties to
Fisher information. This measure of information is designed to measure
how far apart two distributions are in the sense of likelihood. That is, if
an observation were to come from one of the distributions, how likely is
it that you could tell that the observation did not come from the other
distri bution.?
Definition 2.89. Let P and Q be probability measures on the same space.
Let p and q be their densities with respect to a common measure v on that
space, for example, P+Q. The Kullback:...Leibler information in X is defined

Jlog:~~~P(X)dV(X).
as
Ix(P;Q) =
In the case of parametric families, let 0 and 1/; be two elements of D. The
Kullback-Leibler information is then
fXls(XIO) }
Ix(O; 1/;) = Eo { log fXls(XI'IjI) .

If T is a statistic, let Pt and qt denote conditional densities for P and Q


given T = t with respect to a measure Vt. Then the conditional K ullback-
Leibler information is
116 Chapter 2. Sufficient Statistics

In general, Ix{Pj Q) "I Ix{Qj P), so Kullback-Leibler information is not


a metric. The sum Ix(PjQ) +Ix(Q; P) is sometimes called the Kullback-
Leibler divergence [see Kullback (1959)]. Even divergence fails the triangle
inequality in general, so it is not a metric.
Example 2.90. Suppose that X '" N(9, 1) given e = 9. Then
1 fXls(xI9) =! [( _ .1.)2 _ ( _ 9)2]
og fXls(xlt/J) 2 x 'I' X

It follows that Ix(9j '1/1) = ('1/1 - 9)2/2. This time Ix(9j '1/1) = Ix('I/1j 9).
Example 2.91. Suppose that X", Ber(O) given e= 9. Then
fXls(xI9) 9 1- 9
log fXls(xl'l/1) = x log ~ + (1 - x) log 1 _ '1/1'

It follows that
Ix(9j'l/1) = 9 log ~9 + (1 - 1-9
9) log 1 _ '1/1'

This time Ix(9j'l/1) -# Ix('I/1j9).


Kullback-Leibler information measures the information in a data set in
some of the same ways that Fisher information does.
Proposition 2.92. The Kullback-Leibler information Ix(PjQ) 2:: 0, and
it equals 0 if and only if P = Q. The conditional Kullback-Leibler infor-
mation IXIT(P; Qlt) 2:: 0, a.s. [PTI, and it equals 0 a.s. [PTI if and only
if Pt(x) = qt{x), a.s. [Pl. (See Definition 2.89.) Also, if X and Y are
e
conditionally independent given and 0, 1/J E 0, then
Ix,y(O; '1/1) = Ix(O; 1/J) + Iy(O; 1/J).
Theorem 2.93. If Y = g(X), then Ix(O; 1/J) 2:: Iy(O; 1/J) with equality for
all 0 and 1/J if and only if Y is sufficient.
PROOF. Use the same setup as in Theorem 2.86.
!xls{XIO)
Ix{f); 1/J) = E6 log fXls{XI1/J)
fYls(YIf)) !xIY,s(XIY, f))
= E6 log IYls(YI1/J) + E6 log !xIY,s(XIY, 1/J)
= Iy(O; 1/J) + E6 [IxlY(O; 1/JIY)] 2:: Iy(O; 1/J),
where the last line follows from Proposition 2.92. To make the inequality
into equality, Proposition 2,92 says that we must have
fXIY,s(XIY,O) = !xIY,s(XIY, 1/J), a.s. (P61
But this is true for all 0 and 1/J if and only if Y is sufficient. 0
The Kullback-Leibler information tells us how far one distribution is from
another in terms of likelihood.
2.3. Information 117

Example 2.94. Let n = (0,1) and suppose that Po says that {Xn}~=l are lID
Ber(J). Let X = (Xl, .. . , Xn). Let 'Ij; > (J, and let 8 be discrete with

() 1TO if y = 8,
fe y ={ 1 - 1To if y = 'Ij;.

Then

Pr(8 = (JIX = x) 1To(Jx(1- O)n-x + (1- 1To)'Ij;x(1 _ 'Ij;)n-x


(1 + 1~01TO (~r (~ =~r-X) -1

where x = E~=1 Xi Let pn = x/no Then

Pr(8 = (JIX = x) (1 + 1~01TO [(~rn (~ =~f-pnr)-1


= (1 + 1~o1TO exp {(Ix (Pnj 0) - Ix (Pnj 'Ij; n} r
1

So, the probability of either (J or 'Ij; increases with more data, depending on to
which one pn is closer in Kullback-Leibler information.

One advantage Kullback-Leibler information has over Fisher information


is that it is not affected by changes in parameterization. Another advantage
is that Kullback-Leibler information can be used even if the distributions
under consideration are not all members of a parametric family.
Example 2.95. Suppose that P is the standard normal N(O, 1) distribution and
Q is the Laplace distribution Lap(O, 1). Then

p(x) 1
= V21r exp (X2)
-2 ' Ep(X2) = 1, Ep(IXI) = /K,
q(x) = "21 exp( -Ix!), EQ(X 2 ) = 2, EQ(IX!) = 1.
It follows that
log p(x) 1 2 1 2
q(x) "2 log:;;: -"2 x + lxi,
1 2 1 {2
Ix(PjQ) "2 log:;;: - "2 + V:; : = 0.07209,
1 2 2
Ix(QjP) -"2 log :;;: +"2 - 1 = 0.22579.
If data come from a Laplace distribution, it is easier to tell that they don't come
from a normal distribution than vice versa.
Another advantage to Kullback-Leibler information is that no smooth-
ness conditions on the densities (like the FI regularity conditions) are
needed.
118 Chapter 2. Sufficient Statistics

Example 2.96. Suppose that P8 says that X has a uniform U(O, 8) distribution.
For 6 > 0,

Ix(8j8 +6) 1 0
8 (8 6) "81dx = log (6)
+
log -8- 1 +"8 '

Ix(8+6j8) = 1810g(8!6) 9~6dX+ 1 oo8~6dX=OO.


8
+6

If an observation has a U(O, 8) distribution, there is some information to distin-


guish the observation from one with a U(O, 8+6) distribution. On the other hand
if an observation has a U(0,8 + 6) distribution, then there is infinite informatio~
to distinguish the observation from one with a U(O, 8) distribution. The reason for
this is that there is positive probability that the U(O, 8) distribution can beruled
out entirely. In a sense, this is the most powerful kind of information possible for
distinguishing distributions.
There is at least one connection between Kullback-Leibler information
and Fisher information when they both exist and when two derivatives can
be passed under the integral sign. In this case,

~
aOBo.Ix(Oo,0)
a f log !xle(xIOo)
. I -- aOBO. f (1 0) !xle(xIOo)dv(x) I
2

, J 9=90 ' J xle x 9=90

= f -ao~;o.IOg!Xle(xIO)1
, J 9=90
!xle(xIOo)dv(x)

= -E90 (a~i 10g!xle(XIOo) a~j IOg!x1e(XIOo)) = Ix,i,j(Oo),


the (i, j) element of the Fisher information matrix.
Example 2.97 (Continuation of Example 2.91j see page 116). The second par-
tial derivative of the Kullback-Leibler information is

If one plugs in 1/1 = 9, one gets 1/[8(1 - 8)] = Ix (8), the Fisher information.

2.3.3 Conditional Information*


We defined conditional Fisher information in Definition 2.79 as the condi-
tional covariance matrix of the conditional score function. We also defined
conditional Kullback-Leibler information in Definition 2.89 as the condi-
tional mean of the logarithm of the ratio of the conditional densities. We
used these conditional information measures in the proofs of Theorems 2.86
and 2.93 to show that sufficient statistics contain all of the information in a

This section may be skipped without interrupting the flow of ideas.


2.3. Information 119

sample. However, in Section 2.1.4, it was suggested that performing infer-


ence conditional on ancillary statistics makes better use of the information
available in the actual sample obtained. We can make this idea more precise
by considering conditional Fisher and Kullback-Leibler information given
an ancillary.
Theorem 2.98. Let U be an ancillary statistic. Both Fisher and Kullback-
Leibler information have the property that the information is the mean of
the conditional information given U.
PROOF. Suppose that X has a density fXle given e with respect to a
measure II. Ifu = U(x), then we can write /xle(xI6) = /u(u)/xlu,e(xlu,O),
since U is independent of e. If the FI regularity conditions hold, then
8 8
80i logfxle(xI6) = 8(}i logfxlu,e(xlu,6).
Since the mean of the conditional score function is 0, a.s., the mean of
the conditional covariance matrix equals the marginal covariance matrix
by Proposition B.78. In symbols, Ix(}) = E8IXlu(}IU). Similarly, for
Kullback-Leibler information,

fXle(xIO) fXlu,e(xlu, 6)
fXle(xl1/J)
= fXlu,e(xlu,1/J) ,
so that Ix(}; 1/J) = EIxlU(O; 1/JIU). 0
Some data sets have more information and some have less depending
on the value of the ancillary U. Theorem 2.98 says that the amount of
information averages out to the marginal information over the distribution
of U, but we can make use of the observed value of U to tell us whether
we have one of the data sets with more or less information.
Example 2.99 (Continuation of Example 2.38; see page 95). In this example,
X = (Xl, X2) with the Xi being lID N(}, 1) given e = (). We had U = X2 - Xl.
The conditional distribution of X' given U can be obtained from the conditional
distribution of Xl given U, which is N(} +u/2, 1/2). The conditional score func-
tion is 2(XI - () - u/2), which has conditional variance equal to 2 for all u. Simi-
larly, the conditional Kullback-Leibler information is IxlU(}; .,plu) = () - .,p)2 for
all u. Hence, this ancillary does not help distinguish data sets from each other.
Example 2.100 (Continuation of Example 2.52; see page 100). In this problem,
there were two ancillaries, Mo and No. We can write the second derivative of the
logarithm of the density as

82 Noo NOI NIO Nll


882 log !xls(XI(}) =- (1 _ 8)2 (1 + 8)2 (2 + 8)2 (2 _ 8)2'
(2.101)

The Fisher information is Ix(}) = N(2 - (}2)/[(1 - (}2)(4 - (}2). According to


Problem 43 on page 143, we can find the conditional Fisher information by cal-
culating minus the conditional mean of (2.101). Conditional on Mo = mo, we
120 Chapter 2. Sufficient Statistics

get
3mo + N(l - (P)
IXI M o(8I m o) = (1- 82)(4 _ 82) .

It is clear that this is an increasing function of mo, so the more observations we


get with first coordinate equal to 0, the more conditional information we have
given Mo. Conditional on No = no, we get

I (81) _ 2no8 + 2N(1 - 8) - N8 2


XINo no - (1 _ 82)(4 _ 82)

This is an increasing function of no. It is easy to verify that the means of these two
are both equal to the marginal information, since E(Mo) = 1/3 and E(No) = 1/2.

In Example 2.100, one might ask which of the ancillaries does a better
job of distinguishing data sets from each other in terms of information.
This might be answered by looking at how spread out is the distribution
of the conditional information.
Example 2.102 (Continuation of Example 2.100; see page 119). We can com-
pute Var(Mo) = 2N/9 and Var(No) = N/4, so that the variance of IXI M o(8IMo)
is 2/82 (> 1) times as large as the variance of IXI No(8INo). Suppose that we are
interested in the statistic N oo . We can calculate any aspect we wish of the condi-
tional distribution of Noo given either Mo or No (and 9). To see how much more
Mo distinguishes data sets than does No, Figure 2.103 shows the distribution of
the conditional mean of Noo given Mo and No for 8 = 0.1 and N = 50. It is easy
to see how much more the distribution is spread out conditional on Mo than on
No. Since the variance of the conditional mean of Noo is greater given Mo than
given No (2.25 versus 1.125), it follows from Proposition B.78 that the mean of
the conditional variance given Mo must be smaller (by the same amount) given

GI""n~
NO
Given
I
I

15 20
o 5 10
Condttlona. M n

FIGURE 2.103. Distribution of Conditional Means of Noo given Mo and No


2.3. Information 121

Mo than given No. In fact, the values are 4.125 and 5.25, respectively. Because
this example is sufficiently simple, one can even calculate the probability that
the conditional variance of Noo given Mo will be smaller than the conditional
variance given No. The probability is 0.8346.

2.3.4 Jeffreys' Prior*


Fisher information turns out to have a role to play in one popular method
for choosing prior distributions. Suppose that one desires a method for
choosing a prior density with the following property. If the parameter 8
were to be transformed by a one-to-one differentiable function 9 with differ-
entiable inverse, then the prior for W = g(8) obtained by the method would
be the same as the usual transformation of the prior obtained for by the e
method. For example, suppose that 8 is a positive parameter and that
the method produces a prior with density feU with respect to Lebesgue
measure. Let W = 8 2 . The usual method of transformations would make
the prior for W equal to

h,(1f;) = fe ( ..fii;) 2~'


We want the method, when applied directly to W to produce this same
prior.
A class of methods that have this property is the following. Let h
X x n -+ lR be a function, and define

fe(())
a
= Varl! ao h(X, ()). (2.104)

To see that this works, let W = g(8) and let the new parameter space be
n'.Note that h must be modified to h' : X x n' -+ lR by
h'(x,1f;) = h(X,g-l(1f;)),
or else the expression on the right of (2.104) makes no sense. Now

a '(
a1f; h x, 1f;)

fw(1f;) =

"This section may be skipped without interrupting the flow of ideas.


122 Chapter 2. Sufficient Statistics

which is just what transformation of variables would give.


The most popular function h to use for such a method is the logarithm
Qf the conditional density h(x,O) = log fXle(x\O). In this case,

8 8
80 h (x,O) = 80 log /xle(xIO),

which is the score function. We have already seen that under the FI reg-
ularity conditions, Varo8h(X,O)/80 = Ix(O). So, the method says to use
fe(O) = cJIx(O) as the prior density, where c is chosen to make the inte-
gral of fe(O) equal to 1, if possible. If no such c exists, then fe(O) = JIx(O)
is often used as an improper prior. This type of prior is called Jeffreys' prior
after Harold Jeffreys, who proposed it in Jeffreys (1961, p. 181).
Example 2.105. Suppose that X '" Bin(n,p) given P = p. Then we saw in
Example 2.82 on page 112 that the Fisher information is Ix(p) = n(p[l _ p])-I.
This makes the Jeffreys' prior proportional to p-l/2(1 - p)-1/2. This is the
Beta(1/2, 1/2) distribution, which is a proper prior.

Example 2.106. Suppose that X '" Negbin(a,p) given P = p. It is not difficult


to show that the Fisher information is Ix(p) = a(p2[1 - p])-I. This makes the
Jeffreys' prior proportional to p-l(1- p)-1/2. This is not a proper prior.
Interestingly also, Jeffreys' prior in this example is not the same as the prior
for the case of binomial sampling (Example 2.105). This means that choosing a
prior by Jeffreys' method has the unfortunate characteristic that it depends on
something that would normally not be taken into account in a Bayesian analysis.
For example, suppose that one were to be exposed to a sequence of exchangeable
Bernoulli random variables one at a time. If one were asked to calculate the
predictive distribution of each observation before it is observed, one would have
to ask whether sampling was to continue to a fixed size or to a fixed number
of successes (or failures) before one could even choose a prior distribution. This
stopping criterion should be irrelevant before the first observation arrives,15 but
the method of Jeffreys' prior must take it into account.
Example 2.107. Suppose that the density with respect to Lebesgue measure of
X given e = () is f(x - (}), where f is differentiable. Then

() !'(X-(})
{)(} log f(X - (}) = f(X _ (}) .

150ne could imagine situations in which the stopping criterion is chosen by


someone who has information not available to us. In such a situation, it is possible
that when we learn what this other person has chosen for the stopping criterion,
we believe that the choice tells us some additional information that we would
like to incorporate into our model. For example, suppose that this other person
decides to stop as soon as five successes are observed. We might then say, "Aha!
We will use the prior p-l(1 - p)-1/2 to reflect this information." But then, we
discover that we only have time to collect four observations. Jeffreys' rule says
that we have to change back to the prior p-l/2(1 - p)-1/2 because ~e ,,:ill h~ve
a fixed sample size, even if we believe that the reason the sample size IS bemg
fixed has nothing to do with P.
2.4. Extremal Families 123

The distribution of this quantity given e = 9 is the same as the d.istribution <?f
!'(X)/ f(X) given e = 0, hence the variance is co~stant as a function of 9. Thl~
means that Jeffreys' prior would be constant. If n IS an unbounded set, Jeffreys
prior is improper.
For multiparameter problems, a similar derivation is possible. Let fe(O)
be proportional to the square root of the determinant of the covariance
matrix of the gradient vector of h(X,O) with respect to O. It is easy to
check that the gradient of h(X,g-1(,p)) with respect to ,p is equal to the
matrix whose determinant is the Jacobian times the gradient of h(x, 0)
with respect to (J evaluated at (J = 9- 1 (,p). In the special case of Jeffreys'
prior, h is the log of the density and fe(9) becomes the square root of the
determinant ofthe Fisher information matrix, Ix(O).
Example 2.108. In Example 2.83 on page 112, we found that if X '" N(p., (12J
given e = (p., (1), the Fisher information matrix was diagonal with entries 1/(1
and 2/(12. The determinant is 2/(14 and Jeffreys' prior is a constant over (12, an
improper prior. The usual improper prior in this problem is a constant over (1.

One interesting feature of Jeffreys' prior is that its definition did not
depend on the parameter space (except that we were able to take deriva-
tives). That is, if the parameter space is actually an open subset of the
set {O : fXls(xIO) is a density}, then Jeffreys' prior has the same form.
Obviously, a different. normalizing constant will be required if the prior is
proper.
Example 2.109 (Continuation of Example 2.107; see page 122). Suppose that
the parameter space is actually only the open interval (a,b), but the conditional
density of X given e = 9 is still f(x - 9). Then Jeffreys' prior is the U(a, b)
distribution, which is proper.

2.4 Extremal Families*


In this chapter we have shown how one can determine a sufficient statistic
once one has chosen a family of models indexed by a parameter. Lau-
ritzen (1984, 1988) has developed a theory in which a family of probability
models is determined once one chooses a sequence of sufficient statistics.
Lauritzen's theory is general enough to apply to collections of random
quantities that are not exchangeable and to collections more general than
sequences. We will consider only the case of sequences here. Diaconis and
Freedman (1984) also prove results of a similar nature, and Theorem 2.111
below is based on their work.

*This section may be skipped without interrupting the flow of ideas.


124 Chapter 2. Sufficient Statistics

2.4.1 The Main Results


Obviously, it takes more than just a sufficient statistic to identify an inter-
esting class of probability models. For example, Tn = ~~=l Xi is sufficient
e
whether the Xi are conditionally lID N((),l) or Ber(()) given = (). But
these two models would not both be considered appropriate for the same
data. The conditional distribution of Xl,' .. ,Xn given Tn would also be
useful in identifying a class of models. This conditional distribution can be
described by a transition kernel.
Definition 2.110. Let X and T be topological spaces and let 13 and C
be the Borel a-fields. A function r : 13 x T ~ [0,1] is called a transition
kernel if, r(, t) is a probability on (X,13) for every t E T, and r(A,) is
measurable for every A E 13.
A transition kernel r is like a regular conditional distribution, except
that it need not satisfy an equation like Ie
r(B, t)dJLT(t) = Pr(X E B, T E
C), because there is no mention of a marginal distribution for T. Our
goal, in this section, is to prove a representation theorem for the joint
distribution of {Xn}~=l under the assumptions that a particular sequence
of statistics {Tn}~=l are sufficient and that the conditional distributions
given the sufficient statistics are a particular collection of transition kernels.
The basic structure we will consider here is the following. Let (8, A) be
a measurable space. For each n, let (Xn ,13n ) and (Tn, Cn) be Borel spaces.
The set Xn is the space in which all data available at time n lie, and Tn is
the space in which the sufficient statistic at time n lies. Let Tn : Xn -4 Tn be
measurable, and let Pn-l,n : Xn -4 Xn- 1 be onto and measurable. Then Tn
is the sufficient statistic at time n, and Pn-l,n is the function that extracts
the data available at time n - 1 from the data available at time n. Let X
be the following subset of n::'=l Xn:

II Xn: Pn-l,n(Xn) = Xn-b for all n > I}.


00

X = {x = (Xl,X2,"') E
n=l
(It is easy to see that X is in the product a-field.) The set X is the set of
sequences of possible data which are consistent in the sense that the data
at time n are an extension of the data at time n - 1 for all n > 1. Define,
for k < n,
Pk,n = Pk,k+1(Pk+l,k+2('" (Pn-l,n(-))))'
Let 13 be the Borel a-field of X. Let Pn : X -4 Xn be the nth coordinate
projection function Pn(Xl,X2,.") = Xn . Let X : 8 -4 X be measurable,
and define Xn = Pn(X), The definition of X makes it clear that, for all
x E X, all n, and all k < n, Pk,n(Pn(x)) = Xk. That is, Xk = Pk,n(Xn) for
all n and all k < n. Let ~n be the sub-a-field of 13 generated by {Tl(Pl)}~n'
and set
2.4. Extremal Families 125

For brevity, we will use the symbol Tn to stand for Tn(Pn(X)) or for Tn(Pn)
This makes E the tail a-field of the sequence of statistics {Tn}~=1'
Theorem 2.111. For each n, let rn : 13n x Tn ---+ [0,1) be a tmnsition
kernel such that
rn (T;; 1 ( {t}), t) = 1, for all t E Tn (2.112)

Suppose that the following is true for each nand t E Tn+1 :16
Condition S: Assume that the distribution of X n+1 is rn+l (', t).
Then r n (, s) is a regular condititnal distribution for Xn given
Tn = s for all s E Tn (Pn ,n+1 (T;;+1 ({t}))).
Let M be the set of all distributions on (X, 13) such that rn is a version of
the conditional distribution of Xn given Tn for all n. Then M is a convex
set. Let : be the extreme points of M. Suppose that M is nonempty. Then,
there exists a set E E E and a tmnsition kernel Q : 13 x E ---+ [0,1] such
that
1. P(E) = 1 for all P E M,
w
2. for each x E E, rn(-, Tn(x n )) ---+ Q(', x),
3. for each P EM, Q is a regular conditional distribution for X given
E,
4. for each x E E and A E E, Q(A,x) E {O, I},
5. for each P E M, there is a unique probability R on (X, E) such that

P = 1 Q(.,x)dR(x),

6. the R in part 5 is the restriction of P to E,

7. for each P E M, P E : if and only if P({x: P = Q(', x)}) = 1.


If the distribution of X is in the class M, we say that {Xn}~=l is partially
exchangeable17 relative to the sequences {Tn}~=l and {rn}~=l' The family
of distributions in : is called the extremal family. It is helpful to comment
on the conditions in Theorem 2.111. Equation (2.112) says that rn(-, t) puts

16Condition S is a way of saying that Xn is conditionally independent of Tn+l


given Tn. The problem with saying it this way is that such a statement requires an
explicit distribution for Xn+l, and such a distribution has not yet been defined.
17Dawid (1982) refers to the type of models considered here as intersubjective
models. The reason is that all of the distributions in M have a common con-
ditional distribution for the data Xn given Tn, and they only disagree on the
marginal distribution of Tn.
126 Chapter 2. Sufficient Statistics

all of its mass on the set of points where Tn = t, SO that it really looks like
a conditional distribution given Tn = t. Condition S is a way of expressing
the idea that Tn is sufficient without introducing parameters. It says that
conditioning X n+1 on Tn+1 and then looking at the conditional distribution
of Xn given Tn is the same as conditioning Xn on Tn from the start.
The most common situation in which the above conditions hold is that
in which Xn = yn and Bn = 'D n for some Borel space (y, 'D). In this case
Pn,n+1 (Yl, ... , Yn+ d = (Yl, ... , Yn), and there is a bimeasurable function
w : X -+ yoo given by wYl, (Yl, Y2), .. .) = (Yb Y2, . .. ). If, in addition, the
Yi are exchangeable, we have a special version of Theorem 2.111.
Theorem 2.113. Let (y, V) be a Borel space and let Xn = yn, Bn = 'On.
Let Y k : S -+ Y be measurable for all k. Define X as above and define
X: S -+ X by18

x = (Yl, (Yl, Y2), (YI, Y2, Y3 ), ).


Let w : X -+ yoo be defined by

Assume the conditions of Theorem 2.111, but replace condition S by the


stronger

Condition T: Assume that the distribution of X n+1 is r n+1(', t).


Then rn(, s) is a regular conditional distribution for Xn given
Yn+1 and Tn = s for all s E Tn (Pn,n+1 (T;';l ({t}))).
Suppose that for every nand t, rn(-, t) is the distribution ofn exchangeable
coordinates, and that Tn is a symmetric (with respect to permutations of
the arguments) junction ofYl , ... , Y n . Let M* and e* be the sets of distri-
butions on (yoo, 'D OO ) induced by w from those in M and e, respectively.
Then all elements of M* are distributions of exchangeable random quanti-
ties. Also, e* is the set of all IID distributions in M* for which the coor-
dinates have distribution equal to limits (as n -+ 00) ofrn(Pl,n(),Tn(xn ))
for (Xl. X2, ) E E.
Exponential families are a special case which can be characterized by
their transition kernels. A generalization of the following theorem was
proven by Diaconis and Freedman (1990). What this theorem says is that
if an extremal family has the same sufficient statistics and conditional dis-
tributions as an exponential family, then those members of the extremal
family with densities are the exponential family distributions. There may
also be degenerate distributions in the extremal family which would not be
part of the exponential family. (See Example 2.117 on page 128.)

18In this way Xn = (Yl, . .. , Yn).


2.4. Extremal Families 127

Theorem 2.114. Let h : IRk - t [0,00) be strictly positive on a set of

positive Lebesgue measure. Suppose that there exists 0 such that

c(O) = J h(x)exp(OT x)dx < 00. (2.115)

Let y = IRk = Tn for all n. Suppose that Tn(Yl,"" Yn) = 2:~=1 Yi' Let
h(l) = hand
h(n)(t) = J h(n-l)(t - y)h(y)dy,

for n > 1. For B ~ Xn and t E IRk, let

B(t) ~ {(Y" ... ,Yn-') , (Yb ... ,Yn-"t - ~ y.) E B}.


Let

Then the conditions of Theorem 2.113 are satisfied. Also, the members of
the extremal family with distributions absolutely continuous with respect to
Lebesgue measure are the members of the exponential family with the Yi
being IID with density hey) exp(OT y)/c(O), for some 0 satisfying {2.115}.

2.4.2 Examples
In this section, we present examples of the above theorems for exchange-
able sequences. For more general distributions, some examples are given in
Section 8.1.3. The theorems can be summarized as follows. Suppose that
we specify a sequence of sufficient statistics and conditional distributions
for {Yn}~=l given the sufficient statistics in such a way that the sequence
is exchangeable. Then the Xn are conditionally lID with distribution being
one of the limits of rn(Pl,n(-), Tn(x n )). Examination of these limits should
reveal the collection of extremal distributions.
The most straightforward example of Theorem 2.113 is to show that it
implies DeFinetti's representation theorem 1.49 for random variables.
Example 2.116. Let Xn = lRn. Let Tn be the subset of lRn with the coordinates
in nondecreasing order. Let rn(A, t) equal lin! times the number of permuta-
tions of the coordinates oft which are elements of A. Let Pn-I,n(Xl, ... ,Xn ) =
(Xl, ... ,Xn-l). Clearly, every lID distribution is in M, so Mis nonempty. Now,
suppose that Xn+l has distribution pn+l (', t). The set yet) = Pn,n+l (T':-~l ({t}))
for t E Tn+! is just the set of vectors of length n whose coordinates are n draws
without replacement from the coordinates of t, or equivalently, the set of vectors
consisting of the first n coordinates of the permutations of the coordinates of t.
The distribution of Xn is uniform over these (n+ 1)! points (with repeats counted
128 Chapter 2. Sufficient Statistics

as more than one point), and the distribution of Tn is uniform on Tn(Y(t)), which
consists of the n+I vectors obtained by removing one coordinate from t. The con-
ditional distribution of Xn given Tn = s is clearly rn(,s) for each s E Tn{Y(t)).
Hence the conditions of Theorem 2.111 are satisfied.
A combinatorial argument like the one used in the proof of Theorem 1.49 shows
that each limit point of a sequence {rn(Pk,n(-) , Tn(Xn))}~=l of probabilities (for
fixed k and x) has lID coordinates. Hence, Q(', x) is a distribution of lID random
variables for each x. We can determine which lID distribution by looking at the
first coordinate. Since the rn(Pl,n('), Tn(x n distributions are just the empirical
probability measures of the first n coordinates of x, Q("x) is the limit of the
empirical probability measures for those x such that the limit exists. Since all
CDFs on lR are such limits, we get DeFinetti's representation theorem 1.49 out
of Theorem 2.111.

When (Tn,Cn ) is the same space (T,C) for all n, it may be possible to
identify the extremal distributions with elements of T.
Example 2.111. Suppose that X = lRand Tn = lRxlR+ O with Tn(Xl,'" ,xn ) =
(L:~=l Xi, L:~=l xn,
and r n (, (tl, t2 the uniform distribution on the surface of
the sphere of radius Jh - tVn around (h, ... ,td/n. If an n-dimensional vector
Y is uniformly distributed on the sphere of radius I around 0, then rn is the
distribution of h/n + Jt2 - tVnY. So, we will find the distribution of Yl . The
conditional distribution of (Y2 , . , Yn ) given Yl = Yl is uniform on the sphere
in the (n - I)-dimensional space in which the first coordinate is Yl with radius
JI - yi around the point (Yl, 0, ... ,0). The marginal density of Yl is then the
ratio of the surface areas of these two spheres. The surface area of sphere of radius
r in n > I dimensions is 27r n / 2 r n - l /r(n/2). So

r(~) ( 2)~-1
jy,(yd = r(n21) Vi 1-Yl .

Let O"~ = (t2 - ti/n)/n. Then, the density of Xl given Tn = (tl,t2) is

Since
.
hm
r(~) I
=-,
n-oo r(n21) Vn v'2
we have that

If O"n converges to 0" and tl/n converges to 1-', this function c?n~e~ges uniformly on
compact sets to the N(I-', 0"2) density. If O"n goes to 00, the limit IS 0 and does not
correspond to a probability distribution. If O"n converges to 0 ~nd tl/n converge~
to 1-', the density goes to 0 uniformly outside of every open mterval around 1-',
hence the limit distribution is concentrated at 1-'. If O"n converges to 0 but tl/n
2.4. Extremal Families 129

goes to oo, the limit is not a probability distribution. Hence the set consists
of all lID distributions in which the coordinates are either normally distributed
or constant.

Theorem 2.111 is actually so general that it applies to all joint distribu-


tions.
Example 2.118. Let {Yn}~=l be a sequence of arbitrary Borel spaces. Let
Xn = Tn = n~=l Yo, and let Tn be the identity transformation on Xn . Let
rn{A, t) = IA{t). Let Pn-l,n{Yt, ... ,Yn) = (Y!,"" Yn-I). Then the conditions of
Theorem 2.111 are satisfied, and the extreme points of M are the point mass
distributions Q{A,x) = IA{x) for all A E B. The tail a-field is the whole a-field
B, and the representation probability for P is P itself. Needless to say, this is not
an interesting example of the representation, but it is an example.

2.4.3 Proofs+
The proof of Theorem 2.111 will proceed by means of a sequence of lemmas.
The following simple proposition implies that M is a convex set.
Proposition 2.119. Let P and Q be probability measures on a measurable
space (Y,C), and let R = )..P + (1 - )..)Q with 0 < ).. < 1. Let V be a
sub-u-field olC such that P('IV) = Q(IV). Then R('IV) = P(IV).
Next, we prove that the conditional distribution of Xn given {Tt}i=n is the
same as the conditional distribution given Tn alone. 19
Lemma 2.120. For each n and each P E M, Xn is conditionally inde-
pendent 01 {Tn+i}~1 given Tn.
PROOF. We will prove this by showing that the conditional distribution
of Xn given Tn, ... , Tm is the same as the conditional distribution of Xn
given Tn for all m. It will follow from the result in Problem 13 on page 663
that this is also the conditional distribution of Xn given {Tt}i=n'
For each P E M, r n+1(-,t) is the conditional distribution of X n+1 given
Tn+1 = t. Condition S says that rn(-, s) is the conditional distribution of
Xn given Tn = s and Tn+1 = t. But rn is the conditional distribution of Xn
given Tn, so the result is true if m = n + 1. We finish the proof by induction
on m. Suppose that the conditional distribution of Xn given Tn,"" Tm
is rn. Now, find the conditional distribution of Xn given Tn,"', Tm+1'
The conditional distribution of X m+1 given Tm+1 is rm+b so condition
S says that the conditional distribution of Xm given (Tm, Tm+1) is r m ,

+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
19This is a stronger statement of the fact that Tn is sufficient. If we think of
{Tt}~n as the parameter, then Lemma 2.120 is the usual classical concept of
sufficiency.
130 Chapter 2. Sufficient Statistics

the conditional distribution of Xm given Tm. 20 Since (Xn' Tn,"', Tm) is a


function of X m , its conditional distribution given (Tm, Tm+1) is the same as
its conditional distribution given Tm. According to Theorem B.75, we can
use this last conditional distribution to find the conditional distribution of
Xn given Tn,"', Tm+1 by conditioning Xn on Tn,"', Tm. By the induction
hypothesis, this just produces r n , and the proof is complete. 0
It follows from Lemma 2.120 that for every n, r n (, t) can be used as a
version of the conditional distribution of Xn given Tn = t, Tn+! = u, ....
Next, we find the conditional distributions of X k given E for each k.
These distributions are limits of the conditional distributions given Tn as
n-oo.
Lemma 2.121. For each x E X, and each nand k < n, let Rk,n,x be the
probability on (Xk' B k ) induced by Pn,k from rn(-, Tn(Pn(X))).21 Define L
to be the set of all x E X such that Rk,n,x converges in distribution (as
n - 00) for all k (denote the limit by Rk,x)' Then LEE, and peL) = 1
for all P E M. Also, the function f(x) = Rk,x(A) is measumble for all
A E Bk and is a version of P(p;l(A)IE) for all P E M.
PROOF. For each k, let k : Xk - [0, I] be a bimeasurable function. (See
Definition B.3l.) Let Yk = k(Xk). Let Qk,n,x be the probability induced
on [0, I] from Rk,n,x by k. Let f : [0, I] - m. be a bounded continuous
function. By Lemma 2.120,

E(J(Yk)ITn, Tn+1,"') = E(J(k(Pn,k(Xn)))ITn, Tn+b )


= E(J(Yk)ITn).
Now, define gk,n(Xj f) = f f(y)dQk,n,x(Y)' It follows that E(J(Yk)ITn =
t) = gk,n(Xj f) if t = Tn(Pn(x)). According to part II of Levy's theo-
rem B.124, E(J(Yk)ITn , Tn+1, . .. ) converges almost surely to E(J(Yk)IE).
In terms of the points x EX, we say this as follows. First, we note that,
as in Lemma 2.120, a version of the conditional distribution of Xn given
{Td~n is rn for all P E M. Hence, versions of the conditional distribu-
tion can be chosen so that the set of x for which 9k,n(X, f) converges does
not depend on P. Let Gk,/ be the set of x for which 9k,n(Xj f) converges,
and call the limit Ak(Xj f). (Hence, Ak does not depend on P either.) Then
Gk,/ E E and P(Gk,!) = 1 for all P E M. Also, Ak('j f) is measurable
(with respect to E), and it is a version of E(J(Yk)IE), that is, for all A E E,

E(J(Yk)IA) = i Ak(Xj f)dP(x) ,

for all P E M.

20Note that we now have that Xm is conditionally independent of Tm+l given


T m because a marginal distribution of Xm+l has been identified.
21In symbols, Rk,n,.,(B) = rn(Pn,k(B),Tn(Pn(x))) = rn(Pn,k(B),Tn(x n )).
2.4. Extremal Families 131

Let Co be a countable dense subset of the set of bounded continuous


functions from [0,1] to JR (see Lemma B.42) using the uniform metric on
functions. Let
Gk = n
Gk,f,
jEGo

so that Gk E E and P(G k ) = 1 for all P E M. Since the elements of Co are


dense in the uniform metric, we have that for every bounded continuous
f: [0, 1] ~ JR and each x E Gk, lim n-+ oo 9k,n(Xif) = >'dxif) This can
also be written as

where Qk,x is the conditional distribution of Y k given E calculated from the


probability space (X, B, P), which is the same for every P E M. In short,
Qk,n,x ~ Qk,x' Also, the function 9k(Xi f) = J f(y)dQk,x(Y) is measurable
with respect to E and is a version of E(j(Yk)!E). Because Qk,x is a proba-
bilityon [0,1] rather than on X k , we need to show that Qk,x(k(Xk )) = 1,
a.s. Since 9k(Xi f) is measurable even if f is only bounded and measurable
(i.e., not necessarily continuous), it follows that 9k(Xi f) is also the condi-
tional mean of f(Yk) given E for all bounded measurable f. Set f = I<I>k(xk)
to get

from which it follows that Qk,x(k(Xk = 1, a.s. To complete the proof,


set
Hk = {x E Gk : Qk,x(k(Xk = I}
and L = nk=:lHk, and let Rk,x be the probability induced on (XbBk) by
'k 1 from Qk,x. 0
The next lemma says that we can arrange for the Qk,x probabilities to
be a consistent set of distributions as k varies.
Lemma 2.122. In the notation of Lemma 2.121, let

c = {x E L : R k,x(p'k2 1,k(' = Rk-1,x(-), for all k}.


Then C E E and P( C) = 1 for all P EM.
PROOF. As in the proof of Lemma 2.121, let Co be a countable dense set
of bounded continuous functions from [0, 1] to JR, and let
132 Chapter 2. Sufficient Statistics

for each k and each bounded measurable f. Then both 9k-l (x; f) and
9k(x;f(Pk-l,k)) are versions of E(Yk-1IE). Let H k,/ be the set of x E L
such that the two versions are equal, and let

nn
00

c= H k ,/.
k=l/EGo

Each Hk,j E E and P(Hk,J) = 1 for all P E M. Since Co is a dense set,


x E C implies 9k-l(X; f) = 9k(X; f(Pk-l,k)) for all bounded continuous f.
o
Next, we combine the consistent conditional distributions on the Xk
spaces into a conditional distribution on the space X given E.
Lemma 2.123. There exists a transition kernel Q : B x C -+ [0,1] such
that Q(A,x) is a version of P(AIE) for all A E B and all P E M.
PROOF. Lemma 2.122 says that the finite-dimensional distributions Rk,x
are consistent for each fixed x E C. Theorem B.133 says that for each
x E C, there is a unique probability Q(,x) on (X,B) with Rk,x as the
k-dimensional marginal for every k. But surely, for each P E M, P('IE) is
such a probability, so Q(,x) is a version of P('IE) for every P E M. 0
If E S;; C, then we have established parts 2 and 3 of Theorem 2.111
(see Lemma 2.128 below.) Next we show that the probabilities Q(', x) are
mostly in M.
Lemma 2.124. Let V = {x E C : Q(', x) EM}. Then VEE and
P(V) = 1 for all P E M.
PROOF. A point x E V if and only if, for all n, all A E Bn, and all BEen,

r
i ;/(T;;'(B
p
rn(A, Tn(Pn(y)))dQ(y, x) = Q( {Tn(Xn) E B, Xn E A}, x).
(2.125)
Both sides of (2.125) are E measurable functions of x, so the set of x for
which (2.125) holds for fixed n, A, and B is in E. Since Xn and Tn are Borel
spaces, there exist countable fields of sets which generate their respective
Borel a-fields (Proposition B.43). The set of all x such that (2.125) holds
for all A and B in those countable collections and all n simultaneously is
therefore in E. But it is easy to see that if (2.125) holds for all A in a field
that generates Bn (for fixed B and fixed n), then it holds for all A E Bn,
and similarly for all B. Hence VEE. To show that P(V) = 1 for all
P E M, let P E M and let GEE. Since Q("x) is a version of P('IE), it
follows that the integral of the right-hand side of (2.125) over G is
2.4. Extremal Families 133

Similarly, the integral of the left-hand side of (2.125) over G is

(1
le p;l(T,;-l(B
rn(A,Tn(Pn(y))dQ(y,x)dP(x)

= E(leIB(Tn(Xn))rn(A, Tn(Xn))) = P(G, Tn (Xn) E B, Xn E A),


where the last equality follows from the fact that G and {Tn(Xn) E B} are
both in En and rn is a conditional distribution for Xn given En. Now, let G
be the set of x such that the left-hand side of (2.125) is strictly greater than
the right-hand side. If P( G) > 0, we have a contradiction, and similarly
if G is the set of x such that the left-hand side is strictly less than the
right-hand side. 0
There may be many x for which Q(,x) are the same, and it would be
useful not to distinguish these if we want the representation to be unique.
Lemma 2.126. Let E' be the smallest a-field of subsets of V such that all
of the functions fA (x) = Q(A, x) (as functions of x) are measurable. The
a-field E' is countably generated.
PROOF. The a-field E' is generated by all sets of the form {x E V :
Q(A, x) > q}, where q is a rational and A is an element of a countable
field that generates A (Proposition B.43). 0
Next, we show that for each P E M, E and E' differ only by probability
zero sets.
Lemma 2.127. For each A E E, let A' = {x E V: Q(A,x) = I}. Then
A' EE' and P(A~A') = 0 for all P E M.
PROOF. Since Q(A,.) is measurable with respect to E', A' E E'. Now
A~A' = (A \ A') U (A' \ A). Since P(Q(A, X) = IA(X)) = 1 for all A E E,
Q(A, x) = 0, a.s. for x E A', and we get

P(A \ A') = ( Q(A,x)dP(x) = o.


lA'C
Since Q(A,x) = 1 for all x E A',

P(A' \ A) = jA'
[1 - Q(A, x)]dP(x) = O. o

Next, we identify the set E.


Lemma 2.128. For each x E V, let S(x) be the atom in E' containing x,
that is,

S(x) = {y E V: Q(A,y) = Q(A,x), for all A E E'}.


Let E = {x E V : Q(S(x),x) = I}. Then E E E' and P(E) = 1 for all
P E M. Also,
E = {x E V: Q(x, A) = IA(X), for all A E E'}. (2.129)
134 Chapter 2. Sufficient Statistics

PROOF. First, we prove (2.129). Suppose that x E V and Q(A,x) = IA(x)


for all A E ~'. Since S(x) E ~', we have Q(S(x),x) = Is(x)(x) = 1, since
x E S(x). So x E E, and the right-hand side of (2.129) is contained in E.
If x E E and A E ~', then S(x) ~ A if and only if x E A. It follows that
Q(A, x) = IA(x), and E is a subset of the right-hand side of (2.129).
Next, we prove that E E I;'. Note that Q(A,x) = IA(x) for all A E I;'
if and only if Q(A,x) = IA(x) for all A E D, where D is a countable field
generating ~'. So,

E= n
AEV
{x E V: Q(A,x) = IA(x)}, (2.130)

which is in ~' because each of the sets in the intersection is in E'.


Finally, we prove P(E) = 1 for all P E M. Since Q(A,x) is a version of
P(AI~) for all x E C by Lemma 2.123, we have that for all A E ~,

P(Q(A, X) = IA(X)) = 1.
Now use (2.130) again, to conclude P(E) = 1. 0
Since E ~ C, we have now established parts 1, 2, and 3 of Theorem 2.111.
Next, we establish part 4.
Lemma 2.131. If x E E and A E ~, then Q(A, x) E {0,1}.
PROOF. Let x E E and A E ~. Then Q(,x) E M by Lemma 2.124.
By Lemma 2.127, there is A' E ~' such that Q(A,x) = Q(A',x). But
Q(A',x) = lA' (x) by Lemma 2.128. 0
Now, we are ready to prove parts 5 and 6 of Theorem 2.11l.
Lemma 2.132. Let ~* be the a-field of subsets of E defined by

~* = {A n E: A E ~'}.

For each P E M, there is a unique probability R on (E, ~*) such that


P = JEQ(,x)dR(x) and R is the restriction of P to ~* as well as the
restriction of P to ~.
PROOF. Since Q(A,x) is a version of P(AIE), it follows that R equal to
the restriction of P to ~* satisfies the representation. To show uniqueness,
let R be a probability on (E, ~*) which satisfies the representation and let
A E ~*. Then

R(A) = Ie IA(x)dR(x) = Ie Q(A,x)dR(x) = P(A),


where the second equality follows from Lemma 2.128, and the third follows
from the representation. Since E* ~ E, the restriction of P to E agrees
with the restriction of P to ~* on ~*. 0
The following result follows easily from Lemma 2.132.
2.4. Extremal Fa.milies 135

Corollary 2.133. If P and pI are in M and they agree on E*, then they
are the same.
Finally, we can prove part 7 of Theorem 2.111.
Lemma 2.134. A probability P E M is extreme if and only if it is a
zero-one measure on E. Also, P E M is extreme if and only if

P({x E L: Q(,x) = P}) = 1. (2.135)

PROOF. According to Lemma 2.127, P E M is a zero-one measure on


E if and only if its restriction to E* is a zero-one measure. P is a zero-
one measure on E* if and only if it is concentrated on one of the atoms,
which are sets of the form {x E E : Q(,x) = R}, for some R. But the
representation in Lemma 2.132 implies that R = P. So, P is a zero-one
measure on E* if and only if (2.135) holds.
Next, we prove that if P E C, then P is a zero-one measure on E*.
Suppose, to the contrary, that there is A E E* such that 0 < P(A) = a < 1.
Let
PI =.!. [ Q(,x)dR(x), P2 = _1_ [ Q(.,x)dR(x),
alA 1- a JE\A
where R is the restriction of P to E". Clearly, PI(A) = 1 and P2 (A) = 0,
so PI :I P2. But P = O:PI + (1 - 0:)P2, so P is not extreme.
Finally, we prove that if P is a zero-one measure on E*, then P is
extreme. Suppose, to the contrary, that P = aPI + (1 - a)P2 for some
0< 0: < 1 and PI :I P2 in M. Since Pi P on E" for i = 1,2, it follows
that P, PI. and P2 all concentrate on the same atom in E", hence they
agree on E". By Corollary 2.133, P = PI = P2 , a contradiction. 0
In particwar, Lemma 2.134 says that we can locate the extreme points
by finding all of the Q(,x) measures for x E E.
Lemma 2.136. A projability P E M is in c if and only if it is a Q(,x)
for some x E E.
PROOF. Lemma 2.134 says that x E E implies Q(., x) is extreme. (Check
the definition of E in Lemma 2.128.) Conversely, if P is extreme, then P is
a zero-one measure on E* and it equals Q(., x) for x in the atom on which
P concentrates, according to (2.135). 0
The proofs of Theorems 2.113 and 2.114 require a lemma first.
Lemma 2.137. Let (y, 'D) be a Borel space, and assume the conditions of
Theorem 2.111 hold with Xn = yn. Let Xn be (YI , . .. , Yn). Then all IID
distributions in M* are in coO.
PROOF. Let An be the a-field generated by all functions from Xn to Tn
which are symmetric with respect to permutations of the coordinates. Let
Aoo = n~=IAn' Then E* ~ Aoo, where E" is the image of E under the
136 Chapter 2. Sufficient Statistics

bimeasurable mapping w. We will prove that the lID distributions are zero-
one on Aoo. We do this by proving that lID distributions are conditional
distributions given A.:xo. Let P stand for the distribution of the data which
says that the Vi are lID with distribution equal to the limit of the empirical
CDFs. Since the empirical CDF based on YI , ... , Yn is An measurable, and
since An ~ An+! for all n, it follows that P is An measurable for every n,
hence it is Aoo measurable. To see that P(B) = Pr( (Yi l l , Vi k ) E BIAoo),
we need to prove that, for every C E Aoo,

The proof of this is virtually identical to the proof of (1.84) on page 47


and will not be repeated here. This means that the liD distributions are
conditional distributions given Aoo, hence they are 0-1 on ~* and in the
extremal family. 0
PROOF OF THEOREM 2.113. To see that each distribution in M* is the
distribution of exchangeable random quantities, let i l , ... , ik be distinct,
let n = max{ i l , ... , id, let B E Bk, and let 'fJ E M *. Let

A = {(x!, ... ,xn) : (Xil"" ,Xik) E B},


A' = {(Xl, ... ,xn ): (Xl, ... ,Xk) E B}.
Since Tn(-, t) .is the distribution of exchangeable random quantities for all n
and t, we have Tn (A, t) = Tn (A', t) for all nand t. Let T/Tn be the distribution
of Tn. Then

T/XiI ,, X ik ) E B) J Tn(A, t)d'fJTn (t)

= J Tn(A ' , t)dT/TJt) = T/XI, ... , Xk) E B).

Next, we want to prove that the lID distributions are the extremal distri-
butions. Lemma 2.137 says that the lID distributions in M* are contained
in *. We now prove that all extremal distributions are UDit follows from
DeFinetti's representation theorem 1.49 that every distribution 'fJ in M*
is a mixture of liD distributions. We show next that if 'fJ E *, then the
mixture must be trivial. Let 'fJ E *, and represent

(2.138)

as in DeFinetti's representation theorem 1.49, where Pn is the distribution


that says that Xn = (YI , ... , Y n ) are UD with distribution P. Let q(P) be
the joint distribution on (yoo, V OO ) which says that {Yn}~=l is lID with
distribution P. Since condition T says that Xn is conditionally independent
of {Yn+di=1 given Tn, and since P is a function of {Yn+i }i=l' it follows that
2.4. Extremal Families 137

Xn is conditionally independent of P given Tn. This means that the condi-


tional distribution of Xn given Tn = t and P is rn(-, t). Hence, q(P) E M*
with probability 1. This makes (2.138) a representation of'T} as a mixture
of elements of M*. Since 'T} E *, the mixture must be trivial, that is,
q(P) = 'T/ with probability one. So, we have that all distributions in &* are
IID distributions.
The last claim in the theorem follows from the fact that each element of
is the limit of rn probabilities. 0
PROOF OF THEOREM 2.114. Since rn(IR\t) = 1 for all t, it is clear that
rn is a transition kernel. Since B(t) = BnTn-l({t}), rn satisfies condi-
tion 2.112. Since every member of the exponential family in question has
the conditional distribution of (Yt, ... , Yn ) given Tn equal to r n , it follows
that condition S holds and that every member of the exponential family
is also in M. Since Yn+l = Tn+l - Tn, Proposition B.28 says that the
conditional distribution of Xn given (Tn' Yn+d = (t, y) is the same as that
given (Tn' Tn+d = (t, t + y), so condition T holds. So the conditions of
Theorem 2.113 hold. Lemma 2.137 says that every member of the expo-
nential family is in the extremal family. We now prove that all extremal
distributions are in the exponential family. We may assume, without loss
of generality, that h(O) > O. (If not, find c such that h(c) > 0 and subtract
c from all Xi' Then replace hey) by v(y) = hey + c) and note that rn has
the same form in terms of v as it does in terms of h.) Let f be the density
of a distribution in the extremal family, and let

f(2)(t) = ! f(t - y)f(y)dy.

Then f(2) is the density of Y 1 + Y2, since Y1 and Y2 are lID in the extremal
family. Since f leads to the same conditional distributions as h, we get

fey) = J
h(y)h(t - y) f(2) (t)dt
h(2)(t) ,
hence f(O) > 0, since h(O) > O. It also follows that
f(t - y)f(y) h(t - y)h(y)
f(2)(t) - h<2)(t) a.e., (2.139)

since both sides give the conditional distribution of Y 1 given Y 1 + Y2 = t.


Define

A(Y) = log fey) -log f(O)


hey) h(O)'
f(2)(t) f(O)
1jJ(t) logh(2)(t) -210g h (0)'

By taking the log of both sides of (2.139), we get A(t - y) + A(y) = 1jJ(t).
Now, set y = t and note that A(O) = 0, so that A(t) = 1jJ(t). It follows that
138 Chapter 2. Sufficient Statistics

,\(t-y)+,\(y) = '\(t). According to Theorem C.9, ,\(y) = a+bT y for some


scalar a and vector b. Hence, f(y) is a constant times h(y)exp(bT y) for all
y such that f(y) > O. To see that f(y) > 0 whenever h(y) > 0, note that
(2.139) implies that f(y) = 0 and h(y) > 0 means that h(t-y)/h(2)(t) = 0
for all t. But this would contradict

h(y) = ! h(y)h(t - y) h(2) (t)dt


h(2)(t) . o

2.5 Problems
Section 2.1.2:

1. Suppose that Ps says that Xl, ... ,Xn are lID N(8,1), for 8 E nt. Let
X = (Xl, ... , Xn) and find a one-dimensional sufficient statistic T. Also,
find the conditional distribution of X given T = t.
2. Refer to the definition of P9,T on page 84. If {P9 : 8 E O} is a regular
conditional distribution on (X,8) given e, then prove that {PS,T: 8 E O}
is a regular conditional distribution on (T, C) given e.
3. Suppose that Xl, ... , Xn are conditionally lID with N(,.", cr 2 ) distribution
given e = (,.",cr). Find a two-dimensional sufficient statistic.
4. Let Xl, ... ,Xn be conditionally independent given P = P with Xi having
conditional density (with respect to counting measure on the nonnegative
integers)
fXiIP(xlp) = r~~ia.~x.~) p"'(1- p)Q;,
where 011, ,an are known strictly positive numbers. (These are general-
ized negative binomial random variables.) Define T = E~=l Xi. Find the
conditional distribution of (XI , ... , X n) given T = t and P = p.
5. Let Ps say that Xl, ... , Xn are lID Poi(8). Show that T = E~l Xi is
sufficient by both Definitions 2.4 and 2.8.
6. Prove Proposition 2.12 on page 86 and find the conditional distribution of
(Xl, ... , Xn) given the order statistics.
7. Prove Proposition 2.23 on page 90.
8. For the experiment in Example 2.53 on page 102, find the conditional
distribution of X given Xl and e and show that Xl is not sufficient. Show
that X is minimal sufficient.
9. *Consider the experiment described in Example 2.54 on page 102. Let N be
the number of observed y;, X = U~=2{0, 1}m, and X = (Z, YI, ... , YN).
(a) Find the density of X given e= 8 with respect to counting measure
on (X,2 x ).
2.5. Problems 139

(b) Let M be the number of observed successes among the Yi, that is,
M = E~l Yi. Show that (N, M) is sufficient. In particular, we do
not need to keep track of Z.
(c) Find the conditional distribution of Z given (N, M, 9), and show that
it does not depend on 9.
10. (Nonexchangeable example) Let HI say that {Xn}~=l are Bernoulli random
variables with the following joint distribution:

Pe(Xl = 1) (h
if Xi-l = 1,
if X i - 1 = 0,
where (J= (Jl,(J11,(JlO).
(a) Let X = (Xl, ... , X n). Find a four-dimensional sufficient statistic.
(b) Suppose that X = (Xl, ... , XN), where N is the number of obser-
vations until k successes (Is) have been observed, where k is known.
Find a three-dimensional sufficient statistic.

Section 2.1.9:

11. Prove that the sufficient statistic T found in Problem 1 on page 138 is
minimal sufficient.
12. Show that T is a complete sufficient statistic in Problem 4 on page 138.
13. (Logistic regression) Let {Yi}~l be Bernoulli random variables, but assume
that each Yi comes with a known vector Xi of k covariates. Conditional on
9 = (J (a vector of length k), the Yi are independent with

Pe(Yi = 1) } T
log { Pe(Yi = 0) = 8 x.

Let X = (Yl, ... , Yn). Find a minimal sufficient statistic (vector).


14. *Let (Xl, Yl ), ... , (Xn, Yn ) be conditionally liD with uniform distribution
on the disk of radius r centered at (81,82) in IR? given (91,92,R) =
(Jl, 82, r).
(a) If (9 l ,92) is known, find a minimal sufficient statistic for R.
(b) If all parameters are unknown, show that the convex hull of the sample
points is a sufficient statistic.
15. *Here, we will construct a function T as needed in Theorem 2.29 for the
general case. The function will turn out to be essentially the likelihood
function.
(a) Let :F be the space of functions f : n ...... [0,00) with the product
u-field C. Prove that the function Tl : X ...... :F defined by Tl(X) =
fXle{xl) is measurable.
140 Chapter 2. Sufficient Statistics

(b) Consider the relation '" on :F defined by f '" 9 if there exists c >
such that g(}) = cf(}) for all (}. Prove that'" is an equivalence

relation. (That is, prove that (i) f '" f, (ii) f '" 9 implies 9 '" f, and
(iii) (J '" 9 and 9 '" h) implies f '" h.)
(c) Let T be the set of all equivalence classes [f] = {g : 9 '" n.
Let the
u-field of subsets of T be the smallest u-field containing sets of the
form A9.1/I.c = {If] : f(}) ~ cf(1jJ)}. Prove that [f] E A9.1/J.c if and
only if [g] E A9.",.c for all 9 E [fl.
(d) Let T2 : :F -+ T be defined by T2(f) = [fl. Prove that T2 is measur-
able.
(e) Prove that T = T2(Tt} satisfies T(x) = T(y) if and only if y E V(x)
in the notation of Theorem 2.29.
16. Suppose that {(Xn, Yn)}~l are conditionally lID given 8 = (} with distri-
bution uniform on the disk of radius (} centered at (0,0), that is,

fXi.YiIS(Xi,Yil(}) = 2:(}2 I [o.9j (Jx~ + y~).


Let X = [(Xl, YI ), ... , (Xn, Yn)]. Find a complete sufficient statistic and
its distribution.
17. Suppose that n = {( (}l , (}2) : (}2 > (}tl and P9l .92 says that X I, .. ,Xn are
lID with U(}I,(}2) distribution. Find minimal sufficient statistics.
18. *Let X be a discrete random variable, and let
if x = 0,
if x = 1,2, ... ,
otherwise

be the density of X (conditional on 8 = (}) with respect to counting mea-


=
sure on the integers. Let n (0,1). Prove that X is boundedly complete,
but not complete.
19. Suppose that PIJ says that {Xn}~=l are lID Ber(}). Let X = (Xl, ... , Xn)
and T = L::I Xi. Prove that T is a complete sufficient statistic without
using Theorem 2.74.
20. Suppose that PIJ says that {Xn}~=l are liD U(O, (}). Let X = (Xl, ... , Xn)
and T = maxi=l ..... n Xi. Prove that T is a complete sufficient statistic.
21. *Suppose that Xl, ... , Xn are lID given 8 = (} with conditional distribution
uniform on the set [0, (}] U [2(}, 3(}]. That is,

fXiIS(xl(}) = 2~ (I[o.IJj(x) + I[2IJ.3IJj(x)).


Find a minimal sufficient statistic (dimension at most 3).
22. Let Z = (XI, ... ,Xn,YI, ... ,Yn ) where the Xi and Yi are all condition-
ally independent with Xi '" N(p.,u~) and Yi '" N(p.,u}) given 8 =
(p., UX, Uy). Let T(Z) = (X, V, sl, S}), the usual sample means and vari-
ances. Show that T is minimal sufficient but that T is not boundedly com-
plete.
2.5. Problems 141

23. *Let n = {I, 2, 3}, and let X = (Xl,'" ,Xn ) be conditionally lID given e =
0, with density fo(-). Suppose that Pt, P2, and P3 have density functions
(with respect to Lebesgue measure): ft(x) = I(_l,O)(X), hex) = I(O,l)(X),
and hex) = 2xI(o,1) (x). Thus, the model has only three members. Let
Sex) = {(j, k);ni h(xi) +ni h(Xi) > o}. Let

n,fk(Xi) , ( ' ) S()}


T(x)= { n:fj(Xi);J<k, J,k E x .

Show that T is minimal sufficient.


24. (Nonexchangeable example) Suppose that {Xn};:='=1 is a sequence of ran-
dom variables, that e E (-1,1) is a parameter such that the conditional
distribution of Xl given e = 0 is N(O, [1_0 2 ]), and that, for i > 1, the con-
ditional distribution of Xi given e = 0 and (Xl, ... , Xi-I) = (Xl, ... , Xi-I)
is N(}Xi-l, 1).22
(a) If X = (Xl, ... , X n ), find a three-dimensional minimal sufficient
statistic.
(b) Find the conditional distribution of Xn+l given X = X, and show
that it depends on more of the data X than the minimal sufficient
statistic.

Section 2.1.4:

25. Suppose that Po says that {Xn}~l are lID U(O - 1/2,(} + 1/2). Let X =
(Xl, ... , Xn). Find minimal sufficient statistics and find a nonconstant
function of the sufficient statistic which is ancillary.
26. Prove that if S is ancillary, then Sand e are independent no matter what
prior one uses for e.
27. (A vector example) Suppose that Po says

with n= (-1,1).
(a) Find a two-dimensional minimal sufficient statistic.
(b) Prove that the minimal sufficient statistic found above is not com-
plete.
(c) Prove that Zl = 2:~1 xl and Z2 = 2:7=1 Y/ are both ancillary but
that (Zl, Z2) is not ancillary.
28. *Consider the situation in Example 2.51 on page 100. Prove that the dis-
tribution of U is uniform on the sphere centered at the vector 0 of n Os in
the hyperplane defined by IT U = 0, where 1 is the vector of n Is. (Hint:
Let A be an orthogonal n x n matrix that maps the hyperplane to itself.
Prove that AU has the same distribution as U.)

22Such a sequence is often called an autoregression of order 1.


142 Chapter 2. Sufficient Statistics

29. Suppose that Pr(X > 0) = 1 and that the conditional distribution of Y
given X = x is U(O,x). Let Z = X - Y and suppose that Y and Z are
independent. Let fx(x), jy(y), and fz(z) be differentiable.
(a) Prove that Pr(X > c) > 0 for all c > O.
(b) Prove Ix (x) = a 2 xexp(-ax), for x > O.
30. Let Xl, ... , Xn be conditionally IID given e = () each with density g(x-(})
for some function g. Prove that max{XI , ... , Xn} - min{XI , ... , Xn} is
ancillary, but not maximal ancillary if n > 2.
31. Consider the situation in Example 2.46 on page 97. Suppose that we wish
to condition on No if

I
Varo ( 1 - 3 r:;oo No) ~ Varo ( 1 - 2 ~: IMo) ,
and we wish to condition on Mo otherwise. For which data sets would we
choose No, and for which would we choose Mo?
32. Consider the situation in Example 2.46 on page 97. Suppose that we need
to choose upon which ancillary to condition before we see the data. Suppose
that we decide to condition on No if

Varo( 1 - 3r:;oo) ~ Varo( 1- 2~:),


and we will condition on Mo otherwise. On which ancillary will we condi-
tion?
33. Call a statistic U ignorable if there exists a sufficient statistic T such that
T and U are conditionally independent given e. Prove that an ignorable
statistic is ancillary.

Section 2.2:

34. Express the family of Poisson distributions in exponential family form. Find
the natural parameter, natural parameter space, and sufficient statistic. Use
Theorem 2.64 to find the mean and the variance of the sufficient statistic.
35. Express the family of Beta distributions in exponential family form. Find
the natural parameter, natural parameter space, and sufficient statistic(s).
Use Theorem 2.64 to find the mean and the variance of the sufficient statis-
tics. (Hint: The derivative of the log of the gamma function is called the
digamma function 'IjJ. The second derivative is called the trigamma function
'IjJ'. )
36. In Problem 9 on page 138, show that the family of distributions for ob-
served data is an exponential family but that the sufficient statistic is not
complete. How do you reconcile this with Theorem 2.74?
37. Prove Proposition 2.70 on page 107.
38. Prove Proposition 2.72 on page 107.
2.5. Problems 143

Section 2.9:

39. Suppose that X and Yare conditionally independent given 8 with con-
ditional densities fXls(xIO) and fYls(yIO), respectively. Suppose that 8 is
k-dimensional. Prove that Ix,y(O) = Ix (0) +Iy(O).
40. Prove Proposition 2.84 on page 112.
41. Prove Proposition 2.92 on page 116.
42. Suppose that Pe says that X '" Poi(O).
(a) Find the Fisher information Ix(O).
(b) Find Jeffreys' prior.
43. Suppose that the FI regularity conditions hold and that two derivatives
can be passed under integral signs. Let T be an ancillary statistic. Prove
that IXIT(Olt) has (i,j) entry equal to -Ee (8 2 log fXIT,S(Xlt, 0)/80i80;).
44. Let n= {(p1,P2,P3): Pi ~ O,E~=lPi = 1} and
P1 if x = 1,
{ P2 if x = 2,
fXIS(xlpl,P2,P3) = P3 if x = 3,
o otherwise.
Let 00 = (1/3,1/3,1/3), and find the value of 0 such that Ee(X) = 2.5 and
Ix (00; 0) is minimized.
45. Suppose that X '" U(O,O) given e = O. Find the Kullback-Leibler infor-
mation Ix (01; (2) for all pairs (01, (2)'
46. Suppose that person 1 believes Pr(8 = 1/3) = 11"0 and Pr(8 = 1/2) = 1-11"0
and person 2 believes Pr(8 = q) = 1. Both persons believe that {Xn}~=l
are lID Ber(O) given 8 = O. Let Yn = E~=l Xi/no
(a) Find Pr(8 = 1/31Yn = q) for person 1.
(b) For each possible value of q, describe person 2's beliefs about how
the value of Pr(8 = 1/3IYn ) (calculated by person 1) will behave as
n --+ 00.
47. Let Y be the number of patients (out of n) who survive for one year after
an operation. Let Z be the number of patients who survive for five years.
Let 8 = (P, Q), and suppose that we model Y '" Bin(n,p) given 8 = (p, q)
and Z '" Bin(y, q) given Y = y and e = (p, q). Let X = (Y, Z). Find Ix (0)
and Jeffreys' prior.
Section 2.4:

48. *Let Tn(xl, ... , x n) = E~=l Xi, and suppose that (Xl, ... , Xn) given Tn = t
is distributed uniformly on the portion of the hyperplane ~~l Xi = t with
all coordinates nonnegative. Find the extremal family of distributions.
49.*Let Xl = {O, 1}. Let Tn(xl, ... ,xn ) = E~=l Xi, and suppose that the dis-
tribution of Xl, ... ,Xn given Tn = t is that of draws without replacement
from an urn containing t 1s and n - t Os. Find the extremal family of
distributions.
CHAPTER 3
Decision Theory

A major use of statistical inference is its application to decision making


under uncertainty. When the costs and/or benefits of our actions depend
on quantities we will not know until after we make our decisions, we need
to be able to weigh the costs against the uncertainties intelligently.

3.1 Decision Problems


3.1.1 Framework
Suppose that one can determine ahead of time a set of actions from which
one will have to choose. We name this set ~ and call it the action space. This
set will contain all of the actions under consideration. We will occasionally
need to introduce a measure over this set, so let Q be a a-field of subsets
of ~. In the most general type of decision problem we will consider, we
suppose that there is a not yet observed quantity V (taking values in a set
V) on which depends the amount we lose as a function of our action.
Example 3.1. Suppose that we are trying to decide whether to keep a store
open for an extra hour during a busy shopping season. We might be able to
determine the extra costs of overhead and payroll associated with staying open,
but the amount of additional net sales V is as yet unknown. The final profit or
loss associated with the decision depends on V.

Definition 3.2. Let (S, A, Jl) be a probability space, ~ an action space,


and V : S -> V a function. A loss function is a function L : V x ~ -> JR.
L( v, a) measures "how much" we lose by choosing action a when V = v.
Consider the following simple example.
3.1. Decision Problems 145

Example 3.3. Let V '" N{l, 1), and suppose that N = lR. and L(v,a) = (v-a)2.
In this case, the amount I lose when I choose action a is the squared distance
between a and the unknown V. Alternatively, we might have L{v,a) = 31v - a!-
In this case, I lose three times the distance between V and a.

The conditional distribution of V given e = 0 will be denoted by Pe, v.


For convenience we will assume that V and X are conditionally independent
given e. Most often, in discussions of statistical theory, the function V is
e, so that Pe,v(B) = IB(O), and L : n x ~ ---> JR. But this is not actually
necessary. The goal of decision theory is to make L(V, a) as small as possible
by choice of a E ~. Unfortunately, this would normally require that we know
the value V. For example, in Example 3.3, both losses are smallest if a = V.
If V is a function that depends on the future (coordinates later than those
observed or the parameter), then we will not know V at the time a decision
will need to be made. When V = 8, we will not even know V after the
decision is made.
The tools we use to make decisions are called decision rules.
Definition 3.4. A randomized decision rule 8 is a mapping from X to
probability measures on (~,a) such that for every A E a, 8()(A) is mea-
surable. A nonrandomized decision rule 8 is a randomized decision rule that
for each x assigns probability 1 to a single action, denoted by 8(x). That
is, a randomized decision rule 8 is a nonrandomized rule if, for each x EX,
there exists ax E ~ such that 8(x)(A) = IA(a x ), and in such a case ax is
denoted by 8(x).
Note that the definition of a randomized decision rule makes it a regular
conditional distribution over (~, a) given X. Of course, if one were actually
to use a randomized decision rule, one would need to choose an action in
~, not just a probability measure over (~, a). To do this, one takes the
observed x and simulates (see Section B.7) a pseudorandom element of ~
according to the probability measure 8(x)(). Hence, an alternative method
for specifying a randomized rule is to specify, for each possible x, the way
in which one will simulate the action from ~.
Example 3.5. Suppose that m and n are even integers. Suppose that Pe says
that Xl, ... , Xn+m are IID Ber{(}) random variables. Let X = (Xl, ... , Xn) and
V = E::I X n + i . Let the action space be N = {ao,aI}, and suppose that the loss
function is
if (v < T and a = ao) or if (v > T and a = al),
if v = T'
otherwise.

Let Y = E~=l Xi and y = E:l Xi. Here is a plausible randomized decision rule:
probability 1 on ao ify <~,
8{x) = { probability Ion al ify >~,
probability ! on each ify=~.
146 Chapter 3. Decision Theory

If Y = n/2 is observed, one could flip a fair coin to decide between the two
actions.

If 6 is a randomized rule, set

L(v,c5(x)) = 1
L(v,a)dc5(x)(a).

This then allows us to talk about the loss incurred by either a random-
ized or a nonrandomized rule without regard to the result of the auxiliary
randomization in the randomized rule.
Example 3.6 (Continuation of Example 3.5; see page 145). If y = n/2, then
one can easily show that L(v,b(x)) = 1/2 for all v.

3.1.2 Elements of Bayesian Decision Theory


In the Bayesian paradigm, one calculates the posterior risk

r(6Ix) = Iv L(v, 6(x))dI-l V l x (vlx)

for each decision rule and chooses the one with the smallest posterior risk.
Here, dl-l v1x denotes the conditional distribution of V given X. If we do
this for every x and the posterior risk is never +00, the resulting rule is
called a formal Bayes rule.
Definition 3.7. If 60 is such that r(60 Ix) < 00 for all x and r(6 0 Ix) :5 r(6Ix)
for all x and all decision rules 6, then 60 is called a formal Bayes rule.
The use of formal Bayes rules is based on the following principle.
The Expected Loss Principle: When one compares two rules
after observing data, the better rule is the one with the smaller
posterior risk.
A justification for this principle will be given in Section 3.3. One feature
of that justification, which we do not use here, however, is that the loss
function needs to be bounded.
Example 3.8. Let V ~ lR, N ~ lR, and L(v,a) = (v - a)2. Then

Assuming that E(V2IX = x) < 00, we can easily minimize the posterior risk .by
setting 8(x) = E(VIX = x). This result is very general. So long ~ the poster~or
variance of V is finite, a formal Bayes rule with squared-error loss IS the postenor
mean of V.
3.1. Decision Problems 147

It is possible that there exist x values such that the posterior risk given
X = x is +00 for every decision rule. Also, it is possible that although
there exist rules with posterior risk < 00 given X = x, there is no rule
that achieves the minimum of the posterior risk. In these cases, there is no
formal Bayes rule as we have defined it, although there may exist x values
such that, conditional on X = x, the posterior risk can be minimized at a
value < 00. In this latter case, we call a rule that minimizes the posterior
risk at all values of x for which a minimum < 00 can be achieved a partial
Bayes rule.
Example 3.9. Suppose that {Yn}~l are conditionally lID with Cauchy distri-
bution Cau(8, 1) given e = 8, where 0 = 1R = N, V = e, and L(O, a) = (a - 8)2.
Let the prior distribution of 8 be Cau(O, 1). Let t > 0 and let Xi = min{t, Yi}.
Define X = (X 1 ,X2,X3). If at least one of the Xi is strictly less than t, then
the posterior risk will be finite for some decision rule. But if all three Xi = t,
the posterior risk is infinite for all decision rules. In this example, a partial Bayes
rule is any rule that chooses the action minimizing the posterior risk for those
data in which all at least one Xi < t. As we saw in Example 3.8, the action to
choose in those cases is the posterior mean of 8. For those data x such that the
posterior risk is infinite (all Xi = t), it might still make sense to choose 6(x) in
such a way that the posterior distribution of L(V,o(x is stochastically small.
That is, if we define Z6 = L(V, 6(x)), we should prefer 01 to 02 if the CDF of Z61
is everywhere larger than the CDF of Z62'
Example 3.10. Let N = (0,1) = 0, and let X = (Xl, ... , XlO) where the Xi are
lID with Ber(8) distribution given e = 8. Let V = e and L(O, a) = (O-0.1-a)2.
Let the prior distribution of e be Beta(l, 1) so that the posterior given X = x is
Beta(x+ I,ll-x). If X = x > 0 is observed, then the posterior risk is minimized
at oo(x) = (x + 1)/11 - 0.1. However, if X = 0 is observed, the posterior risk is
an increasing function of a for a E N, so we would like to choose 00(0) as small as
possible. But the action space is not closed, so there is no smallest possible value.
Any decision rule 0 such that o(x) = oo(x) for x > 0 is a partial Bayes rule.

If the posterior risk of a randomized rule is finite, or the loss function is


nonnegative, we can write the posterior risk as

r(olx) = 1Iv L(v, a)dFvlx (vlx)d8(x)(a). (3.11)

In this case, if the inner integral, hx(a) = Iv


L(v, a)dFvlx(vlx), considered
as a function of a for fixed x, does not achieve a minimum at some value
of a, then it is easy to see that (3.11) does not achieve its minimum at any
probability o(x). This leads us to state the following result.
Theorem 3.12. If a formal Bayes rule exists with finite posterior risk,
then there is a nonrandomized formal Bayes rule. If, at a particular value
of x, the posterior risks of nonrandomized rules are unbounded below, then
there exists a randomized rule with posterior risk -00 at that x (whether
or not there is such a non randomized rule).
148 Chapter 3. Decision Theory

PROOF. Before proving the first part, let hx(a) = IvL(v, a)dFvlx(vlx),
the posterior risk of a nonrandomized rule with 6(x) = a. Then, if a formal
Bayes rule exists with finite posterior risk, hx (a) (as a function of a) must
be bounded below. Furthermore, if for some x, hx(a) = 00 for all a, then
every decision rule has infinite posterior risk at x. Hence, we can assume
that c(x) = infaENhx(a) < 00 for all x.
We will prove the contrapositive of the first part of the theorem, namely
that if, at some value of x, there is no nonrandomized formal Bayes rule,
then the formal Bayes rule does not exist. Suppose that there is no non-
randomized formal Bayes rule at a value x. Then there exists no b E ~ such
that hx(b) = c(x). Suppose that 6 is a randomized rule with finite poste-
rior risk (so c(x) > -00 also.) Let An = {a : hx(a) ~ c(x) + lin}. Since
{a: hx(a) = c(x)} = 0, we can choose n large enough so that 6(x)(An) > O.
It follows that

J hx(a)d6(x)(a) > c(x)6(x)(A~) + in hx(a)dc5(x)(a)

> c(x)6(x)(A~) +6(x)(An) (c(x) +~)


c(x) + 6(x)(An) > c(x).
n
Since c(x) = infaEN hx(a), there exists a such that

c(x) < hx(a) < J hx(a)dc5(x)(a).

It follows that 6 is not a formal Bayes rule.


For the second part, suppose that c(x) = infaEN hx(a) = -00. For each
k = 1,2, ... , let ak be such that hx(ak) :::; _2k and for k > 1, hx(ak) <
hx(ak-l)' Let 6(x) assign probability 2- k to ak for each k. Then, it is easy
to see that 6 has posterior risk -00 even if hx(a) > -00 for all a. 0
Although Theorem 3.12 says that there are cases in which one need only
consider nonrandomized rules in order to find formal Bayes rules, it may
be that there are still some randomized formal Bayes rules as well.
Example 3.13. Suppose that Pe says that {Xn}~=l are lID Ber(8). Let N =
{ao,aI} and

L(} a) =
,
{o
1
J
if (0:::; and a = ao) or if (0 > ~ and a = all,
otherWIse.

Let X = (Xl, ... , Xn) and suppose that n is even. Let Y = E~=l Xi.
Suppose that the prior is 'T/ equal to Lebesgue measure. If Y = y succe~ses ~re
observed in n trials, the posterior is Beta(y + 1, n - y + 1). The po~terI?r risk
for choosing a = ao is r(y) = Pr(8 > 1/2jY = y), and the posterIor rIsk f<;>r
choosing a = al is Pr(8 :::; 1/2jY = y) = 1 - r(y). The formal Bayes rule wI~1
be to choose ao if r(y) < 1 - r(y) (that is, if r(y) < 1/2) and to choose al If
3.1. Decision Problems 149

r(y) > 1/2. If, however, y = n/2, the posterior will be Beta(n/2 + l,n/2 + 1),
which is symmetric about 1/2, so r(y) = 1/2. Randomized rules of the following
form are formal Bayes rules:

probability 1 on ao if Y < ~,
6(X) ={ probability 1 on al ifY > ~,
probability l on each ifY =~.

Next, we illustrate the case in which losses are unbounded below.


Example 3.14. Suppose that N = n = rn. and that L(O, a) = -(9 - a)2. In
words, we are trying to choose a as far away from e as possible. Suppose that the
posterior distribution of e has finite variance a 2 and mean 1-'. Then the posterior
risk for a nonrandomized rule 6 is -(I-' - 6(x2 - a 2. So, every nonrandomized
rule has finite posterior risk. If, for each x, 6(x)(-) is a randomized rule such that
the distribution over N has infinite variance, then 6 will have posterior risk -00.

3.1.3 Elements of Classical Decision Theory


In the classical paradigm, one conditions on e =
(assuming that there
was a parametric family specified) and calculates the risk function

R(O, 6) = Ix Iv L(v,6(x))dPo.v(v)dPo(x).

For the case in which V is not e, we can definel


L(O, a) = Iv L(v,a)dPo.v(v). (3.15)

In either case (V = e or not), the risk function becomes


R(O, 8) = Ix L(O,8(x))dPo(x).


There is usually no way to choose 8 to make R(O,6) as small as possible
for all simultaneously. One possibility is to choose a probability measure
7] over n and try to minimize

r(7], 6) = In R(O, 6)d7](O),

which is called the Bayes risk. Let Jl.x denote the marginal probability mea-
sure over X, namely Jl.x(A) = In
Po(A)d7](O). Suppose that Po v for ev-
ery 0. If L 2: 0, we can use Tonelli's theorem A.69, or if L(O, 6(x))fxla(xIO)

1 Notice that a predictive decision problem has been replaced, in the classical
setting, by a parametric decision problem with loss L(8, a), which does not depend
on the future observable V.
150 Chapter 3. Decision Theory

is integrable with respect to v x .,." we can use Fubini's theorem A.70 to

Ix
conclude that
r(.,." 6) = r(6\x)d/Lx(x).

Each 6 that minimizes r(.,." 6) is called a Bayes rule, assuming that r(.,." 6)
is finite. Otherwise no Bayes rule exists. So, we can prove the following.
Proposition 3.16. If a Bayes rule 6 exists, then there is a partial Bayes
rule that equals 6 a.s. [p.x].

3.1.4 Summary
We now summarize the last several definitions in the case where V = 8.
Definition 3.17. Suppose that we have a decision problem with action
space N, parameter space fl, sample space X, and loss function L : fl x N -+
JR. Let 6 be a randomized rule. Then L(O,6(x)) = IN L(O, a)d6(x)(a). The
posterior risk of 6 is

r(6Ix) = 10 L(O,6(x))d/Lslx(0Ix).
Let A be the set of all x for which there exists ax that achieves a finite
minimum posterior risk. Then a decision rule such that 60 (x) = ax for all
x E A is called a partial Bayes rule. If A = X, then a partial Bayes rule
is called a formal Bayes rule. The risk function of a rule 6 is R(0,6) =
Ix L(O,6(x))dP8(X). If.,., is a prior distribution for 9, the Bayes risk of 6
with respect to.,., is r(.,." 6) = In
R(O, 6)d.,.,(0). If there is a 6 that minimizes
this quantity at a finite value, then that rule is called the Bayes rule with
respect to .,.,.

3.2 Classical Decision Theory


3.2.1 The Role of Sufficient Statistics
As defined, decision rules can be arbitrary functions of the data. We learned
in Chapter 2 that all we needed from the data were sufficient statistics,
so it should be the case that decision rules should only be functions of
sufficient statistics. Of course, formal Bayes rules will only be functions of
the sufficient statistic, since the posterior distribution is a function of the
sufficient statistic. The next theorem says that if a choice of decision rules
will be based solely on the risk functions, then even in classical statistics,
decision rules need only be functions of sufficient statistics.
3.2. Classical Decision Theory 151

Theorem 3.1S. 2 Let 60 be a randomized rule and T be a sufficient statistic.


Then there exists a rule 61 that is a function of the sufficient statistic and
has the same risk function.
PROOF. For A E 0:, define

61 {t)(A) = E{60 {X)(A)IT = t).


(Since T is sufficient, this expectation does not depend on 0.) It follows
easily that for any 60 (x) integrable function h : N --+ nt,

E ( / h(a)d<5o(X)(a) T I = t) = / h(a)d61(t){a). (3.19)

(Just check the equation for indicators, simple functions, nonnegative func-
tions, then integrable functions.) Then,

R(O,61) = Ix L(O,61{T(x)))dPIJ{x)

= Ix i L{O, a)d<51{T{x)(a)dPIJ{x),

R(O,60) = Ix i L(O, a)d60{x)(a)dPIJ (x).

It follows from (3.19) that

i L{O,a)d61{T{x))(a) = E {i L{O,a)d60{X){a)1 T = T{X)}.


Now, use the Law of total probability B.70 to write R(O,6I) as

Ix i L(O, a)d61(T{x)) (a)dPIJ (x)

= EIJ (E {i L(O, a) d60(X)(a) IT}) = Ell (i L(O, a)d60(X)(a)


= Ix i L{O, a}dOo{x)(a}dPIJ(x) = R(O,60 } 0

Note that if 60 in Theorem 3.18 is nonrandomized, then 61 will still be


randomized if T is not one-to-one. {See Problem 8 on page 209.}
There are cases in which nonrandomized rules are all we need.
Theorem 3.20. 3 Suppose that N is a convex subset of mm and that for all
E 0, L(O, a) is a convex function of a. Let 6 be a randomized rule and let

2This theorem is used in the proof of the Rao-Blackwell theorem 3.22.


3This theorem is used in the proof of Theorem 3.22.
152 Chapter 3. Decision Theory

B <;;; X be the set of all x such thatIN lald8(x)(a) < 00. Let the mean of
the distribution b, considered as a nonrandomized rule, be

80 (x) = i ad8(x)(a), for x E B.

Then L(O, 80 (x)) :S L(O, 8(x)) for all x E B and for all 0.
PROOF. Since N is convex, Theorem B.17 says that bo(X) EN for all x E B.
It follows that

L(O,bo(x)) = L (0, i adb(x)(a)) :s i L(O,a)db(x)(a) = L(O,8(x)),

for all x E B. The inequality follows from Jensen's inequality B.17. 0


If B = X in Theorem 3.20, then the posterior risk for the nonrandomized
rule 80 will be no larger than that for the randomized rule 8. If Po (B) = 1
for all (), then the risk function of the nonrandomized rule will be no larger
than that of the randomized rule.
Example 3.21. Suppose that Po says that Xl, ... 2Xn are lID ~er(O). Let X =
(Xl,'" ,Xn ), N = n = [0,1], and L(O,a) = (0 - a) . Let y = Li=l Xi, and set

11.. with probability ~,


6(y) =
{
$ with probability ~.

This is like flipping a coin between the proportion of successes and the posterior
mean from a U(O, 1) prior. Then
y y+I
6o(y)
2n + 2n +4'
L(fJ,8(y)) ~ (fJ _ l!..) 2 + ~
2 n 2
(fJ _ Y +
n+2
1) 2

= fJ2_fJ(l!.+y+1)+~(y22 +(Y+1)2),
n n +2 2 n (n + 2)2
L(O,6o(y)) ( fJ Y )2
Y+1
2n 2n+ 4
= (}2 _ (l!.n + n+2
y + 1) + ! (l!.. + y + 1)2
4 n n+2
Since (x + z)2/2 < x 2 + Z2 for X =F z, it follows that L(fJ,6o(y)) < L(fJ,8(y)).
The theory of hypothesis testing (see Chapter 4) is one in which N is
not a convex set, and indeed, randomized rules figure prominently in the
classical theory of hypothesis testing.


Theorem 3.22 (Rao-Blackwell theorem).4 Suppose that N is a con-
vex subset of JRm and that for all E n, L(O, a) is a convex function

4This theorem originated with Rao (1945) and Blackwell (1947).


3.2. Classical Decision Theory 153

of a. Suppose also that T is sufficient and 80 is nonrandomized such that


Eo(1180 (X)11) < 00. Define
81 (t) = E (80 (X) IT = t).
Then R(0,8t} S R(0,80 ) for all 0.
PROOF. Consider 80 as the randomized rule 83 (x)(A) = IA(8 0 (x)), for
A E a. For A E let

i
0:,

84 (t)(A) = E[83 (X)(A)IT = tJ, 82 (t) = ad84 (t)(a).

By Theorems 3.20 and 3.18,


R(O,82 ) ::; R(O, 84 ) = R(O, 83 ) = R(O,80 ).
All that remains is to show that 82 = 81 . Using the law of total probabil-
ity B.70, we can write

82(t) = 1 ad84 (t)(a) = L1 ad83 (x)(a)dFxIT(xlt),

where FXIT is the conditional distribution function of X given T. Since


83 (x)() is a point mass at 80 (x), we get

82(t) = L 80(x)dFxIT(xlt) = E(80(X)IT = t) = 81(t). 0

Example 3.23. Suppose that Po says that Xl, ... , Xn are lID N(e, 1). Let X =
(Xl, ... ,Xn). Let ~ = [0,1] and
L(O,a) = (a - <I> (c - 0))2,
for some fixed c E lR. A naive decision rule is 80 (x) = L~=l IC-oo,cJ(xi)/n. But
T = X is sufficient and 80 is not a function of T. Since ~ is convex and the loss
function is a convex function of a, we should calculate

E(80 (X)IT = t)=~ ~E(IC-OO,cJ(Xi)IT = t)=Pr(X l: ; ciT = t)=<I> ( ;;:;1)'


since the distribution of Xl given T =t is N(t, In - l]/n).

3.2.2 Admissibility
The Roo-Blackwell theorem 3.22 tells us that under some conditions, the
risk function of one decision rule is no larger than that of another. Similarly,
Theorem 3.20 tells us that under some conditions, the loss incurred from
one decision rule is no larger than that from another. These theorems have
a common theme. That is, sometimes we can tell that one decision rule is
better than another no matter what e equals.
154 Chapter 3. Decision Theory


Definition 3.24. A decision rule 6 is inadmissible if there is another deci-
sion rule 61 such that R(O, 6t} ~ R(O, 6) for all with strict inequality for
some 0. If there is such a 61 , we say that 61 dominates 6. If there is no such
61 , then we say 6 is admissible.
E~ample 3.25. S~ppose that Po sa~ that Xl, ... , Xn ar~IID N(If,'2(12), where
8 - (/-" (1). Let X - (Xl, ... , X n ), N- [0,00), and L(8, a) - (a - (1 ) . Define
n n

6(x) = n ~ 1 ~)Xi - X)2, 6l (x) = n ~ 1 L(Xi - X)2.


~l ~l

Then it can be shown that 61 dominates 6 (see Problem 11 on page 210).


The criterion of admissibility may seem too severe if some values of fJ are
deemed to be virtually impossible.
Definition 3.26. Let A be a measure on (0, r) and let 6 be a decision rule.
For every decision rule 61, let A6} = {fJ : R(O,6t} < R(O,6)}. Suppose that
for every decision rule 61, if R(O,6t} ~ R(O,6) a.e. [A] then A(A6}) = o.
Then 6 is A-admissible.
A Bayes rule with respect to a probability measure A is A-admissible.
Theorem 3.27. 5 Suppose that A is a probability and 6 is a Bayes rule with
respect to A. Then 6 is A-admissible.
PROOF.

Let 61 be a decision rule. If R(O,61) ~ R(O,6) a.s. [A] with strict
inequality for all E A with A(A) > 0, then

In R(O, 61)dA(O) < In R(fJ, 6)dA(fJ),


which contradicts 6 being a Bayes rule with respect to A. 0
A A-admissible rule will be admissible if A is a probability that is spread
out appropriately. Theorems 3.28,3.29,3.31, and 3.32 say this in different
ways.

Theorem 3.28. If is discrete, A is a probability that gives positive prob-
ability to each element of 0, and 6 is Bayes with respect to A, then 6 is
admissible.
PROOF. Suppose that 61 dominates 6. Then R(O, 6t} ~ R(fJ,6) for all 0,
and for some 00 we have R(Oo,61) < R(Oo, 6). It follows that

r(A,61 ) =L A({fJ})R(O,61 ) < L A({9})R(O,6) = r(A,6),


All 8 All 8

since A( {O o}) > O. This contradicts that 6 is Bayes. o

5This theorem is used in the proof of Theorem 3.31.


3.2. Classical Decision Theory 155

Theor em 3.29. If every Bayes rule with respect to a prior A has the same
risk function, then they are all admissible.
PROOF . Let tJ be a Bayes rule with respect to the prior
A, and let g(O) be
the risk function of every such Bayes rule. Suppose that tJo domina
tes tJ.
Then R( 0, ho) ~ g( 0) for all 0 with strict inequality for some 0.. But
t~en
In R(O, tJO)dA(O) ~ In
g(O)dA(O). Since 6 is a Bayes rule, the meq~ahty
must be an equality. This means that 60 is also a Bayes rule, hence
It has
risk function g(O), which is a contradiction.
0
Here is an example in which the condition of Theorem 3.29 does
not
hold.
Examp le 3.30. Let n=
(0,00) and N == [0,00). Let L(B,a) = (B - a)2. Let
X ,...., U(O, e) given e = e, and let
A be the U{O, c) distribu tion for c > 0. Then
81X = x has density log(c/x)I(x,c){e). The formal Bayes rules are
of the form

hex) = { (c~x)log(;) if x < c,


arbitrar y if x 2': c.
Clearly, R(e,8) for () > c will depend on the arbitrar y part of the
definition of o.
For example , oo(x) == c for x 2: c will have a larger risk function than
01(X) = x
for x 2: c, even if 81 (x) = 8o(x) for x < c.
The following theorem may apply when the parame ter space is an open
subset of m,k or when it is the closure of an open set.
Theor em 3.31. Let n be a subset of IRk such that every neighborhood
of
every point in n intersects the interior of n. Let A be a measure on (0.,
r)
such that Lebesgue measure on n is absolutely continuous with respect to
A.
Suppose that 60 is A-admissible and that it has finite risk function. Suppos
e
that R( 0, tJ) is continuous in 0 for all 6 with finite risk function. Then
60
is admissible.
PROOF . If tJo were inadmissible, then there would be tJ
1 such that R(O, 61 ) ::;
R(0,60 ) for all 0 with strict inequality for some 00 By continuity
of risk
functions, R(O, tJr) < R(O, tJo ) for all B in some neighborhood N
of 00 ,
which intersects the interior of n. Since Lebesgue measure is absolut
ely
continuous with respect to A, A(N) > O. Hence 60 is not A-admissible.
This
contradiction proves the result.
0
Consider an exponential family with natural parame ter space 0. contain
-
ing an open set. Let the loss be L(O, a) = (a - g(O))2 for g a continu
ous
function. The risk function of each decision rule with finite variance
for all
() will be continuous in 0 according to Theore m 2.64. If tJ is a Bayes
rule
with respect to a prior A that has a strictly positive density with respect
to
Lebesgue measure, then 6 is A-admissible by Theorem 3.27. Since the
nat-
ural parame ter space is convex, it satisfies the conditions of Theore
m 3.31
and tJ is admissible.
The following theorem says that with a strictly convex loss functio
n,
every Bayes rule is admissible.
156 Chapter 3. Decision Theory

Theorem 3.32. Suppose that N is a convex subset of IRm and that all Ps
are absolutely continuous with respect to each other. If L(9,) is strictly
convex for all 9 and 00 is )..-admissible for some ).., then 00 is admissible.
PROOF. If 00 were inadmissible, then there would be 01 such that R( 9,01) ::;
R(9,00) for all 9 with strict inequality for some (Jo. Define 02(X) = [oo(x) +
01(x)l/2. Then, for every 9,

R(9,02) = Ix L(9, oo(x) ~ 01 (x)) dPs(x)

:5 ~ /)L(J, oo(x)) + L(J, 61 (x))}dPs(x)


1 1
= 2 R(9, 00) + 2R(9, 01) :5 R(9,00).

e
The first inequality above will be strict unless P (01(X) = oo(X)) = 1.
Since all Ps are absolutely continuous with respect to each other, it follows
that Pe(01(X) = oo(X)) = lfor one (J if and only if Pe(01(X) = oo(X)) = 1
for all 9. Hence the first inequality will be strict unless the distribution of
01(X) is the same as the distribution of oo(X) given e = (J for all 9. In this
case, 01 could not dominate 00, hence the inequality must be strict for all
9. This would imply that 00 is not )..-admissible, no matter what).. is. 0
Example 3.33. Suppose that Pe says that XI, ... ,Xn are lID Ber(O), and let
X = (Xl, ... , Xn). Suppose that N = [0,11 and that the loss is L(O, a) = (0 -
a)2/[O(1 - 0)1. Define Y = L~=l Xi, and let the prior be Lebesgue measure
on [O,lJ The posterior given X = x would be Beta(y + 1,n - y + 1), where
y = Li=l Xi Then

E(L(8 a)IX
,
= x) = r(n + 2)
r(y + l)r(n - y + 1)
11
0
(0 _ a)201l-1(1 _ O)n- II - l dO

is minimized at a = Yin for all x and all n > 0. So, 6o(x) = yin is a Bayes rule
with respect to ).. and it is admissible by Theorem 3.32.
Theorem 3.32 even applies to ).., which put 0 mass on large portions of
the parameter space. See Problem 17 on page 210 for examples.
The concept of )..-admissibility did not require that ).. be a probability
measure. It is common to try to find "Bayes" rules with respect to non-
probability measures.
Definition 3.34. Let dPs/dv(x) = fXls(xI9). Suppose that).. is a measure
on (fl, r) and that for every x there exists o(x) such that

f L(9,0(x))fxls(xI9)d)"(9) = min f L(9,a)fxls(xI9)d)"(9). (3.35)


in aEN in

The rule 0 is called a generalized Bayes rule with respect to )...


3.2. Classical Decision Theory 157

In
If
0< c = fXle{xIO)dA{O) < 00, (3.36)

then, after observing x, one can pretend that the "prior" distribution of 0
has density fXle(xIO)/c with respect to A and that there are no data. If a
formal Bayes rule exists in this problem, it is a generalized Bayes rule. For
this reason, generalized Bayes rules with respect to A are also called formal
Bayes rule with respect to A.
Example 3.37. Suppose that P8 says that Xl, .. " Xn are lID U(O, 6). Let X =
(Xl, ... ,Xn ), n = (O~oo), and N = [0,00). Let..\ be Lebesgue measure on (0,00)
and L(6,a) = (6 - a) . Then
/xle(xI6) = 6- n 1(0,9) (max Xi) = 6- nI(max ",;,00) (6).
We get cin (3.36) equal to [(n-l)(maxxi)"-ltl. So, we could invent the "prior"
density
(n - l)(maxxi)n-l I (6)
6n (max "';,00)

The Bayes rule with respect to this prior is the mean, which is 6(x) = (n -
1) maxxi/(n - 2). This is a generalized Bayes rule.

It sometimes happens that the integral with respect to A of the risk


function of a generalized Bayes rule with respect to A is finite. In such
cases, there is an analog to Theorem 3.31.
Theorem 3.38. If n is a subset of IRk such that every neighborhood of
every point in n intersects the interior of n, R{ 0, 6) is continuous in 0 for
all 6, Lebesgue measure on n is absolutely continuous with respect to A, 60
is a genemlized Bayes rule with respect to A, and L{0,60(xfxle{xlO) is
v X A integmble, then 60 is A-admissible and admissible.
PROOF. All we need to show is that 60 is A-admissible and then apply The-
orem 3.31. For each decision rule 6, R{0,6) = Ix
L(O, 6(xfxle{xI0)dv(x).
If L{O, 6{xfxls(xI0) is v x A integrable, then

In R(O, 6)dA(0) In Ix L{O, 6{xfxls(xI0)dv(x)dA{0)

= Ix l L(O, 6(xfxle(xI0)dA(O)dv(x),

where the last equality follows from Fubini's theorem A.70. If 61 is any
other rule, then

In L(O,oo(xfxls(xIO)dA(O) $l L(O, 01 (xfxls(xIO)dA(O),

for all x, since 60 is a generalized Bayes rule with respect to A. Hence,

l R(O,oo)dA(O) $l R(0,6t}dA(0).
158 Chapter 3. Decision Theory

So, it cannot be the case that R( (), 15 1 ) :s R( (), 150 ) for all () with strict
inequality for () E A with A(A) > O. 0
Example 3.39. Suppose that Xl, ... ,Xn are liD Ber(O) given e = 0, and let
X = (Xl, ... ,Xn). Let ~ = [0,1] and let the loss be L(O, a) = (0 - a)2. Define
y = 2:::'::1 Xi, and let A have Radon-Nikodym derivative 1/[0(1-0)] with respect
to Lebesgue measure on (0,1). The posterior given X = x would be Beta(y, n-y),
where y = 2::~=1 Xi unless y = 0 or y = n. For 1 $ y $ n - 1, the generalized
Bayes rule is 6(x) = yin. For y = 0 or y = n, the only values of 6(x) which make
(3.35) finite are 6(x) = yin. So 6 is a generalized Bayes rule with respect to A.
Now L(O,6(x))/xIS(xIO) = (0 - Yln)20Y(I- O)n- y , which has integrallln with
respect to counting measure times A. Hence, 6 is A-admissible and admissible.

Sometimes the integral of the risk function with respect to an infinite


measure is not finite. For this reason, Blyth (1951) proved the following
theorem, which makes use of a sequence of generalized Bayes rules to con-
clude that a rule is admissible.
Theorem 3.40. 6 Let 6 be a decision rule. Let {An}~=l be a sequence of
measures on (0, T) such that a generalized Bayes rule bn with respect to An
exists for every n with

r(An,bn ) = J R(), bn)dAnUJ),


lim r(A n , b) - r(A n , bn ) O. (3.41)
n--->oo

Suppose that either of the following conditions holds:


All Po are absolutely continuous with respect to each other; N is a
convex set; L( (), a) is strictly convex in a for all (); and there exist c,
a set C, and a measure A such that An A and dAn/dA(O) :::: c for
() E C with A(C) > O.

Every neighborhood of every point in 0 intersects the interior of 0, for


every open subset C ~ 0 there exists a number c such that An (C) :::: c
for all n, and the risk function of every decision rule is continuous
in ().
Then 15 is admissible.
PROOF. Suppose that 15 is inadmissible. Then there is 15' such that R(), 15') :s
R(), b) for all () and R()o,b') < R()o, b).
If the first condition holds, set 15" = (15 +15')/2 and get that L(),b"(x)) <
[L( (), b(x)) + L(), 15' (x) )J/2 for all () and all x for which b(x) i=- b'(x). Since
POo(b(X) = b'(X)) < 1 and all Po are absolutely continuous with respect to

6This theorem is used in Example 3.43 and in the proof of Theorem 3.44.
3.2. Classical Decision Theory 159

each other, we have Pe(6(X) = 6'(X)) < 1 for all e and R(e, 6/1) < R(e,
6)
for all e. So, for each n,

r(A n ,6) - r(An, 6n ) ~ r(A n ,6) - r(An' 6") ~ L


[R(e, 6) - R(e, 6")JdAn(e)

> c L[R(e ,6) - R(e,6")JdA(e) > O.

This contradicts (3.41).


If the second condition holds, there exists f > 0 and open C ~
that R(O,6') < R(),6) - f for all () E C. Now note that for each
n such
n,
r(An,o) - r(A n , 6n ) > r(A n ,6) - r(An, 6')
> L[R(e ,6) - R((),6')JdA n(()) ~ cXn(C) ~ fC,

where C is guaranteed in the second condition. This contradicts (3.41).


0
We can use Theorem 3.40 together with the following lemma to prove
that some common estimators are admissible.
Lemm a 3.42. 7 Suppose that 8 = (8 1 ,82)' Also, suppose that, for
each
possible value 82 ,0 of 8 2 , 6 is admissible when the parameter space is no
{e = (e 1 , e2,0) En}. Then 6 is admissible.
=
PROOF . Suppose that 6 were inadmissible. Then there
exists 6* such that
R(e,6*) ~ R(e,6) for all e and R(eo,6*) < R(eo,6) for some eo E
n.
Let eo = ((h,0,e 2 ,0), and let no = {O = (Ol,e 2 ,0) En}. We now have
a
contradiction to 6's being admissible when the parame ter space is no.
0
Examp le 3.43. Suppos e that Po says that X has N(J.!, ( 2 ) distribu
tion, where
0= (J.!,u). Let the loss be L(O,a) = (IL - a)2. We now prove
that <5(x) = x is
admissible.
Denote e = (M, ~). It is easy to calculate R(O,o) = u 2 For each value Uo of
E, we will show that <5 is admissi ble for the parame ter space no =
{(IL, uo) : IL E
JR}. Let An be the measure on no having density ..;n times the N(O,
u5n). The
generalized Bayes rule with respect to An is 5 n (x) = nxj(n + 1).
The integral of
the risk function of On with respect to An is rn = n3 / 2 u5/(n + 1).
The integral of
the risk function of <5 with respect to An is vnu5. Note that

vnu~ _ n ~ u5 = vnu5
n+1 n+1'
which goes to 0 as n ...... 00. If C is an open subset of no, then An(C)
for all n. Since L is strictly convex in a for all 0, the conditio ns of
2: A1(C)
Theorem 3.40
apply in the parame ter space no, so 5 is admissi ble with this parame
ter space.
By Lemma 3.42, it is admissi ble for the entire parame ter space.

7This lemma is used in Exampl e 3.43.


160 Chapter 3. Decision Theory

The following theorem is a simplified version of a theorem of Brown and


Hwang (1982). It will allow us to extend Example 3.43 to two dimensions.
Theorem 3.44. Suppose that X has a k-dimensional exponential family
distribution given e = 8 with density fXle(xI8) = c(8) exp(8 Tx) with re-
spect to a measure II. Let the natural parameter space 0 be a rectangular
region 11 x ... x h. Let Ii = (al,i, a2,i) with aj,i possibly infinite. Suppose
that limo;-+aj,; fXle(xI8) = 0 for all i,j,x. Suppose that there exist a set
S ~ 0 with positive Lebesgue measure and a sequence of almost everywhere
differentiable functions {hn}~=l such that
hn : 0 -+ [0, 1],
h n (8) = 1 if 8 E S,
lim n -+ oo IIV' Jh n (8)11 = 0 for all 8 (where V' denotes the gradient),

In SUPn IIV' Jh n (0)1I 2dO < 00.


Then 6(x) = x is admissible as an estimator of g(O) = Eo(X) with loss
L(O,a) = E7=1(gi(0) - ai)2.
PROOF. Let An be the measure on (0, r) with Radon-Nikodym deriva-
tive hn(O) with respect to Lebesgue measure (A). Then dAn/dA(O) = 1
for all 0 E S with A(S) > 0 and the loss is strictly convex in a, so the
first of the two alternative conditions of Theorem 3.40 is met. We need
to find the generalized Bayes rule 6n with respect to An and show that
limn -+ oo r(An, 6) - r(An,6n ) = O. Since g(O) = -V' log c(O) by Proposi-
tion 2.70, the generalized Bayes rule with respect to An will be
In (- V' log c(O) )c(O) exp(OT x)hn(O)dO
In c(0) exp( OT x )hn(O)dO
In( -V'c(O)) exp(OT x)hn(O)dO
=
Inc(O) exp(OT x)hn(O)dO
In c(O) exp(OT x)[xhn(O) + V'hn(O)]dO
=
In c(0) exp(OT x )hn (O)d()
In(V'hn(O))fxle(xIO)d()
(3.45)
= x + In fXle(xIO)hn(())d() ,
where the third equality follows by doing integration by parts with respect
to Oi for the ith coordinate of the integral (with u = exp(OT x)hn(O) and
dv = -ac(O)/aOi) and using limoi-+aj,; fXle(xIO) = 0 to drop the integrated
term uv = fXle(xIO)). Now write

10 L(116(x) - g(())112 -116n(x) - g(())1I2) fXle(xIO)hn(O)dll(x)dA(())


r(An,6) - r('xn, 6n)
=
3.2. Classical Decision Theory 161

= Ix 10 (x - bn(X T (x + bn(X) - 2g(OJx/e(xIO)hn(O)dA(O)dv(x).

According to (3.45), we have that

10 g(O)Jx/e(xIO)hn(O)dA(O) = bn(X) 10 Jx/e(xIO)hn(O)dO.


For convenience, define

Hn(X) = 10 Jx/e(xIO)hn(O)dO, and In(x) = 10 (Vhn(O))fx/e(xIO)dO.


Then x - bn(X) = -In(x)/Hn(x) and
[ IIJn (x)112
r(An,b) - r(An,bn) = lx Hn(x) dv(x).

Use the fact that Vhn(O) = 2y'h n(O)V y'hn(O) and the Cauchy-Schwarz
inequality B.19 to conclude that

This means that

r(An,b) - r(An,bn) < 4 Ix 10 IIVy'hn (O)1I 2 fx/s(xIO)dOdv(x)

= 410 IIV y'hn(O)1I dO, 2

which goes to 0 as n ~ 00 because of the last two conditions in the theorem


and the dominated convergence theorem A.57. 0

Example 3.46. Suppose that X '" N2 (/-, 0') given e = (/-,0') where 0' is a 2 x 2
positive definite diagonal matrix and /- is two-dimensional. 8 Let N = IR?, let
L(8,a) = (al - J.tl)2 + (a2 - J.t2)2, and let 8(x) = x. For each value

0' = ( 0'20,1
o 0
0)
O'~ 2 '

let no = {8 = (/-,0'0) : J.t E IR?} be a subparameter space. The natural farameter


of this exponential family is t/J = (/-I/O'g,I,J.t2/O'~,2)' where 0'3,1 and 0'0,2 are the
diagonal elements of 0'0. Consider the following sequence of functions:

I if1It/J1I2 ::; 1,
hn (t/J) = { (1 - 1::;
l~o~~~!/) 2 if 1It/J1l ::; n,
o if Iit/Jil ~ n.

8This example was compiled from material in Brown and Hwang (1982) and
Section 8.9 of Berger (1985).
162 Chapter 3. Decision Theory

Here S = {O : 11'l/J1l :::; I} in Theorem 3.44. It is easy to see that

II Vhn('I/J) 112 = (11'l/JlIlog(n))-2 I[l,n] (II'I/JII)


:::; (1I'l/JlIlog(max{II'l/JII, 2}))-2 I[l,oo) (11'l/J11).

It follows that limn~oo IIV' Vhn('I/J) II =


Theorem 3.44, we need only show that
for all 'I/J. To verify the last condition of

J {1jJ:1I.p1l~2}
(II'I/Jillog 1I'l/J11)-2 d'I/J < 00.

By transforming to polar coordinates, we see that this integral is a finite constant


plus a constant times

1 ~dr 1 d~
2
00
log (r)
= 00

In(2) Z
< 00.

The following result allows us to translate admissibility in one decision


problem to admissibility in a different decision problem if the loss functions
are related.
Proposition 3.47. 9
Let 0 be an open subset of IRk. If c( 0) > 0 for all 0 E 0, then 6
is admissible with loss L(O, a) if and only if 6 is admissible with loss
c(O)L(O,a).

Let 0 ~ IR be an interval. Suppose that c(O) ~ 0 for all 0 and is


strictly positive except for 0 E A, where A consists solely of points
in n which are isolated from each other. Suppose also that, with loss
function L(O, a), the risk function of every decision rule is continuous
from the left at every 0 E A. (Alternatively, suppose that all risk
functions are continuous from the right at every 0 E A.) Then 6 is
admissible with loss L(O, a) if and only if 6 is admissible with loss
c(O)L(O, a).
Let d( 0) be a real-valued Junction of O. Then 6 is admissible with loss
L(O, a) if and only if it is admissible with loss L(O, a) + d(O).
The proof of this proposition is simple and is left for the reader.
Example 3.48 (Continuation of Example 3.33; see page 156). Let c(O) = 0(1-0)
n
and = (0,1). By Proposition 3.47, 60(x) = yin is admissible with loss L(O, a) =
(8 - a)2. Since c and all risk functions are continuous, even at the endpoints, it
is easy to show that 60 is also admissible if n = [0, 1].

Proposition 3.49. Suppose that 60 is A-admissible with loss L( 0, a). Then


60 is A-admissible with loss c(O)L(O, a) if c(O) > 0 a.e. [A].

9This proposition is used in Examples 3.48 and 3.59. It is also used to simplify
the class of loss functions in hypothesis testing in Chapter 4.
3.2. Classical Decision Theory 163

3.2.3 James-Stein Estimators


We have seen some examples of simple decision rules that are admissible,
but there is a notorious example of a simple decision rule that is inadmis-
sible. This example has spawned a great amount of study. It begins with
Stein (1956) and James and Stein (1960), who showed the following.
Theorem 3.50. Suppose that the conditional distribution of Xl, ... , Xn
given e = (/-L1I"" /-Ln) is that they are independent with Xi rv N(/-Li, 1). Let
X = (X1!""Xn)' N = n = IRn , and let the loss be L(9,a) = E~=l(/-Li
ai)2. Then, ifn > 2, 6(x) = x is inadmissible. In fact, a rule that dominates
6is
61 (x) = 6(x)
1=1 x.
[1- E~-2 ~].
The proof we give here requires a few lemmas. The first is due to Stein
(1981).
Lemma 3.51. Let 9 : IR - IR be a differentiable function with derivative
g'. Suppose that X has N(/-L,l) distribution and E(lg'(X)I) < 00. Then
E(g'(X)) = Cov(X,g(X)).
PROOF. Let J(t) = exp( -t 2 /2)/y'?;i be the standard normal density func-
tion. Use integration by parts to show that

J(x - /-L) = 1 00
(z -/-L)J(z - /-L)dz = - i~ (z - /-L)J(z -/-L)dz.

1:
We will use these facts in what follows.

E(g'(X)) = g'(x)J(x -/-L)dx

100 g'(x)J(x -/-L)dx + 100 g'(X)J(x -/-L)dx


= 1 1
00
g'(x)
00
(z -/-L)4>(z -/-L)dzdx - l~ g'(x) i~ (z -/-L)4>(z -/-L)dzdx
= 100
(z -/-L)4>(z -/-L) 1% g'(x)dxdz - iOoo (z - J.L)4>(z -/-L) 1
g'(x)dxdz

1 100(z -
i:
= 00
(z -/-L)4>(z -/-L)[g(z) - g(O)Jdz - /-L)4>(z - /-L)[g(O) - g(z)]dz

[g(z) - g(O)](z -/-L)4>(z -/-L)dz

1:
=

= g(z)(z - p,)4>(z -/-L)dz = Cov(X,g(X). 0


164 Chapter 3. Decision Theory

Lemma 3.52. Let 9 : IR n ---t IR n be a vector of differentiable functions,


(gl, ... , gn). Let X have N n (0, J) distribution. For each i, define hi (y) to
be the expected value of gi(X1 , .. ,Xi - 1 ,y,Xi+l,'" ,Xn ) and

h~(y) = E d~9i(Xl"'" Xi-I> y, X H1 , ... , Xn). (3.53)

Suppose that, for all i, E(lh~(Xi)1) < 00. Then

EIIX +g(X) - 811' ~ n + E (lIg(X) II' + 2 t, a~/;(X)LJ .


PROOF. Write

EIIX + g(X) - 911 2 EIIX - 911 2 + ElIg(X)112 + 2E[(X - 9)T g(X)]


n
= n + Ellg(X)11 2 + 2E L[{Xi - (Ji)gi(X)].
i=1
All we need to prove is that for each i,

By integrating out the Xj for j t= i, the left-hand side can be written as


E[(Xi - 9i )gi(X)] = E[(Xi - 9i )h i (X i )]

= Cov{Xi,hi{Xi )) = E{h~(Xi)) = E (a~igi{X)I.,=x)'


where the first equality follows from the definition of hi, the third follows
from Lemma 3.51, and the fourth follows from (3.53) and then integrating
out Xi' 0
PROOF OF THEOREM 3.50. Now, let g(x) = -x(n - 2)/ E;=1 x~, and use
the notation of Lemma 3.52. This makes gi(X) = -(n - 2)x;/ '2:';=1 xr
For each x =f=. 0, the second partial derivative of gi with respect to the ith
coordinate of x is uniformly bounded in a neighborhood of Xi for each set
of values of Xj for j t= i. A simple application of Taylor'S theorem C.1 with
remainder shows that hi can be differentiated under the expectation. We
can write

EIi{lh~(Xi)l) ~ (n - 2) f f IE;=1[E;=1x; x;J- 2x~1


'" 2 fXle{xIO)d X 1'" dx n .

This can be bounded by three times the expected value of one over a X~
random variable. The expected value of one over a X~ random variable is
3.2. Classical Decision Theory 165

n/(n - 2) if n > 2, hence Eo(lh~(x)/) < 00. Lemma 3.52 says that the risk
function for 151 is

We can write

= ( 2) 2
"n 2 (n _ 2)2
Xi - ",=::----'n'
L,.,i -1
n- [,,~ x2] 2 - I:~=1 X~ ,
L,.,J=1 J
n a "n X2 -(n _ 2)2
2) 2 L,.,i= 1 i _ -="-=----;:-
"
~ax
i=1
-gi(X)
l
=
- n-
(
["n X~]2
L,.,]=1 ]
- I:~=1 X~ ,

-(n - 2)2
"n 2
L,.,j=1 Xj
< 0, (3.54)

for all x. It follows that the risk function is less than n for all e. 0
From (3.54), in order to calculate the risk for 15 1 , we need the mean
of 1/ L:j=1 XJ. Note that Z = L:j=1 XJ
has noncentral X2 distribution,
NCX;(>") with A = L:j=1 J..t~. From the form of the NCX2 density, it is
clear that Z has the same distribution as Y, where Y "-' X;+2k given K = k
and K ,...., Poi(A). The mean of liZ is

Notice that when A = 0, K = 0, a.s. and R(O,b 1 ) = 2. This is where the


risk function is smallest. A plot of R(O, bt} as a function of A for n = 6 is
given in Figure 3.56. There is no reason why the smallest value of the risk
function must occur when 0 = O. We could subtract a vector 00 from X
and then add 00 back on to 151 to get an estimator that has the minimum
of its risk function at 00 This would give the decision rule

(3.55)

It may be that we cannot decide which vector 00 to subtract. It is possible


to choose based on the data. If n ~ 4, then we could use the decision rule

03(X) = (X - Xl) (1 - L: n n -
i=1 (Xi
3- X) 2) + Xl,

where 1 denotes a vector whose coordinates are all 1. (See Problem 20 on


page 211.)
166 Chapter 3. Decision Theory

o 2 3 4 5 6

A
FIGURE 3.56. Risk Function of James-Stein Estimator for n =6
There is a way to derive the James-Stein estimator from an empirical
Bayes argument. lO This was done by Efron and Morris (1975). Suppose
that 8", Nn(Oo, 7 2 I). The Bayes estimate for 8 is
72
00 + (X - (0) 72+ 1.
The empirical Bayes approach tries to estimate 7 from the marginal dis-
tribution of X. The marginal distribution of X is Nn(Oo, (1 + 7 2 )1). So,
we could estimate 1 + 7 2 by L~=1 (Xi - OOi)2/c for some c. An estimate of
7 2 I (7 2 + 1) is
c
1 - ",n (
wi=1 Xi - 00i
)2'
The empirical Bayes estimator is then 82 (X) if c = n - 2. If we take the
empirical Bayes approach one step further and also try to estimate 00 ,
we could use Xl as an estimate, and the estimate of 1 + 7 2 would be
L~=1 (Xi - X)2/c. With c = n - 3, we get 83 (X).
Another way to arrive at estimators like these is through hierarchi-
cal models (to be discussed in more detail in Chapter 8). For example,
8 1 , .. ,en could be modeled as conditionally lID N (/-L, 7 2 ) given M = /-L
and T = 7. Then M and T could have some distribution, rather than merely
being estimated as in the empirical Bayes approach. Strawderman (1971)
finds a class of Bayes rules that dominate 80 when p ?: 5 and are admissible
by Theorems 3.27 and 3.32.

lOSee Section 8.4 for more detail on empirical Bayes analysis.


3.2. Classica.l Decision Theory 167

The estimator 151 (X) is actually inadmissible as can be shown in Prob-


lem 22 on page 211. Brown (1971) considers the problem of finding nec-
essary and sufficient conditions for an estimator to be admissible in this
setting.

3.2.4 ~iniD1~ Ftules


There are usually lots of admissible rules. Unless one is willing to choose
one by choosing a Bayes rule with respect to some prior distribution, then
one needs some other criterion by which to choose a rule. One such criterion
is the minimax principle.
The Minimax Principle: In comparing rules, the rule 15 with
the smallest value of sUPo R(e, c) is best.
The minimax principle says to prepare for the worst possible value () of 8.
When playing a game against an opponent who is trying to make things
bad for you, there may be good reason to prepare for the worst. When it
makes sense to consider how likely are various alternative value of e, the
worst value may turn out not to be of such a concern.
Definition 3.57. A rule Co is called minimax if, for all 15, SUPOEO R(), co) ::;
SUPOEO R(e, c), alternatively, SUPOEO R(O, co) = inf6 SUPOEO R(O, c).
Proposition 3.58. If 15 has constant risk and it is admissible, then it is
minimax.
Example 3.59. We saw earlier that when X '" N(I-'. 0'2) given e = (1-',0'), 6(x) =
x is admissible when the loss function was L(8, a) = {f-a)2. By Proposition 3.47,
it is also admissible with loss L' (8, a) = (I-' - a)2 10' . The risk function for this
new loss is c6nstant R(8,6) = lin. Hence 6 is minimax with loss L'. Every other
decision rule will have to have a risk function that approaches or surpasses lin
for some 8 values.
Theorem 3.60. Let {An}~=1 be a sequence of probability measures on the
parameter space (0, T) with cn being the Bayes rule with respect to An.
Suppose that limn ..... oo r(An' cn ) = c < 00. If there is Co such that R(O, co) ::;
c for all (), then Co is minimax.
PROOF. Assume Co is not minimax. Then there is 15' and > 0 such that
R(O,6') ::; sUP4>EoR( cf>, 60) - f ::; C - f,

for all 8. Choose no so that r(An, cn ) ~ c- /2, for all n ~ no. Then, for
n ~ no,
r(An, 6') = / R(O, c')dAn(O) ::; (c - ) / dAnce)
f
= C- f < c - "2 ::; r(An, cn),
which contradicts that 6n is the Bayes rule with respect to An. 0
168 Chapter 3. Decision Theory

2::::
Example 3.61. Suppose that Po says that Xl, ... , Xm are independent with
Xi rv N(Oi,I), where = (01, ... ,Om). Let oo(X) = X, N = mm, and L(O,a) =
1 (Oi - ai)2. Let An be the probability measure that says e
has distribution
Nm(O, nI). The Bayes rules are On(X) = nX/(n + 1), and the Bayes risks are
r(An,On) = mn/(n + 1), which go to m as n --> 00. Also, R(O,oo) = m, so 00 is
minimax. We see that minimax rules need not be admissible.
Example 3.62. Suppose that Po says that X rv Bin(n,O) and that L(O, a) =
(0 - a? /[0(1 - 0)], where n = (0,1) and N = [0,1]. A rule with constant risk is
oo(X) = X/no The risk is R(O,oo) = l/n. We saw earlier that 00 is admissible,
so it is minimax.
Now, suppose that we use the loss L'(O,a) = (0 - a)2. We will see that no
analog to Proposition 3.47 operates here. If the prior for e is Beta(Ct,{3), then
the Bayes rule is 0(x) = (Ct + x) / (Ct + (3 + n). The risk function for this rule is

which is constant if Ct = (3 = v'n/2. The constant risk is 1/(4 + 8v'n + 4n). So 0


is minimax. The rule can be expressed as o(x) = (x + v'n/2)/(n + v'n). Notice
that this is like changing your prior distribution as the sample size changes.

Since minimax rules are designed to prepare for the worst, we should see
if there are prior distributions that make the worst () just likely enough so
that the corresponding Bayes rules are minimax. 11
Definition 3.63. A prior distribution Ao for e is least favorable if
inf r{Ao, D) = sup inf r{\ D).
o A 0

Such a AO is sometimes called a maximin strategy for nature.


Let Ao and Do be a fixed probability and decision rule, respectively. It is
true that info r{Ao, 8) ::; SUPA r(A, 80 ). So, it follows that
V == supinfr{A,8) ::; infsupr(A,8) = infsup R{(), 8) == V.
A 6 6 A 6 8

Definition 3.64. The number V above is called the maximin value of the
decision problem, and the number V is called the minimax value.
Theorem 3.65. If 80 is Bayes with respect to Ao and R(), 80 ) ::; r(Ao,80 )
for all (), then Do is minimax and Ao is least favorable.


llThere is a delicate balance between how likely the bad values are and how


likely the rest of the parameter space is. In the second part of Example 3.62,
the worst values, in some sense, are those near 1/2 because the data are most
variable given e = 1/2. The prior Beta( v'n/2, v'n/2) puts enough mass near 1/2
to force us to take seriously the possibility that the data will be highly variable.
However it still spreads enough mass around the remainder of the parameter
space so' that we cannot ignore other 0 values. If the prior put probability 1 on
e = 1/2, for example, then the Bayes rule would be o(x) = 1/2 for all x.
3.2. Classical Decision Theory 169

PROOF. Since V S supo R(O, 60) S r(.\0,60) = inf/i r(.\o, 6) S V S V, it


follows that all of the inequalities are equalities. 0

Example 3.66 (Continuation of Example 3.62; see page 168). We saw that the
minimax rule with loss (0 - a)2 was Bayes with respect to the Beta( vn/2, vn/2)
prior, Ao. Since the risk function is constant, R(O, 6) = r(Ao, 6) for ~ll O. It follows
that AO is least favorable. The reason it is least favorable is that It puts a lot of
mass on the () values (near 1/2) that have high variance for X. On the other
hand, it does not put so much mass there that the estimator is drawn too close
to 1/2.

Definition 3.67. A rule 60 is extended Bayes if for every to > 0 there exists
a prior .\< such that r(.\fl 60 ) S to + info r(.\fl 6).

Example 3.68. Suppose that Pe says that X '" N(O, 1). Let L(O, a) = (0 - a)2,
and let A. be the N(O, (1 - E)/E) prior for e. Let 60(x) = x. The Bayes rule with
respect to A. is
x
6.(x) == 1 + = (1 - E)X.
1-.

The Bayes risk is r(A.,6.) = 1 - E, while the Bayes risk of 60 is r(A.,60 ) = 1,


which is no greater than E + 1 - E. SO 60 is extended Bayes.

Theorem 3.69. A constant-risk extended Bayes rule is minimax.

PROOF. Let 60 be a constant-risk extended Bayes rule. Let R(}, 60 ) = c for


all (). Suppose that 60 is not minimax, but rather that there is a rule 61 such
that sUPo R( (), 61) = c - 10, for some 10 > O. Let .\</2 be as in Definition 3.67.
That is,
r(.\~,60) S ~ + i~fr(.\~,6).
Since 00 has constant risk c, its Bayes risk is also c. So

The Bayes risk of 01 can be no greater than c - 10, so info r(),,</2, 0) S c - f-


It follows that c S to/2 + c - to = C- 10/2, which is a contradiction. 0

Example 3.70. Suppose that Po says that X '" Poi((}/, where n = (0,00) and
~ = [0,00). Let the loss function be L(9, a) = (0 - a) /0, and let 60 == x. The
risk function is R(O,60 ) = 1, constant. Let >.. be the prior that says e has
r (1, E/[1 - ED distribution. The posterior distribution is

The Bayes rule is 6.(x) = x(1 - E), with Bayes risk 1 - f. Since the Bayes risk of
80 is 1,80 is extended Bayes, hence minimax.

There are certain situations in which minimax rules are known to exist.
These involve finite parameter spaces. When n is finite, the risk function
170 Chapter 3. Decision Theory

is just a vector in some Euclidean space. The set of all risk functions of all
decision rules is just a convex set of vectors. 12
Definition 3.71. Suppose that n = {(h, ... , (h}. Let
R = {z E IRk: Zi = R(Oi, 0), for some decision rule 0 and i = 1, ... , k}.

We call R, the risk set. The lower boundary of set C ~ IRk is the set
{Z E C : Xi ::; Zi for all i and Xi < Zi for some i implies X C}.
The lower boundary of the risk set is denoted 8 L . The risk set is closed
from below if 8 L ~ R.
Example 3.72. Consider a situation with n = {O, I} and N = {I, 2, 3}. Let the
loss function be
L(fJ, a)
a
fJ 1 2 3
o
1 1 0 0.2
Supposing that no data are available, the class of randomized ruies consists of
the set of all probability distributions over the action space N. The risk function
for a randomized rule with probabilities (PI,P2,P3) for the three actions (1,2,3)
is just the point (P2 + 0.5p3,PI + 0.2p3). The set of all such points is the shaded
region in Figure 3.73.
We can locate the minimax rule in Figure 3.73 by looking at all orthants of the
form 0. = {(xo, xI) : Xo ::; s, Xl ::; s} and finding the one with the smallest s that
intersects the risk set. These orthants are shown in Figure 3.73. For all s < 0.3846,
the orthant 0. fails to intersect the risk set. But 00.3846 does intersect the risk set
at the point (0.3846,0.3846). This point corresponds to the randomized rule with
probabilities (0.2308,0, 0.7692) on the three available actions. It is interesting to
note that one is required to randomize in order to achieve the minimax rule.
This is somewhat disconcerting for the following reason (among others). After
performing the randomization, one will then either choose action a = 1 or action
a = 3. In either case, one is no longer using the minimax rule, and the risk point
for the chosen decision is either (0, 1) or (0.5,0.2), but not (0.3846,0.3846) as
hoped.
Two lines are added to Figure 3.73 to show the Bayes rules with respect to
two different priors. The line 0.5xo + 0.5XI = 0.35 passes through the point
(0.5,0.2) to indicate that the action a = 3 is Bayes with respect to the prior
that puts equal probability on each parameter value. (The action a = 3 is also
Bayes with respect to many other priors.) The prior with PreS = 0) = .6154
is least favorable, and the line 0.6154xo + 0.3845xI = 0.3846 passes through all
of the points corresponding to Bayes rules with respect to the least favorable
distribution. (See Problem 25 on page 211 to see how the minimax principle is
actually in conflict with the expected loss principle in this example.)

12The remainder of this section is devoted to proving the minimax theo-


rem 3.77. The discussion of risk sets is used in the proofs of the minimax the-
orem 3.77 and the complete class theorem 3.95. It is also used briefly in t~e
discussion of simple hypotheses and simple alternatives in Chapter 4. All of thIS
material can be skipped without disrupting the flow of the remaining material.
3.2. Classical Decision Theory 171

FIGURE 3.73. Risk Set for Example 3.72

Notice that the risk set in Figure 3.73 is convex. This is true in general.
Lemma 3.74. The risk set is convex.
PROOF. Let Zi be a point in the risk set that corresponds to a decision rule
8i for i = 1,2, and let 0 ~ a ~ 1. Then aZ1 + (1 - a)z2 corresponds to the
randomized decision rule a8 1 + (1 - a)82 . 0
There is a common misconception that a minimax rule can be located
by finding that point in the risk set with all coordinates equal which lies
closest to the origin. Here is an example in which the unique minimax rule
corresponds to a point with distinct coordinates.
Example 3.75. Consider a situation with fl = {O, I} and N = {1,2,3}. Let the
loss function be
L(9, a)
a
e
1 2 3
o
1 1 0.5 0.75
The class of randomized rules consists of the set of all probability distributions
over the action space N. The risk function for a randomized rule with probabilities
(Pl,P2, P3) for the three actions (1,2,3) is just the point (0.25p2 +P3,Pl +O.5p2 +
0.75p3)' The risk set is illustrated in Figure 3.76 together with the point corre-
sponding to the unique minimax rule, choose action 2. The point (0.625,0.625)
is the closest point to the origin which has equal coordinates.

The following theorem gives conditions under which minimax rules and
least favorable distributions exist.
172 Chapter 3. Decision Theory

-
<> ~"'------~''''''--''''~---''--''--'-'l
o.

~
:1
d'

(I
1
I

CD :

FIGURE 3.76. Minimax Rule with Unequal Risks

Theorem 3.77 (Minimax theorem). Suppose that the loss function is


bounded below and n is finite. Then SUPA inf.s r(A, e) = inf.s SUP6 R(8, e),
and there exists a least favomble distribution Ao. If R is closed from below,
then there is a minimax rule that is a Bayes rule with respect to Ao.
The proof requires a few lemmas.
Lemma 3.78. Suppose that n is finite. The loss function is bounded below
if and only if the risk set is bounded below.
PROOF. Suppose that the loss function is bounded below. That is, there
exists c such that L( 8, a) ~ c for all 8 and a. Then R( 8, e) ~ c for all 8
and 15, since the risk function is just the integral of the loss function with
respect to a probability measure.
Suppose that the loss function is unbounded below, that is, there exists
a sequence {(8man)}~=1 such that L(8n ,an ) < -n for each n. Since n is
finite, there exists a single 8 and a sequence {bn}~l such that L(8, bn) <
-n for all n. With en(x) = bn for all x, we have R(O, en) < -n, hence the
risk set is not bounded below. 0

Lemma 3.79. If a set C ~ mk is bounded below, then its lower boundary


is nonempty.
PROOF. First, note that the lower boundary of C is the same as the lower
boundary of C, the closure of C. Next, let Cl = inf{zl : z E C}, and for
i = 2, ... , k, let
Ci = {ZEC:Zj=Cj, forj=l, ... ,i-l},
3.2. Classical Decision Theory 173

Ci = inf{zi: Z E Gi }.

Since G is closed, the point C = (C1," . , Ck) E G. Suppose that the lower
boundary is empty. Then, there is a point x such that Xi :::; Ci for all i with
at least one inequality strict. This is clearly a contradiction to the way that
c was constructed. 0

Lemma 3.80. Suppose that the loss function is bounded below. If there is
a minimax rule, then there is a point on E1 whose maximum coordinate is
the same as the minimax risk.
PROOF. Let z be the risk function for a minimax rule, and let S be the
minimax risk max{z1, .. " Zk}. Let G = R n {x E IRk : Xi :::; S, for all i}.
Since the loss function is bounded below, Lemma 3.78 says the risk set is
bounded below, so G is bounded below. Lemma 3.79 says that the lower
boundary of G is nonempty. Clearly, the lower boundary of G is a subset
of the lower boundary of R. Since each point in G is the risk function of a
minimax rule, the result is proven. 0
PROOF OF MINIMAX THEOREM 3.77. 13 Let R denote the closure of R. For
each real s, let
As = {z E IRk : Zi :::; s for all i}.
Then As is closed and convex for each s. Let So = inf {s : As n R f 0}.
We will prove first that there is a least favorable distribution. It is easy
to see that
So = inf sup R(O, 6). (3.81)
6 (J

Note that the interior of Aso is convex and does not intersect R. By the
separating hyperplane theorem C.5, there is a vector v and number C such
that v T Z ~ c for all Z E R and v T x:::; C for all x in the interior of Aso. It is
clear that each coordinate of v is nonnegative since, if Vi < 0, a sequence of
points {xn}~=1 in the interior of Aso exists with lim n -+ oo xi = -00 and all
other xi = So - E and lim n -+ oo v T xn = 00 > c. So, let AO be the probability
that puts mass Ao,i = Vi/ E;=1 Vj on Oi for i = 1, ... ,k. Since (so, . .. ,so)
is in the closure of the interior of A so ' it follows that C ~ So EY=1 Vj. We
can now calculate

inf r(Ao, 6) = inf Aci Z ~ kC ~ So = inf sup R(O, 8). (3.82)


6 zER" . 6
L."J=1 VJ (J

It follows that AO is a least favorable distribution.


Next, we prove that if R is closed from below, there is a minimax rule.
Let {Sn}~=l be a decreasing sequence converging to So. Note the following
facts:

13This proof is similar to proofs of Ferguson (1967) and Berger (1985).


174 Chapter 3. Decision Theory

For each n, R n ASn :/; 0 is closed and bounded .


Aso = n~=1 Asn :/; 0.
It follows from the Bolzano-Weierstrass theorem C.6 that RnAso is closed
and nonempty. It follows from (3.81) that every point in RnAso is the risk
function of a minimax rule. Now, apply Lemma 3.79 with C = R n Aso to
see that there is a point of 8L contained in RnA so ' Since 8 L ~ R, we have
a point in R that is the risk function of a minimax rule.
Finally, we prove that a minimax rule 0 whose risk function is on 8L is
a Bayes rule with respect to Ao. Since R(O, 6) ~ So for all 0, it follows that
r(Ao, 0) ~ So. Combining this with (3.82) completes the proof. 0

3.2.5 Complete Classes


Sometimes minimax rules are too hard to find, or they are not good rules.
It might be worthwhile just to find all admissible rules. Or, one could find
a set of rules such that every other rule is dominated by one in the set.
Definition 3.83. A class of decision rules C is complete if, for every 0 fj. C,
there is 00 E C such that 00 dominates O. A class C is essentially complete
if for every 0 f/. C, there is 00 E C such that R( 0, 00) ~ R( 0, 0) for all ().
A class C is minimal (essentially) complete if no proper subclass is also
(essentially) complete.
Lemma 3.84. 14 If C is a complete class and A is the set of all admissible
rules, then A ~ C.
PROOF. If 00 fj. C, then there is 0 E C such that 0 dominates 00, hence 00 is
inadmissible, hence 60 fj. A. 0

Proposition 3.85. If C is an essentially complete class and there is an


admissible 0 fj. C, then there is 00 E C such that R( (), 00) = R( (), 0) for all ().
Lemma 3.86. If a minimal complete class exists, it consists of exactly the
admissible rules.
PROOF. Let C be a minimal complete class and be A the set of admissible
rules. By Lemma 3.84, A ~ C. We need to prove C ~ A. Assume the
contrary. That is, assume that there is 00 E C but 00 fj. A. Then there
is 01 that dominates 00. Either 01 E C or not. If 01 E C, call 02 = 01 If
01 fj. C, there exists 02 E C such that 02 dominates 01 In either case, 02 E C
dominates 00. If 00 dominates some other rule 0, then 02 also dominates
0, so C\{oo} is a complete class. But this contradicts the fact that C was
minimal complete. 0
There is one famous case in which we can find a minimal complete class. 15

14This lemma is used in the proof of Lemma 3.86.


15This theorem originated with Neyman and Pearson (1933).
3.2. Classical Decision Theory 175

Theor em 3.87 (Neym an-Pe arson fundam ental lemma ). Let


nand
~ both be {O, I}, L(O,O) = L(I, 1) = 0, and

L(I,O) = kl > 0, L(O, 1) = ko > 0.


Let fi(x) = dPddl/ (x) for i = 0,1, where 1/ = Po + Pl. Let fJ be a decis~o
n
rule. Define <p( x) = fJ (x)( {I} ). (This function <p is called the test functIo
n
corresponding to fJ.)
Let C denote the class of all rules with the test functions of the followm .
g
forms:
For each k E (0,00) and each functio n, : X - t [0,1]'
I if ft(x) > kfo(x),


<Pk,-y(X) = { ,(x) if ft (x) = kfo(x), (3.88)
if hex) < kfo(x).
For k = 0,

<Po(x) = {
I
if ft (x) > 0,
if hex) = 0.
For k = 00,
I if fo(x) = 0,
<Poo(x) = { 0 if fo(x) > O.
Then C is a minimal complete class.
Before giving the proof of this theorem , we will give an outline
of the
proof becaus e there are so many steps. We need to prove that if fJ
is a rule
not in C, then there is a rule in C that domina tes fJ. For a rule fJ
not in C,
we find a rule 6* E C that has the same value of the risk functio n
at () = O.
Half of the proof is devoted to this step. We show that the risk functio
ns
of rules in C at () = 0 are decreasing in k, but they may not be continu
ous.
However, by defining ,(x) approp riately, we can find a rule fJ* E
C such
that R(0,6*) = R(0,6). We then show that R(I,6* ) < R(l,fJ).
PROOF OF THEOR EM 3.87. Let C' be C togethe r with
all rules whose test
functions are of the form <Po,-y in (3.88). Let fJ E C'\C. Then the test
functio n
for 6 is <Po,-y for some, such that poe ,(X) > 0) > O. Let 6 be the rule
0 whose
test functio n is <Po. Since hex) = 0 for all x E A = {x: <Po,-y(x) #
<Po(x)},
it follows that R(I,b) = R(l,bo) . But
R(0,8) = ko [Eo(r(X)IA(X + Po(ft(X ) > 0)]
= koEo(-y(X)IA(X + R(O,bo) > R(O,60 ).
Hence 6 is inadmissible and is domina ted by 60 , We will now proceed
to
prove that C' is a comple te class. It will then follow from what
we just
proved that C is a comple te class.
Next, let <P be a test functio n corresp onding to a rule {) not in C'.
Let

a = R(O, 8) = ! korjJ(x)fo(x)dl/(x).
176 Chapter 3. Decision Theory

Note that Q ~ ko. We will now try to find a rule 8* E C' such that R(O, 8*) =
Q and R(l, 0*) < R(l, 0). To that end, we define the function

g(k) = r
J{h(x)?k/o(x)}
kofo(x)dv(x).

Note that if ')'(x) = 1 for all x and 8* has test function <Pk."f' then g(k) =
R(O,o*). Since {x : h(x) 2: kfo(x)} becomes smaller as k increases, and
h(x) < 00 a.e. [v], it is easy to see that g(k) decreases to 0 as k - 00. Also,
it is easy to see that g(O) = ko 2: Q. We now prove that 9 is continuous
from the left and we find the limit from the right. First, note that

k<m
n {x: h(x) 2: kfo(x)} {x: h(x) 2: mfo(x)}, (3.89)

k>m
u {x: h(x) 2: kfo(x)} {x: h(x) > mfo(x)} U {x : fo(x) = O}.

Because 9 is bounded, the monotone convergence theorem A.52 gives that

lim g(k) g(m)


kTm

limg(k)
k!m
r
J{x:h (xm/o(x)}
kofo(x)dv(x), (3.90)

hence 9 is continuous from the left. Note that if ')'(x) = 0 for all x and 0
has test function <Pm."f' then R(O,o*) = limklm g(k). Since 9 is continuous
from the left, either there is a largest k such that g(k) > Q or there is
a smallest k such that g(k) = Q. The first case occurs if 9 has a jump
discontinuity and it jumps from a value greater than Q down to a value
at most Q. The second case occurs if 9 drops continuously to Q. In any
case, let the guaranteed value of k be denoted k*. If Q = 0, it may be that
k* = 00. If Q > 0, we must have k* < 00 because 9 decreases to O.
We will construct a decision rule called 8* whose test function <P has the
form of <Pk*."I' We consider the three possible cases:
1. Q = 0 and k* < 00,
2. Q = 0 and k* = 00,
3. Q > 0 and k* < 00.
We begin by proving that by appropriate choice of the function ')', we can
make R(O, 0*) = R(O,o) = Q.
In the first case, we can use (3.89) and (3.90) with m = k* to show that
')'(x) = 0 makes R(O, 8*) = 0 = Q. In the second case,

R(O, 0*) = ! ko<Poo(x)fo(x)dv(x) = 0 = Q.


3.2. Classical Decision Theory 177

In the third case, if g(k*) = a, set "Y(x) = 1 to make R(O, c*) = g(k*) = a.
Otherwise, g(k*) > a. Set the right-hand side of (3.90) with m = k* equal
to v :5 a. Because 9 has a jump discontinuity at k*, it must be that

koPo(fl(x) = k* fo(x = g(k*) - v> a - v ~ O.

For those x such that fl(x) = k* fo(x), define


a-v
0:5 "Y(x) = g(k*) _ v < 1.

It follows that

R(O,c*) = ! kOko,.,(x)fo(x)dv(x)

= v+ f ko a - v fo(x)dv(x)
J{!t(X)=kofo(x)} g(k*) - v
a-v
= v + (k)
9 * - v
koPo(fl(x) = k* fo(x = a.

If k* < 00, define


hex) = [kO,.,(X) - (x)][h(x) - k*fo(x).

We know that kO,.,(X) = 1 ~ (x) for all x such that fleX) - k*fo(x) > 0
and ko,.,(X) = 0 :5 (x) for all x such that hex) - k* fo(x) < O. Since
is not of the form of some k,." then there must be a set B such that
v(B) > 0 and hex) > 0 for all x E B. Since v = Po + Pl, we get that
fo(x) + h(i) = 1 a.e. [v]. So,

o < fa h(x)dv(x) :5 ! h(x)dv(x)

![ko,.,(X) - (x)]fl(x)dv(x) - k* ![ko,.,(X) - (x)]fo(x)dv(x)

= ! k*
[ko,.,(X) - (x)Jh(x)dv(x) + ko (a - a)

= :1 [R(1, 6) - R(1, 6*)] .

Hence R(1, 6) > R(1, 6*).


If k* = 00, then R(O,6) = 0, and hence (x) = 0 for almost all x such
that fo(x) > O. So

R(1,C) = ktPo(Jo(X) > 0) + k1 f [1 - (x)]ft(x)dv(x)


J{X:fo(x)=O}
> ktPo(fo(X) > 0) = R(I, 6*),
178 Chapter 3. Decision Theory

where the inequality follows from the fact that

{ [1 - (x)]!t(x)dv(x) = 0
} {x:/o(x)=O}

implies that (x) = oo(x), a.e. [v]. Since 6 was assumed not to be in C' ,
this cannot happen.
What we have shown is that for every 6 C', there is 6* E C' such that
6* dominates 6. Hence C' is complete. As claimed earlier, it now follows
that C is complete.
It is easy to check (see Problem 29 on page 212) that no element of
C dominates any other element of C, so nothing can be removed without
destroying the completeness of C. Hence, C is minimal complete. 0
Notice that C consists of all Bayes rules with respect to those priors with
positive mass on each () (see Theorem 3.91), plus only one Bayes rule with
respect to each of the priors that put mass 0 on one of the () values.
Proposition 3.91. In the decision problem described in Theorem 3.87,
each rule k,,,( is a Bayes rule with respect to a prior that assigns positive
probability to each parameter value. The only admissible Bayes rule with
respect to the prior that says Pr(8 = 0) = 1 is rPoo, and the only admissible
Bayes rule with respect to the prior that says Pr( 8 = 1) = 1 is o
Exatnple 3.92. Let lh > ()o, and let 10 and h be the N()o,l) and N()l,l)
densities, respectively. Then, for any k, h (x)/ 10 (x) > k if and only if

lh + 80 logk
x > -2- + 81 - ()o .

There is no need to introduce ')'k(X), since equality has zero probability.

Example 3.93. Let

for some 1 > PI > Po> O. Then, for any k, /I(x)/fo(x) > k if and only if

nlog(~) + logk
x> I (PIll-PO)) .
og Po(l-ptl

For example, if PI = 0.9, Po = 0.5, and n = 10, we get


16.09 + log(k)
x> 2.197 .

If k = 4.408, for example, x = 8 is the cutoff, and ')'(8) must be chosen.


3.2. Classical Decision Theory 179

Example 3.94. Let

fo(x) = ( 12)
x
(1)12
'2 '
These are Bin(12, 1/2) and U(0,12) distributions. The measur~ p, .is L~besgue
measure plus counting measure on the integers. To make both distributIOns ab-
solutely continuous, we must change ft(x) to

Then,
ft(x) _ { 0
00 if 0 < x < 12, x not an integer,
_ ifx=O,I, ... ,12,
fo(x) undefined otherwise.
There is only one admissible rule, namely, do what is obvious. If the observed
x is an integer, choose the binomial distribution; otherwise, choose the uniform
distribution.

There is an analog to the minimax theorem for complete classes. The


proof is similar also and is adapted from Ferguson (1967). This theorem can
also be thought of as a generalization of the Neyman-Pearson fundamental
lemma to larger, but still finite, parameter spaces and more general action
spaces. However, explicit forms for the decision rules are not given because
the action space is unspecified.
Theorem 3.95 (Complete class theorem). Suppose that n has only k
points, the loss function is bounded below, and the risk set is closed from
below. Then the set of all Bayes rules is a complete class, and the set of
admissible Bayes rules is a minimal complete class. These are also the rules
whose risk functions are on the lower boundary of the risk set.
First, we need a lemma.
Lemma 3.96. 16 Suppose that n has only k points and the loss function is
bounded below. Let the risk set be R. Then

every admissible rule is a Bayes rule, and


for every point zEaL, there exists a prior AT = (A!, ... , Ak) such
that AT Z = inf yER '\ Ty.

16This lemma is used in the proofs of Theorem 3.95 and Lemma 4.43. It is also
used in the discussion of testing simple hypotheses against simple alternatives in
Chapter 4.
180 Chapter 3. Decision Theory

PROOF. If a rule is admissible, it is clear that its risk function is a point in


8L SO, let z E 8L , and define

A = {x : Xi < Zi, for all i}.

Then A and R are disjoint convex sets, so the separating hyperplane the-
orem C.5 says that there exists a constant c and a vector v such that for
all x E A and all y E R, v T Y ~ c ~ V T X. If a coordinate Vi < 0, we can
find a point x such that Xj = Zj - for j =F i and Xi is sufficiently negative
so that v T x > c, a contradiction. So we know that all coordinates of v are
nonnegative. Set
Vi
Ai = k
1: j =l vi
Since Z is a limit point of A, there exists a point x E A such that v T x is
arbitrarily close to v T z. Hence C = v T Z, and AT y ~ AT Z for all y E R. So,
if Z E R, then Z is the risk function of a Bayes rule with respect to A. If
z R, then it is still true that AT z = infYER AT y. 0
PROOF OF THEOREM 3.95. From the definition of the lower 'boundary of
the risk set, it is clear that every point on 8L corresponds to an admissible
rule. It is also clear that to every point in R not on 8L there corresponds a
point in R that dominates it. Hence the lower boundary contains the risk
functions of all and only admissible rules.
Next, we show that the rules whose risk functions are on 8L form a
minimal complete class. For each z E R, define

Az = {x : Xi :::; Zi, for all i}.

Let Z not be on 8 L Then there exists z' E R such that z' dominates z
and AZI C A z If z' E 8L, we are done; if not, apply Lemma 3.79 with
C = AZI n R to conclude that there exists a point in 8L that is at least
as good as z' and hence dominates z. This makes the admissible rules a
complete class. Since no admissible rule can be dominated, it is also a
minimal complete class.
Lemma 3.96 shows that 8L consists of the risk functions of all admissible
Bayes rules, so these rules form a minimal complete class. Since the set
of Bayes rules contains the set of admissible Bayes rules, the set of Bayes
rules is a complete class also. 0
Notice that the complete class theorem 3.95 says that the rules Jk,.."
Jo, and Joo in the Neyman-Pearson fundamental lemma 3.87 are the rules
whose risk functions are on 8 L in the risk set.
3.3. Axiomatic Derivation of Decision Theory 181

3.3 Axiomatic Derivation of Decision Theory*


In Section 3.1, we claimed that a Bayesian would choose that decision rule
which minimized the expected loss with respect to the posterior distribu-
tion of e. (Recall the expected loss principle.) This may seem reasonable
or ad hoc on the surface, but there is some justification for such a princi-
ple. Von Neumann and Morgenstern (1947) and Anscombe and Aumann
(1963) set up a system of axioms for preferences among decisions, which
lead to the conclusion that one should choose the decision that minimizes
the expected loss. In this section, we present those axioms together with
a proof of the important conclusion. For alternative derivations, see DeG-
root (1970), Savage (1954), Fishburn (1970), or Ferguson (1967). In this
section, we prove Theorem 3.108, which says that so long as preferences
satisfy some axioms, there is a probability distribution and a utility func-
tion (like the negative of a loss function) such that, in any comparison of
decisions, the one with higher expected utility is the more preferred de-
cision. We also prove Theorem 3.110, which says that if data are to be
observed before making decisions, then the comparison should be based on
conditional expected utility given the data.
We begin with background information in the form of definitions and
axioms. Then we present some examples of the axioms, the statements of
the main results, and the proofs.

3.3.1 Definitions and Axioms


The setup we will consider is one in which there is uncertainty about which
of several possibilities will occur. Each possibility will be called a state (of
Nature). Let R be the set of states of Nature. Let Al be a a-field of subsets
of R. Assume that (R, Ad is a Borel space. We will also assume that the
final outcomes of our choices will be that we will receive some consequence
or prize. In general, let the set of prizes be an arbitrary set P with a-field
of subsets A2 which contains all singletons. We assume that all prizes are
available in all states.
In addition to states of Nature, we will assume that there is also un-
certainty about the outcome of another experiment about which we are
willing to specify probabilities. This experiment is assumed to be capable
of producing events with arbitrary probabilities, and we are completely
indifferent between the possible outcomes prior to choosing between the
options available to us. We also assume that this experiment is indepen-
dent of the state of Nature. One can think of this experiment as a spinner
with arbitrarily fine precision which we believe to be fair. The purpose of
this experiment is to allow us to refer to probability distributions over the

"This section may be skipped without interrupting the flow of ideas.


182 Chapter 3. Decision Theory

set of prizes.
Definition 3.97. A Von Neumann-Morgenstern lottery (NM-lottery) is
a probability on the space (P, A2) that is concentrated on finitely many
prizes. 17 Call the set of all NM-Iotteries C.
For convenience, if L E C gives probability 1 to a prize p, then we will often
denote L by p. If the NM-Iottery L awards prize Pi with probability 0i for
i = 1, ... , k, we will denote it by L = 0lP1 + .. '+OkPk, where 0i ~ 0 for all
i, 2::=lOi = 1. This NM-Iottery is to be interpreted as meaning that the
results of the experiment T are to be partitioned into k events E l , ... ,Ek
with probabilities 01, ... , Ok, respectively, and prize Pi is awarded if event
Ei occurs. We assume that the details of the partitioning of the results of
T are irrelevant to us. We only care about the probabilities of the various
prizes.
The choices we need to make will be among NM-Iotteries and some more
general gambles. In order to make choices among gambles, we need to be
able to say which ones we like better than others. 18 We will use the symbol
~ to indicate weak preference, that is, L ~ L' means that we like L' at
least as much as L. Assuming that a weak preference ~ is defined on C, we
can now define the more general class of gambles to which we will extend
~. These gambles will be defined as functions from R to C.

Definition 3.98. A function H : R - C is called a horse lottery if, for


each NM-Iottery L, {r : H(r) ~ L} E A l . Let 1i denote the set of all horse
lotteries.
If H(r) = L for all r, we will often denote H by L. A horse lottery is a
gamble that pays off the result of a state-specific NM-Iottery depending on
which state occurs.

17The reason that we choose this restricted set of probability distributions


rather than the set of all probability distributions on (P, A2) is threefold. First,
since our axioms will require that various relations hold for all NM-Iotteries, the
smaller the set of NM-Iotteries is, the less restrictive are our axioms. Second,
there is one useful result (Corollary 3.143) that fails without further assumptions
if we allow all probabilities to be NM-lotteries. Third, since we will only consider
measures on (P, A2) which concentrate on finitely many points, the a-field A2
can be quite arbitrary.
18There is one assumption implicit in all of this discussion which is not of
a mathematical nature. The assumption is that our choices do not affect our
opinions. As an example that violates this assumption, suppose that I am trying
to decide whether or not to offer accidental death insurance to an individual. If I
sell the insurance, the individual may be more inclined to act in a risky manner
(such as mountain climbing or bungee jumping). If the individual does not have
insurance, he or she may be less inclined toward the risky behavior. The theory
described in this section assumes that these considerations are absent from our
choices.
3.3. Axioma.tic Derivation of Decision Theory 183

It is useful to have notation for strict preference and indifference also.


If HI ~ H2 but it is not the case that H2 ~ Hl. then we say HI -< H2
and that H2 is strictly preferred to HI. If both HI ~ H2 and H2 ~ Hl. we
say HI rv H2 and we are indifferent between HI and H2, or HI and H2 are
indifferent.
The first axiom says that the relation ~ is a weak order, which we define
now.
Definition 3.99. A binary relation ~ on a set A is a weak order if it
satisfies the following conditions:
For every a E A, a ~ a.
For every a, b E A, either a ::$ b or b ::$ a, or both.
Ifa~bandb~c,thena~c.
We say that ~ is degenemte if a ::$ b for all a, b E A. If ::$ is not degenerate,
we say it is nondegenemte.
Note that a preference is degenerate if and only if all horse lotteries are
indifferent.
Axiom 1 (Weak order). The relation of weak preference ::$ is a weak
order on the set 1t of horse lotteries.
The first axiom does not put very many constraints on the possible pref-
erences we could express. There is the constraint that preferences be tran-
sitive, but this is not a very controversial requirement. There is also the
constraint that every pair of horse lotteries be affirmatively compared. That
is, for every HI and H2, either HI ::$ H2 or H2 ::$ HI (or both). It is not
allowed that I refuse (or am unable) to compare HI with H2. Seidenfeld,
Schervish, and Kadane (1995) study, in depth, the problems that arise if
one relaxes Axiom 1 to allow that certain horse lotteries are not compared.
Since the set C is convex, the set of all functions from R to C is convex
with H = oHI +(1-0)H2 defined by H{r) = oHI {r)+{1-0)H2 {r) for all
r. Unfortunately, the requirement that {r : H(r) ::$ L} E Al for all LEe
prevents us from concluding immediately that oHt + (1- O)H2 E 1t, even
if both HI. H2 E 'H. However, all NM-Iotteries satisfy this condition. That
is, if Ht(r) = LI for all rand H2(r) = L2 for all r, then oHI + (1-0)H2 =
oL I + (1- 0)L2 E 1t. Eventually, we will be able to prove that 1t is convex.
(See Lemma 3.131.) Until we do so, however, many results will have to be
stated in such a way that they still apply without 1t being convex. This
will be done by adding a condition that the result hold not necessarily for
all horse lotteries, but only for every convex set of horse lotteries. The set
of constant horse lotteries (identified with C as above) is convex, so the
condition is not vacuous.
Another requirement of the same sort as transitivity is one that says that
if I prefer one prize to another, then I should prefer a gamble that gives
184 Chapter 3. Decision Theory

me that prize with some probability to a gamble that gives me the other
prize with the same probability, all other things being equal. We make this
precise with another axiom.
Axiom 2 (Sure-thing principle). For each convex set 11.0 of horse lot-
teries; for every three horse lotteries HI, H 2, H E 'Ho; and for every 0 <
a ~ 1, HI ::; H2 if and only if aH I + (1 - a)H ::; aH2 + (1 - a)H.
Two other axioms are often assumed for mathematical reasons. The first
assures that no horse lottery is worth infinitely more than another. A def-
inition is required first.
Definition 3.100. Let L be an NM-Iottery and let {Lklk'::1 be a sequence
ofNM-lotteries. We say that Lk converges pointwise to L (denoted Lk -+ L)
if and only if, for every A E A 2, limk~oo Lk(A) = L(A).
Since each NM-Iottery is a probability distribution over(P, A 2 ), it is a
function from A2 to [0,11. The above definition of pointwise convergence
agrees with the usual concept of pointwise convergence of functions.
Axiom 3 (Continuity). Let H be a horse lottery, and let {Hk}k'::1 be a
sequence of horse lotteries such that Hk(r) converges pointwise to H(r) for
all r. Let H' be another horse lottery. If H k ::; H' for all k, then H ::; H'.
If H' ::; Hk for all k, then H' ::; H.
The next axiom assures that the relative values of prizes do not vary from
state to state. It is not obvious that such an axiom should be adopted. In
fact, this axiom appears to be nothing more than a mathematical tool for
ensuring that probability and utility can be separated. In Section 3.3.6, we
will consider what happens if we do not assume Axiom 4. A definition is
required before we can state Axiom 4.
Definition 3.101. If E E Al and H l , H2 E 11., the preference between Hl
and H2 is said to be called-off when event E occurs if HI (r) = H2(r) for all
r E E. A set BE Al is called null if whenever the preference between Hl
and H2 is called-off when BC occurs, we have HI ""' H2. A subset is called
nonnull if it is not null. Similarly, a state s is nonnull if the singleton set
{s} is nonnull.
Axiom 4 (State independence). Suppose that there is a nonnull set
B such that the preference between HI and H2 is called-off if B C occurs.
Suppose also that Hl(r) = Ll and H2(r) = L2 for rEB. Then Hl ~ H2
if and only if Ll ::; L2
An interesting discussion of state independence is given by Schervish,
Seidenfeld, and Kadane (1990).
Next we introduce an axiom that is only needed when there are infinitely
many s~ates of Nature. When there are infinitely many possible prizes and
states, Seidenfeld and Schervish (1983) show that it is possible for HI to
3.3. Axiomatic Derivation of Decision Theory 185

be preferred to H2 merely because HI offers more possible prizes than H2,


even though H2 offers more valuable prizes than HI. To avoid this problem,
we introduce an axiom.
Axiom 5 (Dominance). If HI{r) ~ H 2{r) for all r, then HI ~ H 2
The dominance axiom ties the values of horse lotteries to the values of
the NM-Iotteries which they assume. It can be shown (see Problem 38 on
page 213) that Axioms 1-4 imply dominance when R is finite. An example
in which Axioms 1-4 hold but Axiom 5 does not is in Example 3.107.
Next, we define conditional preference and introduce an axiom to link
preferences with conditional preferences.
Definition 3.102. Let (X, B) be a measurable space, and let X : R -+ X
be a random quantity. We define a conditional preference relation given X
to be a set of binary relations on 'H, {~x: x E X}, where HI ~x H2 is read
as "we would like H2 at least as much as HI if we were to learn that X = x
and nothing else of relevance," and
for every x E X,H E 'H,L E C, {r: H{r) ~x L} E AI;
HI ~x H 2 if and only if, for every pair (H~, H~) of horse lotteries
satisfying HI{r) = Hi{r) fori = 1,2 and all r E X-I{{x}), Hf ~x H~;
and
for all H I ,H2 E 'H, {x: HI ~x H2} E B.
The first condition in the list is to ensure that the same horse lotteries
are compared conditionally as unconditionally. The second condition guar-
antees that conditional on X = x, it does not matter what would have
happened if X = y =I- x had occurred. This makes conditional preference
truly conditional. The measurability condition is added for mathematical
convenience. None of the axioms stated for unconditional preference says
anything about conditional preference. We now suppose that the following
axiom holds.
Axiom 6 (Conditional preference). If (X, B) is a measurable space,
X : R -+ X is a random quantity, and {~x: x E X} is a conditional
preference relation given X, then
for all x, ~x satisfies Axioms 1-5;
HI ~x H2 for all x implies HI ~ H2;
if HI ~x H2 for all x and HI -<x H2 for all x E B, where X-I(B) is
nonnull, then HI -< H2.
This axiom says that if we know that we will prefer H 2 to HI after we ob-
serve X no matter what value we observe for X, then we should prefer H2 to
HI now. In the case in which there are only finitely many states of Nature,
186 Chapter 3. Decision Theory

there is a means of deriving conditional preferences from unconditional


preferences, but one still needs an axiom to say that the derived prefer-
ences should be used conditionally. The method is to make use of called-off
preferences. If X: R -+ X is a random quantity and X-l({X}) = E, then
one can require that conditional preferences given E agree with preferences
that are called-off when E C occurs.
If we must condition on more than one quantity, there is an issue of
consistency. For example, if Y is a function of X and we first condition on
Y and then on X, do we get the same conditional preferences as if we had
conditioned on X alone?
Definition 3.103. Let X : R -+ X and Y : R -+ Y be random quantities
such that Y is a function of X, Y = h(X). We say that conditional prefer-
ence relations {::;y: y E Y} and {::;x: x E X} given Y and X are consistent
if there exists a set B ~ Y such that y-1(B) is null and for every y ft B,
{:~x: x E h-1(y)} is a conditional preference relation relative to jy which
satisfies Axiom 6.

3.3.2 Examples
Here are a few examples to illustrate how the axioms can be satisfied or
violated.
Example 3.104. Suppose that there are only two states of Nature rl and r2.
Suppose also that there are only two prizes, PI and P2. Then horse lotteries are
characterized by the pair of numbers (qJ, q2), where qi is the probability of P2 in
state ri for i = 1,2. We give two examples of weak preference, one which satisfies
the axioms and one which does not.
First, suppose that we claim that (ql, q2) ::; (q~, q~) if and only if ql + q2 ~
q~ + q~. It is straightforward to check that this satisfies all of the axioms. The
representation of preference according to Theorem 3.108 below will be Pr(rl) =
Pr(r2) = 1/2 and U(P2) > U(pJ). Consider two horse lotteries HI = (ql,q2) and
H2 = (ql,q3). The preference between HI and H2 is called-off if {rl} occurs. It
is easy to see in this example, and it can be shown in general that if r2 is nonnull
(see Lemma 3.148), then HI ::; H2 if and only if H3 ~ H4 for all H3, H4 of the
form H3 = (q4, q2), H4 = (Q4, Q3). That is, a pair of horse lotteries that differ
only on a nonnull state are ranked the same as every other pair that differ the
same way on the same state.
Second, suppose that we claim that (Ql,Q2) ~ (Q~,Q~) if and only if Ql < Q~
or (Ql = Q~ and Q2 ~ Q~). This fails Axiom 3, and there is no expected utility
representation for the preferences.
Example 3.105. Let P = {Pl, ... ,Pm} and R = {rl, ... ,rn }. Let ql, ... ,qn
be nonnegative numbers that add to 1. Let Ul, ... , Urn be real numbers. For
each NM-Iottery L = 01Pl + ... (};mPm, define U(L) = L:::I0iUi. For each
horse lottery H = (L 1, ... , Ln), define U(H) = L:;;1 qP(Lj). Say that HI ::;
H2 if and only if U(Hl) :::; U(H2). It is easy to see that this is a weak order.
Since U(oLJ + (1 - 0)L2) = oJU(Ld + (1 - 0)U(L2)' it is easy to verify that
Axiom 2 holds. Continuity follows since Lk -> L implies U(Lk) -> U(L). State
3.3. Axiomatic Derivation of Decision Theory 187

independence follows easily from the definition of U(H). Theorem 3.108 says that
all examples in the finite case will be like this.
Let X : R -+ X be a random quantity. Clearly, X can take on only finitely
many values. For each value x of X, let X-l({X}) = {Pk}(x), ... ,Pk,,,,(,,,)}. If
v", = 2:;:1 qkj(x) = 0, then X-l({X}) is null. Otherwise, define Ux(H) =
2:;:1 qkj(x)U(Lkj(x), and say that Hl ::5x H2 if and only if Ux(Hl ) ~ Ux(H2)' It
is easy to verify that this is a conditional preference and that it satisfies Axiom 6.
Theorem 3.110 says that all conditional preferences must be of this form in the
finite case.

Example 3.106. Let (R, Ad be an arbitrary Borel space with a probability


Q, and let P be an arbitrary set. Let U : P -+ IR be a bounded function.
For each NM-lottery L = alpl + ... akpk, define U(L) = 2:~=1 aiU(Pi). Let
H be the set of all functions H : R ....... .c such that U(H(r)) is a measurable
J
function of r. For H E H, define U(H) = U(H(r))dQ(r). Say that Hl ::5
H2 if and only if U(Hl) ~ U(H2)' For each H E H and each L E .c, {r :
H(r) ::5 L} = {r : U(H(r)) ~ U(L)} E Al because we assumed that U(H())
is measurable. Axiom 1 clearly holds. To see that Axiom 2 holds, note that
U(aHl + [1 - a]H2) = aU(Hd + (1 - a)U(H2)' If Hn(r) ....... H(r) for all r,
then limn~oo U(Hn(r)) = U(H(r)) for all r. Since U is bounded, the dominated
convergence theorem A.57 says that limn~oo U(Hn) = U(H). This implies that
Axiom 3 holds. Note that B E Al is null if and only if Q(B) = O. To see that
Axiom 4 is satisfied, let the preference between Hl and H2 be called-off when
Be occurs, B is nonnull, and HICr) = Ll and H2(r) = L2 for all rEB. Then
U(Hd - U(H2) = Q(B)[U(Ld - U(L2)], so Hl j H2 if and only if Ll ::5 L2.
Axiom 5 follows easily from part 3 of Proposition A.49. Theorem 3.108 says that
when Rand/or P is infinite, this example describes all preference relations that
satisfy the axioms.
Let X : R ....... X be a random quantity. Let {Q('lx) : x E X} be a regular
conditional distribution given X. (Use Corollary B.55 to choose a version of
Q('lx) that gives probability 1 to X-l({x}).) For each x E X and H E H, define
J
Ux(H) = U(H(r))dQ(rlx). If HI, H2 E H, Theorem B.46 can be used to show
that U",(Hi ) is a measurable function of x for each i, hence {x: Hl ::5x H2} E 5,
and {::5x: x E X} is a conditional preference. Axiom 6 follows from the law of
total probability B.70. Theorem 3.110 says that except for differences on null
sets, this example describes all conditional preferences that satisfy Axiom 6.

Example 3.107. This example is based on one of Seidenfeld and Schervish


(1983), and it is designed to show why Axiom 5 is needed in the infinite case.
Let R = [0,1] and let Al be the Borel cr-field. Let Q be Lebesgue measure,
and let P = [0,1]. Let V : P ....... 1R be defined by V(p) = p. For each NM-
lottery L = alPl + ... akPk, define V (L) = 2:~= 1Qi V (pd. For each function
H : R ....... .c, define wH(p,r) = H(r)({p}), that is, the probability that H(r)
assigns to the prize p. Let H be the set of all H such that WH(p, r) is a mea-
surable function of r for every P and V(H(r)) is a measurable function of r. For
J
H E H, define V(H) = V(H(r))dQ(r). Note that V is the same as the U
J
in Example 3.106. Define WH(p) = WH(p, r)dQ(r). Since 2: WH(p, r) = 1
All p
for all r, there can be at most countably many p such that WH(p) > O. Define
W(H) = 1- 2: All p
WH(p), The value W(H) measures the extent to which more
than count ably many different prizes are assigned by H. For example, if the set
of all prizes assigned by H is countable, then W(H) = O. In particular, it is easy
188 Chapter 3. Decision Theory

to see that W(L} = 0 for all L E C. Define U(H} = V(H} + W(H} and say
that HI j H2 if and only if U(Hd $ U(H2}. Axiom 1 is clearly satisfied. To see
that Axiom 2 is satisfied, note that Wo:HI +[I-o:J H2(p) = aWHI (p) + (1- a)WH2 (p)
for all p, so W(aHI + [1 - a]H2) = aW(H1 } + (1 - a)W(H2}. Now, use the
fact that V(aHI + [1 - a]H2} = aV(H1 } + (1 - a)V(H2) as shown in Ex-
ample 3.106 to see that U(aHI + [1 - a]H2} = aU(Hd + (1 - a}U(H2). If
Hn(r) -+ H(r) for all r, then limn->oo WHn (p, r) = WH(p, r) for all p, r. Let
{pd~1 be the prizes such that either WHn (PiL.? 0 for some n or WH(Pi) > O.
Define fer} = L::':1 wH(Pi,r} and fn(r) = L."::':1 WHn(Pi,r). Then W(H) =
f f(r}dQ(r} and W(Hn} = f fn(r}dQ(r}. Since 0 $ fn ~ 1 and limn->oo fn(r} =
fer} for all r, it follows from the dominated convergence theorem A.57 that
limn->oo W(Hn} = W(H}. As shown in Example 3.106, limn->oo V(Hn} = V(H},
so limn->oo U(Hn} = U(H) and Axiom 3 holds. To see that Axiom 4 is satisfied,
let the preference between HI and H2 be called-off when Be occurs, B is non-
null, and Hl(r} = Ll and H2(r} = L2 for all rEB. Then W(Hd = W(H2} =
E All p
fBe WH. (p, r)dQ(r}, so U(Hd - U(H2} = Q(B)[V(Ld - V(L2}]' We see
that HI j H2 if and only if Ll j L 2. To see that Axiom 5 is violated, let
Hl(r} = Ir, that is, the NM-lottery that gives prize P = r with probability 1. Let
H2 = 1, that is, the constant horse lottery that gives the prize 1 with probability
1 for all r. It is easy to calculate V(Hl} = 1/2, W(Hd = 1, V(H2) = 1, and
W(H2} = O. So U(Hl} = 3/2 > 1 = U(H2}. But U(Hl(r}} = V(HICr = r
for all rand U(H2(r}} = V(l} = 1 for all r. So Hl(r} j H2(r} for all r, but
H2 -< HI. Note that the ranking of horse lotteries by U is not an expected utility
representation because of the added function W.

3.3.3 The Main Theorems


Since the proofs of the major theorems are very long and not particularly
straightforward, we state here, for the interested reader, the main results.
The proofs will be given in Section 3.3.5.

Theorem 3.108. Assume Axioms 1-5, and assume that preference is non-
degenerate. Then, there exists a bounded junction U : 11. -+ m such that
U(H (r is a measurable function of r for all H E 11. and that satisfies
(3.109)

for all a E [0,1] and all H 1, H 2. Also, there exists a probability Q on (R, A1)
such that for every Hb H2 E 11., H1 :::S H2 if and only if J U(H1(rdQ(r) ::;
JU(H2(rdQ(r). The probability Q is unique, and U is unique up to pos-
itive affine transformation.
The function U in Theorem 3.108 is called a utility function.
We also prove a theorem linking preference and conditional preference.
Theorem 3.110. Assume the conditions of Theorem 3.108. Let (X,8) be
a Borel space, and let X : R -+ X be a random quantity. Let Q be the
probability from Theorem 3.108, and let {Q('lx) : x E X} be a regular
conditional distribution given X. Let {:::Sx: x E X} be a conditional pref-
erence relation given X which satisfies Axiom 6. Then there exists a set
3.3. Axiomatic Derivation of Decision Theory 189

B such that X-l(B) is null and for all x f/. B, Hi ~x H2 if and only if
f U(Hl(r))dQ(rlx) ~ f U(H2(r))dQ(rlx).
In Theorem 3.110, if Y : R ---+ Y is a function of X, and {~y: y E Y} is
a conditional preference relation given Y, then {~y: y E Y} is consistent
with {~x: x E X} because Theorem B.75 and Corollary B.74 say that
conditioning on Y and then X is the same as conditioning on X alone.

3.3.4 Relation to Decision Theory


Earlier in this chapter we set up decision theory using action spaces and
loss functions. There is a natural connection between these concepts and
the concepts introduced in this section. Let the states of Nature be possible
values of the parameter (R = fl), or of some future observable (R = V),
or possibly data and parameter together (R = X x fl). Let the actions
(elements of ~) index functions from states to prizes (or, more generally,
from states to NM-Iotteries). That is, for each a E ~, there exists a horse
lottery Ha : R ---+ C such that Ha(r) is the prize (NM-Iottery) we get in
state of Nature r. For example, if R = fl, then we can consider L(O, a) =
c - U(Ha(O)) for arbitrary c. In this way bounded loss functions are like
the negatives of utility functions. Unbounded loss functions, however, do
not correspond to utilities that satisfy the axioms stated in this chapter.
Example 3.111. Suppose that R = fl is the interval [eo,qj. Let P, the set of
prizes, be a bounded interval of monetary units containing my current fortune y
and having half-width at least (Cl - co? Let U(p) = p for each pEP. If N = fl,
then for each action a E N we construct the horse lottery H a (0) = y - (a - O?, a
function from n to P. Then y - U(Ha(O)) = (a - 0)2 is squared-error loss. Note
that we used the bounded intervals to ensure that utility is bounded.

Axiomatic developments like the one given in this section have two main
consequences. The obvious one is that, as stated in the main theorems,
if one satisfies the axioms, then the preferences have an expected utility
representation. The contrapositive is also true. If preferences do not have
an expected utility representation, then at least one axiom must be violated.
Example 3.112 (Continuation of Example 3.72; see page 170). The minimax
principle is often in conflict with Axiom 2. In this example, the minimax rule
corresponds to the convex combination

(0.3846,0.3846) = 0.7692(0.5,0.2) + 0.2308(0, 1)


in Figure 3.73. According to the minimax principle, (0, 1) ~ (0.5,0.2) because
0.5 < 1. If Axiom 2 were satisfied, then

(0.3846,0.3846) 0.7692(0.5,0.2) + 0.2308(0, 1)


-< 0.7692(0.5,0.2) + 0.2308(0.5,0.2)
(0.5,0.2).
190 Chapter 3. Decision Theory

But (0.5,0.2) -< (0.3846,0.3846) according to the minimax principle, because


0.3846 < 0.5. Hence, the minimax principle violates Axiom 2 in this example. 19

3.3.5 Proofs of the Main Theorems


The theorems we prove pertain to the general case in which (R, AI) is
an arbitrary Borel space and P is an arbitrary set. Some readers may
wish to focus on the finite case in which R = {Tl,"" Tn}, Al = 2 R ,
P = {Pl,'" ,Pm}, and A2 = 2P . In each of the results, we will point out
how the proofs can be simplified (usually by skipping major portions) in
the finite case. In the finite case, if H(Ti) = Li for i = 1, ... , n, then we
will denote H = (L1, ... , Ln). Since Al = 2R , it follows that 'H is convex
in the finite case. In the finite case Axiom 5 follows from Axioms 1-4. (See
Problem 38 on page 213.)
Until we prove that 'H is convex in general (Lemma 3.131), all of the
lemmas we prove will need to contain conditions like the one that appeared
in Axiom 2 concerning an arbitrary convex set 'Ho of horse lotteries. Once
we prove Lemma 3.131, then 'Ho can be taken equal to 'H in all of these
results. For this reason, in this section we will assume that 'Ho is a convex
set of horse lotteries. The theorems in this section will apply to every
such set 'Ho.20 Because some of the lemmas in this section are also useful
in Section 3.3.6, the hypotheses often include which axioms are assumed
explicitly.
Axiom 2 has an "if and only if" clause preceded by a quantification. The
implication in one direction is straightforward, namely that if Hl ~ H 2 ,
then aH l +(l-a)H ~ aH2+(1-a)H for all H and all a E (0,1] (assuming
that the mixtures are horse lotteries.) The other direction of implication
has more striking consequences. In words, if a horse lottery appears in two
mixtures on both sides of a preference, then the smaller amount can be
"removed" from each mixture without changing the preference.
Lemma 3.113. Assume Axioms 1 and 2. Let H1, H 2 , H E 'Ho, and let
a,/3 E (0,1) .
Suppose that aH + (1 - a)Hl ~ /3H + (1 - /3)H2' If a. > /3, then

a - /3 H +1- a Hl -< H2.


1-/3 1-{3 -

If a < /3, then


/3-a 1-/3
Hl -< - - H + - - H2
- I-a I-a

19See also Problem 25 on page 211.


'This section may be skipped without interrupting the flow of ideas.
20In the finite case, 'Ho can be taken equal to 'H, since 'H is known to be convex
in that case.
3.3. Axiomatic Derivation of Decision Theory 191

Suppose that aH + (1 - a)HI ~ {3H + (1 - (3)H2. If a> {3, then


a-{3 I-a
1- {3H + 1- {3HI ~ H2
If a < {3, then
HI ~ {3 - a H + 1- {3 H2.
I-a I-a
PROOF. The first two statements are proved in almost identical fashion.
We will prove only the first one. Axiom 2 says that for arbitrary 0 ~ 17 < 1,
TiH + (1 - Ti)[-yH + (1 - ,,)HI ] :::; TiH + (1 - Ti)H2
implies that "H + (1 - ,,)HI :::; H 2 Let 17 = {3 and" = (a - (3)/(1 - (3).
The last two statements are proved in almost identical fashion. We will
only prove the third one. Axiom 2 and the definition of -< say that for
arbitrary 0 ~ Ti < 1,
TiH + (1 -17)[-yH + (1 - ,,)HI] ~ 17H + (1 -17)H2
implies that "H+(I-,,)HI :::; H2. In fact, we can conclude"H +(I-,,)HI -<
H2 because, if not, then H2 :::; "H + (1 - ,,)HI and Axiom 2 implies
17H + (1 -17)H2 :::; 17H + (1 - Ti)[-yH + (1 - ,,)Hd,
a contradiction. Now, let Ti = {3 and " = (a - (3)/(1 - (3). 0
The next lemma says that a less preferred gamble can be substituted for
a more preferred one on the left side of any :::; relation. Similarly, a more
preferred gamble can be substituted for a less preferred one on the right
side of any :::; relation.
Lemma 3.114. Assume Axioms 1 and 2.
Assume that HI, H2 E 'Ho and that HI :::; H2. Then for every 0 <
a ~ 1, every H3 E 'Ho, and every H4, if o:H2 + (1 - a)H3 :::; H4,
then o:HI + (1 - a)H3 :::; H4, and if H4 :::; o:HI + (1 - a)H3, then
H4 :::; o:H2 + (1 - o:)H3'
Assume that HI, H2 E 'Ho and that HI ~ H2. Then for every 0 <
a ~ 1, every H3, and every H4, if o:H2 + (1 - o:)H3 :::; H4, then
aHI + (1 - a)H3 ~ H4, and if H4 :::; aH I + (1 - o:)H3, then H4 ~
aH2 + (1 - a)H3.
PROOF. Suppose that aH2 + (1- o:)H3 :::; H4 and HI :::; H2' Axiom 2 says
that
aHI + (1 - o:)H3 :::; o:H2 + (1 - a)H3.
It follows from the transitivity of :::; that aHI + (1 - o:)H3 :::; H4. The
remaining cases are all similar. 0
It follows easily that two indifferent horse lotteries can be substituted for
each other in all comparisons.
192 Chapter 3. Decision Theory


Corollary 3.115. Assume Axioms 1 and 2. Let Hl, H2 E 'Ho, and assume
HI '" H2. Then for every < a ~ 1, every H3 E 'Ho, and every H4,
aH I + (1 - a)H3 j H4 if and only if aH2 + (1 - a)H3 j H 4, and H4 j
aH I + (1- a)H3 if and only if H4 j aH2 + (1- a)H3.
The next lemma says that two different mixtures of the same two horse
lotteries are ranked according to how much probability they give to the
better of the two horse lotteries.
Lemma 3.116. Assume Axioms 1 and 2. Let HI -< H2 E 'Ho. Suppose
that
H3 '" aH2 + (1 - a)HI' H4 '" f3H2 + (1 - (3)H1.
Then a ~ f3 if and only if H3 j H4.
PROOF. Suppose that H3 j H4 but f3 < a. Then

by Corollary 3.115. Since a > f3, use Lemma 3.113 to conclude that H2 j
(f3/a)H2 + ([a - f3l/a)H 1. Use Lemma 3.113 once again to conclude that
H2 j H!, which is a contradiction.
Next, suppose that a ~ {3 but H4 -< H 3. Then {3H2 + (1 - (3)Hl -<
aH2 + (l-a)Hl' by Corollary 3.115. A contradiction follows just as before.
o
A useful consequence of the first three axioms is what is often called an
Archemedian condition. 21
Lemma 3.117 (Archemedian condition). Assume Axioms 1-3, and
assume that Hl, H3 E 'Ho. If HI -< H2 -< H3, then there exists a unique
< a < 1 so that
aH3 + (1 - a)HI H2. f'V

PROOF. Suppose that HI -< H2 -< H 3. Let No = {a : aH3 + (1 - a)Hl j


H2}, and define {30 = sup{a : a E No}. Since No contains 0, it is nonempty
and {30 is well defined. Define H = f30H3 + (1- (30)H1 For k = 1,2, ... , let
ak E No be such that limk-too ak = f30 and define Gk = ak H 3+(I- a k)H1
We have, for each r, Gk(r) --+ H(r) and, for each k, Gk j H2. By Axiom 3,
H j H2. Next, let Nl = {a : H2 j aH3 + (1 - a)Hd, and define {31 =
inf{a : a E Nt}. Since N1 contains 1, it is nonempty and f31 is well defined.
Define G = f31H3 + (1 - f3dH 1 . For k = 1,2, ... , let 'Yk E N1 be such that
limk-too 'Yk = {31 and define Rk = 'YkH3 + (1 - 'Yn)H1 We have, for each
s, Rk(r) --+ G(r) and, for each k, H2 j Rk' By Axiom 3, H2 j G. By
Lemma 3.116, f30 ~ f31. If f30 < f3 < f31, then neither H2 j f3 H 3 + (1- (3)HI

21 In the proofs of results in the finite case, we do not explicitly use Axiom 3,
but rather we only use the Archemedian condition. This fact would allow us to
prove a converse to Lemma 3.117 in the finite case.
3.3. Axiomatic Derivation of Decision Theory 193

nor /3H3 + (1 - /3)H 1 ~ H2, which contradicts Axiom 1. It follows that


/30 = /31 and a is the common value. Clearly, any value other than a is
either in No or N1 but not in both, so a is unique. 0
As an aside, some people prefer to take the Archemedian condition in
Lemma 3.117 as an axiom instead of Axiom 3. In the finite case, they are
equivalent.
Proposition 3.11B. Assume Axioms 1 and 2 and assume that P and R
are finite. If the Archemedian condition from Lemma 3.117 holds, then
continuity (Axiom 3) holds.
Lemma 3.119. Assume Axioms 1 and 2 and the Archemedian condition
of Lemma 3.117. There exists a function U : 1io -+ IR such that
(3.120)
PROOF. If H1 H2 for all H 1, H2 E 1io, then just set U(H) = 0 for all H.
f'V

For the rest of the proof, assume that there exist H *, H* E 1io such that
H* -< H*. We will use the Archemedian condition in Lemma 3.117 to help
define U. For each H E 1i such that H* ~ H ~ H*, define U(H) equal
to the value of a such that aH* + (1 - a)H* H. Note that U(H*) = 0
f'V

and U(H*) = 1. 22 For each H such that H ~ H*, define U(H) equal to
-al(l - a) for that a such that aH* + (1 - a)H H*. For each H suchf'V

that H* ~ H, define U(H) equal to 1/a for that value of a such that
aH + (1 - a)H* H*. Next, we prove that (3.120) is true.
f'V

There are six possible arrangements of H1 and H2 relative to H* and H*


(ignoring the permutations of H1 and H2 themselves). Lemma 3.116 shows
that (3.120) is true if both H1 and H2 are between H* and H*.23 It is easy
to see that if only one of Hl and H2 is between H* and H* that (3.120)
is true, since one value of U is between 0 and 1 and the other is not. Also,
(3.120) is true if Hi ~ H* -< H* ~ H 3- i , for i = 1 or 2. The only cases
that remain are (i) that in which both Hl and H2 are preferred to H* and
(ii) that in which H* is preferred to both H1 and H2. For case (i), we have

H* 1 U(H1)-1
'" U(H1) H1 + U(Ht} H*, (3.121)
H* 1 U(H2)-1
U(H2) H2 + U(H2) H*. (3.122)
Lemma 3.116 and (3.122) say that U(Hd :::; U(H2) if and only if
H * -< 1 U(H1) - 1
- U(Ht} H2 + U(H1) H*.

22In the finite case, one can prove that there exist two NM-lotteries H. and
H' such that H. :j H :j H' for all HE H. (See Problem 33 on page 212.) For
this reason, one can skip to the next paragraph in the finite case.
23The proof ends here in the finite case because we can choose H. and H* so
that H. :j H :j H* for all H E H, by Problem 33 on page 212.
194 Chapter 3. Decision Theory

This and (3.121) are true if and only if HI :j H2 by Lemma 3.113. Case
(ii) is similar. 0

Lemma 3.123. Assume Axioms 1 and 2 and the Archemedian condition


of Lemma 3.117. The function U constructed in the proof of Lemma 3.119
satisfies (3.109) for all HI, H2 E 1-1.0 ,
PROOF. If HI '" H2 for all H E 1-1.0, the result is trivial. So, assume
that there exist H* ~ H* E 1-1.0' There are 10 cases to handle, depending
both on how HI and H2 compare to H* and H* and on the value of
c = aU(Ht} + (1- a)U(H2)' Without loss of generality, assume HI :j H 2,
since they are arbitrary. 24
Case 1. H. :j HI, H2 :j H*. Since

HI '" U(HdH* + [1 - U(Hd)H*,


H2 '" U(H2)H* + [1 - U(H2)H*,
we can use Corollary 3.115 to conclude that

aHI + (1 - a)H2
'" (aU(Hd + [1 - a)U(H2))H* + (1 - aU(Hd - [1 - ajU(H2))H.,

so that (3.109) holds.


Case 2. HI ~ H* :j H2 :j H* and c ~ O. In this case, c ~ 1 is clear, and

H2 U(H2)H* + [1 - U(H2)H*, (3.124)


U(Hd * 1 (3.125)
H* '" -1 _ U(HI) H + 1 _ U(Hd HI'

Mix HI with weight a with both sides of (3.124) to obtain

aHI + (1 - a)H2
'" aHI + (1 - a)U(H2)H* + (1 - a)[1 - U(H2)H*
= (3 [ -U(Hd H* + 1 HI]
1 - U(HI) 1 - U(Hd
+ (1 _ (3) [O:U(Ht} + (1 - a)U(H2) H* + (1 - a)[1 - U(H2) H*] ,
1- a + aU(HI) 1- a + aU(HI)
where (3 = a[1 - U(HI ), which is less than 1 because c > O. Use (3.125)
and Lemma 3.113 to see that the last expression is '" cH*+(1-c)H1 This
implies that (3.109) is true.
Case 3. HI ~ H. :j H2 :j H* and c < O. In this case, (3.124) and (3.125)
are still true. This time let {3 = (a - aU(Ht})/(1 - aU(HI))j mix the left-
hand side of (3.124) with weight 1 - {3 with the right-hand side of (3.125)

240nly case 1 is needed in the finite case.


3.3. Axiomatic Derivation of Decision Theory 195

with weight {3, and mix the other sides also. The result is

Now use Lemma 3.113 to remove the common H* from both sides (there
is more on the left than on the right) to get
1 -c
-1-[o:HI + (1- 0:)H2) + -1-H* rv H*.
-c -c
It follows that (3.109) is true.
Case 4. HI ~ H* ~ H* ~ H2 and c E [0,1). In this case,

-U(Ht) * 1
(3.126)
H* rv 1- U(H1)H + 1- U(Ht) HI,
H* 1 U(H2)-1
(3.127)
rv U(H2) H2 + U(H2) H*.

Let (3 = 0:(I-U(HI))/[0:(I-U(Ht))+(1-0:)U(H2))' and take the mixture


of (3.126) with weight {3 and (3.127) with weight 1 - (3. The result is

0:(1 - U(HI)) H (1 - 0:)U(H2) H*


0:(1 - U(Ht)) + (1 - 0:)U(H2) * + 0:(1 - U(HI)) + (1 - 0:)U(H2)

rv -o:U(Ht} H* + 0: HI
0:(1 - U(H I )) + (1 - 0:)U(H2) 0:(1 - U(Ht)) + (1 - 0:)U(H2)
(1 - 0:) H (1 - 0:)(U(H2) - 1) H
+ 0:(1 - U(Ht}) + (1 - 0:)U(H2) 2 + 0:(1 - U(Ht}) + (1 - 0:)U(H2) *.
One can now use Lemma 3.113 to remove a common component consisting
of the two terms involving H* and H* on the right-hand side with weight

-o:U(Ht} + (1 - 0:)[U(H2) - 11
0:(1 - U(Ht)) + (1 - 0:)U(H2) .
The result says that (3.109) is true.
Case 5. HI ~ H* ~ H* ~ H2 and c > 1. In this case, (3.126) and (3.127)
are still true. This time, let (3 = 0:(I-U(Ht))/[0:(I-U(Ht})+(1-0:)U(H2))'
and mix (3.126) with weight {3 with (3.127) with weight 1 - {3 to get

1 c-l )
"IH + (1- "I)L rv "I ( ~[o:HI + (1- 0:)H2J + -c-H* + (1- "I)L,
196 Chapter 3. Decision Theory

where
c
0(1 - U(Hd} + (1 -
0)U(H2} '
-oU(HI ) -o[U(Hd - 1]
L = 0(1- 2U(Ht}) H + 0(1- 2U(Ht}) H*.
Use Lemma 3.113 to remove the common component of L from both sides
and the result is (3.109).
Case 6. HI, H2 -< H . In this case, c < 0 is clear and we have (3.126)
together with
H -U(H2). 1
* '" 1 _ U(H2) H + 1 _ U(H2) H 2. (3.128)
Let {3 = 0(1-U(HI))/[0(1-U(Hd)+(1-0)(1-U(H2))], and mix (3.126)
with weight {3 with (3.128) with weight 1 - {3 to get
-c 1
H* '" 1- c H * + 1- c[oHI + (1- 0)H2J.
This implies that (3.109) is true.
Case 7. H* -< HI, H 2. This is analogous to case 6.
Case 8. HI -< H. -< H* -< H2 and c < O. This is analogous to case 5.
Case 9. H* :::S HI :::S H* -< H2 and c> 1. This is analogous to case 3.
Case 10. H. ::S HI ::S H* -< H2 and c ~ 1. This is analogous to case 2. 0
Now, we can prove that U is bounded on Ho. 25
Lemma 3.129. Assume Axioms 1-3. Then U is bounded on 'lio.
PROOF. If HI rv H2 for all HI, H2 E 'lio, the result is trivial, so assume
that there exist H. -< H* E 'lio. Without loss of generality we can as-
sume that U(H*) = 0 and U(H*) = 1 (otherwise, just replace U by
[U - U (H *)] / [U (H*) - U (H *)] and the preferences and boundedness are not
changed). Suppose, to the contrary, that U() is unbounded above. (A simi-
lar construction works if U is unbounded below.) Let {Hn}~1 be such that
U(Hn) > n. Let H~ = (1 - l/n)H* + (l/n)Hn for each n. Then H* -< H~
for all n because U(H*) = 1 and U(H~) > 1 for all n, but Hn --+ H* and
H * -< H*. This contradicts Axiom 3. 0
If 'lio ~ 'li contains H. and H' with H. -< H*, define 26
{31(Ho) = sup U(H), {3o('lio) = inf U(H).
HEXo HEXo

The following lemma is useful in allowing us to find NM-Iotteries with


arbitrary utilities. 27

25The conclusions of Lemma 3.129 are obvious in the finite case.


26In the finite case, we can arrange for !31 ('H) = 1 and !3o('H) = O.
27In the finite case, Lemma 3.130 follows trivially from the fact that there exist
NM-Iotteries L* and L. that achieve the maximum (1) and minimum (0) values
of U and the fact that U(aL* + (1- a)L*) = a for every a E [0,1].
3.3. Axiomatic Derivation of Decision Theory 197

Lemma 3.130. Assume Axioms 1-3. For each (3 E (30 Clio), (31 ('Ho)), there
exists an NM-lottery L with U(L) = (3.
PROOF. If HI '" H2 for all H 1,H2 E 'Ho, then /3o('Ho) = (31 ('Ho) , and the
result is vacuous, so assume that there exist H. -< H E 'Ho with U(H.) =
o and U(H*) = 1. Assume, to the contrary, that ao = infLE.c U(L) >
/3o('HO).28 We know that ao $ 0, since U(H.) = O. Let H be a horse lottery
such that U(H) < ao, which must exist since /3o('Ho) is the infimum of all
utilities of horse lotteries. Let = ao - U(H), so that U(H) = ao - < O.
Let a = ao/[2(ao - f)], which is easily seen to be between 0 and 1/2.
Let L be an NM-Iottery such that U(L} = ao(1/2 + a}/2, which is in
the open interval (ao/2, aao). Define H' = aH + (1 - a}H. This means
that U(H') = ao/2 < U(L), hence H' j L. But H'(r) = aH(r) + (1 -
a)H. We have assumed that U(H(r)) ~ ao for all r, since H(r) E C. So
U(H'(r)) ;::: aao > U(L). This implies that L j H'(r) for all r. Axiom 5
implies L j H', a contradiction. A similar contradiction holds if we assume
that sUPL U(L) < (31 ('Ho). 0
We are now in position to prove that 'H is itself convex.
Lemma 3.131. 29 Assume Axioms 1-3. Let 'Ho be the set of all constant
horse lotteries.
For each horse lottery H E 'H, the function g : R --+ m defined by
g(r) = U(H(r is measurable.
If HlI H2 E 'H and 0 $ a $ 1, then aHI + (1 - a)H2 E 'H.
PROOF. For the first part, let H E 'H and let g(r) = U(H(r)). We know
that (3o('Ho) $ g(r) $ (31('H O) for all r. To prove that g is measurable, we
need to show that for every c E (f30('H0),(31(Ho)), {r: g(r) $ c} E AI. For
each such c, let Lc be an NM-Iottery with U(Lc) = c, as guaranteed by
Lemma 3.130. Then

{r : g(r) $ c} = {r : U(H(r)) $ U(Lc)} = {r : H(r) j Lc} E All


where the second equality follows from Lemma 3.119, and the inclusion
follows from the definition of a horse lottery.
For the second part, let Hb H2 E 'H and 0 $ a $ 1. We need to prove
that for all L E C, {r : aH1(r) + (1 - a)H2(r) j L} E AI. Let L E C.
Lemma 3.123 says that

{r : aHl(r) + (1- a)H2(r) j L} (3.132)


= {r: aU(Hl(r)) + (1- a)U(H2(r)) $ U(L)}.
28This can only happen if f30('Ho) < O. If (3o('Ho) = 0, then ao = (3o('Ho) must
occur.
29The conclusions of Lemma 3.131 are already known in the finite case.
198 Chapter 3. Decision Theory

But the first part of this lemma shows that both U(HlO} and U(H2())
are measurable functions. Hence the convex combination is measurable. It
follows that the set on the right-hand side of (3.132) is in At. 0
From now on, so long as we assume Axioms 1-3, we can assume that 'It
is closed under convex combination.
Lemma 3.133. Assume Axiom 5 and that preference is nondegenerate.
Then there exist two NM-lotteries L* and L * such that L* -< L * .
PROOF. Since the preference is nondegenerate, there exist horse lotteries
H * -< H*. If, to the contrary, H * (r) '" H* (r) for all r, then Axiom 5 says
H ::S H., a contradiction. 0
Lemma 3.134. Assume Axioms 1-3. Let H* and H* be horse lotteries
such that U(H*) = 0 and U(H*) = 1. For each B E A l , define HB by

H (r) = { H* (r) if rEB, (3.135)


B H*(r} if not.
Let Q(B} = U(HB}. Suppose that H. :; HB for all B. Then Q is a proba-
bility.
PROOF. It is easy to see that H B is a horse lottery. It follows from H * :; H B
that Q(B} ~ o. It is easy to see that Q(0} = 0 and Q(R} = 1. If C and
D are disjoint, define H = ~Hc + ~HD' which equals ~HCUD + ~H*.
According to (3.109),

~[Q(C} + Q(D)] ~[U(HC} + U(HD}] = U (~Hc + ~HD )


U (~HCUD + ~H*) = ~[U(HCUD} + U(H*)]
~Q(C U D},
from which it follows that Q(CUD} = Q(C} +Q(D).30 Next, let {An}~=l
be mutually disjoint subsets of R, and let
n

= UAi, = UAi
00

A Bn
i=l i=l

For every n, we have

30The proof ends here in the finite case, ~ince there ~o not .exist infinitely many
disjoint subsets of R. Note that in the fimte case AXIOm 3 IS not used, only the
Archemedian condition of Lemma 3.117 is used.
3.3. Axiomatic Derivation of Decision Theory 199

hence, we can choose an E [0, 1] so that

Since we just showed that Q is finitely additive, we have


1 1 1
~Q(A) = "2 U (HA} = "2 U (H*} + "2U(HA} = U(Hn)
= ~(1 - an)U(HB,J + ~(1 - an)U(H*) + anU(H*}
1
L Q(Bn) + an
n
"2(1 - an)
i=l
It follows that for all n, Q(A) = (1 - an) L~=l Q(Bn) + 2an. If we can
show that limn -+ co an = 0, then we have that Q is countably additive. Let
{anlc}~l be a convergent subsequence of {an}~=l with limit a. Then

It follows from Axiom 3 (with H' = H" = ~H* + ~HA) that H rv ~H* +
~HA' But H = (1 - a)[~H* + ~HA] + aH*. It follows from Axiom 2 that
either a = 0 or H* rv ~H* + ~HA' Since this latter is clearly false, it must
be that a = 0 and an -+ O. 0

Lemma 3.136. Assume the conditions of Theorem 3.108. In Lemma 3.134,


let H* = L* and H* = L* from Lemma 3.133. Then H* j HB for all
BE Al. Fo~ all BE A l , Q(B) = 0 if and only if B is null.

PROOF. The fact that Q(B) = 0 if and only if B is null follows easily from
Axiom 4 and is left to tile reader (as Problem 37 on page 213). By Axiom 5,
L*jHBjL*.3l 0

Lemma 3.137. Assume the conditions of Theorem 3.108. Let H be a horse


lottery that takes on only finitely many different NM-lotteries. Then

U(H) = J
U(H(r))dQ(r).

PROOF. Let Li, ... ,L~ be the different NM-Iotteries that H takes on. Let

bl = max{l, U(LD, ... ,U(L~)}


bo = min{O, U(LD, ... ,U(L~)}.

31In the finite case, the fact that L. :j HB :j L* was already known without
appeal to Axiom 5.
200 Chapter 3. Decision Theory

Define Cl = [b1(1- bO)]-l and C2 = -bo/(l- bo). Clearly, Cl > 0,C2 ~ 0,


and Cl + C2 :::; 1. Let H" = c1H + C2L + (1 - Cl - c2)L . 32 Then

U(H") c1U(H) + C2,


U(H"(r)) c1U(H(r)) + C2, for all r. (3.138)

Also, 0 :::; U(H"(r)) :::; 1 follows from (3.109) and simple algebra. Since
Cl =1= 0 and

J U(H"(r))dQ(r) = Cl J U(H(r))dQ(r) + C2,

it is sufficient to prove the result for H". Since H"(r) is the same mixture
of H (r) and L. and L for all r, H" takes on only finitely many different
NM-Iotteries also. Let H"(r) = Li for r E Bi for i = 1, ... , n, where the
Bi E Al form a finite partition of R. For each i, define Hi by

It is easy to see that Hi is a horse lottery and that

1 II n-l 1 1
-H
n
+ - -n L . = -HI
n
+ ... + -Hn.
n

Hence, U(H") = L~=l U(Hi)' Since

J U(H"(r))dQ(r) = L U(Li)Q(Bi ),
i=l

we complete the proof by showing that U(Hi) = U(Li)Q(Bi) for each i.


Since 0:::; U(Li) :::; 1, we know that Li U(Li)L + [1 - U(Li)]L . By
I'V

Axiom 4, we can substitute the right-hand side of this expression for Li in


the definition of Hi and conclude that Hi HI, where
I'V

Hence, U(Hi) = U(Hn. For each i, define the horse lottery HB as in


Lemma 3.134. SO, HI = U(Li)HB. + [1- U(Li)]L . It follows that U(HD =
U(Li)Q(B i ), as desired. 0

Lemma 3.139. 33 Assume the conditions of Theorem 3.10B. Let H be an


arbitrary horse lottery. Then U(H) = J U(H(r))dQ(r).

32In the finite case, Cl = 1, C2 = 0, and H" = H.


33This lemma is not needed in the finite case.
3.3. Axiomatic Derivation of Decision Theory 201

PROOF. First, suppose that U(H(r ~ 0 for all r. Let H" = ~H + ~L.
J
Since U(H") = ~U(H)+~ and U(H"(rdQ(r) = ~ U(H(rdQ(r) +~, J
it suffices to prove the result for H". Let b1 = sUPr U(H"(r. It follows
from Lemma 3.130 that for all x ~ b1 there exists an NM-Iottery L with
U(L) = x.
For each n and k = 0,1, ... ,n2n , define

Bn,k = {r :(k; 1) < U{H"{r ~ ~ }.


Define the horse lotteries Hn for each n by

Hn{r) = Ln,k for all r E Bn,k, k = 0,1, ... , n2n ,


where Ln,k are chosen (see Lemma 3.130) so that U(Ln,k) = min{bl. (k -
1) /2n} for k ~ 1 and Ln,o = L . It follows from Axiom 5 that L. ~ Ln,k ~
H"{r) for all n,k and all r E Bn,k, hence 0 ~ U{Hn(r ~ U(H"(r for
all r,n. Since U(Hn(r converges to U{H"(r for all r, the monotone
convergence theorem A.52 implies

nl!.~ J U{Hn(rdQ(r) = J U(H"{rdQ(r). (3.140)

Lemma 3.137 says that the integrals on the left-hand side of (3.140) are
U(Hn). Since Axiom 5 says that U{Hn) ~ U(H") for all n,

! U(H"(rdQ(r) ~ U(H").
Since U is bounded above, we can choose Mn,k to be NM-Iotteries such
that U(Mn,k) = min{b 1 ,k/2n}. Just as above, let Hn(r) = Mn,k for r E
Bn,k, so that U(H"(r ~ U{Hn(r ~ b1 for all r, n. The dominated
convergence theorem A.57 says that

U(H") ~ nl!.~! U(Hn(rdQ(r) = ! U(H"(rdQ(r).

It follows that JU{H"{rdQ(r) ~ U(H"), and the result is proven when


U{H{r ~ 0 for all r.
A similar argument works if U(H(r 5 0 for all r. For arbitrary H, let
H+(r) = H(r) if U(H(r ~ 0 and H+(r) = L. otherwise. Let H-(r) =
H(r) if U(H(r < 0 and H-(r) = L. otherwise. Then ~H+ + ~H- =
~H + ~L., and U(H) = U(H+) + U(H-). The result now follows. 0
The last two lemmas prove the essential uniqueness of U and the unique-
ness of Q.
Lemma 3.141. Assume the conditions of Theorem 3.108. The utility U
from Lemma 3.119 is unique up to positive affine transformation.
202 Chapter 3. Decision Theory

PROOF. Let Uf and U~ be two utilities. If preference is degenerate, then


both Uf and U~ are constant and the result is trivial. So, suppose that there
exist H* and H* with H* -< H*. For i = 1,2, define Ui(H) = [UI(H) -
UHH*)]/[UHH*) - U[(H*)]. This makes Ui(H*) = 0 and Ui(H*) = 1 for
i = 1,2 without affecting the other properties of each Ui. Now, suppose
that there exists H such that U1(H) i= U2(H). Without loss of generality,
assume U1(H) < U2(H). There are five cases to consider. 34
Case 1. 0 :::; U1(H) < U2(H) :::; 1. Let U1(H) < a < U2 (H) and let
H' = aH*+(I-a)H*. Then Ui(H') = a for i = 1,2. Now U1(H) > U1(H ' ),
meaning H -< H', and U2(H') < U2(H), meaning H' -< H, a contradiction.
Case 2. U1(H) < U2(H) < o. Let UI(H) < c < U2(H), and define
H' = c/(c - I)H* + (-I)/(c - I)H. Then UI(H') < 0, so H' -< H*, and
U2(H') > 0, so H* -< H', a contradiction.
Case 3. 1 < UI(H) < U2(H). Let UI(H) < c < U2(H), and define
H' = (l/c)H+(c-l)/cH*. ThenU1(H' ) < l,soH' -< H*, and U2(H') > 1,
so H* -< H', a contradiction.
Case 4. UI(H) < 0 < U2(H). Then H -< H* -< H, a contradiction.
Case 5. Ul(H) < 1 < U2(H). Then H -< H* -< H, a contradiction.
Finally, note that if U1 = U2, then Uf and U~ are positive affine trans-
formations of each other. 0

Lemma 3.142. Under the conditions of Theorem 3.108, the probability Q


is unique.
PROOF. Lemma 3.141 shows that the utility is unique up to positive affine
transformation, so suppose that there are two different probabilities Ql
and Q2 such that for both i = 1 and i = 2, HI ::S H2 if and only if
J J
U(Hl(r))dQi(r) :::; U(H2(r))dQi(r). Pick two NM-Lotteries L* and
L such that L. -< L. Let B be an arbitrary subset of R and define
HB as in (3.135). It follows that U(H B ) = Qi(B) for i = 1,2, so that
QI(B) = Q2(B). 0
Since NM-Iotteries are concentrated on only finitely many prizes, the
following is a simple consequence of (3.109).
Corollary 3.143. Under the conditions of Theorem 3.108, if L = alPI +
... + akP", then U(L) = E:=l aiU(Pi).
The proof of Theorem 3.110 requires a lemma first.
Lemma 3.144. Under the conditions of Theorem 3.110, if L}, L2 are NM-
lotteries such that LI ::S (-<)L2' then there exists a set B such that X-I(B)
is null and, for all x f/. B, LI :::Sx (-<x)L2.
PROOF. Let LI ::S L 2, and let B = {x: L2 -<x L I } Define HI(r) = L2 for
all r E X-I(B) and HI(r) = LI for all r ~ X-I(B). Then Axiom 6 says

340nly case 1 is needed in the finite case.


3.3. Axiomatic Derivation of Decision Theory 203

that HI ::5 Ll, but Axiom 4 says that Ll ::5 HI if X-I (B) is nonnull. It
follows that X- 1 {B) must be null. A similar proof works if Ll -< L 2. 0
Before giving the proof of Theorem 3.110, we give a brief outline. We use
Theorem 3.108 to represent conditional preference by expected utility sepa-
rately for each value of x. We then use Lemma 3.144 to show that the utility
function in the conditional preference representation must equal the utility
function for unconditional preference except on a null set. We prove that
the probability measure for the conditional preference representation must
equal conditional probability calculated from the unconditional preference
by showing that if it were not, we could construct a pair of horse lotteries
that are conditionally ordered one way for all x, but that are marginally
ordered the opposite way, contradicting Axiom 6.
PROOF OF THEOREM 3.110. Let L* and L* be as in Lemma 3.133. Accord-
ing to Theorem 3.108, since ::5x satisfies Axioms 1-5 for each x, there is, for
each x, a probability Px on (R, Ad and a utility Ux such that HI ::5x H2
if and only if J UX {H 1 {r))dPx (r) ~ J Ux {H2{rdPx {r). Let Qx denote the
distribution of X induced from Q. Lemma 3.144 says that each pair of NM-
lotteries is ranked the same by ::5x except possibly for x in a set with null
inverse image. Since Lemma 3.136 says that a set C is null if and only if
Q(C) = 0, we can assume that there is Bo such that Q{X-l{Bo = 0
and Ux(L*) < Ux{L*), for all x Bo. We can certainly assume that
Ux(L",) = 0 = U(L*) and Ux(L*) = 1 = U(L*) for all x Bo. Let B_
be the set of all x such that there exists Lx with Ux{Lx) < U(L x ), and
let B+ be the set of all x such that there exists Lx with Ux(Lx) > U(Lx).
We will show next that for each x E B+ U B_, we could choose Lx so that
U(Lx) = ~. For each x E B+ U B_, let
, 1 b1 - 1 -bo '"
Lx = bl(l- bo)Lx + b1 (l- bo)L* + 1- bo L ,
where bo= min{O, U(Lx)} and b1 = max{l, U(L x )}. Then Ux(L~) =F
U(L~), but now 0 ~ U(L~) ~ 1. By mixing L~ with either L. or L to
create L~, we can have U{L~) = ~ and either Ux{L~) > ~ or U(L~) < ~.
That is, we can assume that U(Lx) = ~. Let Ll/2 = ~L* + ~L*. Define
horse lotteries H+ and H_ by

H+(r) = {LxLl/2 if X{r) = x and x E B+,


otherwise,

H_(r) = {LxLl/2 if X(r) = x and x E B_,


otherwise.
Since {r : H+(r) ::5 L} is either R or 0 depending on whether or not
U(L) = ~, H+ is a horse lottery, and similarly for H_. By construction
H_ ::5x Ll/2 for all x and H_ -<x Ll/2 for all x E B_. Also, by the
measurability condition in the definition of conditional preference, B_ =
{x: Ll/2 ::5x H_}C, so B_ E B. It follows from Axiom 6 that X-l{B_)
204 Chapter 3. Decision Theory

is null. By a similar argument, we can show that X-I(B+) is null. Let B'
equal BoUB_ UB+. Then X-I (B') is null and, for all x B', U",(L) = U(L)
for all LEe.
Next,a5 we prove that P",(A) is a measurable function of x for all A E AI.
For each A E AI. let HA(r) = L' if rEA and HA(r) = L. if r A. For
each c E [0,1],

{x: P",(A) :5 c} = {x : HA j", cL' + (1- c)L .. } E B


follows from the measurability assumption on conditional preference.
Finally, we prove that Q('lx) and P",() agree almost surely. Let D =
{x : Px (-) i= Q('lx)} and B = B' U D. If we can prove that DEB and
Qx(D) = 0, the proof is complete. a6 Since (R,At) is a Borel space, Al is
count ably generated (Proposition B.43). Let {An}~=l generate AI. Then

U({x: P",(An) < Q(Anlx)} U {x: Px(An) > Q(Anlx)}).


00

D= (3.145)
n=l

Since both P",(An) and Q(Anlx) are measurable functions, each of the sets
in the union is in B, so DEB.
If Qx(D) > 0, then one ofthe sets in the union (3.145) must have strictly
positive Qx measure. Let D' = {x : Px(A 1) < Q(Allx)}, and suppose
that Qx(D') > O. For each rational q E [0,1]' let Dq = {x : P",(A1) <
q < Q(A1Ix)}. Then D' is the union of all the Dq. Since this union is
countable, there exists q such that Qx(Dq) > O. Define HI(r) = L* for
rEAl n X-I(Dq) and HI(r) = L. otherwise. Also, define H2(r) = L*
for rEAl and H2(r) = L .. otherwise. Then U",(Ht} = U",(H2) according
to the definition of conditional preference because H1(r) = H2(r) for all
r E X-1(D q). But Px(A I ) = U",(Ht} by the uniqueness of probability and
Lemma 3.134. Define Ha(r) = qL" + (1 - q)L .. for all r E X-1(Dq) and
Ha(r) = L .. otherwise. Then the definition of conditional preference implies
that Ux(Ha) = U",(qL* + (1 - q)L .. ) = q. Since U",(Ha) = q > U",(Ht} for
all x E Dq, we have HI --<", Ha for all x E Dq. But HI jy Ha for all y Dq,
since HI(r) = Ha(r) for all r X-1(Dq). It follows from Axiom 6 that
HI --< Ha. Now, note the following contradiction:

U(Ha) = qQx(Dq) < [ Q(Allx)dQx(x) = Q({X E Dq}nAt} = U(Ht},


JD q

where the first and last equalities follow from Lemma 3.137, the inequality
follows from the definition of D q , and the other equality follows from the
definition of conditional probability. 0

35This paragraph is not needed in the finite case.


36Since DEB is obvious in the finite case, the rest of this paragraph is not
needed in the finite case.
3.3. Axiomatic Derivation of Decision Theory 205

3.3.6 State-Dependent Utility


We mentioned earlier that Axiom 4 may not be reasonable to assume.
It may be the case that when the state of Nature changes, the relative
values of various prizes also change. For example, if the states of Nature
involve different exchange rates between two currencies, then the relative
values of fixed amounts of the two currencies will change according to the
state of Nature. For this reason, we prove a theorem that does not assume
Axiom 4. If Axiom 4 fails, then Axiom 5 may not even be desirable, as the
next example shows.
Example 3.146. In this example, we will have the relative values of the prizes
change drastically from one state to the next. Let R = {r1, r2} and P = {P1, 112}.
Let U.(Pi) = 1 and Ui(P3-i) = 0 for i = 1,2. For NM-Iottery L = OP1 + (1- 0)112,
define Ui(L) = OU.(P1) + (1 - o)U'(112). For horse lottery H = (L1' L 2 ), define
U(H) = 0.4U1(L1) + 0.6U2(L2). Consider the following two horse lotteries H1 =
(L1,1, L1,2) and H2 = (2,1, 2,2), where

L1,1 = P2, L1,2 = P2,


1 1
L2,1 = P1, L2,2 = 2P1 + 2112.
One can easily calculate U(L1,1) U(L 1,2) = 0.6, while U(L2,t} = 0.4 and
U(L2,2) = ~. So H2(ri) -< H1 (ri) for i = 1,2, but U(Ht} = 0.6 and U(H2) =
0.7, thus H1 -< H2. Even though each of the NM-Iotteries awarded by H1 is
marginally preferred to the corresponding NM-Iottery awarded by H2, U1(H2(r1
is sufficiently higher than U1(H1(r1 to make up for the fact that U2(H2(r2 is
a little lower than U2(H1(r2'
The functions Ui in Example 3.146 are called a state-dependent utility.
Since we will have to abandon Axiom 5 (at least in its current form) in order
to abandon Axiom 4, and since some version of dominance is essential for
the infinite case, we will only deal with the case in which R and P are finite
in this section. 37
Theorem 3.147. Assume Axioms 1 and 2 and the Archemedian condi-
tion of Lemma 3.117. Assume that preference is nondegenerate. Then,
there exist a probability Q = (q1. .. " qn) over R = {r1."" r n} and a
state-dependent utility function (Ub ... , Un) such that for every H1
(L1,1,"" L1,n) and H2 = (L2,1. .. . ,L2 ,n), H1 ~ H2 if and only if
n n
2: U (L 1,i)qi :::; 2: Ui(L2 ,i)Qi.
i
i=1 i=1

*This section may be skipped without interrupting the flow of ideas.


37One possible approach to dealing with the infinite case would be to assume
the existence of a conditional preference relation {jr: r E R} that satisfied
Axiom 6. The type of dominance that we need in the state-dependent case is
built into Axiom 6.
206 Chapter 3. Decision Theory

The Ui functions are unique up to positive affine transformation (one for


each i). The only property of Q determined by the preferences is that non-
r),ull states must have positive probability.
The reader will note that Theorem 3.147 makes no claim of uniqueness for
Q. It is easy to see why not. Suppose that (ql,' .. , qn) is a probability over
the states and (Ul"'" Un) is a state-dependent utility. Let (tl, ... , t n ) be
another probability such that ti = 0 if and only if qi = O. For each i such
that ti > 0, define V; = qiUi/ti. If ti = 0, set V; = Ui . Then
n n
L qiUi(Li) = L ti Vi(L i)
i=l i=l

for all (Ll"'" Ln). If (ql, ... , qn) and U are as guaranteed by Theo-
rem 3.108 and (tl, ... ,tn ) is as above, then V; = qiU/ti will satisfy
n n
L qiU(L i) = L ti V;(Li)'
i=l i=l

This same construction can be applied whether or not Axiom 4 holds.


What Axiom 4 achieves is the ability to identify a unique probability and
state-independent utility. It does not preclude the existence of alternative
state-dependent representations of preference.
The proof of Theorem 3.147 resembles those parts of the proof of The-
orem 3.108 that were relevant for the finite case. The first thing we do is
define the state-dependent utility by means of called-off comparisons. Then
we define a particular Q that makes U(H) = 2::=1 qiUi(H(ri))' We need a
lemma first.
Lemma 3.148. Assume Axioms 1 and 2. For each state rj, each pair
(L l , L 2 ) of NM-lotteries, and each four horse lotteries HI, H2, H3, H4 sat-
isfying the following conditions:
the preference between HI and H2 is called-off when {rj}C occurs,
the preference between H3 and H4 is called-off when {rj}C occurs,
and
Hl(rj) = H3(rj) = L 1 , H2(rj) = H4(rj) = L 2,
we have HI :::S H2 if and only if H3 :::S H4
PROOF. First, note that ~Hl + ~H4 = ~H2+ ~H3. Use Lemma 3.114 to see
that HI :::S H2 implies ~Hl + ~H3 :::S ~Hl + ~H4' which implies H3 :::S H4
by Axiom 2. Similarly, H3 :::S H4 implies HI :::S H2. 0
PROOF OF THEOREM 3.147. For each state rj and each (n - 1)-tuple of
NM-lotteries (L l , . . , L j - 1 , L j +1,' .. , Ln), consider the set of horse lotteries
of the form
3.3. Axiomatic Derivation of Decision Theory 207

where L is an arbitrary element of C. According to Lemma 3.148, the


ranking of these horse lotteries will be the same no matter what one chooses
for the LiS. Hence, we can treat the set of these horse lotteries as the
entire set of interest and apply Lemma 3.119 to obtain a utility function
Uj : L -+ [0,1] satisfying

(Ll, ... ,Lj - 1 , L~,Li+l, ... ,Ln) ~ (L 1 , .. , L j _l, L~, Lj +1,'" ,Ln)
if and only if Uj(LD ::; Uj(L;), no matter what one chooses for the LiS.
For each j such that Tj is nonnull, there are prizes pj and P*j such that
Uj(Pj) = 1 and Uj(p*j) = O. (If Tj is null, Uj(Pi) can be arbitrary, since
there are no preferences among the horse lotteries.) It is easy to see that
the best and worst horse lotteries are respectively

H* = (Pi, ... ,p~), H* = (P*1,'" ,P*n)'


Now, set up the following horse lotteries:

Ht~(T,.) = {pj if j = i,
'f . 4 .
P*j 1 J r t.
Define qi = U(Ht), where U is constructed by Lemma 3.119 based on all
of 1i with U(H*) = 1 and U(H*) = O. Clearly, qi ~ 0 for all i. To see that
E~=l qi = 1, note that the equal mixture of all the H;s is

1 H*1
- + ... + -n1 H*n = -1 H* n --1 H *.
+-
n n n
Evaluating U at both sides of this expression gives lin E~=l qi = lin.
Finally, we prove that if H = (Ll, .. . ,Ln ), then U(H) = E~l qiUi(Li).
This will complete the proof. Construct n horse lotteries

H ( ) { Li if i = j,
i Tj = P*j ifi fj.
By taking an equal mixture of all n of these, we get
1 n-1 1 1
-H + - - H . = -HI + ... + -Hn.
n n n n
Evaluating U at both sides of this gives U(H)ln = (lin) E~=l U(Hi)' So,
we need only prove that U(Hi) = qiUi(Li) for each i. From the definition
of Ui , we see that Hi '" Hi, where
Hi(T') = { Ui(Li )pi+(1-Ui (Li ))p*i ifi=j,
, P*j ififj
Since Hi = Ui(Li)H; + (1 - Ui(Li))H*,
U(Hi) = U(Hi) = (1 - Ui(Li))U(H.) + Ui(Li)U(H;) = qiUi(Li). 0
208 Chapter 3. Decision Theory

3.4 Problems
Section 3.1:

1. *Consider the rule 8 in Example 3.13 on page 148.


(a) Find a formula for the risk function.
(b) Find a formula for the Bayes risk with respect to Lebesgue measure.
(c) Prove that the Bayes risk is strictly less than 1/2 for all even n.
(d) Find the exact value of the Bayes risk if n = 2.
2. Prove Proposition 3.16 on page 150.
3. Two firms are planning to make competing secret bids on the price at which
they will supply a computer system to a government agency. The firm with
the lower bid will get the job. (No cost overruns will be allowed.) One firm
believes that its actual cost of supplying the system is sure to be c and it
has a prior distribution on the bid e of the other firm. Let h be the cost
of preparing and submitting the bid. For each situation below, find the bid
this firm should make to maximize expected profit:
(a) h = 0 and fe(O) = exp(-O/p,)/p" for 0 > 0 and p, known.
(b) h = 0 and fe is arbitrary.
(c) h > 0 and fe(O) = exp(-O/p,)/p" for 0 > 0 and p, known.
(d) h > 0 and fe is arbitrary.
4. An actuary wants to estimate the mean number of claims for industrial
injuries in a newly opened factory, in order to determine the premium for
insurance. The actuary believes that, to a good approximation, the number
of claims in any year by anyone person is Poisson with mean fJ conditional
on a parameter e = O. Different persons are assumed to be independent
given e. Past experience with similar factories gives a prior density for e:
m(mOr- 1 e- m8
fe(O) = r(r)

for fixed known values of m and r > 1. After n person-years have elapsed
in this factory, s injuries are observed.
(a) Using fe(fJ) above, show that with a loss function

L(fJ d)
,
= (d -(0)2
)'

the best choice of the premium d is

d* ( s) = r +s - 1.
m+n

(b) Now, assume that n (the number of person-years) is fixed a~d treat
S (the number of injuries) as random (not yet observed). Fmd the
risk function for d*. Also find the Bayes risk for d* with respect to
the prior fe and the posterior risk for dOes) given S = s.
3.4. Problems 209

5. Let XI, ... , Xn be lID Ber(O) random variables conditional on e = O.


Suppose that we have a loss function L(O, a) = (0 - a)2 /[02{1- 0)5], where
the action space is N = [0, 1]. The prior distribution of e is Beta{ 0:, {3).
Find conditions on 0: and {3 such that both of the following are true:
The formal Bayes rule exists and has finite posterior risk for all pos-
sible samples .
The Bayes rule exists and has finite risk.
6. Suppose that the conditional density of X given e = 0 is exp{ -Ix - 01)/2
and that e has prior density exp{ -10-'11)/2 for some number '1. Let N = IR
and L{O, a) = (O - a)2. Find the formal Bayes rule.
Section 9.2.1:

7. Suppose that P8 says that Xl, ... , Xn are lID N{O, 1). Let Do{X) be the
median of the sample, and let T = D1{X) be the sample average. Find a
randomized rule based on the mean, li(T) , which has the same risk func-
tion as lio no matter what the loss function is. (You may wish to solve
Problem 12 on page 663 or Problem 1 on page 138 first. You probably can-
not write a closed-form solution for the randomized rule. You may either
describe the probability distribution in words sufficiently precise to define
it or give an algorithm for actually performing a randomization that will
have the appropriate distribution.)
8. Let 6 : X --+ N be a nonrandomized rule, and let T : X --+ T be a sufficient
statistic. Let 61 be the rule constructed in Theorem 3.18 on page 151. Show
that for each t, the distribution 61 {t)() on N is the probability measure
induced by 6 from the conditional distribution of X given T = t.
9.*Let n = (0,00) x (0,00), X = 1R3 , and N = IR+. Suppose that P8 says
Xl, X~, X3 are lID U(o:, {3), where 0 = (0:, {3). Let

L{O, a)
0:+{3
= ( -2- - a
)2 .
Let 60 {X) = X.
(a) Find a two-dimensional sufficient statistic, T.
(b) Use the Rao--Blackwell theorem 3.22 to find a rule 6I (T) whose risk
function is at least as good as that of 60 (X).
(c) Find the risk functions R(0,60 ) and R(0,6d, and show that there is
at least one 0 such that R(O, lid < R(O, 60 ),
10.*Let {Xn}~=l be conditionally lID Ber(O) given e = 0, and let X =
(Xl, ... , Xn). Let N = n = (0,1) and L(O, a) = (0 - a)2. Let the prior
distribution A of e be U(O, 1). Let Do(x) be the sample median, that is

if more than half of the observations are 0,


if more than half of the observations are 1,
if exactly half of the observations are 0.
210 Chapter 3. Decision Theory

(a) Find R(O, 60) and r(A,60).


(b) Let T = E~1 Xi. Find the rule 61 guaranteed by the Rao-Blackwell
theorem 3.22.
Section 3.2.2:

11. In Example 3.25 on page 154, find the risk function for both 6 and 61 and
show that 61 dominates 6.
12. Suppose that P(J says that X '" Bin(n,O). Let n = (0,1) and N = [0,1].
=
Let L(O, a) (0 - a)2 and

with probability ~,
with probability ~.
Find a nonrandomized rule that dominates 6.
13. Find an example of a decision problem with a decision rule 60 and a prob-
ability A on the parameter space such that 60 is A-admissible but 60 is not
a Bayes rule with respect to )..
14. Suppose that X '" Exp(I/0) given 9 = (J. Let the action space be [0, (0),
and let the loss function be L(O, a) = (0 - a)2.
(a) Prove that 6(x) = x is inadmissible.
(b) Find a nonconstant admissible rule.


15. Let X = (Xl, ... , X n ), where the Xi are conditionally lID N(p" (12) given
9 = (p"o-). Let Cl > and C2 be constants. Let N = lR and L(O, a) =
(p, - a)2. Define 6(x) = (nx + Clc2)/(n + cd. Show that 6 is admissible.
16. *Prove Proposition 3.47 on page 162.
17. Assume that L(O, a) = (0 - a)2 in each of the following questions.
(a) Suppose that X '" N(J, 1) given e= (J. Show that for each constant
c, 6(x) == c is admissible.
(b) Suppose that X '" U(O,O) given 9 = O. Show that for each constant
c, 6(x) == c is inadmissible.
18. Let n = (0,1), N = [0,1], and L(J, a) = (J_a)2. Suppose that P(J says that
X '" Geo( 0), that is,

f xla (xIO) = { 0(1 - 0)9''' otherWise


for x =.0,1, ... ,

is the density of X with respect to counting measure given e = (J. Show


that 6(x) = x/(x + 1) is admissible.
19. Suppose that X '" N(O; 1) given 9 = 0, and let 9 have an N(O, 1) prior.


Suppose that the parameter space and the action space are both (-00, (0).
Let L(O, a) = if a ~ 0 and L(O, a) = 1 if a < (J.
(a) Show that there is no Bayes rule.
(b) Show that every decision rule is inadmissible.
3.4. Problems 211

(c) Show that if the action space is [-00,00], then there is a Bayes rule
and that it is the only admissible rule.

Section 3.2.3:

20. *Prove that the modified James-Stein estimator 63 (X) has smaller risk func-
tion than 6(X) if n 2: 4. (Hint: Let r be an orthogonal transformation with
first row proportional to 1 T , and let Z be the last n - 1 coordinates of r X .
What does Theorem 3.50 say about estimating re by r6 3 (X)?)
21. *Say that a function 9 : rn. --+ rn. is absolutely continuous if there exists a
function g' such that for all Xl < X2, g(X2) = g(xd + :"I l
x2 g'(y)dy.38

(a) Prove that the conclusion to Lemma 3.51 continues to hold if the
assumption that 9 is differentiable is replaced by the assumption that
9 is absolutely continuous.
(b) Prove that the conclusion to Lemma 3.52 continues to hold if the
assumption that the coordinates of 9 are differentiable is replaced by
the assumption that hi is absolutely continuous for every i.
22. Let g(x) = -x min{c, (n - 2)/ 2::~=1
6*(x) = x + g(x).
xn be a function from rn. n to rn. n . Let

(a) Using Problem 21 above, find all values of c > 0 such that 6*(x) has
smaller risk than 6(x) = x in the setting of Theorem 3.50.
(b) Prove that for c > (n - 2)/(n + 2), 6*(x) has smaller risk than 61 (x)
in the setting of Theorem 3.50.

Section 3.2.4:

23. Prove Proposition 3.58 on page 167.


24. Let X '" Geo(8) given e = 8. Let L(8, a) = (8 - a)2 /[8(1- 8)]. Prove that
6(x) = I{o} (x) is minimax.
25. In Example 3.72 (see page 170), let Pi = Pr(e = i) for i = 0, 1 be a prior
distribution. Prove that it is impossible for the Bayes risk of the minimax
rule to be simultaneously stictly less than the Bayes risks of both action
3 and action 1. This example shows how the minimax principle can be in
very serious conflict with the expected loss principle.

Section 3.2.5:

26. Prove Proposition 3.85 on page 174.

38Such functions are called absolutely continuous because they have a prop-
erty similar to measures that are absolutely continuous with respect to Lebesgue
measure. In particular, if 9 is nondecreasing, then 7J((a,bJ) = g(b) - g(a) defines
a measure that is absolutely continuous with respect to Lebesgue measure.
212 Chapter 3. Decision Theory

27. Suppose that Po says X '" U(O, 1) and PI says X '" U(O, 7), and that the
loss function is as in Theorem 3.87. Find all of the admissible rules under
the conditions of that theorem. Express each rule by saying which intervals
of X values lead to making each decision.
28. Suppose that an observation X is to be made and it is believed that X has
one of two densities:

fo(x) = ~exp(-Ixi), /1 (x) = vkexp( _~X2).

Find all of the admissible procedures according to the Neyman-Pearson


fundamental lemma 3.87 (using the loss function stated there). Express
the rules in terms of intervals in which each decision is taken.
29. Prove the claim at the end of the proof of the Neyman-Pearson funda-
mental lemma 3.87 that no element of C dominates any other element of
C.
30. Prove Proposition 3.91 on page 178.
Section 3.3:

31. Suppose that there are k ~ 2 horses in a race and that a gambler believes
that Pi is the probability that horse i will win (E:=l Pi = 1). Suppose that
the gambler has decided to wager an amount x to be divided among the
k horses. If he or she wagers Xi on horse i and that horse wins, the utility
of the gambler is log(CiXi), where Cl, ... , Ck are known positive numbers.
Find values Xl, ... , Xk to maximize expected utility.
32. Suppose that two agents have a common strictly increasing utility function
U for their fortunes in dollar amounts and that their current fortunes are
the same, Xo. (So, for example, the utility of receiving an additional X
dollars would be U(xo + x).)
(a) Let R be a random dollar amount that is strictly greater than -Xo.
If one of our agents contemplates selling R, what would be the lowest
price at which the agent would be willing to sell it? What would be
the highest price that an agent who did not own R would be willing
to buy it?


(b) Suppose that one agent receives a gift consisting of a lottery ticket
that will pay T > dollars with probability 1/2 and pays nothing with
probability 1/2 and that both agents agree on these probabilities.
Construct a utility function U having the property that, as soon as
an agent receives this gift, he or she is willing to sell it at some price
less than T /2 and the other agent is willing to buy it at that same
price.
33. Assume Axioms 1 and 2 and the Archemedian condition of Lemma 3.117.
Let R = {Tl, ... , Tn} and P = {PI, ... ,Pm}. Consider the set 'H' of all horse
lotteries of the form (Pil , ... ,pi,,). (These are all horse lotteries whose NM-
lotteries assign probability 1 to a single prize.) Let H.,H* E 'H' be such
that H * :j H :j H' for all H E 'H'. Prove that H * :j H :j H* for all
HE'H.
3.4. Problems 213

34. Prove Proposition 3.118 on page 193. (Hint: You can use Theorem 3.147 if
you wish.)
35. Let '11 < '12 < 1, and suppose that HI and H2 are horse lotteries such that

Assume Axioms 1-3 and prove that HI '" H2.


36. Assume all of the axioms, including Axiom 6. Show that conditional pref-
erence given R is the same as unconditional preference.
37. Prove the part of Lemma 3.134 that says that Q(B) = 0 if and only if B
is null.
38.*Assume Axioms 1-4. Let R be finite. Prove that Hl(r) ~ H2(r) for all
r e R implies HI ~ H2. (Hint: Create a comparison between HI and
H' that is called-off when {rl} C occurs. Use induction on the number of
states.)
CHAPTER 4
Hypothesis Testing

4.1 Introduction
4.1.1 A Special Kind of Decision Problem
Recall the setup used at the beginning of Chapter 3. We had a probability
space (S, A, J.L) and a function V : S -+ V. One example of V is the param-
eter 8. Other examples are measurable functions of e. Other V functions,
which are not functions of e, are possible but are rarely seen in classical
statistics. This is true to a greater extent in hypothesis testing for rea-
sons that will become more apparent once we study the criteria used for
selecting tests in classical statistics.
Definition 4.1. Suppose that we can partition V into V = VHUVA, where
VH n VA = 0. The statement that V E VH is a hypothesis and is labeled
H. The corresponding alternative is labeled A and is the statement that
V E VA. If V = 8, we have f2 = f2H U f2A with f2H n f2A = 0 and V E VH
if and only if e E f2 H . In this case, we write H : e E f2H and A : e E f2A
A decision problem is called hypothesis testing if N = {O, I} and L(v,a)
satisfies L(v,l) > L(v,O) for v E VH and L(v,l) < L(v,O) for v EVA
The action a = 1 is called rejecting the hypothesis, and the action a = 0 is
called accepting the hypothesis. 1 If we reject H but H is true, we made a
type I error. If we accept H and it is false, we made a type II error.

lSome authors prefer to call action a = 0 not rejecting the hypothesis.


4.1. Introduction 215

A simple type of hypothesis testing loss function is

L(v,a) = {Cba ~fv E VH,


If V EVA,
(4.2)
a

where CI > Co and bo > b1 It is easy to see (see Problem 1 on page 285)
that (4.2) is equivalent to a loss function of the same form with Co = b1 = 0,
bo = 1, and Cl = C > O. Such a loss function is called a O-l-c loss function.
If, in addition, c = 1, it is called a 0-1 loss function. More general loss
functions than the 0-1-c loss might often seem appropriate for the type of
problems in which hypothesis testing is used. For example, if the parameter
is real, the hypothesis is that e ::; 90, and C > 0, an appropriate loss might
be
L(9 ) _ { 9 - 90 if 9> 90, a = 0, (4.3)
,a - (90 - 9)c if 9:5 90, a = 1.
This loss provides for penalties for choosing the wrong decision that are
commensurate with the inaccuracy of the decision. But this loss can be
written as 19 - 90 I times the 0-1-c loss. By Proposition 3.47, so long as the
risk functions of all decision rules are continuous from the left (or all are
continuous from the right) at 9 = 90 , rules admissible under the O-l-c loss
will be admissible under this loss. One could begin the study of hypothesis
testing by concentrating solely on which decision rules are admissible. For
this purpose, the 0-1 loss is sufficient. The focus of hypothesis testing,
however, is on finding tests that meet certain ad hoc criteria to be defined
later.
A randomized decision rule 6 in a hypothesis testing problem can be
described by its test function, which is the measurable function t/J : X -+
[0, I] given by

t/J(X) = 6(x)(I) = Pr(choose a = llX = x).


One should think of a randomized test t/J as follows. First, observe X =
x, and then flip a coin with probability of heads equal to t/J(x). If the
coin comes up heads, reject the hypothesis. Because of this interpretation,
randomized tests are seldom used in practice.
Definition 4.4. Suppose that V = e. The power function of a test t/J is

of t/J is sUP9EOH !3,p(9). A test is called level a, for some number : ;


!3,p(9) = E9t/J(X). The operating characteristic curve is Pip = 1-!3,p. The size
a ::; 1,
if its size is at most a. A hypothesis is simple if OH is a singleton. Similarly,
the alternative is simple if OA is a singleton. The hypothesis (alternative)
is composite if it is not simple. For symmetry, we also define the base of
the test to be inf9EoA f3,p(}). A test is said to have floor 'Y if the base is at
least 'Y.
The definitions of power function, size, level, and operating characteristic
are all standard in classical theory, but the definitions of base and floor are
216 Chapter 4. Hypothesis Testing

not. Some elaboration is in order. There is a duality between hypotheses


and alternatives which is not respected in most of the classical hypothesis-
testing literature. The definitions of base and floor are introduced to com-
plete the duality among the concepts usually defined. For example, sup-
pose that we decide to switch the names of alternative and hypothesis, so
that nH becomes nA, and vice versa. Then we can switch tests from </> to
t/J = 1 - </> and the "actions" accept and reject become switched. The power
function of </> is the operating characteristic of t/J, and vice versa. The size
of </> is one minus the base of t/J, and vice versa. The test </> has level Q if and
only if t/J has floor 1 - Q. The classical optimality criteria for tests do not
respect this duality. That is, a test </> may satisfy the appropriate classical
optimality criterion for a specified hypothesis-alternative pair, but when
the names of hypothesis and alternative are switched and the same opti-
mality criterion is appropriate, 1 - </> does not satisfy the same optimality
criterion. (See Problem 31 on page 289 for an example.) For this reason,
when appropriate, we will introduce new optimality criteria that are dual
to the existing ones.
It is easy to see that the risk function for a hypothesis-testing problem
is closely related to the power function. If the loss function is 0-I-c, then
the risk function is
R((} cp) _ { c(3",((}) if (} E nH, (4.5)
, - 1 - (3",((}) if (} E nA.
Now suppose that we let n~ = nA and n~ = nH , so that hypothesis
and alternative are switched. Also, switch the names of the actions, set
t/J = 1 - cp, and let the loss be c times the D-l-l/c loss function. Then the
risk function of t/J in this new problem is
'( ) {(3.p((}) if (} E n~,
R (},t/J = c(1 - (3.p((})) if (} E n~,
which is easily seen to equal R((}, </. So, the risk function respects the
duality between hypotheses and alternatives, as will considerations of ad-
missibility.

4.1.2 Pure Significance Tests


A simpler framework for hypothesis testing dates back at least to Pearson
(1900). In this simpler framework, one need only explicitly state the hy-
pothesis (call it H as before), which is either a single distribution for the
data or a class of distributions. One then creates a weak order ~ on the
sample space X, where x ~ y is intended to mean that y is more at odds
with H than x is. 2

2See Definition 3.99 on page 183. Basically, the binary relation ~ must be
reflexive and transitive, and all pairs of data values must be compared.
4.1. Introduction 217

Examp le 4.6. Let H state that X "" N(O, 1). We can say that x ~ y
if Ixl ~ Iyl
We are quite free to define ::5 however we wish, so long as it is
a weak
order.
Examp le 4.7. Let H state that X "" N(O, I}. We could define x ~ y by
Ixl 2: Iyl
The most commo n way to define :5 is in terms of a statisti c T : X
-+ lR.
We would say that x :5 y if and only if T(x) ::; T(y). In Examp
le 4.6,
T(x) = Ixl. A pure significance test is obtaine d by calcula ting the
signif-
icance probability PH(X) (Defini tion 4.8) and rejectin g H if PH (x)
is too
small.
Defini tion 4.8. Let the hypoth esis H be a set of distribu tions on (X, 8).
Suppos e that the quantit y PQ(x) = Q({y: X::5 y}) is the same (or
approx -
imately the same) for all Q in H. Then the commo n value PH(X)
is called
the significance probability of the data x relative to the weak order
:5, and
the test that rejects H when PH(X) is small is called a pure signific
ance
test.
Examp le 4.9 (Continuation of Example 4.6). It is easy to see that

PH(X} = 1-,x,
-00
cp(y)dy +
100 cp(y)dy = 2<I>(-lxl),
Ixl
where cp and <I> are the standar d normal density and CDF, respectively.
This pure
significance test would be the same as the usual test of the hypothesis
that the
mean of a normal distribution with variance 1 is 0 versus the alternative
that the
mean is not O.
For the case of Example 4.7, we have

PH(X) = 1''
-''''I
cp(y)dy = 2<I>(lxl).

This test would lead to rejecting H if the data are too consistent with
H. This
is similar to what Fisher (1936) did when considering how closely the
Mendel (1866) matched a theory that Fisher later showed to be inaccur data of
ate.
Examp le 4.10. Suppose that H is the set of distributions that say
that X =
(X1, ... ,Xn ) are conditionally lID with N(O, 0'2) distribution given ~ =
0'. Let
T(x) be the usual t statistic for testing the hypothesis that the mean
sample
2
is 0, namely T(x) = v'nlxI/8 , where x =
x) I(n -1). Then F,,({y : T(x) ~ T(y)}) is the same
z:n_
1.- 1
of a normal
x;/n and 8 2 = z:~ (Xi-
t-l
for all 0'. In fact PH(X) =
2Tn - 1 ( -IT(x)l ), where T n -1 is the CDF of the t n -1 (0,1) distribu
tion. The usual
t-test is a pure significance test.

The advanta ges to pure significance tests over general hypoth esis tests
are
that one need not explici tly state the alterna tives and one is free
to choose
the weak order ::5 however one sees fit. Of course, one would normal
ly choose
~ with some alterna tive in mind, but one need not
say what the alterna tive
is, nor need one calcula te any probab ilities conditi onal on the alterna
tive.
218 Chapter 4. Hypothesis Testing

A serious disadvantage is that one never knows, until one considers explicit
alternatives, whether one should continue calculating probabilities as if the
hypothesis were true or not. Just because PH (x) is large does not mean that
H is a better probability model for the data than some other plausible
distribution not part of H. Similarly, if PH(X) is quite small, it may be
the case that many other distributions not part of H also give very small
probability to the set {y : x ::5 y}. Berkson (1942) forcefully argues this
point, but not forcefully enough for Fisher (1943).
We will not discuss pure significance tests any further in this book except
to mention a few points. 3 First, all of the hypothesis tests developed in this
chapter can be interpreted as pure significance tests if one feels compelled
to do so, although the hypotheses may need to be modified in order to
satisfy the definition of pure significance test. Second, the goodness of fit
tests described in Section 7.5.2 were originally intended to be interpreted
as pure significance tests. Third, pure significance tests have no role to play
in the Bayesian framework as described in various parts of this text. If the
hypothesis H describes all of the probability distributions one is willing to
entertain, then one cannot reject H without rejecting probability models
altogether. If one is willing to entertain models not in H, then one needs
to take them into account, as well as their merits relative to H, before
deciding whether or not to reject H.

4.2 Bayesian Solutions


4.2.1 Testing in General
The Bayesian solution to a hypothesis-testing problem with 0-1-c loss is
straightforward theoretically. The posterior risk from choosing action a = 1
is cPr(V E VHIX = x), and the posterior risk of choosing action a = 0 is
Pr(V EVA IX = x). The optimal decision is to choose a = 1 if
cPr(V E VHIX = x) < Pr(V E VAIX = x),
which is equivalent to
1
Pr(V E VHIX = x) < -1-' (4.11)
+c
So, the Bayesian solution is to reject the hypothesis if its posterior probabil-
ity is too small, that is, smaller than 1/(1 +c). Theoretically, that is all there
is to Bayesian hypothesis testing with 0-1-c 108s. In practical problems, it
may be computationally difficult to calculate the posterior probability that
V E VH, but this is a numerical analysis problem.

3COX and Hinkley (1974, Chapters 3-5) discuss pure significance tests and
related topics in great detail. A nice review is contained in Cox (1977).
4.2. Bayesian Solutions 219

Example 4.12. Suppose that P,~ says that {Xn}~=l are lID N(p" ( 2 ), where
fJ = (p"u) and X = (X1, ... ,Xn ). Let V = 8 and nH = {(p"u) : p, ~ p,o},
and let L be a Q-1-c loss function. If we use the measure with Radon-Nikodym
derivative l/u with respect to Lebesgue measure as an improper prior, then the
posterior distribution of M is tn-l (x, 8/ ..;n). The formal Bayes rule is

I ift < T;!l(l~J,


rjJ(x) = { 0 ift > T:!lC~J,
arbitrary ift=T;!l(l~J,

where t = ..;n(x - /1-0)/8 is the usual t statistic and Tn-l is the CDF of the
tn-l (0,1) distribution. Note that this is the usual size 1/(1 + c) t-test of H from
every elementary statistics course.

The Bayesian solution, as stated above, applies to predictive hypothesis


testing as well as to parametric testing. For example, note that (4.11) is for-
mulated in terms of predictive probabilities. Classical theory is not as well
equipped as Bayesian to deal with predictive hypothesis tests. 4 The closest
the classical theory comes to dealing with predictive testing is as a predic-
tive decision problem. The type of hypothesis constructed in Example 4.13
is closely related to tolerance sets as described in Section 5.2.3.
Example 4.13. In the classical setting (see (3.15) on page 149) the predictive loss
function is first converted to a parametric loss function and then the parametric
decision problem is solved. For the hypothesis-testing case with 0-1-c loss,

L(v,c5(x = clvH(v)rjJ(x)+IvA(v)[l-rjJ(x)],
L(fJ,c5(x = CP9,V(VH)rjJ(X) + {1 - P9,v(VH)[1- rjJ(x)]}
rjJ(x)[(c+ 1)Pe,v(VH) - 1] + 1- P/J,V(VH),
R(fJ, rjJ) = .BcI>(fJ)[(c + l)Pe,v(VH) - 1] + 1 - P/J,V(VH). (4.14)

Now, define

nH = {fJ: Pe,v(VH) ~ 1 ~J,


OA = ng,
{ 1- P/J,V(VH) if fJ E OH,
e(fJ) = -Pe,v(VH) if fJ E nA,
d(fJ) = I(c + 1)Pe,v(VH) -11-
Now note that R(fJ, rjJ) in (4.14) is exactly equal to e(fJ) plus d(fJ) times the
risk function from a 0-1 loss as given in (4.5) for the hypothesis H : 8 E nH.

4The interested reader should try to extend the classical definitions of level,
power, and so forth to the case of predictive hypothesis testing and see what
happens. The problem arises because one usually assumes that the future data
are independent of the past data conditional on the parameters, and all classical
inferences are conditional on the parameters. Hence, the past tells us nothing
about the future, and vice versa.
220 Chapter 4. Hypothesis Testing

If power functions are continuous at that (J such that P9,V(VH) = l/(c + 1),
then the predictive testing problem has been converted into a parametric testing
problem. In words, we have replaced a test concerning the observable V with a
test concerning the conditional distribution of V given e.
Another area in which Bayesian and classical hypothesis testing differ
dramatically is their treatment of more general loss functions. When the
focus of classical testing is on admissible tests, then it does not matter
which of several equivalent loss functions one uses. A Bayesian solution to
a testing problem will depend on which loss one uses because one is trying
to minimize the posterior risk. For example, with the loss function in (4.3),
the posterior risk for choosing a = 0 is J 1(60,00) (O)(O-Oo)dFelx (Olx), while
the posterior risk for choosing a = 1 is cJ 1(-00,60 )(0)(00 - O)dFelx(Olx).
A little algebra shows that the formal Bayes rule is to choose a = 1 if

E(8IX = x) - 00 > (c - 1) Pr(9 :::; OolX = x) {Oo - E(918 :::; 00 , X = x)}.


It may turn out that this is the same decision rule as is optimal with a
o-l-c' loss for some number c'.
Example 4.15. Suppose that X '" N(J, 1) given e = (J and that the hypothesis
is H : e ::; (Jo with loss (4.3). It is easy to see that the formal Bayes rule with
respect to a prior p,e is to choose action a = 1 if

[ (J - (Jo) exp(X[(J - (Jo]- 9;) dp,e(J)


1(90,00)

> cJ (Jo - 9)exP(X[(J - (Jo)- (J22) dp,e(J).


(-00,901

The expression on the left is increasing in x and the expression on the right is
decreasing in x, so the formal Bayes rule is to choose action a = 1 if x > k
for some number k. This rule has the same form as the formal Bayes rules with
respect to 0-1-c' loss functions.
In classical hypothesis testing, it is not common to recommend different
tests depending on whether the loss is o-l-c or of the form of (4.3) or
anything else. In fact, very little attention is paid to what the loss function
might be in classical testing. Were the focus solely on finding all admissible
rules, this might not be a problem. However, once we advance beyond
the simplest types of testing situations, the classical theory will tend to
abandon the goal of finding all admissible rules and concentrate instead on
finding all tests that satisfy certain ad hoc criteria.

4.2.2 Bayes Factors


The most striking difference between classical and Bayesian hypothesis test-
ing arises in the treatment of point hypotheses of the form H : 8 = 00
4.2. Bayesia n Solution s 221

versus A : e 1= 00 When the parame ter space is uncountable, prior


dis-
tributio ns are typically continuous. This means that the prior (and
pos-
terior) probability of e = 80 is o. In order to take seriously the
prob-
lem of testing a point hypothesis, one must use a prior distribution
in
which Pr(e = ( 0 ) > O. Alternatively, one can replace the hypoth
esis
with (what might be more reasonable) an interval hypothesis of the
form
H' : e E [0 0 - f., 80 + 6]. This latter case is no different from anythin
g
considered already. The case of a point hypothesis has some interes
ting
features, which we will explore in the remainder of this section.
Jeffreys (1961) suggests the use of what are now called Bayes factors
for
comparing a point hypothesis to a continuous alternative. Let Po
1/ for
all 8, and suppose that one assigns probability Po to {e = 80 } and
uses
a prior distribution A on n \ {Oo} for the conditional prior given e
1= 00 .
Then the joint density of the data and e (with respect to 1/ times the
sum
of A and a point mass at (0) is

fx e(x, 0) = { pofxle(xIOo) if 0 = 00 ,
, (1- po)fxle(x\O) if 01= 00 .
The marginal density of the data is

fx(x) = pofxle(x\Oo) + (1- Po) J fXle(xIO)dA(O).

The posterior distribu tion of e has density (with respect to the sum of A
and a point mass at ( 0 )

PI if 0 = 00 ,
felx(Olx) = { (1-p )fx1e(xI O)
I fx(x) if 0 1= 00 ,
where
po!xle(xIOo)
PI = fx(x)
is the posterior probability of e = 00 . It is easy to see that

~ _ Po fXle(xIOo)
1 - PI - 1 - Po J fXls(x\O)dA(O) . (4.16)

The second factor on the right-hand side of (4.16) is the Bayes factor.
It
would be the posterior odds in favor of e = 00 if Po = .5. For other values
of
Po, one needs to multiply the prior odds times the Bayes factor to calcula
te
the posterior odds. The advantage of calculating a Bayes factor over
the
posterior odds (pd[l - PI]) is that one need not state a prior odds in
favor
of the hypothesis. This might be useful if one is reporting the results
of an
experiment rather than trying to make a decision. One must still, howeve
r,
state a prior distribution over the alternative given that the hypoth
esis is
false.
222 Chapter 4. Hypothesis Testing

Example 4.17. Suppose that X '" N(O, 1) given e


= 0 and f!H = {Oo}, f!A =
(-00, (0) U (00,00). Let the prior probability of the hypothesis be PreS = (0 ) =
Po > O. Suppose that the conditional prior distribution of e given S =1= 00 is a
measure A. It is not difficult to show that

1 -PIPI = 1 _PoPo exp (O~)


-2" [1 (exp x[O - 00 ] - 02
2" ) dA(O) ] -1

If >. puts positive mass on both sides of 00, then it is easily verified that Jexp(x[O-
Ool - 02 /2)d>.(O) is convex as a function of x and goes to 00 as x -> oo. So all
formal Bayes rules will be of the form "reject H if x is outside of some bounded
interval."

When testing hypotheses of the form H : e = 80 , the formal Bayes


rule can be written in the form "reject H if the Bayes factor is less than
something." It is possible to bound the Bayes factor from below when the
likelihood function is bounded above. That is, we might be able to find a
distribution A that would lead to the smallest possible Bayes factor. 5 This
lower bound would give a bound on how strongly the data conflict with the
hypothesis.
Example 4.18 (Continuation of Example 4.17). The Bayes factor in this exam-
ple is

which is minimized (over A) by the distribution that puts probability 1 on


that value of 0 which maximizes the integrand, which is the likelihood function.
In this case, that would be 0 = x, and the lower bound on the Bayes factor is
exp( - [x - 00]2 /2}. For example, if x = 00 + 1.96, which is the critical value for the
usual two-sided level 0.05 test of H, we get a lower bound of 0.1465. This says
that a data value that would just barely lead to rejecting H at level 0.05 could
not possibly change one's odds against the hypothesis by more than a factor of 7,
and then only in the extremely unlikely case that one believed before seeing the
data that S was sure to equal 00 + 1.96 if it was not 00. Put another way, in order
for the posterior probability of H to be as low as 0.05, the prior probability po
would have to be lower than 0.2643, and much lower if a more reasonable prior
on the alternative were used.
A more realistic expression of prior opinion might be that the prior, given
e =1= 00, is a normal distribution with mean 00 and some variance 7 2 . In this case,
the Bayes factor is
Jlh2exp ( (~0:0~:)2). (4.19)

The prior in this class that leads to the smallest Bayes factor can easily be shown
(see Problem 7 on page 286) to be the one with

2 _ { (x - (0)2 - 1 if Ix - 00 1> 1,
7 - 0 otherwise.

5For more discussion of this technique, see Edwards, Lindman, and Savage
(1963).
4.2. Bayesian Solutions 223

The minimum Bayes factor is Ix - 001exp( {-[x - 00]2 + I} /2) if Ix


- 001 > 1.
The minimum is 1 if Ix - 001 sLAt x = 00 + 1.96, the minimum Bayes
fact?r
is 0.4734. This time, the prior probability po would have to be lower
than 0.1 In
order for the posterior probability to be as low as 0.05.
Intermediate to the two bounds above is the bound obtained by supposm .
g
that e has distribution symmetric around 00, but not necessarily normal..
prior that is symmetric around 00 is a mixture of priors that put probabi ~ince a
lIty 1/2
on two points symmetrically located around 00, the smallest value of
the B~yes
factor among all symmetric priors can be obtained by maximizing
over prIOrs
that put probability 1/2 on 00 c for c 2 o. For such a prior, the density
of the
data given that the hypothesis is false equals

(2v'21 rr 1 [exp(_[X-0~+C]2) +exp ( [X-O~-cV)].


Maximizing this as a function of c leads to c = 0 if Ix - 001
the maximum occurs at the solution to the equation
s 1. If Ix - 001 > 1,

x - 00 + c = exp(2c[x _ 00 ]),
x -00 - c
For Ix - 001 > 1.5, the solution c is very nearly equal to Ix - 001
(although
it is always strictly smaller than Ix - (01). If x = 00 + 1.96, for example
, then
c = 1.958. The value of fx(x), when c = Ix-Ool, is [1 +exp( -21x -(012)
If x = 00 + 1.96, for example, then the lower bound on the Bayes factor
]/[2$].
is
approximately twice the global lower bound. This is not surprising, 0.2928,
since the
two-point distribution puts half of its probability very nearly at the
same point
as does the one-point distribution that led to the global lower bound.
The
half of the probability is on a point that contributes nearly nothing becauseother
it is
so far from x.
The global lower bound on the Bayes factor, namely

fx\e(x\ lIo)
(4.20)
sUPo#o fx\e(x\ lI) '
is closely related to the likeliho od ratio test statisti c, which is discuss
ed in
Section 4.5.5.
Upper bounds on Bayes factors are usually harder to come by.
This is
due to the fact that there are often priors (even conjug ate priors)
that
place such high probab ility on the data being very far from
what was
observe d that the hypoth esis will be highly favored if such a prior
is used
for the alterna tive. For exampl e, in Examp le 4.17, if the alterna
tive prior
is N(lIo, r2), the Bayes factor goes to 00 as r2 goes to 00. In this
regard,
it is import ant to note that improp er priors are particu larly inappro
priate
for the conditi onal distribu tion of e given e #- lIo. The limit as
r2 goes
to 00 in Examp le 4.17 leads to an improp er prior. As we just
noted, the
Bayes factor goes to 00 becaus e the improp er prior for the alterna
tive says
that e has probab ility 1 of being outside of every bounde d interva
l. Since
the data will surely be inside some bounde d interva l, it will appear
to be
much more consist ent with the hypoth esis than the alterna tive.
There are
ways, however, to use limits of proper priors in Bayes factors.
224 Chapter 4. Hypothesis Testing

Example 4.21 (Continuation of Example 4.18; see page 222). Suppose that we
wish to let T2 go to 00 in the N (80 , T2) prior for e given the alternative. In order
to use an improper prior to approximate a proper prior in this problem, we would
have to let the prior on the hypothesis be improper also. This could be done by
letting po go to zero in such a way that POT -+ k. In this case, po/[1 - Pol times
the Bayes factor converges to kexp(-[x - 80 ]2/2). It this way, k acts like the
prior odds ratio, and exp( -[x - 90 ]2/2) acts like the Bayes factor. In fact, k is
the limit (as T -+ (0) of the ratio of Po to the prior probability that e is in the
interval [-#' # l given the alternative. (See Problem 8 on page 287.)

By restricting the class of prior distributions, one can obtain useful upper
bounds on Bayes factors. For example, in Example 4.17, one could restrict
attention to priors with T2 ::; c. Since the Bayes factor is increasing as a
function of 7 2 , we get that the maximum occurs at 7 2 = c. For large c,
one can easily compute the upper bound to be approximately Vc times the
global minimum Bayes factor.
Bayes factors can also be calculated in cases in which the hypothesis is of
the form H : g(6) = g((}o) versus A: g(9) =J g((}o) for some function g. For
example, the hypothesis might concern only one of several coordinates of
6. In this case, global lower bounds on the Bayes factor are not particularly
useful.
Example 4.22. Let 8 = (M, E), and suppose that Xl"'" Xn are conditionally
liD given e = (1-', u) with N(I-', ( 2) distribution. Suppose that H : M = 1-'0
is the hypothesis. Given M =1= 1-'0, we suppose that E2 '" r- 1(ao/2, bo/2) and
that M given E = u has N (1-'0, u 2/)..0) distribution. This is the usual conjugate
prior distribution. Conditional on M = 1-'0, we still need a prior distribution of
E2. We will use the conditional distribution given M = 1-'0 obtained from the
joint distribution given M =1= 1-'0. Conditional on M = 1-'0, E2 has r- 1(ao/2, bo/2)
distribution, where a(j = ao + 1 and b(j = bo + )..0(1-'0 - I-'0? The conditional
density of (X, E) given M = 1-'0, iX,EIM(X, ull-'o), equals

b* )~
~ [b~ + w + n(Xn _1-'0)2]}.
(
2 u-(n+a(j+1) exp { __1_
(27r)~r( ~) 2u 2

Given M =1= 1-'0, the joint density of (X, M, E) is

2 (!!ll.)
2
~ v:ro- u- n - ao - 2 exp {--2
1 2 +n)..o
1 [bO+W+)..l(I-'-I-') T (_ xn-I-'o )2] } ,
(27r)~r(-"f) 2u 1

where
n n

Xn=~LXi' W = L(Xi - Xn)2,


i=l i=l

1 nXn + >'01-'0
)..1 =)..0 + n, I-' = )..1
4.2. Bayesian Solutions 225

If we integrate the parameters out of the two densities above and take the ratio,
we get the Bayes factor:

(4.23)

where
n).o (_ 0)2
al = aO +n, b1 = bo + W + ""I; Xn - /.L ,

ai = aO +n, bi = bo+ w + n(Xn - /.LO)2.

To put a lower bound on the Bayes factor, we first note that the conditional
distribution of M given E and M i- /.Lo which will lead to the largest marginal
density for X given M i- /.Lo is the one that says M = x with probability 1. We
are then left with the problem of finding distributions for E given M = /.LO and
given M =F /.Lo. It is easy to see that if we let the distribution of E be concentrated
at the same value c for both the hypothesis and the alternative, then the Bayes
factor is exp( -n(x - /.Lo)2/[2c2)), which goes to 0 as c goes to 0, unless x = /.Lo.
If x = /.Lo (a probability 0 event given 9), the lower bound on the Bayes factor
is still 0, but one achieves this by letting the priors for E be different under the
hypothesis and alternative.
If one wished to use improper priors, one would have to let ).0 go to 0 while
Po/$o converges to some finite strictly positive number k. 6 In this case ao = ao
instead of ao + 1 because E and M are independent in the improper prior. To
convert (4.23) to the case of the improper prior, we set ao = -1 and bo = O. The
product of the prior odds and the Bayes factor becomes

)-~
kv'ii ( 1 + n ~ 1
2
'

where t = ...fii(x-/.Lo)/ vw/[n - 1] is the usual t statistic used to test H : M = /.Lo.

In general, minimizing a Bayes factor for a problem like the one in Ex-
ample 4.22 would require choosing the prior for the alternative to maximize
the predictive density and choosing the prior for the hypothesis to minimize
the predictive density. But this latter problem was already seen to lead to
the minimum being 0 in most cases. In short, the global lower bound on
the Bayes factor, when the hypothesis concerns only a function of the pa-
rameter, will most likely be 0 and so is not useful. An alternative to the
global lower bound is an approximate Bayes factor formed by maximizing
the marginal density of X separately under the hypothesis and alternative

sUPge{lH IXls(xIO)
(4.24)
sUPge{lA IXls(xIO) .

6This approach was suggested in personal communication with Luke Tierney.


It is also the approach taken by Robert (1993).
226 Chapter 4. Hypothesis Testing

This approximate Bayes factor is also closely related to likelihood ratio


tests (see Section 4.5.5).
Example 4.25 (Continuation of Example 4.22; see page 224). To maximize the
marginal density of the data under the alternative, we choose the prior distribu-
tion to concentrate all of its probability on the values for (J' and /1- which provide
a maximum for the likelihood function. These are clearly /1- = x and (J' = Jw/n.
Under the hypothesis, we must choose (J' to maximize the likelihood, and the
appropriate value is (J' = Jw/n + (x - /1-0)2. This approximation corresponds to

letting AO = 0, bo = b(j = 0, and ao = a(j = in the analysis with the conjugate
prior. The approximate Bayes factor would then equal

( w + n(~ - /1-0)2 ) ~ = ( 1+ n ~ 1) -~ , (4.26)

where t = -./ii(x-/1-o)/ Jw/[n - IJ is the usual t statistic used to test H : M = /1-0.

An alternative approximation to the Bayes factor is available by approx-


imating the marginal densities of the data under the alternative and hy-
pothesis using the method of Laplace. 7 That is, approximate the product of
likelihood times the prior by a multivariate normal density with the wrong
normalizing constant. Then approximate the integral over the parameter
space by the integral of a normal density. For example, suppose that un-
der the hypothesis, the parameter is III with prior density 1'iJI (t/J) and that
under the alternative, the parameter is 8 with prior density le(O). The
likelihoods are IXIIlt(xlt/J) and IXle(xIO), respectively. Assume that ~ and
e provide the largest values of the two likelihood functions. If the maxima
occur at points where the partial derivatives of the likelihoods are 0, and
if the likelihoods have continuous second partial derivatives, then we can
write
1 T
log Ix IIlt (xlt/J) ~ log Ix lilt (xl Ill) + "2[t/J - Ill) A[t/J - Ill],
A A A

where A is the matrix of second partial derivatives of the logarithm of the


likelihood evaluated at ~.8 This matrix will typically be negative definite.
Let 0"", = - A-I. A similar expression is obtained for e.
1 T
log IXle(xIO) ~ log IXle(xI8) + "2[0 - e] B[O - e),
A A A

Let 0"8 = _B-1. If, in addition, the prior densities are relatively fiat in the
regions where the likelihoods attain their largest values, we can write

! Ix IIlt (xlt/J) lilt (t/J)dt/J ~ IIlt(~)/xlllt(xl~)

7We will discuss the large sample properties of the metho~ of ~aplru:e in
Section 704.3. (In particular, see Theorem 7.116 and the ~nsumg. dls~ussl~n.)
Here, we give only a description of the method without any rIgorous JustificatIOn.
The derivation presented here is based on Kass and ~ery ~1995) . .
8The matrix - A is sometimes called the observed Hsher mformatwn.
4.2. Bayesian Solutions 227

x J exp ( -~[1/J - ~lT a;j/[1/J - ~l) d1/J


= fw(~)fxlw(xl~)(271")~ la,pl~,
where k is the dimension of the vector lIt. A similar expression is obtained
for the integral over O. The Bayes factor is

JfXlw(xl1/J)fw(1/J)d1/J ~ (271")9 fw(~)fxlw(xl~)la,pl; . (4.27)


J fXle(xjO)fe(O)d() fe(6)fxle(xI8)l aol"2
The factor fw("~)1 fece) can be removed and multiplied times the prior
odds poll! - pol to capture the prior input required. The rest of the ap-
proximate Bayes factor does not require the specification of any prior dis-
tributions. The removed factor, however, is not entirely prior-dependent.
It also depends on the observed data.
Example 4.28 (Continuation of Example 4.22; see page 224). In the case of
testing H : M = /1-0, we have k = 1 and p = 2 in (4.27) because 8 = (M, E) and
111 = E. The likelihood functions have their maxima at ~ = J[w + n(x - /1-o)2]/n
e
and = (x, Jw/n). The matrices 17", and 178 are

_ w + n(x - /1-0?
a", - 2n2 ,
_ w
178 - n2
(1 0)
0 ~ .

The approximate Bayes factor is 1/v'21r times the factor f.,,(~)/ fa(9) times the
expression in (4.26) times the ratio of the square roots of the determinants of the
two matrices above. The result is

nfq,(Il1}
,
fa(8)J27rw
(
"

1+--
n- 1
e)--r , _ n-l

(4.29)

where t = J1i(x -1L0)/ Jw/[n - 1] is the usual t statistic for testing H.


To see how the approximation compares with the actual Bayes factor, suppose
that the prior distributions are conjugate, as on page 224, and we let f." be the
conditional prior calculated from /a given that H is true. Then

2(~):;
2
r ( ~)
17-(110+1) (bo )
exp - 2cr 2 '

/e(/1-, (7) = 2 (~)~ ~


v'21rr( ~) 17
-(110+ 2)
exp
(b- + >"0(/1- - /1-0)2)
o
2cr 2

Plugging cr = ~ into the first of these and (/1-,17) = (x, Jw/n) into the second
and taking the ratio give
228 Chapter 4. Hypothesis Testing

If n is large, the exponential term above can be approximated by

If we substitute this into (4.29) and notice that 1 + t 2/[n - 1] = q,2/(w/n), we


get

= n(bo);Q-r(!lf) b!
(bo)~~r(~) (bn~'
where bl, bi, al, and ai are defined after (4.23). If >'0 and ao are small relative
to n, we can approximate n/v'2 by ";>'0 + nr(ai/2)/r(al/2). With this approx-
imation, the expression above becomes exactly (4.23). Although h(q,)/fe(8)
depends on the data, one could calculate values of the ratio for a range of plau-
sible priors to see how much it could reasonably vary.
As an example, suppose that /-Lo = 1.5 and that n = 14, x = 2.7 and w = 41
are observed. Then q, = 2.0901 and e= (2.7,1.7113). That por~ion of (4.29)
that does not depend on the prior is

n ( t 2 ) -!!j-!
~ 1+ -- = 0.0648.
y21l'w n-l

Next, we let /-L0 = 1.5 and let the other hyperparameters ao, bo, >'0 be elements
of the set {0.1, 1,5,10, 20}. Figure 4.30 shows the 125 different values of the
logarithm ofthe ratio h(q,)/ fe(e) with >'0 varying most rapidly and ao varying
most slowly. Since log(0.0648) = -2.736, those priors corresponding to values on
the vertical axis greater than 2.736 (horizontal line) will lead to Bayes factors
greater than 1, while the others lead to Bayes factors less than 1. Examining
Figure 4.30, we see that many reasonable priors (those with small to moderate
values of ao and >'0 and values of bo/ao in the vicinity of the observed sample
variance w /n = 2.93) give values for the log of the ratio near 2.736. This suggests
that the data will not dramatically alter anyone's opinion very much as to whether
or not M = 1.5. The other approximate Bayes factor (4.26) is 0.0608, which
suggests a significant reduction to the odds in favor of the hypothesis. The t
statistic for the usual classical test would be 2.53, and the hypothesis would be
rejected at level 0.05.

An interesting difference between Bayes factors and the results of classical


hypothesis tests arises from the comparison of the various lower bounds on
the Bayes factor to the significance probability (see Definition 4.8). We
will generalize the concept of significance probability in Section 4.6 and
4.2. Bayesian Solutions 229

'fr.,

o 20 40 60 80 '00 '20

Prior Sequence

FIGURE 4.30. Logarithms of Ratios of Prior Densities

then make detailed comparisons with Bayes factors. At this point, we only
mention that the results can often be in stark contrast. In particular, data
with very small significance probability (supposedly suggesting that the
data do not support the hypothesis) can have relatively large values for the
lower bounds on the Bayes factor (suggesting that the data do not conflict
with the hypothesis to a great extent). This conflict is sometimes called
"Lindley's paradox" [see Lindley (1957) and Jeffreys (1961)].
If one believes that the probability is 0 that the parameter lies in a low-
dimensional subset of the parameter space, then it is not appropriate to
test the types of hypotheses we have considered in this section. As an alter-
native, one can calculate a measure of how far the parameter of interest is
from the hypothesized low-dimensional subset. For example, if the parame-
ter is e = (M, E) and the hypothesis is that M is near 0, then the posterior
distribution of IMI contains a great deal of information about how far M is
from O. Also, the posterior distribution of IMIIE contains information of a
similar sort.
Example 4.31. Suppose that X rv N(O, 1), given e = 0, and that our hypothesis
is that e is near 00 If we consider all prior distributions of the form e '"" N(O, 1'2),
then the posterior distribution of e is N([Oo +1' 2 X]/[1 +1"2], 1"2/[1 +1"2]). For each
6> 0, we can calculate

Pr(le - 00 1 ~ 6) = <I> (A[Oo - x] + D- <I> (A[Oo - x]- ~) ,


where A = 1"/';1 + 1"2, and <I> is the standard normal CDF. There is no useful
upper bound on this probability, but a lower bound is obtained by letting A -+
230 Chapter 4. Hypothesis Testing

1, which means r2 -+ 00. This corresponds to the usual improper prior. After
observing X = x, one could plot Pr(le - 801 :::; 6) (using an improper prior) as
a function of 6 to describe how far e is likely to be from 80. For example, if
x = 80 + 1.96, then Pr(le - 801 ::; 6) = 0.05 for 6 = 0.3983.
For multi parameter problems, there may be many possible summaries of
the parameters that measure the extent to which the parameter differs from
the hypothesis. We will consider a very general class of such summaries in
Section 8.2.3.

4.3 Most Powerful Tests


As we saw earlier, the power function of a test is closely associated with
the risk function for a Q-I-c loss. It makes sense, then, that most attention
in classical hypothesis testing focuses on the power function. The following
definitions begin to introduce the criteria by which tests are evaluated in
the classical framework.
Definition 4.32. Suppose that {l = {lH U {8tl, where (h {lH. A level Q
test 4J of H : e E {lH versus A : e = 81 is called most powerful (MP) level
Q if, for every level Q test .,p, {3",((h) :::; {3tf>((Jt).

The corresponding dual criterion is the following.


Definition 4.33. Suppose that {l = {lA U {Oo}, where 00 {lA' A floor Q
test 4J of H : e = 00 versus A : e E {lA is called most cautious (MC) floor
Q if, for every floor Q test .,p, {3",((Jo) ~ {3tf>(00).

For more general cases, we have the following definitions.


Definition 4.34. A level Q test 4J is uniformly most powerful (UMP) level
Q if, for every other level Q test .,p, {3",(0) :::; {3tf>(0) for all 0 E {lA. A floor Q

test 4J is uniformly most cautious (UMC) floor Q if, for every other floor Q
test .,p, {3",(0) ~ {3tf>(0) for all 0 E {lH'
In some cases, both criteria (UMP and UMC) lead to the same optimal
tests. In some cases, they do not. (See Problem 31 on page 289.) Either
way, there is asymmetry in these definitions. A different criterion is used for
protecting against one type of error than that used for protecting against
the other. One argument given for the particular choice is that type I error
is more costly than type II error, so we arrange for the maximum type
I error probability to be small. However, what often happens is that the
probability of type II error can become even smaller for most values of the
parameter. Here is a simple example.
Example 4.35. Suppose that X fVPoi(8) given e = 8, and that n = {I,10}.
We are interested in testing H : e = 1 versus A : e = 10. The MP level 0.05 test
4.3. Most Powerful Tests 231

is (see Proposition 4.37 ahead)


if x ::; 2,
ljJ(x) = { ~.5058 if x = 3,
if x 2 4.
The probability of type II error for this test is 0.0065, which is much smaller than
the probability of type I error.
In Example 4.35, we protect ourselves more against the less costly error
than against the more costly error. If type I error is more costly, it might
make sense to minimize it using the UMC criterion and let the probability
of type II error be a bit larger.
Example 4.36 (Continuation of Example 4.35; see page 230). The MC floor
e = 1 versus A : e = 10 is
0.05 test of H :
0 if x::; 4,
'IjJ(x) = { 0.4516 if x = 5,
1 if x 2 6.
The probability of type I error for this test is 0.00198. This test provides more
protection against the more costly error, while keeping the probability of the
other error at a low level.
Another alternative is to try to balance the costs of the two types of error
in a deliberate fashion. For example, Lehmann (1958) offered the suggestion
that one decrease the required level as the sample size increases so that the
power would decrease also. Schervish (1983) suggested that the size of the
test be matched to the power function at an alternative chosen based on
substantive grounds.
Not many theorems can be proven about MP and UMP tests in general
without some assumptions of additional structure. One general result is al-
ready familiar to us. The Neyman-Pearson lemma 3.87 provided a minimal
complete class for this decision problem. For convenience, we restate that
result here using the language of hypothesis testing.
Proposition 4.37 (Neyman-Pearson fundamental lemma). Let n =
{Bo,Bd and let Po v, for some measure v and both values of B. Let
fi(x) = dPo./dv(x) for i = 0, 1. Let H : e = Bo and A: e = B1 .
For each k E (0, (0) and each function, : X ----> [0,1]' define the test
I if ft(x) > kfo(x),
c1>k,,,/(X) = { 'o(x) if ft(x) = kfo(x),
if hex) < kfo(x).
Also define the two tests

ljJo(x) = { ifif ft(x)


1 > 0,

ft(x) = 0,

c1>oo(x) { ifif fo(x)


1 0, =

fo(x) > 0.
232 Chapter 4. Hypothesis Testing

All of these tests are MP of their respective levels and Me of their respective
floors.
Note that oo will have size 0 because it never rejects H when fo > O. On
the other hand, o will have the largest possible size for an admissible test,
equal to 1 in many problems but not always.
The following result gives conditions under which MP tests are essentially
unique.
Lemma 4.38. Let n = {Oo,Od and let Po II, for some measure II and
both values ofO. Let fi(x) = dPojdll(x) for i = 0,1, and let
Bk = {x : hex) = kfo(x)}.

Suppose that for all k E [0,00], POi (B k ) = 0 for i = 0,1. Let be a test of
H : e = 00 versus A : e = 01 of the form

(x) = {Io if hex). > kfo(x),


otherwzse,

and let 'ljJ be another test such that f3", (0 0 ) = f3", (00 ). Then either = 'ljJ,
a.s. [Po;] for i = 0, 1 or f3",(Ol) > f3",(Ol).
PROOF. Let and 'ljJ be as stated in the lemma. Define A> = {x : 'ljJ(x) >
(x)} and A< = {x : 'ljJ(x) < (x)}. Clearly, is in the form of a test from
Proposition 4.37, and so it is MP of its size. Also, since only takes on the
values 0 and 1, we have

A> C {x: (x) = O} = {x: hex) ~ kfo(x)},


A< C {x: (x) = I} = {x : hex) > kfo(x)}.

It follows that

( [(x) - 'ljJ(x)]h(x)dll(x) > k ( [(x) - 'ljJ(x)lfo(x)dll(X),


lA> lA>
( [(x) - 'ljJ(x)]h(x)dll(x) > k ( [(x) - 'ljJ(x)lfo(x)dll(x). (4.39)
lA< lA<
Because of the way A> and A< are defined, the sum of the two left-hand
sides in (4.39) is f3",(Ol) - f3",(Od, and the sum of the two right-hand sides
is k[f3",(Oo) - f3",(Oo)] = O. Now, assume that POl ((X) = 'ljJ(X)) < 1. It
follows that POl (A>UAd > O. Since POl (B k ) = 0, we have that at least one
of the inequalities in (4.39) is strict. In this case f3",(Ol) > f3",(OIl Finally,
assume that POo((X) = 'ljJ(X)) < 1. It follows that either Poo(A > 0
or Poo(A<) > o. In the latter case, POl (Ad> 0, and we have just proven
that f3",(Ol) > f3",(Ol). If Poo(A > 0 and POo(Ad = 0, then 'ljJ ~ with
strict inequality on a set of positive POo probability, which would contradict
f3",(Oo) = f3",(Oo). 0
4.3. Most Powerful Tests 233

Examp le 4.40. Let X", N(9, 1) given e = 9, and let 0 = ~90' 9d


Then Bk is
a singleton set for every k and Pe; (Bk) = 0 for every k and l = 0,
l.
Examp le 4.41. Let X rv U(0,9) given e = 9, and let 0 = {90,9d
If k =
9o/9l, then Bk is a set with positive probabi lity under both Peo and
POI' So, the
conditio ns of Lemma 4.38 are not met in this example.

4.3.1 Simple Hypotheses and Alternatives


In a simple-simple testing problem, the parame ter space has only
two
points in it, and so the risk function has only two values. This makes
it
particularly easy to compare the risk functions of all tests at once.
Each
test corresponds to a point in two-dimensional space. Each coordin
ate is
the risk function evaluated at one of the parame ter values. For definite
ness,
let fl = {fJo'(h} and let flH = {Oo}, so that flA = {Od Let ao stand
for
(3",(Oo) and a1 for 1- (3",(01) for an arbitra ry test 1/>. Then the risk functio
n
of I/> is represented by the point (ao, ad E [0,1]2. The risk set, as defined
in Definition 3.71, is the set of all possible (ao, ad points.
Examp le 4.42. Suppose that X '" N(9, 1) given e = 9 and 0 = {O,
I}. Accord-
ing to the Neyman -Pearso n fundam ental lemma 4.37, the MP tests
of H : e = 0
versus A : e = 1 are those that reject H when exp( - [x - 1]2/2) >
k exp( -x 2 /2)
for various values of k E (0,00]. This inequal ity simplifies to x >
c for arbitrar y
c E [-00,00 ]. For each c, we get a point in the risk set with

ao Po(X > c) = 1 - ~(c) = ~(-c),


al H(X~c)=<I>(c-l).

A plot of these points is given in Figure 4.44.

Figure 4.44 has several features that are typical of all risk sets.
Lemm a 4.43. The risk set for a simple- simple hypothesis-testing problem
is closed, convex, and symme tric about the point (1/2, 1/2). It also
contains
that portion of the line al = 1 - ao lying in the unit square.
PROOF . The rule 1/>( x) == ao corresponds to the point (ao,
1- ao) for each ao
between 0 and 1, so the risk set contains the portion of the line a1 =
1- ao
which lies in the unit square.
Suppose that (a, b) is in the risk set. The symmetrically placed point
about (.5, .5) is (1 - a, 1 - b). If I/> produces the first point, then
1 - I/>
produces the second point. Hence the risk set is symmetric about
(.5, .5).
Lemma 3.74 shows that the risk set is convex.
To show that the risk set is closed, we need only show that it contain
s
its lower boundary. The rest of the bounda ry is included by symme try
and
convexity (and the fact that the line al = l-ao is in). By Proposition
3.91,
a Bayes rule exists for every prior. By Lemma 3.96, every point on
8L is
the risk function for one of these Bayes rules; hence fh is in the risk
set. 0
234 Chapter 4. Hypothesis Testing

FIGURE 4.44. Risk Set for Testing H : e = 0 versus A : e = 1 with X rv N(O, 1)

The definition of the lower boundary 8L of the risk set (see Defini-
tion 3.71) is designed so that exactly those tests which produce points
on fh are admissible. The lower boundary also consists solely of Bayes
rules. (See Lemma 3.96.) Proposition 3.91 tells us that if a Bayes rule with
a
respect to some prior is not in L , then one of the two prior probabilities is
o and there is another Bayes rule with respect to that prior that is in aL.
These considerations lead to the following result.
Lemma 4.45. 9 If a test </> is MP level a for testing H : e= 00 versus
A: e = Ol, then either (3",{Od = 1 or (3",{Oo) = a.
PROOF. We will prove the contrapositive. Suppose that both (3",{Od < 1
and (3",(00) < a. Create another test </>' as follows. Let A = {x: </>{x) < 1}.
Note that Plh (A) > 0, since (3",(Od < 1. Let 9c{X) = min{ e, 1-</>{x)}. Then,
for e ~ 0, hi{e) = EO;9c(X) ~ 0 and hi(e) is nondecreasing in e. Also, it is
easy to see that hi(e) is continuous in c, since Ihi(e) - hi(d)1 ~ Ie - dl It
follows that there exists e such that ho{e) = a - (3",(0 0), Let </>' = </> + 9c It
follows that (3",,(00) = a. Also, </>'{x) > </>(x) for all x E A. Since POl (A) > 0,
it follows that hl(C) > 0 and (3",,(0 1 ) > (3",(0 1 ), So </> is not MP level a. 0
Lemma 4.45 says that a test that is MP level a must have size a unless
all tests with size a are inadmissible. This result allows us to say when the
two optimality criteria (MP and MC) are equivalent in the simple-simple
testing situation.

9This lemma is used in the proofs of Lemmas 4.47 and 4.103.


4.3. Most Powerful Tests 235

Propo sition 4.46. Suppose that ao and a1 are both strictly between
0 and
1. Suppose that a test of H : e = 00 versus A : e = 0 1 c~rresp
onds t~
the point (ao, al) in the risk set. Then is Me floor 1 - al if and
only if
it is MP level ao
The reason for the restriction that ao and a1 be strictly between 0
and
1 is that many tests with ao E {O, I} or a1 E {O, I} are inadmissible
even
though they may satisfy one of the two optimality criteria.
The following lemma allows us to conclude that if is an MP level a
test
and we switch the names of hypothesis and alternative, then 1 - become
s
an MC floor 1 - a test in the new problem (see Problem 30 on page
289
for a more general version.)
Lemm a 4.47. If is MP level a for testing H : e = 00 versus A: e
then 1 - has the smallest power at 01 among all tests with size
= 01,
at least
I-a.
PROOF. First, note that 1 - has size at least 1 - a. Next, suppos
e that
{3q,(01) = 1. Then 1 - has power 0 at (h and is clearly the least powerfu
l
of any class to which it belongs. By Lemma 4.45, the only other case
to
consider is that in which {3q,(00) = a. In this case, I - has level
I-a.
Suppose, to the contrary, that {3",(Oo) ~ I-a and {3",(01) < {31-q,(Ot}.
Then
{31-"'(00) ::; a and {31-",(01) > {3q,(OI)' which contradicts the assump
tion
that is MP level a.
0
The following lemma says that in the comparison of two MP Neyma
n-
Pearson tests, the one with the smaller level will also have the smaller
power.
Lemm a 4.48. 10 Let {Po : 0 En} be a parametric family. If 1 is a
level
a1 test of the form of the Neyma n-Pear son fundamental lemma
437 for
testing H : e = 00 versus A : e = 01 , and if 2 is a level a2 test of that
form with al < a2, then {3q,1 (0 1) < {3q,2 (Ot}.
PROOF. By the Neyma n-Pears on fundamental lemma 3.87, both 1
and
2 are admissible. If {3q,1 (h)
2:: {3q,2 (0 1), then 2 is inadmissible. 0
The Neyman-Pearson fundamental lemma 4.37 tells us all of the admis-
sible MP and MC tests. We also saw (Theorem 3.95 and Lemma 3.96)
that
a
these are the tests corresponding to points on L , the lower bounda
the risk set, and they are the Bayes rules with respect to positive priors
ry of
and
one Bayes rule for each of the priors that assign 0 probability to one
of the
parame ter values. The usual classical approach to choosing one of the
ad-
missible tests is not to choose a prior distribution and then take the
Bayes
rule, but rather to choose a value of a and then choose the MP level a
test.
In cases with simple hypotheses and simple alternatives, the classica
l and
Bayesian procedures will agree. That is, for every prior distribution,
there

lOThis lemma is used in the proof of Theorem 4.56.


236 Chapter 4. Hypothesis Testing

is a formal Bayes rule and a such that the formal Bayes rule is MP level
a. Similarly, for every a, there is a prior such that the MP level a test is
a formal Bayes rule. Only in cases more complicated than those described
so far can we distinguish these two approaches. Example 4.49 is one such
case.

Example 4.49. Suppose that XI and X 2 are conditionally independent U(O, ())
random variables given e = () and n = {1,2}. Let nH = {I} and nA = {2}.
Suppose that Z is independent of XI and X 2 given e with distribution Ber(I/2).
Hence Z is ancillary. Suppose that we observe Z and X = maxl<i<n Xi where
n = 1 if.Z = 0 and n = 2 if Z = 1. That is, the sample size is random-(n = Z + 1)
but anCillary. The marginal densities of X (given e = 1,2) are

JI(x) ~ +x if 0 < x < 1,


f 2 (X ) "41 + "4x'1f 0 < X < 2.
The MP level a test based solely on X is 11 4>( x) = 1 if 12 (x) / JI (x) > c for some
c. This becomes
~+i >
-1-- C, or x ::::: 1.
2+ X
For 1/3 ::; C ::; 1/2, the first inequality is x < (1/4 - c/2)(c - 1/4). For C < 1/3,
the inequality is x> 0, and for c > 1/2, the inequality is x ::::: 1. So, the MP level
a test for 0 < a < 1 is 4>( x) = 1 if x ::::: 1 or x < (VI + 8a - 1)/2. The power of
this test is (34)(2) = (9 + 4a + VI + 8a)/16.
This test may seem odd because it picks () = 2 for small values of x. An
alternative test is formed by conditioning on the ancillary Z. If Z = 0 (n = 1),
the MP level a conditional test is 1fJ(x) = 1 if x> 1 - a. Actually, we could have
chosen x ::::: 1 together with any interval of length a, but the test is easier to write
if we put the interval next to x::::: 1. Similarly, if Z = 1 (n = 2), then the MP level
a conditional test is 1fJ(x) = 1 if x > ~. (Once again, the ratio of densities is
constant for all x < 1, so we could have chosen any set with probability a given
e = 1.) The power of this conditional test can be calculated to equal (5 + 3a)/8,
which is always smaller than the power of the test 4>. So the test conditional on
the ancillary is inadmissible.
The fact that the MP level a test conditional on the ancillary is inadmissible
has led some [e.g., Bondar (1988)] to conclude that we should not condition on
the ancillary in such problems. This mistaken conclusion is due to the fact that
the conditional test being compared has conditional size a given all values of
the ancillary. The lesson should be that we should not fix the size to be the
same for all values of the ancillary, but we should continue to condition on the
ancillary. If we have a prior (7rI' 7r2) with 7r1 + 7r2 = 1, then the Bayes rule will
be drastically different depending on whether Z = 0 or Z = 1 is observed. 12 A
complete characterization of the Bayes rules 4>"1 for all values of 7r1 is as follows.

11This test is not the MP level a test based on the joint distribution of the
data (X, Z).
12These Bayes rules will be the MP level a tests based on the joint distribution
of (X, Z) according to the Neyman-Pearson fundamental lemma 4.37.
4.3. Most Powerful Tests 237

We always have 4>"'1 (x) = 1 for x ~ 1. For 0 < x < 1, the value of 4>"'1 (x) is

7rl Z=o Z= 1
<1.5 1 1
1 1 arbitrary
"5
between i and ~ 1 0
1 arbitrary 0
3
>!3 0 0

Notice that the Bayes rule is never the same as either the unconditional level a
test or the conditional level a test when 0 < a < 1. That is, there is no value of
7rl strictly between 0 and 1 such that the Bayes rule rejects H for some (but not
all) 0 < x < 1 for both Z = 0 and Z = 1. The Bayes rule either rejects H for all
values of 0 < x < 1 for at least one of the Z values or it rejects H for no values
of 0 < x < 1 for at least one of the Z values. The power functions of the Bayes
rules are
7rl {j(1) {j(2)
<i l l
1 l!! 1!
"5 2 8
between
1
i and ~ !
a
k
52a
3 2 8
>~ 0 ~
Here a is any number between 0 and 1 corresponding to the "arbitrary" parts of
some of the Bayes rules. Note that for each Q between 0 and 1, there is a Bayes
rule with size Q. (For example, to get Q = 0.05, let a = 0.1 in the fourth row of
the table. One such test is the following. If Z = 1 is observed, 1/3(X) = 1 for
x ~ 1, and if Z = 0 is observed, 1/3(X) = 1 for x > 0.9.) Since the Bayes rule
with size a is the MP level Q test, it has higher power than the unconditional
size a test. (See Problem 15 on page 287.) On the other hand, it does not have
conditional level Q given the ancillary.
This example illustrates a conflict between two principles of classical
statistics. The principle of conditioning on ancillaries together with the
principle of choosing MP level 0: tests leads to the MP conditional size
0: test. This test is dominated by the MP unconditional size 0: test that
ignores the ancillary. This test, in turn, is dominated by the unconditional
size 0: test that makes use of the ancillary. The natural conclusion is to use
this last test. But if we are to make use of the ancillary, aren't we supposed
to condition on it? And if we condition on the ancillary, aren't we supposed
to use the conditional size a test? The reason that it is difficult to justify
(in the classical framework) using the size a test based on the whole data is
that once the ancillary is observed, the conditional size of the test changes
depending on the value of the ancillary. Why should an ancillary affect
my choice of the size of the test? This begs the more important question,
"How should the size of a test be chosen in a particular problem?" There
are no general decision theoretic principles that lead one to be able to
choose the size of a test based on a loss function or a prior distribution.
There are cases in which one can find a simple correspondence between the
238 Chapter 4. Hypothesis Testing

size of a test and a loss function, but these seem to depend on additional
structure not present in all problems, or they seem to be isolated instances
not easily generalized. (See Theorem 6.74 on page 376 for a description of
some additional structure and Example 4.61 on page 241 for an isolated
example.)

4.3.2 Simple Hypotheses, Composite Alternatives


The next most complicated testing situation taken up by the classical the-
ory is that of a simple hypothesis versus a composite alternative. 13 It is
clear that, even from a decision theoretic perspective, a UMP level Q test
will have no larger Bayes risk than any other test whose size is Q, no matter
what prior distribution we use, so long as nH = {Oo} and n = {Oo} unA,
(See Problem 16 on page 287.)
Example 4.50. Suppose that X '" N(O,l) given 8 = 0 and n = [00,00)
with nH = {Oo}. For each 01 E nA, the MP level a test is c/>{x) = 1, if
!xle(xIOl)/ !xle(xIOo) > k. We can calculate the ratio

This is greater than k if and only if

log k + H0; - O~) 0 1 + 00 log k


x>t= =---+---.
01 - 60 2 0 1 - 00
(If we had a 0-1-c loss and the prior probability of Oi were 7l'i, we would obtain
this same test as long as k = C7l'0/7l'I-) The size of the test c/>t(x) = 1 if x > t is
at = 1 - 4>(t - ( 0 ), So, t = 00 + 4>-1(1 - at). For fixed a, the MP level a test of
H versus Al : e = 01 is c/>"'(x) = 1 if x > 00 + 4>-1 (1- a). Notice that this is the
same test for every 0 1 Hence this test is UMP level a for testing H versus A.
Also notice that the conditions of Lemma 4.38 are met in this example, so that
the UMP level a test is also the unique MP size a test for each 01 E nA and
hence the unique UMP level a test.
Now, consider a Bayesian approach in which the prior distribution satisfies
Pr{8 = ( 0 ) = po > O. Suppose that the conditional prior distribution of 8
given 8 > 00 is a measure A. It is not difficult to calculate the Bayes factor (see
Section 4.2.2) in this case as

(1 - po) Pr(8 = OolX = x)


po Pr(8 =I OolX = x)
= ex p (- O~)
2
[1
80
00
exp(x[o _ 00]- ( 2 )
2
dA(O)] -1

Since 0 > 00 in the exponent inside the integral, the Bayes factor is a decreasing
function of x no matter what A is. Hence, the formal Bayes rule (with 0-1-c

13By dealing with this case next, we postpone the issue that arises when the
size of the test is the supremum of the power function over the hypothesis rather
than just the value of the power function at the hypothesis. This subtle point is
actually at the root of the asymmetry between hypotheses and alternatives.
4.3. Most Powerful Tests 239

loss) will be to reject H if x > t is observed, where t is the largest x so that


Pr(9 = 90lX = x) ~ 1/(1 + c). In this case the prior distribution determines t,
which in turn determines the size of the Bayes rule when thought of as a size
0: test. Each Bayes rule is a UMP level 0: test for some 0:, but the particular 0:
(even for fixed c) depends on the prior. Alternatively, for a fixed prior, each value
of c will lead to a different 0, but the correspondence between 0: and c (although
monotone for each prior) will be different for different priors.

The dual concept of UMC test would apply if the hypothesis were com-
posite and the alternative were simple. It would then be the case that, if
we switched the names of hypothesis and alternative, 1 minus the UMP
level 0: test would become the UMC floor 1 - 0: test. (See Problem 30
on page 289.) This case is interesting in that it is never discussed in the
hypothesis-testing literature because the classical theory is not equipped
to deal with it. If the power function is continuous (as it is in most of
the examples that we can calculate) the power of an MP level 0: test of
H : e < 00 versus A : e = 00 will be 0:. The optimality criterion gives
us no way to choose between level 0: tests. They all have the same power
on the alternative. Clearly, though, some are better than others. In fact,
those that have smaller power functions for () < ()o are better. But the MP
criterion does not accommodate such comparisons.

4.3.3 One-Sided Tests


Next we consider the case in which both H and A are composite. For
now, we will proceed along the classical UMP level 0: lines. We begin by
introducing a concept that makes UMP and UMC tests come out the same.
Definition 4.51. If n ~ JR, X ~ JR, and dP(J/dv(x) = !xle(xIO} for some
measure II, then the parametric family is said to have monotone likelihood
ratio (MLR) if, whenever ()1 < ()2, !xle(xI02)/ !xle(xIOl} is a monotone
function of x a.e. [P(Jl + P(J21 in the same direction for all pairs of 01 and
O2 A parametric family has increasing MLR if the ratio is increasing for
all 01 < O2 , and it has decreasing MLR if the ratio is decreasing.
Example 4.52. Suppose that X has a one-parameter exponential family distri-
bution with 9 being the natural parameter, !xls(xI9) = c(8)h(x)exp(9x). Then

!xls(xI82 ) C(82)
fXls(xI81) = c(81) exp {X(92 - (1)},

which is increasing in x for all 81 < 82.


Example 4.53. Suppose that fXls(xI9) = (7r[1+(x_9)2])-1, the Cauchy
distribution with location shifts. Then

fXls(xI92) 1 + (x - 9I)2
fXls(xI91) = 1 + (x - (2)2'
240 Chapter 4. Hypothesis Testing

This ratio goes to 1 as x approaches either 00 or -00, but it is not constantly 1


and hence is not monotone. The same problem occurs with Cauchy distributions
having only a scale parameter. However, the absolute value of a Cauchy random
variable with a scale parameter does have MLR. (See Problem 17 on page 287.)

Example 4.54. Suppose that fXle(xIO) = 1/0 for 0 < x < O. For 02 > 01, write
the likelihood ratio as
undefined if x ~ 0,
fXle(xI02) ={ ~ if 0 < x < 01,
fXle(xlOd 00 if 01 ~ x < (h,
undefined if x ~ O2
The two "undefined" regions can be ignored because the ratio need only be
monotone in x a.e. [Pe, + Pe2j. The "undefined" regions have 0 probability under
both Pe, and Pe2. The likelihood ratio is then seen to be monotone increasing for
every 01 < O2 , although it is not strictly increasing. It only takes on two different
values.

Proposition 4.55. If g is measurable, monotonic, and one-to-one and Y =


g(X) and the family of distributions of X has MLR, then so does the family
of distributions of Y.
Theorem 4.56. If {Po: fJ E n} is a parametric family with increasing
MLR, then every test of the form
I if x> Xo,
(x) = { 'Y if x = Xo, (4.57)
o if x < Xo
has nondecreasing (in 0) power function. Furthermore, each such test is
UMP of its size for testing H : e ~ fJ o versus A : e > fJo, no matter what fJo
is. Finally, for each a E [0,1] and each 00 E n, there exists Xo E IRU {oo}
and'Y E [0,1] such that the test is UMP level a for testing H versus A.
PROOF. Let fJ 1 < fJ 2 E o. The Neyman-Pearson fundamental lemma 4.37
says that the MP test of Hl : e = fJ 1 versus A2 : e = fJ2 is

Because the parametric family has MLR increasing, we can write as


1 if x> I,
(x) = { ,(x) ift~x~I, (4.58)
o if x < t.

(Note that we have put x = t and x = I in 0e (x) = 'Y(x~ category.


It may be required to set,W = 0 and/or 'Y(t) = 1, dependmg on the
4.3. Most Powerful Tests 241

CDF of X

L
1.0

a-at~
1-0.

Xo
X

FIGURE 4.59. Step in Proof of Theorem 4.56

values of !XIS(1l02)/ !XIS(1l0I) and !xls(tI02)/ !xls(lIOI) if X does not


have continuous distribution.) Let be of the form (4.58) and set a'
=
(3t/>(0I). Let o.l(X) == 0.', and since is MP, it follows that (3t/>(02) ~ 0.'.
Hence, we have shown that has nondecreasing power function.
Now, let a E [0,1] be given, and let

{ inf{x: POo ( -00, xl ~ 1 - a} if 0. < 1,


Xo = inf{x: POo(-oo, xl > O} if 0. = 1.

Then 0.* = POo(xo, 00) ~ 0. and POo({xo}) ~ 0. - 0.*. (See Figure 4.59.) Let

(Note that Xo = -00 is possible for 0. = 1 and an unbounded distribution.


In this case, POo{{xo}) = 0, 0.* = 1, and 'Y* = 0.) Let be of the form
(4.58) with 1 = Xo = l and 'Y{xo) = 'Y*. Then (3t/>{Oo) = 0. and is MP level
0. for testing Ho : e = 00 versus Al : e = 8 1 for every 01 > 00 , since it is
the same test for all 01 Hence, is UMP level 0. for testing Ho versus A.
Finally, let 1jJ be any test having level 0. for H. Then 1jJ has level at most
0. for H o, and by Lemma 4.48, is at least as powerful as 1jJ at all 0 E A.
Since (3t/>(0) is increasing, has level 0. for H, so it is UMP level 0. for
testing H versus A. 0

Definition 4.60. A test such as in Theorem 4.56 is called a one-sided


test. A hypothesis of the form H : e ~ 00 or H : e ~ 00 is called a
one-sided hypothesis.
Example 4.61. Suppose that X '" Poi(8) given e = 8. Then /xls(xI8) =
exp( -8)8'" Ix! for x = 0,1, ... is the conditional density of X with respect to
242 Chapter 4. Hypothesis Testing

counting measure. For (h < O2 ,

which increases in x. Hence, this family has MLR increasing. The UMP level 0.05
test of H : e $ 1 versus A : e > 1 is

if x> v,
q)(x) ={ ~ if x = v,
if x < v,

where v and 'Yare chosen so that

IV
'Y exp(-I),
v.
+ L
00 1:1:
exp(-l),
x.
= 0.05.
x=v+l

A little algebra shows us that v = 3 and 'Y = 0.506.


As a possible Bayesian solution to this problem, suppose that we have a Q-l-c
loss and the prior for e is r(a, b). Then the posterior given X = x is f(a+x, b+l).
Since Gamma distributions are stochastically larger the larger the first parameter
is (for fixed second parameter), we conclude that Pr(e $ IIX = x) < 1/(1 + c)
if and only if x 2: Xo for some Xo. For the improper prior with a = b = 0
(corresponding to density dO /0) and with c = 19 (so 1/(1 + c) = 0.05), the value
of Xo is 4. This is the same as the UMP level 0.05 test except for the randomization
at x = 3.

There are also versions of Theorem 4.56 for decreasing MLR and for
hypotheses of the form H : e ;?: (Jo.
Proposition 4.62. If {P6 : (J EO} is a parametric family with decreasing
MLR, then any test of the form

I if X < xo,
={ if x = xo, (4.63)

(x) 'Y
if x > Xo
has a non decreasing (in (J) power function. Furthermore, it is UMP of its
size for testing H : e $ (Jo versus A : e > (Jo, no matter what (Jo is.
Finally, for each a E [0, 1] and each (Jo E 0, there exists Xo E IR U {oo}
and'Y E [0,1] such that the test is UMP level a for testing H versus A.
Proposition 4.64. If {P6 : (J E O} is a parametric family with increasing
MLR, then any test of the form (4.63) has a nonincreasing (in (J) power
function. Furthermore, it is- UMP of its size for testing H : e ;?: (Jo versus
A : e < (Jo, no matter what (Jo is. Finally, for each a E [0,1] and each
(Jo E 0, there exists Xo E IR u {oo} and 'Y E [0, 1J such that the test is
UMP level a for testing H versus A.
4.3. Most Powerful Tests 243

Proposition 4.65. If {Po: () E O} is a parametric family with decreasing


MLR, then any test of the form (4.57) has a nonincreasing (in ()) power
function. Furthermore, it is UMP of its size for testing H : e ~ 00 versus
A : e < (}o, no matter what 00 is. Finally, for each a E [O,lJ and each
(}o EO, there exists Xo EmU {oo} and 'Y E [0, 1J such that the test is
UMP level a for testing H versus A.

The proofs of these propositions are all very similar to the proof of The-
orem 4.56. The tests in these propositions are also called one-sided tests.
The following simple result shows that one-sided tests also minimize the
power function on the hypothesis among tests with the same size.
Corollary 4.66. 14 Let the family of distributions have MLR, and suppose
that the hypothesis is either H : e :5 (}o or H : e ~ (}o. A one-sided UMP
level a test satisfies {3</J(O) :5 {31/J(O) for all 0 E OH and all tests 1/J such
that {3..,,((}0) ~ a .
PROOF. Let 1/J satisfy f3..,,((}0) ~ a, and let () E OH. As in the proof of
Theorem 4.56, we can show that a MP level 1 - a test of H' : e = (}o
versus A' : e = 0 is anyone-sided test with size 1 - a. Since 1 - is
a one-sided test with size 1 - a and 1/J has level 1 - a, it follows that
f31-if>((}) ~ f31-..,,((}), hence f3if>((}) :5 f3..,,((}). 0
The following result is a simple consequence of Lemma 4.38, and it gives
conditions under which UMP tests are essentially unique. It will also be
useful in showing that there are situations in which there are no UMP tests
of H : e = (}o versus A : e :/= (}o.
Proposition 4.67. Suppose that {Po: () EO} is a parametric family with
Po v for all () and with increasing MLR. Let fXls(I(}) denote dPo/dv.
Let (}o EO, and define

BO,k = {x : fXls(xl(}) = kfxls(xl(}o)}.


Suppose that for all () > (}o and all k E [O,ooJ, Po(Bo. k ) = O. Let be a test
of H : e = (}o versus A : e > (}o of the form (4.57), and let t/J be another
test such that {3</J((}o) = f31/J((}0), Then either = t/J, a.s. [PoJ for all 0 ~ 00,
or there exists (} > (}o such f3if>((}) > f31/J(}).
There are versions of Proposition 4.67 for decreasing MLR and for al-
ternatives of the form A : e
< 00 , but we will not state them here. An
example of one of these propositions is the first part of Example 4.50 on
page 238.
The following is a complete class theorem for the case of MLR.

14This corollary is used in the proof of Theorem 4.68.


244 Chapter 4. Hypothesis Testing

Theorem 4.68. Let the action space be N = {a, I}. Suppose that a para-
metric family has MLR and that there exists 00 E n such that
[L(O, 1) - L(O, 0)](00 - 0) (4.69)

has the same sign for all 0 f 00 , Then the appropriate family of one-sided
tests is an essentially complete class.
PROOF. Consider the case in which (4.69) is always positive and the family
has increasing MLR. Let the hypothesis be H : e $ 00 , Let <!> be any test.
Then

R(O, ) = f {(x)L(O, 1) + [1- (x)]L(O,O)} fXls(xIO)d/l(x)


= L(O,O) + [L(O, 1) - L(O,O)] f (x)fxls(xIO)d/l(x).

So, if and 'Ij; are any two tests, then

R(O, 'Ij;) - R(O, ) [L(O, 1) - L(O, 0)] f ['Ij;(x) - <!>(x)l!xls(xIO)d/l(x)


= [L(O, 1) - L(O, 0)][,8",(0) - ,8",(0)].

Now let 'Ij; be an arbitrary test. We must show that there is a one-sided test
which is at least as good as 'Ij;. Let be a one-sided test with ,8",(00 ) =
(3",(00 ), Then Theorem 4.56 and Corollary 4.66 give us that (3",(0) - ,8",())
has the same sign as ()o - (). This implies that R(),'Ij;) 2: R(),) for all ().
For the other cases (decreasing MLR and/or (4.69) always negative),
use one of Propositions 4.62-4.65 together with Corollary 4.66 to obtain a
similar result. 0
We can use Theorem 4.68 to help prove that in an MLR family with
a one-sided hypothesis, one-sided tests are UMC as well as UMP. (See
Problems 23 and 24 on page 288.)
Proposition 4.70. 15 Let <!> be a one-sided test as in Theorem 4.56 or in
Propositions 4.62-4.65 for a one-sided hypothesis versus the corresponding
one-sided alternative in an MLR family. Suppose that the base of is "{.
Then is UMC floor "{.
Suppose that we have a prior distribution J.ts on the parameter space.
If there is a test with finite Bayes risk and the loss function is bounded
below, then Theorem 4.68 allows us to conclude that one-sided tests are
formal Bayes rules. (See Problem 33 on page 289.) For the case of D-1-c

15This is not precisely the same as Corollary 4.66. Corollary 4.66 says that the
power function is minimized on the hypothesis subject to the power function at
00 being at least a. If a power function is monotone but not continuous, the base
of the test might be different from its size.
4.3. Most Powerful Tests 245

loss, the result follows in a simpler fashion from the fact that the posterior
probability of a semi-infinite interval is a monotone function of the data in
MLR families.
Theorem 4.71. Suppose that {P8 : () E n} is a parametric family of dis-
tributions for X with MLR and that P,e is an arbitrary prior distribution
on (n, r). Then the posterior probability given X = x that 8 is in a semi-
infinite interval is a monotone function of X.
PROOF. We will prove that if the family has increasing MLR then the
posterior probability that S ~ ()o is a nondecreasing function of x. The
other cases are all similar. Let Xl < X2. Then

Pr(S ~ ()olX = X2) Pr(8 ~ ()olX = xd


PreS < ()olX = X2) PreS < ()olX = xd
f[80,00) fXle(x2/())dJ.Le(()) f[80,00) fXle(xl/())dp,e(())
=
f(-00,8 0 ) fXle(x2/())dp,e(()) f(-00,8 0 ) fXle(xl/())dp,e(())

= (ri( -00,80)
fXle(x2/())dp,e(()) r
i( -00,80)
fX1e(Xl/())dP,e(()))-1

x[ r r
i[80 ,00) i( -00,(0)
[JXle(X2/()2)fxle(X11()1)

- fXle(x2/()1)fxle(Xl/()2)] dp,e(()l )dp,e(()2) 1 (4.72)

Since the family of distributions has increasing MLR, it follows that

for all Xl < X2 and all ()l < ()2. This makes the last expression in the last
line of (4.72) nonnegative, and the result is proven. 0

Corollary 4.73. Suppose that {P8 : () En} is a parametric family of


distributions for X with MLR and that J.Le is an arbitrary prior distribution
on (n, r). Suppose that we are testing a one-sided hypothesis against the
corresponding one-sided alternative with a 0-l--c loss. Then one-sided tests
are formal Bayes rules.
When no UMP level Q: test exists, one option that has been suggested is
to find a locally most powerful test.
Definition 4.74. Let d be a strictly positive function on nA. We say that
a level Q: test is a locally most powerful level Q: relative to d (LMP) if, for
every level Q: test 'IjJ, there exists c such that {3</>(()) ~ {3",(()) for all () such
that 0 < d(()) < c.
246 Chapter 4. Hypothesis Testing

For a one-dimensional parameter e and H : e ~ 00, if power functions


are continuously differentiable and is the unique level 0: test with maxi-
mum derivative for the power function at 00 , then is LMP level 0: relative
to d(O) = 0 - 00 for 0 > 00 , (See Problem 34 on page 289.)

4.3.4 Two-Sided Hypotheses


Two-sided testing situations come in two forms.
Definition 4.75. If H : OI ~ e ::; O2 and A : e > O2 or e < OI, then the
alternative is two-sided. If H : e 2 O2 or e :::; Ol and A : Ol < e < O2 ,
then the hypothesis is two-sided.
The only difference, from a decision theoretic viewpoint, between two-
sided alternatives and two-sided hypotheses is where the endpoints go. The
tradition in classical hypothesis testing is to put the endpoints into the hy-
pothesis. That is, the hypothesis is closed and the alternative is open. There
is no need to require this, especially in cases in which the power functions
are continuous. However, the treatments of these two cases are drastically
different in the classical framework. The difference has nothing to do with
where the endpoints go, but rather with the asymmetric treatment of hy-
potheses and alternatives in the optimality criteria.
In this section, we consider the case of two-sided hypotheses. We put
off the case of two-sided alternatives until Section 4.4. Some mathematical
lemmas are needed first.
Lemma 4.76 (Lagrange multipliers).16 Let T be any set, let f : T--+
JR and gi : T --+ JR, for i = 1, ... ,n be functions, and let AI, ... ,An be real
numbers. If to minimizes f(t) + L:~=l Aigi(t) and satisfies gi(tO) = Ci for
i = 1, ... , n, then to minimizes f(t) subject to gi(t) ::; Ci for each Ai > 0
and gi(t) ~ Ci for each Ai < O.
PROOF. Suppose, to the contrary, that there exists tl such that f(td <
f(to), while gi(t!l ~ Ci for each Ai > 0 and gi(tl) ~ Ci for each Ai < O.
Then
n n n
f(tl) + L Aigi(td < f(to) + LAiCi = f(to) + L
Aigi(to).
i=l i=l i=l

This contradicts the assumption that to minimizes f(t) + L:~=l Aigi(t). 0

Corollary 4.77,17 Let T be any set, let f : T --+ JR and gi : T --+ JR,
for i = 1, ... , n be functions, and let AI, ... , An be real numbers. If to
maximizes f(t) + L:~=I Aigi(t) and satisfies gi(tO) = Ci for i = 1, ... ,n,

16This lemma is used in the proof of Lemma 4.78.


17This corollary is used in the proof of Lemma 4.78.
4.3. Most Powerfu l Tests 247

then to maximizes f(t) subject to gi(t) ~ Ci for each Ai > 0 and gi(t) S
Ci
for each Ai < O.
The following lemma is sometimes called the generalized Neyman-Pears
on
lemma due to the resemblance it bears to Proposition 4.37.
Lemm a 4.78. 18 Let PO,P1,'" ,Pn be integrable (and not a.e. 0) functio
ns
(with respect to a measure v), and let
I ifpo(x) > 2:~=1 kiPi(X),
o(x) = { ')'(x) i!po(x) = 2:~=1 kiPi(X),
o ifpo(x) < 2:i=1 kiPi(X),
where 0 S ,),(x) S 1 and the ki are constants. Then o minimizes f[1
-
(x )]Po(x )dv(x) subject to

/ (x)pj( x)dv(x ) < / o(x)p j(x)dv( x), for those j such that k j > 0,
/ (x)pj( x)dv(x ) > / o(x)p j(x)dv( x), for those j such that kj < O.
PROOF . Let be an arbitra ry measurable function taking values between
o and 1 which satisfies the preceding inequality constraints. Since (x) S
o(x) whenever po(x) - 2:7=1 kiPi(X) > 0 and (x) ~ o(x) whenev
er
po(x) - E~1 kiPi(X) < 0, it is clear that

/ [(x) - o(x)] [po(X) - ~ kiPi(X)] dv(x) s O.


It follows from this that

/[1- o(x)]Po(x)dv(x) + t k i / o(x)p i(x)dv( x) (4.79)


.=1

: ; /[1-(x)]p(x)dv(x) + t k i / (x)pi( x)dv(x ) .


=1
Now, let T be the set of all measurable functions from X to [0,1]' and
let
f(t) = /[1- t(x)]po(x)dv(x),

gi(t) = / t(x)pi( x)dv(x ), for i = 1, ... ,n.


Equatio n (4.79) says that o minimizes j() + L~1 ki 9i(). Lemma
4.76
implies that cPo minimizes f() subject to the constraints.
0
There is also a version of this lemma corresponding to Corollary 4.77.

18This lemma is used in the proofs of Theorems 4.82 and 4.104.


248 Chapter 4. Hypothesis Testing

Corollary 4.80. 19 Let PO,P1,'" ,Pn be integrable (and not a.e. 0) func-
tions (with respect to a measure v), and let
0 ifpo(x) > E~=l kiPi(X),
>o(x) ={ ,(x) i!po(x) = E~=l kiPi(X),
1 ifpo(x) < Ei=1 kiPi(X),
where 0 :::: ,(x) :::: 1 and the ki are constants. Then >o maximizes f[1 -
>(x)]po(x)dv(x) subject to

J >(x)pj(x)dv(x) > J >(x)pj(x)dv(x), for those j such that kj > 0,

J >(x)pj(x)dv(x) < J >(x)pj(x)dv(x), for those j such that k j < 0.

The theorems we can prove assume that we are dealing with a one-
parameter exponential family.
Lemma 4.81. 20 Assume that the parametric family has monotone likeli-
hood ratio. If > is an arbitrary test and (h < (h, define ai = (3",((}i) for
i = 1,2. Then there is a test of the form

1/J(x) = { ,io
I if Cl < X < C2,
if x = Ci,
if Cl > x or C2 < x,
with Cl :::: C2 such that {31/Jo ((}i) = ai for i = 1,2.
PROOF. Define >w to be the UMP level w test of H e < (}1 versus
A : e > (}1, and for each 0 :::: u :::: 1 - 0:1, set
>~(x) = CPOl+U(X) - >u(x).
First, note that since 0:1 + u ~ u for all u, 0 :::: >~(x) :::: 1 for all u and
all x. This means that >~ is really a test. By design, {3",~ ((}1) = 0:1' Also,
cP~ has the form of 1/J for each u (with Cl or C2 possibly infinite). This is
true since all >w(x) are 0 for small x and 1 for large x. By construction,
>~ =:= >Ol is the MP level al test of H' : 8 = (}l versus A' : e = (}2, and
>~-Ol = 1 - >l-Ol is the least powerful such test. Since > is also a level 0:1
test of H' versus A', it follows that
{3"'~-o.l ((}2) :::: 0:2 :::: (3"'~((}2)'
If we can show that {3",,,, ((}2) is continuous in w, we can conclude that there
exists w such that (3",,,, ((}2) = a2. The proof of this is left to the reader.
(See Problem 40 on page 290.) It follows that >'w has the form of 1/J and
(3'",((}i) = ai for i = 1,2. 0

19This corollary is used in the proof of Theorem 4.82.


20This lemma is used in the proofs of Theorem 4.82 and Lemma 4.99. It re-
sembles part of Theorem 1 on p. 217 of Ferguson (1967).
4.3. Most Powerful Tests 249

Theorem 4.82. In a one-parameter exponential family with natural pa-


rameter, ifOH = (-oo,(h] U [0 2, 00) and OA = (01. ( 2), with 01 < O2, a test
of the form
I if Cl < X < C2,
o(x) = { 'Yi if x = Ci,
o if Cl > x or C2 < x,
with Cl ~ C2 minimizes /3",(0) for all 0 < 01 and for all 0 > O2, and it
maximizes /3",(0) for all 01 < 0 < O2 subject to /3",(Oi) = ai fori = 1,2 where
ai = /31/>o(Oi) for i = 1,2. If Cl, C2, 'Yl, 'Y2 are chosen so that al = a2 = a,
then o is UMP level a.
PROOF. Suppose that o is of the form stated in the theorem, and let
fXls(xIO) = C(O) exp(Ox) , so that h(x) is incorporated into the measure v.
Let 01 and O2 be as in the statement of the theorem, and let 00 be another
element of O. Define Pi(X) = C(Oi) exp(Oix), for i = 0, 1,2.
First, suppose that 01 < 00 < O2, Set bi = Oi - 00 for i = 1,2. Next,
note that the function a1 exp(b 1x) + a2 exp(b2x) is strictly monotone if a1
and a2 have opposite signs, and it is always negative if both al and a2
are negative. Suppose that we try to solve the following two equations for
at, a2:

1 = al exp(blcl) + a2 exp(b2ct},
1 = al exp(b1c2) + a2 exp(b2C2), (4.83)

where Cl and C2 are from the definition of o. The reader can easily verify
that this system of linear equations has nonzero determinant and hence has
a solution. The solution must have both al. a2 > O. Set ki = aic(Oo)/C(Oi).
With these values of kl. k2' apply Lemma 4.78. Note that minimizing
f[1 - (x)]Po(x)dv(x) is equivalent to maximizing /3",(00), The test that
maximizes /3",(00) subject to /3",(Oi) ~ /3"'0 (Oi) for i = 1,2 is

(x) = 1, if c(Oo) exp(Oox) > klC(Ot} exp(Olx) + k 2c(02) exp(02x) (4.84)


if this test satisfies the constraints. The test in (4.84) can be rewritten as

where a1 and a2 are the solutions to the two linear equations above. Since
al exp(bl x)+a2 exp(b2x) goes to infinity as x -+ oo, it follows that (x) =
1 for C1 < x < C2, which leads to (x) = o(x). This same argument applies
for all 00 between 01 and 02, hence the same o maximizes /3",(0) for all
01 < 0 < O2
Next, try to minimize /3",(00 ) for 00 < 01 . (An identical argument works
for 00 > 02.) This time, we will use Corollary 4.80. Set bi = Oi - 00 for
i = 1,2. Next, note that the function al exp(blx) + a2 exp(b2x) is strictly
monotone if al and a2 have the same signs. If al < 0 < a2, the derivative
250 Chapter 4. Hypothesis Testing

has only one zero and the function goes to 0 as x -+ -00 and to 00 as
x -+ 00. This means that the function equals 1 for only one value of x.
Hence, the solution to the equations in (4.83) must have al > 0 > a2.
Solve the equations and set k i = aic(}o)/C(}i). Since maximizing f[1 -
rJ>(x)]po(x)dv(x) is the same as minimizing (3.p(}o), the test that minimizes
(3.p(}o) subject to (3.p(}I) ~ al and (3.p(}2) $ a2 is

rJ>(x) = 1 if c(}o) exp(}ox) < klC(}d exp(}lx) + k 2c(}2) exp(}2x),

with kl > 0 and k2 < O. This can be rewritten as

with al > 0 > a2 and b2 > b1 > O. Since al exp(b1x) + a2 exp(~x) goes
to 0 as x -+ -00 and goes to -00 as x -+ 00, it follows that rJ>(x) = 1 for
Cl < x < C2. Once again, we get the same test for all (}o and the same test
as before.
Finally, consider the test rJ>o.(x) == a, and now suppose that al = a2 = a.
Lemma 4.81 guarantees that Cl, C2, 'Yl, and 'Y2 can be chosen so that rJ>o has
the stated form with a1 = a2 = Q. The power function of rJ>o. is the constant
a. It must be that (3.po(}) $ a for every () E nH. Hence rJ>o has level a. Since
every level a test 1jJ must satisfy (3",(}i) :5 a (i = 1,2), and r/>o maximizes
the power on the alternative subject these constraints, it follows that rJ>o is
UMP level a. 0

Example 4.85. Suppose that Y '" Exp(9) given e = 9. Let X = -Y so that e


is the natural parameter. Let OH = (0,11 U [2,00), OA = (1,2), and Q = 0.1. We
must solve the equations

exp(c2) - exp(cI) 0.1,


exp(2C2) - exp(2cl) = 0.1.

If we let a = exp(c2) and b = exp(cl), these equations simplify to a - b = 0.1


and a2 - b2 = 0.1, respectively. The solution is easily calculated to be a = 0.55
and b = 0.45. So the solution to the original equations is Cl = -0.7985 and
C2 = -0.5978. Since the distribution is continuous, 71 = 72 = O. We reject H if
0.7985 > Y > 0.5978.
Example 4.86. Suppose that X '" Bin(n,p) given P = p. Let e = 10g(P/(1 -
P, the natural parameter. Then /xle(xI9) = c(lI) exp(lIx), where c(9) = (1 +
exp(9-n, and v is (:) times counting measure on {O, ... , n}. The hypothesis
H: P :5 1/4 or P ~ 3/4 corresponds to OH = (-00, -1.0991 U [1.09~,00) in e
space. If n = 10 and Q = 0.1, we get the UMP level a test by choosmg Cl = 4
and C2 = 6 with 71 = 72 = 0.2565.
Suppose that we have a prior distribution JLe on the parameter space.
If there is a test with finite Bayes risk and the loss function is bounded
below, then Theorem 4.82 allows us to conclude that tests of the form
given in that theorem are formal Bayes rules. (See Problem 38 on page 290.)
4.3. Most Powerful Tests 251

Of course if we switch the names of hypothesis and alternative, then


the
formal B~yes rule will be 1 minus the test in Theorem 4.82. This will
be
even more apparent after we see Lemma 4.99. One interesting differen
ce
between Bayes rules and UMP level a tests is that not all Bayes rules
for
testing two-sided hypotheses need to have the same value for the power
function at the two endpoints of the alternative.
Examp le 4.81 (Continuation of Example 4.85; see page 250). Once
again, sup-
e
pose that Y tv Exp(8) given = 8. Suppose that the prior distribu
tion for e
is
Exp(l). The posterior distribu tion will be r(2, 1 + y). Let the loss
function be
Q-l-c with c = 0.5. Solving numerically for the formal Bayes rule, we
get

<fJ( ) -
y -
{I
0
if 0.02133 < y < 0.83685,
otherwise.

The power function of this test is 0.454 at () = land 0.229 at () =


2. The level
of the test is 0.454, but it is not UMP level 0.454. The UMP level
0.454 test
would have power function 0.454 at () ::::: 2, but would not be a Bayes
rule for the
stated decision problem. The intuitiv e reason for the lopsided power
function of
the formal Bayes rule is that the prior puts so much more mass below
1 than
above 2 (0.3679 versus 0.1353). It makes sense that the test should
protect more
against alternatives with small 6 than those with large 6.
Curiously, however, if we use the improper prior with Radon- Nikody
m deriva-
tive 1/{} with respect to Lebesgue measure, the formal Bayes rule will
be a UMP
level Q test for all O-l-c losses. To see this, note that the posterio
r distribu-
tion of 6 is Exp(y). The posterior probability that the hypothesis is
true is now
1 - exp( -y) + exp( -2y). To find the formal Bayes rule with O-l-c
loss, we set
this expression equal to 1/(1 + c) and solve for y. There will be two
solutions 21
Cl < C2 (the endpoin ts of the rejectio n region
for the test), and they will satisfy

1- exp( -CI) + exp( -2cd = 1 - exp( -C2) + exp( - 2C2) = _1_.


l+c
Rearran ging terms in this expression leads to

exp( -CI) - exp( -C2) = exp( -2CI) - exp( -2C2).


The right-ha nd side of this last equatio n is the power function of the
and the left-hand side is the power function at () = 1. If Q is the commo =
test at () 2,
n value
of the two sides, then the test is UMP level Q.

Theorem 4.68 says that the class of UMP level Q tests is essentially com-
plete for decision problems that include hypothesis-testing loss functio
ns
for one-sided hypotheses. The first part of Example 4.87 shows that
for
two-sided hypotheses, the class of UMP level Q tests is not essentially
com-
plete. The formal Bayes rule given there is admissible and the risk functio
n
is not the same as that of any UMP level Q test. This, then, is the first point

21There may actually be only one solution or no solutions, because the


posterio r
probability of the hypothesis is bounde d below. In these cases, the
formal Bayes
rule always accepts H.
252 Chapter 4. Hypothesis Testing

at which classical hypothesis-testing theory has departed from the decision


theoretic approach to hypothesis testing. When we had simple hypotheses
and alternatives or one-sided hypotheses and alternatives with MLR, the
class of (U)MP level Q tests included (essentially) all admissible procedures.
When we move to two-sided hypotheses, we lose that property. One can
prove, however (see Problem 37 on page 290), that the class of tests of the
form given in Theorem 4.82 is essentially complete. The restriction to tests
that are UMP for their level does not follow from considerations of admis-
sibility. The argument to justify such a restriction might be, "In comparing
two tests with the same level, I want to choose, if possible, the test with
higher power function on the alternative." This would make perfect sense
if the hypothesis consisted of a single point. (See Problem 16 on page 287.)
However, the hypothesis is not a single point in general. The classical theory
treats the hypothesis as if it were a single point and does not distinguish
between tests based on their power functions on the hypothesis so long as
they have the same level. To put the case more succinctly, the level of a
test does not completely describe the power function on the hypothesis,
but the classical theory pretends as if it did. That is why the formal Bayes
rule in the first part of Example 4.87 on page 251 is lumped together with
all level 0.454 tests even though it has advantages over some other level
0.454 tests. These advantages are simply ignored when the level of a test
is taken as the entire summary of the power function on the hypothesis.
The restriction of attention to UMP level Q tests, rather than all tests
of the form of Theorem 4.82, has another consequence that is even more
surprising, perhaps, than the fact that the tests do not form an essentially
complete class. A simple example will illustrate the general situation.
Example 4.88. Let X rv N(/-t, 1) given M = /-t. Suppose that we are considering
two different hypotheses about M with nH 1 = (-00, -0.5] U [0.5,00) and nH2 =
(-00, -0.7]U[O.51, 00). The UMP level 0.05 test of HI versus Al : M E (-0.5,0.5)
is to reject HI if X E (-0.071,0.071). The UMP level 0.05 test of H2 versus
A2 : M E (-0.7, 0.51) is to reject H2 if X E (-0.167, -0.017). Since nH2 c nHl'
it makes sense that if we reject HI, then a fortiori we should be able to reject
H 2 However, if X E [-0.017,0.071), we would reject HI at level 0.05 but accept
H2 at the same level.

The type of contradictory conclusions we were able to obtain in Exam-


ple 4.88 actually occurs quite generally in level Q testing, once we leave
the one-sided situation. (See also Problem 41 on page 290.) We will see
them again in Section 4.4.2 and in Section 4.6. Gabriel (1969) introduced
the concept of coherent tests of several hypotheses. A collection of tests of
various hypotheses is coherent if rejecting one hypothesis H always leads
to rejecting every hypothesis that implies H. Testing several hypotheses at
the same level is not always coherent, as Example 4.88 shows. The problem
lies in choosing tests based on their level rather than on decision the<r
retic criteria. For example, if one were to reject hypotheses whose posterior
probabilities were less than some number "I, then rejecting HI would always
4.4. Unbiased Tests 253

lead to rejecting H2 when H2 implies Hl (as in Example 4.88).


Example 4.89 (Continuation of Example 4.88; see page 252). Suppose that we
use the usual improper prior (Lebesgue measure). Then the posterior distribution
of M is N(x,1). The level 0.05 test of HI corresponds to rejecting HI if the
posterior probability of HI is less than 0.618. The posterior probability of H2 is
less than 0.618 whenever x E (-0.72,0.535). Notice that this last interval strictly
contains the rejection region for HI, (-0.071,0.071), so that rejection of HI will
always imply rejection of H2 when rejection means "posterior probability less
than 0.618."

There is a natural sense in which incoherent tests are inadmissible. See


Problem 42 on page 291.

4.4 Unbiased Tests


4.4.1 General Results
For the cases not previously considered, there generally do not exist UMP
level a tests. This is not to say that there are no good tests in other
situations, but rather that the criterion of UMP level a needs to be relaxed
if we are going to find the good tests. Consider the following example, which
is typical of what happens when the alternative is two-sided.
Example 4.90. Suppose that X '" N(J, 1) given e = (J and that we wish to test
H : e = (Jo versus A : e i= (Jo at some level a E (0,1). The UMP level a test 4>' of
H versus A' : e < (Jo and the UMP level a test 4>" of H versus A" : e > (Jo are
both level a tests of H versus A. If there is a UMP level a test 4> of H versus A,
then its power function must be at least as large as that of 4>' for (J < (Jo and at
least as large as that of 4>" for (J > (Jo. But such a 4> would also be a level a test of
H versus A', so Proposition 4.67 says that either 4> = 4>' a.s. 22 or there is fJ < fJo
such that {3", (fJ) < {3"" (fJ). Since we are assuming that the latter is false, we must
have 4> = 4>' a.s. The same argument applied to A" and 4>" implies that 4> = 4>"
a.s. [Pel for all fJ. But this is impossible since 4>'(x) = I(-oo,c/l(x), for some finite
number c', and 4>"(x) = I[c",oo)(x), for some finite number e". It follows that no
test 4> is UMP level a for testing H versus A.

The way to circumvent the lack of UMP level a tests in cases like Ex-
ample 4.90 is to create a new criterion that one-sided tests fail to satisfy
when the alternative is two-sided. 23 The rationale is that even though the
power function of a one-sided test is high in one part of the alternative, it
is very low in the other part. The new optimality criterion requires that
the power function be higher on the alternative than on the hypothesis.

22Since all N(fJ,1) distributions are mutually absolutely continuous, a.s. with
respect to one of them means a.s. with respect to all of them.
23When the conditions of Proposition 4.67 fail, there may be UMP level a tests
for two-sided alternatives. See Problem 27 on page 288 for an example that even
has MLR.
254 Chapter 4. Hypothesis Testing

Definition 4.91. A test is unbiased level a if it has level a and if {3", ((}) ~
a for all () E nA . If n ~ IRk, a test is called a-similar if (3",((}) = a for
each () E riH n riA, More generally, is a-similar on B ~ n if (3",((}) = a
for each () E B. If is UMP among all unbiased level a tests, then is
uniformly most powerfuL unbiased (UMPU) level a.
The concepts of unbiased level a and a-similar are closely related.
Proposition 4.92. 24 If a test is unbiased level a and (3",(.) is continuous,
then is a-similar.

Proposition 4.93. If is a UMP level a test, then is unbiased level a.

Since being unbiased level a implies that has floor a, the dual concept
to unbiased level a is simply unbiased floor a.
Definition 4.94. A test is unbiased floor a if it has floor a and if (3",((}) ::;
a for all () E n H . If is UMC among all unbiased floor a tests, then is
uniformly most cautious unbiased (UMCU) floor a.
It is interesting to note that the collection of unbiased tests may not be
essentially complete. The test in the first part of Example 4.87 on page 251
is admissible with 0-1-0.5 loss, but it is not unbiased and it does not have
the same risk function as an unbiased test. The restriction to unbiased
tests, just like the restriction to UMP level a tests in the previous section,
does not follow from considerations of admissibility. It is true that the
restriction to unbiased tests rules out the use of one-sided tests in problems
like Example 4.90 on page 253, but it also rules out many admissible tests.
Example 4.95. Suppose that Y '" Exp(O) given e = 0 with f2H = [1,2) and
f2A = (-00,1) U (2,00). Suppose that the loss function is asymmetric in the
following way.
if 0 E f2H and a = 1,
if 0 > 2 and a = 0,
if 0 < 1 and a = 0,
otherwise.
We will use the usual improper prior with Radon-Nikodym derivative 1/8 with
respect to Lebesgue measure so that the posterior distribution is Exp(y). The
formal Bayes rule will minimize the posterior risk. The posterior risks for the two
possible decisions are
a=O a=l
3(exp(-y) -exp(-2y)) ~ exp(-2y) + 1- exp(-y)

Solving to see when the risk for a = 1 is smaller, we see that this occurs .w.hen
y < 0.2569 or y > 0.9959. The test that rejects when one of t~ese ~o~dltlOns
holds has power function 0.5959 at 0 = 1 and 0.5382 at 8 = 2. Smce It IS more

24This proposition is used in the proofs of Lemma 4.96 and of Theorems 4.123
and 4.124.
4.4. Unbiased Tests 255

important to reject H when 0 is small, the power is higher for small


0 values.
The test has level 0.5959, but it is biased.
One techniq ue for finding UMPU tests will be to restrict attentio
n first
to a-simil ar tests. The following lemma shows why this will work
in many
cases.
Lemm a 4.96. 25 Suppose that /3",(.) is continuous in () for every .
If o
is UMP among a-simil ar tests and has level a, then o is UMPU level
a.
PROOF. Since 1jJ(x) == a is a-simil ar and o is UMP a-simil ar,
it follows
that f3",o(O) 2': a for every 0 E OA. Since o has level a, it is unbiase
d level
a. Every unbiase d level a test is a-simil ar by Propos ition 4.92. So, the test
that is UMP among a-simil ar tests, namely o, has power functio
n at least
as high (on the alterna tive) as the test that is UMPU level a.
But o is
also unbiase d level a. Hence o is UMPU level a.
0
Propo sition 4.97. 26 Suppose that /3",0 is continuous in () for every
. If
o is UMC among a-simil ar tests and </>0 has base a, then o
is UMCU
floor a.

4.4.2 Interval Hypotheses


In this section , we will conside r the case in which the alterna tive
is two-
sided and the hypoth esis is a nondeg enerate compac t interval. That
is, the
e e
case H : E [Ol, 82 } versus A : t/:. [()l> 82 J, with 81 < ()2. It turns
out that
there is no UMP level a test of H versus A in one-pa ramete r expone
ntial
families (for a > 0). One would suspec t that if were the optima
l level
1 - a test of A versus H as derived in Theore m 4.82, then 1 -
would be
the approp riate level a test for H versus A. 27 This will, in fact, turn
out to
be the case, but the test is no longer UMP level a, but only UMPU
level
a. 28
Examp le 4.98 (Continuation of Example 4.87; see page 251). In this
Y '" Exp(O) given 8 = O. For consistency with the classical approach,example
suppose
,
that we use the improper prior with Radon-Nikodym derivative I/O with
respect
to Lebesgue measure. The posterior distribution of e given Y = y
At the end of Example 4.87 on page 251, we saw that the formal Bayesis Exp(y).
for O-l-c loss would be a UMP level 0: test. Suppose now that we switch rule
hypothesis and alternative, so that OH = [1,2] and OA = (-00,1) the
U (2,00).
Suppose that we use the O-l-c loss with c = 3.04. The posterior probabi
lity

25This lemma is used in the proof of Theorem 4.100.


26This proposition is used in the proof of Theorem 4.100.
270ne can prove (see Problem 30 on page 289) that 1 - is UMC floor
0: for
testing H versus A. But this is not the most popular optimality criterion
.
280ne should also be aware that there is no UMC floor 0: test in the case
sided hypotheses in exponential families. The concepts of UMP and UMC of two-
really
are dual to each other. Neither of them is the unique best optimality
criterion.
256 Chapter 4. Hypothesis Testing

of nH is exp( -y) - exp( -2y). We then reject H, that is, we choose a = 1, if


exp( -y) - exp( -2y) < 1/4.04, which is true if y > 0.7985 or y < 0.5978. This is
the same (a.s.) as 1 minus the UMP level 0.1 test in the earlier example. Note
that the conditions of Proposition 4.67 are met in this example, and so no test
will be UMP level 0.9. The test we have just constructed will be UMPU level 0.9,
however.

We are now in position to begin to prove that the UMPU level a test for
the case of two-sided alternatives in a one-parameter exponential family is
just 1 minus the UMP level 1 - a test for the two-sided hypothesis.
Lemma 4.99. 29 In a one-pammeter exponential family with natuml pa-
mmeter, if is any test of H : e E [(h, 02J versus A : e [0 1 , 02J with
01 < O2 , then there is a test of the form

_{1 i!
'IjJ(x}- 'y.
x ~ Cl or x > C2,
ifx-ci,
o if Cl < x < C2,

with (31/1(Oi) = (3q,(Oi) fori = 1,2 and

(3 (0) { ~ f3q,(0) if 0 E nH,


1/1 :::: f3q,(0) if 0 E OA.

PROOF. Lemma 4.81 says that 1 - 'IjJ can be chosen to have (31-1/1(Oi) =
(31-q,(Oi) and thus that 'IjJ is in the desired form. Theorem 4.82 then shows
that 1 - 'IjJ minimizes and maximizes power in just the opposite regions
from where we want 'IjJ to minimize and maximize power under the same
conditions. 0
The tests in Lemma 4.99 are called two-sided tests. It is easy to see that
when the conditions of Lemma 4.99 hold, the class of two-sided tests is
essentially complete for hypothesis-testing loss functions. (See Problem 48
on page 292.) The next theorem says that the UMPU level a tests are a
subset of this essentially complete class. We could show, as in Example 4.87
on page 251, however, that this subset is not essentially complete.
Theorem 4.100. Assume the same conditions as Lemma 499. Also sup-
pose that (31/1(Oi) = a for i = 1,2. A test of the form 'IjJ is UMPU level a
and UMCU floor a.
PROOF. By comparing 'IjJ with a(x) == a and using Lemma 4.99, we see
that'IjJ is unbiased level a and unbiased floor a. Also, Lemma 4.99 shows
that 'IjJ is UMP a-similar and UMC a-similar. Lemma 4.96 and Proposi-
tion 4.97 can be applied since the power functions are continuous in an
exponential family. 0

29This lemma is used in the proof of Theorem 4.100 and to show that the class
of two-sided tests is essentially complete.
4.4. Unbiased Tests 257

Example 4.101. Suppose that X '" N(/-L, 1) given e = /-L. Let nH = [-1,1] and
Q = 0.1. Set C2 = 2.286 and Ci = -2.286. (h = -1 and (h = 1. So

t/J(x) = {

I if x < -2.286 or x
otherwise;
> 2.286,

{3",(02) = Pr(N(I, 1) [-2.286,2.286]),


= 1- (<1>(1.286) - <1>(-3.286)) = 0.1,
1 - (cI>(3.286) - cI>( -1.286)),
Pr(N( -1,1) rt. [-2.286,2.286]) = {3",{fh).
Hence, t/J is UMPU level 0.1 and UMCU floor 0.1.
Lest the reader think that UMPU and UMCU tests are always the same,
note that the UMP and UMC tests in Problem 31 on page 289 are both
unbiased, but they are not the same. Those tests are one-sided however.
We should also note that the type of contradictory conclusions drawn in
Example 4.88 on page 252 also arises for interval hypotheses. The following
example is adapted from Schervish (1996).
Example 4.102. Suppose that X '" N(/-L, 1) given M = /-L. We wish to consider
two different hypotheses, Hi : M E [-0.5,0.5] versus Ai : M [-0.5,0.5] and
H2 : M E [-0.82, 0.52] versus A2 : M rt. [-0.82, 0.52]. The UMPU level 0.05 test
of Hi is to reject Hi if X [-2.185,2.185]. The UMPU level 0.05 test of H2 is
to reject H2 if X rt. [-2.475,2.175]. So, if X E (2.175,2.185], we would reject H2
and accept Hi, even though Hi implies H 2
If we had used Lebesgue measure for an improper prior, the posterior proba-
bility of Hi given X = x is less than 0.0424 if x [-2.185,2.185]' the rejection
region for the UMPU level 0.05 test. The posterior probabilty of H2 given X = x
is less than 0.0424 if x [-2.531,2.231]. So, if the decision rule is to reject the
hypothesis if the posterior probability is less than 0.0424, then we would reject
Hi whenever we reject H 2

4.4.3 Point Hypotheses


In this section, we will deal with the case in which OH = {Oo} and OA =
0\ {Oo}. This is like the case of an interval hypothesis with a two-sided
alternative, except that the interval is degenerate. The proofs of some of the
results for two-sided alternatives relied on the fact that the two endpoints
of the hypothesis were distinct. When the endpoints are the same, some
changes are required.
Lemma 4.103. 30 Suppose that the power functions of all tests are differ-
entiable. If cp is the UMP level Q test of H : e = 00 versus A: e < 00 , then
the derivative of the power function of cp at 00 is smallest among all tests
with size Q. Similarly, if cp is the UMP level Q test of H : e = 00 versus
A : e > 00 , then the derivative of the power function of cp at 00 is largest
among all tests with size Q.

30This lemma is used in the proof of Theorem 4.104.


258 Chapter 4. Hypothesis Testing

PROOF. We prove only the first part, since the second is very similar. By
Lemma 4.45, it follows that (3,p(Oo) = a if cP is UMP level a. Let 1/J be
another size a test. Since cP is UMP level a, for every > 0,
(3,p(00 - ) > (3",(00 - f),
a - (3,p(00 - ) < a - (3",(00 - f),
(3.p( ( 0) - (3.p( 00 + ) (3", (00) - (3",(00 + )
<

Since the derivatives are the limits of the quantities in the last inequality
as goes to 0, the result follows. 0

Theorem 4.104. 31 In a one-pammeter exponential family with natuml


pammeter, let OH = {Oo}, where 00 is in the interior of o. Let cP be any
test of H versus A : e =f. 00. Then there is a test of the form of 1/J in
Lemma 4.99 such that
(3",(00) (3,p(00),

:0 (3", (0) 18=80 :0 (3.p (0) 18=80 '


(4.105)

and, for every 0 =f. 00, (3",(0) is maximized among all tests 1/J satisfying the
two equalities above.
PROOF. Let a = (3q,(00) and'Y = d(3q,(0)/dOI8=80. Let cPw be the UMP level
w test of H : e ~ 00 versus A : e < 00 , and for each 0 :::; u :::; a, set
cP~(x) = cPu(x) +1- cP1-a+U(X).
By design, (3q,'u (0 0 ) = a for all u. Also, cP~ has the form of 1/J for every u
(with C1 or C2 possibly infinite). By construction, cP~ = 1- cP1-a is the UMP
level a test of H' : e = 00 versus A' : e > 00 and cP~ = cPa is the least
powerful such test. It follows from Lemma 4.103 that the derivatives of the
power functions of cP~ and cP~ at 00 are respectively the smallest possible
and the largest possible among all tests with power function a at 00 . Hence

~(3,p'a(O)1
dO 8=80
~ 'Y:::; ~(3,p~(O)1
dO ()=80
.

To prove that there is a 1/J satisfying (4.105), we need only show that
d(3w{O)/dO I9=90 is continuous in w. 32 Recall that
1 if x < cw ,
cPw{x) = { 'Yw if x = Cw,
o if x> cw ,

31This theorem is used in the proofs of Corollary 4.109 and Theorem 4.124. It
is also used to show that two-sided tests form an essentially complete class.
32The proof follows part of the proof of Theorem 2 on pp. 220-221 of Ferguson
(1967).
4.4. Unbiase d Tests 259

for some numbers Cw and IW such that

(4.106)
Define h(x,g) = Poo(X :::; x) + gPoo(X = x), and define the random
variable V = h(X, G), where G has U(O, 1) distribu tion and is indepe
ndent
of X and 8. For 0 < w :::; 1, we note that

Cw = inf{u: FXle(uIO) 2: w}
and
IWPOO(X = cw) = [w - Poo(X < cw)]
For w = 0, Co = sup{u : FXle(uIO) = O} and 10 = o. It follows that
for all
t, h(x, g) :::; t if and only if either x < Ct or x = Ct and 9 :::; It For t 2: 0,
we have

FVle(tIOo) j j I[o,tj(h(x,g))fxle(xIOo)dgdv(x)

j j [Ie -oo,e.) (x) + I{e,} (x )1[0,'"1,1 (g)]dgfxle(xIOo)dv(x)


jlle-oo ,e.) (x) + ItI{e,} (x)]fxle(xIOo)dv(x) (4.107)

= j t(x)fx le(xIO o)dv(x ) = (3",,(00) = t,


by (4.106). Hence V has U(O, 1) distribu tion given 8 = 0 . From (4.107)
0 ,
we can write w(x) = E{I[O,wl(V)IX = x}. It follows from Theore
m 2.64
that for every test 17,

:0(37)(0)\0=00 = :0 j 17(x)c(O) exp(Ox)dv(x) \0=00

= j 17(x) :OC(O) exp(Ox) \0=00

= j 17(x) [xc(O) + c'(O)) exp(Ox)dv(x)


= E80 1X17(X))- (37) (00)E 80 (X).
It follows that

Eoo{X w(X)} - wEoo(X)

Eoo{XI[o,wj(V)} - wEoo(X).
Since w is continuous and V has continuous distribution, it follows
that
the above expression is continuous in w.
260 Chapter 4. Hypothesis Testing

What remains to be proven is that 'l/J maximizes the power function


among all tests with a fixed size and a fixed value of the derivative of
the power function. As in the proof of Theorem 4.82, let fXle(xlO) =
c(O) exp(Ox), so that h(x) is incorporated into the measure I). For a test
'r/ with f3"1(Oo) = n, the derivative of the power function at 00 will equal
'Y if and only if Eoo[X'r/(X)] = 'Y + nEoo(X). Let 0 1 =f. 00 , We now apply
Lemma 4.78 with
Po(x) C(Ol) exp(Olx),
P1(X) c(Oo) exp(Oox),
P2(X) = xc(Oo) exp(Oox).
The test 'Tf with the largest power at 01 subject to
f3"1(00) ::; (;:::) lX,

d~,8''1(0)18=OO ::; (;:::) 'Y

is 'Tf(x) = 1 if
(4.lO8)
where the signs of k1 and k2 depend on which inequalities we use. The
inequality (4.108) simplifies to exp([Ol - Oolx) > k1 + k2X. This inequality
is satisfied for x outside of a bounded interval or for x in a semi-infinite
interval. We already know (Theorem 4.56 and Propositions 4.62-4.64) that
tests with 'l/Jo(x) = 1 for x in a semi-infinite interval are one-sided and they
minimize the power function on one side of the hypothesis. Hence, we need
'l/Jo(x) = 1 for x outside of a bounded interval, and 'l/Jo has the form of'l/J.
Furthermore, the same 'l/Jo works whether 0 1 > 00 or (h < 80 , by choosing
k1 and k2 correctly. 0
When the conditions of Theorem 4.104 hold and the loss function is
of the hypothesis-testing type, then it follows that the class of two-sided
tests is essentially complete. Corollary 4.109 says that the class of UMPU
level n tests is a subset of this essentially complete class. At the end of
Example 4.111 on page 261, we will see that the class of UMPU tests is
not essentially complete.
Corollary 4.109. 33 In a one-parameter exponential family with natural
parameter, let f!H = {Oo}, where 00 is in the interior of f!. If'l/J is a size n
test of the form of Lemma 4.99 with

:of3.p(O)lo=oo = 0,

then it is UMPU level n.

33This corollary is used in the proof of Theorem 4.124.


4.4. Unbiased Tests 261

PROOF. Since the test ,Ax) == 0 has size 0 and 0 derivative at 00 , Theo-
rem 4.104 tells us that tf; is unbiased level o. In light of Theorem 4.104, all
we need to show is that all unbiased level 0 tests must have power func-
tions with 0 derivative at 00 Any test with (3t/>(Oo) = 0 but with nonzero
derivative will have power strictly less than 0 on one side or the other of
00 because the power function is differentiable. Such a test could not be
unbiased level o. 0
Example 4.110. Suppose that X '" N(8, 1) given 8 = 8 and that we wish to
test H : 8 = 80 versus A : e =F 80. To make the test 't/J unbiased, we need

o = ~~.p(8)19=90

= d [ 1-
d8 [C2 --exp
1
Cl
( - x 2 8? )
v'21r
dx ] I
9=90
_ {C2 X - 0
8 exp ( - (x - ( 0)2) dx
lq v'21r 2

= _ j C

cl-90
2-
9
0 ~ exp (- ~2) dx
v'21r
. which is true if and only if -(Cl - (0) = C2 - 80 = c. In this case (3.p(80) =
2[1- ~(c)l = 0 if and only if c = ~-1(1_ 0/2). This gives the usual equal-tailed,
two-sided test, which is UMPU level o.

In Example 4.110, if a Bayesian used a proper continuous prior distribu-


tion, then Pr(9 = ( 0 ) = 0 both before and after observing X. There are at
least two ways to treat point hypotheses from a Bayesian perspective. One
is to treat tliem as surrogates for interval hypotheses in which the length of
the interval has not been stated. Another is to assign positive probability
to the point hypothesis.
Example 4.111 (Continuation of Example 4.110). Suppose that I really want
to test H' : 18 - 801 ~ 0 versus A' : 18 - 801 > o. Suppose that the prior
distribution of 8 is N(80, T2). Then the posterior distribution of 8 given X = x
is N(81,T 2/(1 + T2, where
8 _ 80 + XT 2
1 - 1+T2

If we use a 0-1-c loss function, the Bayes rule is to reject H' if its posterior
probability is less than 1/(1 + c). The posterior probability of H' is
262 Chapter 4. Hypothesis Testing

which is clearly a decreasing function of

So, the Bayes rule is to reject H' if Ix - 801 > d for some d. This has the same
form as the UMPU level a test.
Alternatively, suppose that Pr(8 = (0) = Po > O. Conditional on 8 "I 80,
suppose that 8 rv N(80,7 2 ). We computed the Bayes factor for this case in
Example 4.18 on page 222. The Bayes factor was given in (4.19) as

Vi
~2
+7 exp
(x -(+
2(1
0 )27 2 )
7 2) .

(See Problem 6 on page 286 for the entire posterior distribution of 8.) If we use
a Q-l-c loss function, then the Bayes rule is to reject H if the probability that
it is true is less than 1/(1 + c). This corresponds to the Bayes factor being less
than some number, which in turn is easily seen to correspond to Ix - 80 1 > d for
some d. This is in the same form as the UMPU level a test.
Finally, suppose that we continue to let Pr(8 = (0 ) = Po > 0, but that the
conditional prior given 8 "I 80 is N (J' , 7 2 ) with (J' "I 80. Then the same kind
of calculation as above leads to the Bayes factor being small when x is far from
[(1 +7 2 )80 - 8'117 2 Such a test is two-sided but is not UMPU of its level. In fact,
the test is biased, even though it is admissible. We see once again that the class
of UMPU tests is not essentially complete.
In Example 4.111, two different types of prior distributions both led to
Bayes rules that were of the same form as the UMPU level a test. Unfor-
tunately, there did not appear to be any transparent connection between
the size a and the loss function or the prior distribution. The reason for
this is related to the inadmissibility of incoherent tests as illustrated in
Problem 42 on page 291. We will discuss this matter more in Section 4.6.
Example 4.112. Suppose that X rv Bin(n,p) given 8 = log(p/(1- p)). Then
the density of X with respect to counting measure on {O, ... , n} is

fXle(xI8) = (:) [1 + exp(8W n exp(x8).


Let QH = {80} and QA = 1R\{80}. The UMPU level a test is

1 ~f x < Cl or x > C2,


<J>(x) ={ 'Yi If x = Ci, (4.113)
o otherwise,
where Cl :5 C2. Supposing that Cl < C2, we have

(3",(8) = 1 - [1 + exp(J)-n [(1 - 'Yt) (~) exp(cI6)


+ (1 - "') (~) ",p(",O) + .f, (:) exp(.O)].
4.4. Unbiased Tests 263

It follows that

:e(3t/>(6) = n [(1 - 'Yt) ( : ) exp(cI6) + (1 - '12) (~) exp(c26)


+ "'~1 (:) eXP(X6)] exp(6)[1 +exp(6)]-n-l
- [1 + exp(6wn [(1 - 'Yt}Cl ( : ) exp(cI6)

+(I-'Y2)C2(~)eXp(C26)+ "'~1 (:)xeXP(X6)].


Once Cl and C2 are determined, solving for '11 and '12 amounts to solving two
linear equations. Now, suppose that 60 = log(0.25/0.75). Then, with 0 = 0.05
and n = 10, we get (after some numerical calculation)

Cl = 0, '11 = 0.52804,
C2 = 5, '12 = 0.00918.
Most people who want a level 0.05 test of this hypothesis would not bother to
compute the UMPU level 0.05 test but rather would perform what is called an
equal-tailed test. Since Theorem 4.104 says that the two-sided tests are admissible,
we could try to find a two-sided test of the form (4.113) such that the probability
of rejecting for small X equals the probability of rejecting for large X (both equal
to 0.025). In this case, the test would have

Cl = 0, '11 = 0.44394,
C2 = 5, '12 = 0.09028.

This test is biased because the derivative of the power function is 0.0236 at 60. In
other words, the probability of rejecting the hypothesis will be slightly less than
0.05 given 9 = 6 for a short interval of 6 values below 60.
One possible Bayesian solution would be to set Pr(P = po) = qo and let
P '" Beta(oo, (30) otherwise, where P = exp(9)/(1 + exp(9)). Then, the Bayes
factor will be
P5(1 - Po)n-", TI~-=-OI(OO + f30 + i)
(4.114)
TI 'i=O
" 1(00 + t')TIj=O
n '" I(R_
fJU + J
')'
In the special case with 00 = f30 = 1, the Bayes factor is

(n + 1) (: )~(1 - Po)n-",. (4.115)

These values have been calculated for n = 10 and Po = 1/4 in Table 4.116
together with the posterior probability when qo = 1/2. Note that if we used a
o-l-c loss function with C = 19 (so ihat 1/(1 + c) = 0.05), we would still accept
H even when X = 6 was observed.
As we noted in Section 4.2.2, we would run into trouble if we naively tried to
use an improper prior (with 00 = {30 = 0) for the alternative. The Bayes factor
264 Chapter 4. Hypothesis Testing

TABLE 4.116. Bayes Factor and Posterior Probability in Binomial Example


x Bayes Factor Posterior Prob.
o 0.619 0.3825
1 2.064 0.6737
2 3.097 0.7559
3 2.753 0.7335
4 1.606 0.6163
5 0.642 0.3911
6 0.178 0.1514
7 0.034 0.0329
8 0.004 0.0042
9 3 x 10- 4 0.0003
10 1 x 10- 5 1 X 10- 5

in (4.114) would become 00 in this case. On the other hand, if we let qo go to


zero at a rate such that
lim qo(ao + 130) = k,
ao{3o

n
then the product of the prior odds ratio qo/(1 - qo) and the Bayes factor would
converge to

k(n -1) (: = po (1 - pot- X

if both x > 0 and n - x > O. This has a form similar to (4.115).


Another Bayesian solution would be to replace H and A by H' : Ie - 801 ~ 8
and A' : Ie - 801 > 8. Suppose that P '" Beta(ao, (30). The posterior distribution
of P given X = x is Beta (ao + x, (30 + n - x). It is an easy matter to calculate
Pr(H' true IX = x) for various values of 8 and x. Figure 4.119 gives plots of
Pr(IP - 1/41 ~ 81X = x) for ao = 130 = 1 for all values of x = 0, ... ,10 when
n = 10. For example, suppose that 8 = 0.1. We see that for x = 0, ... , 5, the
posterior probability of the hypothesis is greater than 0.05. So, if c = 19 and
we use the 0-I-c loss function, we would accept H' if X :5 5 and would reject
otherwise.

Notice that the condition that the derivative of the power function be 0
in Example 4.112 was equivalent to

(4.117)

if the size is a. This is true in general in exponential families.


Proposition 4.118. 34 If X has a one-parameter exponential family dis-
tribution with natural parameter e and cf> is a test of H : e = 00 with size
a, then (3", has 0 derivative at 00 if and only if (4.117) holds.

34This proposition is used in the proof of Theorem 4.124.


4.5. Nuisance Parameters 265

!"!

~
~
~
..,
v
in
C'f ~
D..
~

:?l x=~
/
/

--- -- - -
..--,,--/
l!! x-10
:::l
0.0 0.1 0.2 0.3 0.4 0.5

II

FIGURE 4.119. Pr (Ip - ~I ::5 61 X = x) for all 6 ::5 ~ and all x

When UMPU level a tests do not exist, one can try to find LMPU
(locally most powerful unbiased) level a tests. 35 When power functions
are continuously differentiable and if; is the unique unbiased level a test of
H : e = 00 with maximum second derivative for the power function, then if;
is LMPU level a relative to d(O) = 10 - 00 1. (See Problem 50 on page 292.)

4.5 Nuisance Parameters


When the parameter is multidimensional and f!H is a smaller-dimensional
space, the remaining dimensions are often called nuisance parameters for
reasons that will become apparent shortly. In a Bayesian analysis, one must
integrate nuisance parameters out of the posterior joint distribution of the
parameters and base inference on the marginal distribution of the param-
eters of interest. This can be a nuisance also.

4.5.1 Neyman Structure


The approach that we will take to finding UMPU tests in the presence
of nuisance parameters is to find a statistic T such that the conditional
distribution of the data given T has a one-dimensional parameter. In many
cases, it will then turn out that the UMPU test among all tests that have

35We leave it to the interested reader to write a formal definition of LMPU.


266 Chapter 4. Hypothesis Testing

level 0: conditional on T will also be UMPU unconditionally. If a test is


o:-similar conditional on T, we say that it has Neyman structure.
Definition 4.120. Let G ~ 0 be a subparameter space corresponding to
a subfamily Qo of 'Po, and let 'It : Qo -+ G be the subparameter. If T is a
sufficient statistic for 'It in the classical sense, then a test 4J has Neyman
structure relative to G and T if E//[4J(X)IT = tl is constant in t a.s. [P6l for
all () E G.
It is easy to see that if Qo = {P6 : () E OH n OA} and if 4J has Neyman
structure, then 4J is o:-similar. We will prove later (Lemma 4.122) that under
certain conditions all o:-similar tests have Neyman structure. In these cases,
one can find UMP o:-similar (hence UMPU level 0: by Proposition 4.92)
tests by restricting attention to tests with Neyman structure. Consider the
following example.
Example 4.121. Suppose that Xt, ... ,Xn are lID N(I',u 2 ) random variables
conditional on (M, E) = (I-', (7). The usual two-sided t-test of H : M = 1-'0 versus
A:M;fl-'ois

t/J(X) = {I if Ix - ~ol > 7,iT;!l (1- ~),


o otherWise,

where Tn-l is the CDF of the tn-l(O,I) distribution and 8 2 = E~=l (Xi _x)2 /(n-
1). Here, the intersection of OH with OA is !lH = {(I', (7) : I' = I'o}. It is easy to
see that t/J is a-similar, as follows. Given (M,E) = (1'0,17) E !lH, the conditional
distribution of T = (X - p.o)/(8/..;n) is tn-l(O, 1) for all u. Hence

fjql(-y) = P~ [..;nIX ~ 1'01 > T;!l (1- %) ] = a.


Let the subparameter space be !lH itself. A sufficient statistic for the subparam-
eter E is n

U = '""
L.J(Xi -1'0) 2 = (n - 1)82 + n(X
- -1'0) 2 .

i=l
We can write
w= X - 1-'0 = sign(T) ,
.jfJ ..;nJ~ + 1
so that W is a one-to-one increasing function of T and t/J(X) is a function of W.
We need to show that the conditional distribution of W given (M, E) = 'Y and
U = u is the same for all 'Y E !lH and all u. If this is true, then W is independent
of U given (M, E) = 'Y E S1H and
E..,[t/J(X)IU = u) = E..,[t/J(X) = a,
for all 'Y E S1H and t/J would have Neyman structure .relative. to the !lH an~ U.
Since the distribution of (Xl - 1'0, ... , Xn - 1'0) IS spherically symmetnc, we
showed in Examples B.56 and B.60 (see pages 627 and 628) that the conditional
distribution of
( Xlv'fJ
- /Lo Xn
' ... , v'fJ
-1-'0)
4.5. Nuisance Parameters 267

given U is uniform on the sphere of radius 1 and is independent of U. Hence W


is independent of U given (M,~) = 'Y E 11H.

Lemma 4.122. 36 If T is a boundedly complete sufficient statistic for the


subparameter space G ~ !l, then every a-similar test on G has Neyman
structure relative to G and T.

PROOF. By a-similarity, E9{E9[(X)IT]- a} = 0, for all () E G. Since T


is boundedly complete, it must be that E9[(X)IT] = a, a.s. [P9] for all
() E G. 0
We can now combine this result with Proposition 4.92 to conclude a
useful result for identifying cases in which UMPU tests exist.
Theorem 4.123. Let G = !lH n !lA. Let I be an index set, and suppose
that G = UiE1G i , where the subsets Gi form a partition of G. Suppose that
there exists a statistic T that is a boundedly complete sufficient statistic for
each subparameter space Gil i E I. Also, assume that the power function
of every test is continuous. If there is a UMPU level a test among those
which have Neyman structure relative to G i and T for all i E I, then will
be UMPU level a.
PROOF. Because the power functions are continuous, Proposition 4.92 says
that all unbiased level a tests are a-similar. Lemma 4.122 says that because
there is a boundedly complete sufficient statistic T for each subparameter
space Gi , every a-similar test has Neyman structure relative to Gi and T.
The result now follows. 0
The way that Theorem 4.123 is generally used is the following. We sup-
pose that power functions are continuous and that there exists a partition
of G = !lH n !lA into one or two sets (G = Go or G = G I U G2 ) and
a statistic T that is a boundedly complete sufficient for each G i . We also
suppose that the conditional distribution of X given T is a one-parameter
family with parameter 9(9) for some function g. We also suppose that !lH
can be written as {(): g((}) ~ bo} or {(): g((}) E [bo,b l ]} or one of the other
forms with which we are already familiar. So, for example, if e = (M, E)
and!lH = {(I-', 0') : bo ~ I-' ~ bl}, then 9(1-',0') = 1-', G I = {(I-', 0') : I-' = bo},
and G2 = {(I-', 0') : I-' = bd. For the one-sided cases, we assume that the
family of distributions of X given T has MLR, but for the other cases,
the conditional distribution of X given T should be a one-parameter ex-
ponential family with natural parameter g(8). We then find the UMP or
UMPU level a test of H conditional on T. For all of the cases, except for
the case in which the hypothesis is H : g(9) = bo, the UMP or UMPU
level a test conditional on T will also be UMP or UMPU among tests
with Neyman structure. The reason is that these tests are derived as UMP
or UMPU among all tests that satisfy conditions that are equivalent to

36This lemma is used in the proofs of Theorems 4.123 and 4.124.


268 Chapter 4. Hypothesis Testing

having Neyman structure. For example, when H : g(9) E lbo, bI ], the


two-sided test is conditionally UMPU level 0: among all tests that have
level 0: and have conditional power function equal to 0: at g( 0) = bo and
g(O) = bl . This is exactly what it means to have Neyman structure relative
to Go = {O : g(O) = bo} and relative to Go = {O : g(O) = btl. For the
case of H : g(9) = bo, the two-sided test is conditionally UMPU level 0:
among those tests with conditional power function 0: at g( 0) = bo and with
derivative of the conditional power function equal to 0 at g(O) = boo This
last condition is not part of the definition of Neyman structure. Hence,
in these cases, we need to prove that every Neyman structure test also
satisfies this last condition. Problem 58 on page 293 is an example of this
situation. In multiparameter exponential families, when 9(9) is just one of
the coordinates of e, every o:-similar level 0: test will have zero derivative
for the conditional power function, as we prove in Theorem 4.124.

4.5.2 Tests about Natural Parameters


The case in which we can prove the most complete result is that of an
exponential family in which the hypothesis concerns one of the parameters.
Theorem 4.124. Let (XI, ... , Xk) have a k-parameter exponential fam-
ily distribution with natural parameter e = (91, ... , 9k), and let U =
(X2 , ,Xk).
1. Suppose that the hypothesis is one-sided or two-sided concerning only
9 1 Then there is a UMP level 0: test conditional on U, and it is
UMPU level 0:.
2. If the hypothesis concerns only 91 and the alternative is two-sided,
then there is a UMPU level 0: test conditional on U, and it is UMPU
level 0:.
PROOF. Suppose that the density of X with respect to a measure v is
written

Let G = nH n nA, the intersection of the closures of the hypothesis


and alternative sets. The conditional density of Xl given (X2 , , Xk) =
(X2,' .. , Xk) with respect to the measure dllxlu(Xllu) (from Theorem B.46
with X = m and U = mk - l ) is
exp(Olxl)h(x)
fXlI6,U(XlIO, u) = Jh(x) exp(Olxl)dvxlu(Xllu)'
which can be seen to depend on 0 only through 01 . So, for each vector u, the
conditional distribution of Xl is a one-parameter exponential family with
4.5. Nuisance Parameters 269

natural parameter 9 1 , For the hypotheses considered in this theorem, the


subparameter space G is either the set Go = {O : 01 = On for some O~ or
the union of two such sets, 0 1 = {O : 01 = On and G 2 = {O : 01 =
each such subset of 0, the subparameter W = (9 2 , , 9k) has complete
on. For

sufficient statistic U = (X2 , ,Xk ). Let TJ be an unbiased level a test. It


follows from Proposition 4.92 that TJ is a-similar on Go, or on G 1 and on G 2 ,
whichever is appropriate. By Lemma 4.122, TJ has Neyman structure. Also,
for every test TJ, (3f/(O) = Eo(Eo[TJ(X)IU]), so that a test that maximizes the
conditional power function uniformly for 0 E OA subject to constraints also
maximizes the marginal power function subject to the same constraints.
For part (1), in the conditional problem given U = u, there is a level
a test 4> that maximizes the conditional power function uniformly on OA
subject to having Neyman structure (see Theorem 4.56, Propositions 4.62-
4.65, and Theorem 4.82). Since every unbiased level a test has Neyman
structure, and the power function is the expectation of the conditional
power function, 4> is UMPU level a.
For part (2), we consider two cases. First, suppose that OH = {O : Cl :5
01 :5 C2} with C2 > Cl. Then, Lemma 4.99 shows that there is a test 4>
whose conditional power function is maximized uniformly on OA subject
to having Neyman structure. It follows as before that 4> is UMPU level a.
Finally, suppose that OH = {O : 01 = On. If TJ is unbiased level a, then
(3f/ must have zero partial derivative with respect to 01 evaluated at every
point in G. Using Theorem 2.64, just as in the proof of Theorem 4.104, we
get, for every O. E G,

0= a~ (3f/(O) \ = E6.[XI7](X) - aXIl


I 6=6.
By the law of total probability B.70, Eo. (Eo. [X1TJ(X) - aX1IU]) = 0, for
every O. E G. Since U is complete for 9 E G, it follows that
(4.125)

So every unbiased level a test TJ must satisfy (4.125). Proposition 4.117


says that (4.125) is equivalent to the derivative of the conditional power
function with respect to 01 at 01 = O? being zero. Subject to this condition
and having Neyman structure, there is a test that maximizes the conditional
power function at all 0 E OA according to Corollary 4.109. This test is then
UMPU level a. 0
Example 4.126. Supf,ose that (Xl, U) has a multiparameter exponential family
:5 8l } with r!A = {8 : 81 > 8?}. The conditional UMP level a
and r!H = {8 : 81
test is
I ifxl>d(u),
tf>(xllu) ={ -y(u) if Xl = d(u),
o if Xl < d(u),
where d and -y are chosen so that the conditional size is a a.s. This test is UMPU
level a by Theorem 4.124.
270 Chapter 4. Hypothesis Testing

Example 4.127 (Continuation of Example 4.121; see page 266). The usual two-
sided Hest is an example of Theorem 4.124. To see this, write the joint density
of (X, 8 2 ) as
n-l
Vn(~)-r
v'27rr( n; 1) U
x exp ( - 2~2 [n(x - {to - [{t - {to])2 + (n - 1)8 2])

r(/-t, u)h(x, 8 2 ) exp ((h v + 02U),


where r({t, u) depends solely on the parameters and

(h={t-{to 1
O2 = - -
u2 ' 2u 2 '
v = n(x -/-to), U = n(x - {to)2 + (n - 1)8 2.

(Note that u is the observed value of the statistic U from Example 4.121.) The-
orem 4.124 says that the UMPU level a test of H : 8 1 = 0 versus A : 8 1 i= 0 is
the conditional UMPU level a test of H versus A given U. Note that 81 = 0 is
equivalent to M = {to. Since V has a one-parameter exponential family distribu-
tion with natural parameter 8 1 given U, the test will be a two-sided test of the
form
if v < d1 (u) or if v > d2 (u),
tP(vlu) = { ~if d 1 (u) ::; v::; d2 (u),
where dt(u) and d2 (u) are chosen so that the conditional power function equals
a at Ot = 0 for all u and so that the derivative of the conditional power function
equals 0 at fh = 0 for all u. As we saw in Example 4.121, the two-sided t-
test has the above form with dt(u) = -cVu and d2 = cVu, where c > 0 is a
constant. We also saw that the t-test has conditional level a given U. The fact
that the derivative of the conditional power function is 0 follows, as in the proof
of Theorem 4.124, from the fact that the partial derivative of the marginal power
function is O. Hence, the usual two-sided t-test is UMPU level a.
One possible Bayesian approach is to put positive probability on nH . This was
done in Example 4.22 on page 224, where Bayes factors were computed. It is not
the case that the usual two-sided t-test is equivalent to rejecting H when the
Bayes factor is less than a constant. However, with a special type of improper
prior, the two tests are equivalent. In Example 4.22, we showed that if AO -+ 0
and po -> 0 in such a way that the ratio po/~ -> k, a constant, then the
posterior odds in favor of the hypothesis converge to

where t is the usual t statistic.


Example 4.128. Suppose that X and Y are conditionally independent given
r = (A, M) = (-X, /-t)
with X rv Poi(-X) and Y rv Poi({t). We wish to test H : A =
2M versus A : A i= 2M. Set G = {(-X,{t) : A = 2{t}. Set 111 = A = 2M for the
subparameter space G, and note that
4.5. Nuisance Parameters 271

for x, y = 0, 1, .... It follows that T = X + Y is a complete sufficient statistic for


the subparameter space. The distribution of T is

for t = 0,1, .... The conditional distribution of (X, Y) given T and r is one-
dimensional and can be represented by X alone. It is

_ (~r (t~
/XIT,r(X\t, A, j.L) - x!(t _ x)!
(~r)-1
a!(t - a)! '

for x = 0, ... , t. This is easily seen to be a one-parameter exponential family


with natural parameter e = g(r) = log(A/M). The hypothesis can be written
as H : e = log(2) versus A : e i= log(2). All we need to show is that every
a-similar test with level a has zero derivative for the conditional power function
at e = log(2). To do this, first we reparameterize to e and M. Then

exp(x(})j.Lxh
/X,YI9,M(X, y\(}, j.L) = exp( -j.L[exp(}) + 1]) x!y! .

Every unbiased a-similar level a test must have partial derivative of the power
function with respect to (} equal to 0 at every point (log(2),j.L) in G, otherwise
the power function would dip below a on the alternative. The partial derivative
of the power function of a test cjJ with respect to (} is

~~ exp(x(})j.Lx+y
L..t L..t cjJ(x, y) exp( -j.L[exp(}) + 1]) , ,
x.y.
[x - j.Lexp(})]
x=o y=o
= Eo,/L(XcjJ(X, Y - j.Lexp(})!3</>(}, j.L).
Now, plug in (} = log(2) and set this equal to O. Note that 2j.L is the mean of X
and the power function at (log(2), j.L) is a for every j.L for an a-similar test. Hence,
every a-similar level a test cjJ must satisfy

0= E 10g (2),/L (X(X, Y) - Xa) = E 1og(2),/L (E(X(X, Y) - Xa\T,

for all j.L. Let h(t) = E(XcjJ(X, Y) - Xa\T = t). Since T is complete for the sub-
parameter space, E)og(2),/L(h(T = 0 for all j.L implies h(T) = 0, a.s. [Ptog (2),/L] for
all j.L. By Proposition 4.118, this is equivalent to the derivative of the conditional
power function being 0 at (} = log(2).

None of the theorems of this section provides an essentially complete


class of tests. To find an essentially complete class, we could piece together
conditional tests given U = (X 2 , . . , X k ). For example, let 'lj; be a test of
H : 8 1 ::; ()~ versus A : 8 1 > ()~. Let .B", ((}llu) be the conditional power
function of'lj; given U = u. For each u, there is a one-sided test of the form

if Xl > c(u),
if Xl = c(u), (4.129)
if Xl < c(u),
272 Chapter 4. Hypothesis Testing

which has maximum conditional power function for (h > ()~ and minimum
conditional power function for ()1 < ()~ subject to the power being (3",((}~) at
(}l = (}? Since in (4.129) minimizes and maximizes the power in precisely
the right places uniformly in u among all tests in a class that contains ,, it
follows that R( (}, ) S R( (}') for all (} if a hypothesis-testing loss function
is being used. It follows that tests of the form (4.129) form an essentially
complete class. Other hypotheses can be handled in a similar way.

4.5.3 Linear Combinations of Natural Parameters


Most of the popular tests in the theory of normal distributions and linear
models involve linear combinations of the natural parameters of an expo-
nential family. In a k-parameter exponential family with natural parameter
e, let WI = 2:7=1 Ci(}i with C1 =I- o. Let Wi = 8 i for i = 2, ... ,k, and set
Y1 = Xt/Cl and Yi = Xi - CiXt/Cl for i = 2, ... , k. Then Y = (Y1 , .. , Yk )
has exponential family distribution with natural parameter w. If we want
to test a hypothesis concerning WI, we can proceed as above.
Example 4.130 (Continuation of Example 4.127; see page 270). The natural
parameters are 81 = M/I:2 and 8 2 = -1/[2I: 2 ]. Testing M = /-to is then equiv-
alent to testing 81 + 2/-to8 2 = O. So, we set WI = 81 + 2/-to8 2 and the usual
two-sided t-test will result exactly as in Example 4.127.
Example 4.131. Suppose that X and Yare conditionally independent given
(A, M) = (A, /-t) with X rv Poi(A) and Y rv Poi(/-t). Suppose that H : A = M.
In natural parameter form, 81 = log A and 82 = log M. SO H : 81 - 8 2 =
o. Set WI = 81 - 82 and W2 = 82. Let Zl = X and Z2 = X + Y. Then
(ZI, Z2) has exponential family distribution with natural parameter W. We need
the conditional distribution of ZI given Z2. For ZI = 0, ... , Z2, and Z2 = 0,1 ... ,
we have
1
c('I/J) I( )1 exp(zl'I/Jl
Zl Z2 - Z1 .
+ Z2'I/J2),
1
!Z2Iw(Z21'I/J) d('I/J) , exp('l/J2 z2),
Z2
!zllz2,W(Zll z2, 'I/J) r('I/J)k(z1, Z2) exp(zl 'l/Jd

r('I/J)(~:) (~rl = (;:) (A~/-trl (A~/-tr2-%1,


for ZI =
0, ... ,Z2. So Z1 given Z2 = Z2 and (81,8 2 ) = (logA,log/-t) has
Bin(z2, A/(A + /-t)) distribution. The UMPU level ( l test for the binomial pa-
rameter equal to 1/2, conditional on Z2 is the UMPU level ( l test of A = M.

4.5.4 Other Two-Sided Cases*


In Theorem 4.124, we saw how to deal with cases in which the hypothesis or
alternative is two-sided in a natural parameter. But other two-sided cases

*This section may be skipped without interrupting the flow of ideas.


4.5. Nuisance Parameters 273

arise.
Example 4.132. Suppose that Xl, ... , Xn are conditionally lID N(/-, 0'2) given
(M, E) = (/-,0'). This is an exponential family with natural parameters 8 1 =
M/E 2 and 82 = -1/[2E 2 j. Suppose that we wish to test H : a ::; M ::; b versus
A: not H. We can rewrite H in terms of the natural parameters as

We can reparameterize to WI = 8 1 + 2b82 and W2 = 8 1 + 2a82, and the


hypothesis becomes H : WI ::; 0 and W2 2: 0. (If a < b, the parameter space is
{( 'ljIl , ',/12) : 'ljIl < 'ljI2}.) This is not of any of the forms we have studied so far.
Suppose that cp is an unbiased level a test of H versus A. Then cp must be
a-similar. This means that f3",(O,'ljI2) = f3",('ljIl, 0) = a for all 'ljIl < 'ljI2. It is not
easy to construct nontrivial tests with these properties. (Of course, the trivial
test cp(x) = a for all x has these properties and is unbiased.)
We can address this problem simply within the Bayesian framework. To stay as
close to the classical solutions as possible, suppose that we use the usual improper
prior, so that the posterior distribution of M is M '" t n - l ('Xn, s~/n), where xn =
E~=1 xdn and s~ = E~=1 (Xi - Xn)2/(n - 1). The posterior probability that H
is true is
p = Tn-l ( vn b ~nxn ) - Tn- l ( vn a ~nXn ) .

The formal Bayes rule with a D-l-c loss would be to reject H if P < 1/(1 + c).
To see what this test looks like, note that for each value of Sn, P is a decreasing
function of Itl, where
Xn _!!H
t=vn 2 (4.133)
Sn

In fact,
p = Tn- 1 (..;nb2~na - t) - Tn- 1 (-vnb2~na - t) .
So the formal Bayes rule is to reject H if It I > d(sn), where it is also easy to see
that d(sn} is a decreasing function of Sn.
One possible classical solution is to abandon the UMPU criterion and just use
the usual t-test for testing that M = (a+b)/2. That is, reject H if It I > d, where
t is defined in (4.133) and d is determined to make the level a. Unfortunately,
the conditional distribution of fo(X n - [a + bJ/2) / Sn given (M, E) = (/-,0') is
noncentral t NCt n -l( vn[/-- (a+b}/2j/0'). The power function ofthe usual t-test
at (/-,0') for /- =1= (a + b) /2 goes to 1 as 0' goes to 0 for fixed /-. Hence the usual
t-test has level 1 as a test of H versus A. To get a test with level a < 1, one
could let d depend on Sn as in the formal Bayes rule (although one should note
that the formal Bayes rule also has level I-the problem occurs as 0' ...... 00 for
the formal Bayes rule). Calculating the power function of the resulting test would
require a separate two-dimensional integration over the space of (Xn' Sn) values
for each (/-,0') pair.
Another classical solution would be to add together the UMPU level a/2 test
of H' : 8 ::; b versus A' : 8 > b and the UMPU level a/2 test of H" : 8 2: a
versus A" : 8 < a. It can be shown (see Problem 54 on page 292) that this
test has size a. 37 The power function is easy to calculate. It is just the sum of

37This test is the likelihood mtio test, to be defined in Section 4.5.5.


274 Chapter 4. Hypothesis Testing

the two power functions for the two one-sided tests. If a < b, this test is biased,
since there exists () E OA such that the power function at () is close to 0./2. Note,
however, that if a = b, then this test is exactly the same as the UMPU level 0.
test of 8 = a versus 8 =I- a.

One could change the hypothesis in Example 4.132 to make it more like
Example 4.127.
Example 4.134 (Continuation of Example 4.127; see page 270). Suppose that
we wish to test H : M E [ILO + aE, ILo + bE] versus A : M [ILO + aE, JLo + bE].
This is a test about a linear combination of natural parameters, namely iIIl =
81 + 2IL082. The hypothesis is H : iIIl E [a, b] versus A : iIIl [a, b]. We can
let i112 = 82 = -1/[2E 2]. We would now need to work with the conditional
distribution of V = n(X -ILO) given U = n(X -ILo)2+(n-1)S2. This conditional
distribution will have a density equal to a constant (function of'll. and tPI) times
exp( -VtPl)(U - v 2/n)[n-l]/2-1 for 0 < v < y'nU. For each value of'll., one would
have to find d 1 (u) and d2(U) so that

for tPl = a and for tPl = b. Of course, one could wait until the data were observed
and then do it only for the observed value of U.

4.5.5 Likelihood Ratio Tests


A popular approach to forming tests, when no obvious UMP or UMPU test
is available, is to use the likelihood mtio criterion (LR). The idea is to start
with the Neyman-Pearson concept of likelihood ratios. In the Neyman-
Pearson setup, the likelihood under H is just a single number, as it is under
A. In general, however, the likelihoods are functions of O. In the Bayesian
approach, we integrate those functions with respect to a distribution. In
LR tests, we maximize those functions over O. The LR criterion is

SUPOEOH !xls(XIO)
LR= .
SUPOEO !xls(XIO)

To test a hypothesis H using the LR criterion, choose a number c and


reject H if LR < c. The number c is usually chosen to make the level of the
test equal to some prespecified value. Sometimes the distribution of LR is
recognizable, and sometimes it is not. If the distribution is not recognizable,
then an approximate distribution is provided in Section 7.5.1.
If sUPOEO !xls(xIO) = SUPOEO !xls(xIO), then LR is the same as the
approximate Bayes factor in (4.24) when X = x is observed. 3s If, in addi-
tion, nH is a single point, then LR is the same as the global lower bound
on the Bayes factor (4.20).

38For example, if every point in OH is the limit of a sequence of points in OA


and iXle(xl()) is continuous in 8, this condition will hold.
4.5. Nuisance Parameters 275

Example 4.135. Suppose that Xl, ... , Xn are conditionally IID with Ber(9)
distribution given e = 9. Let the hypothesis be H : e = 90 versus A : e f 90
Let Y = 2:~1 Xi. Then

if Y rt {O, n},
ifY = n,
if Y = 0,
since 9Y (1- 9)n-Y is largest if 9 = Yin. The LR test would be to reject H if
LR is smaller than some specified value. As a function of Y, LR increases until
Yin reaches 90 and then decreases. For example, if 90 = 1/4 and n = 10, we
have a case similar to Example 4.112 on page 262. The UMPU level 0 = 0.05
test of H : e = 114 was found there to be a test that rejected H if Y 2: 6 and
randomized if Y E {0,5}. If Y = 0 is observed, then LR = 0.0563. If Y = 6 is
observed, then LR = 0.0647. It follows that the level 0 = 0.05 LR test is not
the same as the UMPU level 0.05 test. The reason is that no LR test can reject
for Y = 6 without rejecting also for Y = 0, since LR is smaller at Y = 0 than
at Y = 6. The level 0.05 LR test will reject H for Y 2: 7 and will randomize if
Y = 0 with probability 0.8259 of rejecting H. Note that the LR test is of the
form of Theorem 4.104, so that it is admissible, but it is not UMPU.
Example 4.136. Suppose that Xl, ... , Xn are conditionally lID given e =
(M,~) = (p.,0") with N(p.,0"2) distribution. Suppose that H : M = P.o is the
hypothesis. The formula in (4.26) gives the observed value of LR, namely (1 +
t2 I[n - 1j)-n/2, where t is the test statistic for the usual t-test. Since LR is a
decreasing function of Itl, the level 0 LR test will be the same as the UMPU level
o test for all o.

As mentioned earlier, the main reason for introducing LR tests is that


they can be used in situations in which UMPU tests do not exist. In the
following example, SUPOEfI fXls(xIO) is not equal to SUPOEflA fXls(xIO) for
all possible data values x, so that there will exist data sets for which LR
is not equal to the approximate Bayes factor (4.24).

Example 4.137. Suppose that Xl, ... , Xn are conditionally liD N(p., 0"2) given
(M,~) = (p.,0"), and H : a ~ M ~ b versus A : not H. We can easily calculate
the LR criterion:

I if X E [a,b],
-n/2
= { ( 1 + (-:~i) )
- 2
LR if X < a,
( 1 + (X-b)
- 2)-n/2
if X> b.
~

We can easily see that the level 0 LR test will be the sum of two one-sided tests.
Since Problem 54 on page 292 shows that the level of the sum of two one-sided
tests in this problem is the sum of the levels, and since LR decreases equally fast
as X drops below a or rises above b, it follows that the two tests should each have
level 0/2. The LR test becomes the test described at the end of Example 4.132.

In Section 7.5.1, we will prove some large sample properties of LR tests.


276 Chapter 4. Hypothesis Testing

4.5.6 The Standard F-Test as a Bayes Rule'"


In many of the examples in this chapter, we compared various classical tests
to Bayesian procedures. In particular, we found that by using improper
priors, we could make many classical tests into formal Bayes rules. For the
case of normal distributions, it is actually possible to find a proper prior
distribution that leads to a similar result. Consider one of the general linear
models, such as analysis of variance or regression. In this section, we will
find a proper prior distribution for the parameters such that the standard
F-test emerges as a Bayes rule and is seen to be admissible with O-l-c loss.
General linear models can be transformed into the following. The pa-
rameters are e
= (M, \11, E), with M an r-dimensional vector and \11 an
s-dimensional vector. Let k = s + r. The sufficient statistics are (Y, W, U),
which are conditionally independent given the parameters with distribu-
tions

(4.138)

Example 4.139. Suppose that Xl,i, i = 1, ... , nl are lID N(I-'l, ( 2) independent
of X2,i, i = 1, ... , n2 with N(1-'2, ( 2) distribution given MI = 1-'1, M2 = 1-'2, E = u.
A popular hypothesis is H: MI = M2. One can write this as H: M I -M2 = O. In
the above notation, r = 1, Y is the difference between the two sample averages
times yntn2/(nl + n2), and M = MI - M2 Also, 8 = 1, U is the sum of all the
observations divided by yni + n2, and lit = (niMI + n2M2)/ynl + n2. Finally,
d = ni + n2 - 2 and W is the pooled sum of squared deviations.

Example 4.140. Suppose that Yl, ... , Yn are conditionally independent given
B = (3, E = u with Yi '" N(xT (3, ( 2) distribution, where the Xi are known
k-dimensional vectors. This is the standard linear regression model. A typical
hypothesis is of the form H : CB = c versus A : CB ::/= c, where C is an r X k
matrix of rank r < k and c is an r-dimensional vector. Define the matrix C =
"'~ x,x
L..,.,.,,=1 .. '& , and ~ume that C is nonsingular. The usual least-squares estimator
T

of B is B = C- l E:=l XiYi. Its conditional distribution given the parameters


is Nk(3,u 2C- I ). The conditional distribution of CB given the parameters is
N r (C(3,u 2 CC- I C T ). Let D be a ~k - r) x k .matrix ~~ose k - r rows ar~ all
orthogonal to the r rows of CC- l . 3 The sufficlent statistIcs can then be wrltten
as
n

Y = (CC- 1 C T )-"2"CB,
l' W = 'L.....,(Yi
"' - XiT'B) 2 , U = (DC- 1 D T )_lDB'.
"2"
i=1

With lit = (DC- l D T )-l/2 DB, M = (CC- l C T )-l/2CB, 8 = k-r, and d = n-k,
we see that (Y, W, U) have the distributions given in (4.138).

For the general situation, construct new random variables B > 0, ~ E JRr ,
and H E {O, I}, which are conditionally independent of (Y, W, U) given e.

*This section may be skipped without interrupting the flow of ideas.


k = r, we don't need the U vector.
39If
4.5. Nuisance Parameters 277

The distribution of e and the new parameters is as follows. Given III = 'I/J,
r: = u, B = {3, r = ,,(, and H = h; M = h"({3/(1 + "( T "() with probability 1.
Given r: = u, B = /3, r = ,,(, and H = h,

Given B = {3, r = ,,(, and H = h, E = 1/ viI + "( T "( with probability l.


This makes the conditional distribution of III into Ns(O, (1 - ( 2 )I), which
we shall call Pu ' Given B = (3 and H = h, the density of r is

if h = 0,
if h = 1.

Finally, Band H are independent, with B having some density f({3) strictly
positive on all of (0,00), and Pr(H = 0) = Po.
The Bayes rule with respect to D-l-c loss is to reject H = if Pre H =
llY = y,U = u, W = w) is large.

Pr(H = llY = y,U = u, W = w) (4.141)
=
Pr(H = 0IY = y, U = u, W = w)
f f f f fy,u,wls(Y, u, wlp" 'I/J, U)dPu ('I/J)dQl,"!,{J (u, p,)7rl,{J("()d"(f({3)d{3
f f f f fy,u,wls(y,u,wlp,,'I/J,u)dPu('I/J)dQo,"!(u,p,)7ro,{J("()d"ff({3)d{3 ,
where Ql,"!,{J is the distribution for (M, E) that puts all of its mass on
p, = "({3/(1 + "( T "() and u = 1/ Jl + "( T "(; Qo,"! is the distribution of
(M, r:) that puts all of its mass on p, = 0 and u = 1/ Jl + "( T "(; and
fy,u,wls(Y, u, wlp" 'I/J, u) is proportional to

To find the Bayes rule, we begin with the innermost integration (over 'I/J)
in both numerator and denominator (since they are the same). The integral
to be performed is

where
278 Chapter 4. Hypothesis Testing

It follows that a(l- ( 2 ) 1/2 is a scale factor for each coordinate of 1/;. Hence
the integral is proportional to exp( -u T u/2). Since this depends on the
data alone, it cancels out of the numerator and the denominator along
with w d / 2 - 1 and the constant in the data density. What remains of (4.141)
is

I I I a- r - d exp (~[w + 2:::=1 (Yi - tLi)2]) dQl,"(,{3(a, tL)7rl,{3(T)d, f({3)d{3


I I I a-r-dexp (~[w + E:=I(Yi - tLi)2]) dQo,"(a,tL)7ro,{3(T)d,f({3)d{3 .
The next innermost integrals are with respect to point mass probabilities,
so they merely evaluate the integrands at the points where they put their
mass. The result is

I 1(1 +, T ,)!::f! exp { -~ [w + Ily - ~112]} 7rl,{3(T)d,f({3)d{3


I 1(1 + 'Y T 'Y)!::f! exp { -1+1 T "'I (w + YT y) } 7ro('Y)d'Yf({3)d{3
Next, we integrate over 'Y. In the denominator, we get

J {Co exp
1+'YT'Y T
2 (y Y + w) d'Y
}

= Co exp { -~(y T Y + W)} (y T y + w)-; (27r);.

Call this last expression K. In the numerator, we get

So, the ratio is

constant x I Cl ({3) f ({3) exp { ~ ~ } d{3


I f({3)d{3
This is a one-to-one increasing function of
yTy rF
yTy+W = rF+d'
4.6. P- Values 279

where F is the classical F statistic. Hence the Bayes rule is to reject H


when F > c for some number c, which is the classical F-test. Because of
the way the prior distribution was constructed, we can show that the Bayes
rule is admissible (see Problem 60 on page 294), so the F-test is admissible.
Notice that the prior distribution depends on the sample size, so it could
not be used as a real prior distribution unless we knew for sure what the
sample size would be before observing the data.

4.6 P- Values
4.6.1 Definitions and Examples
A common criticism of hypothesis-testing methodology is that the decision
to "reject" or "accept" a hypothesis is not informative enough. One should
also provide a measure of the strength of evidence in favor of (or against)
the hypothesis. The posterior probability of the hypothesis is an obvious
candidate to provide the strength of evidence in favor of the hypothesis,
but the posterior is not available in a classical analysis. In fact, there is
no theory for strength of evidence or degree of support in the classical
theory. Instead, some alternatives to testing hypotheses are available. The
alternative considered here is to give the set of all levels for which a specific
hypothesis would be rejected. 40 For most of the tests that we will consider
in this book, the set of a values such that the level a test would reject H
will be an interval starting at some lower endpoint and extending to 1. The
lower endpoint will be called the P-value of the observed data relative to
the collection of tests.
Definition 4.142. Let H be a hypothesis, and let r be a set indexing
nonrandomized tests of H (Le., {r/>",( : "Y E r} is a set of nonrandomized
tests of H). For each "Y E r, let cp("'() be the size ofthe test r/>",(' Define

PH(X) = inf{ cp("Y) : r/>-y(x) = 1}.


We call PH (x) the P -value of the observed data x relative to the set of tests
for the hypothesis H.
Usually, when the data have continuous distribution, we can arrange for
r = [0, 1] and cp("'() = "Y. That is, there is one and only one size "Y test in the
set for each "Y E [0,1]. If the data have a discrete distribution, it may be
impossible to achieve certain sizes with nonrandomized tests. Often, it is
understood implicitly which set of tests is under consideration and what is

40 Another alternative is to provide interval (or set) estimates for parameters


(see Section 5.2). For example, a coefficient 1- a confidence set (Definition 5.47)
is defined in such a way that it contains all of the values of (J such that the
hypothesis H : e = (J would be accepted at level a (see Proposition 5.48).
280 Chapter 4. Hypothesis Testing

the hypothesis. In these cases, PH(X) is called the P-value without reference
to the set of tests or the hypothesis.
Example 4.143. Suppose that X '" N(9,1) given 8 = 9 and HI : e E
[-0.5,0.5]. The UMPU level a test of HI is l/>a(X) = 1 if Ixi > Ca, for some num-
ber C". If X = 2.18 is observed, I/>a will reject HI if and only if 2.18 > Co. Since Co<
increases as a decreases, the P-value is that a such that Ca = 2.18. If Ca = 2.18,
then the test is I/>a(x) = 1 if Ixi > 2.18, so a = <P( -2.68) + 1 - <P(1.68) = 0.0502
is the P-value.

The reader will note that we used the same notation PH (x) to denote the
P-value as we used for the significance probability in Definition 4.8. The
reason is that they are almost always the same thing.
Proposition 4.144. Let {4>",/ : 'Y E r} be a collection of tests. Suppose that
r ~ [0, 1] and 'Yl > 1'2 implies that for all x, "'/l (x) ~ "'/2 (x). Define the
binary relation ~ on the sample space by x ~ Y if and only if the P-value
for x is at least as large as the P-value for y. Then ~ is a weak order and
the P-value always equals the significance probability.
The conditions of Proposition 4.144 say that for every possible observa-
tion x, if x leads to rejection of H at one level, then it leads to rejection at
every higher level. This is just a precise way of saying what we said earlier
about the set of levels at which a hypothesis would be rejected being an in-
terval running from some lower bound up to 1. Although the conditions of
Proposition 4.144 are met in most situations, the following example [from
Lehmann (1986)41] is a case in which they are not.
Example 4.145. Suppose that n = {I, 2, 3} and X = {I, 2, 3, 4}. Consider the
following conditional distribution for X given 8:
x
Ix e(xI9) 1 2 3 4
2 4 3 4
1 13 13 13 13
4 2 1 6
9 2 13 13 13 13
4 3 2 4
3 13 13 13 13
Consider the hypothesis H : 8 :5 2 versus A : 8 =
3. One can show that the MP
level 5/13 test of H is 1/>5/13(X) = 1 if x E {I, 3} and that the MP level 6/13 test
is 1/>6/13(X) = 1 if x E {1,2}. For a = 1, I/>",(x) = 1 for x E {1,2,3,4}. So X = 3
leads us to reject H at some high values of a and at some low values of a, but not
at certain values in between. The infimum of the set of all a such that 1/>",(3) = 1
no longer tells us all of the levels at which we would reject H. In particular, one
of the conditions of Proposition 4.144 is violated.
Because P-values are between 0 and 1 and because the smaller the P-
value is the smaller a would have to be before one could accept the hypoth-
esis, people like to think of the P-value as if it were the probability that the

4IThe example appears in Problem 34 on page 121 of that text and in Prob-
lem 29 on page 116 of the 1959 edition.
4.6. P- Values 281

hypothesis is true. Those who are more careful with their terminology will
still suggest that it is the degree to which the data support the hypothesis.
Sometimes this is approximately true, as in the next two examples. 42
Example 4.146. Suppose that X", Bin(n,p) given P = p, and let H: P:$ PO.
The UMP level Q test rejects H when X > Co, where Co increases as Q decreases.
The P-value of an observed value x is the value of Q such that c" = x-I unless
x = 0, in which case the P-value is 1. The P-value can then be calculated as

PH(X) = t (7)p~(1-
t=x
pot-i.

This formula is also correct when x = 0.


Next, suppose that we used an improper prior for P of the form Beta(O,I).
The posterior distribution of P would be Beta(x, n+l-x). If x> 0, the posterior
probability that H is true is Prey :s: po), where Y '" Beta(x, n + 1 - x). This is
the probability that at least x out of n IID U(O, 1) random variables are less than
or equal to po, because Y has the distribution of the xth order statistic from a
sample of n lID U(O, 1) random variables. The probability that a single U(O,I)
is less than or equal to Po is Po, and the n of them are lID so the probability
that at least x of them are less than or equal to Po is

:t (7)p~(1-
'l=x
po)n-i = PH(X).

So, the P-value is equal to the posterior probability that the hypothesis is true
(using an improper prior), at least when the fosterior is proper. If x = 0, then
the posterior is still improper Beta(O, n + 1).4
Consider next what happens if H : P ~ Po. It turns out that the improper
prior must change to Beta(I,O). (See Problem 64 on page 294.) Because two
different priors are needed to obtain the "degree of support" for the two different
hypotheses, we get the following anomaly. If we take the two hypotheses together,
{P:$ po} U {P ~ po}, the total degree of support is

1+ (:)PO(I- Pot- X

One can easily check that this is not due to the fact that {P = Po} is included in
both hypotheses. One could leave it out of either one and the results would be
the same.

A similar situation occurs with Poisson data.


Example 4.147 (Continuation of Example 4.61; see page 241). The P-value of
an observed data value x is Pr(X ~ xle = 1). This can also be written as
Pr(at least x events in one time unit of a rate 1 Poisson process)
Pr(time until xth event is :$ 1) = PreY :s: 1) = Pr(e :$ llX = x),

42Berkson (1942) carefully examines the use of significance probabilities as


evidence in favor of hypotheses.
43There is a sense in which PreP :$ polX = 0) = 1 even in this case, but it
requires the notion of finitely additive probability.
282 Chapter 4. Hypothesis Testing

where Y '" r(x, 1) and assuming that the "prior" for S is the improper dfJlfJ. So,
the P-value is the posterior probability that H is true if the prior is improper.
This is actually true in general for Poisson distributions and hypotheses of the
form H : S ~ fJo. The implications of this result include the following. If a
Bayesian uses the improper prior dfJlfJ and has a G-l-c loss function, then he or
she will reject H if the P-value is less than 1/(1 + c). This is the UMP level 0
test if 0 = 1/(1 + c).
Next, suppose that H : e ~ 1 and A : S < 1. The UMP level 0 test is

I ifx<c,
q,(x) ={ 'Y if x = c,
o if x> c,

where c and 'Yare chosen so that q, has size o. The P-value of an observed data
value x is Pr(X ~ xiS = 1). As we did earlier, we can write this as

PH(X) = Pr(at most x events in one time unit of a rate 1 Poisson process)
= Pr(time until event x + 1 is > 1) = Pr(Y > 1) = Pr(S > 11X = x),

where Y '" r(x + 1,1) and assuming that the "prior" for e is the improper dfJ.
If we modify Example 4.143 slightly, we discover a case in which it is
simply impossible to use P-values for measuring degree of support. [See
also Schervish (1996).]
Example 4.148 (Continuation of Example 4.143; see page 280). Let H2 : S E
[-0.82,0.521. The UMPU level 0 test is 1/I0x) = 1 if Ix + 0.151 > do<. If X = 2.18
is observed, then do< = 2.33 and

o = ell ( -3) + 1 - eII(1.66) = 0.0498.


This is smaller than the "degree of support" for the smaller hypothesis HI. It
does not make any sense to have a concept of degree of support that gives more
support to a smaller hypothesis than it gives to a larger one. In the one-sided
testing case, this does not happen. (See Problem 62 on page 294.)

In Example 4.148, we saw that the P-value of a data value relative to the
class of UMPU tests behaved strangely as the hypothesis varied. This exam-
ple is closely related to the incoherent tests discovered in Example 4.88 on
page 252 and Example 4.102 on page 257. Problems 61 and 62 on page 294
show that for one-sided hypotheses with known or unknown variance, the
P-value always equals the posterior probability of the hypothesis calcu-
lated from the usual improper prior. In the case of interval hypotheses
with unknown variance, the situation is somewhat different.
Example 4.149. Suppose that Xl' ... ' Xn are conditionally lID with N(/J,0- 2 )
distribution given e = (/J,o-). For a hypothesis of the form H : a ~ M ~ b with
a < b, we have not found UMPU tests. We do, however, have the collection of
likelihood ratio (LR) tests. (See Examples 4.132 and 4.137.) The UMPU tests of
one-sided hypotheses (like M ~ b or M ;::: a) and point hypotheses (like M = c
versus M c) are also LR tests. So, we might try to compare the P-values
for various hypotheses relative to the families of LR tests. If X = x > b and
4.6. P-Values 283

8 2= 2::1 (Xi - X)2 j(n - 1) = 82are observed, the P-value for H : a ~ M ~ b


will be the level of the LR test that rejects H when In(X - b)j 8> In(x - b)j 8
or when In(X -a)j8 < -In(X~b)j8. Call this P-valuepH(x). It is easy to see
that PH(X) is precisely the same as the P-value for the hypothesis Hb : M = b
versus Ab : M =1= b relative to the collection of UMPU (two-sided) tests. Also
PH(X) is precisely twice the size of the P-value for the hypothesis H~ : M ~ b
versus A~ : M > b relative to the collection of UMPU (one-sided) tests.
Since the one-sided P-values equal posterior probabilities of the hypotheses
when using improper priors (see Problem 62 on page 294), we find that the one-
sided P-value for the hypothesis H~ : M ~ c is a continuous function of c. It
follows that there exists c > b such that the one-sided P-value for H~ satisfies
PH(X) > PH' (x) > PH(x)/2. If we are to interpret the P-values relative to the
collections ot LR tests as degrees of support for the respective hypotheses, then
if x > b, the degree of support for every hypothesis of the form Ha : M E [a, b]
(for varying a but fixed b) is the same number PH(X) (even if a = b or if a is
much less than b). But the degree of support for the hypothesis H~ : M E (-00, c]
(where c > b is chosen as above) is PH' (x) < PH(X). In words, the data offer more
support for every hypothesis of the f~rm M E [a, b] than they do for M E (00, c]
even though la, b] ~ (-00, c] for every a.
We see that there are cases (usually one-sided testing) in which P-values
can correspond to a degree of support for the hypothesis, but there are
other cases (e.g., two-sided alternatives) when they cannot. It is possible,
for example, with normal data, to express certain P-values as weighted
averages over the corresponding hypotheses of P-values for testing point
hypotheses. (See Problem 66 on page 294.) To generalize this idea beyond
normal distributions (or symmetric location families), one needs to con-
sider tests that may not be UMPU. Spj9Jtvoll (1983) defines a measure of
"acceptability" of point hypotheses, which has the property that for many
distributions and certain hypotheses, the weighted average of the accept-
ability over the hypothesis equals something closely related to the P-value.
In Section 6.3.1, we give some general conditions under which P-values
are equal to the posterior probabilities of hypotheses. Casella and Berger
(1987) study the problem of testing a one-sided hypothesis-alternative pair
and find that in many cases, the P-value is approximately a limit of pos-
terior probabilities. Examples 4.143 and 4.148 point out that the P-value
cannot be taken as a method for providing a "degree of support" for general
hypotheses, however.

4.6.2 P- Values and Bayes Factors


In Section 4.2.2, we introduced Bayes factors as ways to quantify the degree
of support for a hypothesis in a data set. In particular, there are lower
bounds on Bayes factors which indicate the smallest amount of support
one could coherently say that the data supply to the hypothesis. When
the lower bound is not particularly small, one would be hard pressed to
argue that the data are highly inconsistent with the hypothesis. Since P-
values have also been suggested as measures of support for the hypothesis
284 Chapter 4. Hypothesis Testing

the data offer, it seems natural to compare the two. In one-sided cases,
we found that posterior probabilities (when using improper priors) often
corresponded to P-values. In this section we will only compare Bayes factors
to P-values for testing hypotheses of the form H : e = eo versus A : e f:. eo
Edwards, Lindman, and Savage (1963) and later Berger and Sellke (1987)
made comparisons of P-values with lower bounds on Bayes factors, and the
following two examples are inspired by the presentations in those sources.
Example 4.150. Suppose that Xl, ... , Xn are conditionally lID with N(e, 1)
distribution given e = 8, and we are interested in testing H : e = 80 . Let
po = Prce = 80) > O. If we let the prior distribution'>' of e given that e f 80 be
unrestricted, then the global lower bound on the Bayes factor is easily calculated
to be exp( -n[x - eo]2/2). The lower bound on the Bayes factor for.>. being a
normal distribution centered at 80 is 1 if Ix - 801 ~ 1/../ii, and it equals

../iilx - 801 exp ( -iCx - 80)2 +~)


if Ix - 801 > 1/../ii. The UMPU level a test of H is to reject H if ../iilx -
801 > ~-lC1- a!2), where ~ is the standard normal CDF, so the P-value for an
observation x is 1- 2~( ../iilx - 80 1). All of these, the P-value and the two lower
bounds, are monotone decreasing functions of ../iilx - 80 I. Table 4.152 compares
the two lower bounds with the P-value and lists what the prior probability of
the hypothesis would have to be in order for the posterior probability to be as
low as the P-value. Notice how small the prior probability would have to be in
order for there to exist even a prior distribution on the alternative which would
allow the posterior probability to equal the P-value. For the normal distribution
priors and small P-values, the required Po is quite small.

The discrepancy between the P-value and posterior probability of the


hypothesis, as described in Example 4.150, is sometimes called "Lindley's
paradox" [see Lindley (1957) and Jeffreys (1961)]. The contrast between P-
values and posterior probabilities is even more striking when one considers
more reasonable prior distributions on the alternative rather than the lower
bounds.
Example 4.151. Suppose that X!, ... ,Xn are conditionally lID with N(IL, ( 2 )
distribution given (M,~) = (IL, u). Suppose that we use a prior distribution that

TABLE 4.152. Comparison of P-Values and Lower Bounds in Example 4.150


Global Normal
P-value Bound Prior a Bound Priora
0.1 0.2585 0.3006 0.7011 0.1368
0.05 0)465 0.2643 0.4734 0.1001
0.01 0.0362 0.2179 0.1539 0.0616
0.001 0.0045 0.1835 0.0242 0.0398
0.0001 0.0005 0.1622 0.0033 0.0293

aThis is the largest possible value of Po which is consistent with the posterior
probability being equal to the P-value.
4.7. Problems 285

is conjugate as in Example 4.22 on page 224. The Bayes factor for the hypothesis
H : M = 1-'0 was given in (4.23). Consider what happens to this expression as
n -+ 00. First, suppose that the usual t statistic converges to a constant to. That
is, assume that v'n(Xn - I-'O)/Sn converges to to, where s! = w/(n - 1). In this
case, the formula in (4.23) behaves asymptotically (as n -+ 00) like v'n times
a constant. This means that the Bayes factor goes to 00 as n increases, hence
the posterior probability of the hypothesis goes to 1. What happens to the P-
value for this same sequence of data sets? Since the t statistic is converging to a
constant to, the P-value is converging to 1- 241(to). For example, if to = 1.96, the
P-value will go to 0.05, while the posterior probability of the hypothesis goes to
1. This is an extreme example of Lindley's paradox. Once again, the suggestion
of Lehmann (1958) to let Q decrease as n increases would seem appropriate.
The situation is much the same if one uses the approximate Bayes factor as
calculated in (4.29). This formula does not require that the prior be natural
conjugate, but merely smooth in some sense. Since both q, and e converge to
finite values almost surely given e (by the strong law of large numbers 1.62), the
expression in (4.29) also behaves asymptotically like v'n times a constant if the
t statistic converges to to.
Of course, the t statistic will converge to a finite value with positive probability
only if the hypothesis is true. So, it is comforting that virtually any smooth prior
distribution will lead to eventually discovering that the hypothesis is true, if it is
indeed true. 44 On the other hand, it is a bit disconcerting that the P-value will
stay bounded away from 1 with positive probability no matter how much data
we observe. (See Problem 65 on page 294 to see how to prove that the P-value
has U(O,I) distribution given that the hypothesis is true.) If the hypothesis is
false, it is easy to check that the P-value will go to 0, as will the Bayes factor.

The irreconcilability of P-values and posterior probabilities as illustrated


in Examples 4.150 and 4.151 is quite typical of cases in which nH is a
lower-dimensional set than n. [See Schervish (1996) for some examples
with distributions other than normal.] Together with the strange behavior
of P-values in Examples 4.148 and 4.149, it becomes difficult to justify their
use to measure strength of evidence in favor of the hypothesis for two-sided
problems.

4.7 Problems
Section 4.1:

1. Prove that the loss function in (4.2) is equivalent to a 0-I-c loss function.
(By "equivalent" we mean that both the posterior risk and the risk function
will rank all decision rules the same way regardless of which loss is used.)
2. Prove that the general form of the hypothesis-testing loss function (in Defi-
nition 4.1) can be written as d( 8) times a 0-1 loss function for some function
d > 0 if the loss is 0 whenever a correct decision is made.

44In Section 7.4, we will prove some results that make more precise this limiting
ability of posterior distributions to identify the value of a parameter.
286 Chapter 4. Hypothesis Testing

3. Let X = (Xl, ... , Xn) be such that the X; are conditionally lID with
N(O,0'2) distribution given E = O' under the hypothesis H. Let T(x) =
V'ri:X/8, where x = L~=l x;fn and 82 = L~=l (Xi - x? /(n - 1). Define
X ::5 y if T(x) ~ T(y). Find PH(X), showing that it is the same for all O'.

Section 4.2:

4. Suppose that X rv N(B, 1) given e = B. Suppose that L(B, 1) < L(B, 0) for
all B < Bo and that L(B, 1) > L(B,O) for all B > Bo. Prove that, for every
prior there exists k such that the formal Bayes rule will be to choose action
a = 1 if X < k.
5. Suppose that Xl"'" Xn are conditionally lID with N(JL, 0'2) distribution
given e = (JL,O'). Use the improper prior having Radon-Nikodym deriva-
tive I/O' with respect to Lebesgue measure on (0,00) x IR. Let Ito and d
be known values, and suppose that the loss function is

c if a = 1 and lit - JLol ~ dO',


L(B, a) ={ 1 if a = 0 and lit - Ito I > dO',
o otherwise.

Prove that the formal Bayes rule will be of the following form: Choose
a = 1 if ITI > k for some constant k, where T = fo(Xn -ItO)/Sn and

- I~
Xn=;:;:L Xi , S2 n =I -
n-I
L (X;-X- n ).
n
2

i=1 ;=1

6. Suppose that X rv N(B, 1) given e = Band e has Pr(e = Bo) = po and


given e =F Bo, e rv N(Bo, 7 2 ). Prove that the posterior density of e with
respect to the measure II(A) = IA(Bo)+oX(A), where oX is Lebesgue measure,
is given by

where
xr2 + Bo
1+72 '

~ po
(1 - po)
~exp{-!2 [~]
1+
(X-Bo)2}.
1 - PI 7

7. Suppose that X rv N(B, 1) given e = B. Let H : e = Bo and A : e =F Bo.


Let the conditional prior given e 1= Bo be N(Bo, 7 2).
(a) Prove that the Bayes factor is minimized if

72 _ { (x - Bo)2 - 1 if Ix - Bol > 1,


- 0 otherwise.
4.7. Problems 287

(b) Show that the minimum Bayes factor is Ix-{;Iol exp( {-[x-{;Io]2+1}/2)
if Ix - {;Iol > 1, and is 1 if Ix - (;Iol :::; 1.
8. In Example 4.21 on page 224, prove that
k r Po
= T~~~(~) -~(-fZ2)'
Section 4.3.1:

9. Let (ao, ad be a point on the lower boundary of the risk set for a simple-
simple hypothesis-testing problem. Prove that ao + a1 :::; 1.
10. In a simple-simple hypothesis-testing problem, prove that the minimax
rule for a 0-1 loss function is any test that corresponds to the point where
8L intersects the line y = x.
11. Let n = {0,1}. Suppose that Po says that X '" U(-V3,V3) and P1 says
that X '" N(O, 1). Let nH = {O}. Draw the risk set for a 0-1 loss function,
and find the minimax rule.
12. In a simple-simple hypothesis-testing problem with 0-1 loss, show that a
MP level a test has size a unless all tests with size a are inadmissible.
13. Return to the situation in Problem 28 on page 212. Consider the hypothesis
H : X '" fo versus A : X '" h. Find all a such that the MP level a test is
of the form "Reject H if -d :::; X :::; d," and write d as a function of a.
14. *Prove Proposition 4.46 on page 235.
15. In Example 4.49 on page 236, prove that the Bayes rule with size a has
higher power than the unconditional size a test.

Section 4.3.2:

16. Suppose that the loss function is 0-1-c, that nH = {Bo}, and that n =
{Bo} U nA. Prove that a UMP level a test has no larger Bayes risk than
any other size a test, no matter what prior we use.

Section 4.3.3:

17. Let Y = IXI where fXls(xl{;l) = l/(7r{;l[l + (X/{;I)2]). Suppose that e >0
for sure. Prove that the family of distributions for Y has MLR.
18. Let X have Cauchy distribution Cau(B, 1) given e = B.
(a) Prove that the MP level a test of H : e = {;Io versus A : e = {;I1 for
(}1 > (}o is essentially unique. That is, if <I> and 1j; are both MP level a
tests, then PII(<I>(X) = 1j;(X)) = 1 for all B.
(b) Prove that there is no UMP level a test of H : e = Bo versus A : e >
(}o for 0 < a < 1.

19. Let the parameter space be the open interval n = (0,100). Let Xl and
X2 be conditionally independent given e = B with Xl '" Poi(B) and
X2 '" Poi(100 - B). We are interested in the hypothesis H : e :S c versus
A: e > c.
288 Chapter 4. Hypothesis Testing

(a) Show that there is no UMP level 0 test of H versus A.


(b) Show that T = Xl + X 2 is ancillary.
(c) Find the conditional UMP level 0 test given T.
(d) Find a prior distribution for e such that the conditional UMP level
o test given T is to reject H if Pr(H is truelX1 = Xl,X2 = X2) < 0.
20. Let {P9 : 0 E O} be a parametric family, and let p(x, 0) = (dP9/dx)(X) be
the density of a member of the family with respect to Lebesgue measure.
Assume that {J21ogp(x, 0)/8x80 exists for all x and O. Prove that the family
has increasing MLR if and only if a2 Iogp(x, 0)/8x80 ;::: 0 for all x and O.
21. Prove Proposition 4.55 on page 240.
22. Let 0 = {Ol, O2, Oa} with 01 < 02 < 03. Suppose that given e = 0, X f'V

N(O,l). Let H : e E {01,02} and A : e = Oa. Show that each test f/>
satisfying (3",(01) = (3",(02) = 0 is inadmissible if 0 < 0 < 1 and the loss
function is 0-1.
23. Suppose that the parametric family has MLR increasing and that H : e :::;
00 is the hypothesis of interest. Let the alternative be A : e > 00. Suppose
that the power function of every test is continuous. Prove that the UMP
level 0 test is the UMC floor 0 test.
24. Suppose that the parametric family has MLR increasing and that H : e :::;
00 is the hypothesis of interest. Let the alternative be A : e > 00 Suppose
that the UMP level 0 test has base 'Y. Prove that the UMP level 0 test is
the UMC floor 'Y test.
25. Show that the family of U(O, 0) distributions for 0 > 0 does not satisfy the
conditions of Proposition 4.67. Find two UMP level 0 tests of H : e = 1
versus A : (3 > 1 which are not almost surely equal.
26. Show that the family of Poi(9) distributions for 0 > 0 does not satisfy the
conditions of Proposition 4.67. Nevertheless, prove that the collection of
one-sided tests of the form (4.57) for H : e :::; 00 versus A : e > 00 forms a
minimal complete class when the loss function is of hypothesis-testing type
(see Definition 4.1.)
27. *Suppose that n = (0,00) and that Xl, ... , Xn are lID U(O,O) given e =
O. For 0 :::; 0 < 1, find a UMP level 0 test for each of the hypothesis-
alternative pairs below:
(a) H: e :::; 00 versus A : e > 00.
e
(b) H: ~ 00 versus A : < 00. e
(c) H: (3 = 00 versus A : e I: 00.
(d) In part (a), find a second UMP level 0 test that differs from the one
you found for part (a) with positive probability given e = 0 for 0 in
a set of positive Lebesgue measure.
28. (a) Suppose that X U(O, 0) given e = O. Find the conditional distri-
f'V

bution of Y = X-a given e = o.


(b) Let a > 0 be fixed, and suppose that X rv Par (a, 0) given e = O.
Find UMP level 0 tests for each of the three hypothesis-alternative
pairs in Problem 27 above.
4.7. Problems 289

29. *The density of the NCB(o, (3, 1/J) distribution is given in Appendix D.
(a) Prove that, for fixed 0 and {3, this family of distributions has increas-
ing MLR in x where 1/J is the parameter. (Feel free to pass derivatives
under summations.)
(b) Use the result of the previous part to show that the noncentral F
distribution has increasing MLR also.
(c) Also show that the non central t distribution has increasing MLR.
30. Prove that if cP is UMP level 0 for testing H : e E OH versus A: e E OA,
then 1- cP is UMC floor 1 - 0 for testing H' : e E OA versus A' : e E OH.
31. *Let Xl, ... , Xn be lID U(J - 1/2, (J + 1/2) given e = (J. Let H : e ~ (Jo
and A : e < (Jo.
(a) Find the UMP level 0 test of H versus A.
(b) Find the UM C floor 0 test of H versus A.
(c) Suppose that we begin with a uniform improper prior for e. Let the
loss be o-l-c with 1/(1 + c) = o. Find the formal Bayes rule.
(d) Calculate the power functions of the three tests above. Compare them
and explain the differences intuitively.
(e) Find the UMP level 0 test of H versus A conditional on the ancillary
U = max X; - minX;.
(f) Find the UMC floor 0 test of H versus A conditional on the ancillary
U = maxX; - minX;, and show that this is the same as the UMP
level 0 test conditional on the ancillary.
32. Let 0 = (-00, (Jol, and let Xl, ... , Xn be lID U(J - 1/2, (J + 1/2) given
e = (J. Let H : e = (Jo and A : e < (Jo. Suppose that we have a prior
distribution such that Pr(e = (Jo) = Po > 0 and that the conditional
distribution of e given e < (Jo has strictly positive density g(J) for (J < (Jo.
Assume that the loss is o-l-c. Find the formal Bayes rule for data values
such that max{xI, ... ,X n} < (Jo + 1/2. (This condition assures that the
data are consistent with the parameter space.) Does this test match any of
the tests in Problem 31 above?
33. Suppose that /-Le is a probability measure on (0, r). Suppose that the
conditions of Theorem 4.68 are satisfied and that there is a test with finite
Bayes risk with respect to /-Le. Suppose that the loss function is bounded
below. Show that there is a one-sided test that is a formal Bayes rule.
34. Let 0 ~ m. and H : e :::; (Jo. If power functions are continuously differen-
tiable and cP is the unique level 0 test with maximum derivative for the
power function at (Jo, show that cP is LMP level 0 relative to d(J) = (J - (Jo
for (J > (Jo.
35. Let Xl, ... , Xn be conditionally lID with Cau(J, 1) distribution given e=
(J .

(a) Find the LMP level Q test of H : e ~ (Jo versus A : e > (Jo for
o < 0 < 1. (Hint: Pass derivatives under integral signs.)
290 Chapter 4. Hypothesis Testing

(b) Prove that the power function of this test goes to 0 as 6 -+ 00.

36. Let n = (0,00), and let


if x:::; 6,
if x > 6.

(a) Show that this family of distributions has MLR.


(b) Show that the conditions of Proposition 4.67 are not met.
(c) Show that the UMP level Q test of H : e = 60 versus A : e > 60 is
unique if Q < 1/2.
(d) Show that the UMP level Q test of H : e = 60 versus A : e < 60 is
unique if Q < 1/2.
(e) If Q > 1/2, find two UMP level Q tests of H : e = 60 versus A: e > 60
that are not almost surely equal.

Section 4.3.4:

37. Suppose that the conditions of Theorem 4.82 are met. Let the loss function
be of the hypothesis-testing type for a two-sided hypothesis. Show that the
class of tests of the form given in Theorem 4.82 is essentially complete.
38. Suppose that J1,e is a probability measure on (n, or). Let the loss function be
of the hypothesis-testing type for a two-sided hypothesis. Suppose that the
conditions of Theorem 4.82 are satisfied and that there is a test with finite
Bayes risk with respect to J1,e. Suppose that the loss function is bounded
below. Show that there is a test of the form given in Theorem 4.82 which
is a formal Bayes rule.
39. Suppose that X '" N(6, 1) given e = (J. Let nH be the set of rational
numbers, and let nA be the set of irrational numbers. Prove that the UMP
level Q test of H : e E nH versus A : e E nA is the trivial test </>(x) == Q.
40. *Let X be a random variable, and define

if x < Cw,
if x = cw ,
ifx>c w '

Let PI and P2 be two different possible distributions of X with correspond-


ing expectations EI and E2 such that PI ~ P2 and P2 ~ Pl. Suppose that
EIc/>w(X) = w for all w. Prove that g(w) = E2c/>w(X) is continuous. (Hint:
To prove that 9 is continuous at w, consider three cases. First, suppose
that 1 > 'Yw > 0, and prove that so long as z is close enough to w so that
1 > 'Yz > 0 and Cz = Cw , then g(z) is close to g(w). Second, suppose that
'Yw = O. When w increases, either 'Yw or Cw must increase (or both). Either
way, the increase can be made small enough so that g( w) does not change
much. This is similar if w decreases. The third case, 'Yw = 1, is similar to
the second case.)
41. Let X '" Exp(J) given e = 6. Consider the two hypotheses HI : e :::; 1
versus Al : e > 1 and H2 : e ~ (1,2) versus A2 : e E (1,2).
4.7. Problems 291

(a) Find the UMP level 0.05 tests of the two hypotheses.
(b) Find the set of all x values such that the UMP level 0.05 test of H2
rejects H2 but the UMP level 0.05 test of H1 accepts H 1.
42. Let 01 C 02 be strictly nested subsets of the parameter space. Let Hi :
e E Oi and Ai : e f/. Oi for i = 1, 2. Suppose that L; is 0-I-c loss for
testing Hi versus Ai for i = 1,2 (with the same c for both cases). Consider
the problem of simultaneously testing both hypotheses with action space
{O, I} 2 , the first coordinate being the action for the first hypothesis, and the
second coordinate being the action for the second hypothesis. (That is, for
example, a = (a" a2) = (0,1) means to reject H2 but accept HI.) Suppose
that the loss function for the simultaneous tests is L(e, a) = L, (e, at) +
L 2(}, a2). A pair of tests (CPl, CP2) can be thought of as a randomized decision
rule in this problem. In this case, CPi(X) = Pr(reject Hi!X = x). We can
say that a pair of tests (CPl, CP2) is incoherent if there exists e E O2 \ 0 1 such
that

(a) Prove that an incoherent pair of tests is inadmissible. (Hint: Switch


the two tests for all x such that 1/11(X) < 1/12(X).)
(b) Consider the special case in that X rv N(e, 1) given e = (}, 01 = {eo},
and O2 = (-00, (}o]. Let cPl and CP2 be the UMPU level Q tests of their
respective hypotheses. Find a pair of tests that dominates (cPl, 1/12) in
the decision problem defined above.

Section 4.4:

43. For each k = 1,2, ... , let 'Yk,a denote the 1 - Q quantile of the X~ dis-
tribution. Let Yk have NCX~(c2) distribution. Prove that Pr(Yk > 'Ya) :5
Pr(Yl > 'Ya) for all k ~ 1. (Hint: Let Xl, ... ,Xk be lID N(B,I) given
e =k e, and2
consider two tests of H : e = 0 based on L.,,=l X ,2 and "k
(Li=l Xi) .)
44. Prove Proposition 4.92 on page 254.
45. Prove Proposition 4.93 on page 254.
46. Let Xl, ... , Xn be 110 U(e - 1/2,e + 1/2) given e = e. Let H : e = eo
and A : e '" Bo.
(a) Let cP be a test with size Q. Prove that 1/1 is unbiased level a if I/I(x) = 1
for all x which satisfy !xls(xleo) = O. (Hint: Use the fact that if
!xls(x!(}O)!xls(X!(}l) > 0 for all x E B, then Peo(B) = Pe, (B).)
(b) Prove that there does not exist a UMPU level a test for Q > O. (Hint:
You can slightly modify the UMP and UMC tests for the one-sided
case in Problem 31 on page 289 to produce unbiased level a tests with
maximum power for e < (}o and (} > (Jo, respectively. Then prove that
it is impossible for a single test to achieve both maxima.)
47. Prove Proposition 4.97 on page 255.
292 Chapter 4. Hypothesis Testing

48. Suppose that the conditions of Lemma 4.99 hold and that the 1055 function
is of the hypothesis-testing type. Prove that the class of two-sided tests is
essentially complete.
49. In a one-parameter exponential family with natural parameter 9, prove
that the condition
~{3",(O)1 = 0
d9 9=90
can be written as (4.117) if {3", (Oo ) = o.
50. Let n ~ JR and H : 9 = 00 versus A : 9 =I- 00 If power functions are twice
continuously differentiable and 4J is the unique unbiased level 0 test with
maximum second derivative for the power function at 00, show that tjJ is
LMPU level 0 relative to d(O} = 10 - 00 1 for 0 =I- 00.
Section 4.5:

51.*Let the joint density of (X 1,X2) given (91,9 2 ) = (01,02) be


cP(X2 - 02}cP(Xl - 01}
/xl,x21 9 1,92(X1,X21 01,02} = ~(X1-02) I(-oo,"'t> (X2),

where cP and ~ are respectively the standard normal density and CDF.
Find the UMPU level 0 test of H : 92 ::; c versus A : 92 > c.
52. Let the parameter space be n = (0, 1) x JR. Conditional on (P,9) = (p,9),
N", Bin(lO,p), and conditional on N = n, Xl, ... , Xn+l are lID N(O, 1).
You get to observe N, Xl, ... , X N +1. We are interested in the hypothesis
H : e ::; 0 versus A : e > o.
(a) Find the UMPU level 0 test of H versus A.
(b) Suppose that P = p is actually known. Show that there is no UMPU
level 0 test of H versus A.
53. In the framework of Example 4.121 on page 266, let H : r1 ::; p.o and
A: r 1 > p.o. Prove that the usual size 0 one-sided t-test is UMPU level o.
54.*Suppose that Xl, ... ,Xn are conditionally lID N(p.,u 2 ) given e = (p.,u).
Define
n
_ 1,",
Xn = ;; L.J Xi, Sn =
i=l

,-Xn - a Xn -b
tt=v n - - , t2 =v'n--.
Sn
Sn

Let 01, 02 ~ 0 and 0 = 01 + 02 ::; 1. Define

cP
(x) = {I 0
if t1 < !,;;!1 (od or if t2
otherWise.
> T;;!l (1 - 02),

(a) Prove that cP has size 0 as a test of H : a ::; M ::; b versus A: not H.
(Hint: For each p. E [a, b] find the distributions of hand t2 given
e = (p., u) and see what happens as u -+ oo.}
4.7. Problem s 293

(b) Prove that 4> is not an unbiase d level a test if both a1 and
a2 are
strictly positive. (Hint: Show that the limit of the power function
at
(a,l1) is a1 as 11 -+ O. Since the power function is continuous, what
does this say about the power function at (a+f, 11) for 11 and f small?)
55. Suppose that Xl rv Bin(n1 ,pI) indepen dent of X2 rv Bin(n2,
P2) condi-
tionalo n (PI, P2) = (P1,P2). Find the UMPU level a test of H :
H = P2
versus A : PI i P2.
56.*Let Xl, ... ,Xn be lID given 8 = (fh,fh) with density
fXls(xl fh, (h) = exp {-"I/.I(8 1 ) + 81 cos(x - 82)} I[o,21Tj(x),
where
1/J(y) = log 121T exp(yco s(tdt.

Let n=
{(81 ,92): 91 ~ 0,0::::; 82 < 27r}.
(a) Let HI = 8 1 cos 82 and H2 = 8 1 sin 8 2. Let H : 8 2 = 0
versus
A : 82 =1= 0 and let H : H2 = 0 versus A : H2 =1= O. Prove that the
UMPU level a test of H is also UMPU level a for H. (You need not
actually find the test to prove this.)(H int: Use the fact that if I and
9 are analytic functions of one variable and I(x) = g(x) for
all x on
some smooth curve, then I = g.)
(b) Find the form of the UMPU level a test of H as closely as you
can.
(You will not be able to find the cutoffs in closed form.)
(c) Why is this a reasonable test of H but not of H?
57.*Let X and Y have joint density given (M,A) = (JL,>'),

IX,YIM,A(X, ylJL, >.) = JL>' exp( -JLX - >.y)I[o,oo) (x)I[o,oo) (y).


Find UMPU level a = 0.2 tests for the following hypotheses:
(a) H: A ::::; M + 1 versus A : A > M + 1.
(b) H: A = M versus A : A =1= M.
58. *Consider a breedin g experim ent in which each observa tion is a
classification
into one of three groups. Suppos e that the observations are conditio
nally
indepen dent given (PI, P2, P3) = (PI, P2, P3) with conditio nal probabi
lity
Pi of being classified as group i.
(a) Find the form of the UMPU level a test of H : P2 = 3P1 versus
A :
P2 =1= 3P1 based on n observations, and say how you would determi
ne
the exact rejection region.
(b) For the case n = 2, find the precise form of the UMPU level 0.1
test.
(c) Suppose that (PI, P2 , P3 ) has a prior distribu tion of the Dirichle
t form
with density

I Pl>~ (P1,P2 ) -- real + a2 + (3) 01-1 02-1( )03-1


r(al)r(a 2)r(a3) P1 P2 1 - PI - P2 ,
for all Pi 2': 0, PI + P2 + P3 = 1, where all ai > 1. Find the posterio r
mean of Pd Pl.
294 Chapter 4. Hypothesis Testing

59. We will observe Y 1 , ,Y" where the conditional distribution of Yi given


(Bo, BI) = ({30, (3I) is Bin(ni,Pi), where Pi = [1 + exp(f30 + (31 Xi)r 1, the
Yi are conditionally independent, and the ni and Xi are all known.
(a) Find minimal sufficient statistics.
(b) Find the form of the UMPU level 0 test of H : B1 $ c versus A :
B1 > c.
(c) For the special case k = 2, nl = 3, n2 = 2, Xl = 1, X2 = 2, c = 0,
0 = 0.1, find the exact test for all possible data values.

60. Consider the proof in Section 4.5.6 that the F-test is a proper Bayes rule.
(a) Prove that the prior distribution given in Section 4.5.6 puts positive
probability on every open subset of the original parameter space for
e.
(b) Prove that the Bayes rule is admissible.
Section 4.6:

61. Suppose that Xi are lID N(p., 1) given M = p. for i = 1, ... , n and that we
use Lebesgue measure as an improper prior for M. If H : M $ c, show that
the posterior probability that H is true equals the P-value associated with
the family of one-sided tests.
62. Suppose that Xi are lID N(p., (2) given (M,~) = (p., q) for i =
1, ... , n
and that we use the usual improper prior with Radon-Nikodym derivative
l/q as an improper prior. If H : M $ c, show that the posterior probability
that H is true equals the P-value associated with the family of one-sided
t-tests.
63. Prove Proposition 4.144 on page 280.
64. In Example 4.146 on page 281, suppose that we change H to H : P ~ Po
Prove that the P-value, associated with the family of UMP level 0 tests, of
an observed X < n is the posterior probability that the hypothesis is true
based on an improper prior Beta(I,O).
65. Let {P6 : (J E O} be a parametric family. For each 0, let cpa (X) = Iso (x) be
a size 0 test of H : e = (Jo versus some alternative such that, for 0 < 0 < 1,
So: = np>aSp. Suppose that So = 0 and Sl = X.
(a) Prove that if 0 < {J, then Sa ~ Sp.
(b) Let PH(X) be the P-value of an observed x. Show that, given e = (Jo,
PH(X) '" U(O, 1).
(c) Suppose that CP'" is unbiased for each o. Prove that
P!J(PH(X) $ 0) ~ P~o(pH(X) $ 0).

66. *Let X '" N(J,I) given e = 9. Let g(J, x) be the P-value for testing H6 :
e = (J versus A6 : e "" (J for data X = x.
(a) If the hypothesis is H : e E la, bj versus A : e f1. la, bj, and X = x >
(a + b)/2 is observed, prove that the P-value is 41(a - x) + 41(b - x),
where ~ is the standard normal CDF.
4.7. Problems 295

(b) Assume that Lebesgue measure is used as an improper prior for e


and that data X = x > b are observed. Prove that for all hypothe-
ses of the form e ::; b, e = b, and e E [a, b], the P-value equals
E(g(8,x)IH is true,X = x).
67. Let X '" N(e, lin) given e = e, where n is known. Let the prior for e
be a mixture of a point mass at 0 and an N(O, 1) distribution. Consider
the hypothesis H : e = 0 versus A : e :f. O. Draw a graph of the Bayes
factor as a function of x and a graph of the P-value as a function of x for
n = 1,10,100.
68. Return to Problem 31 on page 289.
(a) Find the three P-values relative to the three sets of tests in parts (a),
(b), and (c).
(b) Find the posterior probability that H is true in part (c). Does it equal
any of the three P-values?
69. Let X have N(/-" 1) distribution given M = /-" and let Lebesgue measure
be an improper prior for M. For hypotheses of the form M ::; c, M ~ c,
or M = c, we will show that the P-value for the usual family of tests is
the posterior probability that M is farther from X (in the direction of the
alternative) than X is from c.
(a) Let H : M ::; c versus A : M > c. Prove that the P-value equals the
posterior probability that M - X > X-c. State and prove a similar
result for H : M ~ c.
(b) Let H : M = c versus A : M :f. c. Prove that the P-value equals the
posterior probability that 1M - XI > IX - cI-
(c) Can you extend this interpretation to the case of H : a ::; M ::; b
versus A : M [a, b]?
CHAPTER 5
Estimation

In Chapter 3, we discussed methods for choosing decision rules in problems


with specified loss functions. In Section 3.3, we gave an axiomatic derivation
of some of those methods. This derivation led to the conclusion that there
is a probability and a loss function, and one should minimize the expected
loss. There are decision problems in which Nand n are the same (or nearly
the same) space and the loss function L(O, a) is an increasing function of
some measure of distance between () and a. Such problems are often called
point estimation problems. The classical framework makes no use of the
probability over the parameter space provided by the axiomatic derivation.
One can also try to ignore the loss function as well. To estimate e without a
specific loss function, one can adopt ad hoc criteria to decide if an estimator
is good. In this chapter, we will study some of these criteria as well as some
criteria for the problem of set estimation. In set estimation, the action
space is a collection of subsets of the parameter space (or the closure of
the parameter space). The idea is to find a set that is likely to contain the
parameter without being "too big" in some sense.

5.1 Point Estimation


A point estimator of a function g of a parameter e is a statistic that takes
its values in the same set (or at least a similar set) as does g(9). One
popular type of point estimator is an unbiased estimator.
Definition 5.1. Let n be the parameter space for a parametric family
with Po and Eo specifying the conditional distribution of X given e = O.
Let g : n ~ G be some measurable function. Let G' ;2 G. A measurable
5.1. Point Estimat ion 297

function </J : X ~ G' is called an estimator of g(8). An estima tor </J


of g(8)
is called unbiased if Eo(</J(X = g(O), for all 0 E n. The bias of </J is
defined
as
brf>(O) = Eo(</J(X - g(O).
The next example is one of several that led some early researchers
to
believe that unbiased estimat ors may not be bad to use.
Examp le 5.2. Suppose that Po says that {Xn}~=l are lID N(/-t,a 2
), where
0= (/-t,a). If X = (Xl, .. . ,Xn)' then we define X = I:~=1 X;/n.
It is easy to
see that Eo(X) = /-t. So, if y(O) = /-t, we see that (X) = X is unbiase
d.
The following example shows that restrict ing attentio n to unbiased
esti-
mators may lead to an impasse.
Examp le 5.S. Suppose that Po says that X '" Exp(O). If (X)
is to be an
unbiased estimat or of e, then

Eo(X) = 1 00
(x)Oex p( -Ox)dx = e,
for allO. This happens if and only if fooo (x) exp( -ex)dx = 1, for all
e. By Theo-
rem 2.64, we can differentiate the left-hand side with respect to 0 under
the inte-
gral and get Jooo x</J(x) exp( -Ox)dx = 0 for allO. This means that Eo(X</J
(X)) = 0
for all e. Since X is a complete sufficient statistic , (x) = 0, a.s.
[Po] for all e.
This contrad icts </J(X) being unbiased. Hence, there are no unbiase
d estimat ors
ofe.

5.1.1 Minimum Variance Unbiased Estim ation


It is natural to check how unbiased estimat ors fare under certain
loss
functions. The most common one to use is squared -error loss L(O,
a) =
(g( 0) - a)2. The risk function of an estimat or </J is

R(O,</J) = Ee {(g(O) - </J(X2} = b~(O) + Vare</J(X).


If an estimat or is unbiased, the risk function is just the varianc
e. This
suggests the following "optimality" criterion for unbiased estimators.
Defini tion 5.4. An unbiased estima tor </J is a uniformly minimum
vari-
ance unbiased estimator (UMVU E) if </J has finite variance and, for every
unbiased estimat or 'I/J, Vare</J(X) ::; Vare'I/J(X) for all 0 E n.
UMVUEs are not necessarily good, as we will see later. The criterio
n of
unbiasedness only means that the average of </J(X) with respect to
Pe is
g(O) for all O. It does not mean that you expect </J(X) to be near g(O)
nor
does it mean that you expect g(6) to be near </J(x) after you have
seen
X=x.
We mentioned earlier that the concept of complete sufficient statisti
c
would playa role in unbiased estimation. The following theorem is
due to
Lehma nn and Scheffe (1955).
298 Chapter 5. Estimation

Theorem 5.5 (Lehmann-Scheffe theorem). 1fT is a complete statis-


tic, then all unbiased estimators of g(8), that are functions of T alone,
are equal, a.s. [Po] for all (). If there exists an unbiased estimator that is a
junction of a complete sufficient statistic, then it is a UMVUE.
PROOF. Suppose that l(T) and 2(T) are unbiased estimators of g(8).
Then EO[l (T) - 4>2(T)] = 0 for all O. Since T is a complete statistic,
it follows that l(T) = 2(T), a.s. [Po]. Now, suppose that there is an
unbiased estimator (X) with finite variance. Define 3(T) = E((X)IT).
Then 3(T) is unbiased by the law of total probability B.70. Using squared-
error loss, the Rao-Blackwell theorem 3.22 says R(), 3) s R(), ) for all ().
Since the risk function is the variance for unbiased estimators, this makes
3 a UMVUE. 0

Example 5.6. Suppose that {Xn}~=l are lID N(p,,0'2) given 8 = (M, E) =
(p,,0'). Let X = (Xl, ... , Xn). Then
n n
X = .!.n~
" Xi, and 8 2 = _1_ "(Xi - X)2
n-1~
~1 ~l

are complete sufficient statistics. Since they are unbiased, they are UMVUE of M
and E2, respectively. Notice that 8 2 does not minimize mean squared error, even
among estimators of the form c E~=l (Xi - X)2. (See Example 3.25 on page 154
and Problem 11 on page 210.)
Example 5.7. Suppose that Po says that X has Poi(J) distribution. We know
that X is a complete sufficient statistic. Let g(8} = exp( -38}. We will find the
UMVUE of g(8}. The required condition is

Eo(X) = L I/>(x) exp( -0),Oxx. = exp( -30),


00

x=O

for all O. It follows from the uniqueness of Taylor series expansions for analytic
functions that I/> is unbiased if and only if I/>(x) = (-2t for x = 0, 1,2, .... This 1/>,
although UMVUE, is an abominable estimator of g(8). We will see some better
estimators later in this chapter. (See Examples 5.29 and 5.32.)

The following results are useful when there is no complete sufficient


statistic.
Proposition 5.S. Let 80 be an unbiased estimator of g(8), and let

u = {U : EoU(X) = 0, for all ()}.

Then, the set of all unbiased estimators of g( 8) is {80 + U : U E U}.


Theorem 5.9. An estimator 8 is UMVUE of Eo8(X) if and only if, for
every U E U, Covo(8(X), U(X)) = o.
5.1. Point Estimation 299

PROOF. For the "only if" part, suppose that 6 is UMVUE. It is clear that
if VarsU(X) = 0 for all 9, then Covs(6(X), U(X)) = O. So, let U E U be
such that VarsU(X) > 0 for some 9. Let A E JR, and define 6>. = 6 + AU.
Then 6>. is unbiased also, and for every A,

Vars6(X) :::; Vars6>.(X)


= Vars6(X) + 2ACOVs(6(X), U(X + A2VarsU(X).
This is true for all A and all 9 if and only if

A2VarsU(X) ~ -2ACOVs(6(X), U(X,

which, in turn is true for all A and 9 if and only if Covs(6(X), U(X)) = 0
for all O.
For the "if" part, assume that for all U E U, Covs(6(X), U(X)) = 0
for all 9. Now, let 61 (X) be an unbiased estimator of Es6(X). Then there
exists U E U such that 61 = 6 + U. It follows that

Vars(61(X)) = Vars(6(X + 2Covs(6(X), U(X)) + Vars(U(X))


Vars(6(X + Vare(U(X)) ~ Vare(6(X)),
hence 6(X) is UMVUE. 0
Sometimes unbiased estimators exist, but none is UMVUE.
Example 5.10. Suppose that Pe says that VI, Y2, ... are lID Ber(J). Set

if VI = 1,
x ={ ~ of trials until 2nd failure otherwise,

and suppose that we observe only X. Then

if x = 1,
if x = 2,3, ....
Define the estimator 60 to be 60 (x) = 1 if x = 1 and 60 (x) = 0 if not. Then
60 is an unbiased estimator of e. We will now try to find a UMVUE. Assume
EeU(X) = 0, for all fJ. Then
00

z=2

+L
00

U(2) (Jk [U(k) - 2U(k + 1) + U(k + 2)] =0


k=l

if and only if U(2) = 0 and U(k) = -(k - 2)U(1) for all k ~ 3. This characterizes
all functions in U according to the value t = U(I). That is,

U = {Ut : Ut(x) = (x - 2)t, for all x}.


300 Chapter 5. Estimation

Every unbiased estimator of e is 8t (x) = 8o(x) + (x - 2)t, for some t. In order


for 8t to be UMVUE, it must have 0 covariance with every Us E U. That is, for
all s and all (),
00 00

0= L !xls(x!())8t (x)Us(x) = ()( -s)(1 - t) + L(1 - ())2()x-2 ts (X - 2)2.


x=l x=2

Divide both sides by (1 - ())2 to get

s(1 - t) (1 ~ ())2 = f
x=2
tS()X-2(X - 2)2.

By rewriting () /[1 - ()]2 as an infinite sum, we get

L k()k = L tsk ()k.


00 00

s(l - t) 2

k=l k=l

Since these two series in this last equation are analytic functions of B, it must be
that s(1 - t)k = tsk 2 for all s, k. This is not possible, hence there is no t such
that these equations hold. And there is no UMVUE.

Oddly enough, there is a locally minimum variance unbiased estimator


in Example 5.10.
Definition 5.11. An unbiased estimator oo(X) is locally minimum vari-
ance unbiased, LMVUE, at ()o if for every other unbiased estimator o(X),
Varoooo(X) ~ Varooo(X).
Example 5.12 (Continuation of Example 5.10; see page 299). First, note that

so a LMVUE at ()o can be found by minimizing

+L
00

()o(1 - t)2 ()~-2(1 - ()O)2(X - 2)2t 2.


x=2

This expression is quadratic in t and can be minimized by choosing

which is different for each ()o.

A LMVUE at ()o is not the "best" estimator if e = ()o; it is merely


an unbiased estimator such that the conditional variance given e = ()o is
smaller than that of any other unbiased estimator.
5.1. Point Estimation 301

5.1.2 Lower Bounds on the Variance of Unbiased Estimators


Suppose that one is interested in unbiased estimators. It would be nice
to know how low the variances of such estimators can be. Under some
regularity conditions, there exist lower bounds for the variances of unbiased
estimators. The Fisher information plays an important role in these lower
bounds. 1
Theorem 5.13 (Cramer-Rao lower bound). Suppose that the three FI
regularity conditions hold (see Definition 2.78 on page 111), and let Ix (0)
be the Fisher information. Suppose that Ix(O) > 0, for all O. Let (X)
be a one-dimensional statistic with Eel{X)1 < 00 for all O. Suppose also
that J (x)fxle(xIO)dv(x) can be differentiated under the integral sign with
respect to O. Then
'7 ""(X) > (1B Ee</>(X))2
v are Of' - Ix (0)
Before proving this theorem, we should look at some examples.
Example 5.14. Suppose that X '" N(8, b) given e = 8 and (x) = x. Then,
the conditions of Theorem 5.13 are satisfied, and we calculated Ix(8) = l/b in
Example 2.80 on page 111. So

In this case, the Cramer-Rao lower bound is met exactly.


Example 5.15. Suppose that X '" U{O, (}) given e = 8. In this case iXle{xI8) =
8- 1 I(o.8){X). We saw in Example 2.81 on page 111 that the conditions of The-
orem 5.13 are not met. Nevertheless, we were able to calculate something that
could have been called Fisher information, namely Ix (8) = 1/8 2 Let 4>(x) = x.
Then EII4>(X) = 8/2 and Varll4>(X) = 82 /12. We can calculate
d 1
d(}EII4>{X) = "2.
If the Cramer-Rao lower bound held here, it would say that Varll4>{X) ~ (}2/4,
which is clearly false.
PROOF OF THEOREM 5.13. Let Band C be the sets described in the FI
regularity conditions. (See Definition 2.78 on page 111.) Let D = CnBc, so
that, for all 0, Pe{D) = 1 and fv fXle(xIO)dv(x) = 1. Taking the derivative
with respect to 0 of this integral, we get

ttdxle(xIO)
o = Jv{ {)Ofxle(xIO)dv(x)
{) (
= Jv fXle(xIO) fXle(xIO)dv(x)

Ell [:0 IOgfx1e(X10)] .

1The first lower bound is due to Rao (1945) and Cramer (1945, Chapter 32;
1946).
302 Chapter 5. Estimation

Also, we can differentiate to obtain

:0 Eocp(X) = J cp(x) :O!XIS(xIO)dll(X) = Eo [cp(X) :0 log!x1s(X10)]

Eo {[CP(X) - Eocp(X)] :0 log!x1s(X10)},

since the term being added on has zero mean. Now take the absolute value
and use the Cauchy-Schwarz inequality B.19:

I:OEOCP(X)I < JEo [cp(X) - EOcp(X)]2 Eo [:0 IOg!x1s(X10)f

JVarocp(X) JIx(O).

Now square the extreme ends of this string and divide by Ix (0). 0
For an unbiased estimator cp(X) of e, the smallest possible variance
is I/Ix(O), since dEocp(X)/dO = 1. A necessary and sufficient condition
for the lower bound to be achieved is that the ~ become an = in Theo-
rem 5.13. The ~ in Theorem 5.13 was introduced by the Cauchy-Schwarz
inequality B.19, which provides for equality if and only if the two fac-
tors are linearly related. (That is, if E(X) = 0 and E(Y) = 0, then
IE(XY)I = JE(X2)JE(Y2) if and only if there exist a and b such that
aX + bY = 0, a.s. and ab f. 0.) So, the Cramer-Rao lower bound is an
equality if and only if cp(X) and the score function olog!xls(XIO)/oO are
linearly related, that is,

o
00 log !xls(xIO) == a(O)cp(x) + d(O), a.s. [Po]'

for all O. If we solve this differential equation, we get

!xls(xIO) = c(O)h(x) exp{1f(O)CP(x)}.

This means that the Cramer-Rao lower bound can be sharp only in a
one-parameter exponential family with cp(X) being a sufficient statistic.
Example 5.16. Suppose that X ,..., Exp(>..) given A = >... Set e = 1/A. Then
~ exp { -~x} 1(0,00) (x),
1 x
-0 + (}2'
Since the score function is a linear function of x, it follows that c/J(x) must also be
a linear function of x if c/J(X) is to achieve the lower bound. If c/J(x) = a+bx, .then
Eoc/J(X) = a + b(}, so a = 0 and b == 1 gives an unbiased estimat~r ~h~t achieves
the Cramer-Rao lower bound. The reader should verify that thiS IS mdeed the
case.
5.1. Point Estimat ion 303

Examp le 5.17. Outside of exponential families, the Cramer -Rao


lower bound
cannot be achieved. For example, suppose that

This is the family of ta(e, 1) distributions. In order for the variance


to exist,
suppose that a ~ 3. Then
{p a +1 1 - ~ (x - 8)2
002 log /xls(xIO) = --a- [1 + ~(x _ 0)2] 2'
Call this g(x). Then Ix(8) = -Eeg(X ).

Ix(O) = f r(~)
r(~)yIa1r a
a+ 1 1- ~(X-8)2
(1+~(x-O)2r
!ill. .
2

Since the denomi nator of the integran d looks like part of the ta+4 density,
we will
perform the following transformation:

z-e =
x-O
Va'
J J
Ja+4

x = (J + a: 4 (z - e), dx = adz.
a+4
The integral becomes

a_+_1 Vah __ r(!!.!)


2

a+4r(~)yIa1r
f 1 - _l_(Z
a+4 - 0)2 dz.
a (1+a~4(Z-0)2)~
Except for the constan t, the integral is E(l - U 2 /\a + 4]), where U
"" ta+4' The
correct constan t to make the integral equal to this expecte d value is

r(~)

and the expected value is (a + 1)/(a + 2), so the result is

This means the Cramer -Rao lower bound is (a + 3)/(a + 1). We


know that
I/J(X) = X is UMVUE because X is a complete sufficient statistic , and
Vare(X ) =
a/(a - 2), which is always larger than the Cramer -Rao lower bound.

There is another lower bound that applies in more general cases, such
as
when the set of possible values for X depends on e. This next lower bound
is due to Chapman and Robbins (1951).
304 Chapter 5. Estimation

Theorem 5.18 (Chapman-Robbins lower bound). Let


m(O) = Eo(X),
supp(O) = closure of {x: fXle(x\O) > O}.
Assume that for each () En, there is ()f =F () such that supp( ()f) ~ supp( 0).
Then,

[m(O) - m()f)]2 }
Varo(X) ~ sup
{O/:SUpp(OI)~SUpp(O)}
{
Eo [fX I9( X10 I) -
fXI9(XIO)
1] 2 .

PROOF. Let ()f be such that SUpp()f) S; supp(8). Let

U(X) = fxle(X\()f) _ 1.
!xle(XI())
Then EoU(X) = 1 - 1 = 0, and

Covo(U(X), (X)) = 1supp(O)


[(X)!xle(xI8') - (x)fxle(x\8)] dv(x)

= m(O') - m(8),
since SUpp()f) S; supp(O). By the Cauchy-Schwarz inequality B.19, the
square of the covariance is at most the product of the variances, so

[m()') - m(8)J2 ~ Varo(X)VaroU(X). o

Example 5.19. This is a case in which the Cramer-Rao lower bound does not
apply. Let Po say that {Xn}~=l are lID with density fXle(xIO) = exp(O-x)/[O,oo).
Let X = (Xl,oo.,Xn ). Then supp(9) = [9,00) and supp(O') C;;; supp(O) so long
as 0' ~ O. From the proof of Theorem 5.18,
U(X) = exp{n(O' - O)}/[OI,OO) (min X;) - 1.
If C/>(X) is an unbiased estimator of e, then [m(O) - m(O'W = (0 - 0')2, and
EOU(X)2 exp{2n(O' - O)}Po(minXi ~ 0')
+ 1,

r
- 2exp{n(9' - O)}Po(minXi ~ (1')
Po(minXi ~ 9') (Po (Xl ~ 9')r

[1,00 exp(9 - x)dx = exp{ -nCO' - O)},

EOU(X)2 exp{n(O' - O)} - 1.


The Chapman-Robbins lower bound is
(0 - 0')2 ~ 0.1619
Varoc/>(X) ~ sup {(9' _ O)} - 1 ~ n 2
9'"2:.9 exp n

A simple unbiased estimator is c/>(X) = min Xi - lin, which has variance l/n2.
5.1. Point Estimation 305

Another way to improve the Cramer-Rao lower bound is to raise it if it


is unattainable. We know that it is attained if (X) is perfectly correlated
with the score function aIog fXls(XIO)/oO, that is, if the regression of (X)
on the score function has 0 residual. If this is not possible, the residual might
be made smaller by regressing (X) on more than just the score function.
Lemma 5.20. Let (X) be an unbiased estimator of g(8) and let 'ljJi(X, 0),
i= 1, ... ,k, be functions that are not linearly related and such that
"I T = ("(1>' .. ,"Ik), "Ii = COV6((X),'ljJi(X,0)),
C = ((Cij )), C;j = COV6('ljJi(X, 0), 'ljJj(X, 0)).

Then Var6(X) ~ "IT C- 1"1'


PROOF. The covariance matrix of ((X), 'ljJl(X, 0), ... , 'ljJk(X, O))T is

(Var6~(X) ~).
The inequality follows from the fact that the covariance matrix is positive
semidefinite and C is nonsingular. 0
Lemma 5.20 has two corollaries, one of which is an improvement on
the Cramer-Rao lower bound and the other of which is a multiparameter
version of the Cramer-Rao lower bound.
The first corollary to Lemma 5.20 is one that attempts to improve the
Cramer-Rao lower bound by making use of the fact that the inequality
can fail to be an equality because the score function is not linearly related
to the estimator. To put this another way, if the residual from the linear
regression of the estimator on the score function has nonzero variance, then
it might be possible to get the residual variance down by regression on more
than just the score function. This is the approach taken in Corollary 5.21.
Corollary 5.21 (Bhattacharyya system of lower bounds). Assume
the conditions of Theorem 5.13, assume that k partial derivatives with re-
spect to 0 can be passed under the integral sign, and assume that .:1(0)
(defined below) is nonsingular. Then VarlJ(X) ~ "I T(0).:1- 1 (0),,(0), where

"1(0) = ( "11 ~O) ) , di


"Ii(O) = dO i E6(X),
"Ik(O)
.:1(0) = Jij (0))), Jij(O) = COV('ljJi(X,O),'ljJj(X,O)),

'ljJi(X, 0) =
1 ai
fXls(xIO) OOi fXle(xIO).

PROOF. All we need to do, in order to apply Lemma 5.20, is note that
306 Chapter 5. Estimation

which follows for higher derivatives in just the same way that it did for the
first derivative in the proof of Theorem 5.13. 0
The Cramer-Rao lower bound is the special case with k = 1.
Example 5.22. Suppose that X '" Exp(),) given A = ),. Set e= 1/ A. Then

fXle(xIO) = ~ exp { -~x} I(o,oo)(x),


:ofXle(xIO) ( -~ + ;2) fXle(xIO),
::2 fXle(xIO) (o~ - !~ + ::) fXle(xIO}.
Let >(x) = x 2. We can easily calculate Ee(>(X)} = 20 2 and Vare(>(X = 2004 .
Since Ix(O) = 1/0 2, and dEe(>(X)}/dO = 40, the Cramer-Rao lower bound on
the variance of >(X) is 160 4 A little bit of calculation yields

.7(0) = (~ ~), -y(0) = ( ~ ) .

So, -y(0) T .7- 1 (Oh( O) = 2004 , and the Bhattacharyya lower bound is achieved.
Corollary 5.23 (Multiparameter Cramer-Rao lower bound). As-
sume that the FI regularity conditions on page 111 hold. Let Ix(B) =
(Iij(B))) be the Fisher information matrix, and suppose that it is posi-
tive definite. Suppose that J >(x)fxle(xl(})dv(x) can be twice differentiated
under the integral sign with respect to coordinates of (} and that

Eel(X)1 < 00 and, T (B) = C, 8~i Ee(X), -) .


Then Vare(X) 2: ,T (B)Ix (B)r(B).
1

PROOF. All we need to do, in order to apply Lemma 5.20, is note that

8~i Ee>(X) = Cove ((X), 8~i IOgfx1e(XIB) ,


just as in the proof of Theorem 5.13. o
Example 5.24. Suppose that Pe says X '" N(p,,0'2), where 0 = (p" 0'). This is
the same as Example 2.83 on page 112. There we calculated

Ix (0) =( t ? ).
~

Now, set >(X) = X2. Then Eo>(X) = p,2 + 0'2 and

-Y1 (0) = 2p" Var(J>(X) = 20'


4
+ 4p,2 0' 2 ,
,2(0) = 20', -y T (0)Ix(lI}-1,(0) = 4,iu 2 + 2u 4 .

The Cramer-Rao lower bound is met exactly.


5.1. Point Estimat ion 307

5.1.3 Maximum Likelihood Estimation


If the posterior density of 8 is very high at some value, say (}o, and
relatively
low everywhere else, it means that we are quite sure that 8 is near
~o. If
the prior density of 8 is fairly flat near (}o and is not orders of magmt
ude
larger at other values of 0 than it is at 00, then the posterior density
will
differ from the likelihood function only by a constant factor near (}o.2
Defini tion 5.25. Let X be a random quantit y with conditional density
fXle(xl(}) given 8 = (}. If X = x is observed, the~ the function. L(~)
=
fXle(xl(}) considered as a function of (} for fixed x IS called the lzkelzho
od
function. Any random quantit y 9 such that

maxfxl e(X\(} ) = fXle(X \S)


liEn

is called a maximum likelihood estimator (MLE) of 8. If 8 = (8 1 ,8


2 ) and
9 = (9 1 ,92 ), then 9 1 is called an MLE of 8 1 ,
The idea of maximizing the likelihood function in order to estimat
e a
parame ter dates back to Fisher (1922).
Examp le 5.26. Suppos e that Xl,"" Xn are conditionally indepen
dent given
e = 0 with U(O,O) distribution. Then,
fXle(xIO) = 0: I[O,9J (max
, xi)/[O,maxi Xi) (min
, Xi).
As a function of 0, the maximu m of this function is at 0 = maxi
e
MLE is = maxi Xi.
Xi. Hence the

Suppose that we had defined the density of each Xi to be I(o,9)(x)


instead of
using the closed interval. In this case there is no value of 0 at which
the maximu m
is achieved. At first, it would seem that this could easily be fixed
by replacing
max by sup in the definition of MLE. This would also require some
continu ity
conditio n on the likelihood function. It turns out to be very inconve
nient to do
this. Rather, we should use the closed interval for the density.

The following example shows that the MLE may exist but not be unique.
Examp le 5.27. Suppose that Xl, ... , Xn are conditionally indepen
dent given
e= 0 with U((} - 1/2, () + 1/2) distribu tion. Then,

ixle(xIO ) = 1[9_ !,9+~] (miin xi)I[mini xi,9+!] (mFxi) .

As a function of 0, this is constan t for maxi Xi - 1/2 S 0 S mini


Xi + 1/2. Any
e
random variable between maxi Xi - 1/2 and mini Xi + 1/2 is an
MLE.

2This observa tion has led some people to try to base inference on
the likeli-
hood function alone, rather than the posterio r distribu tion. See Barndor
(1988) for an in-depth study of likelihood.
ff-Nielsen
308 Chapter 5. Estimation

Theorem 5.28. Let 9 be a measurable function from n to some space G.


Suppose that there exists another space U and a one-to-one measurable
function h : n --+ G x U such that h( 8) = (g( 0), g* (0)) for some function
g*. If 8 is an MLE of 9, then g(8) is an MLE of g(8) .
PROOF. Since h is one-to-one, the parameter might just as well be 111 =
h(8). The likelihood for 111 is fXlw(xl'l/l) = fXle(xlh- 1 ('I/I)). For fixed x,
if the maximum of fXle(xIO) occurs at 0 = 0, define (j; = h(O). Then
the maximum of fXle(xI8) occurs at 8 = h- 1 ("jJ). Now, suppose that the
maximum of fXlw(xl'') occurs at '1/1 = '1/10' If fXlw(xl'l/lo) > fXlw(xl"jJ),
then 0 = h- 1 ('I/Io) would provide a higher value for fXle(xI8) than 0. This
would be a contradiction. It follows that '1/1 = "jJ provides a maximum for
fXlw(xl'l/l). It follows that the MLE of 111 is h(6) and the MLE of the first
coordinates of 111, namely g(9), is g(8), the first coordinates of h(8). 0
Example 5.29 (Continuation of Example 5.7; see page 298). Suppose that X
given 8 = 8 has Poi(O) distribution and g(B) = exp( -38). Since the MLE of
8 is X, and 9 is one-to-one, the MLE of g(8) is exp( -3X). This is far more
reasonable than the UMVUE (_2)x.
If the loss function is L( (J, a) = ((J + i log a) 2 and the (improper) prior dis-
tribution has Radon-Nikodym derivative l/B with respect to Lebesgue measure,
then the formal Bayes rule is also exp( -3X).
Example 5.30. Let Xl, ... ,Xn be lID N(/.,17 2) given 8 = (M,1:) = (/.,17). It
is not difficult to see that the MLE of 8 is (X, JW/n), where X = ~~=\ X;fn
and W = ~~=\ (Xi - X)2. Suppose that we want the MLE of M2. The function
g(M,1:) = M2 is not a one-to-one function of either coordinate, but g* (/., (7) =
(17, sign(/.)) will satisfy the conditions of Theorem 5.28. So X2 is the MLE of M2.
The UMVUE of M2 is X2 - W/n2, which is negative with positive probability.

In exponential families, there is a simple method for finding MLEs in


most cases. The logarithm of the likelihood function will be log L(O) =
log c( 8) + X TO. If the MLE occurs in the interior of the parameter space,
it occurs where the partial derivatives of log L( 8) are O. That is, Xi =
-8 log c(O)/88 i . By using the method of Example 2.66, we see that the
MLE is that 0 such that x = E8X.
Example 5.31 (Continuation of Example 2.68; see page 106). Let Xl, ... , Xn
be lID N(/., (7 2) given W = (/., (7). The natural parameter of this expo?e~ti~
family is 8\ = M/E2 and 82 = -1/[2E 2]. The natural sufficient statistic IS
X = (nX,~~=\X?). Now logc(8) = nlog(-282)/2+~(JU[~82]' The partial
derivative with respect to 8\ is n81/[202] and the partial with respect to O2
is n/[202] - nOU[4(J~]. Setting these equal, to th~ega~ives of the ~wo co?rdi-
nates of X and solving for 8 1 and (J2 give 8\ = nX / ~i=l (Xi - X) and 82 =
-n/[2l:~1 (Xi - X?l. In terms of the ~sual parameteriza;ion M = -81/[282J
and E2 = -1/[282], we get M = X and E2 = ~~\ (Xi-X) /n by Theorem 5.28.
5.1. Point Estimation 309

5.1.4 Bayesian Estimation


Bayesian estimation tends to be somewhat more decision theoretic than
classical estimation. If 9 is a function on the parameter space, N is the
closure of g(O), and the loss function L(O,a) increases as a moves away
from g(O), then one could reasonably be said to be estimating g(8). In
Example 3.8 on page 146, we saw that if e
is one-dimensional and if
L( 0, a) = (0 - a)2, then the formal Bayes rule is to use the posterior mean
of 8 (so long as the posterior variance is finite).
Example 5.32 (Continuation of Example 5.7; see page 298). Suppose that Po
says that X has Poi((}) distribution, and g(8) = exp(-38). If the prior distri-
bution for e is rea, b), then the posterior after learning X = x is rea + x, b + 1).
The posterior mean of exp( -38) is

l CXl
exp(-30) (~~ ~a:)x Oa+x-l exp( -(b + l)O)dO = G: ~) a+x .

Another popular loss function is L(O, a) = 10 - al. The formal Bayes rule
in this case is a special case of the following result.
Theorem 5.33. Suppose that 8 has finite posterior mean. For the loss

L( 0 a) = { c( a - 0) if a 2 0, (5.34)
, (1 - c) (0 - a) if a < 0,

a formal Bayes rule is any 1 - e quantile of the posterior distribution of 8.


PROOF. Suppose that a' is chosen to be a 1 - c quantile of the posterior
distribution of e. Then

Pr(8 ~ a'iX = x) 2 1- e, Pr(e 2 a'iX = x) 2 e.

If a> a', then


e(a - a') if a' 2 0,
L(O, a) - L(O, a') { e(a - a') - (0 - a') if a 2 0 > a',
(1 - e)(a' - a) if 0> a

= e(a - a') +{ ~,- (I


(a' - a)

if a' 2 0,
if a 2: > a',
if 0> a.

It follows that the difference in the posterior risks is

r(alx) - r(a'lx) = e(a - a') + f


(al,a]
(a' - O)felx(Olx)d>'(O)

+ (a' -a)Pr(8 > alX = x)


> c(a - a') + (a' - a) Pr(8 > a'iX = x)
= (a - a')[e - Pr(8 > a'iX = x)].
310 Chapter 5. Estimation

Since Pr(8 > a'IX = x) ::; c, it follows that r(alx) 2: r(a'lx). Similarly, if
a < a', then
if a 2: 8,
L(8, a) - L(8, a') = c(a - a') + { ~- a if a' 2: 8 > a,
(a' - a) if 8 > a'.
It follows that r(alx) - r(a'lx) 2: (a' - a)[Pr(8 > alX = x) - c]. Since
Pr(8 > alX = x} 2: Pr(8 2: a'IX = x) 2: c, it follows that r(alx) 2: r(a'lx),
so a' provides the minimum posterior risk. 0
Notice that Theorem 5.33 remains true if the loss in (5.34) is replaced by

L(O, a) = { c(a - 8} if a > 8,


(1 - c)(O - a) if a ::; 0,
even if 8 has a discrete distribution. The reason is that the loss is 0 when
o= a for both loss functions. As a corollary, we can let c = 1/2 in (5.34),
and we get that the median is the formal Bayes rule for absolute error loss.

5.1.5 Robust Estimation*


A pragmatic approach to statistical inference will often allow for the pos-
sibility that probability distributions used in modeling are not to be taken
too seriously. For example, when we model data as conditionally lID with
N(I1-, a 2 } distribution given e = (11-, a), we might not be saying that this
description is a precise specification of our beliefs, but rather it is an approx-
imation that we hope will be sufficient for most purposes. Occasionally, the
approximation is not sufficient. For example, if there is some small chance
that one observation will be generated by a process much different from
the others, we might wish to use a model that makes this belief explicit.
Alternatively, we might want to use a procedure for estimating a parameter
that is not sensitive to the occasional observation that comes from the dif-
ferent process. This latter is the approach that leads to robust estimation.
Of course, robust estimation is not concerned solely with occasional aber-
rant observation, but it is also concerned with general misspecification of
distributions. The approach to robust estimation outlined here originated
with Huber (1964).
Suppose that we will be estimating some functional T of the distribu-
tion P of the data. For example, if P is a distribution with finite mean,
then T(P) = I xdP(x) is the mean expressed as a function of P. Similarly,
the median of a one-dimensional distribution P with continuous, strictly
increasing CDF F is T(P} = F- 1 (1/2}. The influence function of a func-
tional T is a means of assessing the sensitivity of the functional to small
changes in the distribution P.

This section may be skipped without interrupting the flow of ideas.


5.1. Point Estimat ion 311

Defini tion 5.35. Let Po be a collection of distributions on a Borel


space
(X, B) and let T : Po ~ ]Rk be a functional. For each x E X,
P E Po,
t E [0,1), and B E B, we define Px,t(B) = (1 - t)P(B) + tIB(X).
The
influence functio n of T at P is the following functio n of x:

. T P) - l'
IF( x,, T(Px,t) - T(P)
- 1m
tlO t '

for those x such that the limit exists.


In particu lar, the influence function of Tat P gives, for each x, the
rate
of change in T when P is contam inated by an infinitesimal mass at
x. In
other words, it is the right-hand derivative of T(Px,t) with respect
to t at
t = O.
Examp le 5.36. If T is the mean functional, then T(px.t) = [1 -
t]T(P) + tx.
It follows that IF(x; T, P) = x - T(P) for all x and P with finite
mean T(P).
Clearly, if P is contam inated by some mass at x = T(P), the
mean will not
change. Otherwise, the mean changes proport ionally to how far x
is from T(P).
In a finite sample setting, suppose that we obtain n observations
Xl, ... ,Xn and
we contem plate one further observa tion Xn+l. We can think of
the empirical
CDF of the first n observations as a probabi lity measure P and
then with t =
1/(n + 1), P Xn +1 t will be the empirical CDF of all n + 1 observa
tions. In this
case, the difference between the sample averages T(PXn + .d - T(P}
1 is exactly
[Xn+l - T(P}JI( n + 1).

Examp le 5.37. For the median functional, where F is the CDF


of P,

F-l (12)
I-t

T(Px.t ) = { x

F- l ( 2[/-tJ)

If we let Po be the class of distribu tions that have CDF with


derivative at the
median, then the influence function exists at all of lR except the median
of P. For
each P (with derivative of the CDF being f) and all x not equal
to the median ,

IF xt P = 1 if x > F-l(~),
x { 1
( " ) 2/(F- 1 0)) -1 if x < p-l ( ~ ) .
For all x less than the median , the effect of a contam ination at x is
essentially the
same. This is similar, for all x greater than the median. In a finite sample
setting,
suppose that we obtain n observations Xl, .. ,X n and we contem plate
one further
observation X n +!. We can think of the empirical CDF of the first
n observations
as a probabi lity measure P and then with t = 1/ (n+ I), PXn+t. will
t be the empir-
ical CDF of all n+ 1 observations. In this case, the difference between
the sample
median s T(PXn + 1 t ) -T(P) has slightly different forms dependi ng on
whethe r n is
odd or even. For the case of n odd, T(P} = X([n+lJ/ 2), the [n+ 1]/2
order statistic .
If Xn+l > X([n+lJ/ 2+l), then T(PXn + 1 .d = [X([n+lJ/ 2) +X([n+l )/2+1}]/
2, which is in-
depende nt of the value of Xn+l (so long as it is larger than X([n+lJ/ 2+1)'
A similar
expression holds for Xn+l < x([n+!J/ 2-1)' So T(Pxn+ t.t ) - T(P) is
approxi mately
312 Chapter 5. Estimation

the difference between X([n+l]/2) and the next observation either above or be-
low it. If the distribution is continuous, then this difference is approximately one
over two times the density at the median times lin, 1/[2nf(p- 1 (l/2))], which
is approximately tIP(Xn+l;t,P).

One way to summarize the influence function is by the gross error sen-
sitivity. This is defined as ,*(T,P) = sup x IIF(x;T,P)I. If ,*(T,P) is
infinite, then there is no bound on how much T can change when P is
contaminated by even a small amount of mass at an arbitrary point.
Example 5.38. For the mean functional T, ,'(T, P) is the largest absolute devi-
ation possible from the mean. For distributions with unbounded support, " = 00.
For the median, on the other hand, ,'(T, P) = [2f(P- 1 (l/2))tl, which is finite.
For this reason, we say that the median is more robust with respect to gross
errors than the mean.

There is a way to derive estimators with specified influence functions.


Let Po be a class of distributions on (X, B), and let T : Po -+ IRk be a
k-dimensional functional of interest. Let 'IjJ : X X IRk -+ IRk be a vector-
valued function. Assume that 'IjJ(x, (}) is differentiable with respect to () at
() = T(P) a.s. [P] for each P E Po. Suppose that the mean of 'IjJ(X, T(P))
is 0 for all distributions P E Po. That is,

J 'IjJ(y, T(P))dP(y) = 0, (5.39)

for all P E Po. For each x E X, P E Po, t E [0,1), and B E B, we


define Px,t(B) = (1 - t)P(B) + tIB(X), and we assume that the mean of
'IjJ(X, T(Px,t)) is 0 also. That is,

(1 - t) J 'IjJ(y, T(Px,t))dP(y) + t'IjJ(x, T(Px,t)) = O. (5.40)

Subtracting (5.40) from (5.39) gives

J ['IjJ(y, T(P)) - 'IjJ(y, T(Px,t))] dP(y) (5.41)

= t ['IjJ(x,T(Px,t)) - J 'IjJ(y,T(Px,t))dP(Y)] .

J
Suppose that we can differentiate 'l/J(y, T(Px,t))dP(y) with respect to t
by differentiating under the integral sign. Dividing both sides of ~5.41) ~y
t and taking the limit as t -+ 0 gives the derivative at t = O. Smce 'IjJ IS
continuous in its second argument, and since T(Px,t) is continuous at t = 0
for all x at which the influence function exists,

lim
t--+O
J'IjJ(y, T(Px,t))dP(y) 0,

lim 'IjJ(x, T(Px,t)) 'IjJ(x, T(P))


t--+O
5.1. Point Estimat ion 313

if the influence function of T exists at x. So, the limit as t --+ 0 of


lit
times the right-hand side of (5.41) is 'IjJ(x, T(P)). Since T is k-dimensional
,
the influence function will be a vector with coordinates IF(x;T ,P)j,
for
j = 1, ... , k. The limit as t --+ 0 of lit times the ith coordinate
of the
left-hand side of (5.41) is

-/t ~'ljJi(y,O)1
j=l BO; 9=T(P)
IF(x;T ,P)jdP (y). (5.42)

Define the matrix M = mi,j), where

mi,j =/ B~.'ljJi(y,O)1; 9=T(P)


dP(y).

If we assume that the matrix M is finite and nonsingular, we can set


(5.42)
equal to 'ljJi(X, T(P for each i and collect the resulting equations into
a
vector equation -M[IF (x; T, P)] = 'IjJ(x, T(P), so that
IF(x; T, P) = -M- 1'IjJ(x, T(P.
For an empirical distribution Pn , T(Pn ) = Tn, where Tn solves the equa-
tion
1 n
-n L
'IjJ(Xi' Tn) = O. (5.43)
i=l
Estimators that solve equations like (5.43) are called M -estimators becaus
e
they are generalizations of maximum likelihood estimators in the followin
g
sense. If p is a function such that Bp(x, ()IB()i = 'ljJi(X, 0), then to maxim
ize
E~=l p(Xk' Tn) it is necessary (but not sufficient in general
) that (5.43)
hold (if the maximum does not occur at a bounda ry point). One can
think
of p(x,O) as a replacement for logjxle (xl(). In this way, an M-esti mator
is a generalization of an MLE.
Examp le 5.44. Let Xl,.'" Xn be conditionally lID with density
fxtls(xI O)
given e = O. Let Po = {Pe : 0 E O} and let T(P) = 0 if P is
Pe. Suppose
that 0 ~ JRk Let 'l/Ji(X,O) = 8Iogfx lle(xI8) /88i , the score function
. If fxIlS
is sufficiently smooth , we have that the M -estima tor corresponding
to 'I/J is the
MLE. The matrix M is the Fisher information matrix, so the influenc
e function
for the MLE, in the smooth case, is the inverse of the information
matrix times
the score function.
As an example, if e = (M,l:) and fx1Ie(IIl,U) is the N(Il,u 2 ) distribu
then the score function is tion,

'Ij;(x, 0) = :2 ( lx~r ~ u ) .

(See Example 2.83 on page 112.) The Fisher information matrix is

Ix (9) =( ~
314 Chapter 5. Estimation

So, the influence function of the MLE T is

I F(x; T, P} = ( [x:I'J-; ~ ~ ) (5.45)


2<7 2

if P is the N(IL, (T2) distribution. In fact, one can verify directly that if P is
any distribution with finite variance and T'(P} is the standard deviation, then
IF(x;T',P} is the second coordinate in (5.45). (See Problem 32 on page 342.)

Example 5.46. For an example that does not meet the smoothness criteria,
consider Xl, ... ,Xn conditionally lID with U(O,O} distribution given e = O. Let
Po be the class of distributions on (R, 8 l ) with bounded support, and let T(P}
be the supremum of the support. The MLE is the maximum of the sample, which
is the supremum of the support of the empirical CDF, so the MLE is T of the
empirical CDF. The influence function for T is I F(x; T, P} = 0 if x ~ T(P} and
00 if x> T(p}. (See Problem 33 on page 342.)

One famous M -estimator of a one-dimensional location parameter is


based on the function ",,(x, 8) = hex - 8), where

-b if t < -b
h(t) = { t if -b ::; t'::; b,
b if t > b.

If P is a continuous distribution, then ",,(y, 0) is differentiable at 0 = T(P)


with probability 1. The influence function will be

",,(x, T(P
IF(x; T, P) = P([T(P) _ b, T(P) + b])
Notice that as b -+ 0, the influence function approaches the influence func-
tion of the median. The finite sample version is the estimator Tn which
solves L~=l ",,(Xi, Tn) = O. As with all M-estimators, one can rewrite this
equation as L~=l Wn,i(Xi - Tn) = 0, where Wn,i = 1/J(Xi , Tn)/(Xi - Tn) It
is not difficult to see that Tn = L~l Wn,iXi/ L~=l Wn,i solves the equa-
tion. Since Tn appears on both sides of this equation a solution must be
found iteratively. For example, make an initial guess T~O) and define

(k)
",,(Xi, T~k-l
Wn,i
Xi - T~k-l)
",n 1 w(k) Xi
~l= n,"

L~=l w~l '


for k ~ 1 until convergence occurs. In the special case we are considering,
the weights Wn,i have a nice form: Wn,i = 1 if IXi - Tnl ::; band Wn,i =
b/IXi - Tnl otherwise.
Other types of robust estimators include trimmed means. The 100a%
trimmed mean of a distribution is the conditional mean given that the
5.2. Set Estimation 315

observation falls between the 0: and 1 - 0: quantiles of the distribution.


That is, if F is the CDF corresponding to P,

f F-l(a)
F - (1-a)
1
xdP(x)
T(P) = 1-20:

for a < 1/2. For 0: = 1/2, the tradition is to call the median the 50%
trimmed mean. The influence function of a trimmed mean at a continuous
distribution is bounded and it has a shape similar to that of the previous
estimator. (See Problem 34 on page 342.)
In Section 7.3.6, we will give some results concerning large sample prop-
erties of M -estimators. More detailed discussion of robust estimators can
be found in the books by Huber (1977, 1981) and Hampel et al. (1986).
In Section 8.6.3, we discuss robustness considerations that are peculiar to
the Bayesian perspective.

5.2 Set Estimation


A set estimator of a function 9 of a parameter 9 is a function from the
data space X to a collection of subsets of the space in which g(9) lies.

5.2.1 Confidence Sets


In Section 4.6, we introduced the P-value as an alternative to testing a
hypothesis at a fixed level. In nice problems, the P-value gave us the set of
all levels at which we could accept the hypothesis. Another alternative is
to fix the level and ask for the set of all hypotheses that we could accept
at that level. This leads to the concept of a confidence set.
Definition 5.47. Let 9 : n --+ G be a function, let 11 be the collection of
all subsets of G, and let R : X --+ 11 be a function. The function R is a
coefficient'Y confidence set for g(9) iffor every () E fl,

{x: g(}) E R(x)} is measurable, and


Pe(g(}) E R(X ~ 'Y.
The confidence set R is exact if Pe(g(}) E R(X = 'Y for all () E fl. If
inf8eo Pe(g(}) E R(X > 'Y, the confidence set is called conservative. 3
The following result shows how confidence sets relate to nonrandomized
tests in general. Its proof is left to the reader.

3Some authors require that P~(R(X) = 0) = 0 for all 9 before calling R a


confidence set. Some require that inf8Eo Ps(g(9) E R(X = 'Y before saying that
the coefficient is 'Y.
316 Chapter 5. Estimation

Proposition 5.48. Let g : n - G be a function .


For each y E G, let c/>y be a level a nonmndomized test of H : g(8) =
y. Let R(x) = {y : c/>y(x) = O}. Then R is a coefficient 1 - a confi-
dence set for g(9). The confidence set R is exact if and only if c/>y is
a-similar for all y .

Let R be a coefficient 1- a confidence set for g(8). For each y E G,


define
c/> (x) = {O E
if y ~(x),
y 1 otherwzse.
Then, for each y, c/>y has level a as a test of H : g(9) = y. The test
c/>y is a-similar for all y if and only if R is exact.
Example 5.49. Let Xl, ... ,Xn be conditionally lID with N(p,,0"2) distribution
given (M, E) = (p,,0"). Let X = (Xl, ... , Xn). The usual UMPU level a test of
H : M = y is <py(x) = 1 if vn(x - y)/s > T;!l(1 - a/2), where T n -1 is the
CDF of the t n -1 (0,1) distribution. This translates into the confidence interval
Ix - T;!l(l- a/2)s/vn,x + T;!l(l- a/2)s/.fiil.
Example 5.49 is typical of the most popular way to form confidence sets,
namely the use of pivotal quantities. A pivotal is a function h : X x n - R
whose distribution does not depend on the parameter. That is, for all c,
Pe(h(X, 0) ~ c) is constant as a function of 0. In Example 5.49, the pivotal
is y'n(X - M)/S, which has tn-l(O, 1) distribution given 8. The general
method of using a pivotal h(X, 8) to form a confidence set is to set R(x) =
{o: h(x,O) ~ Fh"l(-y)}, where Fh is the CDF of h(X,9).
We can define randomized confidence sets if we want a correspondence
to randomized tests.
Definition 5.50. Let g: n- G be a function, and let R* : X x G - [0,1]
be a function such that
R*(, y) : X - [0,1] is measurable for all y E G, and

Ee[R*(X,g(O))] ~ 'Y, for all n.E


Then R* is called a coefficient'Y mndomized confidence set for g(9).
The number R* (x, y) is to be thought of as the probability that y is included
in the confidence set given that X = x is observed.
Example 5.51. Suppose that X '" Bin(2, 8) given e = 8. Let 9(8) = 8 for all 8.
Define

max { 0,1 - (10':'~~2 } if x = 0,


1 if x = 1 and 8 ::; 1 - ,,",0.05,
R*(x,O) = 0.05-(1-6)2 } if x = 1 and 8 > 1 - ,,",0.05,
max { 0, 1 - 26(1 6)

min { 1,1 _ 92-;g.95 } if x = 2.


5.2. Set Estimation 317

It is easy to check that


E9[R*(X,0)] = 0.95,
for all O. So, if X = 1 is observed, the confidence set consists of the
interval
[0,1 - VO. 05] together with possibly some 0 values between ~ - VO.05 and
chosen with decreasing probability as 0 increases. For convemence, we could VO.95
a single U '" U(O, 1) independent of X and, if X = 1 is observed, include ~elect
0 III the
confidence set if
0.05 - (1 - 0)2
1- 20(1 _ 0) > U.
For example, if U = 0.5 is observed, the confidence set becomes [0,0.95J
. This
example is a special case of Proposition 5.52 below.
Randomized tests correspond to randomized confidence sets in a manne
r
similar to Propos ition 5.48. The randomized confidence set in Examp
le 5.51
was constru cted according to Propos ition 5.52 using the UMP level
0.05
tests of H : e = 0 versus A : 9 < (J.
Propo sition 5.52. Let g : n -+ G be a functio n .
For each y E G, let cPy be a level a test of H : g(9) =
y. Let
R* (x, y) = 1 - cPy (x). Then R* is a coefficient 1 - a random
ized
confidence set for g( 9). The random ized confidence set R* is exact
if
and only if cP y is a-simil ar for all y.
Let R* be a coefficient 1 - a random ized confidence set for g( e).
For
each y E G, define cPy(x) = 1 - R*(x,y ). Then, for each y, cPy
has
level a as a test of H : g(9) = y. The test cPy is a-simil ar for all
y if
and only if R* is exact.

The concept of UMP test corresponds to the concept of uniform ly


most
accurate confidence set.
Defini tion 5.53. Let g : n -+ G be a function, and let R be a coeffici
ent 'Y
confidence set for g(9). Let rt be the collection of all subsets of G, and
let B :
G -+ rt be a function such that y B(y). Then R is uniform ly most accurat
e
(UMA) coefficient'Y against B if for each 0 E n and each y E B(g(O))
and
each coefficient 'Y confidence set T for g(9), P;'(y E R(X)) ::; P;'(y E
T(X)).
If R* is a coefficient 'Y randomized confidence set for g(8), then
R* is
UMA coefficient 'Y random ized against B if for every coefficient 'Y random
-
ized confidence set T* and every (J E n and each y E B(g(O) ),

Eo[R*(X, y)] ::; Eo[T* (X, v)].

The accuracy of a confidence set against B is its probab ility of not coverin
g
parame ter values in B(g(O)) given e = (J. The set B(g(O) ) is the
set of
values you wish not to have in your confidence set if e = O. It
is not
exactly analogous to the alterna tive in hypothesis testing, but it is
related ,
as we will see in Theore m 5.54.
318 Chapter 5. Estimation

One can consider a confidence set R as a randomized confidence set by


setting
R*(x ) =
,Y
{I
~f Y R(x),
0 If not.
E
For this reason, we state the following result in terms of randomized con-
fidence sets.
Theorem 5.54. Let g(O) = 0 for all 0 and let B : 0 ---- T} be as in Defini-
tion 5.53. Define
B- 1 (B) = {B' : B E B(O')}.
Suppose that B- 1 (O) is nonempty for every O. For each 0 E 0, let () be
a test. Define R*(x,O) = 1 - ()(x). Then () is UMP level 0: for testing
H : e = 0 versus A : e E B- 1 (O) for all 0 if and only if R is UMA
coefficient 1 - 0: randomized against B.
PROOF. For the "only if" part, suppose that for each B, () is UMP level
0: for testing H() : e = 0 versus A() : e E B- 1 (B). Let T* be another
coefficient 1 - 0: randomized confidence set. Let B E 0 and 0' E B(O). All
that remains is to show that E()[R*(X,B')] ~ E()[T*(X,B')]. First, note that
oE B-l(O'). Now, define a test 'Ij;(x) = 1- T*(x,O'). This test 'Ij; has level
0: as a test of H()/, according to Proposition 5.48. Since ()I is UMP as a
test of H()I against the alternative A()I : e E B- 1 (O'), and 0 E B- 1 (O'), it
follows that 13",(0) :S 134>9 (0). We can rewrite this as
1

13",(0) = E()['Ij;(X)] = 1 - E()[T*(X, 0')] :s 134>9 1 (0) = E()[()I(X)]


= 1 - E()[R*(X, B')],

which establishes the result.


For the "if" part, suppose that R* is a UMA coefficient 1- 0: randomized
confidence set against B. For each B E 0, let 'Ij;() be a level 0: test of Ho :
e = 0 and define T*(x,B) = 1- 'l/J(J(x). Then Proposition 5.52 shows that
T* is a coefficient 1 - 0: randomized confidence set. Let
0' = ((B',O): B' E 0,0 E B(O')}
= ((O,B'): B E 0,0' E B-l(O)},
where the second equality follows since B- 1 (O) is nonempty for all 0 E O.
For each (B,O') E 0', we know that E()I [R*(X, 0)] :S E()I [T*(X, B)]. This is
th~ same as 134>8 (B') ~ f31/l8(B'), for all 0 E ~ and all B' E B-l(O). Th~llast
claIm means that () IS UMP level 0: for testmg H() versus A() : e E B (0).
o
Example 5.55. Suppose that Xl,"" Xn are lID N(p., 1) given M = p.. Due to
the continuity of this distribution, we will not need to consider randomized tests
and confidence sets. Let

R(x) = (-00, x + In-t>-l (1 - 0:)] .


5.2. Set Estimation 319

We note that

P~(/-L E R(X)) = P~ (/-L ~ X + )n1l>-1(1- a)) = 1- a,


so that R is an exact coefficient 1- a confidence set. Consider the test <PI-'(x) = 1
if x < /-L-Il>-I(I-a)/y'n. Then R(x) = {/-L: <PI-'(x) = O}, and <PI-' is the UMP level
a test of H: M = /-L versus A: M < /-L. So B-1(/-L) = (-oo,/-L), B(/-L) = (/-L,oo),
and R is UMA coefficient 1 - a against B. That is, if /-L < /-L', then R has a
smaller chance of covering /-L' than does any other coefficient 1- Q confidence set,
conditional on M = /-L.
The following proposition (whose proof is left to the reader) illustrates
why we do not need to introduce a dual concept to UMA confidence sets
corresponding to UMC tests.
Proposition 5.56. For each () E n, let <Po be a floor I test of H : e E no
versus A : e = (), where () f/. f2o. Let R(x,()) = 1 - <po(x). If <Po is UMC
floor I for testing H : e E B- 1 ()) versus A : e = 0 for all 0 E f2, then R
is UMA coefficient I randomized against B.
In other words, UMA confidence sets correspond to both UMP and UMC
tests, just in different ways. 4
The following example is due to Pratt (1961) and it illustrates an inad-
equacy in the theory of confidence intervals as described above.
Example 5.57. Suppose that Xl, ... , Xn are lID U(0-1/2,(}+1/2) given e = O.
Minimal sufficient statistics are Tl = min Xi and T2 = max Xi.

h 1 ,T2Is(tl, t210) = n(n - l)(t2 - td n - 2, for 0 - ~ S tl S t2 SO + ~.


Suppose that B(O) = (-00,0) and that we want the UMA coefficient 1 - a
confidence set against B. If, for each 00, we find the UMP level a test of nH =
(-00,00 1versus nA = (60 ,00) = B- 1 (00 ), we can use these tests to construct the
UMA coefficient 1 - Q confidence set. This is not an exponential family and it
does not have MLR, since the sufficient statistic is two-dimensional. To find the
UMP level Q test (and hence the UMA coefficient 1 - a confidence set), we use
the Neyman-Pearson lemma. First, let 91 > 90. For k < 1 and 01 < 00 + 1,

(5.58)

if tl > 01 - 1/2 or if t2 > 00 + 1/2 (see Figure 5.59). If k = 1, then (5.58) changes
to equality on the shaded set in Figure 5.59. If 0 1 2: 00 + 1, then (5.58) holds for
tl 2: 01 - 1/2 for every k 2: O. To make the test have size Q and to make it be
the same for all 0 1 > 00 we must set <P = 1 in the upper corner (shaded region in
Figure 5.59), filling in a large enough area to have probability a. This would be

-,(tl, )_{I
'I' t2 -
if t2 > 9+ ~
0 or it > 00 + ~ - a *,
.1.
o if t2 S 90 +~ and it S 90 +~ - an.

4Note that the hypothesis and alternative need to be switched in order for the
same confidence set to correspond to both a UMP and a UMC test.
320 Chapter 5. Estimation

00 +~

00 - ~

FIGURE 5.59. Construction of UMP Test in Uniform Example

(To see that this is MP for each {h > {}o, note that we choose k = 1 in (5.58) if
{}1 - 1/2 < (}o + 1/2 - a l / n , and we choose k = 0 if 01 - 1/2 > 00 + 1/2 _ a l / n .)
The UMA coefficient 1 - a confidence set against B is [T., 00 ), where T. =
~ax{TI -1/2+a 1/n , T2 -1/2}. This means that P~((J' 2 T.) is minimized for all
o < {} among all coefficient I-a confidence sets. Note, however, that e 2: T2 -1/2
for sure, and T. ::; T2 - 1/2 whenever T1 - 1/2 + a 1/n ::; T2 - 1/2, that is, when
T2 - T1 2 a 1/n . So, we are 100% confident that e 2 T. whenever T2 - Tl 2 a l / n ,
rather than 100(1 - a)% confident. (The probability that T2 - Tl 2 a l/n is
1 - [n/a 1/n - n + 1]. If a = 0.05 and n = 10, then P~(T2 - T1 > a 1/n ) = 0.77
for all o. So the understatement of confidence will occur with probability 0.77 no
matter what e is.)
But there is more. Suppose that we switch to B'(O) = (0,00), so that the
corresponding hypothesis and alternative are H' : e 2 00 versus A' : e < 00 The
UMA coefficient I-a confidence set against B' is (-00, T], where T = min{T1 +
1/2, T2 +1/2-a 1/n }. (See Problem 31 on page 289.) IfT2 -T1 < 2a 1 / n -1, then
the two intervals do not even overlap. For example, if a = 0.05 and n = 10, the
probability of this occurrence is 0.008. As an example, if T1 = 1 and T2 = 1.3,
then T. = 1.24 and T = 1.06. So we are 95% confident that e < 1.06 and we
are 95% confident that e 2 1.24. That makes us 190% confident, ~d we haven't
even covered all the possible values of e. The most straightforward way out of
these dilemmas in the classical framework is to condition on the ancillary T2 - Tt.
This will produce more sensible results, which will also be more closely in line
with a Bayesian analysis. The confidence set produced will not be UMA, however.
Another alternative is to use the UMA confidence interval but assert a confidence
coefficient a(t2 - h) that depends on the observed value of the ancillary.5

5A similar approach was suggested by Barnard (1976) in response to the claim


by Welch (1939) that conditional confidence intervals are inefficient. Welch was
pointing out (as we noted above) that conditional intervals, with the same confi-
dence coefficient for every value of the ancillary, are less efficient (e.g., not UMA,
or longer on average) than intervals based on the marginal distribution of the
data. What Barnard showed was that one can fix the marginal confidence coeffi-
cient to be 1 - a and then choose the conditional confidence coefficient in a way
to optimize whatever criterion one desires.
5.2. Set Estimation 321

Example 5.57 is a situation in which the distributions satisfy a condition


called invariance, which will be discussed in Chapter 6. In Section 6.3.2,
we will prove a theorem that says that in such cases, the posterior proba-
bility that a parameter lies in a confidence set is equal to the conditional
confidence coefficient given the ancillary.
For two-sided or multiparameter confidence sets, we need to extend the
concept of unbiasedness.
Definition 5.60. Let g : 0 -+ G be a function, and let R be a coefficient
'"Y confidence set for g(8). Let TJ be the collection of all subsets of G, and
let B : G -+ TJ be a function such that y f/. B(y). We say that R is unbiased
against B if, for each 9 E 0, P!J(y E R(X)) ~ '"Y for all y E B(g(9)). We say
R is a uniformly most accurate unbiased (UMA U) coefficient '"Y confidence
set for g(9) against B if it is UMA against B among unbiased coefficient
'"Y confidence sets.

Proposition 5.61. For each 0 E 0, let B(O) be a subset of 0 such that


Of/. B(O), and let 8 be a level Q test of H : 9 = 0 versus A: 9 E B-l(O)
such that 8 is nonrandomized. Let R(x) = {O : 8(X) = O}. Then R is a
UMA U coefficient 1 - Q confidence set against B if 8 is UMPU level Q
for testing H versus A.

The following example shows that the phenomenon of Example 5.57 can
occur in exponential families with UMA unbiased confidence sets. This
example is due to Fieller (1954).
Example 5.62. Let e = (Bo,Bl,E), and suppose that Yl, ... ,Yn are condi-
tionally independent given e = (30,{3l,U) with Yi '" N(30 + (3lXi,( 2 ), where
Xl, ... , Xn are known numbers. Sufficient statistics are

-Y = n1" n

L."Yi,
.
Bl =
2::-1 (Yi -
~n (
.LJi=1 X,
.
Y)(Xi - x)
_)2
X
'
i=l

W = "L.,,(Yi - -
Y -.Bl(Xi - x)) 2 .
i=l

If we let S"'''' = E::"l (Xi - x)2, then the joint density of the sufficient statistics is

(_1_)
211"u
~ ..;ns;; W~-2
r(~ -1)
2

x exp ( - 2!2 {n(y - f30 - f3lX)2 + S"",,({3l - f3d + w} ) .


Suppose that we want a confidence interval for the value of X that makes
Bo + B1x = o. Let H = -Bo/Bl be that value. We will test H : H = Xo by
testing (Bo +Blxo)/E2 = O. The natural parameters and corresponding sufficient
statistics are
322 Chapter 5. Estimation

Natural Parameter Sufficient Statistic


n1'
SxxBl
W + sxxBi + n1'2
Now, set i112 = 8 2 , i113 = 8 3 , and i111 = (Bo + Blxo)/~2 = 8 1 + 8 2(xo - x). The
new sufficient statistics are

The inverse of the transformation from (1', 13 1 , W) to (U1, U2, U3) is

_
Y=-,
Ul /3 - U2 + (xo - X)Ul
n 1 - Sxx '

and the Jacobian is 11 Sxx. The joint density of the Uis given ill = t/J is

( _1_)~ 1 [U3- uI _ [U2+(xo-X)ud2]~-2


27ra 2 y'nSxxr( ~ - 1) n Sxx

x exp(UIt/JI)g(U2, U3, t/J)I(o,oo) (w),


where 9 is some function that will not concern us. The conditional density of U1
given i111 = 0 and (U2, U3) = (U2, U3) is

[ U3 - -Ul - U2
2 [
+ ( xoS - -)
X Ul 12]~-2 I (0,00) ()h(
W U2,U3,)
n xx

where h is a function that will not concern us. If we expand out the formula for
wand complete the square in this function as a function of Ul, we get
2

So, the conditional distribution of U1 given U2 and U3 is a location and scale


family and

is independent of U2 and U3. The numerator of this expression is

130 + 13lxo (5.63)


1. + (xO-x)2 ,
n Sxx.
5.2. Set Estimat ion 323

where Eo = 17 - EIx. The denomi nator is

[w "+ " 1!
+ Bo+BI X O
1.
n
(XO-x)2
(5.64)
Sxx

The usual t statistic is

The ratio of (5.63) to (5.64) is [t/v'n -2]/J l+t 2 /(n-2) , whose


absolut e
value is an increasing function of Itl. Hence, the usual t-test of H :
Bo + BIXO = 0
is the UMPU level a test of 'lit = O. Hence a UMAU coefficient 1
- a confidence
set for H is

{ z .. (130
(
+ BIz?
(__ -;;;\2) <F-
1 (1- )}
a
_ l,n-2 ,
.J1:..
n-2
1.
n
+ ~
Sxx

where Fl,n-2 is the CDF of the F distribu tion with 1 and n-2 degrees
offreedo m.
We will now find all z that satisfy this inequality. Define

y'1i(z - x)
v=
VSx;
, v = . -yy'1i
. ,
f31VSx;
F = (n - "2
2)f3I Sxx ,
C = Fl~~_2(1 - a).
W
Clearly v is a one-to-one function of z and v* is its naive estimat or.
The usual F
statistic for testing Bl = 0 is F. We can write

~O + ~IZ = Y + ~1(Z - x) = JS~x ~I(V - v*),

~
n- 2
(.!.n + (z - X)2) = ~
Sxx n-
1 + v2 .
n 2
So, the confidence set is {z : F(v-v* )2/(1+ v 2) ~ c}. Now, F(v-V*
)2/(1+ v 2) ~ c
if and only if
v 2(F - c) - 2vFv* + (FV*2 - c) ~ o. (5.65)
We have three cases dependi ng on the sign of F - c.

1. F = c. Then 13 1 is just barely significant at level a. In this case (5.65)


holds
if v ~ (V*2 - 1)/2v and v > 0 or if v ~ (V*2 - 1)/2v and v <
confidence set will be a semi-infinite interval.
o. The

2. F < c. Then 13 1 is insignificant at level a. In this case the quadrat ic has a


negative coefficient for v 2. The maximu m occurs at v = Fv* / (F -
c), and
the maximu m value is c[c - F(v 2 + l)l/(F - c). If this maximu m
value is
positive, then the confidence set is the exterior of a bounde d open
interval.
If the maximu m value is negative (equivalently if F ~ c/(l + 2
v ), which
can happen ), the confidence set is (-00, (0), which is absurd.
324 Chapter 5. Estimation

3. F > c. Then Ih is significantly different from 0 at level O!. The minimum


of the quadratic is always negative and the coefficient of v 2 is positive, so
the confidence set will be a closed and bounded interval.
In this example, even though the confidence set satisfies the conditions of being
UMAU, it would not be sensible to use it after observing data.

5.2.2 Prediction Sets


One attempt to do predictive inference in the classical setting is the con-
struction of prediction sets.
Definition 5.66. Let V : S ---+ V be a random quantity. Let TJ be the
collection of all subsets of V, and let R : X ---+ TJ be a function. This
function is a coefficient "I prediction set for V if
{(x, v) : v E R( x)} is measurable, and
p~(V E R(X)) ~ "I, for every (J E n.
The prediction set R is exact if p~(V E R(X)) = "I for all () E n. If
inf8Eo p~(V E R(X)) > "I, the prediction set is called conservative. 6
Example 5.67. Suppose that {Xn}~=l are lID N(p" ( 2 ) given (M, E) = (p" u).
Suppose that we will observe X = (Xl, .. . ,Xn ) and that we are interested in
V = "n+m "n "n -
L...li=n+1 Xi/ m . Let -X n = L...i=l Xi/n and Sn2 = L...i=l (Xi - X n )2/( n - 1) .
Since
V - X n ' " N ( 0, u 2 [~ + ~])
and is independent of S~ '" r([n - 1]/2, [n - 1]/[2u2 ]), it follows that
V-Xn
( I 1) '" tn-l (0,1).
Sn n+m

Define the set function R(X) to be the interval

It is easy to see now that P~(V E R(X = 1 - O! for all (J E n, so R is an exact


coefficient 1 - O! prediction set for V.
There are also one-sided prediction sets. For example,

R'(X) ~ [x. -T;!,(l- .)s.V; + !.oo]


also satisfies Pti(V E R'(X =1- O! for all (J En.

*This section may be skipped without interrupting the flow of ideas ..


6Some authors require that P~(R(X) = 0) = 0 for all (J before callIng R a
prediction set. Also, some authors require that inf8EO P~(V E R(X = 'Y before
calling 'Y the coefficient.
5.2. Set Estimation 325

Since tests and confidence sets correspond in a natural way, we might


expect prediction sets to correspond to predictive tests in a similar way.
Example 5.68 (Continuation of Example 5.67; see page 324). Suppose that we
set up the predictive hypothesis H : V ::; Vo versus A : V > Vo. To parallel the
relationship between parametric tests and confidence intervals, we should reject
H if Vo R'(X). What properties does this predictive test have? If we try to
generalize the type I error probability, we might try to calculate

Po(Reject HIV = vo)

where the first equality follows from the fact that V and X are conditionally
independent given e. This probability can be calculated using the noncentral t
distribution. For IL = Vo, the probability is 1- Tn-l (T;~l (1 - a)v'1 + n/m) <
a. The test is easily seen to be a UMPU test (level strictly less than a) for
M ::; Vo, but it is not clear why it should be considered a test of V::; Vo.

5.2.3 Tolerance Sets


A classical alternative to a prediction set is a tolerance set as introduced
by Wilks (1941).7
Definition 5.69. Let V : S -+ V be a random quantity. Let 1/ be the
collection of all subsets of V, and let R : X -+ 1/ be a function. This
function is a 6 tolerance set with confidence coefficient 'Y for V if
((x, v) : v E R(x)} is measurable, and
P;(P;[V E R(X)IX] ~ 6) ~ 'Y, for every 0 E n.

The number 6 is called the tolerance coefficient. The tolerance set R is exact
if P;(P;[V E R(X)IX] ~ 6) = 'Y for all 0 E n. One might wish to require
that P;(R(X) = 0) = 0 for all 0 and/or infllE!1 P;(P;[V E R(X)IX] ~ 6) =
'Y. If this last condition fails, the tolerance set would be called conservative.
Rather than making a single probability statement concerning the joint
distribution of the data X and the future observable V (as is done in a
prediction set), a tolerance set tries to separate the probability statements
about X and V. A conditional probability statement about V is made
given X (the tolerance coefficient) and then a statement is made about the
distribution of this conditional probability (the confidence coefficient).
Example 5.70 (Continuation of Example 5.67; see page 324). Here the data are
conditionally lID N{IL, 0'2) given e = (p"u). Suppose that we want R(x) to have

*This section may be skipped without interrupting the flow of ideas.


7For more detail on tolerance sets, see Aitchison and Dunsmore (1975).
326 Chapter 5. Estimation

the same form as the prediction set, namely Rd(X) = [Xn - dSn,X n + dSn],
where d is yet to be determined. Now,

P~(V E Rd(X)IX) = <P ( ..;rnXn + '::n - IL) _ <P ( ..;rnXn - !Sn -IL ) .
(5.71)
Call the right-hand side of (5.71) Hd(X, IL, u). It is easy to see that the distribution
of Hd(X, IL, u) given e = (IL, u) is the same as the distribution of Hd(X, 0,1) given
e = (0,1). Let Yd,.., be the 1- 'Y quantile of Hd(X, 0,1). Since the distributions of
Hd(X, 0,1) are stochastically increasing in d, it is clear that Yd,.., is an increasing
continuous function of d. Also, Yo,.., = 0 and limd_oo Yd,.., = 1. Let d be such that
Yd,.., = 6. With this choice of d, we have PIJ(Hd(X, IL, u) ~ 6) = 'Y. It follows that
Rd is a 6 tolerance set with confidence coefficient 'Y for V. 8
There are also one-sided tolerance sets. For example,

(5.72)

satisfies P~(P9[V E R~(X)IXJ ~ 6) = 'Y for all 8 E (l if d is chosen so that 6 is


the 1 - 'Y quantile of 1 - <P(Jffl[X n - dSn ]) given e = (0,1).

One might think that tolerance sets could be used to construct predictive
tests in the classical framework where prediction sets failed. In the sense
of (3.15), they can.
Example 5.73 (Continuation of Example 5.67; see page 324). Consider the hy-
pothesis H : V :::; Vo and the tolerance set R~ in (5.72). (Recall tha.t V is the
average of m future observations.) Here d is chosen so that 6 is the 1 - 'Y quantile
of the distribution of 1- <P(Jffl[X n - dSn ]) given e = (0,1). We could reject H
if Vo ft R~(X). We now calculate

PIJ(vo < Xn _ dSn ) = pIJ (vo; IL < Xn - ~- dSn )

= P(O,I) (vo; IL < Xn - dSn )


I
P(O,I) (
1- <P (rm [Xn - dSn]) < 1 - <P (
rm-
Vo -IL))
u- .

This, in turn, is less than or equal to 1 - "y if and only if

1- <P (rmVO; IL) < 6. (5.74)

Note that the left-hand side of (5.74) is 1 - P8(V :::; vol. So, we have replaced
the hypothesis H : V :::; Vo by the hypothesis H' : P8(V :::; Vo) > 1 - 6. (F?r
6 = e/(l + e), H' turns out to be the same as the hypothesis constructed In
Example 4.13 on page 219.)

8Eberhardt, Mee, and Reeve (1989) give a program to compute the number d
in examples like this.
5.2. Set Estimation 327

5.2.4 Bayesian Set Estimation


In a Bayesian framework, set estimation is a type of inverse to the problem
of computing posterior and/or predictive probabilities. 9 That is, rather
than specifying a set and determining its posterior or predictive probability,
you specify a probability and determine a set that has that probability. The
problem is that there are usually many sets with the same probability. For
example, suppose that the predictive distribution of V given X = x is
continuous with CDF FVlx(lx). Suppose that you want an interval T such
that Pr(V E TIX = x) = 'Y. One such interval is (-00, F-v/xblx)], and
another is [Fvl~(1- 'Ylx, 00), and there are many bounded intervals.
To choose between the many possible sets, it might make sense to have a
loss function and then choose the set with the smallest posterior expected
loss. This approach is discussed in Section 5.2.5. More commonly, one of
the following approaches is taken:
If V has a density !vlx(lx), determine a number t such that T =
{v : !vlx(vlx) ~ t} satisfies Pr(V E TIX = x) = 'Y. This choice is
called the highest posterior density region, or HPD region .
For the case in which V is real-valued and one desires a bounded
interval, choose the endpoints of the interval to be the (1 - 'Y)/2 and
(1 + 'Y)/2 quantiles of the distribution of V.
HPD regions are sensitive to the dominating measure. That is, if the
conditional distribution of V given X = x is absolutely continuous with
respect to two different measures, the HPD regions constructed from the
two different densities might be different.
Example 5.75. Suppose that V given X = x has N(x, 1) distribution. This dis-
tribution is absolutely continuous with respect to Lebesgue measure with density
(271")-1/2 exp( -Iv - x]2/2). The corresponding HPD regions will be symmetric
intervals around x because the density is a decreasing function of Iv - xl. The
N(x, 1) distribution is also absolutely continuous with respect to the N(O, 1) dis-
tribution. The density is exp(xv - x2/2), which is an increasing function of v
for x > 0 and is decreasing for x < O. The HPD region would be a semi-infinite
interval in either of these cases. If x = 0, then the dominating measure is the
same as the distribution of V and the density is the constant 1. In this case, every
set is an HPD region.
Even if there is no issue of which dominating measure to choose, HPD
regions can be strange if the density of V is multimodal. In particular, they
can be the union of several disconnected subsets. In such cases, one might
prefer just to choose a reasonable shape for the region T and choose the
particular region of that shape so that it is convenient to demonstrate that
Pr(V E TIX = x) = 'Y.

9Sets with specified posterior probability are often called credible sets in the
statistical literature.
328 Chapter 5. Estimation

5.2.5 Decision Theoretic Set Estimation*


Just as we could choose point estimates to minimize a loss function, we
could also choose set estimates to minimize a loss function. The problem
is that there are many possible loss functions and one rarely suffers a loss
according to one of the tractable ones. Nevertheless, we will derive the
optimal rules for some simple loss functions.
For a simple situation, suppose that we will form a semi-infinite inter-
val of the form (-00, a] for a one-dimensional parameter e
with the loss
function being

L(O a) = { c(a - 0) if a ~ 0, (5.76)


, (l-c)(O-a) ifa<O.

If c < 1 - c, this loss penalizes overly long intervals that contain the pa-
rameter less (per unit of length) than it penalizes short intervals that miss
the parameter. If the posterior distribution of e has density fSlx(Olx) with
respect to a measure A, and the posterior mean of e is finite, then The-
orem 5.33 says that the optimal a is any 1 - c quantile of the posterior
distribution of 9.
For bounded intervals, the action space can be considered to be the set
of ordered pairs (a1' a2) in which a1 ~ a2. Consider a loss like (5.76) that
penalizes excessive length above, below, and around e differently:

(5.77)

The optimal interval is the interval between two quantiles of the posterior
distribution.
Theorem 5.78. Suppose that the posterior mean of e is finite and the
loss is as in (5.77) with Cl, C2 > 1. The formal Bayes rule is the interval
between the 1/C1 and 1 - 1/c2 quantiles of the posterior distribution of 9.
PROOF. We can rewrite the loss in (5.77) as L1(O,at} + L2(O,a2), where
if a1 > 0,
if a1 ~ 0,
if a2 ? 0,
if a2 < O.
Since each of these loss functions depends on only one action, the posterior
means can be minimized separately. If we divide L1 by C1, then Theo-
rem 5.33 says that the posterior mean of L1(e,at}/c1 is minimized at a1

.This section may be skipped without interrupting the flow of ideas.


5.3. The Bootstrap 329

equal to the l/cl quantile of the posterior. Similarly, if we divide L2 by C2,


Theorem 5.33 says that the posterior mean of L 2(8,a2)/c2 is minimized at
the (C2 - 1)/c2 quantile of the posterior. 0
In the special case in which Cl = C2 = 2/0: > 1, the optimal interval runs
from the 0:/2 quantile of the posterior to the 1 - 0:/2 quantile. This would
be the usual equal-tailed, two-sided posterior probability interval for e.
There are other loss functions that do not penalize differently for how
short the interval is when it misses the parameter. For example,

LI(O, [aba2]) a2 - al + C (1 - I[a"a2J(0)) ,


Lq(O, [al,a2]) (a2 - al)2 + C (1 - I[al,a2] (0)) .

The existence of optimal rules in these cases actually requires weaker as-
sumptions than those already considered, because the posterior mean of
the loss is finite in all cases. That is, we need not assume that e has finite
posterior mean to find the formal Bayes rules. If the posterior distribution
of e has a continuous density with respect to Lebesgue measure, then cal-
culus can be used to minimize the posterior risk. For example, the following
result is easy to prove.
Proposition 5.79. Suppose that 0 ~ mand that the action space is the
Borel a-field T of subsets of o. Suppose that the posterior distribution of
e has a density felx with respect to Lebesgue measure A and that the loss
is L(O, B) = A(B) + c (1 - IB(O)). Then, the formal Bayes rule is an HPD
region of the form B(x) = {O : felx(Olx) ;::: l/c}.
For the case in which the density felx is strongly unimodal (that is,
{O: felx(Olx) > a} is an interval for all a), the formal Bayes rule for loss
Lz is an HPD region. In the strongly unimodal case, one can also find the
formal Bayes rule for loss L q (See Problem 40 on page 343.)

5.3 The Bootstrap*


5.3.1 The General Concept
There are many situations in which it is very difficult to work out ana-
lytically some feature of the distribution of some statistic in which we are
interested. The idea of the bootstraplO is to suppose that a CDF Fn calcu-
lated from an observed sample Xl, ... ,Xn is sufficiently like the unknown
CDF F so that one can use a calculation performed using Fn as an estimate
of the calculation that we would like to perform using F. Two types of Fn
are commonly used. For the nonparametric bootstrap, Fn is the empirical

-This section may be skipped without interrupting the flow of ideas.


IOFor a good overview of bootstrap methodology, see Young (1994).
330 Chapter 5. Estimation

CDF of the data. For the pammetric bootstmp, one assumes a parametric
model (with each Xi having CDF F xtls (18 and Fn is Fxtls('ISn) for
some estimate en of e. To be more precise, we follow Efron (1979, 1982).
Let X = (Xl,"" Xn) and let :F be a space of CDFs in which we suppose
that F lies. Let R : X x :F -+ IR be some function of interest. For example,
R might be the difference between the sample median of Xl"'" Xn and
the median of F:

R(X, F) = ~ [X(L ~ J) + X(L ~ J)] - F- (~)I ,

where F-I(q) is understood to mean inf{x : F(x) ~ q}, and X(k) is the kth
order statistic. The bootstrap replaces R(X, F) by R(X, Fn ), where X*
is an lID sample of size n from Fn. If we are interested in the conditional
mean of R(X, F) given P = P where P has CDF F, we try to calculate
the mean of R(X, Fn). The success or failure of the bootstrap will depend
upon the extent to which Fn is "like" F for the purposes of calculating the
distribution of R.
Example 5.80. Suppose that Xl"'" Xn are conditionally lID U(o,6) given
e = 8. Here, F is the U(O, 6) CDF. First, take R(X, F) = E~l Xiln- J xdF(x).
The mean of this quantity is O. In fact, the mean of R(X, F) is 0 if F is any
distribution with finite mean and X is an lID sample from F. In particular, the
mean of R(X, Fn) given Fn is O. The parametric and nonparametric bootstraps
do just fine here.
Next, take R(X,F) = n(F-l(l) - X(n/F- 1 (1), where X(n) is the largest
coordinate of X. The distribution of X(n)1 F- 1 (1) is Beta(n, l) with CDF t n
for 0 $ t :5 1. So, the CDF of R(X,F) is 1 - (1 - tin)". For large n, this
is approximately 1 - exp( -t). For example, if t = 0.1, then HJ(R(X, F) ~
0.1) ~ exp(-O.l) = 0.905. On the other hand, for the nonparametric bootstrap,
R(X, Fn) = n(X(n) - X(n1 X(n) and

..
Ps(R(X ,Fn)
. = 1- (l)n
= OWn) 1- n '

which is approximately 1 - exp( -1 ~ = 0.632 for large n. The nonparametric


bootstrap will perform poorly here. 1
Why is the bootstrap good for the first half of Example 5.80 but not
for the second half? In the first half, we are only interested in the mean
of R(X, F), which is 0 no matter what F is. In the second half, virtually
everything about the distribution of R(X, F) depends very much on F.
Even the mean of R(X, F) is not the same for all F. For example, if F(x)
has a density that drops to 0 at F- I (I) like a power of F- I (I) - x, then

llThe second part of this example was given by Bickel and Freedman (1981).
See Problem 41 on page 343 to see how the parametric bootstrap performs in
this example.
5.3. The Bootstrap 331

R(X, F) goes to 00 with n. (See Theorem 7.32.) At the other extreme, if F


has positive mass at F-l(l) (as does the empirical CDF), then R(X, F) has
positive probability of equaling 0. 12 The success or failure of the bootstrap
in individual problems depends on the degree to which Fn approximates F
for the specific purpose of calculating R(X, F). For some Rs, Fn may be
a wonderful approximation while for other Rs (even with the same data)
Fn is a miserable approximation. One needs to be careful, when using the
bootstrap, not to assume automatically that it will be suitable for one's
specific purpose without doing some checking first.
In the nonparametric bootstrap, the distribution of R(X', Fn) will usu-
ally be a combinatorial nightmare, and Efron (1979) suggests using simu-
lations to approximate features of its distribution. For example, if one is
interested in the probability that IR(X, F)I :::; 2, one can generate many
samples from Fn , X,I, ... ,x,m and calculate the proportion of times that
IR(X*,j, Fn)1 :::; 2. Similarly, if one is interested in the mean of R(X,F),
one can generate many X,j and calculate the average of R(X,j, Fn). For
the parametric bootstrap, the distribution of R(X', Fn) is generally of the
same form as the distribution of R(X, F). Once again, simulation may be
useful when this distribution is intractable. 13
Example 5.81. SUfPose that we are interested in the mean of R(X, F) =
n(Y1/ 2 - F- 1(1/2)) , where Y 1/ 2 is the median of a sample of size n from a
distribution with CDF F. In Section 7.2 we will work out the theory for the
asymptotic distributions of sample quantiles conditional on the CDF P = P
when P has CDF F. But what if one does not know F? According to the as-
sumptions of the bootstrap procedure, if Fn is sufficiently like F, we could sample
n observations X' from the distribution Fn and calculate the sample median Yt/2'
We could then subtract the original observed sample median (equal to F,;-1(1/2))
from Yt/2' square the result, and multiply by n. We could then repeat this many
times and calculate the average of the squared values. In the case of the non-
parametric bootstrap, if n is only moderate in size (a few hundred or less), exact
calculation of the bootstrap distribution of R(X', Fn) is possible using simple
combinatorial arguments. In particular, if X(k) denotes the kth order statistic of
the original sample, then for odd n,

:n
(n-1)/2
Pr(Yt/2 = X(k) = L ~
~
( n
,q,n--q
) (k _1)i(n - kr- t - q,

1=0 q=(n+1)/2-1
(5.82)

12Bickel and Freedman (1981) claim that even if one were to smooth the boot-
strap by sampling from a continuous approximation to the empirical CDF, there
would still be a problem in the second half of Example 5.80. Singh (1981) and
Bickel and Freedman (1981) prove some large sample properties of the nonpara-
metric bootstrap as it pertains to estimating central (not extreme) quantiles of
a distribution.
13See Section B.7 for some ideas on how to simulate in general.
332 Chapter 5. Estimation

TABLE 5.83. Summary of Bootstrap Results for Median


Data Type Sample Size ER(X, F) Bootstrap Average RMS Error
Laplace 51 1.242 1.578 0.949
Laplace 101 1.162 1.423 0.637
Normal 51 1.551 1.898 1.045
Normal 101 1.561 1. 702 0.924
Uniform 51 0.241 0.272 0.156
Uniform 101 0.245 0.272 0.115

for k = 2, ... , n - 1. For k = 1, n, we have


Pr(Yi'
2
= X(k) =..!...
nn ~
~ (n) (n - 1r-
q
q.

q=~

We simulated 100 data sets of size n = 51 and another 100 data sets of size n =
101 from the Lap(O, 1) distribution. Then we repeated the exercise with data from
N(O, 1) and U(O, 1) distributions. We approximated the true mean of R(X, F) for
each of the cases using a simulation of 100,000 data sets (except for the uniform
case in which the true value is just n times the variance of the Beta([n+ 1]/2, [n+
1]/2) distribution). The results are summarized in Table 5.83. The last column of
Table 5.83 gives the square root of the average squared difference between the true
value of ER(X, F) (third column) and the 100 simulated averages of R(X, Fn).
The fourth column gives the average of those 100 simulated averages. It appears
that the bootstrap estimate (fourth column) is slightly high in all cases, but less
so for larger sample sizes. The root mean squared error (last column) is very large
compared to the true value, indicating that the bootstrap estimate of ER(X, F)
based on a single data set may not be particularly useful.

A Bayesian, faced with a difficult analytical problem, might also wish


to resort to some form of computational procedure to replace the analysis.
Rubin (1981) introduced a Bayesian bootstrap. This can be described as fol-
lows. First, simulate a CDF F with Dirichlet process distribution Dir(Fn )
(see Section 1.6.1). Second, simulate an lID sample from the CDF F and
compute the observed value of whatever function is of interest. Repeat this
pair of simulations as many times as desired to obtain a sample of values
for the function of interest. This procedure [as Rubin (1981) noted] suffers
from a flaw it has in common with the nonparametric bootstrap. The flaw
is that the only data values ever simulated are the same ones that were
originally observed. One never simulates from a distribution with support
larger than the observed sample. Unless the sample is incredibly large, this
can make quite unrealistic the assumption that the CDF from which the
bootstrap samples are drawn is like the one from which the original data
were generated. Who would ever argue that the observed values were the
only ones that could have been observed, unless the distribution has known
finite support? It might make sense instead to use one of the tailfree process
priors from Section 1.6.2 concentrated on continuous distributions.
5.3. The Bootstrap 333

Example 5.84 (Continuation of Example 5.81; see page 331).To be as non-


parametric as possible, suppose that we model the data {Xn};:'=l as lID with
distribution P conditional on P = P and we give P a tailfree process prior (see
Section 1.6.2). We now observe Xl = Xl, ... , Xn = X n . Suppose that we are
interested in the mean of R(X', F) = (YI/2 - F- I (1/2))2, where X' is a future
sample of size n from P, F is the CDF for P, and Y1 / 2 is the median of the
observed sample. We could simulate a collection of distributions PI, ... ,Pm from
the posterior distribution of P. For each Pj (having CDF Fj ), we could simulate
an lID sample X*j of size n and find the sample median ytk We also would need
to find the median of Pj (call it F j- I (I/2)) and then average (Yt/2 - Fj- I (I/2))2.
As an example, we might use the Bayesian bootstrap, which corresponds to a
tailfree prior with improper prior distribution as well as to a Dirichlet process
with improper prior. To simulate F, we need only simulate a vector T with
Dirn (I, ... ,I) distribution, and let F(x) be the sum of Ti for those i such that
X(i) ~ x. Then, the exact distribution of the median of a sample of size n drawn
from the distribution F can be computed as in (5.82). For instance, for k =
2, ... , n - 1 and odd n, we have

(5.85)

where T<.k = L:~':-ll Ti and T>.k = L:~=k+l Ti. Using the same data sets as in
Example 5.81 on page 331, we obtained the results summarized in Table 5.86.
The Bayesian bootstrap seems to estimate ER(X, F) to be even higher than does
the bootstrap. This seems natural due to the additional variance introduced in
the Bayesian bootstrap.

The particular choice of R(X, F) in Example 5.84 was chosen to match


the choice from Example 5.81. From a Bayesian viewpoint, however, a more
interesting use of the bootstrap technology might be to try to predict the
median of a future sample. In this case, we could use the bootstrap (or
Bayesian bootstrap) distribution of the median of the sample X* as a pre-
dictive distribution for the median of a future sample. In fact, the Bayesian
bootstrap distribution of the median of the sample X* is precisely the pre-
dictive distribution of the median of a future sample if F is modeled as
a Dirichlet process with improper prior. Similarly, following the bootstrap
logic, if Fn is sufficiently like F, then the median of a sample drawn from

TABLE 5.86. Summary of Bayesian Bootstrap Results for Median


Data Type Sample Size ER(X, F) Bootstrap Average RMS Error
Laplace 51 1.242 1.988 1.151
Laplace 101 1.162 1.638 0.735
Normal 51 1.551 2.183 1.117
Normal 101 1.561 1.916 0.864
Uniform 51 0.241 0.308 0.153
Uniform 101 0.245 0.306 0.121
FIGURE 5.87. Bootstrap and Bayesian Bootstrap Distributions of Sample Median

Fn should have a distribution like that of the median of a sample drawn


from F.
Example 5.88 (Continuation of Example 5.84; see page 333). We can use the
bootstrap distribution of the median in (5.82) or the mean of the conditional
Bayesian bootstrap distribution given T in (5.85) as predictive distributions for
the medians of future samples. The mean of (5.85) is easy to compute in closed
form since (T<.k, Tk, T>.k) has Dira(k - 1,1, n - k) distribution. To see how
closely these distributions approximate the distribution of the median of a future
sample, we note that (for odd n) if data X comes from a distribution F, then
the median has the same distribution as F- 1 (U), where U "" Beta([n+ 1]/2, [n+
1]/2). Hence, we will only simulate data from the U(O, 1) distribution in order to
compare the two bootstrap distributions to the true distribution of the sample
median. We will do this by calculating (for k = 1, ... , n) the probability that the
median of a future sample lies below x(kl and comparing this to (5.82) and the
mean of (5.85). Figure 5.87 shows plots of the bootstrap and Bayesian bootstrap
distributions of a future median against the true distribution for sample sizes
of 51 and 101. The bootstrap consistently assigns too little probability to both
the upper and lower tails of the distribution. The Bayesian bootstrap, however,
reproduces the true distribution (the diagonal line in Figure 5.87) remarkably well
due to the additional variance it adds to the predictive distribution compared to
the bootstrap.

One might be tempted to use tailfree priors or the Bayesian bootstrap


in the second part of Example 5.80 to try to overcome the problems the
nonparametric bootstrap had. 14 It is clear that the Bayesian bootstrap
will fare no better than the nonparametric bootstrap for nearly the same

14This example is examined in detail by Schervish (1994).


5.3. The Bootstrap 335

reason. Trying to use a tailfree process (like a Polya tree distribution)


quickly leads to the realization that the problem as described cannot be
solved satisfactorily without further modeling assumptions. For example,
the typical P with Polya tree distribution on an interval [0, OJ will, with
high posterior probability, have a density very close to zero on an interval
[a, OJ if all of the observed data are less than or equal to a. The distribution
of nee - X(n))/9 is likely to be concentrated on very large values because
the probability that X(n) ~ a will be quite high. This suggests that one
might wish to restrict attention to distributions whose densities stay above
a certain level near O. Alternatively, one might wish to replace F- 1 (1) with
F- 1 (1 - E) for some small E. There are any number of possible alternative
formulations of the problem. One should give serious consideration to what
one really wants to know before choosing a procedure that may not solve
the problem of interest.

5.3.2 Standard Deviations and Bias


The bootstrap was originally [see Efron (1979)] designed as a tool for esti-
mating the bias and standard error of a statistic.
Example 5.89. Suppose that conditional on P = P, X!, ... , Xn are condition-
ally lID with CDF F, where F is the CDF of distribution P. We assume only
that J x 2 dF(x) < 00. Let

)2 2
R(X, F) = (
n
~ ~Xi - (J XdF(X)) ,

and suppose' that we are interested in the mean of R. This is the bias of the
square of the sample average as an estimator of the square of the mean. If the
observed sample average is Xn and we use s;, = L~- (Xi - Xn)2 In as an estimate
of variance, then ,_1

*
R(X ,Fn) =
A

( n1 LX: )2 - (Xn)
n 2
.
,=1
The mean of this, given the data, is s;'ln. The mean of R(X, F) given P = P
is 0'2 In, where 0'2 = J(x - ft?dF(x). Since s;, is supposed to be close to 0'2 for
large n, the bootstrap is thought to behave well in this case. If, instead, we had
used
R'(X,F) = (~L~1 Xi)2 - (J xdF(x))2
J(x - /-I)2dF(x) ,
then the mean of R'(X*,Fn ) would have been lin which is exactly the mean of
R'(X,F). '
How would one use the fact that R' (X, F) has mean lin to "correct" X! as
an estimator of (J xdF(x))2? Presumably, one would subtract s;'ln. Although
this wo~ld usually do well, if s;'ln happens to be larger than X!, one would get
a very slily result.
336 Chapter 5. Estimation

Similarly, if one were interested in the standard deviation of X2, one could
J
assume that x 4 dF(x) < 00, and apply the bootstrap to

The bootstrap estimate of standard deviation would be the square root of the
average of the values of R(X, Fn). Unfortunately, the term after the minus sign
in (5.90) is not easy to evaluate even with F = Fn. An obvious alternative is to
use the average of the values of (L~=l X;* /n)2.

The bootstrap can be applied to all types of statistics whose means


and/or variances are difficult to calculate analytically. Efron and Tibshirani
(1993) give many examples of such statistics, like correlations, regression
coefficients, and nonlinear functions of such things, whose sampling distri-
butions are nontrivial but whose bootstrap distributions are very straight-
forward. (See some of the problems at the end of this chapter.)

5.3.3 Bootstrap Confidence Intervals


Suppose that we desire a confidence interval for some function 9 of a
parameter 9. From the bootstrap perspective, it is preferable to write
g(9) as h(F). We might then desire a confidence interval of the form
(-00, h(Fn) + Y] or [h(Fn) - YI , h(Fn) + Y2J. The problem is to find Y, or
YI and Y 2 For the one-sided case, in order for the interval to be a coef-
ficient'Y confidence interval, it must be that Po(h(Fn) + Y ~ h(F)) = "'f.
Equivalently, we need Po(h(F) - h(Fn) ~ Y) = 'Y. There might be avail-
able a formula a 2 (F) for the approximate variance of h(Fn). For example,
if h(F) = J xdF(x), then a 2 (F) = f[x - h(F)j2dF(x)/fo In this case,
one might replace Y by a(F)Y' or by a(Fn)Y'. Suppose that we do the
latter. 15 Then we want Y' to satisfy

(5.91)

This makes Y' equal to the 'Y quantile of R(X, F), where R is the func-
tion on the left of the inequality in (5.91). What is commonly called the
percentile-t bootstrap confidence interval for h( F) would be
(5.92)

15The reason for switching from Y to O'(Fn)Y' is that one would suspect that
y' depends less on the underlying distribution than does Y.
5.3. The Bootstrap 337

where yl is an estimate of Y' obtained by simulation. One could simulate


Xi, ... , X; with distribution Fn and calculate, for i = 1, ... , b,

where Fi is the empirical CDF of the bootstrap sample X;. The sample
'Y quantile of the R{X;, Fn) values could then serve for yl in (5.92). Hall
(1992) examines bootstrap confidence intervals in detail and finds asymp-
totic expressions for their actual coverage probabilities.
Example 5.93. Consider the same data used in Example 1.132 on page 71. The
data were a sample of size n = 50 from a Laplace distribution Lap(l, 1). We are
interested in h(F) = J xdF(x) , the mean of F. We simulated 10,000 bootstrap
samples, Xi, ... , Xioooo, and for each one, we calculated R(X;, Fso), where Fso
is the empirical CDF of the 50 observations. The curve labeled "Bootstrap" in
Figure 5.94 shows the 10,000 values of Xso+R(X;, Fso)8so, where 8so = u(Fso),
the MLE of the standard deviation of X so. As an example of confidence intervals,
one-sided 95% lower and upper bound confidence intervals for h(F) based on the
bootstrap are [0.5735,00) and (-00,1.3199], respectively.
Using the same prior distribution as in Example 1.132, we also used a tailfree
process of Polya tree type to simulate 10,000 values of the mean of the distri-
bution P. The empirical CDF of these values is plotted as the curve labeled
"Tailfree" in Figure 5.94. Not surprisingly, we see that the extreme quantiles of
the tailfree sample are farther from the sample mean than those of the bootstrap
sample. This is due to the fact that the bootstrap procedure ignores the uncer-
tainty from not knowing F when calculating R values. That is, we must pretend

:"!

I
'8 ~

I
I ~

:;;:

Tallfrea
~

-, 0 2 3

FIGURE 5.94. Distributions of Sampled Bootstrap and Tailfree Quantiles


338 Chapter 5. Estimation

that F = Fso when calculating the R values. The Bayesian solution takes into ac-
count the additional uncertainty from not knowing F. The 0.95 upper and lower
bound posterior probability intervals corresponding to the confidence intervals
calculated above are [0.2574,00) and (-00,1.7628], respectively.

Hall (1992) shows that the percentile-t confidence interval has good fre-
quency properties in terms of the conditional probability that the interval
covers h(F) given F. With a single sample, as in Example 5.93, frequency
properties are neither apparent nor relevant. Suppose, however, that one
were to use bootstrap confidence intervals in many applications (with many
different Fs). From the classical perspective, one might actually be inter-
ested in the proportion of times that the interval covers h( F) and how this
compares to the nominal confidence coefficient.
Example 5.95. Suppose that we will sample data from several different Lap(jJ, 0')
distributions on several different occasions but we do not model the data this way.
Rather, suppose that we use the bootstrap and a tailfree prior of Polya tree type
as in Example 5.93 on page 337. As an example, 1000 data sets of size 50 each
were simulated with many different Laplace distributions. The values of 0' were
generated as r- 1 (1, 1) random variables and the locations were 0' times N(O, 1)
random variables. Location and scale changes do not affect the calculation of R
in the bootstrap, but they do affect the variance of X and S. For each data set,
1000 bootstrap samples were formed and 1000 observations from the posterior
r
mean of xdP(x) were simulated. We counted how many times jJ was below
each of the 1000 sample quantiles of the simulated values. Figure 5.96 shows
these proportions for both the bootstrap and Polya tree samples. As expected,
the bootstrap proportions match the nominal significance levels well.

~
Nominal
Boot.trap
Tailfree

I
~

J ;;

:;:

0.2 0.4 0.8 0.9 1.0


0.0
Quantile

FIGURE 5.96. Empirical Coverage Probabilities for 1000 Samples


5.4. Problem s 339

One final note is in order concerning the bootstr ap. All you ever learn
about by using the bootstr ap, without further modeling assumptions,
are
properties of Fn. Unless you have a way of saying how much and/or
in
what ways knowledge of Fn can be transformed into knowledge of P,
the
bootstr ap can only tell you about Fn , not about P.

5.4 Problems
Section 5.1.1:

1. Prove Proposi tion 5.8 on page 298.


2. Let Xl, X 2 , Xa be lID given 8 = 0 with Exp(O) distribu tion (mean
= I/O).
Find the UMVUE of g( 8) = I - exp( -x8). (Hint: Use the Rao-Bla
ckwell
theorem 3.22.)
3. Suppose that Xl, ... , Xn are lID given 8 = 0 with Exp(l/e ) distribu
tion.
Find the UMVU E of 8 and show that it is inadmissible with squared
-error
loss.
4. Suppose that X 'V N(O, 1) given 8= O.
(a) Find the UMVU E of 8 2 What is wrong with this estimat or?
(b) Suppose that we have a decision problem with loss function L(O,
a) =
(0 2 - a)2. Find the generalized Bayes rule with respect to Lebesgu
e
measure. Show that this estimat or is inadmissible.
5. Suppos e that Xl, ... ,Xn are conditionally lID with Ber(O) distribu
tion
given 8 = O. Find the UMVUE of 8(1 - 8).
6. Suppose that X I, ... , X 20 are conditionally lID given 8 = 0 with
N (0, I)
distribu tion. We first collect Xl, ... , XlO and comput e Y = L~~l
X;flO.
If Y < 5, we set Z = Y, N = 10 and stop sampling. If Y ;:::: 5, we collect
the
other 10 observations and comput e Z = I::~l X;/20. We then set N
The data we report are (N, Y, Z).
= 20.

(al Prove that Y is an unbiased estimat or of 8.


(b) Prove that Z is biased and find its bias.
(c) Show that N is not ancillary.
7. Let V; = (Xi, Yi), i = 1, ... , n be pairs of random variables,
(A, M). Suppose that
and let 8 =

!vile(x, ylA, p,) = Ap,exp( -AX - p,y), for X > 0 y > O.


All we get to observe, for i = 1, ... , n, are

if Z; = Xi,
if Zi = Yi.
(a) Prove that Zi is conditionally indepen dent of Ui given 8.
340 Chapter 5. Estimation

(b) Find a complete sufficient statistic.


(c) Find a UMVUE of A.
8. Return to the situation of Problem 16 on page 140. Find the UMVUE of
8.
9. Return to the situation of Problem 9 on page 138.
(a) Find the conditional UMVUE of 8 given XI.
(b) Find an estimator with smaller unconditional variance than the con-
ditional UMVUE.
(c) Show that (N, M) is not a complete sufficient statistic.
10. Let XI, ... , Xn be conditionally 110 with N(O, 1) distribution given e = O.
Determine the UMVUE of g(8) = Pr(IXil :::; cle), where c > 0 is fixed.
11. Suppose that X I, ... ,Xn are conditionally 110 given 8 = 0 with condi-
tional density
(J"'o.
fXls(xl(J) = x,,+l I [II,CX (x),
where 0. is known and the parameter space is n = (0,00).
(a) Find a complete sufficient statistic.
(b) Find a UMVUE of 8.
(c) Prove that the UMVUE is inadmissible if the loss function is squared
error.
12. Let n be the set of all integers, and suppose that X given 8 = 0 has
discrete uniform distribution on the set {O - 1, 0, 0 + I}. Let g : n ~ IR be
a nonconstant function. Show that there is no UMVUE of g(8).
13. An alternative mode of estimation is the method of moments. Let X =
(XI, ... , Xn) with the Xi being 110 given 8. Suppose that fJ-k = EII(Xf)
is finite for k = 1, ... , m. Also suppose that there is a function h such
that g((J) = h(fJ-I, ... , fJ-m). Let Yk = L.~=I XNn. Then h(YI , ... , Ym) is a
method of moments estimator of g(8). Find method of moments estimators
for each of the following situations:
(a) Xi '" Exp(O) given 8 = 0, g(O) = OJ
(b) Xi'" N(fJ-,u 2 ) given e = (fJ-,u), g(O) = U2 j
(c) Xi '" Ber(O) given 8 = 0, g(O) = 0(1 - 0).

Section 5.1.2:

14. Suppose that Po says that X'" Poi(O) and that we are trying to estimate
exp(-38).
(a) Find the Cramer-Rao lower bound for unbiased estimators.
(b) If (X) = (_2)x, find Varo(X).
(c) Find both the Cramer-Rao lower bound and the Chapman-Robbins
lower bound for the variance of unbiased estimators of 8. Which is
larger?
5.4. Problem s 341

15. Suppos e that Xl,' .. ,Xn are conditio nally IID Poi (B) given e
= B.
(a) Let r be a known integer. Find the UMVU E of exp( _e)e r .
(b) Let n = 1 in part (a). Find the variance of the estimat or and
the
Cramer -Rao lower bound.
16. Suppos e that Y has Exp(l) distribu tion and is indepen dent of
e. Suppos e
that Xl, ... , Xn are lID conditional on Y = y and e = B with N(B,
l/y)
distribu tion. We get to observe Y and Xl, ... ,Xn .
(a) Find the Cramer -Rao lower bound for the variance of unbiase
d esti-
mators of e.
(b) Show that no unbiase d estimat or achieves the Cramer -Rao
lower
bound.
(c) Explain why the Cramer -Rao lower bound should not be taken
seri-
ously in the problem describe d here.
17. For the location family of ta distribu tions in Exampl e 5.17 on
page 303,
prove that the Bhattac haryya lower bound with k = 2 is the same
as the
Cramer -Rao lower bound.
18. Let X rv U(O, B) given e= B.
(a) Find the Chapm an-Rob bins lower bound on the variance of an
unbi-
ased estimat or of e.
(b) Find an unbiase d estimat or and find by how much its variance exceeds
the lower bound.
19. In Exampl e 5.19 on page 304, prove that min Xi - lin is the
UMVUE.
20. Refer to the situatio n in Exampl e 2.83 on page 112.
(a) Prove that Cove (1/11,1/12) = O.
(b) Prove that Vare4>(X) = 2u 4 + 411?u 2 .

Section 5.1.3:

21. Conside r the situatio n in Problem 6 on page 339. Find the MLE
of e.
22. Find the maximu m likelihood estimat or of e if Xl, ... ,X are
n condition-
ally lID with distribu tion Exp(O) given e = B.
23. Suppose that X rv Cau(B, 1) given e = B. Find the MLE of e.
24. Suppose that Xl, ... ,Xn are lID with Laplace distribu tion Lap(p"
u) given
e=(p" u).

(a) Prove that the value of p, that minimizes 2::1 \Xi - p,\ is the median
of the number s Xl, ... , X n .
(b) Find the MLE of e.
25. Let X have a one-par ameter exponen tial family distribu tion given
e. Sup-
pose that the MLE of e is interior to the parame ter space. Prove
that the
MLE equals a method of momen ts estimat or. (See Problem 13 on page
340.)
342 Chapter 5. Estimation

26. Consider the situation in Example 5.30 on page 308. Let squared error be
the loss, that is, L(O,a) = (p.2 - a)2. Show that the UMVUE dominates
the MLE if n ::::: 2. Find a formula for the difference in the risk functions.
Also, find an estimator that dominates both the MLE and the UMVUE.

Section 5.1.4:

27. Consider the situation in Problem 6 on page 339. Find the likelihood func-
tion and show that a Bayesian would take the data at face value (that is, a
Bayesian would calculate the same posterior as if the sample size had been
fixed in advance at whatever value N turns out to be, no matter what the
prior is).
28. Consider the situation in Problem 7 on page 339. If A and M are indepen-
dent a priori with A", r(a, b) and M '" r(c, d), find the posterior mean of
A given the data.
29. In Example 5.10 on page 299, find the posterior mean of 8 given X =x
for all x assuming that the prior for 8 is Beta( 00,130).
30. Return to the situation of Problem 16 on page 140. If e has a prior density
!e(O) = ac aI(c,oo) (J)/(Ja+1 , find the posterior mean of 8.
31. *Suppose that, conditional on N, {Xi}~l are independent with the first
N of them having Ber(I/3) distribution and the rest having Ber(2/3)
distribution. The prior for N is !N(n) = Tn for n = 1,2, ....
(a) Find the posterior distribution of N given a finite sample Xl, ... , X n ,

for known n.
(b) If Xl = 0, ... , Xn = 0 is the observed finite sample, find the posterior
mean of N.

Section 5.1.5:

32. Let Po be the set of distributions on (IR, 8 1 ) with finite variance. Let T(P)
be the standard deviation of the distribution P. Show that I F(x; T, P) =
(x - p.)2/[20"] - 0"/2, where p. is the mean of P.
33. Let Po be the class of distributions on (IR, 8 1 ) with bounded support, and
let T(P) be the supremum of the support. Prove that the influence function
for T is I F(x; T, P) = 0 if x :::; T(P) and 00 if x > T(p).
34. Find the influence function for the 1000% trimmed mean at a continuous
distribution P.

Section 5.2:

35. Prove Proposition 5.48 on page 316.


36. Let the parameter space be n with O"-field of subsets T. Let X be a random
quantity taking values in a set X, and let X have conditional density !x(e
given 8. Let v : X -+ [0,00) be a measurable function such that the set

L(x) = {O En: fXle(xl(J) ::::: v(x)}


5.4. Problems 343

is in T for all x. Let 8 have a prior distribution p,e, and let p,x denote the
prior predictive distribution of X. Let C : X - T be another set function
such that p,e(C(x :5 p,e(L(x, a.s. [p,x]. Prove that Pr(8 E C(x)IX =
x):5 Pr(8 E L(x)IX = x), a.s. (p.x].
37. Prove Proposition 5.56 on page 319.
38. Prove Proposition 5.61 on page 321.
39. Prove Proposition 5.79 on page 329.
40. Suppose that n = ill. and that the posterior density of 8 given X is strongly
unimodal. Let the action space be the set of all closed and bounded intervals
[a!, a2j in ill..
(a) Let the loss function be LI(9, [al,a2j) = a2 - al + c (1 - l[al.a2](9).
Prove that the formal Bayes rule is an HPD region.
(b) Let the loss function be L q (9, [al,a2j) = (a2-al)2+c(1-1[al.a2](9).
Find the formal Bayes rule.

Section 5.3:

41. Suppose that one wished to construct a parametric bootstrap estimate in


the second part of Example 5.80 on page 330.
(a) Explain how to construct the parametric bootstrap estimate using
the U(0,9) parametric family.
(b) Find the distribution of R(X', Fn) for the parametric bootstrap es-
timate.
(c) Will the parametric bootstrap estimate have the same problem that
the nonparametric bootstrap estimate has?
42. How would one use the nonparametric bootstrap to find the bias and stan-
dard deviation for the sample correlation coefficient from a sample of n
pairs (Xl, Y l ), ... , (Xn , Yn)?
43. Let (Xl, Yd, ... , (x n, Yn) be data pairs, and suppose that we entertain a
regression model in which E(YiIBo = /30, Bl = {3l) = /30 + {3lXi. The y-
intercept of the regression line is x = -/301{3l. Let (Bo, Bl) be the usual
least-squares regression estimator.
(a) How would one use the bootstrap to find the bias and standard de-
viation of the ratio -Bo/Bl?
(b) Suppose that one used the following formula for the approximate
variance of the ratio of two random variables ZolZl:

Show how you would use this to find bootstrap confidence intervals
for -Bo/Bl.
CHAPTER 6
Equivariance*

In Chapter 3, we introduced a few principles of classical decision theory


(e.g., minimaxity, ancillarity) to help to choose among admissible rules.
In Chapter 5, we introduced another principle called unbiasedness which
could be used to select a subset of the class of all estimators. As we saw,
sometimes none of the unbiased estimators was admissible. In this chapter,
we introduce another ad hoc principle called equivariance, l which can also
be used to select a subset of the class of all estimators. The principle of
equivariance, in its most general form, relies on the algebraic theory of
groups. However, the basic concept can be understood by means of a simple
class of problems in which the principle can apply.

6.1 Common Examples


6.1.1 Location Problems
We will consider only parametric aspects of equivariance, since it is of inter-
est primarily in the classical paradigm. Suppose that we have constructed
a parametric family Po with a parameter e.
Definition 6.1. First, let X and e be scalar random variables. If the
conditional distribution of X - e given e = 0 is the same for all 0, then

*This chapter may be skipped without interrupting the flow of ideas.


lSome authors call the principle invariance rather than equivariance. We will
see later why the term invariance is better used to mean something different, but
related.
6.1. Common Examples 345

8 is called a location pammeter for X. If 8 > 0, and the conditional


distribution of X/8 given 8 = fJ is the same for all fJ, then 8 is called a
scale pammeter for X. If 8 = (81, (2), where both 8 i are scalar, 8 2 > 0,
and the conditional distribution of (X - (1)/8 2 given 8 = fJ is the same
for all f), then 8 is called a location-scale pammeter for X.
Next, let X be a vector, and let 8 be a scalar. Let 1 denote the vector
of the same length as X with every coordinate equal to 1. Then 8 is a
location pammeter for X if the conditional distribution of X - 81 given
8 = fJ is the same for all fJ. If 8 > 0 and the conditional distribution of
X/8 given 8 = fJ is the same for all fJ, then 8 is a scale pammeter for X.
If 8 = (81,8 2), where both 8i are scalar, 82 > 0, and the conditional
distribution of (X - (11)/82 given 8 = fJ is the same for all fJ, then 8 is
called a location-scale pammeter for X.
Next, let X be a vector, and let 8 be a vector of the same dimension.
Then 8 is a location pammeter for X if the conditional distribution of
X - 8 given 8 = fJ is the same for all fJ. If 8 is a nonsingular matrix
parameter and the conditional distribution of 8- 1 X given 8 = fJ is the
same for all fJ, then 8 is a scale pammeter for X. If 8 = (8 1 , (2), where
8 1 is a vector of the same dimension as X, 8 2 is a nonsingular matrix, and
the conditional distribution of 82"1(X - ( 1 ) given 8 = f) is the same for
all fJ, then 8 is called a location-scale pammeter.
We will deal only with location parameters in this section. In fact, the
only cases of location parameters we will consider are those in which X is
a vector of exchangeable random variables that are conditionally lID given
8, and 8 is scalar. That is, the conditional distribution of X - 81 given
8 = fJ is the same for all fJ.
Theorem 6.2. If 8 is a location pammeter for X and fXls(xlfJ) is the
Radon-Nikodym derivative of P9 with respect to Lebesgue measure, then
fXls(xlf) = g(x - fJ1) for some density function g.
PROOF. The conditional joint CDF of X - 8 given 8 = 9 is
P~(Xi - 8 ::; Gi, for i = 1, ... , n)

= j Cl+9 jCn+9 fXls(xlfJ)dxn ... dX1


- 0 0 ' -00

= iC~'" iC~ fXls(Y + fJ119)dYn'" dYb

which, for each Cll' .. ,en,


is the same for all fJ if and only if fXls(y+91IfJ) =
g(y) for some density g. This implies that fXls(xI9) = g(x - fJ1). 0
Next, imagine two possible data vectors x and y = x + cl. If 9 + c En,
then fXls(xIO) = fXls(ylO + c). This says that if the data values were all
shifted by the same amount, then the likelihood function would be trans-
lated by that same amount. The goal of equivariance is to take advantage
of this "double shift." We make this idea more precise in a proposition.
346 Chapter 6. Equivariance

Proposition 6.3. If e is a location parameter for X, then the conditional


distribution of X +c1 given e = () is the same as the conditional distribution
of X given e = 0 + c.
The word "equivariant" means that two (or more) things change in the
same way. Proposition 6.3 says that two different changes to a problem
produce the same change to a distribution. That is, the conditional distri-
bution of X given e changes the same way whether we change X to X + cl
or we change e to e + c.
Definition 6.4. Consider a decision problem with parameter space Rand
action space IR. A loss function L is called location invariant if L( 0, a) =
p(O - a), where p is some function. If p increases as its argument moves
away from 0, such a decision problem is called location estimation.
A decision rule 0(x) is location equivariant if 0(x + c1) = 0(x) + c, for all
c and all x. A function g(x) is location invariant if g(x+c1) = g(x), for all
c and all x.
The word "invariant" means "does not change." Functions that satisfy
g(x + c1) = g(x) have the property that their value does not change when
their argument changes. Functions that satisfy o(x + c1) = o(x) + c have
the property that their value changes the same way whether we change x
to x + c1 or we change 0(-) to 0(-) + c.
Proposition 6.5. If 0 is location equivariant and e is a location parame-
ter, then the conditional distribution of o(X) - e given e = 0 is the same
for all O.
Note that Proposition 6.5 implies that the risk function of an equivariant
estimator is constant if the loss is location invariant. (See Problem 3 on
page 388.)
Lemma 6.6. 2 Suppose that 00 is location equivariant. Then 01 is location
equivariant if and only if there exists a location invariant function u such
that 01 = 00 + u.
PROOF. Clearly, if there is such a u, then 00 + u is equivariant. Equally
clearly, if 01 is equivariant, then u = 01 - 00 is invariant. 0

Lemma 6.1. A function u is location invariant if and only ifu is afunction


of x only through (Xl - x n , ... , X n -1 - x n )
PROOF. The "if" part is trivial. For the "only if" part, let c = -X n Then

u{x + c1) = U(X1 - X n , ... , X n -1 - x n , 0) = u{x),


by invariance. Note that when n = 1, only constants are invariant. o

2This lemma is used in the proofs of Theorems 6.8 and 6.18.


6.1. Common Examples 347

Theorem 6.8. Suppose that Y = (X1-X n , ... ,Xn-1-Xn ) and L(O,a) =


(0 - a)2. Suppose that 00 is a location equivariant estimator with finite risk.
Then, the equivariant estimator with smallest risk is Oo(X) - Eo [oo(X)IYj.
PROOF. Let 00 be an arbitrary equivariant estimator with finite risk. By
Lemma 6.6, all other equivariant estimators have the form co(X) - v(Y).
Since the risk function is constant for an equivariant 0,
R(O,o) = R(O,o) = Eo[oo(X) - v(Y)j2
= Eo {Eo[(Co(X) - v(y))2IYJ} ,
which is minimized by minimizing Eo[(co(X) -v(y))2IY = yj uniformly in
y. This is accomplished by choosing v(y) = Eo[oo(X)IY = yj. 0
The estimator in Theorem 6.8 is due to Pitman (1939). It is often called
the minimum risk equivariant (MRE) estimator or Pitman's estimator.
Throughout the rest of this section, we will use the symbol Y to stand
for the vector defined in Theorem 6.8.
Example 6.9. Let Xl, ... ,X be lID N(8,1) given e = 8. Let
n

y.T = (yT,Xn) = (Yl, ... ,Yn-l,Yn).


Let Do(X) = Xn = Yn . We can write

Y. = co 0

o
1

0
0
0

1
-1
-1

-1
)x
o 0 0 1

Hence, given e = 8, X '" N n (81, In} and

Y."'Nn[(~)'(
o ~ fJ
1
-1
1
...
-1
:=:)].
2-1
1

To get the conditional distribution of Xn given Y, we need the inverse of the


upper left-hand corner of the covariance matrix of Y. This inverse is In - J In,
where J is a matrix of all Is. So

XnlY = y,e = fJ '" N (fJ + [-1, ... , -1] (In - ~J) y,c) ,
for some number c which we do not need, since the minimum risk equivariant
estimator depends only on the mean of this distribution when fJ = O. We can
rewrite the mean as

-LYi + ~(n -I) LYi = -~ LY; = -X+X


n-l n-l n-l

n.
;=1 ;=1 ;=1

Pitman's estimator is then Xn +X - Xn = X, which is not surprising.


348 Chapter 6. Equivariance

Pitman's estimator is often expressed in a different form.


Theorem 6.10. In a location problem with one-dimensional parameter,
Pitman's estimator can be written as E(9IX = x), where we used a "uni-
form prior" for e. That is, the MRE estimator for squared-error loss is the
formal Bayes rule with respect to Lebesgue measure, if it has finite risk.
PROOF. Suppose that Y = (Xl - X n , .. . ,Xn - l - Xn) and /xls(xI8) =
g(x-Ol). Transform X to (yT, Xn)T. The Jacobian of this transformation
is 1, and we get

The marginal density of Y is

fYls(yIO) = J g(y + (xn - 0)1, Xn - O)dxn = J g(y + ul, u)du.

This does not depend on IJ because Y is ancillary (see Problem 4 on


page 389). So, we can write the conditional density

/ XnIY,S (Xn Iy,O ) = Jg(y + xnl,xn)


g(y + ul,u)du

Let oo(x) = Xn in Theorem 6.8. Then

v(y) = Eo(XnI Y = y) = f ug(y + ul, u)du.


f g(y + ut, u)du
Now change variables from u to z = Xn - u, so u = Xn - z. Then Yi +u =
Xi - z, and

o(X) = Xn _ v(y) = f(xn - u)g(y + ut, u)du


f g(y + ul, u)du
f zg(x - zt)dz = f O/xls(xIO)dO = E(8IX = x),
f g(x - zl)dz f /xls(xIIJ)dIJ
if the prior for e is Lebesgue measure. 0
The "uniform prior," or Lebesgue measure is closely related to location
equivariance. The relation is that the Lebesgue measure of a set is invariant
under location shifts. When we deal with more general types of equivari-
ance, a generalization of Theorem 6.10 will emerge in that the MRE esti-
mator will be the formal Bayes rule with respect to a "prior" distribution
that is invariant in the appropriate sense.
Example 6.11 (Continuation of Example 6.9; see page 347). Let Xl, ... , Xn .be
lID N(O,1} given e = O. If X = (Xl, ... , Xn) and 6o(X} = Xn, the pos.terlOr
from a uniform prior is e", N(z, lin}, and the MRE is X, as we saw earber.
6.1. Common Examples 349

Example 6.12. Let X have Cauchy distribution Cau(9,1} given e = 9. The


posterior from a uniform prior is e '" Cau(x,l}, and there is no formal Bayes
rule. In fact, it is easy to show that all equivariant estimators have infinite risk.
Note that constant estimators (like 6(x) = c E n for all x} have finite risk but
are not equivariant (and they have infinite posterior risk, but since the prior is
improper, this is not surprising).

A maximin style theorem can be proven about equivariant estimation.


Suppose that Nature knows that you will use an equivariant estimator and
the loss function is squared error. Which location family should Nature
choose to make your risk as large as possible? The answer is the family of
normal distributions.
Theorem 6.13. Suppose that L(O, a) = (0 - a)2 and

:F = {f ~ 0: J f(x)dx = 1, J
xf(x)dx = 0, J
x 2f(x)dx = I} .
Suppose that Xl, . .. ,Xn are lID with conditional density f(x - 0) given
e = 0 for some f E :F. Define rn(f) to be the greatest lower bound of the
risk over the set of all equivariant estimators. Then SUPjE.1" rn(f) = rn(fo),
where fo is the standard normal density.
PROOF. If fo is the standard normal density, then X is MRE and rn(fo) =
lin. Since X is always equivariant, it must be that rn(f) ::; lin, since lin
is the risk of X for all f E :F. 3 0

Example 6.14. Suppose that Xl, ... ,Xn are liD with conditional density I(x-
9} given e = 9 where I(x} = exp[-(x+1}] for x ~ -1. This is a family of shifted
exponential distributions, rigged so that 9 = 0 has mean 0 also. The first-order
=
statistic X(l) min Xi is equivariant and is a complete sufficient statistic. Since
Y is ancillary, Theorem 2.48 says that XCI) and Y are independent given e. So

and X(l) + (n - l}/n is MRE and its risk is 1/n2.

For more general loss functions, we have the following lemma.


Lemma 6.15. If p is strictly convex and not monotone and L(O, a) =
p(a - 0), then the MRE estimator of e exists if and only if there is some
equivariant estimator with finite risk. The MRE is unique. The MRE is
unbiased if p( t) = t 2.

3There is a theorem of Kagan, Linnik, and Rao (1965) which shows that, for
n ~ 3, Eo(XIY} = 0 if and only if 1 = 10. This would imply that Tn(f} < l/n if
n ~ 3 and 1 ::I 10.
350 Chapter 6. Equivariance

PROOF. 4 If no equivariant estimator has finite risk, then it makes no sense


to talk about the MRE estimator. If p is strictly convex and Co is equivariant
with finite risk, let Y = (Xl - X n , ... ,Xn - l - Xn). Write
(tj y) = Eo[p(Co(X) - t)IY = y].
We first show that is strictly convex as a function of t for fixed y:
(at + (1- a)uj y)
Eo[p(at5o(X) - at + (1 - a)t5o(X) - (1 - a)u)IY = y]
< Eo[ap(t5o(X) - t) + (1 - a)p(t5o(X) - u)IY = y]
= a(tj y) + (1 - a)(uj y),
where the strict inequality holds because t5o(X) - t cannot equal t5o(X) - u
a.s. [Po] if t - u. We can also show that is not monotone in t a.s. This
follows from the fact that since p is strictly convex and not monotone,
p(x) -+ 00 as x -+ 00, and as x -+ -00 and co(X) - t converges in
probability to -00 or 00 as t -+ 00 or t -+ -00, respectively. Since convex
functions are continuous on the interiors of their domains, ( tj y) has a
minimum at t = v(y) for each y and the minimum is unique by strict
convexity. It follows that t5(X) = 80 (X) - v(Y) is the MRE estimator,
since it minimizes Eo(p(t5(X) - 0.
If p(t) = t 2 , let t5(X) be any equivariant estimator:
E 9 t5(X) = Eot5(X + (1) = 0 + E ot5(X) = 0 + c.

The risk of t5 is its variance plus the bias squared, which is minimized by
choosing c = O. Hence the MRE estimator has c = 0 and is unbiased. 0

6.1.2 Scale Problems*


It turns out that a scale problem with positive random variables and a
positive parameter is identical to a location problem. All one needs to do
is replace e by log e and Xi by log Xi' The general scale problem can be
defined in a fashion similar to the general location problem.
Definition 6.16. Consider a decision problem with parameter space IR+
and action space IR+ U {O}. A loss function L(O, a) is called scale invariant
if L(O, a) = p(a/O) for some function p. If p increases as its argument moves
away from 1, such a decision problem is called scale estimation.
A decision rule t5(x) is scale equivariant if t5(cx) = ct5(x), for all positive c
and all x. A function g(x) is scale invariant if g(cx) = g(x), for all positive
c and all x.

4This proof is based on the proof of Theorem 6.8 in Chapter 1 of Lehmann


(1983). .
*This section may be skipped without interrupting the flow of Ideas.
6.1. Commo n Exampl es 351

A scale analog of Theorem 6.2 would say that fXls(x\O ) = g(x/O)/O.


All
of the other results concerning location problems have their counterparts
in the case of scale problems with positive random variables. We will
not
restate them all here. It should be noted, however, that squared error
must
be changed to (log(a/0))2. Also, if one finds an estimator for loge,
one
should remember to exponentiate the result to produce an estima tor
of e.
Examp le 6.17. Suppose that {Xn}~=l are conditionally lID U(O,B)
given e =
9. Let X = (Xl, ... ,Xn ). Then e is a scale parame ter. Using the loss L(9,a)
[log(aj9W, we can find Pitman 's estimat or of log e by finding the posterio =
r mean
of log e based on a uniform prior for log e. This "prior" translat es into
the "prior"
l/B for e, since 'lj; = log 9 means d'lj; = dB/B. The posterio r density
for e is then
nX(n)
fSlx(Bl x) = (In+l h"(n),oo)(9),
where X(n) = maXi Xi. (This is known as a Pareto distribution.)
The mean of

1
loge is

lxroo nxn(1n+l
log 9
d9 = log X + 0
00

ntexp( -nt)dt = log X 1


+ fi:'
by making the transfor mation t = log(B/x). The MRE estimat or of e becomes
X exp(l/n }.

An alternative invariant loss function, which is more like squared-error


loss, is L( 0, a) = (0 - a)2 /0 2. Here, we need not assume that the random
variables are positive, because we will not have to take logarithms.
An
analog to Theorem 6.8 is proven next.
Theor em 6.18. 5 Let e be a scale parame ter and let L(O, a) = ((J -
a)2/0 2
and Y = (XI/\Xn \, . , Xn/\Xn l). Let 60 be an equivariant estimat
or with
finite risk. Then the equivariant estimat or with smalles t risk is

6 (X)E 1 [6 0 (X)\Yj
o Ed65(X )\Yj'
PROOF . Let 60 be an arbitra ry equivariant estimat or
with finite risk. By the
scale analog to Lemma 6.6, all other equivariant estimators have the
form
oo(X)/v(Y), where v is scale invariant. Since the risk function is constan
t
for an equivariant 6,
R(0,6) = R(I,6) = Ed6o(X)/v( Y) - 1]2
= El {Etl(6 0 (X)/v( Y) - 1)2\YJ} ,
which is minimized by minimizing E 1 [(60 (X)/v( Y) _1)2\y = yJ uniform
ly
in y. To do this, choose v(y) = E 1 [65(X)\Y = yl/Etl6o(X)\Y = y].
0
It can be shown that the MRE estimator is also the formal Bayes
rule
with respect to the improper prior 1/0.

5This theorem is used in Exampl e 6.60.


352 Chapter 6. Equivariance

Theorem 6.19. Under the same conditions as in Theorem 6.18, the MRE
estimator can be written as the formal Bayes rule with respect to a prior
having Radon-Nikodym derivative l/fJ with respect to Lebesgue measure, if
it has finite risk.
PROOF. We begin with the equivariant estimator IXnl. Let Y be as in The-
orem 6.18. The transformation from x to (yT,xn)T has Jacobian Ixnln - 1 ,
so
Iv,xn\e(y,xnll) = f(lxnly)lxnl n- 1 ,
both for Yn = +1 and Yn = -1. So, the conditional density of Xn given
Y =y is

It follows that

because both integrands are symmetric around 0 as functions of u. Now


make the change of variables u = IXnl/z with inverse z = IXnl/u. Then
du = -lx n ldz/z 2 It follows that

Hence, the MRE estimator is O(X), where

To see that this is the formal Bayes rule with respect to the "prior" l/fJ, note
that the posterior felx(9Ix) is proportional to f(x/fJ)/fJn+l. The expected
loss is (aside from the proportionality constant)

1 00
(fJ - a)2 fJ:+3f(~) dfJ.

By expanding this as a function of a and taking the derivative, we find that


the minimum occurs at a = 6(x). 0
The reader should note that the measure A(A) = fA dfJ/fJ is invariant
under scale changes. That is, the measure of a set A of positive numbers is
the same as the measure of lelA for all real c.
6.2. Equivariant Decision Theory 353

6.2 Equivariant Decision Theory


6.2.1 Groups of Transformations
Equivariance occurs in more general situations than just location or scale
problems. For example, it can occur in combined location-scale problems
with a two-dimensional parameter. In fact, it can occur whenever there is
a group of transformations that acts on the sample space, the parameter
space, and the action space in the "same way." We will now make this
notion more precise.
Definition 6.20. A group is a nonempty set G together with a binary
operation 0 called composition such that
for each g1,g2 E G, gl 0 g2 E G;

there exists e E G such that e 0 9 = 9 for all 9 E G;


for each 9 E G there is g-1 E G such that g-1 0 9 = e;

for each g1, g2, g3 E G, gl 0 (g2 0 g3) = (gl 0 g2) 0 g3.


The element e is called the identity and for each g, g-1 is called the inverse
of g. A group is abelian if 0 is commutative, that is, if gl 0 g2 = g2 0 gl for
all gl and g2.
There appears to be some asymmetry in the definition of the identity
and inverses. This is illusory, however.
Lemma 6.21. Let G be a group. For all g E G, go e = 9 and 9 0 g-1 = e.
There is only one identity element, and for each 9 E G, there is only one
inverse of g. The inverse of g-1 is g.
PROOF. Let 9 E G. Let h be the inverse of g-l. Since g-1 0 9 = e, (g-1 0
g) 0 g-1 = e 0 g-1 = g-l. It follows that

ho((g-log)og-l) =hog- 1 =e.

The left-hand side of this last equation can be rewritten using the associa-
tive property as
(h 0 g-l) 0 (g 0 g-l) = gog-I.
Hence go g-1 = e. It follows that 9 satisfies the property required to be
called the inverse of g-l. Next, note that eo e = e and g-1 0 9 = e, so

9 0 e = 9 0 (e 0 e) = 9 0 ( e 0 (g -1 0 g = 9 0 (( e 0 9 -1 ) 0 g) = 9 0 (g -1 0 g) = g.

For the uniqueness claims, first, suppose that hog = e. Then, using what
we have just proved,
354 Chapter 6. Equivariance

which means that the inverse of 9 is unique. It follows from what we proved
above that 9 is the unique inverse of g-l. Finally, let hog = 9 for all g.
Then
h = h 0 (g 0 9 -1) = (h 0 g) 0 g-l = gog -1 = e.
Hence, the identity is unique. 0
Sometimes two seemingly different groups are essentially the same.
Definition 6.22. Let G 1 and G 2 be groups with compositions 01 and 02,
respectively. Let : G 1 ~ G 2 be a one-to-one onto function such that, for
all g, hE G 1, (golh) = (g)o2(h). Then is called a group isomorphism.
The following proposition is straightforward.
Proposition 6.23. Let be a group isomorphism between G 1 and G 2.
Then -1 is a group isomorphism between G2 and G1. Also, maps the
identity in G 1 to the identity in G 2. Also, maps the inverse of each 9 E G 1
to the inverse of (g) E G 2.
The groups that will most interest us are groups of transformations. The
set U in the following definition can be the sample space X or the parameter
space n or the action space N.
Definition 6.24. A measurable function f : U ~ U is called a transfor-
mation of U. The function e( u) = u for all u E U is called the identity
transformation.
Proposition 6.25. Suppose that G is a set of transformations of a set
U with e E G being e(u) = u for all u E U. Let composition 0 be the
composition of functions. IfG is a group with identity e, then every element
of G is one-to-one.
Example 6.26. Here are some examples of groups of transformations. When we
refer to these examples in the future, we will call the ith one "Group i."
1. U = IRn , ge(XI, ... ,xn ) = (Xl + c, ... ,Xn + c) for each e E IR. The identity
is 90, the inverse of ge is 9-e, and the composition is 9a 09b = 9aH' This
group is abelian.
2. U = IRn, 9c(Xl, ... ,Xn) = (exl, ... ,CXn) for each C > O. The identi.ty is 91,
the inverse of ge is 9l/e, and the composition is 9a 09b = 9ab. ThIS group
is abelian.
3. U = IRn, 9(a,b) (Xl, ... , Xn) = (bXl + a, ... , bXn + a) for each b > 0, a ~ .IR.
The identity is 9(0,1), the inverse of 9(a,b) is 9(-a/b,1/b), and the compOSItIOn
is g(a,b) 0 g(e,d) = 9(be+a,bd)' This group is not abelian.
4. U = IRn , gA(X) = Ax, where A E GL(n), the set of nonsingular n.~ n
matrices. The identity is gl, the inverse of gA is gA-l, and the compOSItion
is matrix multiplication gA 09B = gAB. This group is not abelian. This is
called the general linear group of dimension n.
6.2. Equivariant Decision Theory 355

5. U:::: IRn , gp(XI,'" ,Xn) :::: (Xp(l),'" ,Xp(n)~' where p is ~ny permut
ation
of (1, ... , n). The identity is g(l, ... ,n), ~he mverse of gp IS gp-l, ~nd
the
composition is composition of permutatIOns, gPl 0 gP2 = gPl 0P2' This
group
is not abelian.
6. U can be any measurable set and G can be the set of all one-to-o
ne mea-
surable functions whose inverses are also measurable.

If A C
- U , we will use the shortha nd notation gA to denote the set
gA = {u E U : u = gy for some YEA}.
Defini tion 6.27. Let Po be a parametric family with parame ter space
and sample space (X, B). Let G be a group of transformations of X.
n
We
say that G leaves Po invaria nt if for each 9 E G and each 0 E n there
exists
0* E n such that Pe(A) = Po- (gA) for every A E B.
It is easy to see that the 0* in the Definition 6.27 is unique.
Lemm a 6.28. Suppose that G leaves Po invariant. Then, for each
9E G
and 0 E fl, O in Definition 6.27 is unique.
PROOF. Let 9 E G and () E n be given. Suppose that both ()* and ()' satisfy
for every A E B,
Po(A) = po-egA) = P8'(gA).
It follows that for every A E B, PO(g-1 A) = Po. (A) = PO' (A).
Since
distinct elements of the parame ter space have to provide distinct probab
ility
measures, this last equation implies that (J* = 0'.
0
We will call the unique value ()* by the name g() to indicate its connec
tion
to both 9 and O. We can try to unders tand intuitively what it means
to say
that G leaves Po invariant. Suppose that we believed that X had distribu
-
tion Pe given e = O. We already know what the conditional distribu
tion
of gX is
P~(gX E A) = P~(X E g-1 A).
This has nothing to do with equivariance yet. It is a simple consequ
ence
of what we already know about the induced distribution of a functio
n of a
random quantity. What invariance of distributions means is that the second
equation below holds (the others are all consequences of probability theory
and group theory):
P~(X E g-1 A) = PO(g-l A) = PgO(gg-l A) = Pgo(A) = P~o(X
E A).
So, we see that the conditional distribution of gX given e = () is Pgo,
which
is the conditional distribution of X given e = gO.
Propo sition 6.29. Suppose that G leaves Po invariant, then, for each
9E
G the transformation g : n -+ n is one-to-one and onto. Also G =
{g : 9 E
G} is a group, and g-l = g-l.
356 Chapter 6. Equivariance

The proof of this proposition is straightforward and is good practice for


those readers who have become rusty in group theory. See the proof of
Lemma 6.31 below for some guidance.
Definition 6.30. If G leaves Po invariant and L(O, a) is a loss function on
n x ~, we say that the loss is invariant under G if, for each 9 E G and

each a E ~, there exists a unique a* E ~ such that L(gO, a*) = L(O, a) for
all E n. We will denote a* by gao
Lemma 6.31. If the loss is invariant under G, then, for each 9 E G, the
transformation 9 : ~ -+ ~ is one-to-one and onto. Also G = {g : 9 E G} is
a group, and g-1 = g-l.

PROOF. First, we show that 9 is onto for all 9 E G. Let a E ~ and 9 E


G. Recall (Proposition 6.29) that g-1 = rl. For all E n, L(O, a) =

L(g-10,g-l a ). Since this is true for all and g-1 is one-to-one, it follows
that L('l/J,g-l a ) = L(g'l/J,a), fo~l 'l/J E n (just let 'l/J = rIO). By the
definition of 9 it follows that gg-la = a, hence 9 is onto. Applying this
same argument to g-1 gives that g-lga = a. This also shows that 9 is one-
to-one and g-1 = g-l. Clearly, the identity transformation on ~ is c. The
composition of 9 and h is clearly gh, and the associative property follows
directly from the associative property in G. 0

Example 6.32. Consider Group 3. Suppose that Xl, ... ,Xn are conditionally
lID N(J.!, (12) given e = (J.!, (1). Let

G = {(a, b) : b > O,a E 1R}.


If 9 = (a, b) and (J = (J.!, (1), then g(J = (b(1, bJ.! + a) and

Pge(gA) = 1 bA+a
1
(bav'21T)n
t(Xi -
exp { __l_
2b2(12 i=l
a - bJ.!)2} dx

= 1 ~ n exp {- 2b!a2 t(Yi - bJ.!)2} dy


bA (00 27r) i=l

= { 1 exp {_~ t(Zi - J.!)2} dz = Pe(A).


} A ((1J27r)n 2a i=l

Suppose that the loss function is L( (J, d) = (d - J.!) 2/ (12. Then, if gd = bd + a, we


get
L(-(J -d) = (bd + a - bJ.! - a)2 = L((J d).
9 ,g b2(12 '

We might ask, "What is the set of all invariant loss functions?" That is, what
is the set of all L such that, for every (1, J.!, d, a, b,

L((J.!, (1), d) = L((bJ.L + a, b(1), bd + a)?


6.2. Equivariant Decision Theory 357

Since this equation must hold for every a,p"d,a,b, it must hold for b = l/a and
a = -p,/a no matter what p, and a are. It then follows that

L((p"a),d) d-p,)
= L ( (1,0), -a- ,

for all d, p" and a. That is, L((p" a), d) = p([d-p,l/a), for some arbitrary.func~ion
p. It is clear that any L of this form is invariant, so we have found all InvarIant
loss functions.

The method used at the end of Example 6.32 is actually a very general
method for finding all invariant functions. The first step was to find a
necessary condition for the function to be invariant. The second step is to
check that the condition is also sufficient. A similar method works when
trying to find all equivariant functions. (See Example 6.34 on page 357 for
an illustration.)
Definition 6.33. A decision problem is invariant under G if Po and the
loss are invariant. In such a case, a nonrandomized decision rule 8(x) is
equivariant if 8(gx) = go(x) for all 9 E G and all x E X. A randomized
rule 8*(x) is equivariant if 8* (gx)(gA) = 8*(x)(A) for all A E ex, x E X,
and 9 E G. A function v is invariant if v(gx) = v(x) for all x E X and all
9 E G.

We will rarely use randomized equivariant rules, except in discussions of


invariant tests (c.f. Section 6.3.3).
Example 6.34. Consider Group 3. Let N = lR and suppose that we are only
estimating the location parameter. For example, L(O,d) = (0 1 - d?/O~. Here
6(gx) = 96(x) means that 8(bxl + a, . .. ,bxn + a) = b8(x) + a. Suppose that
a = -bx and b = 1/8, where 82 = L~=l (Xi - X)2 /(n - 1). Then

8()
_ (Xl - X
X =x+86 - 8 - " " ' - 8 - .
Xn - X)
Since every function of the above form is equivariant, we have found all equivari-
ant rules.
Note that the function v(x) = 8 ([Xl - X/ 8, ... , [Xn - xl! 8) is invariant and is an
element of N, while the function h(x) = (x, 8) is equivariant and can be thought
of as an element of G. With this notation, we have written 8(x) = h(x)v(x).

Example 6.34 suggests a generalization of Lemma 6.6.


Lemma 6.35. Let h : X -+ G be equivariant. Then 8 : X -+ N i8 equiv-
ariant if and only if h-1{j is invariant. (Here h- 1 means the element of G
that is the inverse of h, not the inverse of the function h, which might not
even exist.)
PROOF. For the "if" part, assume that h-1{j = v is invariant. Then v(x) E N
and 8(x) = h(x)v(x). So

8(gx) = h(gx)v(gx) = gh(x)v(x) = g8(x),


358 Chapter 6. Equivariance

and 0 is equivariant.
For the "only if" part, assume that 0 is equivariant. Let v = h-1o. Then

v(gx) = h(gx)-lO(gx) = (gh(X))-lgO(X) = h(x)-lO(x) = v(x),


so v is invariant. o
Example 6.36. Consider Group 1. Lemma 6.6 already says that 6 is equivariant
if and only if -60 + 6 is invariant, where 60 is an arbitrary equivariant function.
Consider Group 3. If ~ = lR, Example 6.34 showed how 00 16 is invariant,
where 60(x) = (x,s). Now suppose that ~ = n = G and

L )
J.t, (7 ,d) = (d2 - J.t? +
(72
Ilog -;d I.
l

Then L is invariant and 6(x) is two-dimensional, say (61(X), 62(x)). To say 6(gx) =
g6(x) means that, for every a,b,x 6(bxl +a, ... ,bx n +a) = (b6 1(x) + a,b62(x)).
Now, we have already seen that Oo(x) = (x, s) is equivariant, so the most general
equivariant estimator is 6(x) = 60 (x)v(x), where vex) E N is invariant. That is,
let V2(X) be an arbitrary positive invariant function and let Vl(X) be an arbitrary
real-valued invariant function. Then 6(x) = (SV1(X) + x, SV2(X is the general
form of an equivariant estimator.

Definition 6.37. An invariant function v(x) is called maximal invariant


if, for every invariant function u(x), v(xd = V(X2) implies u(xt} = U(X2)
(i.e., u is a function of v). For each x E X, we call

O(x) = {y : y = gx for some 9 E G} (6.38)

the orbit of x.
It is clear that an invariant function is always constant on orbits. In the
statement of Theorem 6.8, Y is maximal invariant. Also, in the statement
of Theorem 6.18, Y is maximal invariant. In invariant decision problems,
the risk function of an equivariant decision rule is constant on orbits. This
follows trivially from part 4 of the following lemma.
Lemma 6.39. 6 In the notation of Definition 6.37,
1. y E O(x) if and only if x E O(y)j
2. orbits are equivalence classes;
3. a maximal invariant assumes distinct values on different orbits;

4. suppose that m(x,O) is invariant under the group actions, that is,
m(gx, gO) = m(x,O) for all x, g, O. Then the distribution of m(X, 0)
given e = 0 is a function of 0 through the maximal invariant in n.

6This lemma is used in the proofs of Lemmas 6.65 and 6.66.


6.2. Equivariant Decision Theory 359

PROOF. Parts 1 and 2 are trivial. 7 For part 3, suppose that v is maximal
invariant. Consider the function 0 : X -+ 2x defined in (6.38). Clearly, 0
is invariant, hence it is a function of v. This means that if O(x) :1= O(y),
then v(x) :1= v(y). That is, v assigns different values on different orbits. For
part 4, let r(O) = P~[m(X, 0) E B] for an arbitrary set B. Then,

Pgo[m(X,gO) E B] = Po[m(gX,gO) E B] = Po [m(X,0) E BJ,

by invariance of m. So, r(gO) = r( 0), and r is invariant, hence it is a function


of the maximal invariant in n. 0

Corollary 6.40. In an invariant decision problem, the risk function of an


equivariant decision rule is constant on orbits in the pammeter space.
This corollary gives a plausible justification for restricting attention to
equivariant decision rules. Since the risk function is constant on orbits in
the parameter space when the loss is invariant, this makes it easier to
compare equivariant rules by means of their risk functions. In particular, if
the group acts transitively on the parameter space (i.e., there is only one
orbit), then the problem of noncomparability of risk functions disappears
altogether.

6.2.2 Equivariance and Changes of Units


One popular justification for the use of equivariant rules is that the result
one obtains should not depend on the units in which the variables and
parameters are measured. For example, suppose that we are estimating
a length in feet and our measurements are in feet. If we were then to
change our measurements (and the length) to inches, our estimate should
be 12 times as large. This sounds like a requirement that the estimator be
scale equivariant. Similarly, if we are estimating a temperature in C and
we convert all measurements to OK, then we should just add 273 to the
C estimate to get the OK estimate. This sounds like a requirement that
the estimate be location equivariant. However, neither of these examples
has anything to do with equivariance. First, we show that estimators need
not be equivariant in order for the units of measurement to be converted
correctly. Afterwards, we show why changes of measurement scale have
nothing to do with equivariance.
Example 6.41. Suppose that e is tomorrow's temperature in C, and suppose
that I have a prior distribution for e which is N( -12,100).8 I will observe X '"
N(O,25) (in DC) given e = 0, and the loss is L(O, a) = (0 - a)2. Ignoring absolute
0, this problem is a location problem. The only location equivariant rules are

7See Problem 15 on page 139 for the definition of equivalence class.


8This example was written during the winter.
360 Chapter 6. Equivariance

Dc(X) = x + c. The Bayes rule (in C) is


x 12 4x -12
E(eIX = x) = ~ - ~ =
25+100 5
This is clearly not equivariant. Does that mean that it will violate the rule of
changing units? Of course not!
Suppose that we change all units to oK. The new parameter is e* equal to
tomorrow's temperature in oK and e* = 273 + e, so the prior for e* (in OK) is
N(26l,100). The datum we will observe is X* = X + 273 rv N(O, 25) (in OK)
given eo = 0. The loss function is still assumed to be LO(O,aO) = (0 - aO?
Everything is now ready for finding the Bayes rule, which is the posterior mean
of eo (not the mean of e, which would clearly be too small by 273).

E(eOIXO = XO) = 4xo ~ 261 = 4x ~ 12 + 273,


which is just what we would like.

Notice that, in Example 6.41, we have treated e, X, e*, and X* as


pure numbers, but we were careful to say what the numbers stood for. A
great deal of the confusion about changing units is caused by ignoring this
simple, but vital, procedure. We now turn to a careful discussion of this
point.
Changing units of measurement has nothing to do with equivariance. It
is simply a reparameterization. When you reparameterize the problem, you
must reparameterize the loss function, the prior, and the likelihood. Who
would ever dream of using the same proper prior for OK as for C? The
same applies to any change of units. Surprisingly, the same applies even to
more general transformations that do not correspond to changes of units.
Example 6.42. Let X rv U(O, 0) given e = O. Suppose that the loss function is
L(O, a) = (0 - a? and the prior is la(O) = 0- 2 1[1,(0)(0). Then the posterior is

where c = ma.x{l,x}. The Bayes rule is the posterior mean E(eIX = x) = 2c.
Suppose that we reparameterize to eo = e 2 If there is going to be a connection
between the two decision problems, the loss function had better transform in such
a way that we are essentially estimating the same thing. That is, the loss had
better be LO(OO,aO) = (JiF - Ja*( The prior for eo is

lao (9*) = ~I[1,oo)(OO),


28~

and the conditional distribution of X given eo = 0 is U(O, JiF). As a red


herring, we could also transform the data to X* = X2 and then

lxolao (xOIOO) = ~I[o eO)(x O


).
2vx o O '
6.2. Equivariant Decision Theory 361

(This makes the situation look more like the equivariance setup.) Now,
we can
find the posterior of e*,

fe'lx.(O *lx*) = ;:2 I [c"=)(0 *),


where c* = max{ x* , 1} = c2. The Bayes rule for the new loss function
is not the
posterior mean, rather it is the a* that minimizes

By expanding the square and differentiating with respect to a*, we find


that the
posterior expected loss is minimized if a* = 4c*, which is precisely the
square of
2c.
In each of the above exampl es, the Bayes rule in the reparam
eterize d
problem is the reparam eteriza tion of the origina l Bayes rule. This
is actuall y
true in general.
Propo sition 6.43. Suppos e that we reparam eterize to 8' = g(8)
where
9 : 0 -+ 0' is bimeasurable and N' = 0'. If the loss functio
n change s
from L(O,a) to L'(O',a ') = L(g-l(O '),g-l(a ') and 8(x) is
the formal
Bayes rule for prior fe(e) (based on data X = x with conditi
onal den-
sity fXle(xIO ), then the formal Bayes rule in the reparam eterized
problem
is 8'(x) = g(8(x) .
A note about transfo rmation s and loss functio ns is in order here.
In
Propos ition 6.43, we transfo rmed the loss functio n L to L'. In
the usual
equivar iance setup, the loss functio n L is assume d to be invaria nt.
That is,
L(g-l(O '),g-l(a ')) = L(O',a' ), as was the case in Examp le 6.41.
But this
was a mere coincidence and had nothing to do with changin g units.
Even
if the loss functio n had not been locatio n invaria nt in Examp le
6.41, the
Bayes rule would have respect ed the change of units, so long as the
correct
loss functio n were used after the change of units. Consid er the
following
modific ation of Examp le 6.41.
Examp le 6.44 (Continuation of Example 6.41; see page 359). Suppose
that the
loss function for the C problem is

L(8 ) = { (8 - a)2 if 0 ~ 0,
,a 2(8-a) 2 if(} <0.
This says that an error of a certain magnitude is twice as costly
if the true
tempera ture is below freezing than if it is freezing or above. If we try
to
same loss function in OK, then no such distinction is made in the costs use the
of errors
of the same magnitude. It is ludicrous to claim that these two decision
problems
are essentially the same problem with different units. Of course, the
loss L here
is not invariant. The transformed loss

L ' (0' ') { (0' - a')2 if 0' ~ 273,


,a = 2(O'-a ')2 if 0' < 273
is the appropriate one to use in the OK scale.
362 Chapter 6. Equivariance

Note that no transformation of the data is needed in Proposition 6.43.


If one wishes to parallel the equivariance situation more completely, one
can feel free to transform the data also, but it makes no difference to the
conclusion of the proposition, so long as the transformation is bimeasurable,
since the posterior distribution will be unchanged. The conclusion here is
that there is no need for decision rules to be equivariant in order to obey
conversion of units. Bayesian decision rules obey conversion of units without
being equivariant.
Finally, we consider the root of the misconception that equivariant esti-
mators are required in order to obey conversion of units. 9 Imagine that I
will measure the length of a table with an inaccurate device. Let the mea-
surement I hope to get in feet be denoted X, and let the sample space be
X = R+. That is, X is the set of possible measurements, in feet, which I
might obtain. Let e denote the "true" length of the table (whatever that
means) in feet, and suppose that the conditional distribution of X given
e = () is denoted Po, where () can be any positive number. That is, n = R+
is the set of possible values of the "true" length of the table in feet. (Ob-
viously some values of () are far less likely than others as candidates for e,
but we will ignore the Bayesian aspects of this problem for the time be-
ing.) If we convert our observed measurement to inches, we get X' = 12X.
This situation resembles Group 2, scalar multiplication. Suppose that the
parametric family Po = {Po : () En} is invariant under Group 2. We could
mistakenly think of X' as 912X and then we could construct 912() = 12().
As we noted earlier, it is now perfectly correct to say that the distribution
of 12X is P 12 0. But here is where things get confused. The transformed (J,
namely 912(), is supposed to be an element of n, which consists of the pos-
sible values of the "true" length of the table in feet, not inches! Although it
is perfectly permissible to think of 12(J as the number of inches representing
the true length of the table in inches, it is absolutely forbidden to think of
912() as anything other than a possible length of the table in feet. In like
manner, the sample space X is the set of possible measurements in feet, not
inches. The transformed measurement 912X is 12 times as many feet as X,
not the number of inches in X feet. Otherwise, how would we ever distin-
guish whether the number 12 E X stood for 12 feet or 1 foot converted to
12 inches? It cannot be both ways. We made it perfectly clear that x E X
stands for x feet and hence 12x E X stands for 12x feet, not x feet con-
verted to inches. Hence 912X is not the converted measurement of the table
to inches, but rather a measurement of the table in feet, which is 12 times
as large as the measurement X. There is no other mathematically correct
way to interpret these transformations. Hence, the invariance of the dis-
tributions has absolutely nothing to do with conversion of units from feet

9The author is indebted to Morris H. DeGroot for personal discussions about


equivariance and invariance which helped immensely in clarifying these concepts.
6.2. Equivariant Decision Theory 363

to inches. That conversion is handled in a straightforward manner as a


reparameterization, the way we did in Example 6.41 and Proposition 6.43.

6.2.3 Minimum Risk Equivariant Decisions


In location estimation with squared-error loss, Pitman's estimator was
MRE and it was the generalized Bayes rule with respect to a uniform
''prior'' distribution. The feature of that prior distribution which made it
the one to use in location estimation was the fact that it is invariant with
respect to location shifts. Similarly, in scale estimation we discovered that
the MRE, with a particular invariant loss, was also the Bayes rule with
respect to an improper prior /-La with Radon-Nikodym derivative I/O with
respect to Lebesgue measure. This measure is invariant with respect to
scale changes. The pattern emerging here can be extended to more general
groups, once we know how to find invariant measures.
Definition 6.45. Let G be a group with a a-field of subsets r. Suppose
that for each 9 E G and A E r, gA E rand A-l E r. A measure A on r
is called left Haar measure LHM (or left invariant measure) if, for every
g E G and every A E r, A(gA) = A(A). Similarly, p is called right Haar
measure RHM (or right invariant measure) if p(Ag) = p(A) for every A E r
and every 9 E G.
It should be noted that every positive multiple of an invariant measure
is another invariant measure, so we may introduce an arbitrary constant
multiple if we wish. It should also be easy to see that if G is abelian, then
RHM=LHM.
Example 6.46. Group 1 is abelian, so LHM=RHM. Note that fA dx = fA+c dx,
for all measurable A and all real c, so Lebesgue measure is invariant.
Group 2 is abelian, so LHM=RHM. Note that

so the measure with Radon-Nikodym derivative l/x with respect to Lebesgue


measure is invariant.
Group 3 is not abelian, so we need to find LHM and RHM separately. The
group action is (a, b) 0 (c, d) = (be + a, bdl. Suppose that h(x, y) is a Radon-
Nikodym derivative for LHM with respect to Lebesgue measure. Then

JA
rh(x, y)dxdy = 1
(a,b)A
h(x, y)dxdy.

Transform the left-hand side by y = w/b and x = (z - a)/b. The Jacobian is


J = 1/b2 , and we get

JA
[ h(x,y)dxdy = 1(a,b}A
;2h (Z ~ a, *) dzdw = 1 (a,b)A
h(z,w)dzdw.
364 Chapter 6. Equivariance

Since this must hold for all a, b, A, we have

:2h(z~a,~) =h(z,w),
for all a, band a.e. [dzdwl. So, if a = z and b = w,
we must have h(z, w) =
h(O,1}/w 2. It follows that LHM has Radon-Nikodym derivative 1/y2 with re-
spect to Lebesgue measure. Next, suppose that h(x, y) is the Radon-Nikodym
derivative for RHM. Then

1 A
h(x, y)dxdy = 1
A(c,d)
h(x, y)dxdy.

Transform the left-hand side by y = wid, and x = z - weld. Then the Jacobian
is J = lid, and we get

f
} A
h(x,y)dxdy= f
} A(c,d)
~h(w- e:,~)dZdW=1. (a,b)A
h(z,w)dzdw.

Since this must hold for all e, d, A, we have

~h(Z- C;;,~) =h(z,w),

for all e, d and a.e. [dzdw). So, if e = z and d = w, we must have h(z, w) =
h(O, l}/w. It follows that RHM has Radon-Nikodym derivative l/y with respect
to Lebesgue measure. This is not the same as LHM.

Not only are measures of sets invariant under group operations when
using LHM, but certain integrals are also invariant.
Lemma 6.47.10 If>. is LHM on G and f is integrable over G, then for all
g E G,
fa f(g 0 h)d)"(h) = fa f(h)d>'(h).
PROOF. First let f(h) = IB(h) for some B E r. Then
fa f(g 0 h)d>'(h) J IB(g 0 h)d>'(h) = ~-lB d>'(h)
= >.(g-1 B) = >.(B) = fa f(h}d>.(h).

By adding, we can extend to simple functions f. By the monotone conver-


gence theorem A.52, we can extend to all nonnegative measurable functions,
and by subtraction to all integrable functions. 0
For a detailed discussion of Haar measure, see Nachbin (1965) or Halmos
(1950, Chapter XI). For example, there are results giving conditions under
which LHM exists. Since we will only use Haar measure explicitly when it
does exist, we will not prove its existence. However, we will need to know
that Haar measure is essentially unique.

lOThis lemma is used in the proofs of Lemmas 6.55 and 6.62.


6.2. Equivariant Decision Theory 365

Lemma 6.48.11 Let (e,o) be a group, and let (e,f) be a topological space
with the Borel a-field. Suppose that A is a-finite and not identically 0 LHM
on (e, r). Suppose that the function f : ex e - t e defined by f(g, h) =
g-1 0 h is continuous. If>.' is also a-finite and not identically 0 LHM on
(e, r), then there exists a finite positive scalar c such that>.' = c A.
PROOF. The first step is to prove that reg) = I B -l(g)j>..(Ag) is a mea-
surable function of g for each A, B E r. Since f(g, h) is continuous, it is
continuous in g for fixed h, hence reg) = f(g, e) = g-1 is continuous,
hence measurable. If B E r, r-1(B) = B- 1 , hence B-1 E r. It follows
that IE-l (g) is measurable. The function f'(g,h) = hog = f(f*(h),g) is
also continuous, hence measurable. It follows that v(g, h) = (g-1, hog) is
continuous and measurable. It is easy to see that v-l = V, so if A E r,
e x A E r 0r, hence v(e x A) E r 0r and meg, h) = I v (cXA)(g-1, h) is a
measurable function. Define f(g) = f meg, h)dA(h). By Lemma A.67, this
is a measurable function. Now notice that I v (CXA)(g-l, h) = IAg(h) and

J
calculate
f(g) = lAg (h)d>"(h) = >..(Ag).
It follows that r is measurable.
Next, we prove that the following two one-to-one bicontinuous functions
preserve measure in the product space (e x e, r 0 r, >.' x A):

T1(g, h) = (g,g 0 h),


T2 (g, h) = (h 0 g, h).

The proofs are similar, and we only prove that T2 preserves measure. Note
that E E r 0 r implies that, for every h E e,

{g : (g, h) E T2 (E)} = h{g : (g, h) E E} = hEh,

where Eh = {g : (g, h) E E}. It follows from Tonelli's theorem A.69 that

>..' x '\(T2(E = J I T2 (E)(g, h)d>'" x A(g, h) = J A'(hEh)d>..(h)

= J >"'(Eh)d>..(h) = >..' x >..(E).

So T2 preserves measure. Also, T 1 1 T2 (g, h) = (hog, g-l) preserves measure.


Hence, for every nonnegative measurable function v : x -+ 1R, e e
Jv(g, h)d>'" x >..(g, h) = J v(hg, g-1 )d,\' x >..(g, h). (6.49)

11 This lemma is used to prove Corollary 6.52. The proof is adapted from Halmos
(1950, Theorem 60.C).
366 Chapter 6. Equivariance

This is proven by noting that it is true for indicators of events, hence


for simple functions, hence for nonnegative measurable functions by the
monotone convergence theorem A.52.
Let A E r have 0 < A(A) < 00 and let B E r. Define

which we have already shown to be measurable. We prove next that

A'(B) = A'(A) f r(h)dA(h). (6.50)

Use Tonelli's theorem A.69 and (6.49) to write

A' (A) J
r(h)dA(h)

JIA(g)r(h)dA' x A(g,h) = J IA(h 0 g)r(g-l)dA' x A(g, h)

= JA(Ag-1)r(g-1)dA'(g) =! IB(g)dA'(g) = A/(B),

where the second to last equation follows from r(g-l )A(Ag-l) = IB(g).
Next, apply (6.50) with N = A to get

A(B) = A(A) f r(h)dA(h).


Multiply both sides of this equation by N(A) and apply (6.50) again to
get N(A)A(B) = ).(A)N(B). Let c = A'(A)/A(A). Since (6.50) is true for
all B, it is true if 0 < N(B) < 00. It follows that 0 < A'(A) < 00, hence
o < c < 00, and the proof is complete. 0
For the rest of this text, whenever LHM or RHM and groups are discussed,
we will assume that the group satisfies the conditions of Lemma 6.48 and
that the measures are a-finite and not identically o.
Lemma 6.51. 12 If A is LHM on a group G, then peA) = A(A- 1 ) is RHM,
and we call p the RHM related to A.
PROOF. Note that p(Ag) = ).(g-lA-l) = )'(A- 1 ) = peA). 0
The following corollary to Lemmas 6.48 and 6.51 now follows easily.
Corollary 6.52. Assume the conditions of Lemma 6.48. If P and p' are
both a-finite and not identically 0 RHM on (G, n,
then there exists a finite
positive scalar d such that p' = c' p.

12This lemma is used to prove Corollary 6.52 and Lemma 6.54.


6.2. Equivariant Decision Theory 367

The following result, whose proof is based on the same concept as the
proof of Lemma 6.47, is useful for converting between integrals with respect
to LHM and RHM.
Proposition 6.53. 13 If A is LHM, p is the related RHM, and f is inte-
grable with respect to p, then J f(g)dp(g) = J f(g-1 )d)..(g). If f is integrable
with respect to A, then J f(g)d)..(g) = J f(g-1)dp(g).
The following result gives a method for converting one LHM or RHM
into many others.
Lemma 6.54. 14 Let pg(B) be defined as p(gB), and let Ag(B) = A(Bg).
Then pg is RHM and Ag is LHM for each 9 E G.
PROOF. Since Ag(hB) = )"(hBg) = )..(Bg) = )..g(B), we have that )..g is
LHM, and a similar argument works for Pg. 0
By Lemma 6.48, )..g is a multiple of )... Let the multiple be Cg. Similarly,
by Corollary 6.52, Pg is a multiple of p, so define c~ by pg(B) = c~p(B). In
abelian groups, cg = c~ = 1 for all g, if)" and p are related. We introduce Pg
because it will play an important role in the proof that Pitman's estimator
is the formal Bayes rule with respect to an invariant measure.
We obtain interesting results if we replace).. by p in Lemma 6.47 and
replace p by ).. in Proposition 6.53.
Lemma 6.55. 15 If p is RHM and 9 E G and f is integrable with respect

J J
to p, then
f(gh)dp(h) = C~-l f(h)dp(h).
If A is LHM and 9 E G and f is integrable with respect to A, then

J f(hg)dA(h) = Cg-l J f(h)dA(h).

PROOF. In a manner similar to the proof of Lemma 6.47, we can prove that

J f(h)dpg(h) = J f(g-1 0 h)dp(h),

for all 9 E G. It follows that

J f(g 0 h)dp(h) = Jf(h)dpg-l (h) = C~-l J f(h)dp(h).

The proof of the other part is virtually identical, using Proposition 6.53. 0
Actually, the numbers Cg and c~ are related.

13This proposition is used in the proofs of Lemmas 6.55, 6.56, 6.65, 6.66, and
Theorem 6.74.
14This lemma is used in the proof of Lemma 6.68.
15This lemma is used in the proofs of Lemmas 6.56, 6.65, and 6.66 and Theo-
rem 6.74.
368 Chapter 6. Equivariance

Lemma 6.56. 16 If >. and p are related, then cg = 1/ c~ = C~-l for all
g E G. Also c~ = Cg-l.
PROOF. Let f be integrable with respect to p. Use Lemma 6.55 twice to
write

J f(h)dp(h) = J f(g 0 g-1 0 h)dp(h) = c~ J f(g 0 h)dp(h)

C~C~-l J f(h)dp(h),

from which it follows that C~_l = l/c~. Next, use what we just proved and
Proposition 6.53 and Lemma 6.55 to show that

J f(h- 1)d>'(h) = J f(h)dp(h) = c~ J f(g 0 h)dp(h)

c~ J f(g 0 h- 1)d>'(h) = c~ J f([h 0 g-1r1)d>'(h)

= C~Cg J f(h- 1)d>'(h),

from which it follows that cg = l/c~. Then c~ = Cg-l follows trivially. 0

Example 6.57. Consider Group 3, so that p(gB) = f gB (l/y)dxdy. Let 9 =


(gl , g2) and transform by y = 92W, and x = 92 Z - 91. Then J = 9~, and

1 1 121
gB
-dxdy
y
=
B
92-dzdw
g2 W
= 92 11
B W
-dzdw = 92P(B),

so c~ = 92.

There is a large class of examples in which the two groups of transforma-


tions G and G are isomorphic and are similar to both the parameter space
and part of the sample space. We make this precise with the following
condition.
Assumption 6.58. Assume the following conditions:
The distributions {Po : () E O} are invariant under the actions of
groups G and G.
LHM>' and related RHM p on G exist.
The conditions of Lemma 6.48 hold for G, so that LHM and RHM
are essentially unique.
The mapping cp : G -+ G defined by cp(g) = 9 is a group isomorphism.

16This lemma is used in the proof of Lemma 6.65.


6.2. Equivariant Decision Theory 369

There is a bimeasumble mapping 7] : n- G which satisfies go 7]( 6) =


7](gO), for every 9 E G and 6 E n.
There exists a bimeasumble function t : X - G x Y for some space
Y (where we write t(X) = (H, Y), such that, for every g E G and
x E X, t(gx) = (goh,y) ift(x) = (h,y).
For every 0, the distribution on G x Y induced from Po by t has a
density with respect to >. xv, where v is some measure on y.
Note that the Y part of t(X) = (H, Y) is invariant when Assumption 6.58
holds. Also, since the function t : X - G x Y is bimeasnrable, it will be
convenient to assume that X = G x Y and X = t(X). Since t is one-to-one,
the posterior ofe given t(X) = (h, y) will be the same as the posterior given
X = rl(h, y). Similarly, we will let Po stand for the induced distribution
on G x y, and we will let fXle(h, ylO) be the Radon-Nikodym derivative
of Po with respect to >. x v.
Theorem 6.59. Assume Assumption 6.58. Let ~ be an action space and
L : n x ~ - IR be a loss function. Let G be a group of tmnsformations of
~ such that L is invariant. Then, if the formal Bayes rule with respect to p
exists, it is the MRE rule, it is MRE conditional on Y, and Y is ancillary.
Before proving Theorem 6.59, here are some examples.
Example 6.60. As we mentioned earlier, Pitman's estimator is MRE and it is
the formal Bayes rule with respect to RHM on the location Group 1. Here G = JR
and we can map X to GxYifwe let Y = JRn - 1 and t(Xl, ... ,xn ) = (xn,y) where
y = (Xl - Xn , . ,Xn-l - x n). Then t(gx) = (gxn' y). The loss L(O, a) = (0 - a)2
is invariant.
In the scale version, Group 2, G = 1R+ and we can let Y = Rn - 1 x {-l,l}
and G = {Ixnl} so that t(x) = (lxnl, y), where y = (xl/lxnl, ... , xn/lxnl). Then
t(~x) = (glxnl, y). RHM is dxjx. If we use the invariant loss L(O, a) = (0-
a) j02, then Theorem 6.18 says that the MRE decision is indeed the formal
Bayes rule with respect to RHM. Theorem 6.59 will also apply if the loss is
L(O, a) = [log(OjaW.
With Group 3, we can write t(x) = (x(n), x(n) - X(l)' y), where XCi) is the ith
ordered element of x and
_ (X(2) - X(l) X(n-l) - X(l) )
Y- ,"', ,7T ,
X(n) - X(l) X(n) - x(l)

where 7T is the permutation required to return the order statistic to the original
data. Here Y = JRn - 2 X II, where II is the set of permutations, and G = 1R X JR+.
Then t(gx) = (g(X(n),X(n) -x(l),y). There are several invariant losses. Here are
three:
(01 - a)2
LI(O, a) = N=1Rj
O~
(02 - a)2
L2(O, a) = N = JR+j
022
(Ol - al)2 C02 - a2)2
L3(O, a) N=JRxJR+.
022 + 02
2
370 Chapter 6. Equivariance

The first is for location estimation, the second is for scale estimation, and the third
is for simultaneous estimation of both. RHM has Radon-Nikodym derivative l/a
with respect to Lebesgue measure. We can explicitly work through the normal
distribution case. The likelihood function based on n observations is

where w = E~=1 (Xi - x)2. To find the posterior, we multiply by l/a and find the
appropriate constant. If we let r = a- 2, then a = r- 1/ 2 and da = -r- 3/ 2dr/2.
The posterior is, for some constant c,'

cr(n-2)/2 exp ( - i [w + n(J-t - X)2]) .

This has the form of the product of an N(x,I/(nr)) density times a r([n -
Il/2,w/2) density. The posterior distribution of

(6.61)

is N(O, 1) given 82. But since this distribution does not depend on 8 2 , it is also
the marginal posterior. Also, the posterior distribution of w/e~ is X~-1' where
W = E~=1 (Xi - X)2. These distributions parallel the prior conditional distri-
butions of the sufficient statistics given e. That is, prior to seeing the data and
conditional on 8, (6.61) has N(O,I) distribution and W/8~ rv X~-1. The pos-
terior distributions were named fiducial distributions by Fisher (1935), because
they seem to fall right out of the conditional distributions given e without any
need for a prior on 8. The quantity in (6.61) and w/e~ are called pivotal quan-
tities. These will be special cases of a more general result that will come later
(Corollary 6.67).

The proof of Theorem 6.59 will proceed through a series of lemmas.


Lemma 6.62. Assume Assumption 6.58. For every 0 E n and every 9 E G,
fXls(h, ylO) = fXls(g 0 h, ylgO), a.e. [A x II]. (6.63)

PROOF. Let B E A be arbitrary. Since P~(X E B) = P~o(X E gB), we


have, for every 9 E G and every 0 En,

JJ IB(h, y)fxls(h, YIO)dA(h)dll(Y)

JJ 19B(h,y)fxls(h,ylgO)d>'(h)dll(Y)

JJ IB(g-l 0 h, y)fxls(h, YlgO)dA(h)dll(Y)

JJ IB(h,y)fxIS(g 0 h,ylgO)d)'.(h)dll(Y),
6.2. Equivariant Decision Theory 371

where the last equality follows from Lemma 6.47. Since this is true for all
B E A, the integrands of the first and last lines must be equal a.e. [A x Ill
This immediately implies (6.63). 0
A simple corollary to this result is obtained by letting g = <1>-1(1](9)-1),
where </J and 'f/ are defined in Assumption 6.58.
Corollary 6.64. Assume Assumption 6.58. There exists a function r
G x y --+ m such that, for every 9 E n

fXle(h,yj9) = r(</J- 1(1}(9)-1) oh,y), a.e. [A x Ill

The formula given in the statement of Corollary 6.64 is particularly cum-


bersome due to the use of the notation -1(1}(')' In fact, some of the
proofs below would be almost unreadable if we continued to use this no-
tation for the sake of mathematical precision. For this reason, we will take
the following liberty with the notation for the remainder of the proof of
Theorem 6.59. We will pretend that n = G = G so that and 1} are just
identity transformations, and we will not have to put the bar over elements
of G. This should not cause any confusion, since the sets really do behave
virtually identically. For example, Corollary 6.64 now says

fXle(h, yj9) = r(O-l 0 h, y), a.e. [A x Ill.

The following lemma will be useful both here and later.


Lemma 6.65.17 Under Assumption 6.58, Y is ancillary and the posterior
density of 8 with respect to RHM is

where the second factor on the right is the conditional density of H given
Yand8.
PROOF. Since n = G, there is only one orbit in the parameter space, hence
the maximal invariant is constant. Since Y is invariant, Lemma 6.39, part
4 shows that Y is ancillary.
To calculate the posterior density of 8 given the data, we need the
marginal "density" of the data:

fx(h,y) = J fxle(h,y\1jJ)dp(1jJ) = J r(1jJ-l oh,y)dp(1jJ)

= J r(1jJ 0 h, y)dA(1jJ) = J
Ch- 1 r(1jJ, y)dA(1jJ)

Ch- 1 J
fXle(x,yie)dA(x) = Jy(y)c',.,

17This lemma is used in the proofs of Lemma 6.66 and Theorem 6.74.
372 Chapter 6. Equivariance

where the second and fifth equalities follow from Corollary 6.64, the third
follows from Proposition 6.53, the fourth follows from Lemma 6.55, and the
sixth follows from the fact that Y is ancillary and from Lemma 6.56. The
posterior density of e given X = (h, y) with respect to p is calculated via
Bayes' theorem 1.31:

Lemma 6.66. Assume the conditions of Theorem 6.59. If TJ is an equiv-


ariant rule, then the conditional risk function given Y = y (constant as
a function of 0) equals the posterior risk given X = (h, y) (constant as a
function of h).
PROOF. Since n is isomorphic to G, there is only one orbit and the risk
function will be constant for every equivariant rule by Lemma 6.39, part
4. Also, the conditional risk function given Y will be constant in O. The
posterior risk given X = (h', y) is

f L(O, TJ(h', y))felx(Olh', y)dp(O)

= Chi f L(O, h'TJ(e, y))fHle,y(h'IO, y)dp(O)

= f~(~) J L(h,-lO, TJ(e, y))fxle(h', YIO)dp(O)

= f~(~) f L(h,-l 0, TJ(e, y))r(O-l 0 h', y)dp(O)

= f~(~)! L(h,-lO,1](e,y))r([(h,)-l oOtI,y)dp(O)


= Jy~y) f L(O, TJ(e, y))r(O-l, y)dp(O)

= Jy~y) f L(O-l, TJ(e, y))r(O, y)d.\(O)

= Jy~y) J L(h- 1 , TJ(e, y))fxle(h, Yle)d.\(h)

f L(e,TJ(h,y))fHly,e(hly,e)d.\(h) = R(e,TJly),

where the first equality follows from Lemma 6.65 and equivariance of TJ,
the second and eighth follow from invariance of L and the definition of
conditional density, the third and seventh follow from Corollary 6.64, the
fourth is elementary group theory, the fifth follows from Lemma 6.55 and
Lemma 6.56, the sixth follows from Proposition 6.53, and the ninth follows
from the definition of conditional risk function. 0
6.2. Equivariant Decision Theory 373

There is a useful corollary to Lemma 6.66.


Corollary 6.67. Under Assumption 6.58, the conditional distribution of
e-1H given Y = y is the same as the posterior distribution ofe- 1H.
PROOF. Let N = G, and for each BET = r, let L(O,a) = IB(O-la) in
Lemma 6.66. The conclusion is that pu(e- 1 H E BIY = y) = Pr(e- 1 H E
BI(H, Y) = (h,y. 0
The quantity e- 1 H is called a pivotal quantity because we can switch
back and forth between thinking of H or e as being the random variable
and the other as fixed without changing the distribution. The common
distribution is called the fiducial distribution by Fisher (1935).
Lemma 6.68. Assume the conditions of Theorem 6.59. Assume that the
formal Bayes rule with respect to p exists. Let d(y) minimize the pos-
terior risk if (e, y) is observed, where e is the identity in G. That is,
mina In L(O, a)felx(Ole, y)dp(O) occurs at a = d(y). Define 6(h, y) = hd(y).
Then D is the formal Bayes rule, and it is equivariant.
PROOF. First, note that 8 is equivariant since, for 9 E G,

8(g(h, y = 8(g 0 h, y) = 9 -;;-;;'gd(y) = ghd(y) = g8(h, y).


To see that 6 is the formal Bayes rule, assume that (h, y) is observed.
We must show that mina In
L(O, a)felx(8Ih, y)dp(8) occurs at a = 6(h, y).
We can write

k L(8, a)felx(8I h, y)dp(O)

,/ ()inrL(O, a)fxle(h, YIO)dp(O)


Y

k
Ch Y

f:(y) L(h-10, h-1a)r(O-1 0 h, y)dp(8)

= ;:~) In L(h-10, h- 1a)r([h- 1 0 or 1, y)dp(O)


= Jy~y) k L(O, h- 1a)r(O-1, y)dp(O)

Jy~y) In L(O, h-1a)fxle(e, YIO)dp(O)

= In L(O, h- 1a)felx(Ole, y)dp(O),

where the second equality uses Corollary 6.64 and the invariance of the loss
function, the fourth equality follows from Lemma 6.54, the fifth follows from
Corollary 6.64, and the sixth uses the fact that Ce = 1. Using the definition
of d,_the last integral above is minimized when h-1a = d(y), that is, when
a = hd(y) = 8(h, y). 0
374 Chapter 6. Equivariance

Lemma 6.69. Assume the conditions of Theorem 6.59. The 6 defined in


Lemma 6.68 is MRE and MRE conditional on Y.
PROOF. From Lemma 6.68, we know that the posterior risk given X = (e, y)
is minimized for each y at the action d(y). By equivariance, and the fact
that the posterior risk given X = (h, y) is constant in h (Lemma 6.66),
it follows that the posterior risk is minimized at the action 6(x). Since
Lemma 6.66 also shows that the risk function equals the posterior risk, the
risk function is also minimized at 6, hence, 6 is the MRE rule conditional
on Y = y. The unconditional risk function of a rule 17 at 0 = e is

R(e,17) = J R(e, 17ly)fy(y)dv(y).

Since {j has minimum conditional risk function uniformly in y, the uncon-


ditional risk function of {j is clearly the minimum also. Hence {j is also the
MRE rule. 0
The conditions of Theorem 6.59 are often met when Y is the space of
maximal invariants.
Example 6.70 (Continuation of Example 6.60; see page 369). Suppose that
Xl, ... , Xn are lID given e = 0 each with density f([x - 01 ]/02)/02, for some
density f(). These distributions are invariant under Group 3. There are many
possible invariant losses, as we saw earlier. The Y we calculated earlier was the
space of maximal invariants. The MRE for loss L1 is

The MRE for loss L2 is

If f is the standard normal density, then 61 (x) = x and


r(~) fW
62(x) =r (~) V 2'

It may be the case that all equivariant rules are inadmissible. For ex-
ample, with Group 4 and one n-dimensional normal observation X, .t~e
MRE rule is to estimate e by X. But we saw in Section 3.2.3 that thIS IS
inadmissible if n > 3.
In later sectiOI; we will see Theorems 6.74 and 6.78, which are like
Theorem 6.59. Th~ conclusions to those theorems say that certain formal
Bayes inferences with respect to RHM priors agree with classical inferences
conditional on the ancillary Y. This is why, in Theorem 6.59, we also showed
6.3. Testing and Confidence Intervals 375

that the MRE decision rule is MRE conditional on Y. Theorems 6.59, 6.74,
and 6.78 parallel each other more this way.
Sometimes, the conclusions of Theorem 6.59 hold even when its condi-
tions are not strictly met. For example, suppose that there is a nuisance
parameter. It may be the case that for each fixed value of the nuisance
parameter, the conditions of Theorem 6.59 apply to the problem with the
appropriate subparameter space.
Exam~le 6.71. Suppose that Xl, ... , Xn are conditionally independent with
N(IL,a ) distribution given e = (lL,a). Let N = IR and L(O,a) = (IL - a? This
loss is not invariant under Group 3; however, it is invariant under Group 1. But
the parameter space is not isomorphic to Group 1. For each value of a, consider
the subparameter space n" = {(j., a) : j. E IR}. The formal Bayes rule with
respect to RHM on n" is 6(x} = x for each a. Since 6 is the MRE rule for each
a, it is the MRE rule under Group 1 for the original problem.

It is not difficult to show that the situation of Example 6.71 generalizes


to the following result.
Proposition 6.72. Suppose that the parameter space is n = 0 1 X O2
Suppose that, for each (}2 E O 2 , the conditions of Theorem 6.59 hold when
0 1 x {(}2} is taken as the parameter space. Then 6 is the MRE rule if and
only if it is MRE for each of the subproblems with fixed values of (}2.
There are situations in which there is no MRE.
Example 6.73 (Continuation of Example 6.71). This time, let N = [O,oo} and
L(O,a} = (a 2 - a}2/a 4. Let the group be Group 2. For each value of j., consider
the subparameter space nl' = {(IL, a) : a > O}. The formal Bayes rule with
respect to RHM on nl' is 6(x) = L:~=1 (Xi - j.)2/(n + 2). No single equivariant
rule achieves the minimum risk for each j..

6.3 Testing and Confidence Intervals*


6.3.1 P-Values in Invariant Problems
In Section 4.6, we introduced P-values as an alternative to testing hypothe-
ses at preassigned levels. In Examples 4.146 (page 281) and 4.61 (page 241)
we saw that sometimes the P-value relative to a collection of tests is the
same as the posterior probability that the hypothesis is true based on an
improper prior. A more general situation in which P-values correspond
to posterior probabilities with improper priors arises when there is equiv-
ariance with respect to some group operating on the data and parameter
spaces. The structure of the problem will need to be very much like that
of Theorem 6.59. In addition, we will need to say something about the hy-
potheses of interest and how they interact with the group operation. We

"This section may be skipped without interrupting the flow of ideas.


376 Chapter 6. Equivariance

also need to choose an appropriate set of tests with respect to which we


calculate the P-value.
Theorem 6.74. Assume Assumption 6.58 (see page 368). For each () E 0,
let 0 0 be a subset of 0 such that the following conditions hold:
1. (}EOo;

2. for all g E G and all () E 0, gOo = 0 90 ;

3. for all () E 0 and all't/J E 0 0, 0", ~ 0 0;

4 for all () E 0 and all hE Oe (where e is the identity in G), Ooh ~ 0 0

For each () E 0, let G index a set of tests {4>o,g : 9 E G} of the hypothesis


e e
Ho : E 0 0 versus Ao : f/. 0 0 defined by

A.
'l-'O,g
(h
,y
)= {l
0
if hE (}O;!l'
zf no.t

Suppose that we use p as a (possibly improper) prior for 8. The posterior


probability that Ho is true given t(X) = (h, y) is equal to the conditional
P-value given Y = y relative to the set of tests {4>o,g : 9 E G}.
It should be noted, in the statement of Theorem 6.74, that the P-values
must be calculated conditional on Y. But, Lemma 6.65 says that Y is ancil-
lary. So, those who believe in conditioning on ancillaries would then want
to calculate P-values conditional on Y anyway. Theorem 2.48 says that if
there is a boundedly complete sufficient statistic, it will be independent of
the ancillary. This leads to a simpler version of Theorem 6.74.
Corollary 6.75. Under the conditions of Theorem 6.74, if H (the "group"
part oft(X)) is a boundedly complete sufficient statistic, then the posterior
probability that Ho is true equals the P-value relative to the set of tests
{4>o,g : 9 E G}.
Before proving Theorem 6.74, some explanation of the four conditions
on 0 0 is in order. The first condition is simply to connect () with the
corresponding hypothesis in a sensible way. The second condition ensures
that the hypotheses are "equivariant" in some sense. The third condition
is to ensure that the P-value is the size of the test 4>o,g when H = g. The
fourth condition guarantees that the size of 4>o,g as a test of Ho is equal to its
power at (). These last two conditions also capture the "one-sided" nature
of the types of hypotheses to which this theorem applies. It will not apply
to point hypotheses or to hypotheses such that 0 0 has smaller dimension
than O. The reason that the form of the test must be tied so closely to the
form of the hypotheses is that there may be many classes of "equivariant"
6.3. Testing and Confidence Intervals 377

tests 18 and each class may lead to a different P-value. However, there is
only one posterior probability that e E ne with respect to RHM. Hence,
we needed to identify exactly which class of tests has P-value equal to that
posterior probability. This point will become clearer after Theorem 6.78.
As in the proof of Theorem 6.59, we will assume that X = t(X) and that
G = G = n, to make the notation simpler. The following lemma is also
useful.
Lemma 6.76. Under Assumption 6.58, the conditional distributions of the
H part of X given Y are invariant.
PROOF. Let B be a measurable subset of G, let 9 E G, and let fl be the
probability measure that gives the marginal distribution of Y. Define

v(O,g,y) = P~(gH E BIY = y).


We want to prove that P~e(H E BIY = y) = v(O, g, y), a.s. [fll (for fixed 0
and g). For every measurable subset A of y,

i v(O,g,y)dfl(Y) = P~(Y E A,gH E B)


= P~(I(Y E A,H E B)

= l PieCH E BIY = y)dfl(y). 0

PROOF OF THEOREM 6.74. Let 0 E 0 and let 1/J E 0 0 . Then 0-11/J E Oe


by condition 2. Also, we use conditions 2 and 4 to show that

P~(</>(I,g(H, Y) = llY = y) = P~(H E On;!IJY = y)


= P:(1/JH E OO;!IIY = y) = P:(H E1/J- 100;!IIY = y)
= P:(H- 1 E fl g-lO- l 1/JIY = y) ~ p~(H-l E ng-IIY = y)
= P~(H E n;!lIY = y) = PfJ(H E On;!lIY = y)
= P~(</>(I,g(H, Y) = llY = y).

This shows that the conditional size of the test </>o,g (given Y) as a test of
Ho is equal to its conditional power function at O. For each 9 E G, define

Q(g, y) = P~(H E n;!lIY = y).


Then P~(H E On;!lIY = y) = Q(g, y), and it follows from what we just
proved that the conditional size, given Y = y, of </>e,g as a test of He equals

18We put the word "equivariant" in quotes because it is not the test function
itself that is equivariant, but rather the combination of the hypothesis and the
test function. That is, if '1/;0 is a test of no, then '1/;0 (h, y) = 'If;go (gh, y).
378 Chapter 6. Equivariance

Q(g, y). The conditional P-value given Y = y can then be calculated as

p(h, y) = i~f{ Q(g, y) : t/J(J,g(h, y) = I}.

It is easy to see that t/J(J,(J-'h(h,y) = 1 by condition 1. It follows that


p(h, y) ::; Q(O-lh, y). Next, suppose that t/Jo,g(h, y) = 1. It follows that h E
OO;!" hence h-10 E Og-l. Condition 3 implies that Oh-10 ~ Og-l, from
which it follows that Q(O-lh, y) ::; Q(g, y). It follows that Q(O-lh, y) ::;
p(h, y), hence Q(O-lh, y) = p(h, y).
To complete the proof, we calculate the posterior probability that Ho is
true given X = (h, y) and show that it equals Q(O-lh, y). Lemma 6.65 tells
us that
!elx(t/llh,y) = Ch!HIY,e(hly,t/I)
In the following equalities, let (H', Y') have the same conditional distribu-
tion given 8 that X had before it was observed:

Pr(8 E OolX = (h,y = Ch J 109 (t/I)!HIY,e(hly, t/I)dp(t/I)

=J 109 (t/I)r(t/I- 1h,y)dp(t/I) /;(y) = J 109 (t/I-l)r(t/I0 h,y)d>'(t/I) !;(y)

=J 109 (ht/l-l)r(t/I,y)d>.(t/I)/Jy(y) = J1n~:l/t/I)r(t/I,y)d>.(t/I)/Jy(y)


= J 10 -,
h- 1 9
(9)!HIY,e(gly, e)d>.(g) = P~(H' E 0h~'0IY' = y)
= Q(O-lh,y),

where the first equality follows from Lemma 6.65, the second equality fol-
lows from Corollary 6.64, the third follows from Proposition 6.53, the fourth
follows from Lemma 6.55, the fifth is just algebra, and the sixth follows from
Corollary 6.64. 0
Example 6.77. Let Xl, ... , Xn be conditionally liD with N(p.,u 2 ) distribution
given e = (IL, u). The group is location and scale (Group 3). Consider the hy-
potheses 0/1 = {(a,b) EO: a ::; p.} for 8 = (p.,u). The corresponding tests
are the usual one-sided t-tests. (The reader should check that the conditions of
Theorem 6.74 are satisfied.) The associated P-values equal the posterior prob-
abilities that the hypotheses are true if the prior is RHM, the measure with
Radon-Nikodym derivative l/u with respect to Lebesgue measure.
As a less familiar example, let
0/1 = {(a, b) EO: a ~ IL, b ::; u}.
This is a simultaneous test of H,.,(T : M ~ IL and E ::; u. We will check condition
4 only. Since e = (0,1), h = (m, s) E Oe satisfies s ::; 1 and m ~ O. For such h,

O/lh = { (a, b) EO: b :5 us, a ~ p. + m: }~ 0/1,


6.3. Testing and Confidence Intervals 379

since us ~ u and J.L + bmj s ? J.L. Suppose that data (Xn, Sn) are observed with
Xn = E~=l xiln and Sn = vlE:':,l
(Xi - Xn}2 I(n - 1). The test 4>9,g rejects
H,..,u if xn ~ J.L + sngl/g2 and Sn ? Ug2. The P-value is the size of the test
4>9,(rxn -,..)/U,8n/U)

6.3.2 Equivariant Confidence Sets


In Section 5.2.1, we introduced confidence sets as an alternative to testing
a single hypothesis about a parameter. In Example 5.57 on page 319, we
saw that the confidence coefficient may not adequately express our degree
of confidence that the parameter is in the set after seeing the data. That
example is one in which the distributions are invariant under the action of
the location group on the real numbers and the group is isomorphic to the
parameter space. In addition, the sufficient statistic (TI' T 2) can be trans-
formed to (Tlo T2 - T 1 ) so that the group acts on TI and leaves T2 - Tl
invariant. This is the same situation that arose in Theorems 6.59 and 6.74.
In Theorem 6.74, we saw that posterior probabilities agreed with P-values
conditional on the ancillary (invariant). A similar thing happens in Ex-
ample 5.57, namely posterior probabilities (with respect to an improper
prior) agree with conditional confidence coefficients. This is a special case
of another theorem with conditions similar to the other two. This theo-
rem is similar to one proved by Stein (1965). Chang and Villegas (1986)
prove a similar theorem with slightly different conditions. Berger (1985,
Section 6.6.3) also proves this theorem in a different way. Jaynes (1976,
p. 181) gives a proof for the case of a location parameter.
Theorem 6.78. Assume Assumption 6.58 (see page 368). For each x EX,
let Bx be a measurable subset of n satisfying Bgx = gBx for all g E G. Let
Co = {x : (J E Bx}. Suppose that we use p as a (possibly improper) prior
for e. Then, for all x E X and all (J En,

Pr(e E BxlX = x) = P~(X E ColY = y). (6.79)

PROOF. As in the proofs of Theorems 6.59 and 6.74, we will assume that
X = t(X) and G = G = n for ease of notation. Hence, we will write
x = (h,y). Now, write B(h,y) = hB(e,y) , and use Corollary 6.67 to say that

p~(e-l HE B(-l
e,Y
)IY = y) = Pr(e- l HE B(-l
e,Y
)I(H, Y) = (h, y,

where B~~y) = {g : g-1 E B(e,y)}. Since (J-Ih E B~~y) if and only if


(J E B(h,y) if and only if (h, y) E Co, the result follows. 0
If we think of S(X) = Bx as a confidence set, then the left-hand side of
(6.79) is the posterior probability that e is in the confidence set and the
right-hand side is the conditional confidence coefficient given the ancillary.
At this point, we should examine the connection between Theorems 6.74
and 6.78. Since confidence sets and tests are equivalent, one would expect
380 Chapter 6. Equivariance

there to be some sort of equivalence between these two theorems. The


problem is that Theorem 6.78 applies to all equivariant confidence sets.
All such collections of confidence sets correspond to collections of test. All
such tests satisfy the "equivariance" condition 1/J(J(h, y) = 1/Jg(J(gh, y) if 1/J(J
means the corresponding test of O(J. Furthermore, every such collection of
tests leads to a P-value. Each such P-value will be the posterior probability
of some set in the parameter space. That set may not equal n(J, however.
Here is how it works. For each 0:, suppose that we choose our B""o so that
C(J,o = {x : 9 E B""o} satisfies P~(X E C(J,oIY = y) = 1 - 0:. This makes
Bx,o. a conditional coefficient 1 - 0: confidence set given Y. Now define
tests 1/J(J,0. to be 1 minus the indicator functions of the sets C(J 0.' Then the
power function of 1/J(J,0. satisfies f31/Je,,,, (9) = 0:, and 1/J(J,0. is a le~el 0: test of
the hypothesis

n(J = {9' En: f31/Je,,,, (9') $ 0:, for all o:}.


The conditional P-value relative to the set of tests B(J = {1/J(J,0. : 0: E [0, I]}
is
p(x) = inf{o:: x E Cfo.}.
,
It follows that P~(X E Cf,p(",)/Y = y) = p(x). From (6.79), we conclude
that
Pr(9 E B~p(",)/Y = y) = p(x).
In general, it might happen that B~p(",) :f: n(J. Here is an example.

Example 6.80. Let Xl"'" Xn be conditionally lID with N(p., (12) distribution
given e = (p., (1). Here (Xn, Sn) is a complete sufficient statistic, where X n =
E~l Xijn, Sn = JE~=l (Xi - X n)2 j(n - 1). So, we will ignore the Y part of
the problem, since Y is independent of the complete sufficient statistic. Let

Ex,a = [Xn - JnSnT;!l (%) ,Xn + JnSnT;!l (%)] ,


where T;!l is the inverse of the CDF of the tn-leO, 1) distribution. Then, 1/J8,a is
the usual two-sided size Cl t-test of H : e = O. The P-value is the Cl value PII(a:)
such that one of the endpoints of the interval EX,a equals O. This makes E"',I'e("')
equal to the interval centered at Xn and having half-width equal to IX n - 91
On the other hand, Oil = {O}. The P-value is the posterior probability of some
hypothesis, but not the hypothesis you thought you were testing. l9

6.3.3 Invariant Tests


In multiple parameter problems with hypotheses concerning several pa-
rameters at once, there may be many competing tests, none of which is

19See Problem 24 on page 392.


*This section may be skipped without interrupting the flow of ideas.
6.3. Testing and Confidence Intervals 381

UMPU. Just as we used equivariance to reduce the collection of estima-


tors to consider, we can try to reduce the numbe r of tests to conside
r also.
In hypothesis testing, the action space is ~ = {O, 1}. There are only
two
groups that act on this set. One contains only an identity, while the
other
contains an identity and a "switch" operato r, g(i) = 1 - i. If we were
to
constru ct groups such that G were this second group, then there
would
have to be conditions under which we were willing to switch the hypoth
esis
and the alternative. Due to the asymmetric treatm ent of hypotheses
and
alternatives in classical testing theory, this would not be advisable.
Hence,
we will only discuss cases in which G consists of one element, namely
an
identity. Then a decision rule is equivariant if and only if it is invaria
nt.
That is, since tests are randomized rules,

8*(gx)(gA) = 8* (gx)(A) = 8*(x)(A),


making 8*(gx)() the same probability as 8*(x)(). So, each equivar
iant
(invariant) test must be a function of the maximal invariant.
Examp le 6.81. Consider Group 3, namely one-dimensional location-scale.
The
maximal invariant is

where w = 2:: (Xi - x)2. Nobody would ever base a test on this
it is ancillary in location-scale problems. In the normal distributionalone, because
case, it is not
even a function of the sufficient statistic.
If we first consider the sufficient statistic (X, W), we see that
invariant is constant, hence only constant functions of the sufficientthe maximal
statistic are
invariant.
This example raises the question of whether reduction of the set of
tests
by invariance is compatible with reduction by sufficiency. That is, suppos
e
that we first reduce to the set of invariant tests and then find a sufficie
nt
statisti c for the maximal invariant parame ter and further reduce by
con-
sidering only invariant tests that are a function of the sufficient functio
n of
the maximal invariant. Will we get the same tests as we would if we
first
reduced to only those tests that depend on the sufficient statisti c and
then
reduced to only those that depend on the maximal invariant in the
space
of sufficient statistics? In Example 6.81 on page 381, the answer is yes,
but
only because both method s produce degenerate results. Hall, Wijsma
n, and
Ghosh (1965) find conditions for this compatibility.
The following assumption is an obvious preliminary. It requires that
the
group operati on is inherited by the sufficient statisti c space.
Assum ption 6.82. If T(X) is sufficient and, for each 9 E G, we define
Tg(x) = T(gx), then Tg(x) depends on x only through T(x).
Examp le 6.83. Suppose that X = JRn and T(x) = (x, w), where = 2:;=1
x? Then 9a,bX = ( ... ,bXi + c,) and T(9a,bX) = (bx + a,b2 w).w This (Xi-
function
382 Chapter 6. Equivariance

satisfies Assumption 6.82, assuming it is sufficient. A function that does not


satisfy the assumption is H(x) = XIX2.

If Assumption 6.82 is satisfied, define g* to be the transformation on


T g*t = Tg(x) for any x such that T(x) = t. The set G* of all such
transformations is a group. Let U : T -> U be the maximal invariant in
the sufficient statistic space, and let V : X -> V be the maximal invariant
in the original data space. Then

U(T(gx)) = U(Tg(x)) = U(g*t) = U(t) = U(T(x)).


So, U(T()) is an invariant function in the original data space; hence it is
a function of V. That is, there exists H : V -> U such that U(T(x)) =
H(V(x)) for all x E X.
Theorem 6.84 (Stein Theorem).2o Let T be SUfficient and satisfy As-
sumption 6.82. Suppose that T has discrete distribution. Let U and V be
maximal invariants in T and X, respectively. Let R( e) be the maximal
invariant in 8. Then U(T(X)) is sufficient for R(8).
PROOF. The proof proceeds through a series of claims.
(a) A = V-1(B) for some B ~ V if and only if gA = A for all g.
(Proof of a): Let A = V-l(B). Then

gA = {gx: V(x) E B} = {x : V(g-lx) E B} = {x : V(x) E B} = A,

since V is invariant. Now, let gA = A for all g. Then

so IA() is invariant and it must be a function of the maximal invariant,


namely IA(x) = f(V(x)). Let B = rl({l}). Then IA(x) = IB(V(x)), so
A = {x : V(x) E B}. A set A that satisfies the conditions of this claim is
called an invariant set.
(b) Pr(X E AIT = t) is an invariant function of t if A satisfies gA = A
for all 9 E G.
(Proof of b): Choose any fJ and t such that Pr(T(X) = tie = fJ) > o. Let
gEG.
Pr(X E A, T(X) = tie = fJ)
Pr(X E AIT(X) = t) = Pr(T(X) = tie = fJ)
Pr(X EgA, T(X) = g*tle = gfJ)
Pr(T(X) = g*tle = gfJ)
Pr(X E gAIT(X) = g*t)
= Pr(X E AIT(X) = g*t),

20 Hall, Wijsman, and Ghosh (1965) attribute this theorem to Stein.


6.3. Testing and Confidence Intervals 383

where the second-to-last equality holds by sufficiency of T.


(c) If A is an invaria nt set, then Po(AIU (T(X)) = u) is constan t in
() for
each u.
(Proof of c): Write Po(AIU (T(X)) = u) as

L: Pr(X E AIT(X ) = t) Pr(T(X ) = tlU(T( X)) = u, e = ()).


{tEU-l( u)}

Since U is maxim al invariant, U- 1 {u) is an orbit in T. So, U- 1 (u)


= {t :
t = g*tu} for some tu E T. It follows that Po(AIU = u) equals

E Pr{T{X ) = g*tul U = u, e = ()) Pr(X E AIT{X ) = g*tu).


g'EC'

The last factor equals Pr{X E AIT = tu) by (b), so it factors out of the
sum. Also,

{x: U(T(x) ) = u} = U {x: T{x) = g*tu},


g'EC'

so the remain ing sum equals 1 and

Po(AIU = u) = Pr(X E AIT(X ) = t u),


which is the same for all ().
(d) Part (a) says that the invaria nt sets constit ute the a-field genera
ted
by V. Part (c) says that for each A in that a-field, P~(X E AIU
= u) is
constan t in e. Hence U is sufficient.
0
Hall, Wijsma n, and Ghosh discuss conditions under which the Stein
the-
orem 6.84 holds for continuous distribu tions.
Examp le 6.85. Consider Group 3 again. The maximal invariant is

Xl -
= ( Xn-l
VeX)
-
Xn
Xn
,"', Xn-2 - Xn .
,slgn(Xn-1 - Xn)
)
,
Xn-l - Xn

which is independent of e. So the sufficient statistic in the maximal


space is constant. If the sufficient statistic is T(x) = (w,x), as withinvariant
distributions, then the maximal invariant in the sufficient statistic space normal
constant. Since there is only one orbit in the parameter space, the is also
maximal
invariant in the parameter space is also constant. In simple English,
equivariance is useless in hypothesis testing. Group 3

Defini tion 6.86. A function f on X is almost invaria nt with respect


to p, if,
for each 9 E G, there exists Bg E 13 such that p,(Bg) = 0 and f(x)
= f(gx)
for all x B g
Propo sition 6.87. If Po p, for each 0, if v()) is maxima l invaria
nt in
fl, and if f is almost invaria nt with respect to p" then the distribu
tion of
f(X) given e = 0 depends on 0 only through the v(O).
384 Chapter 6. Equivariance

The proof of this is very similar to the proof of part 4 of Lemma 6.39.
Definition 6.88. A test is UMPU almost invariant (UMPUAI) level Q

if it is UMPU among all almost invariant level Q tests.


Theorem 6.89. Suppose that Po : IL for each fJ and a hypothesis-testing
problem is invariant under G and G. Suppose that there exists *, which
is UMPU level Q and * is unique a.e. [ILl. Suppose also that there exists
o, which is UMPUAI level Q. Then o is also unique a.e. [ILJ and o = *
a.e. [ILl.

PROOF. Let Uc; be the class of all unbiased level Q tests. First, we show
that E Uo; if and only if g E Uo; for each g, where g(x) = (gx):

Eog(X) = Eo(gX) = Ego(X),

which is greater than or equal to or less than or equal to lk, respectively,


according to gfJ E nH or gfJ E n A , according to fJ E H or () E A by
invariance. This makes g unbiased level lk.
Next, we show that ; is UMP in Uc;. Since * E Uc;, we have that
; E Uc; by the first result. Let () E nA . Then

Eo*(gX) = Ego*(X)
sup Ego(X) = sup Eo(gX)
</>EU", </>EU",
sup Eog(X) = sup Eo(X) = Eo*(X),
</>EU", </>EU",

since * is UMP in Uc;. So, ; = * a.e. [ILl for each 9 by the uniqueness of
*. This makes * almost invariant. Since o is UMPU AI level Q, (3</>0 (fJ) 2:
f3</>. () for all () E nA , so o is also UMPU level Q and o = * a.e. [ILl also,
and so it is also unique a.e. [ILl. 0
This theorem does not guarantee that the UMPU AI level Q test is
UMPU, but it provides insurance that if there is a unique UMPU level
Q test, we can find it by finding the UMPUAI level lk test.

One- Way Analysis of Variance


Consider the one-way ANOVA (analysis of variance). That is, Yij are con-
ditionally independent with Yij rv N(ILi, (72) given Mi = ILi for j = 1, ... , ni
and i = 1, ... , k and I: = a. First, reduce by sufficiency to
k n;
_ 1 ni .
Yi =- L Yij, for ~ = 1, ... , k, and W = LL(Yij - Yi)2.
ni j=l i=l j=l
6.3. Testing and Confidence Intervals 385

Suppose that nH = {(a,J.l) : BAJ.l = O}, where B is an r x k rank r matrix


with r ~ k, A is the diagonal matrix
y'nl ...
A=
(
0 ". oo ) '
o y'nk
and 0 is the vector all of whose coordinates are O. Without loss of generality,
we can assume that B is the first r rows of an orthogonal matrix r. Let
Y be the vector whose ith coordinate is Y i for each i. Make the one-to-
one transformation of the data to X = rAY, and W. Now, given the
parameters, X is independent of W with
X f'V Nk("(, a 21), W f'V a2x~,

where d = n - k and "Y = r AJ.l. We can write


nH = {("Y,a): "Y1 = ... = "Yr = O}.
Let the group G consist of triples (A, b, e), where A is r x r orthogonal, b
is (k - r )-dimensional, and e > O. Define

gA,b,c(X,W) = [e ( X~~ b ) ,e2w] ,


where we write x T = (xi,xD and Xl is r-dimensional and X2 is (k - r)-
dimensional. In the parameter space

9A,b,c(,,(,a) = [e( "Y~~b ) ,ea],

with "Y T = ("YJ, "YJ) and "Yl the first r coordinates, preserves the hypothesis.
So the testing problem is invariant.
The maximal invariant in X is determined by f(g(x, w = f(x, w) for
all g, x, and w. So, for fixed X and w, let A have first row proportional to
xi, b = -X2, and let c = 1/.,fW. Then

JX~Xl
o ,,) ~ f(x,w).

o
So xi xdw is maximal invariant, since it is clearly invariant. The usual F
statistic for testing H is just d/r times the maximal invariant, and it has
noncentral F distribution NCF(r, d, 8), where 8 = E~=l "Yl/a 2 conditional
on the parameters. The hypothesis H is equivalent to 8 = O. Since the
noncentral F distribution has MLR in the noncentrality parameter (see
Problem 29 on page 289), the F-test is UMPUAI level Q.
386 Chapter 6. Equivariance

Multivariate Analysis of Variance


We now present an example of a case in which the number of tests available
is so large that even a reduction by invariance still leaves too many tests to
consider. 2l Imagine that the data consist of exchangeable p-dimensional ob-
servations Xl, ... ,Xn . We will write the data matrix as M = (MdM2IM3),
where

where n > k ~ q. The parameter is e = (M l , ... , M k , E), where each Mi


is a p-dimensional vector and E is a p x p positive definite matrix. The
conditional distribution of the Xi given e = (E,ILl, ... ,ILk) is that the
Xi are independent with Xi having distribution Np(ILi, E) distribution for
i ~ k and Xi having Np(O, E) distribution for i > k. The hypothesis of
interest is Ml = ... = Mq = O.
The group we choose for this problem comes in four parts:

G1 = {g~: A is a p x k - q matrix},
G2 = {9b: D is n - k x n - k orthogonal},
G3 {gb : C is q x q orthogonal},
G4 = {gi;: E is p x p nonsingular}.

These groups are applied in sequence as follows:

The action on the parameter is gA,D,c,E8 equal to

Note that the hypothesis is not altered by action of g. That is,

Ml = ... = Mq = 0 if and only if EIMll IMq]C = (01 IOj.


To find the maximal invariant, we set

f(M) = f(gA,D,c,EM), for all A,D,C,E,M.


In particular, suppose that A = -M2' then f(M) = f([EMlC!0IEM3D]),
where is matrix of all zeros. Now, consider the following lemma.
Lemma 6.90. Two a x b matrices Rand T satisfy RRT = TTT if and
only if T = RQ for some b x b orthogonal matrix Q.

21 For a good introduction to invariant tests in multivariate problems, see An-


derson (1984, Chapter 8) or Kshirsagar (1972, Chapters 7-10).
6.3. Testing and Confidence Intervals 387

PROOF. First, suppose that T = RQ, then TTT = (RQ)(RQ)T = RRT.


Next, suppose RRT = TTT. Write the singular-value decompositions of R
and T as

where rR,rT,OR, and OT are orthogonal and AR and AT are "diago-


nal" matrices arranged so that the absolute values of the diagonal entries
increase as you read down the diagonal. (The A matrices are not really
diagonal because they are not square. Their only nonzero entries are (1,1),
(2,2), etc., however.) Then

RRT = rRARA~rR = rTATA;rT = TTT.


Since these are two representations of the eigenvalue decomposition of the
same matrix, it follows that rT = rR and AT = ARJ, where J is a diagonal
matrix with only 1 in each diagonal entry. (If RRT has eigenvalues with
nonunit multiplicity, a permutation of the columns of rT may be required
to make it equal to r R.) So, T = RORJO;. Since ORJO; is orthogonal,
it follows that T = RQ, where Q is orthogonal. 0
Now, let Mi be a p x q matrix such that MIMi = MiMi T , and let
C be orthogonal such that Mi = MIC. It follows that f([MIIM2IM3]) =
f([Mi 1M2 1M3]). Similarly, if Ma is such that M3MJ = MaMaT, then
f([MIIM2I M3]) = f([MIIM2IMa]). It follows that f is a function of MI
and M3 through MIMi and M3MJ only. Define g(B, W) = f([MdOIM3]),
where B = MIMi and W = M3MJ. It follows that f(M) = g(B, W) if f
is invariant. Also, f(gA,C,D,EM) = g(EBET, EW E T ). Finally, write the
eigenvalue decomposition of

Al 0 0
0 0 0
W-!BW-! =r 0
0
As 0 0
rT = rArT,
0 0
0 0
0 0

where s is the rank of B. Note that s = min{p, q} with probability 1. Then


set E = OW-I/2. It follows that f(M) = g(EBET, EWET) = g(A,Ip).
Note that A is invariant; hence it is maximal invariant.
What we have just proven is that every invariant test in MANOVA must
be a function of the nonzero eigenvalues of W-I/2 BW-I/2, which are the
same as the nonzero eigenvalues of W-I B. A similar argument shows that
the maximal invariant in the parameter space is the set of nonzero eigen-
values of E-IM, where M = [MIl IMq][MII IMq]T. The two special
cases in which s = 1 are of interest. If p = 1, we have univariate ANOVA,
and the only nonzero eigenvalue of W-I B is q/(n - k)F, where F is the
388 Chapter 6. Equivariance

F statistic for testing H. If q = 1, then the only eigenvalue of W- 1 B is


Hotelling's T2. For cases in which s > 1, there is no UMPUAI test, but
there are several well-known invariant tests based on the eigenvalues of
W- 1 B. One is based on the largest eigenvalue, another on the sum of the
eigenvalues, and a third on the product of the nonzero eigenvalues.
A Test Based on Tolerance Sets
Let e= (M, E), and suppose that {Xn}~=l are conditionally lID with
N(It,a 2 ) distribution given e
= (It, a). Let X = (X1, ... ,Xn), and let
V = L:~~l Xdm. Suppose that we want to try to develop a test of the
hypothesis H : V $ c. First, we convert the hypothesis into a parametric
hypothesis as in (3.15). For each 6 E (0,1), let

06 = {O = (It, a) : P~(V $ c) ~ 6}.

We might wish to choose values of 6 and a and then require that for all 0 E
06, Pe(reject H) $ a. This means that we are trying to test H' : e E 06
at level a. We will use a version of the group described in Problem 11 on
page 389. An element 9a of the group acts on X by 9a(Xl, ... , Xn) = (c +
a(xl -c), .. . ,c+a(xn -c. The maximal invariant in the sufficient statistic
space is T = yin (X - c) / S. 22 The maximal invariant in the parameter space
is B = (M - c)/E. We know that V $ c if and only if v'm(V - M)/E $
-v'mB, and so Pe(V $ c) ~ 6 if and only ifB $ ~-1(1-6)/v'm. So, 06 =
{O : f3 $ f3o}, where f3 = (It-c)/a and f30 = ~-1(1_ 6)/ v'm. So the test we
seek is equivalent to H : B $ f3o. The conditional distribution of T given
B = f3 is noncentral t, NCt n-l(ylnf3). This distribution has increasing
MLR in the noncentrality parameter f3 (see Problem 29 on page 289). The
UMP invariant level a test is to reject H if T is greater than the 1 - a
quantile of the N Ctn-l ( ylnfJo) distribution. Let this quantile be denoted
d. Then T > d is equivalent to c [X - dS/ yin, 00), which in turn is
equivalent to the test found in Example 5.73 on page 326.

6.4 Problems
Section 6.1.1:

1. Prove Proposition 6.3 on page 346.


2. Prove Proposition 6.5 on page 346.
3. Let e be a location parameter for X, let N = n, and suppose that L(9, a)
is a function of 9 - a. Prove that the risk function of a location equivariant
rule 6 is constant.

22The reader might wish to prove this in solving Problem 11 on page 389.
6.4. Problems 389

4. If 8 is a location parameter and Y = g(X} is location invariant, then prove


that Y is ancillary.
5. Suppose that Xl, ... , Xn are lID given 8 = 8 each with density

fXlle(xI8} = JK exp {-~(X - 8}2} 1(9,00)(X}.

(This is called the half-normal distribution.) Let L(8, a) = (9 - a)2 and


N = n = JR. Let G be the one-dimensional location group, gc(Xl, . .. , Xn) =
(Xl + c, ... , Xn + c). Find the MRE estimator.
6. A function 9 : JRn -+ JR is even if g(-X1, ... ,-xn } = g(X1, ... ,Xn). A
function 9 is odd if g( -Xl, ... , -Xn } = -g(X1, ... , Xn }. Suppose that S is
odd and location equivariant and that T is even and location invariant.
Suppose that Xl, ... , Xn are lID with density f with respect to Lebesgue
measure such that f(c-x} = f(c+x} for some c and all x. (Such a density
is called symmetric about c.) Suppose that the variances of S(X 1 , , Xn}
and T(X1, ... , Xn} are both finite. Prove that the covariance between them
is O.

Section 6.1.2:

7. For each vector X = (Xl, ... , Xn), let k(x) denote the subscript of the
last nonzero coordinate with k(O, ... ,O} = O. Let Xo = 1. Prove that a
function u is scale invariant if and only if it is a function of x only through
y(x} = (xdlxk(z)I, ... , Xn/IXk(z)l).
8. Suppose that 00 is scale equivariant and not identically o. Prove that 01 is
scale equivariant if and only if 01 = uOo for some scale invariant u.

Section 6.2.1:

9. Prove Proposition 6.25 on page 354.


10. Prove Proposition 6.29 on page 355.
1l.*Let X1, ... ,Xn be lID N(J.L,O' 2 } given 8 = (J.L,O'). Let N = {O,l} and

R if J.L ~ J.Lo and a = 1,


L(9, a) ={ I if J.L < J.Lo and a = 0,
o otherwise.

(a) Prove that the formal Bayes rule with respect to the improper prior
with Radon-Nikodym derivative 1/0' with respect to Lebesgue mea-
sure is the usual level 1/(1 + R) t-test.
(b) Let G be a group that acts on X as follows:

gc(Xl, ... , x n ) = (C(XI - J.Lo) + J.Lo, .. , C(Xn - J.Lo) + J.Lo),


for c > o. Find G and G so that this problem is invariant, and show
that the t-test is equivariant.
390 Chapter 6. Equivariance

Section 6.2.2:

12. Prove Proposition 6.43 on page 361. (Hint: The proof is very much like
Example 6.42. There is no need to transform the data.)

Section 6.2.3:

13.*Let n = (0,00). Suppose that Xl, ... ,Xn are lID given e = fJ each with
density
fX1Is(xlfJ) = ~f G) ,
for some density function f. Let G be Group 2 and N = [0,00). Let L(fJ, a) =
(or - a?/o2r, for some r > 0.
(a) Find 6 so that this problem is invariant.
(b) Characterize all equivariant rules.
(c) Write a formula for the MRE rule.
(d) If I(x) = I[O,l)(X), find the MRE rule.
14. Let n = (0,00). Suppose that Xl, ... ,Xn are lID given e = 0 each with
density
fXlIs(xIO) = ~f (~) ,
for some density function f. Let G be Group 2 and N = [0,00). Let L( 0, a) =
(k 10g(O) - r log(a)2, for some k, r > O.
(a) Find 6 so that this problem is invariant.
(b) Characterize all equivariant rules.
(c) Write a formula for the MRE rule.
15. Let Xl, ... ,Xn be lID U(O,O) random variables conditional on e = 0,
and let the action space be N = [0, 00). Let the loss function be L( 0, a) =
(1 - a/O)2.
(a) Show that this problem is invariant under the one-dimensional scale
group, Group 2.
(b) Find the MRE decision rule.
16. Prove Corollary 6.52 on page 366.
17. Prove Proposition 6.53 on page 367.
18. Suppose that Xl, ... , Xn are lID U (01 , O2 + OI) given e = ((h, (2), where
n = 1R x 1R+. Let N = n and
L(fJ,a) = (0 1 ~al f + (0 ~a2f
2

Show that this problem is invariant under Group 3, and find the MRE
decision rule.
6.4. Problems 391

J
19. Let f : nt -+ [0,00) be a function such that Ixlf(x )dx < 00. Suppose that
X I, ... , X n are conditionally lID given e = (J each with density f (x - (J).
Let the prior density of e be proportional to f(c - (J). Suppose that the
loss function is L(J, a) = p(J - a) for some function p. If the formal Bayes
rule exists, show that it is the same as the MRE decision based on a sample
containing one extra observation X n +1 = c.
20. Let Xl, ... ,Xn (n ~ 2) be lID with Exp(lj(J) distribution given e = (J.
Use the one-dimensional scale group, Group 2. Let the action space be the
same as the parameter space, and let the loss be L(J,a) = (J2 + a2)/(a(J).
(a) Find groups to act on the parameter and action spaces so that the
decision problem is invariant.
(b) Find the best equivariant rule.
21. Suppose that X I, ... , X n are conditionally lID given e (J each with
conditional density
(Jaa
fxtle(xl(J) = x a +l i[9.oo)(X),
where a is known and the parameter space is l1 = (0,00). Let Group 2 (the
one-dimensional scale group) act on the data. Let the action space be the
same as the parameter space.
(a) Find groups acting on the parameter and action spaces so that the
decision problem with loss L(J, a) = (9 - a)2 j(J2 is invariant.
(b) Find the MRE decision rule.
Section 6.3.1:

22. Show that Theorem 6.74 applies, and state the conclusions of the theorem
in the situation described in Problem 31 on page 289.
Section 6.3.2:

23. *Each part of this question assumes the hypotheses of the preceding parts.
(a) Let P and Q be probability measures on (nt, B), where B is the Borel
u-field. Suppose that X = (Xl, ... , Xn) is an lID sample from a
distribution with probability measure P. Let Y be another real-valued
random variable independent of X with distribution Q. Let C =
C(X) be a measurable subset of m.. Define the content of C by Q(C).
Prove that the expected value of the content equals the probability
that C(X) contains Y. You may assume all necessary measurability
conditions.
(b) Let l1 be a parameter space, and suppose now that P is only known
to be an element of the parametric family {P91(J E l1} and that Q is
only known to be an element of the parametric family {Q91(J E l1}
(same parameter space). Let E9 represent expectation with respect to
the conditional distribution of X given e = (J. Suppose that we wish
392 Chapter 6. Equivariance

to choose C in order to maximize Ee[Qe(C)] uniformly in () subject


to Ee[Pe(C)] ~ (3 for all (). Prove that this is equivalent to finding
a uniformly most powerful size (3 critical region for the hypothesis-
testing problem:
H: Xl, ... ,Xn , Yare an lID sample from Pe for some () E n,
A: X I, ... , Xn are an IID sample from Pe independent of Y
which has distribution Qe for some () E n.
(c) Suppose that () = (/-L, a) E lR x lR+, Pe is the normal N(/-L, a 2 )
distribution, and Qe is the N(/-L, aa 2 ) distribution for some known
a E (0,1). Show that the hypothesis-testing problem from (b) is in-
variant under the location-scale group.
(d) Let S 2 = ",n - 2 - 2
L...."i= I (Xi - X) , and show that (Y, X, S ) is a sufficient
statistic for this problem. Also, find a maximal invariant in the suffi-
cient statistic space under the action of the location-scale group.
(e) Among all sets C as described in part (a) which are also equivariant
under the action of the location-scale group on X, find the one that
uniformly maximizes Eo [Qe(C)] subject to Ee[Po(C)] ~ (3 for all
(). (Hint: You may wish to use the form of the t density given on
page 672.)
24. In Example 6.80 on page 380, prove that po(x) equals the posterior prob-
ability that e is not in the interval Bx,ps(x)'
25. Prove that Theorem 6.78 applies to the situation in Example 5.57 on
page 319. For the case a = 0.05 and n = 10, find the conditional con-
fidence coefficients for the two intervals (-00, T-] and [T_,oo) given the
ancillary if the sufficient statistic is (TI' T2 ) = (1,1.3).

Section 6.3.3:

26. *Return to Problem 56 on page 293. Find a group of rotations and a loss
function for estimating 82 that make the decision problem invariant. Show
that the hypothesis and alternative H : 81 = 0 and A : 81 > 0 are
invariant, and find the form of the UMPUAI level a test as closely as you
can. (I do not think you can find the cutoffs in closed form.)
27. *Suppose that X is distributed like Nk(/-L, E) given e = (/-L, E). Let the group
be Group 4 on page 354. Only one vector observation will be available.
(a) Show that the family of distributions is invariant and show how a
group element acts on the parameter space.
(b) Suppose that we wish to test the hypothesis H : M = 0 versus A :
M i= O. Show that the hypothesis-testing problem is invariant, and
find the maximal invariant in the data space. Why are invariant tests
useless in this case?
(c) Suppose that we wish to estimate M. Our action space is N = lR k ,
and our loss function is L(}, a) = (/-L - a) TE-1(/-L - a). Find a group
G operating on N so that the loss is invariant.
6.4. Problems 393

(d) For the estimation problem, show that all equivariant rules are of
the form 8(x) = ex for some scalar c. (Hint: First, prove that for
i = 1, ... , k, if x has 0 in coordinate i, then 8(x) has zero in coordinate
i also. Finally, write 8(x) = a(x)x+,8(x)y(x), where y(x) is orthogonal
to x for all x and the representation is unique unless ,8(x) = O. Then
let A be an orthogonal matrix with first row proportional to x T and
second row proportional to y(x) T.)
28. Suppose that Xl, ... , Xn are conditionally lID with N(I-', CT 2 ) distribution
given e = (1-', CT). Let G be the one-dimensional location group 9cX = x+cl.
(a) Show that Assumption 6.82 holds.
(b) For what kinds of hypotheses can we find UMPUAI tests?
(c) Will these tests be UMPU?
29. Suppose that i,l, ... , i,n; are conditionally distributed as Np(l-'i, CT) given
Mi = I-'i and ~ = CT for i = 1, ... , k and all i,j are conditionally inde-
pendent. (Here ~ is a p x p positive definite matrix.) Suppose that the
hypothesis to test is H : MAC = 0, where M is the p x k matrix whose
ith column is Mi, A is a k x k diagonal matrix with vn; in the ith diagonal
element, C is a k x r matrix that equals the first r columns of an orthogo-
nal matrix, and 0 is a p x r matrix of all zeros. (Compare to the one-way
analysis of variance on page 384.) Transform the data in order to put this
problem into the form of the multivariate analysis of variance, and find the
matrices Wand B in the discussion that begins on page 386.
CHAPTER 7
Large Sample Theory

7.1 Convergence Concepts


In calculus courses, the concept of convergence of sequences is introduced.
In this section, we will generalize that concept to include different types of
stochastic convergence.

7.1.1 Deterministic Convergence


We begin by defining types of deterministic convergence.
Definition 7.1. Let {Xn}~=l be a sequence in a normed linear space,l and
let {rn}~=l be a sequence of real numbers. We say that Xn is small order
of rn (as n -+ 00), denoted Xn = o(rn), if for each c > 0 there exists N
such that IIxnll ~ clrnl for each n 2: N. We say that Xn is larye order of
rn (as n -+ 00), denoted Xn = O(rn), if there exists c > 0 and N such
that IIxnll ~ clrnl for each n 2: N. If {Yn}~=l is a sequence of vectors and
Xn -Yn = o(rn) (or O(rn))' then we write Xn = Yn +o(rn) (or Yn +O(rn).)
What large order and small order allow us to do is to discuss limits of
ratios without being explicit about the ratios as long as they stay bounded
or go to zero. Large order means that the ratio of the quantities remains
bounded. Small order means that the ratio goes to O.
Example 7.2. Since limn~= log(n)/n = 0, we have log(n) = o(n). Also, n T =
o(nP ) if p > r. It is easy to prove that (~) = O(nk) for fixed k.

1 The norm of x is denoted by IIxli. Note that a normed linear space is a metric
'space with metric d(x, y) = IIx - yll
7.1. Convergence Concepts 395

Here are some simple consequences of the definitions:

If Xn = o(rn)' then Xn = O(rn).


If c is real and nonzero, then Xn = O(rn) if and only if Xn = O(crn).
Similarly, Xn = o(rn) if and only if x = o(crn ).

Suppose that Yn = o(rn) and Xn = O(Yn). Then Xn = o(rn) If


Xn = o(Yn), then Xn = o(rn).
If Xn = o(rn) and Yn = o(sn)' then Xn + Yn = o(lrnl + ISn!) Similarly,
if Xn = O(rn) and Yn = O(sn), then Xn + Yn = O(lrnl + ISn!).
If Xn = o(rn) and Yn = O(sn), then Xn + Yn = O(lrnl + ISnl)
If Xn = o(rn) and Yn = o(sn), then XnYn = o(rnsn). Similarly, if
Xn = O(rn) and Yn = O(sn), then XnYn = O(rnsn).
If Xn = o(rn) and Yn = O(sn), then XnYn = o(rnSn).
There will be several situations in which we need to use the concepts of
small order and large order. Let {r n };:='=l be a sequence of real numbers.
1. If lim sup Ilxnlrnll < 00, then Xn = O(rn).
2. If lim sup IIxnlrnll= 0, then Xn = o(rn).
3. Xn = 0(1) if and only if limn oo Xn = O.
-+

4. If rn = 0(1) and m is fixed, then (1 + rn)m = 1 + 0(1).

5. If Xn,k = o(rn) as n -+ 00 for each k = 1, ... ,m, then L;;'=l Xn,k =


o(rn) if m is fixed.
This last example requires that m be fixed as n -+ 00. To see that it is false
otherwise, consider Xn,k = 2k In = 0(1) as n -+ 00. But, L;=l Xn,k -+ 00
as n -+ 00.

7.1.2 Stochastic Convergence


Next, we define stochastic versions of small order and large order. The
setup requires a sequence of probability spaces {(Xn' Bn , Pn )};:='=l' Here,
we assume that each space Xn is a normed linear space with norm II . lin
and that there are functions Xn : 8 -+ Xn where (8, A, J.l) is an underlying
probability space. (As before, J.l(A) for A E A will often be denoted Pr(A)
and conditional probabilities derived from J.l denoted Pr(I).) In this case,
Pn is the probability induced on (Xn, Bn) by Xn from J.l. A common example
is the one in which 8 = IRoo , Xn = IRn , and Xn is the first n coordinates.
All of the results in this section and Section 7.2 apply equally well to cases
in which the probabilities Pn are already conditional probabilities given
396 Chapter 7. Large Sample Theory

some parameter e. Of course, in such cases, Pn would actually be Pe,n


and Pr would be denoted P~. Problems 5 and 6 (see page 468) show how
e
to convert certain limit theorems that are conditional on into marginal
limit theorems.
Definition 7.3. Let {Xn}~=l be a sequence of random quantities as above
and let {rn}~l be a sequence of numbers. We say that Xn is stochastically
small order of rn (as n --t 00), denoted Xn = op(rn), if, for each c >
and each f > 0, there exists N such that Pr(IIXnll n ::; clrnD ~ 1 - f for
all n ~ N. We say Xn is stochastically large order of rn (as n --t 00),
denoted Xn = Op(rn), if, for each f > 0, there exists c > and N such
that Pr(IIXnll n ~ clrnD ~ 1 - f for all n ~ N. If {Yn}~=l is a sequence
of random vectors and Xn - Yn = op(rn) (or Op(rn)), then we write
Xn = Yn + op(rn) (or Yn + Op(rn).)
Proposition 7.4. Xn = op(rn) if and only if, for each c > 0,
lim Pr(IIXnll n ::; clrnD = 1.
n--HXl

Note that in the definition of 0P' the c is allowed to vary with f, so there
is no obvious analog to Proposition 7.4 for Op. We will usually leave the
subscript n off of the norm II . lin, since there is seldom any chance of
confusing one norm with another.
Example 7.5. Let {Zn}~=l be lID random variables with mean I-' and variance
u 2. Let Xn = v'n(Zn -I-')/u. So, Xn = ffi. for every n, and Pn, which is the
distribution of X n , is a probability measure on the Borel subsets of the real line.
The central limit theorem B.97 (together with Problem 25 on page 664) says that

lim n-+ oo Pn -00, t]) = 4>(t) for all t, where 4> is the standard normal CDF. For
each f > 0, there exists t such that ~(t) - ~(-t) ~ 1 - f/2. Choose N such that
for each n ~ N,
f
Pn -00, -t]) ~ ~(-t) + 4' Pn -00, t]) ~ ~(t) - 4'
It follows that Pr(IXnl ~ t) equals
f
Pn-oo, t]) - Pn -00, -t]) ~ ~(t) - ~(-t) - '2 ~ 1- f.

Hence, Xn = Op(l).2 Also, Zn -I-' = Op(l/v'n). If 0 ~ 0 < 1/2, then Zn -I-' =


op(n-Q). In particular (0 = 0), Zn -I-' = opel).
Stochastic convergence is closely related to the concept of convergence
in probability. We restate Definition B.89 in the present context.
Definition 7.6. If {Xn}~=l and X are random quantities in a normed
linear space, and if, for every f > 0, limn -+ oo Pr(IIXn - XII > f) = 0, then
we say that Xn converges in probability to X, which is written Xn ~ X.

2This phenomenon is quite general. See Problem 3 on page 467.


7.1. Convergence Concepts 397

Proposition 7.7. Suppose that Yn = fn(Xn) for each n where fn : Xn .......


R, and R is a normed linear space with Borel a-field. Assume that each fn is
measurable. Let Y : S ....... R be another random quantity. Then llYn - YII =
op(l), if and only ifYn converges in probability to Y.
Example 7.8. Suppose that limn~oo E(Yn - c? = O. Then Tchebychev's in-
equality can be used to prove that Yn !: c.

Definition 7.9. Let {Po: () E O} be a parametric family of distributions


on a sequence space X oo , and let g : 0 ....... G be a measurable function to a
metric space G with Borel a-field. Let Xn = xn, and let Yn : Xn ....... G be
measurable. We say that Yn is consistent for g(8) if Yn ~ g(}) conditional
on 8 = () for all () E n.
Example 7.10. Let {X n };:"=1 be conditionally lID N(/-L,a 2) given 8 = (a,/-L).
Let Yn = 2:7=1X;/n and g(O) = /-L. Then Yn is consistent for g(8) according to
the weak law of large numbers B.95.

The following is a more general definition of "in probability."


Definition 7.11. Suppose that {(Xn , Bn, Pn)}~=l is a sequence of proba-
bility spaces. Define Y = Il~=l X n . Let T ~ y. We say that T occurs in
probability, denoted peT), if, for each > 0, there exists Tn() E Bn for
n = 1,2, ... such that Pn(Tn()) :2: 1 - for each n and n~=l Tn() ~ T.
The following lemma essentially says that a sequence of random quanti-
ties {Yn}~=l is Op(rn) or op(rn) if and only if the set of possible values
for (YI, Y2, ... ) which are O(rn) or o(rn) occurs in probability.
Lemma 7.12. Use the notation from Definition 7.11. Let Yn = fn(Xn)
and let
T = ((Xl,X2"") E Y: fn(xn) = o(rn)}.
Then peT) if and only if Yn = op(rn). Similarly, if

T = ((XbX2, ... ) E Y: fn(xn) = O(rn)},

then peT) if and only ifYn = Op(rn).


PROOF. We will do only the Op part since the Op part is similar. First, for
the "if" part, assume Yn = op(rn) and let > O. Let Cl > C2 > ... decrease
to O. For each i > 1, let N(E, Ci) :2: N(E, Ci-l) be such that for n :2: N(E, Ci),
Pr(IIYnll :::; cdrnl} ~ 1 - . Define Tn(t:) = Xn for n = 1, ... , N(, ct). For
N(,Ci-l) < n:::; N(,Ci), define

By construction, we have Pn(Tn(E)) ~ 1- for every n. If (XbX2"") E


n~=l Tn(E), then limn-+ oo IIfn(xn)ll/lrnl = 0 by construction. It follows
that (Xl. X2,.' .) E T and we have proven peT).
398 Chapter 7. Large Sample Theory

For the "only if" part, assume that peT) and let Tn() be as in Defini-
tion 7.11. Since fn(xn) = o(rn) for (Xl,X2, ... ) E T, it follows that

< 00,

for all but finitely many n. Hence,

IIfn(X n )II }) ~ Pn(Tn()) ~ 1 -


Pn ({ Xn : Irnl ::; Zn .

Now, choose x~ E Tn() such that

Z < Ilfn(x~)1I
n _ Irn I + -1 ->
0
,as n -> 00,
n
so limn->ex:> Zn = O. For each > 0 and c > 0, choose N such that if n ~ N,
then Zn ::; c. It follows that, if n ~ N, then Pr(llYnll/lrnl ::; c) ~ 1 - .
Hence, Yn = op(rn). 0

Example 7.13. Let in : Xn -+ 1R and Y n = fn(Xn). Define

Then Yn = opel) if and only if peT), according to Lemma 7.12.

If count ably many things occur in probability, then they simultaneously


occur in probability.
Proposition 7.14. 3 If P(Si) for i = 1, ... , then p(n~lSi)' If T S;;; S,
then peT) implies peS).
We are now in position to prove a theorem that says (in a more precise
manner) that if you can prove a result involving 0 and 0, then you can
replace 0 by Op and 0 by Op and prove a corresponding result.
Theorem 7.15. 4 LetYo,Yl,1,Yl,2, ... ,Y2,1,Y2,2, ... be metric spaces. Let
hn : Xn -> Yo, fAj) : Xn -> Yl,j, for j = 1,2 ... , and g~k) : Xn -> Y2,k for
k = 1,2, .... Suppose that f~j)(Xn) = Op(r~ and g~k)(Xn) = oP(shk
for all j and k. Also, suppose that it is known that

(t~j)(Xn) = O(r~ and g~k)(xn) = o(shk for all j and k) implies

hn(xn) = O(t n ) (or (hn(xn) = o(tn)),


then hn(Xn) = Op(t n ) (or hn(Xn) = op(tn ).

3This proposition is used in the proof of Theorem 7.15.


4This theorem is used to help develop the delta method.
7.1. Convergence Concepts 399

PROOF. We will only prove the Op part. The op .part is virtual ly


/~entical.
Let S(2j-l) = {x : j;;)(x n ) = O(r~)} for all J S(2k) = {x : gn
(xn) =
o(s~k)} for all k. (If there are only finitely many fJl) or g~k), then just let
S(t) = n:=l Xn after you run out of functions.) De.fine T = {x : hn(xn) =
O(tn)} . The stated conditi ons imply that n~l S(t) ~ T. Also,
we have
assume d that p(S(i) for all i, so P(T) by Propos ition 7.14.
0
Examp le 7.16. Suppose that w : ffi. -+ ffi. has k + 1 continuous derivati
ves at c.
Define
n(x, c) = wCc) + (x - c)w , (c) + ... + k!
1(
x - c )k w (k) ( c,
)

where g(k) denotes the kth derivative of g. Taylor's theorem C.1 says
(among
other things) that
}' w(x) - Tk(X, c) _ 0
x~ (x - c)k -.
Suppose that Xn - C = O(rn), where rn = 0(1). Then Xn - C = 0(1),
and we
conclude that W(Xn) - Tk(X n , c) = OXn - C)k), hence W(Xn) = Tk(X n,
c) + o(r~).
Similarly, we can write W(Xn) = Tk-l(Xn,C) + O(r~). Now, suppose that
c = OPCrn). In the notation of Theorem 7.15, let Xn = lR. for all Xn -
n and let
Yo = Yl,l = ffi.. For each n, let If/)(x) = x and h n {-} =
w() - Tk("C) or
w() - Tk-l C,, c). Suppose that there are no 9 functions. Then Theorem
7.15 says
that

Furthermore, if W has k + 1 continuous derivatives everywhere, then if Xn


Op(rn) , then W(Xn) = Tk(X n, X;;) + op(r~) = Tk-l(X n, X;;) + Op(r~).
- X;; =
Coroll ary 7.17. 5 Let Y and Z be metric spaces. IfYn = fn(Xn) E
Y and
Yn!:.. c E Y and 9 : Y --+ Z is continu ous at c, then g(Yn ) ~ g(c).
Anothe r type of stochas tic convergence is convergence in distribu
tion.
We restate Definition B.80 here.
Defini tion 7.1S. Let {Xn}~=l be a sequen ce of random quantit ies
and let
X be anothe r random quantit y, all taking values in the same topolog
ical
space X. Suppos e that

lim E (f (Xn)) = E (f (X))


n->oo

for every bounde d continu ous functio n f : X --+ 1R; then we


say that
Xn converges in distribu tion to X, which is written Xn E. X or
.c(Xn) -+
.c(X). If Xn E. X, we call the distribu tion of X the asympt otic distribu
tion
of X n . If Xn E. X, and if Rn and R are the distribu tions of Xn
and X,
respectively, then we say that Rn converges weakly to R, denote d
Rn ~ R.
5This corollary is used to help prove that posterior distributions are
totically normal. asymp-
400 Chapter 7. Large Sample Theory

The portmanteau theorem 8.83 gives several criteria that are equivalent
to convergence in distribution. These can be used to derive a connection
between convergence in distribution and op.
Lemma 7.19. 6 Suppose that X is a metric space with metric d. If Xn E. X
and d(Xn, Yn ) = op(l), then Yn -+ X.
1)

PROOF. Let R,.. be the distribution of Yn , and let P be the distribution of X.


We must show that Rn ~ P. (See Definition 7.18.) Let B be an arbitrary
closed set. According to the portmanteau theorem B.83, it suffices to show
that lim sup Rn(B) ~ PCB). Define, for C E B,

d(x, C) = inf d(x, y).


yEC

Then
{Yn E B} ~ {d(Xn,B) ~ f} U {d(Xn, Yn ) > fl.
Define Ce = {x : d(x, B) :5 }, which is a closed set. So,

Rn(B) = Pr(Yn E B)
< Pr(d(Xn , B) :5 ) + Pr(d(Xn , Yn ) > )
= Pn(Ce ) + Pr(d(Xn , Yn ) > f).

We have assumed that lim n ..... oo Pr(d(Xn , Yn ) > ) = 0 and that Xn E. X,


so we conclude limsupn ..... oo R,..(B) :5 lim5upn..... oo Pn(Ce) :5 P(Ce)' Since
B is closed, lime .....oP(Ce) = PCB). It follows then that

lim sup Rn(B) :5 PCB),


n ..... oo

1)
hence Yn -+ X. 0
Lemma 7.19 says that if Xn E. X, then so too does anything close to X n ,
that is, anything that differs from Xn by op(l).
Theorem 7.20. If the a-field on X x Y is the product a-field, if Xn E. X
and Yn E. Y, and if Xn is independent of Yn for all n, then (Xn' Yn ) E.
(X, Y), where X and Y are independent.
PROOF. Since Xn and Yn are independent for each n, their joint charac-
teristic function is

6This lemma is used in the proofs of Theorems 7.22, 7.25, 7.35, and 7.63 and
to help develop the delta method.
7.1. Convergence Concepts 401

This product converges to E exp { it T X} E exp { is Ty}, which is the char-


acteristic function of independent X and Y. Now apply Theorem B.93.
o
Using the fact that a constant is independent of everything, we have the
following simple corollary to Theorem 7.20.
Corollary 7.21. 7 Suppose that {Xn}~=l take values in a metric space X.
If Xn ~ X and bEY is a constant, then (Xn' b) ~ (X, b).
The conclusions of the following theorem are taken for granted in many
calculations of asymptotic distributions.
Theorem 7.22. 1. Suppose that {Xn}~=l take values in a topological space
X and that {Yn}~=l take values in a topological space y. If (Xn' Yn ) ~
v
(X, Y), then Xn -+ X.
2. Suppose that {Xn}~=l take values in a metric space X and that {Yn}~=l
take values in a metric space y. Let bEY. If Xn ~ X and Yn ~ b, then
v
(Xn, Yn ) -+ (X, b).
PROOF. For part 1, let g : X x Y -+ X be defined by g(x, y) = x. Then 9
is continuous and the continuous mapping theorem B.88 says that Xn =
v
g(Xn,Yn ) -+ g(X, Y) = X.
For part 2, let d 1 be the metric in X and let d2 be the metric in y. Then

is a metric in X x Y and the product a-field is the Borel a-field. By


v
Corollary 7.21, we have that (Xn, b) -+ (X, b). We have assumed that
v
d((Xn,Yn), (Xn, b)) = d2 (Yn , b) = op(l). So, by Lemma 7.19, (Xn, Yn ) -+
(X, b). 0

7.1.3 The Delta Method


A method for finding the asymptotic distribution of a function of a random
vector is based on Lemma 7.19 and is called the delta method. [See Rao
(1973), Chapter 6.] As an example, let Yn be the average of n lID random
variables with mean JL and variance a 2 . The central limit theorem B.97 says
that Fn(Yn - JL) ~ N(O, ( 2).8 Now, let 9 be a function with continuous
derivative. We can write

g(t) = g(JL) + (t - JL)g'(JL) + o(t - JL}.

7This corollary is used in the proof of Theorems 7.22.


BIt is common to call an estimator Zn, with the property that vn(Zn - 0)
converges in distribution to a nondegenerate distribution, vn-consistent.
402 Chapter 7. Large Sample Theory

If we are interested in g(Yn ), we can write

g(Yn) g(Jl) + (Yn - Jl)g'(Jl) + o(Yn - Jl),


vn(g(Yn ) - g(Jl)) = vn(Yn - Jl)g'(Jl) + vno(Yn - Jl).

Since J1i(Yn - Jl) converges in distribution, Yn - Jl = Op(l/J1i). Hence


J1io(Yn - Jl) = op(l) by Theorem 7.15. So,

vn(g(Yn ) - g(Jl)) = vn(Yn - Jl)g'(Jl) + opel).


By Lemma 7.19, we get a useful result when g'(Jl) =I 0:

The result in the example above suggests a valuable use for the delta
method. If the variance of the asymptotic distribution of J1i(Yn - Jl) is
an undesirable quantity in the application for which it is intended, then
a transformation of Yn will have a different variance that may be more
suitable. For example, suppose that nYn has Bin(n,p) distribution given
p = p. The asymptotic distribution of J1i(Yn - p) is N(O,p(l - p)) given
p = p. For comparing several possible values of P, it might be nice if the
only dependence of the random variable on P were through the mean. This
can be arranged asymptotically by choosing a function 9 such that

'( 1
9 p) = Vp( 1 - p)

This is a simple differential equation to solve, and the solution is get) =


2 arcsin (../i). The asymptotic distribution of

vn [2 arcsin ( Fn) - 2arcsin (v'P)]


is N(O,l) given P = p. This is a special case of what is called a vari-
ance stabilizing transformation. The general method for constructing a vari-
ance stabilizing transformation is as follows. Suppose that J1i(Yn - Jl) has
asymptotic distribution N(O, h(Jl)). Then, choose a function get) such that
g'(Jl) = l/Jh(Jl) That is,

get) = i c
t 1
Jh(Jl) dJl,

where c is any constant such that the integral exists. The asymptotic
distribution of J1i(g(Yn ) - g(Jl)) will be N(O,1). It is common, when
J1i(Yn - Jl) E. N(O, (12), to say that the asymptotic 2distribution of Yn
is N(Jl, (12 In). In symbols, we may write Yn '" AN(Jl, (1 In). In such cases
we will call (12 In the asymptotic variance of Yn
7.1. Convergence Concepts 403

There is also a multiva riate delta method . If 9 : IRk -+ IR has continu


ous
first partial derivat ives, let V' g(J.L) be the gradien t (vector of
first partial
derivat ives) at /-L. Then g( t) =g(/-L) + (t -/-L) TV' g(/-L) + o( t - /-L). If v'n(Yn -
'D
/-L) -+ Nk(O,a), then

Here are some multiva riate applica tions of the delta method .
Examp le 7.23. Importa nce sampling (see Section B.7) is a means
of approxi-
mating the ratio of integrals of the form J v(O)h(O)dO/ J h(O)dO. Let
{Xn}~=l
be an 110 sequence of pseudorandom numbers with density f, and
let Wi =
h(Xi)/ f(Xd and Zi = V(XdW i. If these have finite variance, then
the sample
averages (W n, Zn) will, by the multivariate central limit theorem
B.99, be ap-
proximately bivariate normal with mean (w,~) = <I I
h(9)d9, v(9)h(9)d9) and
covariance matrix equal to l/n times the covariance matrix a =
(Wi, Zi) pairs. Now, apply the delta method to find the asympt otic
ai,j
of the
distribu tion
of the ratio of the sample averages. The asympto tic mean is ~/w,
the ratio we
want to approximate, and the asympt otic variance is

In practice, it is common to approxi mate u by the sample covarian


ce matrix of
the (Wi, Zi) pairs.

The following exampl e uses the reasoni ng behind the delta method
with-
out using the delta method itself.
Examp le 7.24. Suppose that we wish to find the asympt otic distribu
tion of the
roots of polynomials with random coefficients. Let Yn ANk +1(IJ" E/n), where
Yn = (Yno, .. . ,Ynk) T. Define the polynomial
"-J

Pn(U) = L:Ynju j .
j=O

Let U~ be the smallest root of Pn(U). Define p(u) = L:;=o


IJ,jU j , and suppose
that its smallest root is Uo and this root has multiplicity one. That
p( uo) =
o but p' (Uo) f:. O. It is not difficult to show that the smallest rootis,with odd
multiplicity of a polynomial is a continuous function of the coefficients. 9
It follows

9The main reason is that a polynomial changes sign as the variable


passes
a root of odd multiplicity. There will be points arbitrar ily close to
the root at
which the polynomial has opposite signs. If the coefficients don't change
much,
the signs will remain the same at these points, hence a root will be between
If the root had even multiplicity, the polynomial would have
them.
constan t sign in a
neighborhood of the root, and small changes in the coefficients could
roots from the neighborhood.
remove all
404 Chapter 7. Large Sample Theory

from Theorem 7.15 that U~ ~ Uo. To find the asymptotic distribution of U~,
write Pn(U~) as

where V,: is between Uo and U~. So, S ~ {V,: -+ uo}, and V,: ~ Uo also.
Furthermore,
k k

p~(V':) = Lj(V;)j-1Yjn ~ Lju~-llLj = p'(uo) =I O.


j=l j=l

S~,!!~ - Uo = [P'1(U~) - p,:(uo)]/W~, where W~ = p~(V,:) ~ p'(uo). Now, let


u - (l,uo, ... ,uo) and WrIte

v'n(U~ - uo)

which converges in distribution to N(O, u T~U/[p'(uoW) by Theorem 7.22.

7.2 Sample Quantiles


The reader interested in a thorough treatment of sample quantiles should
read the book by David (1970). In this section, we present some of the more
commonly used asymptotic results on the distribution of sample quantiles.

7.2.1 A Single Quantile


Suppose that {Xn}~l are conditionally lID random variables with distri-
bution P given P = P and suppose that P has a CDF F with derivative
f (at least in a neighborhood of xp where F(xp) = p) and 00 > f(x p) > O.
If the observed values of the first n Xi, when ordered from smallest to
largest, are X(l),"" x(n)' define the empirical CDF by Fn(X(i) = i/n and
interpolate linearly in between. (Do something arbitrary, but continuous
and strictly increasing below x(1).) Now, Fn is continuous and strictly in-
creasing on (-oo,X(n)l.lO Define the sample p quantile by y~n) = F;l(p),
for 0 < P < 1.
The goal of this section is to prove a theorem specifying the asymptotic
distribution of a sample quantile.

lOIf F(c) = 0 and F(x) > 0 for x > c, then we only need Fn to be strictly
increasing on [c, x(n)]'
7.2. Sample Quantiles 4U5

Theorem 1.25. Suppose that {Xn}~=l are conditionally IID with distri-
bution P given P = P and suppose that P has CDF F with derivative f
in a neighborhood of X P ' where F(xp) = p, 0 < f(xp) < 00, and 0 < p < 1.
Define yp(n) = F;;l(p), where Fn is the empirical CDF of(X1, ... ,Xn ).
Then
r:: (n) - xp ) -+
yn(Yp V
N (P(l-P)
0, J2(x p) .

The proof relies heavily on the following lemma.


Lemma 1.26. For each Z E IR there exists a sequence of random variables
{An(z)}~l such that An(z) = op(l/Vn) and

Vri(Yp(n) - x p ) S z, if and only if f~) (p - Fn(xp S Z + Vri~~:?


(7.27)
PROOF. Define

Z ) Z Bn Z
An(z) = Fn ( xp + Vn - Fn(xp) - Vrif(xp) = -;;: - Vnf(xp) + Un,
(7.28)
where Bn is the number of observations in the interval (xp, xp + Z / Vn J,
and Un satisfies Pr(IUnl S 2/n) = 1. In particular, Un = Op(l/n). The
conditional distribution of Bn given P = P is Bin(n,On), where

On =F (Xp + In) - F(x p) = Jnf(xp) + o( In) o( In). =

The characteristic function of Vn(An(z) - Un) is

Eexp {iVrit (An (z) - Un)}

= exp{-itZf(xp)}Eexp(it~)
= (1 - (}n + On exp { ::n} )n exp{ -itzf(x p)}.

We can write
it )
exp ( .J:;; it t2 0 (
=1+---+ 1)
yn Vii 2n
-
n
.

It follows that

(l-en+Onexp{~})n = (l+it~+O(~))n
= (l+i~J(Xp)+o(~))n
-+ exp{iztJ(xp)},
406 Chapter 7. Large Sample Theory

as n ~ 00. So

lim Eexp {i.jiit (An (z) - Un)} = 1,


n ...... oo

for all t. So, vnlAn(z) - Unl 2: 0 by the continuity theorem B.93, and
vnlAn(z) - Unl ~ 0 by Theorem B.90. An(z) = Un + op(l/vn). Since
Un = Op(l/n) = op(l/vn), it follows that An(z) = op(l/vn).
Finally, we prove (7.27). The following inequalities are all equivalent:

vn(yJn) - xp) < Z,

yen) < xp + vn'


Z
p

Fn(YJn) < Fn(xp + fo)'


p < Fn(xp+ fo)'
Z
P :S An(z) + Fn(xp) + vnf(xp),
vn An(z)
f(x p) [p - Fn(x p) < Z + vn f(x p)'
The equivalence of the first and last of these is (7.27). 0
Now, we are ready to prove Theorem 7.25.
PROOF OF THEOREM 7.25. From Lemma 7.26, we know that

We will prove that the right-hand side of this equation converges to the
necessary normal probability. We have that Fn(xp) = Cn/n+Dn , where Cn
is the number of observations less than or equal to xp and Dn = Op(l/n).
Also, An(z) = op(l/vn), so

vn
f(x p) (p - Fn(x p - vn An(z)..;n (Cn )
f(x p) = - f(x p) -;;- - p + op(l). (7.29)

The central limit theorem B.97 tells us that vn(Cn/n-p) 2: N(O, p(l-p.
This, together with Lemma 7.19 applied to (7.29), completes the proof. 0
Example 7.30. Suppose that F has derivative

f(x) = (111r (1 + C~ JLf) -1,


7.2. Sample Quantiles 407

where u > 0 and J.I. are some numbers. If p = 1/2, xl' = J.I. and f(x l' ) = (U1I')-1. It
follows that the sample median Yl~"1 has asymptotic distribution (given P = P)

The asymptotic variance of Yl~i is 2.467u 2 In.


Example 7.31. Suppose that F has derivative

f(x) = _1_exp { __1_(x - J.I.)2} ,


u,;'i/i 2u 2

where u > 0 and J.I. are some numbers. If p = 1/2, xl' = J.I. and f(x l' ) = (u,;'i/i)-l.
It follows that the sample median Yl~i has asymptotic distribution (given P = P)

vn(Yt> - J.I.) E. N (0, U;1I').


The asymptotic variance of Yl~"1 is 1.571u2 In.
For distributions that are bounded above or below, a different sort of
result holds for the p = 1 or p = 0 quantile. The following theorems are
examples.
Theorem 7.32. Suppose that t E JR, a> 0, and
lim(t - x)-Q[l- F(x)] = c > O.
xlt

Let {Xn}~=l be IID with CDF F and let X(n) = max{Xl, ... ,Xn }. Then
nl/Q(t - X(n)) converges in distribution to a distribution with CDF G(x) =
1 - exp( -cx Q
for x> O.
),

PROOF. Write

r{l-F(t- :.)}r
Pr(ni-[t-X(n)]~x) = pr(x(n)$.t- nX~) =F(t- :~)n

~ [1< (n;
Since

it follows that

o
408 Chapter 7. Large Sample Theory

Example 7.33. Suppose that {Xn}~=l are conditionally lID U(0,9) given e =
fJ. The CDF of Xi (given e = fJ) is xlfJ for 0 < x < fJ and I for x 2:: fJ.
With t = fJ we get limx;t(t - x)-l[I- F(x)] = I/fJ. So Theorem 7.32 says that
n(B - X(n) E. Exp(I/B).
A similar theorem can be proven for distributions bounded below.
Proposition 7.34. Suppose thatt E JR, a > 0, andlimxtt{x-t)-O:F{x) =
c> O. Let {Xn}~=l be lID with CDF F and let X(l) = min{Xl'"'' X n}.
Then nl/O:{X(l) - t) converges in distribution to a distribution with CDF
G{x) = 1- exp( -cxO:), for x> O.
Krem (1963) proves that extreme order statistics (like the min and max)
are asymptotically independent of the central order statistics (like the quan-
tiles).

7.2.2 Several Quantiles


We can prove a theorem similar to Theorem 7.25 for several sample quan-
tiles simultaneously.
Theorem 7.35. Let 0 < Pl < .,. < Pk < 1. Suppose that {Xn}~=l are
conditionally IID with distribution P given P = P and suppose that P has
CDF F with derivative f in a neighborhood of each XPi (i = 1, ... , k), where
F(x Pi ) = Pi, 0 < f(xpJ < 00, and 0 < P < 1. Define yp\n) = F;l(Pi),
where Fn is the empirical CDF of (Xl,"" Xn). Then

J1i(Yp\n) - x p" ... , yp~n) - x pk )T E. Nk(O, w),


where W = ((1Pij)) and 1Pij = Pmin{i,j} - PiPj/[J(xp.)f(xpj )].
PROOF. Define
. _ c Pi - Fn(Xpi)
W ',n
Zi,n = J1i(Yp\n) - Xp.), - yn f(xpJ .

Let Zl, ... , Zk be real numbers and let Ai,n(Zi) equal (7.28) with P = Pi for
i = 1, ... , k. Then,

Since (Al n(zI), ... , Ak,n(zk)) = op(l/ fo), it follows from Lemmas 7.19
and 7.26 that the two vectors Zn and Wn converge in distribution to the
same thing if either one of them converges. It is easier to find what Wn
converges to, so that is what we will do.
We can write
7.2. Sample Quantiles 409

where M j is the number of observed values in the interval (xp;_px p;]'


For convenience, set Po = 0, Pk+1 = 1, xpo = -00, and XPk+l = 00. It is
clear that the conditional distribution of (Ml. ... ,Mk+1) given P = P is
multinomial, Mult(nj ql, ... , qk+l), where qi = Pi - Pi-l for i = 1, ... , k +
1. Set Gn = (Ml. ... ,Mk+d T and q = (ql. ... ,qk+1)' The multivariate
central limit theorem R99 implies that, conditional on P = P,

v'n( .!.Gn - q) ~ Nk(O, E)


n
(a multivariate normal distribution), where

o o
o
o qk+1
Next, note that

~AGn+oP ()n),
Fn(Xpl) )
( : = (7.36)
Fn(x Pk )

where A is the k x (k + 1) matrix with Ion and below the diagonal and 0

n
above the diagonal:

A~ U: : : :
Call the vector on the left of (7.36) R. Then the conditional mean of R
given P = P is P = Aq = (PI. .. ' ,Pk)T, and y'n(p - R) ~ Nk(O,AEA T ).
All that remains is to compute AEA T. This can be seen to equal

PI PI
T_ PI P2.
AEA - ( .
P2: - (PI)
PI) P2: (PI.",Pk). o
PI P2 Pk Pk

Since the definition of Fn is arbitrary between observed values, the


asymptotic distribution in Theorem 7.35 applies to every vector of ran-
dom variables whose ith coordinate is between XU-I) and X(j) when
(j -1)/n < Pi < j/n.
An analogue to Theorems 7.32 and 7.34 can be proven for the joint
distribution of the smallest and largest order statistics.
410 Chapter 7. Large Sample Theory

Proposition 7.37. Suppose thattbt2 E JR, 0:1,0:2 > 0, and


lim (x - td- Ol F(x) = C1 > 0,
xttl
lim(t2 - x)-02[1- F(x)] C2 > O.
xTt2

Let {Xn}~l be IID with CDF F, and let X(1) = min{X1 , ... ,Xn } and
X(n) = ma.x{Xb'' ,Xn }. Then the asymptotic joint CDF oJn1/0l(X(l)-
it) and n1/02(t2 - X(n) is (1 - exp( -c1xfl (1 - exp( -C2X~2.

7.2.3 Linear Combinations of Quantiles*


A linear combination of sample quantiles is called an L-estimator. Suppose
that J is the derivative of the conditional CDF of the Xi given P = P and
that J is symmetric about g(O). Here, we suppose that F = 9- 1 (0) and
J(x) = h(x - g(O. Let Zi = Xi - g(O). Then the conditional density of the
Zi is h(). If we choose to sample quantiles symmetric about the median, say
p and 1- p, then xp = 2g( 0) - X1-p by symmetry. Let zp = xp - g( 0) for all p
so that zp = -Zl-po Let W~n) = y;n) -g(O), so that W~n) -zp = y;n) -xp.
p
2h(zp)h(O}
1
4h2(O}
P

If the goal is to estimate g( 9), it might be good if the asymptotic mean were
g(O) given 9 = 0. The asymptotic conditional mean of a1 y ;n) + a2Y1~ni +
a3 Yi~~ is (a1 + a2 + a3)g( 0) + (a3 - a1 )EWt~, since h is symmetric around
O. For p < 1/2, this will equal 9(0) for all 0 if and only if a3 = a1 and
a1 + a2 + a3 = 1. Hence our estimator must be
(7.38)

for some a.
Example 7.39 (Continuation of Example 7.30; see page 406). As an example,
consider the case of Cauchy distributions with a location parameter e. Then
Zi = Xi - 9, h(x) = (11'[1 + X2])-1, and Zl-p = tan[1I'(1/2 - p)]. So, for example,
if p = 1/3, then Zp = -1/v'3, Zl-p = 1/v'3, and h(zp) = 3/(411') = h(Zl-p). The
asymptotic covariance matrix of the three sample quantiles is then

l:=~
n
2(H~
16
ii
2
H) .
~
32
iiI 9 iiI

*This section may be skipped without interrupting the flow of ideas.


7.2. Sample Quantiles 411

The asymptotic variance of the estimator in (7.38) is

(a, 1- 2a, a)E ( 1 a)


: 2a = -:; (14 - 'a 11 ) .
9 + a 27
71'2 2

This variance is minimized at a = 3/22. The minimum asymptotic variance is


871'2/(33n) = 2.39/n, which is not much better than we got with the median alone
(2.467/n).

Perhaps improvement can be made in Example 7.30 by altering p. The


general method for doing this is illustrated by continuing the example.
Example 7.40 (Continuation of Example 7.39; see page 410). For general p <
1/2,
h(zp) = ,; cos 2 71' (p - ~) = ,;c(p),

say, where c(O) = 1. The asymptotic covariance matrix of the three quantiles is

)
1'(1-1') 2
2 ""C'(Pj2
~
2c(p) c(~)2
_71' ( -2...- 1 -2...-
E - -:; 2c(p) 4 2c(p)
2
p(I-~)
c(~)2
~
2c(p) c(p)

The asymptotic variance of the estimator in (7.38) is

(a, 1 - 2a, a)E ( 1 ~ 2a )


= : [~- 2a (~ - c~)) + a 2
( 1+ C~2 c~:)) ]. -

The variance is minimized at

c(p? - 2pc(p)
a =a (p) = 2[c(p)2 _ 4pc(p) + 2p]'
and the minimum variance is

71'2
4n
[1 _C(p)2(c(p) - 2p)2 ] _
- 4pc(p) + 2p -
S P
( ).

We can numerically minimize s(p) and find the minima occur at p = 0.42085 and
at p = 0.07915. The minimum s(p) is 2.302, which is only slightly better than
using p = 1/3.
Example 7.41. Suppose that the distributions are double exponential (also
known as Laplace distribution). That is, hex) = exp(-lxl)/2. Then

log 2p if p :::; ~, 'f < 1


{ h(zp) = { P ~ P - ~'
Zp= -log2(I-p) ifp>~, 1- p If P > 2'
412 Chapter 7. Large Sample Theory

The asymptotic covariance matrix of three symmetric sample quantiles is


1 -1 1
1 p
E =- ( 1 1 1 ) .
nIl i-I
The asymptotic variance of the estimator in (7.38) is

which has a minimum at a = 0 and the minimum value is 1. This means that, no
matter what p is, it is better to use just the median.

7.3 Large Sample Estimation


7.3.1 Some Principles of Large Sample Estimation
One would hope that if a large sample were available, then better knowledge
of P would be available, and we would be close to the situation of having
independent observations. Since predictive inference in usually not the goal
for classical statistics, the issue becomes how well we have estimated 8.
There is the belief that an estimator ought to get 8 correct eventually.
That is, the estimator should be consistent (see Definition 7.9). If more
than one estimator is consistent, then one might ask, "Which is better?"
Without a loss function or some indication of how we plan to use the
estimator, this question is not interesting. There are, nonetheless, answers
to the question.
Let 8 be k-dimensional, and suppose that the FI regularity conditions
(see Definition 2.78) hold. Then the Fisher information matrix (based on
a single observation) IXl (B) can be calculated. Suppose also that an es-
timator en of e converges in distribution, say v'n(e n - 8) E. Nk(O, VII)
given e = B. If we wish to estimate g(8), with 9 continuous, then the delta
. D T
method tells us that Vri!g(8 n ) - g(8)] -+ N(O, Co Voco), where

cJ = (a: 1g(B),"', a:k g(B) . (7.42)

Corollary 5.23 says that the smallest possible variance for an unbiased
estimator of g(8) is cJIxl (B)-lco. Since g(8 n ) is asymptotically unbiased,
the ratio of these two variances might be used as a measure of how good a
consistent estimator is.
Definition 7.43. If Gn is an estimator of g(8) for each nand

Vri(G n - g(B E. N(O,vo), for all B,


7.3. Large Sample Estimation 413

then the ratio cJIx l (})-lC(J/Vf} is called the asymptotic efficiency ofGn at
(). If the ratio is 1, the sequence {Gn}~=l is called asymptotically efficient.
Suppose that {Gn}~=l and {G~}~=l are sequences of estimators of 9(8),
and we have a specific criterion that we require of our estimator, such as
variance equal to t. Suppose that Gno and G~, satisfy this criterion. Then
o
the relative efficiency of {Gn}~=l to {G~}~=l for the specific criterion is
n~/no. Suppose that the criterion is allowed to change in such a way that
the sample sizes required to satisfy it go to 00, for example, variance equal
to t with t going to O. If the ratio n~/no converges to a value r, then r is
called the asymptotic relative efficiency (ARE)ll of {Gn}~=l to {G~}~=l'
Example 7.44. Let {Xn}~=l be conditionally lID with N(p., (12) distribution
given e = (p.,u). Let g(8) = JL. Let G n = X n , the sample average, and let G~
be the sample median. Let our specific criterion be that the asymptotic variance
of the estimator must equal 1:. Since the central limit theorem B.97 says that
.;n(Gn -p.) ~ N(O, (12), and Example 7.31 on page 407 shows that .;n(G~ -p.) ~
N(O, (12 7r/2) , we have the relative efficiency equal to J2/7r = 0.798 for all 1:. If
we let I: -+ 0, the ARE of the sample median to the sample mean is 0.798 as well.

The idea of ARE is to compare the sizes of samples needed to make


comparable inferences from the two sequences.
Example 7.45. Suppose that {Xn}~=l are conditionally lID U(O, 8) given e =
8. The MLE is en = max Xi. Another estimator is twice the sample average,
2X n Suppose that our criterion is that the actual variance of the estimator
e
must equal 82 f.. Since n /8 has Beta(n,1) distribution, the variance of enis
82n/[(n + 1)2(n + 2)]. The variance of 2X n is 82/(3n). Let no be the sample size
at which en has variance 82 1:, and let n~ be the sample size at which 2X n has
variance 82 1:. It is easy to see that we must have n~ = (no + 1?(no + 2)/(3no).
So, n~/no = (n+1)2(n+2)/(3n2) for all 1:. As I: -+ 0, n -+ 00 and the ratio n~/no
goes to 00. That is, the ARE of en to 2X n is 00.
Example 7.46. Let H and H' be nondegenerate distributions that have some
common scale feature (like the same finite standard deviation or the same in-
terquartile range). Suppose that an{G n - g(8)) ~ H and bn{G~ - g(8)) ~ H'.
Suppose also that lim n -+ oo an/bn = r. Then r is the relative rote of convergence
of {Gn}~=l to {G~}~=lP Note that when H and H' are both normal and an
and bn are both O( v'n), the relative rate of convergence is the square root of the
ARE for asymptotic variance.

In Section 7.3.2, we show that the class of maximum likelihood estimators


(see Section 5.1.3) are efficient under quite general conditions. At first, it

llThis definition of ARE is taken from Serfting (1980, pp. 50-52). Serfting's
definition actually applies to more types of inference than estimators, but we will
not pursue that generality here.
12Solve Problem 22 on page 470 to show that the relative rate of convergence
is uniquely defined. Relative rate of convergence is not an example of a criterion
for ARE, but it has a similar nature.
414 Chapter 7. Large Sample Theory

might seem that achieving asymptotic efficiency of 1 would be the best


possible, but sometimes efficiency greater than 1 is possible.1 3

Example 7.47. 14 Suppose that {X n }::"=l are conditionally lID N(),I) given
e = B. We already know that IXl (B) = 1 and Fn(X n - B) '" N(O, 1), so Xn
is asymptotically efficient. (Actually, it is efficient in finite samples.) Let Bo be
arbitrary, and define a new estimator of e:

where.O < a < 1. This is like using Xn when Xn is not~lose to Bo , but using the
posterIor mean of e from a prior centered at ()o when X is close to Bo.
We will now calculate the efficiency of 8n . Suppose that () =1= ()o. Then

Hence, for E > 0, PHFn/x n - 8n / > E) is at most

p~(()o - n- i ::; Xn ::; ()o + n- i )


= P~(Bo - B)Fn - n 1 :; Z:; (Bo - B)Fn + n 1),

where Z = fo(X n -() has N (0, 1) distribution given e = (J. This last probability

goes to as n goes to infinit:L because both of the endpoints either go to +00 or
-00. Hence, if () i= ()o, 8n = Xn + oP(l/v'n).
Now, suppose that (J = (Jo. Then

v'n/a(Xn - ()o) + (Jo - 8nl = (1 - a)v'nIX - ()olI _~


In 4,00)
(IX n - (Jo/).

Hence, for E > 0, P~(v'n\a(Xn - ()o) + (Jo - 8n \ > E) is at most

as n ~ 00. So, if () = ()o, 8n = ()o + a(X - ()o) + op(l/y'ii). It follows that


Fn~8n - (J) E. N(O, ve), where Vila = a2 and Vii = 1 for all other (). Efficiency is
11a > 1 at () = ()o.

The phenomenon of Example 7.47 is called superefficiency. It is easy to


see how one could arrange for an estimator to be superefficient at several
different possible (). LeCam (1953) proved that, under conditions a little
stronger than the FI regularity conditions, superefficiency can only occur
at a set of zero Lebesgue measure.

13When an estimator is efficient, or when two estimators have ARE equal to 1,


more detailed comparisons are often made in a study of second-order efficiency.
We will not study second-order efficiency in this text.
14This example is due to Hodges; see LeCam (1953).
7.3. Large Sample Estimation 415

7.3.2 Maximum Likelihood Estimators


In Section 5.1.3, we defined maximum likelihood estimators (MLE) to be
estimators that maximize the likelihood function L(e) = fXls(xle). That
is an MLE of e after observing X = x is any () at which L() achieves
it~ maximum, if there are any such (). In this section, we prove some large
sample properties of these estimators.
Theorem 7.48. 15 Assume that {Xn}~=l are conditionally IlD given e=
() each with density fXds(xl(). Then, for each ()o and each () -=f. ()o,

~i."'" 1',;0 [g fx.ls(X,190 ) > g ix, Is(X,19) 1d.

PROOF. With POo measure 1, I1~=1 fXds(xil()o) > rr=l fxlls(Xil()) if and

tlOg
only if
R(x) =~ fX1Is(xil() < o.
n i=l fx1Is(Xil()o)
By the weak law of large numbers B.95, under POo ,
p fXds(XI()
R(X) -+ Eoo log fxds(XI()o) = -IxI ()o; (),

where IXI (eo; e) is the Kullback-Leibler information from Definition 2.89.


By Proposition 2.92, we know that -IxI ()o; 0) < 0 if 0 =1= ()o. It follows
that lim n -+ oo POo(R(X) < 0) = 1. 0
Theorem 7.48 suggests that the MLE should be consistent, since the Po
probability goes to 1 that the likelihood function is higher at () than at
some other parameter value. Some further conditions are required to prove
consistency. Wald (1949) proved almost sure convergence of the MLE under
the assumption that the likelihood function was continuous. Theorems 7.49
and 7.54 are very much like Wald's result.
Theorem 7.49. Let {XIl~=l be conditionally IID given e = () with den-
sity fXIls(xle) with respect to a measure von a space (Xl,8 l ). Fix()o En,
and define, for each M ~ n and x E Xl,

Z(M ,x ) -- . f I fxde(xl()o) .
III og
"'EM fx l ls(xl1/J)
Assume that for each () =1= ()o there is an open set No such that e E No
and EOoZ(No, Xi) > o. If n is not compact, assume further that there is
a cO,mpact C ~ n such that ()o E C and Eoo Z(n \ c, Xi) > O. Then,
limen = eo, a.s. [POol.

15This theorem can be strengthened to an almost sure result. See Problem 28


on page 471.
416 Chapter 7. Large Sample Theory

PROOF. If 0 is compact, let C = O. It suffices to prove that for every E > 0,


P~o(limsup lien - 0011 ~ E) = o. (7.50)
n ...... oo


Let E > 0 and let No be the open ball of radius E around ()o. Since C \ No
is a compact set, and {No: E C \ No} is an open cover, we may extract
a finite subcover, N(Jp ... ,N(Jt. Rename these sets and CC to 011 ... ,Om,
so that 0 = No U (Uj'=IOj), and E(JoZ(Oj,Xi ) > O.
Let Xoo be the infinite product space of copies of Xl. Let x E X denote
a generic sequence of possible data values. Let E(JoZ(Oj,Xi ) = Cj. By the
strong law of large numbers 1.63, L~=l Z(Oj, Xi)/n -+ Cj, a.s. [POol. Let
B j ~ Xoo be the set of data sequences such that convergence holds, and let
B = n'J!=IBj . Then Poo(B) = 1 and L:~=1 Z(Oj,xi)/n -+ Cj > 0 for each
x = (XI,X2, ... ) E B. Now, notice that

n ...... oo
m
C U{X:8 n (Xl, ... ,Xn )EOj, infinitely often}

t
j=1

~ U{X:
j=l
inf .!.
n
log !xt\e(xiIOo)
!x1Ie(xil'I/J)
~ 0, infinitely often}

jQ {X , ~ t ((lj, X;):<;
,pErij i=l

C Z 0, infinitely often} <;; ;Q Bf.

Since this last set is BC and P(Jo(B C ) = 0, (7.50) follows. 0


The hard part of using this theorem is verifying the conditions.
Example 7.51. Suppose that {Xn}~=1 given e = 0 are lID with U(O, 0) distri-
bution. Then fxlIS(xIO) = l/() for 0 ~ x ::5 O. We need E90 inf1/>EN9 g(Xi ) > 0
where g(x) is the function

logi
90
if x ::5 min{ Oo,1/1},
log !xlle(xl()o) = { 00 if 1/1 < x $. ()o,
!xlIS(xl1/1) -00 if Bo < x ::5 1/1,
undefined otherwise.

Since the last two cases have 0 probability under Peo, we can choose No
([B + Bo]/2, (0) when B > Bo. In this case, Z(No,x) = ~og([B + Bo]/[2Boj) > ~,
a.s. [Peo]. If 0 < 00, choose N9 = ((J/2, [(J + Ool/2). In thIS case Z(No, x) = 00 If
x> [(J + (Jo]/2. Hence, EeoZ(Ne, Xi) > 0 in either case.
We also need a compact set C such that EooZ(n \ C,X i ) > O. Let C =
[Oo/a, a(Jo], for some a > 1. Then
if Xi < 00
if Xi 2: Bo.
7.3. Large Sample Estimation 417

The conditional mean of this given 8 = 80 is

1
()
o
[l~eo log ndx
0
x
170
+ leo
~eo
logadx. 1
The first integral goes to 0 and the second goes to 00 as a -+ 00. This means that
there is some a > 1 such that the mean is positive. It follows from Theorem 7.49
that the MLE is consistent.
In this example, it would have been easier to find the distribution of and en
prove directly that it was consistent, but we will need the above calculation in
Example 7.82 on page 432.
Example 7.52. Suppose that {Xn}~=l given 8 = 8 are lID with N(O, 1) distri-
bution. It is easy to calculate

(7.53)

The minimum of this over any set occurs at 0 equal to the value in the set closest
to x. So, if Ne = (8 - f,8+f), then EeoZ(Ne,x) = IXI (80j 0) + Eeo(R), where

f(X - 9) + ~ if x <0- ,

R = { x(O - x) + x2;8 2 if 9 - f ~ X ~ 9 + f,
f(9-x)+~ ifx>9+(.

!
Clearly, E80(R) can be made arbitrarily small by choosing small. Similarly, if
C = [90 - u, 90 + u], for large u, then

x(90 - x) + x2;e~ if x < 90 - u,


Z(CC,x) = u(x-Oo)+u; if90-u~x~90,
u(90 - x) + u22 if 00 < x ~ 00 + u,
+~ + u.
x 8 2 2
x(Oo - x) if x> 00
We can make the integrals over the first and last portions of this arbitrarily small
by choosing u large enough. The integral over the two middle portions equals
u 2/2 - exp( _u 2/2 - 1) J2Fr, which is positive for large u.
Unfortunately, if the parameter is 8 = (M, E) and Xi "" N(/1-,0'2) given 8 =
(/1-,0'), it is not possible to find a compact set C such that E8oZ(CC,X;) > o.
Berk (1966) replaces this condition with a weaker condition that first appeared
in Kiefer and Wolfowitz (1956). The proof that this weaker condition suffices for
convergence of the MLE involves martingales and is deferred to Lemma 7.83.
(Also, see Problem 45 on page 474 and Example 7.85 on page 434.)

One of the conditions of Theorem 7.49 can be weakened if fX1IS(xl') is


continuous. 16

16A slightly more general result can be proved by assuming that fXIls is up-
per semicontinuous (USC). A function f : n -+ 1R. is upper semicontinuous if
limsuPn->oo f(9 n ) ~ f(9) whenever 9n -+ 9. Usc functions possess two properties
that are needed in the proof of Theorem 7.49. The sum of two USC functions is
USC, and the maximum of a USC function is attained on a compact set.
418 Chapter 7. Large Sample Theory

Lemma 7.54. Assume the same conditions as in Theorem 7.49, except


that we now only require that EOoZ(No, Xi) > -00. Assume further that
!xle(xl) is continuous in 0 for every x, a.s. [Pool. Then, limen = 00 , a.s.
[Pool
PROOF. If n is compact, let C = n. For each 0 =f 00 in C, let NJk) be
a closed ball centered at 0 with radius at most l/k such that, for each
k, NtH) ~ NJk) ~ No. This ensures that nk:l NJk) = {O}. So, for each
x, Z(NJk),X) increases with k. For each x such that !xde is continuous,
10g[!xde(xIOo)/ fXde(xl1/J)] is continuous in 1/J. So, for each k, there ex-
ists Ok E NJk) ({Oklk:l might depend on x) such that 17 Z(NJk),x)
log[!x,le(xIOo)/ fXde(xIOk)]. Since Ok -+ 0,

lim Z(NJk) ,x) = log !Xde(XIOo). (7.55)


k-+oo !xde(xIO)

Since N~k) ~ No, it follows that Z(N~k), x) 2: Z(No, x). If EOoZ(NIJ, Xi) =
00, then we have EooZ(NJk), Xi) = 00, for all k. If EOoZ(No, Xi) is finite,
then apply Fatou's lemma A.50 to {Z(NJk),x) - Z(NO,x)}k'=l and use
(7.55) to get

lim inf EOoZ(N~k), Xi) 2: Eoo lim Z(N~k), Xi) = Ix, (00 ; 0) > 0, (7.56)
k-.oo k-+CX)

where Ix, is the Kullback-Leibler information. Either way, we can now


choose k*(O) so that EooZ(NJk*),Xd > 0, and apply Theorem 7.49. 0

7.3.3 MLEs in Exponential Families


In exponential families, MLEs exist (with probability tending to 1 as n -+
00) and are asymptotically normally distributed, and differentiable func-
tions of the MLE are asymptotically efficient estimators
Theorem 7.57. Suppose that {Xn}~=l are conditionally IID given e =
o with nondegenerate exponential family distribution whose density with
respect to a measure 1/ is
fx,le(xIO) = c(O) exp(OT x).

Suppose that the natural parameter space is n an open subset of IRk. Let
en be the MLE of e based on Xl' ... ' Xn if it exists. Then
lim n -+ oo Po(8 n exists) = 1,

171f fxtl8 is USC, Ok still exists. One must change the ~im to li~inf wherever
it appears in front of Z, and one must change the equality to 2: III (7.55) and
(7.56) to make the rest of the proof work.
7.3. Large Sample Estimation 419

under Po, y'n(8 n -B) ~ Nk(O,Ixl (B)-I), where IXl (B) is the Fisher
information matrix.
PROOF. If the MLE exists, it will satisfy the equation that says that the
partial derivatives of the log-likelihood function with respect to each coor-
dinate of Bare 0, since the parameter space is open. Since
{) {)
{)B i logfxIS(xIB) = nXi + n {)B i 10gc(B),
the resulting equation is xn = v(B), where the ith coordinate of v(B) is
-{)logc(B)/{)Bi . It follows from Proposition 2.70 that
{)2
Covo(Xi,Xj ) = - {)Bi{)B j logc(B) = aij.
Since a covariance matrix is positive semidefinite, it follows that -logc(B)
is a convex function. Since each Xi has nondegenerate exponential family
distribution, their coordinates are not linearly dependent, hence the matrix
r: = aij will be positive definite. It follows that v(-) has a differentiable
inverse h() in the sense that for each B there is a neighborhood N of B such
that h(v(1/I = 1/1 for 1/1 EN, and the derivatives of h are continuous. (See
the inverse function theorem C.2.) If xn is in the image of v, then, for at
least one such function h, the MLE equals h(xn). By the weak law of large
numbers B.95 Xn ~ EoX under Po and EoX = v(O) by Proposition 2.70.
It follows that Xn will be in the range of v with probability tending to 1
as n -+ 00. It follows that the MLE exists with probability tending to 1.
The multivariate central limit theorem B.99 says that under Po, y'n(X n -
V
v(O -+ N(O,E). Using the delta method, we get that under Po, v'n(8 n -
v .
0) -+ Nk(O, Ar:A T ), where A = aij with aij = 8h i (t)/8tj evaluated at
t = v(B), which is the (i,j) element of E- 1 . So, A = r:- 1 . It is also easy
to see that r: is the Fisher information matrix Ix (B), so v'n(8 n - B) ~
N(O,Ix l (B)-I). 1 0
The following corollary is trivial.
Corollary 7.58. Under the conditions of Theorem 7.57, the MLE of 8 is
consistent.
Another corollary says that differentiable functions of MLEs are asymp-
totically efficient estimators.
Corollary 7.59. Assume the conditions of Theorem 7.57. Suppose that 9 :
n -+ IR has continuous partial derivatives.
Then g(8 n ) is an asymptotically
efficient estimator of g(8).
PROOF. Let Co be defined in (7.42). Using the delta method, we get

vn(g(8 n ) - g(O ~ N(O,clIxl(O)-lce).


It follows that g(8 n ) is an asymptotically efficient estimator of g(8). 0
420 Chapter 7. Large Sample Theory

7.3.4 Examples of Inconsistent MLEs


There are some curious examples of MLEs that are inconsistent. Each of
these examples fails one or more of the conditions of the theorems on con-
sistency that are proved in this chapter.
Example 7.60. This example was introduced by Neyman and Scott (1948) and
discussed by Barnard (1970).18 Suppose that (Xi, }'i) '" N2(J-Lil, 0'2 I), given e =
(0', ill, iL2, ... ), and the individual vectors are conditionally independent. These
observations are not conditionally lID, so that none of our theorems applies as
stated. Nonetheless, we can write a likelihood function

The logarithm of this is (aside from an additive constant)

~
1 { 2 t:t (X~
-2nlogu - 20'2 +y. - iLi )2 + 21 t:t(Xi
~ - Yi)2 } .

The MLEs are easily calculated as M;,n = (X; + Y.)/2 and f:~ = E~l (Xi -
}'i)2/[4n]. Since the conditional distribution of X; -}'i given e = 8 is N(O, 20'2),
it follows that f:~ .!. 0'2/2 under Pe. The MLE of E2 is inconsistent.
Barnard (1970) suggests an empirical Bayes approach, which is to choose a
distribution for the Mi with a fixed finite number of parameters, call them Ill.
Then treat (Ill, E) as the parameters and integrate the Mi out of the problem.
See Section 8.4 for a discussion of empirical Bayes methods. Kiefer and Wolfowitz
(1956) let the distribution of the Mi be more general, but they assume that the
distribution lies in a compact set of distributions so that methods like those of
Theorem 7.49 and Lemma 7.54 can be used.
Example 7.61. Let fl = (1/2,1] and suppose that for x = 0, 1,2, ... ,
if8 # 1,
if 8 = 1.

This is a family of geometric distributions with 8 = 1/2 renamed to 8 = 1. The


density is neither continuous nor USC at 8 = 1. So, Lemma 7.54 will not apply.
We can write
log ~::::~::~~ = -log(28) - x log[2(1 - 8)], (7.62)

and we note that for every compact subset C of fl, the infimum of (7.62) over
8 E CC is 0 for x > 0 and is negative for x = O. So the conditions of Theorem 7.49

18Barnard claims that Neyman presented him with the example in a taxicab
in Paris in 1946. Barnard had just met Neyman for the first time and "was
arguing for the broad general validity of the method of maximum likelihood"
when Neyman asked him what he would do in this example.
7.3. Large Sample Estimation 421

fail as well. Let X n be the average of the first n observations. The MLE based
on the first n observations is

Under H, v'n(X n -l) ~ N(O,2), since the mean of each Xi is 1 and the variance
is 2. It follows that limn .... co p{(Xn ~ 1) = 1/2 and 1/(1 + Xn) --t 1/2 under H.
- - p

(This is not surprising, since e = 1 should have been e = 1/2.) So,

and en is inconsistent because it does not converge appropriately for e = 1.


7.3.5 Asymptotic Normality of MLEs
Outside of exponential families, the proof that MLEs are asymptotically
normal is a bit more complicated than the proof of Theorem 7.57. The
following theorem gives conditions under which the MLE is asymptotically
normal in general parametric families.
Theorem 7.63. Let 0 be a subset of IRP , and let {Xn}~=l be conditionally
IID given e = 0 each with density fXds(IO). Let
A p
en
be an MLE. Assume
that en -+ 0 under Pe for all O. Assume that fXds(xIO) has continuous
second partial derivatives with respect to 0 and that differentiation can be
passed under the integral sign. Assume that there exists Hr(x, 0) such that,
for each 00 E int(O) and each k, j,

sup I80 80.


82 82
logfxlls(xIOo) - 80 80. logfxlls(xIO) ::; Hr(x,Oo),
I
lIe-eoll~r k 3 k 3
(7.64)
with limr--+o EeoHr(X, (0) = O. Assume that the Fisher information matrix
IXl (0) is finite and nonsingular. Then, under Peo'

Before we prove Theorem 7.63, here is an example in which condition


(7.64) is met.
Example 7.65. Suppose that fXlIs(xI8) = (7r[1 + (x - 8)2])-1. Then

{j2 1 - (x _ (})2
8(}2 log fXlIs(xl(}) = -2 [1 + (x - (})2)2 .

This is differentiable and the derivative has finite mean. Hence H r exists as in
Theorem 7.63.
422 Chapter 7. Large Sample Theory

The idea of the proof of Theorem 7.63 is the following. We work with
the vector i~(X) of partial derivatives of the logarithm of the likelihood
function divided by n. We evaluate a Taylor expansion of i~(X) around 00
at the point en.
Since i'e" n (X) should be 0, we get that i~0 (X) is essen-
tially the matrix Bn of second partial derivatives of the logarithm of the
likelihood times en -
00 divided by n. Since i~o (X) is the average of lID
random vectors with mean 0 and covariance matrix IXl (00), v'ni~o (X) is
asymptotically normal with covariance IXl (00). Similarly, Bn is nearly the
average of lID random matrices, so Bn ~ -Ixl (00). Setting the two sides
of the Taylor expansion equal, we get that v'nIx l (Oo)(e n - ( 0 ) is asymp-
totically normal with covariance matrix IXl (0 0). Multiplying by IXl (00)-1
gives the desired result.
PROOF OF THEOREM 7.63. Let
I n
i9(X) = - I)og!Xde(XiIO).
n i=1
The ith coordinate ofthe gradient i~(x) is (L:~=1 8 log !xlle(XiI0)/80j) In.
Since 00 E int(n) , there is an open neighborhood of 00 in the interior of
O. Since en ~ 00 under P90' it follows that Zn1int(O)c(9n) = op(l/v'n)
as n --+ 00 for every sequence {Zn};::'=l of random variables. 19 Note that
i~n (X) = 0 for en
E int(O). It follows that

I )
ien (X) = ien (X)Iint(O)C (en) = v'n .
I I
Op
A (

Using a one-term Taylor expansion (see Theorem C.I) of each coordinate


of i~n (X) around 00 , we get

i,,(X) + (( /i(J~~.l,(xt,J) (e. - 0,,) ~ Op (Jr.). (7.66)

A A p
where 0*n 3" is between 00 "J" and en J" for each j. Since en
t
00 under P90' --+

0* " ~ 00 J" for each j. Set Bn equal to the matrix in (7.66). Then
n" '

i9o(X) + Bn(en - ( 0 ) = op
I A (
..;n
I ) . (7.67)

By passing derivatives under the integral sign in the equation

o= ~
88j
J !xlle(xI8)dv(x),

19In fact, Zn1int(o)c(9n) = OP(Tn) for every Tn. The rea:son is that it ~u~s ,0
with probability tending to 1 and it doesn't matter what It equals when It lsn t
O.
7.3. Large Sample Estimation 423

we see that Eoot'~o (X) = O. Similarly, we get that the conditional covariance
matrix given 9 = 00 of t'~o(X) is Ix1(Oo). The multivariate central limit
'D
N(O,Ix 1 (00. So y'nfoo(X) =
I I
theorem B.99 gives us that y'nt'oo(X) -+
Op(l). It follows from (7.67) that
FnBn(9n - ( 0 ) = Op(l). (7.68)
Next, note that Bn(j,k) = (E~=l 82 log fX 1 Is(Xi IOo)180 j 80k ) In + t1 n, and
(7.64) ensures that lt1nl :5 E~=l Hr(Xi,Oo)ln, when 1100 - O~II :5 r. The
weak law of large numbers B.95 says that

-1 L Hr(Xi, (
n p
0 ) -+ EOoHr(Xi, ( 0 ).
n i=l

Let > 0 and let r be small enough so that EOoHr(Xi, ( 0 ) < /2. Then

P~o(lt1nl > ) :5 P~o (~~ Hr(Xi,Oo) > ) + P~o(IIOo - O~II 2:: r)


< P~o (1~~Hr(Xi,Oo)-EooHr(Xi,Oo)1 >~)
+P~o(IIOo - O~II2:: r).
The last two probabilities go to 0 as n -+ 00, hence it follows that t1 n =
p
op(l). Hence Bn -+ -Ixl (0 0 ), and Bn = Op(l) but Bn "I op(l).1t follows
from (7.68) that y'n(e n - ( 0 ) = Op(l). Now, write Bn = -Ixl (00 ) + C n
where Cn = opel). Then Cn(e n - ( 0 ) = op(ll Fn), and we can rewrite
(7.67) as
Fnt'~o(X) - IXl (Oo)v'n(e n - ( 0 ) = op(l).
'D
By Lemma 7.19, we get that -Ixl (Oo)fo(9 n - (0 ) -+ N(O,Ix 1 (00' Since
multiplication by a matrix is a continuous function, the result is proven. 0
When applying Theorems 7.57 and 7.63 with observed data, it is common
to replace IXl (00 ) by a matrix that does not depend on the unknown
parameter. One possibility is IXl (en), which is often called the expected
Fisher information. In the proof of Theorem 7.63, we saw that IXl (en) ~
IXl (0 0 ) given 9 = 00 , We also saw, however, that IXl (0 0) arose in the
theorem as an approximation to -lin times the matrix of second partial
derivatives of the log-likelihood function at a point near en (and near ( 0 ),
It has been suggested [see, for example, Efron and Hinkley (1978)] that one
use lin times
(7.69)

in place of IXl (en) when X = x is observed and one wishes to use the MLE
to make inference about 8. The quantity in (7.69) is called the observed
424 Chapter 7. Large Sample Theory

Fisher information. We will see later (in Section 7.4.2) that the observed
Fisher information is indeed the appropriate matrix to use when the goal
is to approximate the posterior distribution of a parameter by a normal
distribution. Efron and Hinkley (1978) say that the reason for preferring
observed over expected information is that the inverse of observed infor-
mation is closer to the conditional variance of the MLE given an ancillary.
Example 7.70. 20 Assume that (XI, Z 1), ... , (X n, Zn) are conditionally lID with
Z; having Ber(1/2) distribution and XilZi = z having N(9, l/[z+l]) distribution
given e = (). That is, we flip a fair coin before observing each Xi and if the coin
comes up tails, we get an N(9, 1) observation. If the coin comes up heads, we get
an N(f), 1/2) observation. The log-likelihood function is a constant plus

;=1 i=1

The MLE is the weighted average en = 2:~=1 (Zi + I)X;/ 2::1 (Zi + 1). The
Fisher information is Ix, (f) = 3n/2, which is also the expected Fisher informa-
tion. The approximation to the distribution of en given e = 9 using the expected
Fisher information is N(9,2/[3n]). On the other hand, the observed Fisher in-
formation is J = 2::1 (Zi + 1). A natural ancillary upon which to condition
is Z = (ZI, ... , Zn). The conditional distribution of en given Z and e = 9 is
N (9, 1/ J which is the same as the approximation based on the observed Fisher
information.
LeCam (1970) proves asymptotic normality of MLEs under ostensibly
weaker conditions than Theorem 7.63. The conditions do not require the
existence of continuous second derivatives. They do, however, require the
existence of functions that behave very much like second derivatives. Also,
a condition very much like (7.64) is required, where the second derivatives
are replaced by these other functions that behave like second derivatives. 21

7.3.6 Asymptotic Properties of M-Estimators*


In Section 5.1.5, we introduced a class of estimators called M-estimators.
These estimators can be thought of as being chosen to maximize the log
of some alternative likelihood or some (nearly) arbitrary function rather
than the likelihood function. For example, suppose that we choose some
function p(a, b) and maximize 2:~=1 p(Xi' 0) as a function of O. If p(a, b) =
log!xtle(alb), then we get maximum likelihood as a special cas~. .
Not every function p will be appropriate, however. The followmg condi-
tions will be assumed throughout this section:
1. For each 00 and for all '" 00, E9o [p(X i ,00) - p(Xi,O) > O.

20This is a modification of an example of Cox (1958).


21The paper is worth locating if only to read the author's footn?te.
"This section may be skipped without interrupting the flow of Ideas.
7.3. Large Sample Estimation 425

2. For each 00 and each 0 "I- 00, there exists an open set No containing
osuch that Eoo info/ENs [p(Xi' ( 0 ) - P(Xi' Of)] > -00.
3. For each 00, there exists a compact set C containing 00 such that
Eoo info(tc[p(Xi' ( 0 ) - P(Xi' 0)] > O.
Condition 1 says that p allows one to distinguish possible values of e from
each other and that P(Xi' ( 0 ) tends to be larger when e = 00 than when e
equals some other value. Condition 2 says that it cannot be the case that
even when e = 00 , there are some other possible values Of of e that lead to
p(Xi' Of) being much larger than P(Xi' ( 0 ), Condition 3 says that there is
a region around 00 such that all values of P(Xi' 0), for 0 not in that region,
tend to be less than P(Xi' ( 0 ) simultaneously. If p(a, b) = log fx,le(alb),
then conditions 2 and 3 are two of the conditions of Lemma 7.54. In fact,
the method of proof for that lemma can be applied to prove the following
proposition.
Proposition 7.71. Suppose thatp(X,) is continuous, a.s. [Pool. Also, as-
sume conditions 1-3. lfe n is the value of 0 that maximizes I:~1 p(Xi,O),
then en--+ 00 , a.s. [Pool.

Next, suppose that p is differentiable with respect to the second coordi-


nate. Then, set 1/J(Xi , 0) = OP(Xi' 0)/00, and assume that Eo 1/J (Xi , 0) = 0
for all O. (This is the same as condition (5.39) on page 312.) Also, suppose
that 1/J is continuous. Now, we can try to solve I:~=1 1/J(Xi , 0) = O. If there
is more than one solution, we can choose the one closest to some reasonable
(with any luck, consistent) estimator of e. The next theorem says that as
n increases, the probability that there is a solution to this equation near 0
goes to 1, given e = O.
Theorem 7.72. Assume that n ~ JR. Let 1/J : X x n --+ JR be such that

1/J(x,O) is continuous in 0,
for each 00 , there exists h > 0 such that EOo1/J(Xi' 0) is strictly de-
creasing as a function of 0 for 10 - 00 1< h.
Then, for each > 0,

nl~+~ P~o (3 a solution of ~ 1/J(Xi , 0) = 0in (0 , 0 + ))


0 - 0 = 1.

PROOF. If 01 E (0 0 - 8, ( 0 ) n (00 - , ( 0 ) and O2 E (0 0 ,00 + 8) n (0 0 ,00 + E),


then Eoo 1/J(Xi ,(!t) > 0 and Eoo 1/J(Xi , ( 2 ) < O. By the weak law of large
numbers B.95, under POo, L:~=l1/J(Xi,fh)/n .!:. EOo1/J(Xi' ( 1 ) > 0 and
426 Chapter 7. Large Sample Theory

E~=11/J(Xi'()2)/n .!: E/Jo1/J(Xi ,()2) < O. We now note that the probabil-
ity that there is a solution equals

and the last of these goes to 1 as n -+ 00. 0


A similar result can be proven if E/Jo1/J(Xi , ()) is nondecreasing.
Corollary 7.73. If Gn is the closest solution to a consistent estimator,
then Gn is consistent.
Ex.ample.1.14. Suppose that f X 1Is(xI8) = {1I'[1 + (x - 8)2]}-1. A consistent
estimator IS Yl/2, the median. The likelihood equation is
n
~ Xi -8
~ 1 + (Xi - 8)2 = 0,
1=1

which has several solutions in general. To check the conditions of Theorem 7.72,
we note that 'I/J(x,8) = (x - 8)/[1 + (x - 8)2J. Clearly, E6'I/J(X,8) = 0 for all 8.
The derivative of E6o'I/J(X,8) evaluated at 8 = 80 is

It follows that E60 (X, 8) is strictly decreasing in 8 for 8 near 80 If we choose


the solution to the likelihood equation closest to the median, we have another
consistent estimator.

Freedman and Diaconis (1982) give examples in which M-estimators are


inconsistent. Basically, the distribution from which the data arise has a
density that is designed to be particularly incompatible with the function
p (or 1/J) that one uses for the M -estimation.
The next theorem says that we can get efficient estimators without actu-
ally finding MLEs by starting with any y'n-consistent estimator and then
using one step of Newton's method to try to solve the likelihood equation.
The theorem is stated in terms of general M -estimators.
Theorem 7.75. Let n ~ IRk be open. Let en -
()o = Op(l/ y'n) under
P/Jo' Let 1/J : X x n -+ IRk be such that E/Jo1/J(X,Oo) = 0, and 821/J(x,O)/8()2
is continuous in O. Define two matrices Jl = ((Jt;j,t)) for i = 1,2 where

8
= E/Jo 80t 1/Jj(Xi , ()o),

= COV9o (1/Jj(Xi , ()o), 1/Jt(Xi , ( 0 )),


7.3. Large Sample Estimation 427

Assume that J1 is nonsingular. Also, assume that there exists Hr(x) such
that

n * - -1 - -
PROOF. Let p(}) = I:i=1 'IjJ(Xi' (})/n, so that en = en - M (en)p(E>n).
We can write

- p
where ()* is between (}o and en. It follows that under POD' ()* --t (}o. By
hypothesis,

Since II()* - (}oll = Op(l), Imj,t(}*) - mj,t(}o)1 = Op(l). By hypothesis


(en - ( 0)2 = Op(l/n), so p(e n ) = p(Oo) + M(00)(8 n - (}o) + Op(l/n).
Also,

It follows that

en - p(Oo) + M(00)-1(M(}0)e n - (}o) + Op (~)


en - M(OO)-lp(}O) - (en - (}o) + op ( In) .
The last equality holds since Eoo'IjJ(Xi,(}o) = 0, implying that p(Oo) = op(l)
and because en - (}o = op(l). So,
428 Chapter 7. Large Sample Theory

The weak law of large numbers B.95 says that under PO o ,

M(80 ) & ( (EOo :Ot 'ljij(X, ( 0 )) ) = J1 .

Also, the multivariate central limit theorem B.99 says that under POo '

The result now follows easily from (7.76) and Theorem 7.22. o
Exa~pI~ 7.77. Con~ide~ the Cau.chy distribution with a location parameter.
The likelihood equatIOn IS very difficult to solve, but there is a simple fo-
consistent estimator, namely, the sample median Set en.
( 0 x-8
'IjJ x,8) = 09 log /xlle(xIO) = 21 + (x _ 9)2'
We now calculate

1 - eX - (0)2 1 _ X2
- 2E9 o [1 + (X _ ( 0 )2]2 = - 2Eo (1 + X2)2
2
-;:
J 1- x 2
(l+x 2 )3dx=.5#O.

The other conditions of Theorem 7.75 can also be verified. The following estimator
is asymptotically efficient:

A theorem like Theorem 7.63 can be proven for consistent M-estimators.


The asymptotic distribution is the same as the asymptotic distribution
of e~ in Theorem 7.75. The proof proceeds either by showing that the
M-estimator differs from e~ by op(l/ yn) or by rewriting the proof of
Theorem 7.63 using 'Iji in place of l~. The details are left to the interested
reader. Huber (1967) also proves theorems of this sort.

7.4 Large Sample Properties of Posterior


Distributions
There are three kinds of large sample properties of posterior distributions
which we explore in this section. One kind is classical properties, such
as consistency and asymptotic normality conditional on a parameter. Ex-
amples include Theorem 7.80 and the asymptotic normality of posterior
distributions as presented in Section 7.4.2. Another kind is prior proper-
ties, where the probability statements concern the prior joint distributions
7.4. Large Sample Properties of Posterior Distributions 429

of all random quantities of interest. Some examples include Theorems 7.78


and 7.120. A third kind is pointwise properties. These concern limits along
certain data sequences, and an example is in Section 7.4.3.
One thing that must be kept in mind about the prior properties is that,
without further effort and/or conditions, these properties do not usually
imply corresponding non-Bayesian properties. For example, in Section 7.4.1
we will prove that, under certain conditions, the posterior distribution con-
centrates around the actual value of the parameter as n increases, with
probability 1 under the prior joint distribution of the data and the pa-
rameter. But if the prior distribution of the parameter is concentrated on
a small portion of the parameter space, one cannot expect the posterior
distribution to concentrate near () given e = () for values of () not in that
small portion. Some examples are given in Section 7.4.1.

7.4.1 Consistency of Posterior Distributions+


Doob (1949) proved a theorem that says that if there exists a consistent
estimator of a parameter e, then the posterior distribution of e is consis-
tent in the following sense. Let llelxl, ... ,xJlxl, ... ,xn ) denote the poste-
rior probability measure over (0, T) given (Xl, ... ,Xn ) = (Xl, ... ,xn ). Let
A E T and let fA be the indicator function of A. The theorem says that
lleIX 1 ,oo.,xJAlx!, ... , xn) converges almost surely, as n --+ 00, to fA(e).
The proof given below is adapted from Schwartz (1965, Theorem 3.2) and
is similar to the proof of Theorem 2 of Schervish and Seidenfeld (1990).
Theorem 7.78. 22 Let (S,A, 11) be a probability space. Let (Xl,8 l ) be a
Borel space, and let (0, T) be a finite-dimensional parameter space with
Borel a-field. Let e : S --+ 0 and Xn : S --+ Xl, for n = 1,2, ... be
measurable functions. Suppose that there exists a sequence of functions
h n : xn --+ 0 such that hn(Xl, ... , Xn) converges in probability to 8. Let
llelxl, ... ,Xn ('Ixl,"" xn) denote the posterior probability measure on (0, T)
given (Xl, ... , Xn) = (Xl"'" xn). For each A E T,

PROOF. According to Theorem B.90, there is a subsequence {nk}k'=l such


that Z k = hnk (X!, ... , X nk ) converges to e, a.s. Let A E T, and let
Z = limk-+oo Zk when the limit exists and Z = 00 otherwise, where (}o E O.
Then lACe) = IA(Z) a.s. Since Z is measurable with respect to the a-
field generated by {Xn}~=l' part I of Levy's theorem B.1l8 says that

+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
22The proof relies on martingale theory. The proof in Doob (1949) does not
require that a consistent estimator exists, but it has a slightly stronger assumption
which implies that a consistent estimator exists.
430 Chapter 7. Large Sample Theory

E(IA(Z)IX b ... ,Xn ) converges almost surely to IA(Z), Since IA(Z) =


lAce), a.s., E(IA(Z)IXI, ... , Xn) = JLelxl, ... ,x)AIXb ... , X n), a.s., and
the result is proven. 0
The intuitive meaning of this theorem is that when a consistent esti-
e e
mator of exists, the posterior distribution of will tend to concentrate
near the true value of e, with probability 1 under the joint distribution of
the data and the parameter. A consistent estimator of e fails to exist if
the parameter is not identifiable through the sequence of data values. The
parameter M in Problem 16 on page 75 is an example of such a noniden-
tifiable parameter. Note that there was no explicit mention of the prior
distribution of the parameter in Theorem 7.78. Since the theorem makes
claims only about probabilities calculated under the joint distribution of
the data and the parameter, it holds for all prior distributions. If the prior
"misses the true value," then the conclusion to the theorem is not very
interesting.
Example 7.79. Suppose that person A has a prior for 8 that is concentrated
on a set C s;;; n and person B believes that 8 is concentrated on a set D S;;; n.
Suppose that E is an open set such that C S;;; E and that the closures of D and
E are disjoint. Person B believes that IE(8) = 0 with probability 1, and person
A believes that IE(8) = 1 with probability 1. Oddly enough, they both believe
that the limit of the posterior probability of E is IE(e) with probability 1; they
just don't agree on which of the two possible values IE(8) will equal.

Example 7.79 raises an interesting question. If two people (A and B) have


different models for data, what does person A believe about the asymptotic
behavior of the posterior distribution calculated by person B? Berk (1966)
proved a theorem like Theorem 7.80 for the finite-dimensional parameter
case. Strasser (1981) used a result of Perlman (1972) to prove a theorem for
more general parameters. Theorem 7.80 says that under conditions similar
to those that guarantee consistency of MLEs, if person B uses a parametric
family with parameter e and person A believes that the data are lID
with the distribution corresponding to e = 00 in the parametric family,
then person A believes that person B's posterior will asymptotically be
concentrated on the set of 0 values such that IXI (00 ; 0) is small, where
IXI (0 0 ; 0) is the Kullback-Leibler information.

Theorem 7.80. Assume the conditions of Theorem 7.49 or of Lemma 7.54


For f > 0, define CE = {O : IXI (00 ; 0) < fl. Let JLe be a prior distribution
such that JLe(CE ) > 0, for every f > O. Then, for every f > and open set
No containing CE , the posterior satisfies limn-+oo JLelxn (Nolx n ) = 1, a.s.
[P8o], where xn = (Xb ... ,Xn ) = (Xb .. " Xn) = xn are the data.
PROOF. For each x E Xoo, the infinite product space of copies of Xl, define
7.4. Large Sample Properties of Posterior Distributions 431

Write the posterior odds of No as


INo dJLelxn(Olxn) INO TI~=1 fxlle(XiIO)dJLe(O)
=
iNc dJLelxn(Olxn) iNc
o
TI~=1 fXlle(xdO)dJLe(O)
o

INo exp( -nDn(O, x))dJ-Le(O)


= (7.81)
INC exp( -nDn(O, x))dJLe(O)
o

The idea behind the remainder of the proof is the following. For each x in a
set with probability 1, we find a lower bound on the numerator of the last
expression in (7.81) and an upper bound on the denominator such that the
ratio of these bounds goes to 00.
First, look at the denominator of the last expression in (7.81). Just as
in the proof of Theorem 7.49, construct the sets 0 1 , .. ,Om SO that 0 =
No U (Uj'=10j), and EooZ(Oj,Xi ) = Cj > o. It is easy to see that for
M~O,
1 n
inf Dn(O,x) ~ - "Z(M,Xi).
OEM n~
i=1
So the denominator of the last expression in (7.81) is at most

f1
j=1 OJ
exp(-nDn(O,x))dJLe(O) ~ L
m

sup exp(-nDn(O,x))JLe(Oj)
j=1 0EOj

< ~exp (- ~Z(Oj'Xi)) J-Le(Oj).


For each j, the strong law of large numbers 1.63 says there exists Bj ~ X OO
such that Poo(Bj ) = 1 and such that, for every x E B j , there exists an
integer Kj(x) such that n ~ Kj(x) implies 2:~=1 Z(Oj,xi)/n > cj/2 > o.
Let C = min{cl, ... ,cm }, B = nj'=lBj, and N(x) = max{Kl, ... ,Km }.
For each x E Band n ~ N(x), the denominator of the last expression in
(7.81) is at most exp( -nc/2).
For the numerator, let 0 < 8 < min{E, c/2}/4. For each x E XOO or 8 E 0,
let

Wn(x) = {O:De(O,x) S;LX1(OOjO) +8, foraIU~n},


Vn(8) = {x: De(8,x) S; LXI (OOj 0) +8, for all e ~ n}.
For each 0, the strong law of large numbers 1.63 says that Dn (0, x) -+
LX1(OO;0), a.s. [Poo], so POo(U~lVn(O)) = 1. Now use this fact together
with the fact that the sets Vn (0) are increasing and the fact that x E Vn (0)
if and only if 0 E Wn(x) to write
432 Chapter 7. Large Sample Theory

= lim
n->oo
1 POo(Vn((}))dJle((})

1 J,\'oor
C6

= lim IVn(l)(x)dPo o(x)dJ.le((})


n->oo

r 1 IVn(O) (x)dJ.le ((})dPoo (x)


C6

= lim
n->oo J,\'oo C6

lim r
n->oo J,\'oo
1C6
IWn(x) ((})dJ.le((})dPoo (x)

lim
n-+oo ixr oo
J.le(C6 n Wn(xdPoo(x)

r
= lxCXl n-+oo
lim Jle(Cli n Wn(x))dPoo(x).

Since J.le(C6 n Wn(x)) ~ J.le(C6) for all x and n, we have limn->oo J.le(C6 n
Wn(x)) = J.le(C6), a.s. [Poo], because strict inequality with positive prob-
ability would contradict the above string of equalities. So, there is a set
B' ~ Xoo with Poo(B') = 1 and for every x E B', there exists N'(x) such
that n 2: N'(x) implies J.le(C6 n Wn(x)) > J.le(C6)/2. So, if x E B' and
n 2: N'(x), the numerator of the last expression in (7.81) is at least

1
2 exp( -2n8)J.le(C6)
> ~ exp ( - :C) J.le(C6),
since IXl(}Oj(}) :S 8 for () E Cli. It follows that if x E B n B' and n 2:
max{N(x), N'(x)}, then the ratio in (7.81) is at least J.le(Cli) exp(nc/4)/2,
which goes to 00 with n. 0

Example 1.82 (Continuation of Example 7.51; see page 416). Suppose that
{Xn}~=l given e = B are lID with U(O, B) distribution. We saw earlier that
the conditions of Theorem 7.49 are satisfied. The Kullback-Leibler information
is
T (B B) = { log!o if (J 2: (Jo,
Xl 0, 00 if (J < Bo.
The set C. is the interval [Bo,exp(E)Bo). An open set No containing this interval
will need to contain an open interval (Bo - 6, ( 0 ) for 6 > O. So long as the prior
distribution assigns postive mass to every open interval, then, for every 80 , every
open interval around 80 will have posterior probability going to 1, a.s. [Peol. This
is a much stronger claim than one could infer from Theorem 7.78. 23

As we noted earlier, the conditions of Theorem 7.49 fail in some mul-


tiparameter problems. Berk (1966) proves the following lemma, giving a

23See Problem 48 on page 474 to see why the posterior probability of C. does
not go to 1 almost surely.
7.4. Large Sample Properties of Posterior Distributions 433

slightly weaker condition that holds in more cases. 24


Lemma 7.83. In the notation of Theorems 7.49 and 7.80, instead of as-
suming that there is a compact set C such that E90Z(S1\ C, Xi) > 0, assume
that there exist an integer p and a compact set C such that

(7.84)

Then Dn(8, x) is bounded below, uniformly in 8, by a mndom variable that


converges almost surely to a positive value.
PROOF. Let Np,n be the collection of all subsets of size p of distinct elements
of the set {I, ... ,n}. Denote such subsets 0: = {0:1' ... ,O:p}. For y E xl, let
g(8, y) stand for log fXle(yI8 o)/ fXle(yI8). It is clear that for every 0: E Np,n
and each 8 E CG,

If we add both sides of this inequality for all 0: E N and divide by (;), we
get

Dn(8,x) ~ (n)-l L infe! tg(t/J,x o ;)'


p oEN .pEG P i=l
Call the right-hand side of this expression Gn,p(x). (Note that Gp,p(X) is
the random variable in (7.84) and that Gn,p(x) does not depend on 9, so
that it is a uniform lower bound.) Due to the symmetry with respect to
permutations of coordinates, it is clear that Xl," . ,Xn are conditionally
exchangeable given :Fn , the a-field generated by {Gn+i,p}~o. It follows
that for every 0: E Np,n,

Now, apply part II of Levy's theorem B.124 to conclude that Gn,p(X)


converges almost surely to E9o(Gp,p(X)I:F00 ), where:Foo = n~=p:Fn' Since
:Foo is a sub-a-field of the tail a-field of the sequence {Xn}~=l' the Kol-
mogorov zero-one law B.68 says that E9o(Gp,p(X)IFoo) is constant a.s.
[P9ol. The constant must be E9o(Gp,p(X)) > O. So, limn-+ oo Gn,p(X) =
E9o(Gp,p(X)) > 0, a.s. [P9ol. 0

24The proof of Lemma 7.83 involves martingale theory. Lemma 7.83 also pro-
vides a weaker condition under which the MLE converges a.s. [Peol.
434 Chapter 7. Large Sample Theory

Example 7.85 (Continuation of Example 7.52; see page 417). Suppose that
{Xn}~=l given e = (p.,0') are lID with N(p., 0'2) distribution. If (Jo = (p.o, 0'0), it
is easy to calculate

Iog !xl,x2Ie(XI,X2!(J0)_21 (J' 2(1 1) (X-p.O)2 (X_p.)2


- og-+8 - - - - +.!-.~-'-
!xt,x2Ie(XI,X21(J) 0'0 0'2 (J'~ O'~ (J'2'
(7.86)
where x = (Xl +x2)/2 and 82 = (Xl - X2)2 /4. (Without loss of generality, assume
p.o = 0 and 0'0 = 1 for the rest of this example.) Let C be the rectangle where
x E [-u, u] and 82 E [1/v, v]. The integral of F.86) for (x, 82 ) C can be made
negligible by choosing ~ and. u I~rge. ~?r (~, 8 ) E C?.'. the minimum of ~7.86) will
occur ~t o'.:e of t~e pomts (1). (X, v), (n) (x,l/v), (m) (U,8 2 + (x - u) ), or (iv)
(-U,8 + (X + u) ). By choosmg v large enough, one can check that case (ii) can
be ignored and that (7.86) is very large in case (i). In case (iii), (7.86) equals
1 +log(8 2 + (x - U)2) - 82 _x2. The integral of the last two terms (over the entire
sample space) is -1.5. By choosing u sufficiently large, the integral of the first
two terms over the region where they add to less than 1.5 can be made negligibly
small. This is similar for case (iv). So we can ensure that the minimum of (7.86)
over CC has positive integral.

A famous example, in which the conditions of Theorem 7.80 fail, was


given by Diaconis and Freedman (1986a, 1986b). It concerns an infinite-
dimensional parameter space and a prior distribution constructed from the
Dirichlet process described in Section 1.6.1. It is shown that, conditional
on the data arising from a cleverly chosen continuous distribution, the
posterior mean of a particular function Y of the parameter is not consis-
tent. Initially, what is surprising about this example is the following. Since
continuous distribution functions can be approximated arbitrarily closely
by discrete distribution functions, one might think that continuous dis-
tributions are "close" in some sense to those distributions on which the
Dirichlet process concentrates. The problem is that the sense of closeness
is not Kullback-Leibler information. Rather, it is based on convergence in
distribution. 25 Although Theorem 7.78 can be used to show that the pos-
terior mean of Y converges in probability to Y given distributions in a set
C with prior probability 1, the convergence does not extend to parameter
values that are "close" to C in the sense of convergence in distribution. (See

25There is a simple way to understand why inconsistency arises in the Diaconis


and Freedman (1986a, 1986b) example. As Barron (1986) pointed out, when
the data come from a continuous distribution, the posterior for Y is the same
(with probability I) as what one would get if one assumed that the data were
conditionally lID given Y with the distribution given by the normalized base
measure of the Dirichlet process. (See Lemma 1.104.) Since the normalized base
measure that Diaconis and Freedman (1986a, 1986b) use looks absolutely nothing
like the distribution that actually generates the data, it is not surprising that the
posterior mean of Y is not consistent. In fact the distribution that generates
the data is chosen to be particularly incompatible with the base measure of the
Dirichlet process in much the same way that the examples of inconsistent M-
estimators were constructed by Freedman and Diaconis (1982).
7.4. Large Sample Properties of Posterior Distributions 435

Problem 47 on page 474.) Theorem 7.80 suggests that the type of closeness
that implies consistency in Bayesian problems is much stronger. 26

7.4.2 Asymptotic Normality of Posterior Distributions


Walker (1969) first proved that, under some conditions, the posterior dis-
tribution of a one-dimensional parameter would look more and more like a
normal distribution as more conditionally lID data were collected. Dawid
(1970) proved a similar result under weaker conditions. Heyde and John-
stone (1979) later extracted the essence of Walker's proof to show that it
could extend to sequences of data that were not necessarily conditionally
IID.27 Johnstone (1978) contains a multiparameter version of the theorem
of Heyde and Johnstone (1979). Still others-Brenner, Fraser, and McDun-
nough (1982), Fraser and McDunnough (1984), and Chen (1985)-prove
that, under certain conditions, the likelihood function (or the posterior
density) converges (in probability or almost surely) to a normal density.
The type of asymptotic normality proven by Walker (1969) and by Heyde
and Johnstone (1979) follows from the convergence of the posterior density.
In this section, we present a hybrid of the various theorems mentioned
above. First, we prove that the posterior density of a suitable transforma-
tion of the parameter vector converges to a normal density in probability.
We will then use this to conclude that posterior probabilities converge in
probability to multivariate normal probabilities. The general situation in-
volves a sequence of random quantities Xn : S -+ X n , for n = 1,2, ... and a
parameter e : S -+ IRk such that the conditional distribution of Xn given
6 = () has a density !xnle(le) with respect to a a-finite measure Vn on
Xn . We use the notation

fn((}) = 10g!x"le(Xnl(}), f~(t) = (( a(}~;(}j fn((})!o=)) . (7.87)

Let en stand for the MLE of e if it exists, and let 28

_ '/'-1(6A ) A exist,
E - { l:n n if the inverse and en
n- Ik if not. (7.88)

The following regularity conditions are used in the general theorems.

26Barron (1988) gives necessary and sufficient conditions for the posterior dis-
tribution to concentrate on sets "close" to the distribution that generates the
data. His results even apply in nonparametric settings.
27What Heyde and Johnstone (1979) did (whether intentionally or not) was
to take the conclusions Walker (1969) derived from the assumption that the
data were conditionally lID, and use them as assumptions. To use Heyde and
Johnstone's result for the conditionally lID case, one need only repeat the portion
of Walker's proof in which the assumptions of Heyde and Johnstone's theorem are
proven directly. Alternatively, one could prove the assumptions independently.
28Notice that ~~l is the observed Fisher information matrix.
436 Chapter 7. Large Sample Theory

General Regularity Conditions:


1. The parameter space is n ~ IRk for some finite k.
2. Bo is a point interior to n.
3. The prior distribution of 8 has a density with respect to Lebesgue
measure that is positive and continuous at Bo.
4. There exists a neighborhood No ~ n of Bo on which fn(B) is twice
continuously differentiable with respect to all coordinates of 0, a.s.
[Pliol
5. The largest eigenvalue of En goes to 0 in probability.

6. For 8 > 0, define No(8) to be the open ball of radius 8 around Bo. Let
An be the smallest eigenvalue of En. If N o(8) ~ n, then there exists
K(8) > 0 such that

7. For each > 0, there exists 8() > 0 such that

lim
n ..... oo
P~o ( sup
IiE N o(6(')),lhll=1
11 + I TEJf~(O)E!,1 < ) = 1.

A few words of explanation of these conditions is in order. The first speaks


for itself. The second avoids having likelihood functions that are largest near
the boundary of n and hence cannot look like normal densities. The third
ensures that the prior density doesn't destroy the asymptotic normality
of the likelihood function. The fourth is one of two smoothness conditions
which also rules out distributions for which the support of the distribution
depends on O. Condition 5 ensures that the amount of information in the
data about all aspects of 8 increases without bound. Condition 6 ensures
that the MLE is consistent and that the likelihood function can be ignored
for values not near Bo. Condition 7 is a smoothness condition on the amount
of information in the data about 8.
To be specific about what we mean by saying that the posterior distri-
bution will look more like a normal distribution as more data are collected,
consider the posterior probability that E;;-1/2(8 - en) E B as a statistic
Tn (function of the data) prior to observing the data. Then Tn converges
in probability (under Plio) to the multivariate normal probability of B. We
can also make corresponding claims about the posterior density of 8.
7.4. Large Sample Properties of Posterior Distributions 437

7.4.2.1 Posterior Densities


First, we prove that the posterior densities of a sequence of transforma-
tions of the parameter converge in probability uniformly on compact sets
to a multivariate normal density. Since we expect that the posterior den-
sity of e will become more and more concentrated around 80 given e = 80,
the posterior density of e itself should not have an interesting asymptotic
behavior. Rather, if we rescale 8 so that its variance is approximately con-
stant (as a function of sample size), then perhaps the transformed random
variable will have an interesting posterior density asymptotically.29
Theorem 7.89. Assume the geneml regularity conditions, and let en be
an MLE of 8. Define f~ as in (7.87), and let ~n be defined by (7. 88}. Let
wn = ~;;1/2(e_en). Then the posterior density ofw n given Xn converges
in probability uniformly on compact sets to the Nk(O, I k ) density given
e = 80. That is, for each compact subset B of mk, and each f > 0,

lim
n-+oo
p~o (sup IhnlX" (7/JIXn ) -
1/JEB
(7/J) I > f) = o.

PROOF. First, note that general regularity condition 6 guarantees that en


is consistent, since, for each 6 the probability goes to 1 that en is inside of
No(6). Use Taylor's theorem C.1 to write
!x"le(Xn I8) = !Xnle(Xnlen) exp{fn(8) - fn(e n)}

!x"le(Xnlen) exp { -~(8 - en) T~;;-~ (7.90)

(Ik - Rn(O, Xn~;;-! (0 - en) + 6. n },


where

with 8~ between 8 and en. Since 80 E int(f2) and en is consistent, it follows


that limn -+ oo p~o (6. n = 0, for all 8) = 1. Now we can write the posterior
density of e as

29It is interesting to note that LeCam (1970) proves asymptotic normality of


MLEs by first showing that the logarithm of the likelihood function (as a function
of t = .,fii(J - (Jo is asymptotically quadratic with the same distribution as
-tTIxl (Jo)t/2+t T y, where Y '" Nk(O, Ix} (lIo. The maximum of this function
is t = Ix} (8 o)-ly, which has Nk(O, Ix} (80)-1) distribution.
438 Chapter 7. Large Sample Theory

where
fxJ) = In fe(O)fxnle(IO)dO.
The posterior density of \{In, flJlnlXn (t/JIXn ), can be written as

lI;nl~ fXnle(Xnlen)fe(I;!t/J + en) fXnI9(XnlI;!t/J + en) (7.91)


fxn(Xn) fXnI 9 (Xnl e n)

Our first step is to see how the first factor in (7.91) behaves as n -+ 00.
Choose 0 < f < 1 and let rJ be such that

1- c < 1-rJ 1+ > l+rJ


"-(l+rJ)t' f-(l_rJ)!'

Since the prior is continuous at 00, there exists 61 > 0 such that 110 - 00 11 <
01 implies Ife(O) - f9(00)1 < rJfe(Oo). By general regularity condition 7,
there exists 02 > 0 such that

Let 0 = min{6 b 62 }. Write fXn(Xn) = J l + J2, where


Jl = [ f9(0)fx nI9(Xn I0)dO, J2 = [ fe(O)fxnle(XnIO)dO.
l~w h\~W
Use (7.90) to write

J1 = f XnI 9 (Xnl e n) [ fe(O)exp{-~(O - en)TI;;i


lNo(6)

(lie - Rn(O,XnI;;i(O- en) + Lln }dO.

Because 0 :::; (h, it follows that

(7.93)

where

J3 = [ exp{-!(O - en) TI;;! (lie - Rn(O, XnI;;! (0 - en)


lNo(6) 2

+Lln }dO.
7.4. Large Sample Properties of Posterior Distributions 439

It follows from (7.92) and the consistency of en that the limit as n -+ 00


of P~o of the intersection of {~n = O} with the following event is 1:

{[0(6) exp [- 1; 1] (fJ - en) TE~l(fJ - en)] dfJ :::; J 3

: :; 10(6) exp [- 1; 1] (fJ - en)TE~l(fJ - en)] dfJ}.


We can write the two integrals that bound h above as
{ exp {- 1 ~ 1] (fJ _ en) TE~l(fJ - en)} dfJ
JN o(6)
= (211")~ (1 1])-! IEnl~4>(Cn)'
where 4>(Cn ) is the probability that an Nk(O, I k ) vector is in Cn, and
1
= {t: en + (1 1])-2E~t
1
Cn E N o(8)}.

By general regularity condition 5, E;!2t = op(l) for all t, so 4>(Cn ) ~ l.


Hence,

lim
n-+oo
P~o [(211")~ IEnl~
(1+1])2
k < J 3 < (211")~ IEnl! k
(1-1])2
1 = l. (7.94)

By the way we chose 1] related to , we get from (7.93) and (7.94) that

In other words,

J1 P (2 ) k t () ) (7.95)
IEnl~!Xnle(Xnlen) -+ 11""2 eo
Next, we show that
J2 P
1 -+ O. (7.96)
IE nl 2 !xnle(Xnlen)

Using (7.90), we can write


J2 = !xnle(Xnlen) exp[fn(}o) - fn(e n)]
x ( fe(fJ) exp[fn(fJ) - fn(fJo)]dfJ. (7.97)
In\No(6)
440 Chapter 7. Large Sample Theory

Now, refer to general regularity condition 6. Since An $ IEnI 1 / k, if () fj.


No(8), then fn(l}) - fn(Oo) < -IEnl-l/k K(8) with probability tending to
1. Hence, the integral on the right-hand side of (7.97) is less than

with probability tending to 1. Since en is an MLE, exp!in(}o) - in(e n)) $


1, and general regularity condition 5 says

exp[-IEnl-i K(8)] p
1 -+ O.
IEnl 2

So (7.96) holds. Combining (7.95) with (7.96), we get

Ixn(Xn} p (2 )tF (0) (7.98)


IEnl~ !Xnle(XnI6 n ) -+ 7T Je o

Since en is consistent, and the prior is continuous at 00 , we have that


!e(E;!2'l/J+e n ) ~ fe(Oo) uniformly for 'l/J in a compact set. It follows that

uniformly on compact sets.


To complete the proof, we need to show that the second fraction in (7.91)
converges in probability to exp( -1I'l/J1I2 /2) uniformly on compact sets. From
(7.90), we get

Let Ti, f. > 0, and let B be a compact subset of IRk. Let b be a bound on
1I'l/J1I2 for 'l/J E B. By general regularity condition 7, there exists 0 and M
such that n ~ M implies

Let N ~ M be large enough so that n ~ N implies

POo ( E~ 'l/J + en E No(o), for all 'l/J E B > 1 - '2.


1 , ) f.
7.4. Large Sample Properties of Posterior Distributions 441

Then, if n ~ N,

p~o (11/1 T (lk - Rn(E!1/I + en, Xn1/I -111/11121 < 71 for all 1/1 E B) > 1 - f.

Since p~o (~n = 0, for all 1/1) - 1, it follows that the second fraction in
(7.91) is between exp( -1]) exp( _111/1112 /2) and exp(1]) exp( _111/1112 /2) with
probability tending to 1, uniformly on compact sets. Since 71 is arbitrary,
the desired result follows. 0
We now give two examples in which Xn does not consist of conditionally
IID coordinates, but the general regularity conditions still hold.
Example 7.99. Let n be the interval (-1,1). Let {Zn}~=l be lID N(O, 1) and
let Yo = O. Define Yn = 9Yn- 1 + Zn for n = 1,2, .... The sequence {Yn}~=l
is called a first-order autoregressive process. The Yi are clearly not conditionally
lID given e = 8 except for the case 8 = O. Let Xn = (YI, ... , Yn). Then in(8)
is a constant plus - (YJ + (1 + ( 2 ) E::l1 y? - 28E~1 Y;Y;-l) /2. The MLE is
easily calculated as en = Ei=l YiYi-l/ E:=l Y;~l' The first four general regu-
larity conditions are trivially satisfied if the prior has a continuous density. Also,
"
in(8) = - EO=1n 2
Y;-l' Since in" does not depend on 8, general regularity condi-
tion 7 is satisHed. Since
2 2) 210 2 28 2k
COVII
(
Y; ,Y;-k = 8 Varll(Y;-k) ~ (1 _ ( 2 )2'

it is easy to show that limn_oo Varll(E:=1 Y;~dn) = O. Hence En ~ 0 and nEn


converges in probability. Thus, general regularity condition 5 holds. Finally, note
that, given 8 = 80, E~=l YiYi-I/n ~ 80. Since

in(8) -in(80) 2 - (8 + (0 ) ~
= - -8-80( ~ Y;2 - 2 ~
~ YiYi-1 ) ,

it follows that general regularity condition 6 holds.


Example 7.100. Let Yl, Y2, ... be conditionally independent given 8 = 8 with
Yi '" N(8, i). Let Xn = (Y1 , , Yn). The logarithm of the likelihood is a con-
stant plus in(8) = - (E~=llog(i) + (Yi - 8)2/i) /2. The MLE is easily seen to be
en = (E:=l Yi/i) / (E~ll/i). The second derivative of in(8) is - E:=l l/i,
which does not depend on 8, so general regularity condition 7 holds. Since En =
1/[E:=ll/i] = 0(1/ log(n)), general regularity condition 5 holds. Since

An[in(8) -in(8o)] = E~=l t


9-80 [~1
~i ( 8+(0)]
Yi - - 2 - ,

and
E~-l
E ~ n
o ~
I [~l]-l) ,
'" N (80, L..J .
~
1=1 .. i=l

it follows that An [in (8) -in(80)] ~ -(8-80)2/2, so general regularity condition 6


holds. The first four general regularity conditions hold if the prior is continuous.
Note that, in this example, the MLE is not Vn-consistent, but the posterior
distribution is still asymptotically normal.
442 Chapter 7. Large Sample Theory

7.4.2.2 Posterior Probabilities


We have proven that the sequence of posterior densities of 'lin converges in
probability uniformly on compact sets to the N(O,Ik) density. This makes
it easy to conclude that posterior probabilities converge in probability as
well.
Theorem 7.101. Assume the general regularity conditions, and let en be
an MLE 0/9. Let Be IRk be a Borel set. Define I!~ as in (7.87), let En be
defined by (7.88), and let 'lin = E~1/2(9 - en). Then Pr(w n E B\Xn ) ~
q,(B), under PSo, where q,(B) stands for the probability that an Nk(O, Ik)
vector lies in B.
PROOF. First, suppose that B is a subset of a compact set, and let c be
the Lebesgue measure of B. Then

\Pr(w n E BIXn ) - q,(B)1 :5 L Ih"lx,,(t/JIXn ) - (1jJ) I d1jJ


:5 c sup Ih"lx" (1jJ\Xn ) - (1jJ)\.
~EB .

This goes to 0 in probability by Theorem 7.89.


Next, let B be an arbitrary Borel set, and let f > 0 be given. Let BE be
a compact set such that q,(BE) > 1 - f/3. Let N be large enough so that
n ? N implies both of the following:

PSo (IPr(W n E BEIXn ) - q,(BE)1 > i) < ~,


PSo (IPr(W n E B n BEIXn) - q,(B n BE)I > i) < ~.
We know that
\Pr(w n E B\Xn ) - q,(B)\ < IPr(W n E B n BEIXn ) - q,(B n BE)\
+ Pr(w n E B~\Xn) + q,(B~).

So, if n? N,
o

7.4.2.3 Conditionally lID Random Quantities


Walker (1969) proves a result like Theorem 7.101 in the case of conditionally
lID random quantities. He gives a long list of regularity conditions and then
proves that they imply the general regularity conditions. These conditions
are very similar to those of the Cramer-Rao inequality and those used to
prove asymptotic normality of MLEs. Rather than repeat those conditions
here, we will use the conditions already stated elsewhere in this book.
7.4. Large Sample Properties of Posterior Distributions 443

Theorem 7.102. Let {Yn}~=1 be conditionally lID given e, and let Xn =


(Yb.'" Yn). Suppose that (7.64) and the conditions of either Theorem 7.49
or Lemma 7.5430 hold for {Yn}~=I' and that the first four general regular-
ity conditions hold. Also, suppose that the Fisher information IXl (00) is
positive definite. Then the remaining general regularity conditions hold.
PROOF. That general regularity condition 5 holds follows from the fact
that nEn ~ IXl (00)-1. For general regularity condition 6, let fJ > 0 and
let Z(,) be as in Theorem 7.49. Then

(7.103)

~- min{ inf [in(Oo)-in(O)], ... , inf [in(Oo)-in(O)], inf [in(Oo)-in(O)]}


K~ K~ ~

~ -min{~Z(nl,Xi), ... '~Z(nm'Xi)'}'


where n1 , . , Om are as in the proof of Theorem 7.49. Since

and these means are all positive for j = 1, ... , m, it follows that if

where). is the smallest eigenvalue ofIxl (00 ), then general regularity condi-
tion 6 holds. For general regularity condition 7, let > 0 and let fJ be small
enough so that E90H6(Yi,00) < f/(IJ. + f), where H6 comes from (7.64)
and IJ. is the largest eigenvalue of IXl (00). Let IJ.n stand for the largest
eigenvalue of En. For 0 E No(fJ), we have

sup IT'TE! [E;;:1 + i:(O)]E! T'I


1h'1I=1
< /-Ln sup IT'T[E;;:1 + i:(O)bj
1h'1I=1

so Note that the condition of Lemma 7.83 could be used instead.


444 Chapter 7. Large Sample Theory

Ifen E No (c) and IJL - nJLnl < f, it follows from (7.64) that the last
expression above is no greater than
2 n
(JL+f)- LH6 (Yi,Oo).
n i=1

By the weak law of large numbers B.95 and our choice of C, this last expres-
sion converges in probability to something no greater than E. This implies
general regularity condition 7. 0
To make the theorems of this section apply to prior probabilities not
conditional on the parameter, suppose that the prior distribution satisfies
Pr(8 E int(O)) = 1. We can now apply the result from Problem 6 on
page 468 to conclude that the prior probability that the posterior after n
observations will be within f of the normal approximation goes to 1 as n
goes to infini~y.
Example 7.104. Suppose that Xl, ... , X IO are conditionally lID given a = (}
with Cau.(}, 1) distribution. Suppose that the observations are
-5, -3, 0, 2, 4, 5, 7, 9, 11, 14.
Then the MLE of a is 910 = 4.531, and l~o(4.531) = -1.23116. So, Theo-
rems 7.101 and 7.102 suggest that N(4.531,0.813) is approximately the distribu-
tion of e. To see how good this approximation is, look at Figure 7.105. The solid
line is an approximation to the posterior by numerically integrating the likelihood
times the prior (trapezoidal rule from (} = -1 to (} = 11). The dotted line is the
normal approximation. The functions are not particularly similar. For example,
the normal approximation to Pr(a ~ 51X = x) is 0.3015, while the numerical
integral under the posterior curve is 0.3560.

Numerlca' Int.
Normal Appx.
sss

o 2 4 e e 10

FIGURE 7.105. Posterior Density for a in Cauchy Example


7.4. Large Sample Properties of Posterior Distributions 445

7.4.2.4 Loss of Information


For the case of conditionally lID random quantities, we can try to answer
the question "How much information do we lose by not knowing e?" The
Kullback-Leibler information is E80 10g[Ix"le(XnIOo)/ fx" (Xn)] for com-
e
paring the distribution of Xn given = 90 to the prior predictive distri-
bution. Alternatively, we can examine log[fx"le(XnIOo)/ fx" (Xn)] and see
how it behaves for large n.
Theorem 7.106. Assume the conditions of Theorem 7.102. Then, given
e = 90,
fXnle(XnIOo) ( n ) 'D 2
-210g fx,,(X n ) +klog 211' -210gfe(00)+logIIx 1 (00)I-Xk,
(7.107)
where IXl (00) is the Fisher information matrix for one observation.
PROOF. We have assumed enough conditions to conclude that the general
regularity conditions hold and that E;;1/2(6 n - ( 0 ) ~ Nk(O,Ik), given
e = 00' Hence, we can use whatever steps we wish from the proof of
Theorem 7.89. As in (7.90), we can write

10gfxnle(XnIOo) = 10gfx"le(Xnlen)

- ~ [(00 - en) TE~! (Ik - Rn(O, Xn))E~! (0 - en) + ~n] .


Using (7.92) and the consistency of the MLE, we conclude that

given e = 00' Since the denominator of this expression converges in distri-


bution to -.5 times X~, it follows that the numerator does also. It follows
from (7.98) that

Since the determinant is a continuous function of a matrix, and nEn ~


IXl (00)-1, we get

*This section may be skipped without interrupting the flow of ideas.


446 Chapter 7. Large Sample Theory

The conclusion now follows. 0


Theorem 7.106 says that the amount of Kullback-Leibler information lost
for not knowing e, when observing n conditionally lID random variables,
tends to be about k log( n) /2, for all continuous priors. The effect of the prior
distribution is of a lower order of magnitude. Notice that choosing le(J) =
IIXl (0)11/2 makes the last two terms on the left-hand side of (7.107) cancel.
Suppose that IIxl (J)11/2 is integrable with respect to Lebesgue measure.
In that case, clIX l (0)11/2 would then be Jeffreys' prior (for some c) as
described in Section 2.3.4. Note also that

f log le(O) 1 le(O)dO;::: 0,


cIIx1 (0)12

for every prior Ie, with equality only when Ie is Jeffreys' prior. It follows
that if Jeffreys' prior describes someone's beliefs, then that person believes
that his or her predictive density for the data Ix.. (Xn) will (asymptotically)
be smaller, relative to !x.. le(XnIOo), than that of someone who believes
any other prior. An alternative way to say this is that a person believing
Jeffreys' prior thinks that he or she has more to learn about e from the data
than does someone believing a different prior. This informal description can
be made more rigorous, as Clarke and Barron (1994) do. They consider a
decision problem in which the action space is the set of continuous prior
densities, and the loss is

Note that this loss is precisely the Kullback-Leibler information for com-
paring the distribution of Xn given e = 0 to the prior predictive distri-
bution. They show that Jeffreys' prior is asymptotically least favorable in
this decision problem.

7.4.3 Laplace Approximations to Posterior Distributions


In Section 7.4.2, we calculated the asymptotic distribution, given e = (J, of
the integral of the posterior density over some set. Sometimes, one is only
interested in the value of such an integral. The method of Laplace gives
us a way to calculate approximations to such integrals together with an
order of magnitude of the error. This discussion is a hybrid of the papers
by Tierney and Kadane (1986) and Kass, Tierney, and Kadane (1990).
Suppose that we are interested in the posterior mean of a positive func-
tion 9 of e. For example, if g(O) = !x1Ie(yIO) for some fixed value y, then
E(g( e) IX = x) is the predictive density of a future observation. In general,

*This section may be skipped without interrupting the flow of ideas.


7.4. Large Sample Properties of Posterior Distributions 447

we can write
E(g(8)IX = x) = J g((})fe((})fxle(xl(})d(}.
J fe ((})fxle (xl(})d()
The method of Laplace provides approximations to each of these integrals
for specific values of x. Some conditions and notation are needed to state
the approximations precisely.
Theorem 1.108. For each n, let (Xn Bn) be a Borel space, and let Xn
be a mndom quantity taking values in X n . Let xn = (Xl' ... ' X n) and
let (xn,Bn) be the product space of Xb ... ,Xn. Let {Po: () E O} be a
pammetric family of distributions for {Xn};:"=l with n ~ JR. Suppose that
the distribution of xn given e = () is absolutely continuous with respect to a
measure Vn on (xn, sn) for all n with density fxnle(I(}). Let 9 : n ~ JR+
be a function. Let fe((}) be the prior density of 8 with respect to Lebesgue
measure. Assume that fX"le(xnI8) for all nand xn E xn, g(8), and fe((})
are all continuously differentiable with respect to 8 six times. Assume that
J g(8)fe((})d8 < 00. Define
i n (8jx n )

Hn(8j xn)

Now let y = n~l xn and define the set A ~ Y as the set of all x =
(X 1 ,X2 ,X3 , ) E Y with the following properties:
The integmls J g(8)fe(8)fxnle( xn l(})d8 and J fe(8)fx nle( xn I8)d8
are finite for all n.
in achieves its maximum at a point 8~(xn) for each n.

For each n, Hn and H~ achieve their maxima at points On(X n ) and


()~ (x n ), respectively, where the first derivatives are zero.

On(x n ) and (}~(xn) converge as n ~ 00.

The second derivatives of Hn and H~ at their maxima converge to


negative numbers.

There exist 60, No, M > 0 such that for all n :::: No, the absolute
values of H n , H~ and their first six derivatives are all bounded by M
for 18 - On(xn)1 ~ 260.

For every 6> 0,

limsup~ sup i n (8jx n ) -in(8~(xn)jXn) < o. (7.109)


n--oo n IO-8n{xn)I>6
448 Chapter 7. Large Sample Theory

For each (XI.X2,X3, ... ) E A, define


u2(xn) _ 1 u n*2(xn) = 1
n - - H::(On(xn)jXn), H:"(O:;(xn)jX n )'
For each (Xl. X2, X3, .. .) E A, E(g(9)lxn = xn) equals
u*(xn)
n) exp(n[H~(O*(xn)jXn) - Hn(tjn(xn)jX n )]) [1 + O(n-2)].
.2!.....-(
Un X

PROOF. Since virtually everything depends on n and xn in the statement


of the theorem, we will simplify the notation by not explicitly expressing
that dependence. For example, H(9) will stand for Hn(On(xn)jX n ). Now,
let (Xl. X2,"') E A and write

E(g(8)lxn = xn) = Inexp(nH*(OdO (7.110)


In exp(nH(OdO .
We assumed that 0 and 0* both converge. We now show that they con-
verge to the same thing and that so does 0'. Suppose that 0 converges to
00 and 0* converges to 01 If there exists 6 > 0 such that 10' - 91 > 6
for infinitely many n, then (7.109) says that there is 1/ > 0 such that
(0) < l(O') - 1/ infinitely often. Since H(O) = l(9) + O(n- 1 ), and simi-
larly for ()', it follows that H(O) < H(O') - 1/, infinitely often. This con-
tradicts the definition of 0 being the location of the maximum of H for
all n. It follows that, for each 6 > 0, there exists N such that n 2= N
implies 10' - 01 < 6. Hence 0' converges to 00 also. A similar argument
shows that 0' converges to ()t, hence 00 = 01 , Let n' = n n (00 - 60, ()o +
60 ). Since exp(nH(O = exp(l(OJe(O), condition (7.109) implies that
In\n' exp(nH(OdO and In\n' exp(nH*(OdO are exponentially small. For
this reason, we can replace 0 by 0' in (7.110) without incurring an error
larger than O(n- 2 ).
If we expand H(O) in a Taylor series (see Theorem C.1) around 0 = 0,
we get

H(O) = H(O) + (0 - O)H'(O) + ~() - 0)2 H"(O) + ~(O - 0)3 Hili (0)
+ .2..(0 _ 0)4H(iV)(0) + _1 (0 _ 9)5H(v)(0) + _1 (0 _ O)6H(vi) (0),
24 120 720
where 0 is between 0 and 0, and H(iv), H(v), and H(vi) respectively stand
for the fourth, fifth, and sixth derivatives ~f H. Use the Taylor series of
exp(x) around X = 0 and the fact that H'(O) = 0 to write
exp(nH(O = exp(nH(O exp (i(O - 0)2 HI/(O)
x [1 + ~(O
6
- 0)3HIII(0) + ~(O - 0)4H(iv)(O)
24
+ ~(O - 0)5 H(v) (0) + n 2 (0 - 0)6 Hili (0)2 + Rn(O)] ,
120 72
7.4. Large Sample Properties of Posterior Distributions 449

where
In, Rn(O) exp (~(O - 0)2 H"(O)) dO = O(n- 2)
as n --+ 00, because Rn(O) is bounded on the bounded set 0'. We can also
show that

/n,(O - O)k exp (- 2: 2 (0 - 0)2) dO = O(n- 2),

for all odd k. (In fact, these last integrals are exponentially small.) This
implies that

In exp(nH(OdO = exp(nH(O (In exp ( - 2:2 (0 - 0)2)

X [1 + !!:.(O - 0)4H(iV)(0) + n 2 (0 - 0)6HIII(0)2]dO + O(n- ) 2


24 72

= .ji :n exp(nH(O [1 + ;: H(iv)(O) + ~~: Hili (0)2 + o(n- 2)] .


(7.111)
A similar argument shows that

1n exp(nH*(OdO = .ji y'n exp(nH(O*


u*
(7.112)

x [1 + u*4
8n
H*(iv)(O*) + 5u*6 H*'" (0*)2 + o(n- 2)] .
24n
Next, we prove that 0and 0* differ by O(n- 1 ). Since H* has 0 derivative
at 0*, we can write
o = H*' (0*) = H'(O*) + O(n-1)
H'(O) + (0* - O)H"(O) + O(n-l) + 0(0 - 0*)
1 1
-(0* - 0)2 + O(n- ) + 0(0 - 0*).
A A

=
u

It follows that 0* - 0 = O(n- 1) + 0(0 - 0*). So, it follows that 0* - 8 =


O( n -1). It also follows that the kth derivative of H* at 0* differs from the
kth derivative of H at 8 by O(n- 1). In particular, u*2 = u 2 + O(n- 1).
Now, take the ratio (7.112) divided by (7.111) to get that E(g(e)IX = x)
equals
u* 1 + /7*4 H*(iv)(O*) + 5/7*6 H*'" (0*)2 + O(n-2)
A

- exp(n[H*(O*) - H(O)]) 8n 24n


U 1 + ~~H(iV)(O) + ~~:H"'(8)2 + O(n-2)

u*
= - exp~n[H*(O) - H(O)]) [1 + O(n- 2)] . o
u
450 Chapter 7. Large Sample Theory

Theorem 7.108 makes claims about the conditional distribution of e


given xn = xn for a sequence of xn values having certain properties
(namely being in the set A). Of course, we will only ever get to observe
the beginning of such a sequence. If our model says that the set A occurs
in probability, that is, P(A), then we might feel comfortable believing that
the unobserved tail of the sequence will continue to produce a point in A.
All that is required (in addition to the conditions of Theorem 7.108) is that
the Fisher information from one observation be positive and finite for all 9
and that the MLE is interior to n with probability tending to 1.
Example 7.113. Let {Xn };:"=l be a sequence of lID N(/1-,0"2) random variables
conditional on (M, E, A) = (/1-,0", A). Let the prior for M given E = (I, A = A
be N(/1-0,(l2/ A). Let E and A be independent with E2 '" r-l(ao/2,bo/2) and
A '" Exp(eo). Conditional on A = Ao, this is precisely the same as the natural
conjugate prior distribution, which was given in Example 1.24 on page 14. Hence,
the conditional density of X = (Xl, ... , Xn) given A = A is the same as the
marginal density of the data in that example. This is easily calculated as

[1f;
nA
- (bo
n+A
nA ~
2- 2
+w+ -(x-J1.O) )
n+A
Suppose that we want to calculate the predictive density of a future observation
Y. Conditional on A = A,

Y '" t
ao+n
(AJ1.O + nx bo
A+ n '
+ w + n~:dx -
ao + n
/1-0)2 [1 + !])
A .

So, for each value of y, we can let g(A) be the density of Y at y and apply
Laplace's method for many values of y.
As an example, suppose that we observe n = 10 observations with x = 14.7
and w = 52.2. Suppose that the prior had ao = 1 = bo = eo and /1-0 = 10. Then
fJ = 0.1598 provides the maximum of the function H. For each value of y between
o and 30, say, we can let g(A} be the tn density as described above, and we get
the plot in Figure 7.114. For comparison, Figure 7.114 includes the predictive
density that would have been obtained had Pr(A = 1) = 1 been assumed. (The
prior mean of A is 1.)
A naive alternative to the Laplace approximation is to use the MLE
g({}) to approximate the P?sterior mean. First, note that exp(nH*(lI*V =
g(9*)exp(nH(9*}). Since 9 - 9* = O(n-l), we get that g(9*) = g(9) +
O(n-l) and (as in the proof) 0".2 = 0"2 + O(n- 1). Combining these facts
2
together with the fact that H(O*) = H(O) + 0([0* - OJ ) we get that the
difference between g(O) and the Laplace approximation is O(n- 1 ). Since
the Laplace approximation differs from the E(g(8)lxn = xn) by O(n-2),
we get that g(O) differs from E(g(8)lxn = xn) by O(n- 1 ). So, the Laplace
approximation can be thought of as a higher-order correction to the use of
the MLE as an approximation to E(g(8)lxn = xn).
7.4. Large Sample Properties of Posterior Distributions 451

...,
d
1- Laplace
A _1

o 5 10 15 20 25 30
A

FIGURE 7.114. Predictive Density for Y in Example 7.113

The Laplace method is most useful in hierarchical models, which will be


discussed in more detail in Chapter 8. An example in which the Laplace
method does not work so well is Example 7.104.

Example 7.115 (Continuation of Example 7.104; see page 444). We have ob-
served ten random variables with Cau(0,1) distribution given 8 = O. Suppose
that our prior distribution for 8 was very fiat, say N(O, 1000). We are now in
position to approximate the mean of any positive function of 8. For example,
for each t, we can approximate the mean of exp(t8). This would be the mo-
ment generating function of 8. For each x we could approximate the mean of
[71"(1 + (x - 8)2]-1. This would be the predictive density of a future observation
at x. A serious problem arises with this data set. For some values of t or x,
the associated function H* is not unimodal. This makes the use of the Laplace
approximation unsatisfactory.
Nevertheless, we are able to approximate the moment generating function for
small values of t at least. This is done by using the function g(8) = exp(t8)
for several different small values of t. Kass, Tierney, and Kadane (1988) suggest
using numerical derivatives of the moment generating function for approximating
moments of the parameter. For example, if we use t = k X 10- 5 for k = 0,1,2,3,4,
we can approximate two derivatives of the moment generating function at O.
Laplace's method uses g(O) = exp(tO) for the values of t listed above and gives
t E(exp(t8)) - 1.0
0.0 0.0
1.0 X 10- 5 4.49078 X 10- 5
2.0 X 10- 5 8.98176 X 10- 5
3.0 X 10- 5 13.47294 X 10- 5
4.0 X 10- 5 17.96432 X 10- 5
We can fit a quartic polynomial to these values, and its first two derivatives at
452 Chapter 7. Large Sample Theory

o will aRProximate the first two moments. The fitted quartic is -31677440x 4 +
2972.6x + 9.85075x2 + 4.49068x + 1.0. The first derivative is 4.49068 and the
second is 19.7015. Unfortunately, the estimated variance would be negative, so
these moment estimates are not very good. The problem here is that, for some
values of t, the function H* is not far from being multimodal.
Using numerical integration (trapezoidal rule for -20 < 8 < 40 with 6000 in-
tervals), we approximate the mean to be 4.58994 and the variance to be 2.20456.
The moment generating function can also be approximated by numerical inte-
gration. It is
E(exp(t8)) - 1.0
0.0 0.0
1.0 X 10- 5 4.59005 X 10- 5
2.0 X 10- 5 9.18034 X 10- 5
3.0 X 10- 5 13.77086 X 10- 5
4.0 X 10- 5 18.36161 X 10- 5
If we fit a quartic to this, we get

-226671x 4 + 38.17317x3 + 11.63566x2 + 4.58994x + 1.0,


from which we approximate the mean as 4.58994 and the variance as 2.20380.
For multiparameter problems, the situation is much the same except that
second and higher derivatives are more complicated objects. For a function
f of k variables with p derivatives and for points x and y in m.k , we define
k kp {JP I
D(p)(j;x,y) = L'" L IIYj, {Jz .... {Jz. fez) .
j,=l jp=18=1 J, JP z=x

This is the analogue to the pth derivative evaluated at x times y to the


power p. In particular, D(2)(j; x, y) = YT My, where M is the matrix of
second partial derivatives of f evaluated at x. All of the above reasoning
applies as well in the case of k-dimensional (). For example,

exp(nH(O)) exp(nH(O))exp (-~(} - O)TI;-l() - 0)) [Rn())


+ 1 + !!:.D(3)(H; 0, 8 - 0) + .!!..D(4)(H; 0, 8 - 0)
6 24
+ I;OD(5)(H; 0, () - 0) + ~~ [D(3)(Hj 0, () - oW],
where I; is minus the inverse of the matrix of second partials of H evaluated
at 0 and
r (n
JnRn())exp -"2(}-(}) T 1: -1 ()-()) d()=O(n -2) ,
A A )

as n ~ 00. The net effect of these modifications is that

E(g(8)IX = x) = 11:*I,~ exp(n[H*()*) - H(O)]) [1 + O(n- 2 )] ,


11:12
7.4. Large Sample Properties of Posterior Distributions 453

where ~* is minus the inverse of the matrix of second partials of H* eval-


uated at 0*.
One additional thing that we can do in the multiparameter case is ap-
proximate marginal densities of subsets of the parameter vector. Such den-
sities are ratios of integrals of different dimensions, and we will not obtain
O(n-2) approximations in this case.
Theorem 7.116. For each n, let (Xn' Bn) be a Borel space, and let Xn
be a random quantity taking values in X n . Let xn = (Xl! ... , Xn) and let
(xn, Bn) be the product space of Xl, ... , X n . Let {Po: 0 En} be a paramet-
ric family of distributions for {Xn}~=l with n ~ IRk with k > 1. Suppose
that the distribution of xn given e = () is absolutely continuous with re-
spect to a measure vn on (xn,Bn) for all n with density fxnle(IO). Let
e
fe(O) be the prior density of with respect to Lebesgue measure. Assume
that fXnle(xnl(}) for all nand xn E xn, and fe((}) are all continuously
differentiable with respect to () four times. Write a typical point 0 E n as
0= (-y, '1/1), where "( E IRP and '1/1 E IRk-p with 1 ::; p < k. For each ,,(, let
n(-y) = {'1/1 : (-y, '1/1) En}. Define
in((}; xn) log fxn le(xnl(}),

Hn(O;xn) = -1 [in((};xn) + logfe(O)] ,


n
H~('I/1;xn,"() Hn((-Y,tP);x n ).
Now let y = n~=l xn, and define the set A ~ Y as the set of all x =
(Xl, x 2 , X3 , ) E Y with the following properties:
J
The integrals fe((})fXnle(xnl(})d(} and J fe (0, '1/1) fxn Ie (xn !'Y, 'I/1)d'l/1
are finite for all nand "(.
in achieves its maximum at a point O~(xn) for each n.
For each nand ,,(, in(-y, '1/1) achieves its maximum at a point that we
will call 'I/1~(xn; "().

For each n, Hn achieves its maximum at a point On(X n ) where the


first derivative is zero.
For each nand ,,(, H~(;xn,"() achieves its maximum at 'I/1~(xn;"(), a
point where the first derivative is zero.
On(X n ) and 'I/1~(xn;"() converge as n ---400.

The second derivatives of Hn and H~ at their maxima converge to


negative definite matrices.

There exists 00 such that the absolute values of Hn and its first four
derivatives are all uniformly bounded for II(} - On(xn)11 ::; 200. There
exists 01 such that for all "( the absolute values of H~ and its first
four derivatives are all uniformly bounded for 11'1/1 - '1/1~ (xn , "() II ::; 201.
454 Chapter 7. Large Sample Theory

.limsup~ sup fn(l};xn)-fn(l}~(xn);xn)<O, "10>0. (7.117)


n-+oo n 1\9-8n {x n )I\>6
For each (Xl, X2,X3, ) E A, define

For each (Xl, X2, X3, .. ) E A, the marginal posterior density of r (the first
p coordinates of e) given xn = xn is frJxn ('Ylxn) equal to

n~lu~(xn;'Y)14
(27r)~lan(xn)14
x exp(n[H~(1P*(xn);xn,'Y) - Hn(On(xn);x n )]) x [1 + O(n-l)].
PROOF. As in the proof of Theorem 7.108, we will suppress the dependence
on nand xn. Let (Xl, X2, . ) E A, and write

n fOb) exp(nH*(1Pi'Y))d1P
frJxn blx ) = In exp(nH(OdO (7.118)

As in the proof of Theorem 7.108, 0' converges to the same thing to


which 0 converges, and 1P /b) converges to the same thing to which 1P*b)
converges. Let n' be that part of n inside the ball of radius 00 around
00 . Since exp(nH(O) = exp(fn(Ofe(O), condition (7.117) implies that
In\n' exp(nH(OdO is exponentially small. For this reason, we can replace
n by n/, the part of n inside a ball of radius 00 around the limit 00 of O.
We can also replace nb) by n'b), the part of nb) inside a ball of radius
00 around the limit of 1P*b). The error in (7.118) for doing this is no larger
than O(n-l).
By expanding H(8) in a Taylor series (see Theorem C.I) around 8 = 0,
as in the proof of Theorem 7.108, we can obtain

( exp(nH(0d8 = (27r)~ la~~ exp(nH(O x [1 + O(n- 1 )]


1n n2
Similarly,

Taking the ratio of these two gives the result. o


7.4. Large Sample Properties of Posterior Distributions 455

The proof of Theorem 7.116 can be adapted to show that the approximate
Bayes factor in (4.27) on page 227 is an O(n-l) approximation to the true
Bayes factor when the hypothesis is H : r = /'0. In this case, the parameter
under the hypothesis is '11, the last k-p coordinates of 8. One must replace
()' by 8 and 1/;'(')') by 1/;*(')') in the approximation of Theorem 7.116, but
this does not alter the order of the approximation. One can also show that
the O(n-l) term in Theorem 7.116 is uniform for /' in compact sets.

7.4.4 Asymptotic Agreement of Predictive Distributions+


The next theorem is due to Blackwell and Dubins (1962). It concerns the
difference between posterior predictive distributions calculated under two
different models. 31 If the two models are somewhat similar (in the sense
that they at least assign positive probabilities to the same events), then
the posterior probabilities (calculated from the two models) of every event
will become uniformly closer as the amount of data increases.
To be precise, we will need to set up some notation. Let (Xn ,8n ) be
measurable spaces, let X = n:l Xi, and let 8 be the product a-field,
8 = 8 1 082 0 .... Suppose that P and Q are probability measures on (X, 8).
Let Pn and Qn be the respective marginal distributions on (Yn, en), where
Yn = Xl X . Xn and en = 8 1 Q9 Q9 8 n . Let pn and Qn stand for versions
of the conditional distributions on (yn, en) given the first n coordinates,
where yn = X n+1 X X n+2 X ... and en = 8 n +l Q9 8 n+ 2 0 .... That is,
for each BEen, there is a en measurable function pn(Blxl"" ,xn ) such
that, for each (Xl, ... , Xn) E Yn, pn ('IXl, ... , Xn) is a probability measure
over en and for each bounded measurable : X --+ JR,

(and similarly for Qn). For arbitrary probability measures Sand T on the
same a-field e, let

p(S, T) = sup IS(A) - T(A)I.


AEC

Before we state and prove the main theorem, we will give conditions
under which the hypotheses of the theorem hold.
Lemma 7.119. Let 7r2 7rl be probability measures on a parameter space
(0, r) with parametric family {Po: e EO}. Suppose that for every BE B,

+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
3IThe proof relies on martingale theory.
456 Chapter 7. Large Sample Theory

Then Q P.
PROOF. If Q(B) > 0, then there exists a set C ~ n such that P8(B) > 0
for all 0 E C and 1I'2(C) > O. It follows that 1I'1(C) > 0 and then that
PCB) > O. 0
The importance of Lemma 7.119 is that it applies to the popular model
in which data are conditionally lID given some parameter e with a dis-
tribution in a parametric family. Lemma 7.119 says that if two Bayesians
agree on the parametric family but disagree on the prior distribution, then
Theorem 7.120 will apply to them, so long as one of the prior distributions
is absolutely continuous with respect to the other.
Theorem 7.120. If Q P, then for each pn there exists a version of Qn
such that

Q [(XI.X 2,"'): lim p(pn (.Ixl. ... ,xn ) ,Qn (.Ixl. ... ,xn)) =
n ..... oo
0] = 1.
The proof of this theorem requires some lemmas and corollaries.
Lemma 7.121. Let Q be a probability measure, and let E denote expecta-
tion with respect to Q. Let {Yn } ~= 1 be a sequence of random variables such
that limn..... oo Yn = Y a.s. (QI and IYnl ~ m for all n and some nonnegative
m. Let {Uj }~l be an increasing sequence of a-fields. Let U be the smallest
a-field containing all of the Uj Then

. lim E(YnIUj ) = E(YIU).


3-+ 00 ,n-+00

PROOF. Let Gk = sUPn>k Yn . For fixed k and n ~ k, Yn ~ Gk and


E(YnIUi) :::; E(GkIUi) a.s. for each i. Define

Z = 1.lim
.....00
sup E(YnIUi).
i?,j,n?,j

Then

as k -+ 00. The first equality holds because the supremum decreases as j in-
creases. The limit follows from the martingale convergence theorem B.117.
Similarly, we can show that

Together these imply the lemma. o


Corollary 7.122. If the probability is 1 that only finitely many of {En} ~=1
occur, then
lim Q(uk=nEkIUj) = 0, a.s.
n-+oo,3-+ 00
7.4. Large Sample Properties of Posterior Distributions 457

PROOF. The condition in the statement of the theorem is Q(U~=l nf=n


Ef) = 1. This is equivalent to Q(n~=l Uf:n E k ) = O. Let Yn be the
indicator of the event U'k=:nEk. Then Y = lim n --+ co Y n is the indicator of
n~=l Uf=n Ek, and E(YIU) = O. Now apply Lemma 7.121. 0
Corollary 7.123. If limn--+ co Tn = 0 a.s., then for each ( > 0,
lim Q(sup ITkl > (IUj ) = 0, a.s.
n-+oo,J-+(XJ k~n

PROOF. Let Ek = {ITkl > f}, so that {suPk>n ITkl > f} = Uf=nEk. Now
apply Corollary 7.122. - 0
Lemma 7.124. Let Q P and let q = dQ/dP. Define

if qn > 0,
if qn = 0.
Then, qn = dQn/dPn, and for each f > 0,
lim Qn (Idn - 11 > (I Xl,'''' Xn) = 0, a.s. [QI.
n--+co
PROOF. Since, for every B E Bn,

1 qn(Xl, ... ,xn)dPn(Xl, ... ,Xn )

= Is J q(Xl,X2, ... )dpn(xn+l'" Ixl,'" ,xn)dPn(Xl, ... ,Xn)


f q(Xl,X2, ... )dP(Xl,X2, ... ) = f dQ(Xl,X2, .. . ),
i Bxyn i Bxyn
which equals Qn(B). Hence, qn = dQn/dPn.
Under probability P, E[q(Xl,X2, ... )lxl, ... ,xnl = qn(Xl, ... ,Xn), so
{qn(X 1" " , Xn)}~=l is a martingale. By Part I of Levy's theorem B.ll8,
we conclude qn -+ q a.s. [Pl. Since qn > 0 a.s. [Qnj, this implies that dn -+ 1
a.s. [Qj. Now apply Corollary 7.123 with Uj = j . e 0
PROOF OF THEOREM 7.120. For convenience, let u denote (Xl, ... ,X n )
and let v denote (xn+l," .). For each pn, let

Qn(Clu) = l dn(u, v)dpn(vlu),

which is a version of the conditional distribution of Q given en since, for


each A E en and C E en,
f r q(u(,v dpn(vlu)qn(u)dPn(u)
iAic qn u
ilc q(u,v)dP(u,v) = Q(A x C).
458 Chapter 7. Large Sample Theory

For each u and e: > 0, let

A(u) = {v: dn(u,v) > I}, A(u,e:) = {v: dn(u,v) > 1 + e:}.
Then

p(pn('lu), Qn('lu)) = r
JA(u)
[dn(u, v) - l]dpn(vlu)

~ e: + 1A(U,f)
[dn(u,v) - l]dpn(vlu)

< e: + r
JA(U,f)
dn(u,v)dpn(vlu)

e: + Qn(A(u, e:)lu).
Now, write {limn_ oo p (pn ('Iu n), Qn ('Iu n)) = O} as

nun {p (pn ('Iun), Qn ('Iun)) ~ 2e:}


00 00

~ nun {Qn(A(un,e:)lu n ) ~ e:}


00 00

f>ON=ln=N

nun {QnCldn -11> e:lun) e:}


00 00

;2 $
f>ON=ln=N

The first containment is what we just proved, and the second is trivial.
Lemma 7.124 says that Q ofthe last of these sets is 1, hence Q of the first
of these sets is 1. 0

7.5 Large Sample Tests


7.5.1 Likelihood Ratio Tests
Asymptotic theory can provide approximate tests in complicated situations.
Let 0 ~ IRP, and assume that OH = {O : g(O) = c}. Reparameterize, if
necessary, so that the first k coordinates of 0 are g(O), for k ~ p. Let 9 n ,H
be the MLE of e assuming OH is the parameter space, and let be the en
unrestricted MLE. Then the likelihood ratio (LR) criterion (as introduced
in Section 4.5.5) is

Ln = sUP8EOH !xls(XIO) = !xls(XI9:"H).


sUP8EO !xls(XIO) !xls(XI8n )
7.5. Large Sample Tests 459

We will first consider the special case in which p = k = 1 and g(O) = O.


Then

-2 log Ln -210g!xle(Xlc) +210g!xle(XI8 n )


= -2fn(c) + 2fn(8n) ..
Suppose that fn has two continuous derivatives. Then

where O~ is between c and 8 n . Also, f~(8n) = 0. So

'D
Now, suppose that vIn(c - en) N(O, I/Ix 1 (c)) under Pc. Then
A

->

under Pc, that is, under H.


For the more general (higher-dimensional) cases, we have the following
theorem.
Theorem 7.125. Assume the conditions of Theorem 7.63. Let Ln be the
LR criterion for testing H : 8 i = Ci for i = 1, ... , k. Then -2 log Ln . X~
under H.
PROOF. Let cT = (Cl, ... , Ck), and let 00 E n be of the form OJ" =
(cT, 1/76), where 1/70 has dimension p - k. Under H, the parameter is WT =
(8k+l, ... ,8p ), and the conditions of Theorem 7.63 hold in this smaller
problem.
We will find the asymptotic distribution of - 2 log Ln under POo and see
that it does not depend on 1/70. Let 8 n ,H be the MLE assuming that nH is
the parameter space. Then

8 n ,H = ( ~ n,H
c ).

We will also write the overall MLE in partitioned form as

Then
sup !xle(XIO) = !xle
OEOH
(x I[ ~ c
n,H
]) .
460 Chapter 7. Large Sample Theory

Use Taylor's theorem C.1 to write

in ( ~:,H ) ~ i n (8 n ) + [( ~:,H ) - 8n r( di,ln(8n ) )

~ ~:,H )-enf ((80~;Oj in(O~))) [( ~:'H )-en],


+ [(
(7.126)
with O~ coordinatewise between en
and 8 n ,H. Next, we use Taylor's theo-
rem C.1 to expand the gradient vector of in at both en and en,H around
00 . The vector of partial derivatives around en
is the p-dimensional vector

(7.127)

where 01 is coordinatewise between 00 and en. The vector of partial deriva-


tives around 8 n ,H is the (p - k )-dimensional vector

where ott is coordinatewise between 00 and en,H' It follows from (7.64)


that
Bo )
Do '

Equating the last p - k coordinates of the two 0 vectors in (7.127) and


(7.128), we get
Dn,H(~n,H -1/Jo) = i1; (c - c) + Dn(~n -1/Jo).
We know from Theorem 7.63 that ~n H - 1/Jo = Op(lj yin), ~n - 1/Jo =
Op(lj yin), and c - c = Op(lj yin). AI~o, we just proved that Dn - Do =
op(l), B; - Bo = op(l), and Dn,H - Do = op(l). It follows that

Do(~n,H -1/Jo) = Bri (c - c) + Do(~n -1/Jo) + op (In) .


7.5. Large Sample Tests 461

Hence,
~n,H = ~n + DOl Bri (c - c) + op (In) . (7.129)

Now, combine (7.126) and (7.129) to conclude that

in ( ~ n,H
C ) -in(9 n)

n C-CA]T [ C-CA ]
= -"2 [ Dol BJ (c - C) IXI (90) DOl BJ (c _ C) + op(l)
= i(C - c) T[Ao - BoDo1 BJ](c - c) + op(I).

The matrix Ao - BoDo1 B;r is the negative of the inverse of the upper-
left k x k corner of IXI (90)-1, which, in turn, is minus the inverse of the
asymptotic covariance matrix of C. Since

it follows that -2 log Ln E. X~. Note that the choice of t/Jo is irrelevant. 0
When appealing to the asymptotic distribution of the LR criterion, the
tradition is to choose 0: and reject H if -2 log Ln is greater than the 1 - 0:
. quantile of the X~ distribution.
Example 7.130 (Continuation of Example 7.104; see page 444). Using the same
data as in the previous Cauchy example, suppose that we wish to test H : e = 5.
The two values of the likelihood function are
ilO(4.531} = -1Olog(1r} - 27.36 and ilO(5} = -1Olog(1r} - 27.50.
So -2 log Ln = 0.28, which is too small to reject H at any popular level.

7.5.2 Chi-Squared Goodness of Fit Tests


Another large sample test is the chi-squared (X 2 ) goodness of fit test moti-
vated as asymptotically UMP invariant. If 0 is the set of all distributions
and 90 is one element of 0, then we can test H : e = 90 asymptotically as
follows. Choose a fixed dimension p, and divide X into p disjoint regions
Rl, ... ,Rp. Let Qi = P9(~) and qi,O = P9o(~)' We replace H : e = 90
by H* : Q = qo. H implies H*, but OH* is bigger than OH. The general
result is the following.
Theorem 7.131. Suppose that {Xn}~=l are lID with distribution P. Let
(Rl,"" Rp) be a partition of X. Define Y; for j = 1, ... ,p to be the number
of the first n Xi that are in R j , and define qi = P(~). If

~ (Yi - nqi)2
Cn=~ "
i=l nqi
462 Chapter 7. Large Sample Theory

then Cn Eo X~-l as n ~ 00.

PROOF. The distribution of Y = (Y1 , ... , Yp) T is M ult( n; ql , ... , qp) and
we know that

Vii (
~ ~: ql ) 1)
~ Np(O, E),
y
.:: - qp
where E = ((CTi,i))' with

-qiqj ifii-j,
CTi,j ={ qi(l - qi) if i = j.

Let E* be the upper-right p - 1 x p - 1 corner of E. Define

1 ( Y1 - nql,o
Dn = -(Yl - nql,O,., Yp - 1 - nqp-l O)E;;-l :
n ' .
Yp - 1 - nqp-l,o

and note that Dn Eo X~-l' We can rewrite Dn by using

o o
o
o qp-l

The inverse of a matrix of the form A - bbT is

This means that Dn can be written as

o
The traditional X2 goodness of fit test is to reject the hypothesis that the
distribution of the data is P if Cn is greater than the 1 - a quantile of the
X~-l distribution.

Example 7.132. Bortkiewicz (1898) reports data on the number of men killed
by horsekick in the Prussian army.32 The data were collected from 14 army units
for 20 years.

32See Bishop, Fienberg, and Holland (1975) for a more complete analysis of
this data.
7.5. Large Sample Tests 463

Number killed 0 1 2 3 4 >5


Count 144 91 32 11 2 0
These data are clearly not uniformly distributed over the six categories, but we
illustrate the X2 test with each qi = 1/6. The value of C280 is 366.4, which far
exceeds the 0.9999 quantile of the X~ distribution.

A possible Bayesian approach to this problem is to try to measure how


close the distribution P is to P. Of course there are many measures of close-
ness. We could let Q = (Q1, ... , Qp), where Qi = P(R;), and find a large
sample approximation to the posterior distribution of Q based on just the
data (Y1 , .. , Yp). Theorems 7.102 and 7.101 give one such approximation
as

where S = ((Si,j)), with

ifi #j,
Si,j = {
if i = j.
We could then examine the distribution of Q - q, where q = (q1, ... , qp), or
specifically of IIQ - qll, or whatever. For example, let S. be the upper-left
p - 1 x p - 1 corner of S and consider

Q1 - q1 )
(
Qp-1 ~ qp-1 .

This quantity would have approximately an NCX~_l (2:f=l (Yi - nqi)2 IYi)
distribution.
A different type of hypothesis might be H : e E Po, where Po is a
parametric family with k-dimensional parameter space r with k < p - l.
This case was considered by Fisher (1924).
Theorem 7.133. Let r be a k-dimensional parameter space with parame-
ter \II and k < p-l. Let R 1, ... ,Rp be a partition of X. Let Yi be the number
of observations in Ri for i = 1, ... ,po Call Y = (Y1 , ... , Yp) the reduced
data. Let St/J for'IjJ E r stand for the conditional distribution of the reduced
data given \II = 'IjJ. Define qi('IjJ) = St/J(R;) and q('IjJ) = (q1('IjJ), ... ,qp('IjJ)).
Assume that q has at least two derivatives and is one-to-one. Let q,n be
the MLE based on the reduced data, and let IXl ('IjJ) be the Fisher informa-
tion matrix. Assume that q,n is asymptotically normal Nk('IjJ,Ix l ('IjJ)-1 In).
Define qi,n = qi(q,n) and
464 Chapter 7. Large Sample Theory

Then en E. X;-k-l as n - 00.

PROOF. The likelihood function for the reduced data is i(t/J) = nf=l qfi(t/J).
Setting the partial derivatives of the log of the likelihood equal to 0 gives
the equations

o=
f..
L.t
Yi
.(01.) aa'1'3 qi (t/J ),
ol
i=l q, 'I'
for j = 1, ... ,k. Since q,n is vn-consistent and q is continuous, it follows
that (li,n is a vn-consistent estimator of qi(W). Since l't/[nlli,n] ~ 1 for each
i, it follows from the likelihood equations (and Problem 7 on page 468) that

y2 a
L
P
2 2(q, ) aol. qi(\fIn) = op(l). (7.134)
i=l n qj n '1'3

The argument we just finished for the case in which the hypothesis is
simple shows that for every t/J E r,

under S1jJ. Then

Use the delta method to write


7.5. Large Sample Tests 465

C(t/J) - Cn

1 kk A A {)2 A}
- 2A~ LL(t/Jj -Wn,j)(t/Jt -Wn,t) ()t/J.{)t/J qi(Wn) + op(l).
q"n j=1 t=1 J t

We can rearrange the sum of the first set of terms inside the large brace
and use (7.134) to remove these terms from the sum. Then

1 kk A A {)2 A}
- 2A~ LL)t/Jj - Wn,j)(t/Jt - Wn,t) ()t/J.{)t/J qi(Wn) + op(l).
q"n j=1 t=1 J t

Since Yi = Op(n) and the inner summations are both Op(l/n), we can
use the fact that Yi/[nqi,n] ..!: 1 for every i (and Problem 7 on page 468)
to rewrite C(t/J) - Cn as

Next, notice that


466 Chapter 7. Large Sample Theory

Since IXl (tP)-l In is the asymptotic covariance matrix of ~n' we have that
C(tP) - C n ~ X~. Next we prove that C(tP) - C n is asymptotically inde-
pendent of Cn. This will make the asymptotic characteristic function of Cn
equal to the ratio of the asymptotic characteristic functions of C(tP) and
C(tP) - Cn. Since the former is that of X~-l and the latter is that of X~,
the ratio is that of X~-k-l'
Define qn = (ql,n,' .. qp,n) T. Use the delta method to write
Vri(qn - q(tP)) = V[~n - tPj + op(I),
where V = Vi,j)) is the p x k matrix, with Vi,j = 8qi (tP) 18tPj. It follows
that Vri(qn -q(tP)) is asymptotically Np(O, VIxl (tP)-l VT). Since q is one-
to-one, V has rank k and V- = (VTV)-l VT exists. It is easy to see that
V- V is a k x k identity matrix. Hence,
C(tP) - Cn = n[V(tP - ~n)jTV- T IXl (tP)V-V(tP - ~n) + op(I)
= n(qn - q)TV-TIx1(tP)V-(qn - q) +op(I).

Also, since (Yi - nqi,n)2/n = Op(I) for each i and Ijqi,n ~ Ilqi(tP), we
can use Problem 7 on page 468 to conclude

C =
n
f.. (Yi -
L..J n
nqi,n)2
.(.1.)
+ (1)
Op.
i=l q, 'I'
The proof will be complete if we can show that Y - qn and qn - q( tP) are
asymptotically independent. Since they are jointly asymptotically multi-
variate normal, it suffices to show that they are asymptotically uncorre-
lated. Since Y - q(tP) = Y - qn + (qn - q(tP)), we need only show that the
asymptotic covariance matrix of qn (namely, VIxl (tP)V T ) is the same as
the asymptotic covariance between Y and qn' We find the latter as follows.
First, note that (following some tedious algebra)

[
E (yt - nqt(tP))
f.. Yi kqi(tP)]
t:t di(tP)
8
= 8tPj qt(tP).
Hence the asymptotic covariance between Y and the vector D( tP) of partial
derivatives oflog l( tP) is V. Next, use the delta method to write

o =
f..
1
-L..JYi
8 '
81/.1; qi(W n )

..fii i=l qi(W n )


A

= -
1
L Yi !-qi(tP) + L mn,t.jVri(wn - tP) + op(I),
p
1/.1;
k ,

..fii i=l qi (tP) t=l


where
7.6. Problem s 467

Set Mn = mn,8,t)), and note that Mn ~ IXl (1/J). It follows that


!.D(1/J) = .,fiiIx l (1/J)(~n -1/J) + op(l).
n
Hence, fo(~n -1/J) =AIX1 (1/J)-1 D(1/J)Jn+op(1). So, tAhe asymptotic covari-
ance between Y and wn is VIXI (1/J)-1. Since qn - VWn = op(1), it follows
that the asymptotic covariance between Y and qn is VIXl (1/J) -1 V T
, and
the proof is complete.
0
In applying Theorem 7.133, one must be careful to calculate the MLE
of
W based on the reduced data Y, not on the original data X.
Examp le 7.135 (Contin uation of Exampl e 7.132; see page 462).
A more rea-
sonable hypothesis to test in the horsekick data exampl e is that the
distribu tion
of horse kicks is a membe r of the Poisson family. Because there are
no data in the
"~ 5" category, the likelihood function for the reduced data
is the same as for
the original data, and the MLE is the sample average 4t 280 = 0.7.
The six val-
ues of Qi,280 corresp onding to i = 0,1,2,3 ,4,5 are (139.0,9 7.3,34.1
,7.9,1.4 ,0.2),
respectively, and C2S0 = 2.346, which is the 0.3276 quantile of X~
distribu tion.
Examp le 7.136. Let {X n };:"=1 be exchangeable. Suppos e that we
want to test
the hypothe sis that they have normal distribu tion. Let Rl = (-00,
rll, Ri =
(ri-l,ri ] for i = 2, ... ,p - 1, and Rp = (r -l,oo).
p (For convenience, define
ro = -00 and rp = 00.) Then qi(t/J) = If>([ri - p.l/u) - If>([ri-l
- p.l/u) if t/J =
(u,p.). The likelihood function to maximize is L(t/J) = n:=l
qi(t/J)[, where Y; =
2:;=1 IR;(Xj) . The MLE will not equal the sample average and sample
standar d
deviatio n in general.
Examp le 7.137. The usual X2 test of indepen dence in a two-way
(r x c) con-
tingenc y table is an example of Theorem 7.133. In this case, the
data are not
reduced. That is, each R; contain s only one element of X. In fact,
the Ri are the
cells themselves, which would be better denoted R;,i for the cell
in row i and
column j (i = 1, ... , r, j = 1, ... ,c). The parame ter'll consists of two margina l
probabi lity vectors, one for the rows wR and one for the column
s we. Then
qi,j(t/J) = t/Jf!t/Jf. The MLE ~n is easily seen to consist of ~:: equal to
the row i
total divided by n and ~f equal to the column j total divided by
n. One easily
verifies that C n is the usual X2 statistic , and the appropr iate degrees
of freedom
are (r - 1)(c - 1).

7.6 Problems
Section 7.1:

1. Prove Proposi tion 7.4 on page 396.


2. Prove Proposi tion 7.14 on page 398.
3. Let X and {Xn}~=1 be random variables, and suppose that
Xn E. X.
Prove that Xn = Op(l).
468 Chapter 7. Large Sample Theory

4. Let the conditional distribution of {Xn}::"=l be that of lID N(J.L, a 2 ) random


variables given e = (J.L, a). Define

Find the asymptotic distribution of Xn + 1.96Sn . That is, find an and bn


so that an [X n + 1.96Sn - bn 1converges in distribution to a nondegenerate
distribution.
5. Suppose that for each 0 E n, (Xn, Bn, Pe,n) is a sequence of probability
spaces. Let Yn : Xn - IRk be random vectors for each n. Suppose that
Yn = op(l) for each O. Let T be a sigma field of subsets of n such that for
every n and every A E Bn , PII,n(A) is T measurable as a function of O. Let
Q be a probability measure over (n,T). Define QnO = In
PII,n()dQ(O) for
each n so that (Xn, Bn , Qn) is a probability space. Show that Yn = op(1)
with respect to the sequence {Qn}::"=l'
6. Consider the setup in Problem 5 above. Let Xn = X for all n. If PII,n ~ Pe
as n - 00 for each 0, let QoO= In Pe()dQ(O). Prove that Qn ~ Qo.
7. Suppose that Zi,n & Ci and Xi,n = Op(l) for i = 1, ... ,p as n - 00. Show
that I:;=l Zi,nXi,n - I:;=l CiXi,n = opel).
8. Suppose that pet} ;::: 0 is an even function of t with pet) > 0 for t =j:. 0 and
pet) a strictly increasing function of Itl. If
lim EIIP(Xn
n--+oo
- g(O)) =0
for all 0, then show that Xn is consistent for g(6).
9. *Suppose that fo{Yn - J.L) E. Nk+l CO, E). Let U~ be the smallest root of
the polynomial Pn{U} = L~=o Yn,;ui, where YnT = {Yn,o, ... , Yn,k}' Let Uo
be the smallest root ofp{u) = E~=oJ.Liui. Assume that Uo has multiplicity
exactly 3 (Le., p(uo) = p'(uo) = p"(uo} = 0, but p"'(uo) =j:. 0.) Find an so
that an(U~ -uo} converges in distribution to a nondegenerate distribution,
and find the asymptotic distribution.
10. Let {Xn}::"=l be conditionally lID with N(O, I} given 6 = O. Let Yn =
~c- X n )/ )1 - l/n), where ~ is the standard normal distribution func-
tion and C is a constant. Find an and bn such that an(Yn - bn ) has a
nondegenerate limiting distribution, and find the distribution.
11. Let {Xn}::"=l be conditionally lID Ber(O) given 6 = O. Let Y n = n- 1EXi,
and let g(y) = 2sin- 1 v'Y (Le., g-l(Z) = sin2(z/2).) Suppose the prior for
e has density given by
Ja(O} = cO-! {1- O}-! exp (- 2~ (g{O) - pi),
where 0 < () < 1, J.L and a are constants, and
7.6. Problems 469

Let Zn = g(Yn). Find an and bn such that anZn + bn converges in dis-


tribution to a non degenerate distribution with respect to the marginal
distribution of the data. (Hint: Recall that dsin- l (u)/du = {I - u2 } -1/2.)
Section 7.2:

12. Suppose that F(t) = 1 and F(t - E) < 1 for all E > 0 and that F is differ-
entiable at all values less than t with derivative f such that lim"'lt f(x) = c
with 0 < c < 00, and F is continuous at t. Let {Xn}~=l be lID with CDF
'D
F and let X(n) = max{Xl, ... , X n }. Prove that n(t - X(n --+ Exp(c).
13. Prove Proposition 7.34 on page 408.
14. Prove Proposition 7.37 on page 410.
15. Suppose that {Xn}~l are conditionally lID with Cauchy distribution hav-
ing median () given e = (). Calculate the asymptotic efficiency of the sample
median and of the best linear combination of three symmetrically placed
sample quantiles.
16. Let F(x) = [1 + exp( _X)]-l. Assume that {Xn}~=l are conditionally lID
given e = () with CDF F(x - (}).
(a) Prove that the density is symmetric about ().
(b) If we wish to use the L-estimator based on yJn) , Yl';i, and Yl(~!' with
p < 1/2, find the best p and the best coefficients.
17.*Let {Xn}~=l be conditionally lID given e = () with density equal to
if () - ~ < x < (),
if () ~ x < () + ~.

(a) Find the asymptotic joint distribution of the p, 1/2, and 1- p sample
quantiles of a sample of size n as n --+ 00.
(b) Find the best linear combination of the three sample quantiles p, 1/2,
and 1 - p for estimating e.
(c) Try to find the best p if one wishes to estimate e using a linear
combination of the three sample quantiles p, 1/2, 1 - p, and show
that the usual analysis fails.
18. In Problem 17 above, find the asymptotic joint distribution of the largest
and smallest order statistics from a sample of size n.
19. Let the conditional distribution of {Xn}~l given e = () be lID U(O,(}).
Let Xf;l denote the kth order statistic based on Xl, ... , Xn. Find an and
bn such that an(Xf;l- bn} converges in distribution to a nondegenerate
distribution as n --+ 00 for fixed k.
Section 7.9:

20. Return to Problem 16 above.


(a) Find the asymptotic variance of the estimator found in part (b).
470 Chapter 7. Large Sample Theory

(b) Compute the Fisher information IXl (8) and the efficiency of the es-
timator found in part (b) of Problem 16.
(c) Compute the efficiency of Xn = 1::=1 X;fn as an estimator of e.
21. Let {Xn}~=1 be conditionally lID given e = 8 with distribution U(8 2 , 8)
where the parameter space is the interval (0,1).
(a) Find the MLE of e.
(b) Find a nondegenerate asymptotic distribution for the MLE.


22. Prove that the relative rate of convergence is unique by first showing the
following. Let an,a~ > and let H,H' be CDFs. If an(Gn - g(8 Eo H
and a~(Gn - g(8 E. H', then limn_ooa~/an = c E (0,00) and H'(x) =
H(x/c).
23. Prove the claim at the end of Example 7.46 on page 413 about the rela-
tive rate of convergence being the square root of the ARE for asymptotic
variance.
24. Let {Xn}~=1 be conditionally lID with N(8,1) distribution given e = 8.
Let Xn = n- 1 L:=1 Xi and Sn = L:=1 (Xi - X)2. Let an be such that
Pr(Sn > an) = l/n. Let k n be the largest integer less than or equal to ..;n.
Consider the following two estimators:

T. ={ Xn if Sn ~ an
n n ifSn>a n '


(a) Show that the ARE of Un to Tn is using the criterion of rate of
convergence from Example 7.46 on page 413.
(b) Show that for any fixed f > 0,

PJ(lTn - 81> f) = ~ + 0 (~) ,


pJ(lUn-81>f) = o(~).
Comment on this in light of part (a). (Hint: If X '" N(O, 1), then
Pr(lXI ~ c) ~ 24>(C)/c. 33 )
(c) What happens if we replace f by a/,fii in part (b)?

25. Let e > be a parameter, and {Xn}~=l be a conditionally lID sample
(given 8) with exponential distribution Exp(8), and let Xn be the sample
median. Let X n be the sample average.
v
(a) Find a(8) such that conditional on e = 8, ,fii(log(2)/Xn - 8) --+
~

N(0,a(8.
(b) Using the same criteria as in Example 7.44 on page 413, find the ARE
oflog(2)/Xn to l/X n as estimators ofe.

33This inequality is equivalent to Mill's ratio.


7.6. Problems 471

26. Let {Xn}~=1 be conditionally lID with N(8,1) distribution given 8 = 8.


We observe Y1 , , Yn where
0 if Xi ~ 0,
Yo ={ Xi ~f 0 < Xi < 1,
1 If Xi ~ 1.
(a) Find a minimal sufficient statistic.
(b) Construct two different (i.e., different by more than op(I/v'n con-
sistent, asymptotically normal estimates of 8, and compute their ARE
using the same criterion as in Example 7.44 on page 413.
27. Let the parameter space be two-dimensional. Suppose that {Tn}~=1 are
conditionally lID given 8 = (8 1 ,82 ) with density hIS(tI81 , (2). Suppose
that the conditions of Theorem 7.63 hold. Let Eh be the MLE of the first
coordinate of 8. Let 91(82) be the MLE of the first coordinate of 8 if
it is assumed that the second coordinate is known to equal 82. Use the
same criteria as in Example 7.44 on page 413 to find the ARE of these two
estimators when it is assumed that the second coordinate of 8 is known to
equal 82 Express your answer in terms of the Fisher information matrix.
28. Under the conditions of Theorem 7.48, prove that

P~o [fi !xlls(XiIOo) ~ fi 1= o.


!xlls(XiIO), infinitely often

29. Verify the conditions of Theorem 7.49 in the case in which the observations
are lID given 8 = 0 with density !xls(xI8) = 8e- eo;, for x > o.
30. Assume that 0 = {0 1 , , Om} is finite, and let the prior distribution be
(71"1, . ,7I"m), with 7I"i = Pr(8 = Oil. Let J.t be a measure such that Pe J.t
for each 8 E 0, and let
f;(x) = dZ; (x).
Assume that {Xn}~=1 are conditionally lID given 8 = 8; with density
!;(x) for each i. Let en
be the MLE after n observations. Prove
lim Pr(e n = Oi) = 7I"i.
n-+oo

31. Let {Xn}~1 be conditionally lID given 8 = 0 with N(8,1) distribution.


Let 0 be the set of all integers.
(a) Find the MLE en of 8 and prove that it is unbiased.
(b) Show that there exist positive constants a and b such that, for all
sufficiently large n, Vare(en) ~ aexp( -bn) for every integer O. (Hint:
Use Mill's ratio from Problem 24(b) on page 470.)
32. Return to the situation in Problem 16 on page 140. Find the MLE of 8
and its nondegenerate asymptotic distribution.
33. Consider Example 7.60 (page 420) once again. This time, assume that we
observe k ~ 2 observations with conditional mean J.ti for every i. Find the
MLE of r: 2 and what it converges to in probability.
472 Chapter 7. Large Sample Theory

34. Prove Proposition 7.71 on page 425.


35. Suppose that {Xn}~1 are conditionally lID given 9 = 9 with the following
discrete logistic conditional density with respect to counting measure on
the integers:

f XllS (xI9) -- 1 + exp(9)


exp(x9)
+ exp(29)' for x = 0, 1,2.

(a) Find a vn-consistent estimator of 9.


(b) Find an explicit form for an asymptotically efficient, asymptotically
normal estimator of 9 based on X I, ... , X n
36. Let
-1 if x:5 9 -1,
'I/J(x,9) = { ~in [~(x - 9)] if 9 - 1 < x < 9 + 1,
ifx~O+l.

(a) Prove that there is always a solution to 1::':1 'I/J(Xi , 9) = O.


(b) Assume that the Xi are conditionally lID N(9, 1) given 9 = O. Now
prove that for each E > 0,

lim Peo (3 a solution to En=l'I/J(Xi, 0)


n-too "
= 0 in [00 - E, 90 + ED = l.
37. Suppose that {Xn}:=l are conditionally lID given 9 = 9 with U(-9,9)
and n = (0,00). Find the MLE 8 .. of 9 based on n observations, and find
an so that a n (8 .. - 9) has nondegenerate asymptotic distribution given
e = O. Also find that distribution.
38. Suppose that n arrows are fired at a circular target of radius a whose
center is at the point (0,0,0) E 1R3. The target lies in the plane where
the third coordinate is O. Suppose that arrow i passes through the point
(Xi, Yi,O). Let Ri = JXl + Y?- Suppose that (Xi, Yi) are conditionally
lID N2(0,OI2) given e = 9. The data we observe are all (Xi, Yi) pairs for
those arrows that hit the target. We also know n.
(a) Find the distribution of R~ /9 for an arbitrary arrow (whether or not
it will hit the target).
(b) Find the conditional probability Pe(Ri :5 a) that arrow i hits the
target.
(c) Find the MLE 8 n of 9.
(d) Find the asymptotic distribution of 8 n as n -+ 00.
39.*Suppose that Yl, Y2 are conditionally independent with Yi '" Bin(n"Pi)
given H = PI, P2 = P2, where nl and n2 are known sample sizes. The
parameter space is n = {(Pl,P2) : P2 ~ PI}.
(a) Find the MLE (A, F2) of (H, P2).
(b) Find the asymptotic distributions of PI and of 1'2 as nl and n2 go
to infinity. (Hint: Consider the case PI = P2 separate from the case
PI < P2.)
7.6. Problems 473

40. Let {Xn}~=l be conditionally lID with N(B, 1) distribution given e = B.


Let E>n be any estimator whatsoever of 8. Let 'IjJ be the derivative (with
respect to B) of the log of the conditional density of each Xi given e = B.
Find the asymptotically efficient estimator of Theorem 7.75.
41. *Suppose that {Xn}~=l are conditionally lID U(O, B) random variables given
8 = e. The parameter space is the interval [1,2]. Suppose that we only get
to observe Y; = I[O,l)(X i ) for each i. That is, we only see whether or not
each observation is between 0 and 1.
(a) Find the MLE of e based on Yl , ... , Yn .
(b) Find the asymptotic (as n -+ 00) distribution of the MLE found
above.
(c) In terms of asymptotic efficiency, how does the MLE found above
compare to the MLE based on observing the actual Xi values?
42. Suppose that {Xn}~=l are conditionally lID given 8 = B with conditional
density
B"'a
fX1Is(xIB) = X",+l I[9,oo)(x),
where 0 is known and the parameter space is n = (0,00).
(a) Find the MLE en of E> based on Xl, ... , X n.

(b) Prove that en is inadmissible if the loss is squared error.


(c) Find an and bn such that ane n + bn has nondegenerate asymptotic
distribution, and find that distribution.
43. Let {Xn}~=l be conditionally lID given e, a one-dimensional parameter.
Assume the conditions of Theorem 7.63 and that there are no superefficient
estimators. Let en
be the MLE of e, and let {Tn}~=l be another sequence
of estimators with vIn(Tn - B) E. N(O, v(B given 8 = B for all B and Tn
a function of (Xl, ... , Xn). Consider the joint asymptotic distribution of
vIn([en, Tnf - BI) given e = B. Prove that the asymptotic covariance is
l/Ixl (B). (Hint: Look at the proof of Theorem 5.9 on page 298.)
44. *A psychologist is studying paired subjects. Each person in each pair is asked
a yes-no question. Let Xi,j = 1 if person i in pair j answers yes (Xi,j = 0
otherwise) for i = 1,2 and j = 1,2, .... The psychologist wants to assume
that there are parameters 8 such that all of the Xi,j are conditionally
independent given e. Suppose that the psychologist believes that there is
a number 0 such that
Pr(X2,j = lie = B) _ a Pr(Xl,j = lie = B)
Pr(X2,j = Ole = B) - Pr(Xl,j = 018 = B)'

(a) Prove that there exist numbers (31, (32, . . . such that
( af3j ) Y (3j
Pr(Xl,j = x, X2,j = yle = B) = - 1
(3 - 1
(3' (7.138)
+0 j + j
474 Chapter 7. Large Sample Theory

(b) The psychologist decides to let 9 = (A, B l , B 2 , ... ) so that (7.138)


gives the conditional probability of observing (x, y) for pair j given
9 = (0., {31, (32, ... ). Observations are then made for j = 1, ... , m. Let
Zi = Xl.i + X2.i, T = #{j : Zi = I}, and S = E;:l X 2 j I{l} (Zj) =
Ej:Z;=l X2.j. Write the likelihood function.
(c) Find the MLEs of A and B l , ... , Bm. (Hint: First, find the MLEs of
the Bi for fixed value of A, and then find the MLE of A.)
(d) Since the MLE of A depends only on the pairs with Zj =
1, find
the conditional density of the data given (Zl, ... , Zm). Also find the
conditional MLE of A based on this distribution.
(e) Show that the conditional MLE found in part (d) is consistent as
m ---+ 00.

45. Assume that {Xn}~=l are conditionally lID with N(J.L, ( 2 ) distribution
given 9 = (J.L, u). (See the end of Example 7.52 on page 417.) Prove that
for every 90 and every compact set C ~ n, EooZ(CC,Xi ) ~ 0 in the
notation of Theorem 7.49.

Section 7.4:

46. Use the problem description in Problem 16 on page 75. Show that the
posterior distribution of M given (Xl, ... , Xn) is not consistent in the sense
of Theorem 7.78.
47. Assume the conditions of Theorem 7.78. Prove that there exists a subset
A ~ n with J.Ls(A) = 1 such that for every 9 E A,

P~ ( n-+oo
lim J.LSIXlo .... X n (AIXl , , Xn) = IA(9) = 1.

48. Return to the situation in Example 7.82 on page 432. Prove that the pos-
terior probability of C. does not almost surely converge to 1 given 9 = 90.
(Hint: Rewrite the posterior probability of (0, ( 0 ) in terms of the random
variable n(90 - X(n, where X(n) is the largest observation. Then use the
result in Example 7.33 on page 408.)
49. Suppose that X I, ... ,Xn are conditionally lID with exponential distribu-
tion Exp(9) given e = 9. Let 9 have a qa, b) prior distribution. Use
Laplace's method to construct a formula for the approximation to the pos-
terior mean of 8. How does this compare to the exact posterior mean?
50. Suppose that Xl, ... , Xn are conditionally lID with Laplace distribution
Lap(9,I) given 8 = 9. Let the prior distribution of 8 be Lap(O,I). We
wish to approximate the predictive density of a future observation, namely
f[exp(-Ix - (1)/2jfslx(9Ix)dll for various values of x.
(a) Use Laplace's method to construct a formula for the approximation.
(b) Describe how to use importance sampling to do the approximation.
7.6. Problems 475

51. Let 8 = (r, Ill). Suppose that one wishes to test the hypothesis H : r = ')'0.
Let the prior probability of H be positive, and suppose that the prior for
III given that H is true is the conditional prior of III given r = ')'0 calculated
from the prior on 8 given that H is false. Prove that the approximate Bayes
factor in (4.27) is the same as the Laplace approximation of Theorem 7.116
divided by Ir(')'o) when the hypothesis is H : r = ')'0.
52. Let {Xn}~=l be conditionally lID given (Pl , P2 ) = (Pl,]>2) with Ber(Pl +
P2) distribution. Let the prior distribution be (Pl,P2) "-J Dir3(01,02,03)'
(a) Find the posterior distribution of H given (Xl, ... , Xn).
(b) Conditional on (Pl , P2) = (Pl,P2), say what happens to the posterior
distribution found in part (a) as n -+ 00.

Section 7.5:

53. Let the parameter be 8 = (Ml' M2, E), and suppose that conditional on
8 = (p,l, p,2, 0") Xl,l, ... ,Xl ,n1,X2,1, ... ,X2,n2 are independent with X;,j
having N(p,i' 0"2) distribution for j = 1, ... ,ni and i = 1,2. Prove that the
size a likelihood ratio test of H : Ml = M2 versus A : Ml =I- M2 is also the
UMPU level o.
54. *Let a be a known number strictly between 0 and 1/2. The parameter space
is n = {(8l ,(h) : 0 ~ 81 ~ 0,0 ~ 82 ~ I}. Let D be a discrete random
variable with conditional density given (8 1,82) = (81, (2)

(h(1 - (2) if d = -2,


(!-o)
2
1- 9 1
1-0
ifd=-I,
if d = 0,
if d = 1,
if d = 2.
The hypothesis of interest is H : 8 1 = 0,82 = 1/2 versus A : 8 1 <
0,82 =I- 1/2. We will observe only one D value, and a will be the level for
every test below.
(a) Find the likelihood ratio test and its power function.
(b) Consider the group consisting of two transformations 9+D = D and
9-D = -D. Show that the testing problem is invariant under the
action of this group.
(c) Show that the likelihood ratio test is invariant.
(d) Find a uniformly most powerful invariant test and compare its power
function to that of the likelihood ratio test.
(e) Find the least powerful invariant test.
55. Suppose that X "-J N(O, 1) given 8 = 0. Let OH be the set of rational num-
bers, and let nA be the set of irrational numbers. Prove that the likelihood
ratio test with level a of H : 8 E OH versus A : 8 E OA is the trivial test
cp(x) == o.
CHAPTER 8
Hierarchical Models

When a model has many parameters, it may be the case that we can con-
sider them as a sample from some distribution. In this way we model the
parameters with another set of parameters and build a model with different
levels of hierarchy. In this chapter we will discuss situations in which it is
natural to model in this way.

8.1 Introduction
8.1.1 General Hierarchical Models
We turn our attention now to a situation in which the observations are
not exchangeable. Suppose, for example, that several treatments are be-
ing administered in a clinical trial. From each treatment group, we will
make some observations. It may be plausible to model the observations
within each treatment group as exchangeable, but it would seem strange
to model all observations as exchangeable. For each treatment group, we
might develop a parametric model as we have done elsewhere in this text. A
hierarchical model for this example involves treating the set of parameters
corresponding to the different treatment groups as a sample from another
population. Prior to seeing any observations, we can model the parameters
as exchangeable. 1 This would mean that we could introduce another set of
parameters to model their joint distribution. These second-level parameters

1 It is not essential that we model the parameters as exchangeable a priori,


but it is mathematically convenient. Such a model corresponds to treating the
different goups symmetrically prior to observing any data.
8.1. Introduction 477

are called hyperpammeters. We would then need to specify a distribution


for these hyperparameters. Here are some examples.
Example 8.1. Suppose that there are k treatment groups. Let Xi,j stand for
the observed response of subject j in treatment group i. We might invent pa-
rameters M1 , ... , Mk and model the X i,j as conditionally independent given
(M 1 , ... , Mk) = (ILl, ... , ILk) with Xi,j '" N(lLi, 1). We might then model M1 , ... ,
Mk as a priori exchangeable with distribution N(8, 1) distribution given 8. Here,
e is a hyperparameter. We should also specify a distribution for 8. 2 Note that
we have only one 8 regardless of what k is.
Example 8.2. A survey is conducted in three different cities. Each person sur-
veyed is asked a yes-no question. Treat "yes" as X = 1 and "no" as X = o.
Then, to each person i in city j, there corresponds a Bernoulli random variable
Xi,j. It might seem plausible to treat the Xi,j observations from a single city i as
exchangeable. Suppose that we invent three parameters H, P2, P3. Then we can
model the Xi,j for fixed i as conditionally IID Ber(p) given P; = p. We would
then need to construct a joint distribution for (H, P2 , P3 ). For instance, we could
model the Pi as exchangeable with Beta(a, {3) distribution conditional on A = a
and B = {3. Here, A and B are the hyperparameters. We would then need a joint
distribution for (A, B). Note that we only use a single pair (A, B) no matter how
many Pi we have in this simple model.

The intuitive concept of how hierarchical models work is the following.


Suppose that the data comprise several groups, each of which we consider
to be a collection of exchangeable random variables. From the data in each
group, we obtain direct information about the corresponding parameters.
Thinking of the hyperparameters as known for the time being, we then
update the distributions of the parameters using the data, to get posterior
distributions for the parameters via Bayes' theorem 1.31. Future data (in
each group) are still exchangeable with the same conditional distributions
given the parameters, but the distributions of the parameters have changed.
In fact, the distribution of each parameter (given the hyperparameters) has
now been updated using only the data from its corresponding group. Hence
the parameters are no longer exchangeable. Now, we can also update the
distribution of the hyperparameters. To do this, we first find the conditional
distribution of the data given the hyperparameters. Then, we can use Bayes'
theorem 1.31 again to find the posterior distribution of the hyperparameters
given the data. The marginal posterior of the parameters given the data
is found by integrating the hyperparameters out of the joint posterior of
the parameters and hyperparameters. This is how the data from all groups
combine to provide information about all of the parameters, not just the
ones corresponding to their own group. It is the common dependence of all
parameters on the hyperparameters that allows us to make use of common
information in updating the distributions of all parameters. A diagram of
the directions of influence is given in Figure 8.3. A random variable at the

2We set all variances equal to 1 for simplicity in this first example. In any real
application, the variances would be unknown parameters as well.
478 Chapter 8. Hierarchical Models

Hyperparameters

FIGURE 8.3. Schematic Diagram of Hierarchical Model

pointed end of an arrow has its conditional distribution calculated given


the random variable at the other end of the arrow. Double-headed arrows
indicate places where Bayes' theorem 1.31 is used. Future data are included
in the diagram to indicate how observed data in all groups affect predictive
distributions of all future data through the hyperparameters.
In theory, the updating can be performed as follows. We can denote the
data to be observed as X, the parameters as 8, the hyperparameters as \II,
and future data as Y. Assume that X and Yare conditionally independent
given 8 and \II. Then the conditional posterior density of the parameters
given the hyperparameters is

where the density of the data given the hyperparameters alone is

The marginal posterior distributions of the parameters can be found from

where the posterior density of \II given X = x is

f ('/'1) - fXlw(xlv;)fw(v;)
wlX 'P x - fx(x) ,
8.1. Introduction 479

and the marginal density of the X is

fx{x) = J fXlw {xl1/J)fw (1/J).

Finally, the predictive distribution of future data Y is found from the pos-
terior density of the parameters:

fYlx{ylx) = JJ fYle,w{yIO,1/J)felx,wU'lx,1/J)fwlx{1/Jlx)dOd1/J.
Hierarchical models were first popularized by Lindley and Smith (1972)
and Smith (1973) for the special cases of multivariate normal observations.
Hierarchical models are special cases of partial exchangeability, which we
consider in more detail in the remainder of this seciton.

8.1.2 Partial Exchangeability


A natural generalization of both hierarchical models and exchangeability
is the concept of partial exchangeability. There are several types of partial
exchangeability. Diaconis and Freedman (1980b) present a good overview
with some examples.
There are two ways to think about what exchangeability means, and each
of them leads to a way of extending the concept to partial exchangeability.
Let Xl, X 2 , .. be a (possibly finite) sequence of random quantities.
1. They are exchangeable if, for each n, every permutation of n of them
has the same joint distribution as every other permutation of n of
them.
2. They are exchangeable if, for each n, each sequence {Zi}~l of n
possible outcomes has the same probability of being the observed
value of {Xi}~l as every permutation of {zdi=l'
The second description has the drawback that it only makes sense, as
stated, for discrete random variables. There are ways to make it precise
for more general cases, but they lose some intuitive appeal in the trans-
lation. Oddly enough, it is this second description that has the greatest
potential for generalizing the concept of exchangeability.
Based on the first description, we have the following restrictive extension
of exchangeability.
Definition 8.4. A sequence Xl. X 2 , . is marginally partially exchangeable
if it can be partitioned into subsequences xi
k ), X~k), . . , for k = 1,2, ...
such that the random quantities in each subsequence are exchangeable. To
which subsequence each Xi belongs must be known in advance, and the
subsequences (as well as the original sequence) may be finite or infinite.

This section may be skipped without interrupting the flow of ideas.


480 Chapter 8. Hierarchical Models

If the subsequences are infinite, DeFinetti's representation theorem 1.49


can be applied to each of the subsequences to conclude that there must
exist probability measures corresponding to each subsequence (with some
unspecified joint distribution) such that the random variables in each subse-
quence are conditionally independent given their corresponding probability
measures. Similarly, we can introduce finite-dimensional parametric fam-
ilies for each subsequence and hence reduce the problem of finding joint
distributions for the probability measures to finite-dimensional joint distri-
butions. This is basically what hierarchical models do. (See Examples 8.1
and 8.2 on page 477.)
As an example of an attempt to extend the second description of ex-
changeability, consider a Markov chain {Xn}~=l of Bernoulli random vari-
ables. Clearly, not every permutation of a sequence of possible outcomes
has the same probability. But there are some permutations that do have
the same probability. In particular, two sequences with the same first value
and the same numbers of the four types of transitions (0 to 1, to 0, 1 to
1, and 1 to 0) will have the same probability. For example:
(1,1,0,1,0,0,0,1,0,0,1,1) and (1,1,1,0,0,0,0,1,0,1,0,1).
This is certainly not a case of marginal partial exchangeability. It does,
however, have the intuitive appearance of being a generalization of the sec-
ond description of exchangeability. Diaconis and Freedman (1980c) give an
in-depth treatment of the type of partial exchangeability that characterizes
Markov chains.
The most general type of partial exchangeability is described by Diaconis
and Freedman (1984). In fact, it is so general that it is satisfied by arbi-
trary joint distributions. (See Example 2.118 on page 129.) Nevertheless,
each specific instance of partial exchangeability leads to a representation
theorem of the type of Theorem 2.111. In Example 2.116 on page 127 we
saw that Theorem 2.111 contains a reformulation of DeFinetti's represen-
tation theorem 1.49 as a special case. In Section 8.1.3, we give examples of
Theorem 2.111 for partially exchangeable random quantities that are not
exchangeable.

8.1.3 Examples of the Representation Theorem*


In this section, we give examples of the representation theorem 2.111 to
cases of partially exchangeable random quantities.
Example 8.5. This example is the one-way analysis of variance with only two
groups and equal variances. To handle more groups is a simple matter, but the
notation gets in the way of an initial understanding. Let Xn = ]Rn and 7;.. = ]R2 x
]R+o. Suppose that there is a deterministic sequence {jn};:"=l with jn E {O, I} for

"This section may be skipped without interrupting the flow of ideas.


8.1. Introdu ction 481

all n. The jn sequence tells us from which group the nth observa
tion comes. Let

Let k n = L:~=1 ji. Let T n (" t) be the uniform dist'ribu tion on the
surface of the
sphere of radius Jt3 - tVkn - tV(n - k n ) around the point whose
i~~ coor-
dinate is jih/k n + (1 - j;)t2/(n - k n ). One can check that the
conditIOns of
Theorem 2.111 are met.
We would like to proceed as in Exampl e 2.117 on page 128, but we
cannot as-
sume that the coordin ates are lID in the limit distribu tions. So, we
will constru ct
the joint distribu tion of 80 observa tions from group 0 and 81 observa
tions from
group 1 (for fixed 80 and 81) given Tn = t, and see what happen s
as n --. 00. Call
these observa tions Z = (ZI,"" Zso) and W = (WI,"" WS1 )' Let

2=
O'n 2
-1 ( t3 - -1t l - - -1- t 2 , 2) 1
TIn = --k- t2.
n kn n-kn n- n

Then

r( n-sg-s 1 ) (mr) ~ O':'O+SI


"'.0 ( )2 ",SI ( )2) n-8g-s 1
X (1_ L.,,;-1 Zi - /-tn _ L."i-l Wi - TIn
nO'~ nO'~
Since

11m r(~) 2-~ ,
=
n-+oo r(n-sg-'!) n~ 8 +8

we have that fZ.W/Tn is asympto tically equivale nt to

If O'n converges to 0' E (0,00) and /-tn and I/n converge to /-t and 1/, respecti
ver" this
function converges uniform ly on compac t sets to the density of 80
N(/-t,O' ) and
81 N(I/,0'2 ) random variable s all indepen dent.
If 0'2 goes to 00, there is no limit
distribu tion. If 0'2 goes to 0, and /-tn --. /-t and I/n --. 1/, the function
o uniformly outside of every open neighborhood of the point with ithconverg es to
coordin ate
ji/-t + (1 - j;)I/. In this case the limit distribu tion is
point masses at
dependi ng on whethe r j; = 1 or O. Finally, if O'n goes to a finite value /-t and 1/
and either
/-tn or I/n diverges to oo, there is no limit distribu tion. The
extreme distribu tions
either have all coordin ates degener ate or have the coordin ates being
indepen dent
normal random variable s with commo n variance. In either case,
there are two
different means dependi ng on the values of the sequence {jn}~=I'

Lauritzen (1984, 1988) shows how to characterize much more general


normal linear models using the representation of Theorem 2.111. See Prob-
lem 1 on page 532 for an example of the characterization of a conditionally
partially exchangeable sequence by means of Theorem 2.111.
482 Chapter 8. Hierarchical Models

Aldous (1981) introduces a special kind of partial exchangeability that


arises in the two-way analysis of variance.
Definition 8.6. Let X = ((Xi,j)):'W=1 be an array of random variables.
Let ~ = (Xi,l, X i ,2,"') and Cj = (XI,j, X 2,j, .. . ). We say that X is row
and column exchangeable if both {Rn};::'=1 and {Cn}~1 are exchangeable
sequences.
Rowand column exchangeability is a special case of the conditions of
Theorem 2.111.
Example 8.7. Let X be a row and column exchangeable array. Let {(Tn, Cn)}~=I
be a sequence of pairs of integers such that Tn+! 2: Tn and Cn+I 2: Cn with at
least one inequality strict for each n. Let Xn be IRrnC n , and let Xn be the first Tn
rows and Cn columns of X. That is, we add at least one rowand/or at least one
column each time we increase n. 3 Let Tn = (T~, T~), where T~ and T~ are defined
as follows. For each row of X n , construct the order statistic (smallest to largest)
of the numbers in that row and then arrange these order statistics according to
the smallest value in the row. Call the result T:;. Define T~ by doing the same
thing to the columns. For example, suppose that

Xn =(
-1
4
3
-2 1
2 0)3 .

Then
1
-1 2

(-2,0,3)
(-2,1,3,4) (-1,1,4)
T~ = { (-1,0,2,3)'
(-1,1,2)'
(-1,0,1,2) (0,2,3)
Basically, you throwaway the information about in which row and in which col-
umn each number was, but you keep the information about which other numbers
were in the same row and column with each number. Each of the Tn!cn ! matrices
that can be obtained from Xn by permuting the rows and then the columns will
have the same value of Tn. Similarly, all of thos~ Tn!en! arrays can be constructed
from Tn by a somewhat more tedious algorithm. Clearly, Tn(-, t) must be uniform
over those Tn!C n ! arrays to preserve row and column exchangeability.
Finding all of the Q(,x) distributions is no small task. Aldous (1981) proves
the following result. An array X is row and column exchangeable if and only if
there exists a measurable function f : 10,1]4 ...... IR such that X has the same
distribution as Y = Yi,j)), with Yi,j = f(M, Ai, Bj, Gu), where M, AI, A2,""
B I , B 2 , , GI,I, ... are all lID U(O, 1) random variables.

3 Technically, there is a way to write Theorem 2.111 so that it applies to


partially ordered sets like the set of all (r,c) pairs, but the author thought that
the proof of Theorem 2.111 was complicated enough without introducing this
added level of mathematical detail.
8.2. Normal Linear Models 483

8.2 Normal Linear Models


A particularly simple case with which to work is that of linear models in
which the observables are modeled as having normal distributions given
parameters and the parameters are also modeled as jointly normal (except
for the scale parameters).

8.2.1 One-Way ANOVA


Suppose that we will observe data from k different treatment groups. Let
the jth observation in the ith group be Xi,i' Suppose that we model the
Xi,j as conditionally independent N(ILi, ( 2 ) random variables given M =
(ILt, ... ,ILk) and E = u for j = 1, ... ,ni and i = 1, ... ,k. Next, suppose
that we model M = (M!, ... , M k ) as a vector of lID N('!jJ, T2) random
variables given \II = '!jJ and T = T. To be precise, we should have said
that the Xi,i have N(ILi, ( 2 ) distribution given M = (ILb ... , ILk), E = u,
\II = '!jJ, and T = T. Next, we model \II as N('!jJo,T 2/(o) given T = T and
E = u. Finally, we need a joint distribution for (E, T), which will remain
unspecified for now.
The joint distribution of all quantities can be summarized as in Table 8.9.
Future observations have a distribution like the first stage. The posterior
distribution has only the last three stages. Let X stand for the entire data
vector, and let x be the observed value.
Conditional on E = u, T = T, and \II = '!jJ, the posterior of the Mi is
found from simple normal distribution updating. The Mi, given E = u,
T = T, and \II = '!jJ, are independent with Mi having distribution

(8.8)

where
.(.1,
IL, 0/, CT, T
) _ ni xi T2
- 2
+ '!jJu
2
2
niT + CT
The conditional joint distribution of the data given \II = '!jJ, E = u, and

TABLE 8.9. Hierarchical Model for One-Way ANOVA


Stage Density

Data (211'CT 2)-i exp { -~ 2:~=1 [ni(xi - /-Li)2 + (n1 - 1)8~1}


Parameter (211'T2)-~ exp { -~ 2::=1 (/-Li - 'I/I?}
Hyperparameter v'(O (211'T2) -! exp { - ~ ('1/1 - '1/10) 2}
Variance fr:.,T(U, r)
484 Chapter 8. Hierarchical Models

T = r, is that the Xi and Si are independent with 4

(8.10)

It follows that the posterior of W conditional on :E = 0' and T = r is

wIX=x,:E=a,T=r",N ("/Jt(a,r), [(0r2 + ~~a2+r2n'


ni ]-1) '
;=1
(8.11)
where

To find the posterior distribution of (:E, T), let X stand for the vector
with coordinates X 1, ... , X k. Then, the conditional distribution of the data
given E = 0' and T = r is
2
X", Nk ("pol, W(a, r)), s2 0' 2
(8.12)
i '" n; _ 1 Xn ;-1'

where

It follows that fE,Tlx,s~, ... ,s~ (0', rlx, s~, ... ,sO is proportional to

f E,T (a,r )0' -[nt+ .. +nk-kJIW(a,r )1- 12

X exp ( - ~ [sf (ni20'2-


{:;t 1)] - '12 (X - "pol} T W- 1 (0', r)(x - "pol) ) .

There is one special case in which the above formulas simplify tremen-
dously. For fixed A, suppose that

4The reduced model in (8.10) is sometimes called a variance components


model. In this model, the vector M is not of interest, but rather only the vari-
ance T of its coordinates. The two terms involving T and a are components of the
variance of the observations. Hill (1965) gives a Bayesian analysis of such models.
8.2. Normal Linear Models 485

That is, T is just a known scalar multiple of E. In this case, it is convenient


to define 1'0 = A(O so that

k
= 1'0 + L 'I'i,
i=1

;1 + t + ';0 2-
1'0

W(O',T) 0'2 [ 2-
1 2-
1'0 1'0

where
ni A
Ai = A+ ni, 'I'i = -Ai.
Note how P,i and 'l/J1 no longer depend on 0' and T. In fact, we can use
Proposition 8.13 below to show that

IW(O', T)I = 0'2k II 1',~1 ( 1 + -1'01 Li=1 'I'i


i=1
k k )
.

Proposition 8.13. 5 Let B be a positive definite k x k matrix, and let x be


a vector of dimension k. Define A = B + exx T, then
A-I = B- 1 _ C B-1 T B-1
1 + exT B-1 x xx
and IAI = IBI(l + ex T B- 1 x).
In this case, there is a simple conjugate prior for E which makes the
posterior of E of the same form. That would be that E2 has inverse gamma
distribution r- 1 (ao/2, bo /2). The posterior of E2 would be r- l (al/2, bl/2),
where a1 = ao + I::=1 ni, 1'. = I::=1 'I'i, u = I::=1 'I'ixd'l'., and
k
b1 = bO+L(ni-l)s~+0'2(x_tPol)TW-l(0',T)(x-tPol)
i=1
k
= bO+L{(ni-l)s~+'I'i(xi-u)2}+ 1'.. 1'0 (u-"pO)2.
i=1 1'0 + 1'*

5This proposition is also used in the analysis of the two-way ANOVA in Sec-
tion 8.2.2.
486 Chapter 8. Hierarchical Models

Posterior distributions for linear functions of location parameters are


now t distributions as are predictive distributions of future observations.
For example, if Y is the average of m future observations from population
i, then some of the various posterior and predictive distributions are

y '"

When T is not a known scalar multiple of E, there is still a way to simplify


the formulas slightly. That is, introduce A = E2 JT2 as a replacement for
T in the hyperparameter. Then, the simplified formulas are still correct
as long as they are understood to represent conditional distributions given
A = A. In this case, it is also possible to let the values ao, bo, (0, and 1/Jo
depend on A, if one wishes. The posterior for A is not particularly simple,
but it is the only part of the posterior that is not simple. It is proportional
to fA (A) times

Numerical integration is required to make any marginal (not conditional


on A) or predictive inferences.
The model just described can be used to find a solution to the problem
that gave rise to the James-Stein estimator in Section 3.2.3. In that problem
Pr(E = 1) = 1, but otherwise it is the same. Assuming E to be unknown,
the above model gives the posterior mean of Mj to be

- E:-l "YiXi+(OA,pO A
00 njxj + A ~" "
E(MjIX = x) = [ (0 + L.Ji-l "Y. fAIX(Alx)dA.
io Aj
Since the integration on the right-hand side is over A, we should see explic-
itly where A is. So, we rewrite the formula as
~k n" - r,p
- L.Jj-l ~Xi+"O 0 A
00 njxj + E" n"

E(MjIX = x) = 10 ~: n~=l~ fAIX(Alx)dA

= E[Qj(A)IX = x]Xj + E[{l- Qj(A)}v(A)IX = xl,


where Qj(A) = njJ(nj + A) and V(A) = 1/Jl (0", r) is itself a weigh~e~ a~er~ge
of 1/Jo and a weighted average of all of the sample averages. ThlS IS slmllar
8.2. Normal Linear Models 487

to the empirical Bayes modification to the James-Stein estimator, namely


(3.55) on page 165. The way it behaves can be understood as follows. A
is a measure of how much more spread there is within each group relative
to the spread between groups. Now, suppose that all ni = 1. Then v{>.)
is k/(l + >.) times x plus (0 times 1/10 all divided by k/(l + >.) + (0. If the
posterior distribution of A is concentrated near 0 (that is, there is far less
variation within groups than between), then v(A) will get very little weight,
since a(A) will be close to 1. This makes sense because the large spread
between the means suggests that the information from Xj is much more
valuable than the other XiS. If, however, A has lots of mass for large values,
then there will be a great deal of shrinkage, and v(A) will be near 1/10. For
distributions of A concentrated on intermediate values, Xj, X, and 1/10 all
receive moderate weight.
Example 8.14. Consider the following data gathered from three groups:
1 2 3
10 12 15
27.9268 18.1622 19.5475
23.8227 57.6736 32.3858
Suppose that we want to have ~2 and T2 be independent in the prior distri-
bution. Suppose that the prior for ~2 is r- I (~/2, ~/2) and the prior for T2
is r-I(eo/2,do/2). Then A has the distribution of boeo/(aodo) times an Peo ao
random variable. The conditional distribution of ~2 given A = A can be shown
(see Problem 11 on page 534) to be r-I(ao,bo(A, where ao = [ao+c+0]/2 and
bo(A) = [~+ Ado]/2. Suppose that the rest of the prior distribution is specified by
1/Jo = 10, (0 = 0.1, ao = 1, bo = 10, eo = 1, and do = 1. The posterior distribution
of A can be found approximately, its mode is around 1.07, and it has probability
of about 0.94 of A ::; 10. Hence Qj(A) = nj/[A + nj] is close to 1 with high
probability for all j, and there will be little shrinkage toward the overall mean.
We can numerically calculate the posterior distributions of the three Mi using
either Laplace's method (Section 7.4.3) or importance sampling (Section B.7).
For Laplace's method, the "9" is A, and the function g(A) is one of the posterior
densities of the Mi given A = A evaluated at various values of p.. These densities
are tal with location and scale

niX, + tPI(>')A d Jb1(A) 1 A2 1


A + ni
, an --
al
---
A + ni
+ (A+ ni )2 (0 + Li=I
3'
'Yi

where tPi (A) = oAtPo + L:=l 'YiXi)/O + L:=I 'Yi), and bi (A) is the same as bi
with bo replaced by bo(A).
For importance sampling, we sampled 1000 values from the prior distribution of
A and used these to approximate the integrals that equal the posterior densities at
various p. values. We also used the delta method to calculate standard deviations
for the density values and found these to be at most 0.09 times the density values
in all cases (less than 0.05 times the density in 80% of the cases).
The three posterior means were calculated from the posterior densities and
were found to equal 26.08, 18.86, and 19.86. We see that some shrinkage h&ll
occurred. The numerically evaluated densities are shown in Figure 8.15 together
with the results of an empirical Bayes analysis to be described in Section 8.4 and
a successive substitution sampling analysis to be described in Section 8.5.
488 Chapter 8. Hierarchical Models

Imp. Samp.
Laplaca
SSS
Emp. aay

on
~ d
'"
c:
~
co
d

10 15 20 25 30 35
J.L

FIGURE 8.15. Numerical Approximations to Posterior Densities

One could generalize this model to the case in which the variance of
Xi,; conditional on the parameters is E~. In this case, one can only ob-
tain closed-form posteriors conditional on all of the variance parameters,
E~ , ... , E~, T2. Numerical integration over all k + 1 variance parameters
would then be needed. We postpone illustration of this until Section B.5, at
which time we introduce an alternative method of solution that is better
suited to this type of problem.

8.2.2 Two-Way Mixed Model ANOVA*


In this section, we examine a two-way analysis of variance with one random
effect and one fixed effect and equal numbers of observations per cell. The
recommended analysis of this model will be described in Section B.5. The
analysis given here is mainly motivational as well as illustrative.
Suppose that
Yi,;,k = M + Ai + B; + (ABk; + f.i,;,k, (B.16)

where A stands for the random effect and B stands for the fixed effect and
b b
LB; = 0 = L(AB)i,;, for all i, (B.17)
;=1 ;=1
for i = 1, ... ,a, j = 1, .. , ,b, and k = 1, ... ,m. We suppose that the f.i,;,k

This section may be skipped without interrupting the flow of ideas.


8.2. Normal Linear Models 489

are conditionally lID N(O,O"~) given E~ = O"~ (and all other parameters).
It is also traditional to assume that, conditional on other parameters, the
Ai are independent of each other and of the Bj and the (AB)i,j' that the
Bj are independent of the (AB)i,j, and that the (AB)i,j for different i are
independent of each other. We can let

and put these into vectors Mi = (M i ,l, . , . , Mi,b) T. We can then express the
model described above by saying that the Mi are conditionally independent
Nb((), E) vectors given e = () and E = a. Here E is a b x b matrix and
e = (M + Bl, ... M + Bb) T. In order to ensure that (8.17) is reflected in
the conditional distribution of Mi given M, we assume that E has the form

where 1 is b-dimensional. At the next stage of the hierarchy, assume that


the coordinates of e, namely e 1 , ... , eb, are conditionally lID N(/-l, a~)
given M = /-l and EB = O"B (and other parameters), since e j = M + Bj in
the notation of (8.16). At the next stage, model M '" N(/-lo, o"~/T) given
EB = O"B (and other parameters). Finally, Ee, EA, EAB, and EB have some
joint distribution. In summary, we have Table 8.18.
One way to proceed, after collecting data, would be to march through
the levels of the hierarchy, finding all of the posterior distributions. This
is done in much the same way as in the simpler model of Section 8.2.1,
but with an extra level in the hierarchy, Alternatively, we could take an
approach that is typically done in the classical analysis of this model. That
approach is to pretend that some of the parameters are not of interest and
integrate them out of the model. In particular, the Mi and e j are usually
integrated out of the classical analysis. This is easy to do in the model of

TABLE 8.18. Hierarchical Model for Two-Way ANOVA


Stage Random Variables Distribution Conditional on
Data Yi,j,k, for all i, j, k independent All Mi,j - /1-i,j, E. - U.,
N(/1-i,j, u~) M = /1-, e = IJ, EA = UA,
EB - UB, EAB - UAB
Parameter Mi for all i independent M - 1-'-, 'Ee - (Tel
Nb(IJ,u~llT EA = UA, e = IJ,
+u~B[I -"illT]) EB = UB, EAB = UAB
Hyper- e j , J - 1 , ... ,b independent EA=UA,M=/1-,
parameter N(/1-,u~) EAB = UAB,EB = UB,
Ee - (Ie

Hyperhyper- M N(JLo,u;r- 1) I::A = UA, I::. = Ue ,


parameter EB - UB, EAB - UAB
Variance E., EB, EA, EAB Whatever
490 Chapter 8. Hierarchical Models

Table 8.18. To do this, we note first note that sufficient statistics are

a b m _ 1 m (i,.I,k )
R88 = L L L ( Y i , j , k -fh,j)2, Y i = -mL..J
"" .. .
i=1 j=1 k=1 k=1 Y;t, b, k

The distribution of the Y i is that of independent b-variate normals, with


2
distribution Nb(lli, ~]). To integrate the Mi out of the model conditional
on everything else, we note that the distribution of R88 depends only on
E e , so only the distribution of the Y i changes to

1\T ( ( )
H b
a~ ]
'm + a2
A 11 T+ a 2
AB [] 1 11
- ;; T] ) , (8.19)

and they are still conditionally independent. This means that we can reduce
the sufficient statistic even further. We will still need R88, and of course
we will need Y. = 1::=1 Yi/a. Because of the special form of the covariance
matrix of the Y;, we do not need the whole matrix 1::=1 (Yi - Y.)(Y i -
Y.) T , which would be required if the covariance matrix were unconstrained.
Instead, one can use the fact that

(8.20)

to write the conditional density of the Y i in terms of Y. and


a
88A = L..Jbm(Yi,.
" " -- - Y.,.)
2,
;=1
a b
88] = L L m(Yi,j - Y i ,. - Y.,j + Y.,y
i=1 j=1

L m(Y
a
= i - Y.) T (Yi - Y.) - 8SA,
i=1

where Y i ,. is the average of the coordinates of Y i and Y.,. is the average


of the coordinates of Y., These two sums of squares are the usual sums of
squares for the random effect and for interaction, respectively. The condi-
tional distribution of Y. given the parameters is the same as (8.19) except
that the covariance matrix must be divided by a because we averaged a
independent vectors with the same distribution.
To integrate e out of the distribution, we note that 88A and 88] de-
pend only on E e , EA, and EAB. Using (8.20) once again, we can write the
8.2. Normal Linear Models 491

conditional density ofY. in terms ofY.,. and SSB = E~=l ma(Y.,j -Y.,Y,
where Y.,j is coordinate j of Y., and SSB is the usual sum of squares for
the fixed effect. In summary, the sufficient statistics for the model that in-
volves only the parameters 1:: e , 1::A, 1::AB, 1::B, and M are RSS, SSA, SSI,
SSB, and Y.,.. The conditional distributions of these quantities given the
parameters are easily calculated using the fact that they are functions of
orthogonal transformations of the original data. Hence, they are all condi-
tionally independent given the parameters, and their distributions are

Y.,. _ N(JL'~~ + u:), RSS

SSB ,...., (u~ + mU~B + amu~)X~_l' SSA - (u~ + mU~)X~_l'


SSI - (u~ + mU~B)Xfa-l][b-l1'
At this point, the classical analysis differs from any further Bayesian
analysis. The classical analysis usually ignores Y.,. and makes inference
based on the sums of squares. Since the distribution of Y.,. depends on
some of the variance parameters, a Bayesian would still make use of it,
even if interest was solely in the variance parameters. In particular, we could
integrate M out of the problem and see that the conditional distribution of
Y.,. given the variance parameters alone is

8.2.3 Hypothesis Testing


On page 384, we found a UMPUI6 test for the hypothesis of equal means
in a one-way analysis of variance. This test was the usual F-test. In Sec-
tion 4.5.6, we illustrated how the usual F -test was a Bayes rule in a decision
problem. This decision problem had the property that the prior probability
was positive that all the means were equal. It may be that we do not feel
that exact equality between the means has positive probability, but we are
still interested in how far apart they are. In fixed-effects models, there is a
straightforward way to measure the differences between the means which
resembles the F-test but uses a prior in which the probability is zero that
any two means are equal.
Suppose that X has N n (g(3, u2 v) distribution given (B,1::) = ((3, u),
where 9 and v are known matrices with 9 being n x p and v being n x n,
respectively. Let the prior distribution be that B - N p (f30, U2W(1) given
E = u and E2 - r- 1 (ao/2,bo/2}, where Wo is a known, nonsingular p x p

6Uniformly most powerful unbiased invariant.


492 Chapter 8. Hierarchical Models

matrix. The sufficient statistics from a sample X = x are


~ = (g T v-lg)-lg Tv-IX,
RSS = (x - g~)T v-lex - g~).
The posterior distribution given X = x has the form B '" N p ({311 (J'2wll)
given ~ = (J' and ~2 '" r- l (at/2,bt/2), where
al ao+n, WI = WO+gTv-lg,
{31 = Wll(wo{3o + (g T v-lg)~),
bl = bo + RSS + ({30 - ~)T WOWII(g Tv- l g)({3o - ~).
Let Bo = {{3 : a{3 = 1/Jo} where a is a q x p matrix of rank q :5 p,
and 1/Jo is some q-dimensional vector in the column space of a. Suppose
that we are interested in how far B is from Bo. Let {30 E B o, and define
H = (B - (30) Th(B - (30), where h = aT (awIIa T)-la. Note that
H = (aB -1/Jo) T(aw1la T)-I(aB -1/Jo)
is the same no matter which (30 E Bo is used in its definition. There are
two natural ways to measure the distance between Band Bo. One is by
p(B,Bo) = H/trace(h), and the other is by p(B,Bo)/~2. The reason for
the trace(h) in the denominator is two-fold. First, there is a sense in which
h measures the precision of that part of B that lies in Bo. Because of this,
there are two factors that contribute to (B - (30) T h(B - (30) being large. One
is how far B is from {30 and the other is how precisely we know B. Only the
former should contribute to the distance between Band Bo. The latter can
be used to judge how well we know the distance between Band B o, but not
to increase the actual distance. This means that we must adjust H somehow
to remove the effect of the precision. The trace of h is a natural way to
do that. The second reason for the trace of h is that it is invariant under
alternative representations for the set Bo. That is, if Bo = {{3 : c{3 = 1/Jo}
also, then trace(c T (cw1lcT)-IC) = trace(h).7
To express our uncertainty about the distance between Band Bo, we
need the distribution of H and/or the distribution of H/~2.
Theorem 8.21. Suppose that, conditional on Y = y, the distribution of Z
is NCx~(y). Suppose also that Y has r(at/2, bt/[2e]) distribution, then the
marginal distribution of Z is ANCX2(q, aI, "(), where "( = e/(e + bl)' (See
page 668.) The mean of Z is E(Z) = q + cat/b i = q + a1"l/(1 - "().
PROOF. The conditional density of Z given Y = y is

/z1Y(zIY) =
00
~ exp -2"
(Y) -t-
(ll)i T(~+i) . (Z)
r(~ + i) z~+t-l exp -2" .

7For those with a background in multivariate analysis, it is possible to show


that H/trace(h) is the weighted average of the squares of the principal compo-
nents of the projection of w 1 / 2 (B - /30) into the space W 1 / 2 Bo.
8.2. Normal Linear Models 493

The marginal density of Y is


~

(f) ~-l (blY)


fy(y) = r(!!f) Y 2
2
exp -2c .
The joint density is

fz,Y(z,y)

Integrating out y gives the marginal density of Z

(h);tr(!!1.+i)2-(~+i).
fz(z) = L
00
c 2 a
Z
z~+-l exp (--) .
i=Oi!r(!!f)(l+.!!;)!'.f+ir(~+i) 2
Use the formula for 'Y to complete the proof that the distribution is ANCX2.
The mean of Z given Y = y is q + y, and the mean of Y is cal/b l . 0

Theorem 8.22. Suppose that the conditional distribution of ZY given


Y = y is noncentral X2, NCX~(CY). Also, suppose that the distribution
of Y is r(al/2, bl/2), then the marginal distribution of alZ/[q(b l + c) is
ANCF(q,al,'Y), where 'Y = c/(c+bd. (See page 669.) Ifal > 2, the mean
of Z is E(Z) = c + qbl/(al - 2).
PROOF. The conditional density of Z given Y = y is
00 ( 1l ) i !l +i
fZIY(zly) = "exp (- CY) _~_ ~2 z~+i-l ex p (- zy) .
~ 2 z! 2~+'r(~ + i) 2

The marginal density of Y is

f Y (Y) = (r(!!f)
h)
2 2f ~-l
y2
( blY )
exp -2 .

The joint density is fz,y(z, y) equal to exp( -y[b l + c + z]/2) times


. ~
(~r (~) y~+~+2i-lZ~+i-l
;
00 2

i! 22q+ir(~ + i) r(!!f)'
Integrating y out of this gives the density of Z:

f (z) - " 00 ib2f


C 1
r(!!1.2 + '21. + 2')Z Z~+i-l
Z - ~ i!(b l + c + z)~+~+2i r(~ + i) r(!!f)
494 Chapter 8. Hierarchical Models

Now, make the change of variables from z to u = z/(z+b l +c). The inverse
is z = (b l + c)u/(l- u). The derivative is (b l + c)/(l- U)2. The density of
z/
U = (Z + bl + c) is
r(!ll + !l + 2')
L
~
00 b2 i
fu(u) = I:. 2 2 z u~+i-l(l _ U)~+i-l.
;=0 (b 1 + c).?f+ti! r(~ + i) r(-'-)
Setting, = c/(c + bl) and rearranging r function values produces the
ANCB(q,al,'() density. We know that

which must have ANCF(q,al,') distribution. The mean is obtained by


noting that E(ZIY = y) = c + q/y and E(l/Y) = bt!(al - 2) if al > 2. 0
In Theorem 8.21, let Z = H/E2. Since aB -1/10 has multivariate normal
distribution Nq (a(31 -1/10,0'2aw~la T) given E = 0', it follows that Z has
noncentral X2 distribution with q degrees of freedom and noncentrality
parameter y = (a(31 -1/1o)T (awlla T)-l(afh -1/10)/0'2 given E = 0'. Since
c = (a(31 -1/10) T(awlla T)-l (a(31 -1/10) (8.23)
is a constant in the posterior distribution, we can let Y = c/E2, which
has r(at/2, bt/[2c]) distribution. It follows that the distribution of H/E 2 is
ANCX2(q, at, ,).
In Theorem 8.22, let Z = Hand Y = 1/E2. Now, ZY has noncentral
X2 distribution with q degrees of freedom and noncentrality parameter cy
given Y = y. Also, Y '" r(at!2, bt!c). It follows that the distribution of
aIH/[q(b l + c) is ANCF(q, aI, ,).
Example 8.24. We will use the same data as in Example 8.14 on page 487, but
we will use a conjugate prior for the parameters E and B = (MI,M2,M3)T. The
design matrix 9 is particularly simple and v is the identity matrix. We get 9 T V-I 9
to be the 3 x 3 diagonal matrix with 10, 12, and 15 on the diagonal. Suppose
that the prior has hyperparameters
ao = 1, bo = 10,
6.7742 -3.2258 -3.2258)
~ ( tg ), Wo = ( -3.2258
-3.2258
6.7742
-3.2258
-3.2258
6.7742
.

The posterior distribution has hyperparameters


al 38, bi 1729.7,
24.44734 ) ( 16.7742 -3.2258 -3.2258)
{3I ( 19.43751, WI = -3.2258 18.7742 -3.2258 .
20.11565 -3.2258 -3.2258 21.7742
Now, suppose that Bo = {{3 : {3I = {32 = {33}' This can be represented by the
matrix and vector

a = ( 01 01 -1)
-1 ' 1/10 = ( 00 ) .
8.3. Nonnormal Models 495

::!

~
>
0
u..
C
(.)
~

0.0 0.2 0.4 0.6 0.8 1.0


v

FIGURE 8.25. CDF of V = VH/[44.333U:;2j

The noncentrality parameter of the alternate noncentral distributions can be


calculated to equal "( = 307.5124/(1729.7 +307.5124) = 0.1509. The trace of his
44.3331.
If we want to describe our uncertainty about how far apart the Mi are, we could
look at the CDF of some function of H or of H/E2. For example, Figure 8.25 gives
the graph of the CDF of V = VH/[44.3331E2j. We see that it is almost certain
that the average distance between the Mi is less than E and there is a 95% chance
that the average distance is at least 0.18E.

8.3 Nonnormal Models*


Hierarchical models are useful for problems in which data have any sort of
distribution. We will give two examples in this section.

8.3.1 Poisson Process Data


Suppose that several stochastic processes are being compared. For example,
each process may be registering the occurrence of defects produced by one
of several machines. Or, each process may be registering the times at which
a criminal is arrested. Suppose that we model the processes as Poisson
processes conditional on parameters 8 1 , ... , 8 k , so that process i has rate
()i given 8 i = ()i' We could then model the 8i as a priori exchangeable

'This section may be skipped without interrupting the flow of ideas.


496 Chapter 8. Hierarchical Models

random variables with r(o:,,8) distribution given A = 0: and B = ,8. We


would then need a distribution for (A, B). Suppose that the data for process
i consist of 7i units of time and N; occurrences. The posterior distributions
of the 8 i s given A = 0:, B = ,8, Ti = ti, and Ni = ni are of independent
random variables with 8 i having r(o:+ni' ,8+ti ) distribution. The posterior
density of (A, B) is proportional to

,8kOt k r(o: + nil


fA,B(O:,,8) r(o:)k IT
(,8 + ti)Ot+n; . (8.26)

This would require numerical integration or approximations in order to


make use of it.
Example 8.27. Suppose that Ni is the number of times an individual is arrested
in Ti units of time (months). We will assume that Ti is independent of the param-
eters, and that conditional on Ti = ti and 9 1 = 01 , . ,9n = Ok, A = a, B = (3,
the N; are independent Poi(t;Oi). The 9 i are modeled as lID r(a,(3) given
A = a, B = (3. We will use the following prior distribution for (A, B):

BIA =a '" r ( b(O), ~


(0) ,

In this prior, B/ A is independent of A. Suppose that we use the prior hyperpa-


rameters a(O) = 1/2, b(O) = 1, c(O) ::: 13, and d(O) = 1. The data consist of k = 6
individuals with the following observations:
Subject (i) 1 2 3 4 5 6
Time ti 36 27 14 6 20 30
Number (n;) 2 3 1 1 2 2
We will illustrate two numerical techniques for drawing inferences from this data
and model. Suppose that we want the predictive distributions of the numbers of
arrests in a future 24-month period for two different individuals. One of them
is the second observed individual in the data set, and the other is an individual
not in the data set but deemed to be a priori exchangeable with them. Denote
these individuals by i = 2 and i = 7, respectively, and denote the numbers of
arrests by M2 and M7 to distinguish them from the observed data. What we seek
is fMilx(nlx) for i = 2,7 and n = 0, 1, ... , where X = x is the observed data. We
can write

JJ fM;IX,A,B(nlx, a, (3)/A,Blx(a, (3lx)dad(3,

fM;lx,A,B(nlx, a, (3) J fM;19; (nI8)f9;IX,A,B(8I x , a, (3)d8,

= exp(_240)(2~)n,
(3 + 27)"'+3 0",+2 exp[-O((3 + 27)],
rea + 3)
f 9 7Ix,A,B(0Ix, a, (3) ~:) 0",-1 exp( -0(3),
8.3. Nonnormal Models 497

24n(.8 + 27)O<+Sr(o + 3 + n)
nlr(a + 3)(,13 + 51)o<+3+ n '
24n,l3Qr(a + n)
n!r(a)(,13 + 24)o<+n'

Therefore, we need to be able to integrate these last two expressions times the
expression in (8.26) renormalized to be a density. The normalization constant is
fx(x), the integral of (8.26) over a and ,B.
First, we used Laplace's method from Section 7.4.3, since all the functions
being integrated are positive. The "8" in this example is (A, B), and g(8) is
one of the several functions obtained by fixing n in either f M2Ix,A,S(nlx, 0,,(3) or
fMrlx,A,S(nlx, 0, (3) from above. Due to the form of the prior, it seemed sensible
to transform to (A, BI A) before applying Laplace's method.
Second, we used importance sampling (see Section B.7) to integrate numeri-
cally. We used a single set of 100, 000 pseudorandom pairs drawn from the prior
distribution of (A, B) to perform all of the integrals. We also calculated variances
using the delta method for each ordinate. The results for M2 are shown in Fig-
ure 8.28, and those for Mr are shown in Figure 8.29. The standard deviations
of the importance sample ordinates were all at least two orders of magnitude
smaller than the ordinates themselves. As we can see in the figures, the two
methods produce nearly the same results.

8.3.2 Bernoulli Process Data


Suppose that we can collect counts from several different sources. For ex-
ample, we might be administering several treatments and we count how
many recoveries occur in each treatment group. The data from group i will
consist of ni, the number of subjects in the group, and Xi, the number of

~
Laplace
Imp.Sam .


'"
Ci


E
~~
~



~
0 2 4 8 8 10
m

FIGURE 8.28. Numerical Approximations to Density of M2


498 Chapter 8. Hierarchical Models

I
I
Laplace
Imp. Sam.

II

. .

o 2 4 6 8 10
m

FIGURE 8.29. Numerical Approximations to Density of M7

successes, for i = 1, ... , k. We model the successes as Bernoulli processes


conditional on parameters Pi, with the probability of success in group i
being Pi given Pi = Pi. We can model the Pi as exchangeable random vari-
ables with Beta(Br, [1 - Bjr) conditional on e = B and R = r. Here, e
is like the average probability and R is like a measure of similarity. The
larger R is, the more similar the I{ are. The posterior distribution of the P;,
given 6 = (), R = r, and Xi = Xi is that of independent random variables
with Pi having Beta(()r + Xi, [1 - ()jr + ni - Xi) distribution. The posterior
density of (6, R) would be proportional to

f,
e,R
(()) r(r)k
,r r(()r)kr([l - ()jr)k
IT r(()r + xi)r([l-
i=1
()jr + ni - Xi)
r(r + ni) .

This would require numerical integration or approximations in order to


make use of it.
One possible approximation that is available puts this problem into the
normal model framework. If the ni will be large, we can model Vi =
vi
2 arcsin X;fni as approximately N(2 arcsin $i, l/ni) random variables
given Pi = Pi. We could then use the same transformation on the Pi to
model the Mi = 2 arcsin v'P: as approximately N(p" l/T) given M = p, and
T = T. Then M can be modeled as N(p,(O) , 1/( ..\7)) given T = T, and T
can be given some distribution. Here, M plays the role of 2 arcsin and ve
T plays the role of R from the earlier model. The posterior distribution of
the Mi given M = p, and T = T is that of independent random variables
8.3. Nonnor mal Models 499

with Mi having N('l/Ji(p,), 1/(r + ni)) distribution, where

p,r + niYi
'l/Ji ()
p, = .
r+ni
The posterior of M given T = r is N(p,(1) (r), 1/[r,(r )]), where

+ "k
\ (0) --21L-.
(1) r _ AP, LJi=l ni+ rY'
p, () - ,( r)

The posterior for T cannot be given in closed form, but the density
is
propor tional to iT (r) times

where
k
~
A(r) = ""'
L.J - 'n- ,
i=l ni + T

Once again, if ni is large for each i, then n;/(ni + r) ~ 1 and ni


+r ~
ni for each i. It follows that ).(r) is approximately k and that y(r)
is
approximately the average of the Yi, say y. Hence, the posterior density
of
T is approximately propor tional to

h(T)T~ exp (-~ [w + ~(J.L(O) _ y)2])


2 k+ A '
where w = E~=l (Yi - y)2. Also, the conditional posterior for M given
T = r is approximately

N ( Ap, (0) +" k.


LJi-1 y, _1_ )
A+k 'A+k

If T has a r(a(O) /2, b(O) /2) prior, then the approx imate posterior
of T is
rca(l) /2,b(l) /2), where

all) = a(O) + k,
bel) = kA
b(O) + w + k + A(p,(0) - y)2.
Of course, using these same approximations, the conditional distribu
tion
of Mi given M and T would be N(Yi, l/k), which is independent of
M and
T anyway.
500 Chapter 8. Hierarchical Models

8.4 Empirical Bayes Analysis*


Qlassical statisticians try to make use of hierarchical models either by leav-
ing the hyperparameters at various stages of the hierarchy unspecified or
by not specifying a distribution for the hyperparameters at certain stages.
This allows them to treat these values as "unknown parameters" in much
the same way that they treat parameters in other models. For example,
the hierarchical model in Table 8.9 could be altered by letting W, T, and E
be unknown parameters to be estimated without specifying distributions.
In Table 8.18, we could let M, E~, E~, E~, and E~B be the parameters
by integrating the other parameters out the way we did in Section 8.2.2. A
good introduction to empirical Bayes analysis was given by Morris (1983).
Robbins (1951, 1955, 1964) first introduced the term "empirical Bayes" and
the general methodology.

8.4.1 Naive Empirical Bayes


The naIve approach to empirical Bayes analysis is to estimate the hyper-
parameters at some level of the hierarchical model and then pretend as if
these were known a priori and use the resulting posterior distributions for
parameters at lower levels in the hierarchy. For example, in the one-way
ANOVA (see Table 8.9), we could use (8.12) to specify the joint density of
the data given the parameters W, T, and E. Then we could let A = E2/T2
so that the likelihood of W, E2, and A is

ITk (1-;;. + ~1) -! (},2)-2 exp (1


n
.i + 1 + (ni - 1)s~1) ,
k [(:t. - t/J)2
- 2(}'2 L
i=1 ~ t=1 nj A

where n = L::~=1 ni. For fixed (}'2 and A, this is maximized over t/J by
choosing
k _
,
W(A) =
1" Xi
1 L.. -1--1 .
"k
L..Jt=1
(..1.. + 1)
nj A
i=1 no + >:

If we plug this value for t/J into the likelihood and maximize over (}'2 for
fixed A, we get

"This section may be skipped without interrupting the flow of ideas.


8.4. Empiric al Bayes Analysis 501

If we plug this value for (J'2 into the likelihood, we get the following function
of >. to maximize:
(8.30)

This would produce the MLE of A, call it A. Then set E2 = E2(A)


4! = 4!(A) to get the overall MLEs. Then, we can make inference about and the
MiS by using the conditional distribution in (8.8).
In the special case in which all ni = m, 4!(>.) = L.::=l x;/k and is not
a
function of >.. Also E2(>.) simplifies and the derivative of (8.30) can actuall
y
be set equal to zero to solve for the maximum. If we let 9 = 11m +
11>',
then

and the derivative of the log of (8.30) becomes


k - 2
n Li=l (Xi
- -k + - - - Ill)
- ===-:.---:-- ~ (8.31 )
2g 2E2(>.) ng 2

Setting this equal to 0 gives 9 = L:=l (Xi - 4!)2/[kE2(>.)]. Solving


for 9
yields 9 to equal a multiple of the usual F statisti c for testing the hypoth
esis
of no difference between groups:

_ (n-k)L .::=1(X i-4!)2 _ k-1


g- k -
F
k(m - 1) L.:i=l s~ km
Of course, 9 ~ 11m is required. If F < kl(k - 1), the derivative in
(8.31)
is negative at 9 = 11m, so the maximum occurs at 9 = 11m. Hence,
the
MLE of A is
A= (k-l)F-mk 'f F k
{ k 1 > k-l'
00 otherwise.
This means that T2 = 0 if F ::; kl(k - 1).
Examp le 8.32 (Contin uation of Exampl e 8.14; see page 487).
Using the data
in this example , we can calculat e the likelihood function for A and
maximize it.
'1;:he maximu m occurs at A ;:= 2.614. The other MLEs are q, =
21.78441 and
~2 = 38.37032. This makes T2 = 14.67878. Now,
we could use (8.8) to say that
the approxi mate distribu tion of Mi is

N(ni~i+q,A,
A + ni ni + A
tj2.).
For the three groups, these distribu tions are respectively

N(26.6539,3.04188), N(18.8101,2.62559), N(19.8795,2.17840).


502 Chapter 8. Hierarchical Models

Example 8.33 (Continuation of Example 7.60; see page 420). In this exam-
ple, each observation is a pair (Xi, Yi) that are conditionally IID N(J1.;,u 2 ) and
the pairs are conditionally independent given (~, M I , M 2 , ). Suppose that we
model (M I , M2, ... ) as conditionally lID N(J1., u 2 / >.) given (M, A) = (J1., >.). The
empirical Bayes approach might treat (~, M, A) as the parameter to be estimated
by maximum likelihood. The likelihood function for these parameters is

+~~(Xi+Yi_X+y)2+
>'+2L..t 2 2
2>.n (_x+y)2])
>'+2 J1. 2 .
i=l

The MLE for M is M = (X + Y)/2. The MLE for ~2 as a function of >. is

The MLE of A can be found from

A
A+ 2
A
_.
- mm
{"'n (
1
, ",n
L-i-I Xi - Yi)
(x;+y; _ XtV)2
2} .
L-.=I 2 2

Since the "observations" (Xi + Yi)/2 are conditionally lID N(J1., u 2[(1/2)+ (1/ >.)])
given the parameters, it follows that

~~ (X. + y. _ X + Y) 2 !:.. (>. + 2)u 2

nL..t 2 2 2>'
t=l

This implies that A is consistent and, in turn, that f;2(A) is consistent. The extra
terms added due to the empirical Bayes analysis make f;2(A) consistent (relative
to the empirical Bayes model).

It is not required that one use maximum likelihood estimates in a naIve


empirical Bayes analysis. For example, in the one-way ANOVA example,
we could use

which are based on unbiased estimators.


8.4. Empirical Bayes Analysis 503

8.4.2 Adjusted Empirical Bayes


It is generally recognized that naIve empirical Bayes analyses underestimate
the variances of parameters because they do not take into account the
fact that estimated hyperparameters were not really known a priori. For
example, in the empirical Bayes version of Example 8.14, we treat q, as if
it were 111 and were known a priori. To reflect the fact that we really do not
know W a priori, the posterior variance of Mi should be increased by
E4 A2
(niT2 + E2)2 Var(1I1) = (ni + A)2 Var(1I1).
We would already have an estimate of A from the naive analysis. We could
use (8.11) with (0 = 0 and estimate Var(1I1) by

[2:k ni
]-1
i=1 E2 + n/l'2
The value of this estimate would depend on how we estimated T and E, of
course. We should also increase the variance of Mi to reflect the fact that
E and T were estimated. An easy way to do this is to replace the normal
distribution in the posterior by a t distribution with appropriate degrees
of freedom. Morris (1983) chooses, instead, to replace the naIve variance
. expression E2 j (ni + A) by

-E2 ( 1 - - - -A
k-l - -) (8.34)
ni k A+ni .
This amounts to estimating the shrinkage factor Aj(A + ni) by a smaller
value.
Example 8.35 (Continuation of Example 8.14; see page 487). We can estimate
Var(W} by
10 12
(
38.37032 + 10 x 14.67878 + 38.37032 + 12 x 14.67878
15
+ 38.37032 + 15 x 14.67878
)-1 = 5.95368.
The additional variance terms for the three groups are 0.25568, 0.19048, and
0.13112, respectively. The adjustments specified by (8.34) are 3.30693, 2.81623,
and 2.30494, respectively. Adding these together gives the adjusted variances to
be 3.56261,3.00671, and 2.43606, respectively, all somewhat larger than the naive
variances.
We might now ask how the adjusted empirical Bayes posteriors compare to
the posteriors calculated from a hierarchical model with prior distributions for
all parameters. Such a model exists in the original description on page 487. Plots
of the posterior densities from these models were drawn in Figure 8.15 together
with the adjusted empirical Bayes distributions. Two of the Mi have empirical
Bayes distributions that are very close to the posteriors, but Ml has a noticeably
smaller variance in the empirical Bayes analysis than in the Bayesian analysis.
504 Chapter 8. Hierarchical Models

TABLE 8.36. Hierarchical Model for One-Way ANOVA with Unequal Variances
Stage Density
Data (211'u?)-~ exp { -~ l:~=l[n.(x. - 1-'.)2 + (n. - 1)8~1}
Parameter (211'T 2)-! exp {-~ ~~=l(I-'i -1/.1)2}
Hyperparameter v'(ii(211'T 2)-i exp {-~(1/.1 -1/Jo)2}
Variance fElo .. .,Ek,T(Ul..'" Uk, T)

8.4.3 Unequal Variance Case


The case of a one-way ANOVA with unequal variances can also be handled
by empirical Bayes analysis. Suppose that we begin with the model in
Table 8.36, which is a generalization of the model of Section 8.2.1. The
posterior mean of Mi for fixed values of the variance parameters and 'It is

(8.37)

The posterior mean of I{I for fixed values of the variance parameters is

= (8.38)

The resulting likelihood function for T, Ell' .. , Ek is T- k times

The MLEs of the variance parameters must be either found numerically or


approximated. Morris (1983) suggests using approximately unbiased esti-
2 2
mates instead. For example, Ei = (ni - 1)8i Ini, and
A

(8.39)

where (8.38) and (8.39) must be solved iteratively. qne can p~oose a start-
ing T value and plug it into (8.38) (together with E~, ... , E k ) to produce
8.5. Successive Substitution Sampling 505

a ~ to plug into (8.39) to produce a new T, and so on, until the esti-
mates converge. 8 Morris (1983) also suggests replacing Mi by (1 - Bi)Xi +
Bi~(Eb"" E k , T), where
'2
B- _ k - 3 Ei
t - k- 2 E; + n;T2
causes there to be less shrinkage toward a common mean. 9 The recom-
mended variance for Mi is given as

_ A 2 2 2 kni
+(xi-W(Eb ... ,Ek,T)) k_3 Bi '2 .'2 k nj
(Ei + n t T ) 2: j =1 E~+njt2
Kass and Steffey (1989) present an alternative treatment of this case from
a Bayesian viewpoint. They find a normal approximation to the posterior
distribution of the parameters V = (El' ... , Ek, W, T) (in a manner similar
to the method of Laplace) and then use the delta method to approximate
the mean and variance of (8.37), thought of as a function of V. The posterior
variance of Mi is E(ET)/ni plus the variance of (8.37).

8.5 Successive Substitution Sampling


The model analyzed in Section 8.2.2 is an example of one that got out of
hand very quickly, even though it started out in a fairly straightforward
manner. Another method for finding posterior distributions can be used
for such models without getting bogged down in such messy calculation.
The method is a simulation version of the method of successive substitution
used to solve fixed-point problems.

8.5.1 The General Algorithm


In general, if 9 : A -+ A, and we are interested in finding an x such that
g(x) = x, we could proceed as follows. Pick Xo E A. For n = 1,2, ... , define
Xn = g(xn-t). If {xn}~=t converges and 9 is continuous, then the limit x
is a fixed point of g, that is, g(x) = x.

8The algorithm described here is an example of successive substitution, which


will be described in Section 8.5.
9In the case k = 3 there is no shrinkage, and when k = 2, I don't know what is
recommended, although it seems clear from the formula for the adjusted variance
that k > 3 is required for this analysis.
506 Chapter 8. Hierarchical Models

The type of fixed-point problem we will study is the following. Suppose


that Y1 , ... , Yk are random quantities and that we know the conditional
distribution of Yi given the others, for each i. Suppose that the conditional
distribution of Yi given the others has density !Yil{Yj:#i} with respect to
a measure Ai. (It will prove convenient to use the notation Y\i to stand for
{lj : j =1= i}, so that this last density can be written /YiIY\i') We wish to find
the joint distribution of (Y1 , .. , Yk). Suppose that the joint distribution has
density Jy with respect to the product measure A = Al X . x Ak' Let X, =
(X~, . .. ,X~) have a distribution with density !x' with respect to A. Define
the distribution of a new random quantity X = (Xl, ... , X k) as follows.
Suppose that X' = (X~, ... ,xi.J is observed. The density of Xl with respect
to Al is !Y1IY\1 (!x~, . .. ,xl.). The conditional density of X 2 given Xl = Xl
is !Y2IY\2('!X1'X~"" ,x~). Continue until we get the conditional density of
Xk given Xl = Xl,.. ,Xk- l = Xk-l to be !YkIY\k('!XI,'" ,xk-d. In words,
when we derive the conditional distribution of Xj given Xl"'" Xj- l , we
use the observed values of Xi+!, . .. ,X~ in the conditional distributions.
When we get to j = k, we are using only the Xi values. If we define
Zl = (x~, ... ,xl.), Zi = (Xl. ... ,Xi-1'X~+!"" ,x~) for i = 2, .. :, k -1, and
zk = (Xl, ... ,xk-d, then the following equation is satisfied:

We can define the operator T from the set of densities with respect to A to
itself by

T(J)(x) ~! [n h,w" (x.IZ')] f(x')dA(x').

It is easy to see that T(fy) = Jy, so the joint density of Y is a fixed point
ofT.
The method of successive substitution applied to the fixed-point prob-
lem just described would be to pick an initial density /0, say, and then
let !n = T(fn-d for n = 1,2, .... This would require the calculation of
a great many integrals that may not have closed-form expressions. An
alternative is to draw samples from the various conditional distributions
instead of calculating the integrals. In the notation just used, suppose
that X' = (x~, ... , x~) is generated from the distribution with density
Ix"~ Then suppose that X = (Xl. ... ,Xk) is generated as follows. Gen-
erate Xl from the distribution with density /YliY\l (!x~, ... ,xl.). Let Xl
be the generated value. Generate X 2 from the distribution with density
!Y2IY\2 ('!Xl. x~, ... , xl.). Continue until we ge~e~ate Xk .from th~ distribu-
tion with density !YkIY\k(!Xl. ... ,Xk-l)' The Jomt denSity of X IS T(fx').
So, we can take a starting density !o and generate XO from the distribution
with this density. Then, using the method just described for n = 1,2, ... ,
generate xn from the distribution with density T(fn-l)' This method has
8.5. Successive Substitution Sampling 507

been called successive substitution sampling (abbreviated SSS) because it


is just a sampling version of successive substitution. lO
One must, of course, stop the iteration at some point using the sample
with density T(fn) in lieu of a sample with density Iy. There are several
ways to prove that SSS converges as n goes to infinity. The following theo-
rem is proven by Schervish and Carlin (1992). Its proof, which is given for
completeness, relies heavily on operator theory in Hilbert space. l l Readers
unfamiliar with this theory can safely skip over the proof. The necessary
theorems from operator theory are stated in Appendix C. l2

Theorem 8.40. In the notation 01 this section, let


k
K{x',x) = I1fy,ly\.{Xilzi).
i=l

Assume that

If IK{x', xW fy{x') dA{X')dA{X) <


IY{x)
00 (8.41)

and that K > 0 almost everywhere with respect to A x A. Let?-l be the set
01 functions I such that1 3 11/112 = I I/{x)1 2/ fy{X)dA{X) < 00. There exists
a number c E [0,1) such that lor every density 10 E 'H., the sequence 01
functions In = T(fn-l) = m(fo) lor n = 1,2, ... satisfies II/n - hll :::;
IIlollcn lor all n.
lOMany authors call this method Gibbs sampling. This is actually a misnomer.
Geman and Geman {1984} described this method as a way to generate a sample
from a Gibbs distribution, and they called their particular implementation the
Gibbs sampler. Gelfand and Smith {1990} generalized the method to arbitrary
distributions but continued to call it Gibbs sampling, even though they were
no longer sampling Gibbs distributions. The SSS algorithm is a special case of
the broad class of Markov chain Monte Carlo methods. Note that the sequence
{xn}~=o is a Markov chain {see DefinitionB.125}. A good survey of general
Markov Chain Monte Carlo methods is given by Tierney {1994}.
11 An alternative is to notice that the sequence Xl, x 2 , is a Markov chain
{see Definition B.125 on page 650}. One then applies a theorem like the one given
by Doob {1953, Section V.5}. The conditions of such theorems are often difficult,
if not impossible, to verify in specific applications.
12Some good treatments of operator theory can be found in Berberian {1961}
and Dunford and Schwartz (1963).
13We use the symbol 11/11 for the norm of an element of a Hilbert space.
The norm IITII of an operator T is the supremum of IIT(f}II/II/II. Dunford and
Schwartz (1963) use the symbols III and ITI for these norms. They use the sym-
bol IITII for the Hilbert-Schmidt norm or double norm of a Hilbert-Schmidt-type
operator. We only mention this here in case the reader decides to refer to Dunford
and Schwartz (1963) for some of the proofs of auxiliary results.
508 Chapter 8. Hierarchical Models

PROOF. We will use Hilbert space notation and define the inner product

(g, h) = J g(x)h(x)dp.(x),

where p.(A) = JAll/ fy(x)]d>'(x). It follows that 'H is the Hilbert space
L2(p.). The norm in this space is IIgll = V(g,g). If we let Ko(x',x) =
K(x',x)fy(x'), then T(f)(x) = J Ko(x',x)dp.(x') is the operator that takes
a density for observations at one iteration of SSS to the density of obser-
vations as the next iteration, and (8.41) becomes

fJ IKo(x',x)ldp.(x')dp.(x) < 00.

In fact, it is clear that Ko(x',x) is a joint density of two successive it-


erations of SSS, x' and x, if the first iteration has the solution density
Iv. Furthermore, by writing each of the conditional density factors in
Ko as the ratio of the joint density fy to a joint density for all but one
of the observations, and then rearranging the factors, one can show that
T*(f)(x) = J Ko(x,x')dp.(x') is the operator that takes a density for ob-
servations at one iteration of SSS to the density of observations at the next
iteration if the order of updating coordinates is reversed. For this reason,
it is easy to see that for each 9 and h in 'H that are integrable with respect
to >.,

f g(x)d>.(x) = J T(g)(x)d>.(x) ,

f g(x)d>.(x) = f T*(g)(x)d>'(x),

f T(g)(x)h(x)dp.(x) = f g(x)T*(h)(x)dp.(x).

The last equation is the definition of what it means to say that T* is the
adjoint of the operator T. It also follows from this equation that the adjoint
of the composition U = T(T*) is itself, U. That is to say, U is self-adjoint.
Since U is two applications of successive substitution, it follows that

f f(x)d>'(x) = J U(f)(x)d>.(x). (8.42)

According to Theorem C.lO the operator T is of Hilbert-Schmidt type


because (8.41) holds. Theorem c.n says that such an operator is com-
pletely continuous. 14 It follows then that the adjoint operator T* is also

14 An operator T is completely continuous if every bounded set B. ~ 'H is


mapped by T to a set whose closure is sequentially compact. (That IS, every
sequence in T(B) has a convergent subsequence.)
8.5. Successive Substitution Sampling 509

completely continuous as is U. Since U is self-adjoint and completely con-


tinuous, 11. has an orthonormal basis of eigenfunctions of U. Also, Theo-
rem C.12 says that a self-adjoint completely continuous operator has an
eigenvalue whose absolute value is equal to the norm of the operator.
Let V be the operator defined by V (I) = U (I) - Iy (ly , f). In particular,
V (fy) = 0 because T(fy) = T* (fy) = Iy and (ly,Jy) = 1. It is easy to
see that V = W* W, where W (f) = T(f) - fy (fy , J), and W* is the
adjoint of W. It follows from Theorem C.13 that I!VII = IIWWII = IIWII2.
The remainder of the proof will be to show that I!VII and hence IIWII are
strictly less than 1, and then to show that this implies the conclusion to
the theorem.
Since V is self-adjoint and completely continuous, we can show that
I!VII < 1 by showing that the absolute value of its largest eigenvalue is
strictly less than 1. Let r be the largest eigenvalue of V, which is real since
V is self-adjoint. Let V(g) = rg. If r = 0, the result holds, so suppose that
r ~ O. Then
1 1
(g,Jy) = -(V(g),Jy)
r
= -(g, V(fy))
r
= 0,
since V(fy) = O. Since 9 is not identically 0, we can write 9 = g+ - g-
where g+ and g- are respectively the positive and negative parts of g. Let
B be the set of x such that g(x) > 0, and let C be the set of x such
that g(x) < O. Then A(B) > 0 and A(C) > 0 since (fy,g) = 0 but 9 is
not identically O. We will show that Irl < 1 by means of contradiction. If
Irl = 1, then

Since Ko > 0, it follows that U(g+)(x) >0 and U(g-)(x) >0 for all x.
Hence,

g+(x) < { U(g+)(x)


U(g-)(x)
if r = 1,
if r = -1,
for x E B,

g-(x) < { U(g-)(x)


U(g+)(x)
if r = 1,
ifr=-l,
for x E C.

It f~llows that U(g+ +g-) > g+ +g- for all x. In other words, U(lgl) > Igl,
which would imply f U(lgj)dA > f IgldA, which contradicts (8.42). Hence,
Irl < 1 and the largest eigenvalue of V has absolute value Irl < 1. It follows
that I!VII = Irl < 1.
Now, we know that IIWII = Irll/2 = c < 1. If I is a density, then
(fy, J) = 1 and
W(f) = T(f) - fy = T(f - fy).
510 Chapter 8. Hierarchical Models

Similarly, if (fy,g) = 0, then W(g) = T(g) and (fy, W(g)) = 0, from


which it follows that wn (g) = Tn (g) for all n. Since (fy, I - Iy) = for
every density I, it follows that, for all n,

So, for all n,

Although it appears that one needs to know the solution Iy in order to


check the conditions of this theorem, one often knows the function fy up
to a multiplicative constant. Hence, one could, at least in principle, check
the finiteness of the various integrals in Theorem 8.40.
Example 8.43. Suppose that the posterior density of (Yl , 1'2, Ys) is proportional
to
fey) = Y3"4 exp ( - 2~s { (Y2 ~~':Yl)2 + y~ + 4} ) .

It is not difficult to see that the three conditional distributions are

Yl l1'2 = Ya, Ys = Ys N(0.9Y2,0.19ys),


Y2 1Yl = Yl, Ys = Ys N(0.9Yl,0.19ys),
4 + y2 + (!l2-0.19
0 . 9 !11l 2 )
r -l ( 3, 1
2 .

The integrand in (8,41) is a constant times x;6X3"5 times e to the power

1 { (Xl - 0.9X2)2 (X2 - 0.9Xl? + '2 + (X2 - 0.9xi)2 }


- 2x~ 4+ 0.19 + 0.19 Xl 0.19

-
1
2X3
{4 2
+Xl +
(X2 - 0.9xt}2 }
0.19 .

By collecting terms here, it is not difficult to show that this function is integrable
over the six variables Xl, X2, Xs, X~, X2, xs.
After one stops the iteration, one has a vector Y from approximately the
correct distribution. One can repeat the process and produce yl, ... , ym
for some value m. If one wants the marginal density of}i, one can let Y\i
stand for the (k - I)-dimensional vector formed from ys by removing Y/,
and then calculate

(8.44)
8.5. Successive Substitution Sampling 511

This estimator is based on the simple fact that, for each s,

If one wants the mean of Y;, one can calculate

(8.45)

assuming that the conditional mean of Y; given the others is easily available.
Equation (8.45) should be better than 2:::1 Y/lm, since the variance of
the simple average is the variance of (8.45) plus 11m times the mean of
the conditional variance of Y; given Y\i. Similarly, the variance of Y; can
be approximated by

(8.46)

which should be a better estimate than the sample variance of the 1';".
The SSS algorithm, as described, assumes that the random quantities
Y1 , ... ,Yk are in a fixed order for every iteration. This is not actually
required for convergence of the algorithm. The proof of convergence is sim-
plified by making this assumption however. Note also that each Y; need not
be a single random variable. Some of them might themselves be vectors.
The question of how to arrange the coordinates is important for the rate
of convergence. The more dependence that exists between successive itera-
tions, the slower the convergence will be. One can understand why this is
true intuitively by realizing that convergence "occurs" when an iteration is
"independent" of the starting iteration. The more dependence lingers from
one iteration to the next, the longer it takes to get an iteration that is
essentially independent of the start. Example 8.43 can be used to illustrate
how the choice of coordinate arrangement affects the dependence between
iterates.
Example 8.47 (Continuation of Example 8.43; see page 510). It is not difficult
to see that every order of the three coordinates is essentially equivalent in this
example. Instead, let us compare the natural order YI, Y z , Y3 to the alternative
arrangement Xl = (Yl, YZ)T, Xz = Y3 That is, let the first random quantity be
a two-dimensional vector consisting of both Y1 and Y z . To illustrate the effect of
this change on the amount of dependence between iterations, we will calculate
the conditional distribution of Y1 at the next iteration given the variables at the
current iteration for both arrangements.
In the natural order, we generate Y1 with N(O.9y~,O.19y;) distribution given
Y{ = Yl, Y2 = y~, and Y3 = ya. In the vector arrangement, we generate the whole
vector (Yb Yz) at once with distribution Nz(O, Y3A) where

A = (O~9 i9).
512 Chapter 8. Hierarchical Models

The conditional distribution of Y1 given Y{ = yL Y~ = y~, and Ya= Ya is N(O, Ya)


in this case. The dependence on the previous iteration is greatly reduced in the
vector arrangement. In the natural order, Y1 is much more constrained by the
values Y~ and Ya than in the vector arrangement. A similar calculation shows
that, in the natural order, the conditional distribution of Y2 at the next iteration
a
given Y{ = Y~, Y~ = Y~, and Y = Ya is N(O.81y2,O.3439ya), while in the vector
arrangement it is N(O, ya). Although Y2 is less dependent on the previous iteration
than is Y1 , it is still more dependent in the natural order than in the vector
arrangement.
As a rule of thumb, if one knows that several random variables are highly
dependent, it will be better, if possible, to treat them as a single random
quantity in the SSS algorithm rather than to treat each one as a separate
coordinate.

8.5.2 Normal Hierarchical Models


Take the model in Section 8.2.1 as an example. The vector Y in the dis-
cussion of this section will be the collection of all parameters of the model,
namely MI, ... , M k , 'It, E2, T2. We will use the prior in which E2 and T2
are independent with inverse gamma distributions r- 1 (ao/2, bo/2) and
r-1(eo/2,do/2), respectively. All distributions will be conditional on the
data. It is easy to calculate the various conditional distributions we need.
In the following list, each distribution is to be understood as conditional
on both the data and on all of the other parameters.

Mi '" N( ~+~ n- l ' n-


1
1
)
;t+~ ;t+~

'It '" N (L~=1 J.l.i + (o'I/Jo ~)


k+(o ' k+(o
T2 r-1 (eo + k + 1 do + L~-1 (J.l.i - 'I/J)2 + (O('I/J - 'l/Jo)2)
2 ' 2

~2 -1 (ao + n1 + ... + nk
L..- r 2 '

bo + E:-1 {(ni - I;S~ + ni(Xi - J.l.i)2} )

It is easy to generate pseudorandom numbers from each of the above dis-


tributions, so that SSS could be implemented without much trouble.
Example 8.48 (Continuation of Example 8.14; see p~e 487). We used naive
empirical Bayes estimates of the parameters .as s~artm~ values and then ran
twenty thousand iterations, taking every twentieth Iteration as. a s~ple? value.
We also ran forty thousand iterations where we took every fortieth IteratIOn as a
sampled value. The differences were negligible. Then we calculated (8.44) for each
8.5. Successive Substitution Sampling 513

of the three population means. These densities are plotted in Figure 8.15. The
respective posterior means were calculated to be 26.30, 18.78, and 19.82 using
(8.45). The posterior variances were calculated using (8.46) to be 4.525, 2.998,
and 2.387, respectively.

The SSS algorithm is also well suited to handle the case in which each
population has its own variance, E;. Suppose that the E; are independent
with r- 1 (ao/2, bo/2) distribution in the prior. In this case, we replace two
of the above distributions with

N
( !!1+~
Ui
n
~+TI
T
l ' n'
1
~+TI
1
)
'

'" r- 1 (ao + ni bo + (ni - 1)8; + ni(Xi - J.Lif)


2 ' 2 .
This model has k - 1 more parameters than the equal variance model.
An intermediate case between equal variance and independent variances
is to have a hierarchical model for the variances. Suppose that the E;
are conditionally independent with r- 1 (ao/2, aoO' 2 /2) distribution given
E2 = 0'2. Then suppose that the prior for E2 is r(fo/2, go/2). This model
has k more parameters than the equal variance model, and the conditional
distributions required for SSS must change to include the Mi distributions
just given for the other unequal variance model together with

Example 8.49. Suppose that we use the same data as in Example 8.14 on
page 487, but we use the hierarchical model for the population variances. We
continue to use ao = 1, Co = 1, do = 1, '1/10 = 10, and (0 = 0.1, but we include
=
fo 1 and go = 0.1. The resulting posterior densities are plotted in Figure 8.50.
The posterior means and variances from (8.45) and (8.46) are
1 2 3
mean 27.1392 18.7672 19.7399
variance 3.1634 4.6745 2.2667
We could also do a naive empirical Bayes analysis. (Recall that the adjusted
empirical Bayes analysis requires k > 3.) We will use t: s:.
= We will also adjust
the variance of Mi by adding on 1'2 times the square of the coefficient of '1/1 in
(8.37). This results in Mi having variance

'2
~i ( -1 +, '2'2)
T ~i
, .
ni (E~ + niT2)2
We need to iterate between (8.38) and (8.39). Starting with l' = 1, it took four it-
erations to get no difference between the iterations. The results were ~ = 21.9809
514 Chapter 8. Hierarchical Models

!<l
ci

~
~
'"
ci
<=
<l>
Cl

<>
ci

:3
10 15 20 25 30 35

J.1
FIGURE 8.50. Numerical Approximations to Posterior Densities

and T = 23.4547. This leads to the following naive empirical Bayes posteriors:
N(27.3786, 2.5817}, N(18.8116, 5.4845}, and N(19.7526, 2.3257}. These three den-
sities are also plotted in Figure 8.50. Notice that the hierarchical model for the
variances brings the estimated variance for the second population (the largest
of the three) down quite a bit from the empirical Bayes value while it brings
the other two variances up. Although the hierarchical model variance for M3 is
slightly larger than the empirical Bayes variance, the density (in Figure 8.50) is
more peaked. The additional variance comes from heavier tails.

The more complicated two-way ANOVA described in Section 8.2.2 can


be handled using SSS in a much simpler fashion than the analytical dis-
cussion in Section 8.2.2. Let the parameters be M, Ai, Bj , (AB)i,j (for
i = 1, ... , a and j = 1, ... , b), E;, E~, E~, and E~B' We also have the
constraints (8.17). Because of the constraints, we cannot apply SSS in the
most naive manner. For example, the random variables B 1 , ... , Bb have the
property that the conditional distribution of each one given the others is
concentrated on a single value (namely, minus the sum of the others) with
probability one. This is an extreme case of dependence. No matter what
starting values one generates for Bb ... ,Bb, one will never change them,
no matter how many iterations one performs! Clearly, convergence cannot
occur in this case. 15 There are two ways to circumvent this problem. One
is to drop one of the parameters from the algorithm and just calculate it
when needed. That is, just calculate Bb = - L:~:i B j when it appears in

151n fact, the condition that the function K be strictly positive in Theorem 8.40
is violated.
8.5. Successive Substitution Sampling 515

some conditional distribution, but treat B l , ... , Bb-l as the parameters.


Another approach is to treat the entire vector B = (B 1, ... , Bb) as one
parameter (one of the Yi) as in the vector arrangement in Example 8.47
on page 511. Also, (AB)i = ((AB)i,l,"" (AB)i,b) could be treated as one
parameter for each i. We choose the vector approach here because the con-
straints introduce dependence among the coordinates, which will slow down
convergence.
Suppose that our model says that M, AI,"" Aa, B, (ABh, .. , (AB)a
are all conditionally independent given E~, E~, E1, and E~B with

N(OO' ~), Nb(O,0'1 [1 - t11 T]) ,


'" N(O, O'~), '" Nb(O'O'~B [I - tnT]).
This produces the same model as in Section 8.2.2. Next, assume that the
variance parameters are independent with inverse gamma distributions

r- 1 (~2 ' ~) r- 1 (~ Qa) '


2 ' 2
2 '
r- 1 (!!Ii2 ' .!!B.)
2 '
'" r- (!!M.
1 !wl.) .
2 ' 2

Note that the prior distributions of the constrained parameters are the
conditional distributions of independent random variables given that the
constraint holds. That is, for example, if B1"'" Bb were lID N(O, 0'1) and
we found the conditional distribution of B given that L~=l B j = 0, that
conditional distribution would be the distribution given above for B. (See
Problem 13 on page 535.) Also, if T = b, then this model is the same as
saying that the Bj are lID N(O,0'1) and M = L~=l Bjlb.
The posterior distributions of the parameters conditional on the other
parameters are now easily calculated after we introduce some notation.
Suppose that there are ni,j observations with the A factor at level i and
the B factor at level j. We do not need to assume that the cells all have
the same sample size as we did in Section 8.2.2. In fact, there can even be
empty cells in this analysis. Define

n .,j = L~=l ni,j, Y.,j,. n.,;1 Lai=l Ln.


k:;"l Yi,j,k.
nt,. = L~=l ni,j, Yi,.,.
1 Lb Ln.
n;- j=l k:;"l Yi,j,k,
n .,. = L~=l L~=l ni,j, Y.,.,. n1 Lbj=l Lai=l Ln. k:;"l Yi,j,k,
0 ~
:Lai=l ni,.ai,
1
Yi,j,. ni,; k=l Yi,j,k,
1 :Lni,;

OJ = 1 Lai=l ni,jO:i,
;;:-:- (oJ3).,. n1,. :L~=1:L~=1 ni,j(o:{3)i,j,
."
73. = r!- :L~=1 n.,j{3j, (o:f3) .,j n1,; :L~=1 ni,j(o:{3kj,
73i = ni,. Lbj=l n
-1 {3
t,) (o:{3);,.
}' n!,. L~=l ni,j (o:{3)i,j,
W = i,; (Yi,j,k - -Yi,j,.
:Lai=l :L j=1 l:nk=l
b )2.
516 Chapter 8. Hierarchical Models

By our seeking the conditional posterior distribution of one parameter given


the others, the data can be viewed as having a simple structure. For exam-
ple, if we want the conditional posterior distribution of M given the other
parameters, we construct t:7i,k = Yi,i,k - Ai - Bj - (AB)i,i' Then the t:7i,k
are lID N(p., (1~) given M = p.. The conditional posterior distribution of M
can easily be shown to be

A similar analysis works for each Ai' The result is

Similarly, if we want the conditional posterior of the vector (AB)i given


the other parameters, we construct ~7j,k = Yi,j,k - M - Ai - Bj . Then
the ~~i,k are independent with ~~i,k having No:{3)i,i,(1~) distribution
given (AB)i = o:{3kl,'''' (o:{3)i,b)T. The posterior can be found by first
calculating the posterior as if the parameters were unconstrained and then
conditioning on the constraint as in Problem 13 on page 535. Define the
b-dimensional vectors

Let diag(Vi) stand for the diagonal matrix with diagonal entries equal to the
coordinates of vi. Then the conditional posterior of (AB)i is Nb(CiZ i , C i ),
where Ci = diag(v i ) - Viv iT /(I TVi). A similar analysis works for the B
vector. In this case, define the vectors

n.,l(Y.,l,. - P. - al - (o:{3).,l) ) u:O'i


O'~+n.,lO'~ )
1
Z=- (
. v= (
(12 .. ,
(T:O'~
e n.,b(Y.,b,. - P. - ab - (o:{3).,b) O'~+n .. bO'~

The posterior of B is Nb(CZ, C), where C = diag(v) - vv T/(1 TV).


Since the variance parameters (except for E~) are conditionally indepen-
dent of the data given the other parameters, their conditional posteriors
will not depend on the data. We can calculate the conditional posteriors
8.5. Successive Substitution Sampling 517

making use of Proposition 8.13:

E2A '" r- 1 (a A + a bA + E~-1 o:~)


2' 2 '
-1 (aB +b bB + r{Jl- ( 0)2 + E~=l B~)
r 2' 2 '

EA2 B '" r - 1 (aAB + ab - a bAB + E~=l E~=l (AB)~,j)


2' 2 '

and E~ has distribution

r-1 (a e + n.,. be + W + E~=l E~=l ni,j(fh,j,. - Jl- O:i - {3j - (o:{3)i,j )2).
2 ' 2
The only remaining problem for implementing SSS in this example is
how to simulate from a multivariate normal distribution Nb(Cz, C) with a
singular covariance matrix C = diag( v) - vv T / (1 Tv). The most straight-
forward way is to find a b x (b - 1) matrix D such that DDT = C, then
generate an Nb-l{O,I) vector V, and use D(V +DT z). Let v* be the first
b - 1 coordinates of v, let .;v;. be the vector with jth coordinate y'Vi, let
diag(.;v;.) be the diagonal matrix with (j, j) element equal to y'Vi, and let

h_1-~
- lTv .

Then the following matrix satisfies DDT = C:

D = ( diag (.;v;.) - hV*~T )


-Jtfv.;v;.
One should also note that, just as in the one-way ANOVA, we could
have had unequal variances in the cells. That is, we could have had the
conditional variance of Yi,j,k be E~,j instead of E~ for all i and j. This would
have introduced ab - 1 additional variance parameters, but the conditional
distributions would have been only slightly more complicated. The serious
reader should work this case out in detail. In addition, a hierarchical model
for the E~,j could be introduced. Intermediately, one could model the Yi,j,k
as having variance E~ or E~ so that some cells have the same variance and
others do not. All such models can be handled in nearly the same fashion
as above.

8.5.3 Nonnormal Models


There is a large class of problems to which the SSS methodology can apply.
We will not attempt to catalogue this class. We give only a few more exam-
ples to show how, with a little imagination, the methodology can apply even
518 Chapter 8. Hierarchical Models

where one would not normally think. In Example 7.104 on page 444, the
observables Xi were modeled as having Cau(O, 1) distribution given = O. e
Suppose now that we introduce an extra parameter Vi for each observation
and say that Xi given Vi = y and e = 0 has N (0, y) distribution and the Vi
are independent of e and of each other with r- I (1/2, 1/2) distribution. 16
It follows that Xi rv Cau(O, l) given e = 0; hence this model is equiva-
lent to the original model. However, this new model is easily handled via
SSS. In Example 7.104, we supposed that e had N(O, 1000) as a prior. The
conditional posterior of e given the Vi is

e N "n x
L..i= 1 ~ 1 )

"n_ "n
rv (
0.001 + L..._I 1.'
Yi
0.001 + L...=l 1.
Yi
.

The conditional posterior of Vi given e is r- 1 (1, [1 + (Xi - O)2]/2).


Example 8.51 (Continuation of Example 7.104; see page 444}.After 40 itera-
tions, we constructed 10,000 vectors of the 11 parameters Y1 , , YlO, e. The
estimated mean and variance of e were 4.585 and 2.233, respectively. The poste-
rior density of e is plotted in Figure 7.105 on page 444 together with the normal
approximation of Theorem 7.101 and an approximation by numerical integration.
The same principle as described here can be used if one wishes to use a
Cauchy or t distribution for the prior distribution of the location param-
eter for normally distributed data. In fact, a t distribution for the prior
combined with data having t distribution can be handled using SSS and
simple normal/inverse-gamma posteriors.
The next example is the model described in Section 8.3.2 in which there
are k groups of subjects with ni subjects in group i. The data are Xi = Xi,
the number of subjects with a positive response to some query. The Xi are
modeled as conditionally independent Bin( ni, Pi) given PI = PI, ... ,Pk =
Pk' The Pi are modeled as conditionally independent Beta(Or, [1 - O]r)
given e = 0, R = r. Finally, we will suppose that e and R are independent
with discrete prior distributions having densities fe and fR with respect to
counting measures on sets {Ol> ... , Oa} and {rl, ... , rb}, respectively. The
conditional posteriors of the Pi given the other parameters were already
seen to be Beta(Or + Xi, [1 - O]r + ni - Xi)' The posterior of e given the
others has probability of e = OJ proportional to
k
fe(Oj)r(Ojr)-kr([I- OJ]r)-k rrp~8i-I(1- Pirll-8il-l.
i=l

16This distribution is also known as xi.


8.6. Mixtures of Models 519

The conditional posterior probability of R = rj given the other parameters


is proportional to

IT p?8-1 (1 - pifj [1-9]-1.


k
fR(rj)r(rj)kr(Orj)-kr([l - Oh)-k
i=l
Simulating random variables with a discrete distribution can be done by
the following tedious but straightforward method. Let X have density fx
with respect to counting measure on the set {Xl, X2, . .. }. Generate a U(O, 1)
variable U. Set X = Xj where j is the first n such that L~=l fx(Xi) ~ U.
This is the discrete version of the probability integral transform.
As an alternative to sampling from discrete distributions, we could in-
troduce some latent variables that make the problem look like a normal
hierarchical model. 17 Let Xi = Ej::1 Xi,j, where the Xi,j are modeled as
lID Ber(Pi) given g = Pi for each i. Let Zi,j be lID with N(ILi, 1) distribu-
tion given Mi = ILi where Pi = q,(M i ), and assume that Xi,j = 1[0,00) (Zi,j)'
We can treat the Zi,j as parameters or missing data. Let the prior for the Mi
be that they are conditionally lID with N(IL, 7 2) distribution given M = IL
and T = 7. Let T2 have an inverse gamma distribution. Either let M be
independent of T with N(ILo,0"5) distribution or let M given T = 7 have
N(ILo, 7 2/>'0) distribution. The conditional distribution of the Zi,j given
the Mi , M, T, and the Xi,j is that of independent, truncated N(Mi,l)
random variables. (Those Zi,j corresponding to Xi,j = 1 are truncated to
the interval [0,00) and the others are truncated to the interval (-00,0).)
The conditional distribution of the Mi given the Zi,j, the Xi,j, M, and T
as well as the conditional distributions of M and T given the others are all
obtained as in the appropriate normal hierarchical model.

8.6 Mixtures of Models


8.6.1 General Mixture Models
A different type of hierarchical model is one in which one contemplates
several different models for the same data but does not wish to condition
on just one of them. For example, consider a case in which one observes pairs
(Xi, Yi) and one wishes to predict the Y coordinate from the X coordinate
(often called regression). One typical model is that there are parameters
e = (Bo, B1, I:) such that conditional on e = (/30,{31, a) and X = X, Y rv
N (/30 + /31 X, ( 2 ) and X is independent of e. Another model says that there
are parameters 8 = (Bo,B1,E) such that conditional on e = (/30,/31,a)
and X = X, log(Y) rv N(/3o + /31x, ( 2 ) and X is independent of 8. The two

17This model and some generalizations of it are discussed by Albert and Chib
(1993).
520 Chapter 8. Hierarchical Models

8s are not the same random quantities. In fact, it is common to believe


that at most one of them actually exists. Let \11 be a random quantity
such that conditional on \11 = 0, there is a 8 0 such that conditional on
8 0 = ({30,(31 , (1) and X = x, 10g(Y) '" N({3o + {31x, (12) and conditional on
\11 = 1, there is a 8 1 such that conditional on 8 1 = ({30, {31, (1) and X = x,
Y '" N({3o + (31x, (12). In a sense, the parameters are now (\11,8 0 ,8 1), but
the joint distribution of 8 0 and 8 1 is of no interest since there are no
data that depend on both of them. In fact, we don't even need to believe
that they coexist. One would need to specify a prior distribution for \11, a
conditional prior for 8 0 given \11 = 0, and a conditional prior for 8 1 given
\11 = 1. After observing data, one could calculate conditional posteriors for

8 0 and 8 1 given \11 = and \11 = 1, respectively. One could also construct
prior predictive distributions for the data given \11 alone and use these to get
the posterior for \11. In symbols, we need h, f90Iw(00IO), and f9t1w(0111).
The original models give fy,Xleo and fy,XI9 l ' where we assume that

fy,xleo(y,xIOo) = fy,XI9 0,W(y,xIOo, 0) = fY,XI 90,9l,W(y,xIOo, 01,0),

fy,xle l (y, xl0t} = fy,XI 9 l,W(Y, X101, 1) = fy,xI 90,9l,W(Y, xlOo, 01,1),
so that (Y, X) is conditionally independent of 8 1 - i given ei and \11 = i for
i = 0, 1. The predictive density of (Y, X) given \11 is

fY,Xlw(y,xll/J) = 1n",
fy,xle",(y,xl,p)f9",lw(0,p1l/J)dO,p,

where n,p is the parameter space given \11 = l/J, for l/J = 0,1. The conditional
posteriors are

for l/J = 0, 1. The posterior of \11 is

If there are future data (Y', X') that are conditionally independent of
(Y, X) given the parameters, then predictive inference is available.
1
fy, ,X'IY,X(Y', x'ly, x) =L fY',X'lw(Y', x' 1l/J)fwIX,y (l/Jlx, y)
1/1=0
8.6. Mixtures of Models 521

Notice that the predictive density fy"x'lY,x is a weighted average of the


two predictive densities one would have used if one had believed each of the
two models. The weights are the posterior probabilities of the two models.
If, for example, model 0 looks orders of magnitude better than model 1
based on the (Y, X) data (that is, f>l1IX,y(Olx, y) is much much larger than
f>l1IX,y(llx, y)), then the predictive distribution of the future data will be
almost the same as if only model 0 had been used from the start.1 8 The
real advantage to this approach arises when neither model appears much
better than the other based on the data. In this case, we can hedge our
predictions to allow for the possibility that one or the other model will later
turn out to appear better.
Of course, the above description can be extended to apply to arbitrary
data X and an arbitrary number of models. For example, the two models
in the regression example considered above can be embedded in a family
of models in which, conditional on lit = 1/J, e", = (13o, {31! 0'), and X = x,

where 1/J = 0 is defined by continuity (taking a limit). This is the familiar


Box-Cox family of transformations introduced by Box and Cox (1964). If
uncountably many values of 't/J are being considered, the sums over 't/J must
be replaced by integrals that, presumably, must be evaluated numerically.

8.6.2 Outliers
One popular use of mixtures of models is to allow for the possibility of
outliers in a data set. An outlier is an observation whose distribution (before
seeing the data) is not like that of the other observations. This "definition"
of outlier is intentionally vague. Consider an example to help to clarify the
concept.
Example 8.52. Suppose that X I, .. ,Xn are potential observations, but we be-
lieve that some of them may not have the same distributions as the others. Box
and Tiao (1968) describe a model similar to the following. Let e = (M,~) and
suppose that the conditional distribution of each Xi given e = (p" u) is N(p" ( 2 )
with probability 1- 0: and is N(p" 0 ( 2 ) with probability 0:, where c > 1 and 0: are
constants chosen a priori. SUfPose that the conditional distribution of M given
~ = u is N(p,o,O' 2 />..o) and ~ has r- l (ao/2,bo/2) distribution. There is missing
data here, namely the indicators of whether each observation has variance 00 2 or
not. Let lIt stand for the subset of {1, ... , n} such that the observations with sub-
scripts in lIt have larger variance. For each possible value 1/J of lIt, let n", indicate

18The predictive density has been used by many authors as a means for select-
ing and comparing models. See Geisser and Eddy (1979) and Dawid (1984) for
different perspectives.
522 Chapter 8. Hierarchical Models

the number of elements of '!f;. Let

if i ~ Ill,
if i E Ill.

Then fXI",(xi'!f;) equals a constant times

(8.53)

where
,
n
E~-l ZiXi
>11 = AO + L Z i , Xv> = ""n
i=1 L.."i=l Zi

or'"
If we model the Zi as independent, then the prior probability of III = '!f; is (1-
o)n-n"" so we can calculate the posterior probability of each possible subset of
outliers.
For example, suppose that n = 15 and our prior has ao = 1, bo = 100, /.to = 0,
AO = 1, 0 = 0.02, and c = 25. (For computational simplicity, we truncate the
distribution of III to a maximum of six elements.) The data are the infamous
Darwin data [see Fisher (1966), p. 37] shown below

-67, -48, 6, 8, 14, 16, 23, 24, 28, 29, 41, 49, 56, 60, 75

The possible values of '!f; with the highest posterior probability are given in Ta-
ble 8.54 under the column "Model I." The set of size three with the highest
posterior probability of being outliers is {I, 2, 15} with probability 0.0030. We
can also add the probabilities of all subsets that contain a specific observation to
get the marginal probabilities that each observation is an outlier. See Table 8.55.
The observations not listed in Table 8.55 each have probability less than 0.005
of being an outlier.
Of course, one need not choose a single value of c or a single value of o. One
could treat these as further mixing parameters like III and compute a posterior

TABLE 8.54. Posterior Probabilities of Outlier Sets


Modell Model 2
0.7192 0.7767
{l} 0.1295 0.0885
{1,2} 0.0483 0.0448
{2} 0.0241 0.0166
{15} 0.0114 0.0080
{14} 0.0060 0.0042
{13} 0.0052 0.0037
{12} 0.0043 0.0031
{ll} 0.0036 0.0026
{3} 0.0033 0.0023
{1,15} 0.0032 0.0031
{4} 0.0032 0.0023
8.6. Mixtures of Models 523

TABLE 8.55. Posterior Probabilities of Outlier Observations


i 1 2 11 12 13 14 15
ode 1 0.1970 0.0816 0.0050 0.0060 0.0075 0.0088 0.0195
Model 2 0.1627 0.0803 0.0047 0.0057 0.0074 0.0089 0.0214

distribution for them. For example, (8.53) is now fXI""C,A(xl"l/l,c,o}, which does
not depend on o. This is because X is conditionally independent of A given III
and C. Suppose that we let A have a Beta distribution. It is difficult to have a
Beta distribution with mean 0.02 which is neither extremely concentrated near
its mean nor extremely concentrated near O. Suppose that we choose Beta(l, 49},
which has Pr(A $ 0.02} = 0.6358. Also, let C have probability 0.05 of being one
of the numbers 5, 10, ... , 100. The posterior distribution of C is almost the same
as the prior, meaning that the different values of C do not lead to much difference
in the predictive density of the data, although C = 10 has the highest posterior
probability. The posterior distribution of A has mean 0.0204. The posterior prob-
abilities of the various III sets is given in Table 8.54 under the column "Model 2."
The probability of the set {I, 2, 15} is now 0.0058. The probabilities that each
of the observations is an outlier is in Table 8.55. Although the probability that
=
III {I} is smaller in Model 2, the probability that observation 1 is an outlier is
still quite large, because there are many other sets "1/1 containing 1 which now have
higher probability of equaling Ill. For example, the probability of three outliers
is twice as high in Model 2 as in Modell, and the probability of four outliers is
six times as high.
Before we leave this example, we offer another variation. Suppose that we
give C a continuous prior distribution, say r- 1 (eo/2, do/2} truncated below at
c = 1 and independent of (E, M, A). We could find the posterior distributions of
whatever we wanted by using successive substitution sampling (see Section 8.5).
The following conditional posterior distributions are easy to find and are easy to
simulate:

C r- 1 (eo + n", do + EiE",(Xi - p.}2)


2' 20"2 '

r- 1 (ao +n+2
1
'
bo + ~ EiE",(Xi - p.}2 + Ei~",(Xi - p.}2 + Ao(P. - p.O}2)
2 '

M
AoP.O + ~ EiE'" Xi + Eill'" Xi 0"2 )
N ( Ao + n - n", + 'Cn",
1 'A0 + n - n", + 'Cn",
l '

A Beta(oo + n"" f30 + n - n",},

Ber (0 + v'C(1 - o) ex:( -~(Xi _ P.}2)) ,


where, as before, III = {i : Zi =
I}, and the distribution of C is still truncated
below at c = 1. Since the inverse of the r CDF is available in many subroutine
libraries, the truncated distribution can be simulated using the probability inte-
524 Chapter 8. Hierarchical Models

gral transform. Notice that the Zi are independent of each other given the other
parameters and the data. An analysis like the one just described is developed by
Verdinelli and Wasserman (1991).
Notice that in the last variation, the model no longer resembles a mixture of
models. In fact, it is just a more highly parameterized model with parameter
(~, M, 1lt, C, A). Alternatively, the parameter could be taken as (l:, M, C, A) with
Z being considered as missing data.

This example is not meant to be a prescription for how to handle out-


liers, but merely an example of how mixtures of models can be used for
such a problem. Freeman (1980) describes several other methods for detect-
ing outliers. West (1984) describes hierarchical models for accommodating
outliers in linear regression.
In fact, any situation in which there is uncertainty is amenable to analysis
using a mixture of models. Even the simplest univariate one-sample prob-
lem can admit several prior distributions and/or parametric families. The
different combinations of parametric family and prior distribution can be
mixed using the general theory outlined above. See Problem 14 on page 75
for a simple example.

8.6.3 Bayesian Robustness


In Section 5.1.5, we introduced M -estimators as robust estimators that
might be less sensitive to anomalies in the data. Because Bayesian solu-
tions depend on prior distributions for parameters in addition to the con-
ditional distributions of data given parameters, one might be interested in
prior distributions that provide a measure of robustness. Also, one might
be interested in the degree of robustness that a particular choice of prior
exhibits when compared with several others.
A straightforward way of comparing a particular prior distribution J.ta to
several others is to compute whatever one would normally compute using
J.ta as the prior and then recompute the same quantities using all of the
other priors. This activity often goes by the name of sensitivity analysis. If
the number of alternative priors is too large, one might be able to compute
bounds on the various quantities of interest as the prior ranges over the
alternatives. One popular way of specifying a set of alternative priors is by
means of f.-contamination. For a given J.ta, f. > 0, and set C of probability
measures on (0, T), one forms the collection
(8.56)

of alternative prior distributions. The set CE is called an f.-contamination


class. If J.te E C, then J.ta E CE also. Note that each element of CE is a
mixture of two possible prior distributions. The largest set C one could
use is the set of all probability distributions on (0, T). Suppose that one is
interested in posterior probabilities of sets C E T. It is possible to calculate
8.6. Mixture s of Models 525

bounds on the posteri or probab ilities of such sets as the prior ranges
over
C.
Theor em 8.57. 19 Suppose that X has conditional density fXle(x\B)
with
respect to v given e = O. Let C be the set of all distributions on (n,
~),
and let C. be as in (8. 56}. For each 7r E C" let 7r(lx) denote the posterzo r
distribution of e given X = x calculated as if 7r were the prior distribu
tion.
Similarly, let J.elx(-Ix) denote the posterior calculated as if J.e were
the
prior. For each C E T,

inf 7r(Clx) =
?TEe.

sup 7r( Clx) =


?TEC.

where f X denotes the marginal density of X under the assumption that


J.e
is the prior.
PROOF . For 7r E C. with 7r = (1 - f)JLe + frJ, it is easy to see that

7r(Clx) = (1 - f)fx(x) JLelx( clx) + f Ic fXle(xIB)drJ(B) , (8.58)


(1 - f)fx(x) + fg(X)
where g(x) = I fXle(xIO)drJ(O) is the margin al density of X under
the
assump tion that rJ is the prior. The expression in (8.58) will get smaller
if rJ
is replace d by any rJ* such that rJ*(C) = 0 and rJ*(D) ? rJ(D) for
D ~ Cc.
(For example, rescale rJ(' n CC) to be a probability.) It follows
that the
smalles t values of (8.58) occur when rJ(C) = O. When rJ(C) =
0, (8.58)
becomes
7r(Clx) = (1- f)fx(x)J .elx(C lx)
(1 - f)fx(x) + fg(X) (8.59)
This can be minimized by making g(x) as large as possible. But
since
g(x) is an average of values of fXle(x\O), its suprem um is clearly equal
to sUP6ECc fXle(xIB). So the infimum of 7r(C\x) equals (8.59) with
g(x)
replaced by sUP6ECc fXle(xIO). The suprem um is obtaine d by
applyin g
the same argume nt to Cc.
0
Examp le 8.60. Let X '" Exp(O) given e = 0, and let Me be the
rea, b) distri-
bution. Let C be the interval (0, c(x)] where c(x) is the 'Y quantile
of the posterio r
distribu tion ofe. In this case, the posterio r is r(a+ 1, b+x) and the
margina l den-
sity of the data is Ix (x) = ab /(b+ xt+l. The likelihood function
is 0 exp( -xO),
Q

which increases for 0 < 1/x and decreases thereaft er. So, we have

= {
~
sup IXle(x\B) if ~ :=; c(x),
x
8EG c(x)exp (-c(x)x ) if ~ > c(x),
19This theorem appears in Berger (1985).
526 Chapter 8. Hierarchical Models

exp(-l) if ~ ~ c(x),
sup fXle(xl(l) = { '"
9EC O c(x)exp(-c(x)x) if~ <c(x).

The value of ltelx(Olx) = 'Y by design. The bounds given by Theorem 8.57 for
the -contamination class using C equal to all distributions are, for l/x ~ c(x),

(1 - ) (b+~):fi + c(x) exp( -c(x)x)


aba
(1 - )(1 - ....I ) (b+z)afi
< 7r (01)
x<1- .
- - (1 )
-
ab"
(b+z)"f 1 + ~
'"

For example, with 'Y = 0.5, a = b = 1, and = 0.1, we have c(x) = 1.678/(1 +
x), which is greater than or equal to l/x for x ~ 1.474. Figure 8.61 shows a
plot of the lower and upper bounds on the posterior probabilities of the interval
[0,1.678/(1 + x) as a function of x. Notice how the degree ofrobustness depends
on the observed data. When x is very small, the likelihood function is quite large
for large values of (I outside of the interval (O,c(x), since c(x) never gets bigger
than 1.678. A prior that assigned probability 1 to such a large (I value would be
consistent with a very small observed x and would give low probability to every
subinterval of [0, 1.678). If such priors seem unreasonable, then perhaps the class
C. is too large.

Additionally, we may wish to find bounds for the posterior mean of a


measurable function 9 of e as the prior distribution varies over a class
such as an -contamination class. The following theorem, which is helpful
in this regard, is due to Lavine, Wasserman, and Wolpert (1991, 1993).

8 10
o 2 4 8
x

FIGURE 8.61. Lower and Upper Bounds on Posterior Probabilities


8.6. Mixtures of Models 527

Theorem 8.62. Let r be a class of prior distributions on (0, T), and let 9 :
0:-+ m be a measumble function. Suppose that inf1l"Er I fXls(xI0)d1l'(0) >
O. For each 11' E r, define S1l"(A) = I fXls(xIO)[g{O) - Ajd1l'(0), and let

S{A) = SUpS1l"{A).
1I"Er
Then for finite A, the least upper bound on the posterior means of g(8) is
A if and only ifs{A) = O.
PROOF. Let
IlxIS(xI0)g(0)d1l'(0)
AO = sup
1I"Er
I fXls{xI0)d1l'{0) '
and assume that AO is finite. For the "if" direction, suppose that S{A) = O.
We need to prove that A = Ao. Since S(A) = 0, we know that S1l"(A) ::; 0
for all 11' E r and that there exists a sequence {1I'n}~=1 of elements of r
such that, for each n, S1l"n (A) > -lin. This last claim can be written as
I fXls(xI0)g(0)d1l'n(0) > AI fXls(xI0)d1l'n(0) - lin, which implies

I fXls (x I0)g(0)d1l'n (0) > A _ I


I ix IS (x I0)d1l'n (0) n I IXls(xI0)d1l'n(0) '
for all n. We know that
I fXls (xI0)g(0)d1l'n (0) I
AO > sup > A- . (8 63)
- n I fXls(xI0)d1l'n{0) - SUPn I fXls(xI0)d1l'n(0) .
Because inf1l"Er I fXls(xI0)d1l'(0) > 0, the far right-hand side of (8.63)
equals A, so AO ~ A. We can rewrite S1l"{A) ::; 0 as I fXls(xI0)g(0)d1l'(0) ::;
AI fXIS(xI0)d1l'(0) , which implies

I fXls(xI0)g(0)d1l'(0)
I fXls(xI0)d1l'(0) <- A.
Since this is true for all 11' E r, it follows that AO ::; A, and we conclude
AO = A.
For the "only if" part, we must show that S(AO) = O. From the fact that

for all 11' E r, it easily follows that S(AO) ::; O. Suppose that S(AO) = -to for
some to > O. We will derive a contradiction. We know that there exists a
sequence {1I'n}~=1 of elements of r such that, for each n,
528 Chapter 8. Hierarchical Models

Since 8(>'0) = -e, it follows that, for every n,


f !xle (xI0)g(0)d1l'n (0) e
I !xle(xI0)d1l'n(0) ~ >'0 - I !xle(xI0)d1l'n(0)'
These two inequalities imply that f fXle (x I0)d1l'n (0) > ne for every n. This
contradicts sUP.,..H f !xle(xI0)d1l'(0) < 00. 0
To use Theorem 8.62 to find bounds on posterior means, we first note that
lower bounds can be obtained by replacing 9 by -g and finding another
upper bound. For fixed >., 8.,..(>.) is a linear function of 11'. In the case of
e-contamination classes (r = C), it follows that 8(>.) is the supremum over
the set of contaminations of the form 1I'(B) = (l-e)J.Le+eIB(Oo) for 00 En.
For such a 11', the posterior mean of g( e) is

(1 - e) I g(O)!xle(xIO)dJ.Le(O) + !xle(xIOo)eg(Oo)
(1 - e) f !xle(xIO)dJ.Le(O) + e!xle(xIOo)
One can usually find the supremum and infimum of this expression as
a function of 00 using standard numerical methods. The two integrals,
f g(O)!xle(xIO)dJ.Le(O) and f !xle(xIO)dJ.Le(O), are constants in these nu-
merical problems.
Example 8.64 (Continuation of Example 8.60; see page 525).We have X f'V

Exp(8) given e = 8, and p.s is the r(a, b) distribution. Let g(8) = 8. So

f g(8)JxIS(xI8)dp.s(IJ) =
a(a + l)ba
(b + x)a+2'

f JXls(xI8)dp.s(IJ) =
aba

The function for which we need to find extremes is then

(1 - e) ~t:~)12!; + e82 exp( -x8)


h(8) = abel
(1 - E) (Hz),,+i + E8exp( -x8)

If we let a = b = 1 and E = 0.1 as before, we can find the extremes of h(8) for
every possible x. Figure 8.65 shows the lower and upper bounds on E(8IX = x)
for x between 0.1 and 10. As x __ 0, the upper bound goes to 00 because x close
to 0 is most consistent with very large values for e. The bounds get very close
together and small as x __ 00 because large x values are most consistent with
very small values of 8.
In addition to sensitivity analysis, one can try to find prior distributions
such that resulting inferences exhibit some degree of robustness to changes
in the prior. Consider the case in which X}, ... , Xn '" N(J.t, ( 2 ) given e =
(J.L, (1). The natural conjugate prior is one of the form M '" N (J.LO, u 2 / >'0)
given ~ = (1 and ~2 r- 1 (ao/2,bo/2). Such priors have the property that
I'V

the posterior mean of Mis J.tl = (nx + >'oJ.Lo)/(n + Ao), which has a compo-
nent >'oJ.Lo/(n + >'0) that remains the same no matter what the data values
8.6. Mixtures of Models 529

...
I

o 2 4 II 8 10
x

FIGURE 8.65. Lower and Upper Bounds on Posterior Mean of e

are. It is sometimes desirable to have the influence of the prior become less
pronounced as the data move away from what would be predicted by the
prior. Alternatives to natural conjugate priors, which are less influential,
are ones in which M is independent of E2 with teo (/.Lo, 7"2} distribution. We
will assume that E2 r- 1 (ao/2, bo/2) in this prior also. At first, it may
fV

seem difficult to work with such a prior because the posterior cannot be
written in closed form. However, we can use the following trick to make the
problem more tractable: Invent a random variable Y with r- 1 (Co/2, Co7"~ /2}
distribution independent of E, and pretend as if M N(/.Lo, Y} given Y. f'V

The marginal distribution of M is then tco(/.Lo, 7"~} as earlier prescribed. But


now, if we treat (M, E, Y) as the parameter, we can use successive substi-
tution sampling (SSS) because the following conditional distributions are
obtained:

MIE =O',Y =y N(/.Ll (y, O'), 7"1 (y, 0')),


E21M = /.L, Y = Y r- 1 (ao + n b1 (/.L, y})
2 ' 2 '

YIM = /.L,E = 0' r- 1 (Co+l d1(/.L})


2 '2 '
where

/.Ll(Y,O'} =
H.Q.
y
1
ii
+ U2'
+
nx
n
<T"
' 7"1(Y,O') = ( y+
1
:2 )-1
530 Chapter 8. Hierarchical Models
n
b1(p., y) = bo + ~)Xi - X)2 + n(x _p.)2,
i=l

In fact, since the prior density for M will tend to be very flat relative to
the likelihood function with even a small amount of data, this prior is a lot
like using an improper prior such as Lebesgue measure.
Of course, Bayesians can be interested in the same aspects of robustness
in which classical statisticians are interested, namely robustness with re-
spect to unexpected observations. In the classical framework, we introduced
M-estimators (see Section 5.1.5) to be less sensitive to extreme observa-
tions. In the Bayesian framework, this would correspond to using alter-
native conditional distributions for the data given e to reflect the opin-
ion that occasional extreme observations might arise. Consider the case in
which X l. "" Xn '" N(p., a), given e = (p., a), most of the time but in
which an observation with higher variance is occasionally observed. Alter-
natively, suppose that each observation Xi comes with its own variance E~
and that the E~ are exchangeable. If the conditional distribution of E~ were
r- 1 (ao/2, aoT /2) given T, this would be equivalent to saying that the Xi
had tao(f,.t, T2) distribution given M = f,.t and T = T. That is, we would have
changed the likelihood from normal to tao' Once again, it may seem diffi-
cult to work with such a likelihood because the posterior cannot be written
in closed form. However, we can again use SSS to make the problem more
tractable. Suppose that we model M as N(p.o, Y) given Y, E l. ... , En, T,
and Y as independent of the Ei and T with r- 1 (Co/2, CoTV2) distribution,
and T '" r(do/2, 10/2). The conditional posterior distributions needed are
M

T '"
where
-1
1 1
-+E-
n
( )

y a~ i=1

n) T1(al..,a ,y).
(-+La~
P.o x
n
y i=l'

Alternatively, one could integrate numerically.


Example 8.66. Suppose that we model the data as above with po = 0, ao = 5,
Co= 1, TO = 2, do = 1, and fo = 1/2. This prior has the property that the
8.6. Mixtures of Models 531

density of the data given the parameters has thinner tails than the prior density
of M. This is because the degrees of freedom is 5 for the t distribution of the data
given the parameters, but the degrees of freedom is only 1 for the t distribution
of M. This will allow the posterior to resemble the likelihood to a large extent.
Consider the following 10 observations:
1.66, 1.07,0.640,0.310,0.295, -0.070, -0.107, -1.67, -1.90, -1.97.
The posterior mean of M is -0.077, and the posterior standard deviation of M is
0.4202. (The sample average is -0.173, and the sample standard deviation over
vT5 is 0.4017.) We could now consider what changes if one of the observations
moves off to 00. For example, suppose that we take the smallest observation,
-1.97, and let it move to 0 and then to +00. Figure 8.67 shows a plot of the
posterior mean of M as a function of the moving observation. Notice that the
mean of M increases almost linearly with the moving observation for some time
and then begins to decrease again. The decrease is due to the moving observation's
having reached a level beyond which it is more likely to be coming from the tail
of the distribution than from a large value of M.
The other curves in Figure 8.67 correspond to models with different degrees of
freedom. All four of the cases with aD, Co E {1,5} are illustrated. The value of Co
does not have nearly as much influence as the value of aD. When ao changes to 1
(with Co = 1 and the original data), the posterior mean and standard deviation
of M become 0.1258 and 0.1388, respectively. (The MLE of M would be 0.2064,
but the likelihood is not very peaked.) In this case, as one observation changes,
the posterior mean of M is affected most by the average of those observations in
the middle of the data set. When the moving observation enters the middle of
the data set, the posterior mean of M varies linearly with the observation. But
when it moves out of the middle, the posterior mean moves back down again.

Of course, one straightforward way to develop robust models is to form


mixtures of all sensible models for the data. Those models with high prior

:,;t
::::IE
15
&::

i ~~-
ci

a-S.c_1
8-1.0.1
~ 8-5.c_5
a_1.0_5

-2 0 2 4 B B

Moving Observation
FIGURE 8.67. Posterior Mean of M as a Function of One Observation
532 Chapter 8. Hierarchical Models

TABLE 8.68. Relative Values of Prior Predictive Density

XlO
ao -1.970 0.531 3.030 8.030
1 0.252 0.918 0.367 1.501
2 0.553 1.048 0.663 1.412
5 1.000 1.000 1.000 1.000
10 1.187 1.001 1.091 0.945
20 1.283 1.003 1.110 0.623
60 1.348 1.003 1.105 0.221
00 1.381 1.003 1.097 0.096
average 1.001 0.997 0.911 0.828

predictive density will surface as the ones that contribute most to the pos-
terior predictive distribution of future data.
Example 8.69 (Continuation of Example 8.66; see page 530). Suppose that we
are not sure which degrees of freedom to use for the conditional distribution of Xi
given (M, T). We might try a mixture of models with model i having ao = i for
i in some set where ao = 00 means that the conditional distribution is N(M, T2)
rather than tao (M, T2). Table 8.68 lists the relative values of the prior predictive
densities of the data for a few values of ao with ao = 5 taken as 1.0. Four
different data sets are used; they differ only in the value of the last observation,
which is listed in the column heading. The t5 distribution is relatively robust
as one observation increases, and the equal mixture of the seven models (the
row labeled "average") has predictive density remarkably close to that of the t5
model. One might argue that the equal mixture of the seven other models is not
itself sensible. Putting 3/7 of the probability on {20, 60,00} degrees of freedom
is saying that one is somewhat confident that the data will be approximately
normal. On the other hand, putting 3/7 of the probability on {I, 2, 5} degrees of
freedom is saying that one is equally confident that the data will likely have an
occasional "outlier."

8.7 Problems
Section 8.1:

1. Let Xn be the space of sequences of Os and Is of length n + 1 which start


with O. Let Tn(O, Xl, ... , Xn) be the four counts of transitions (from 0 to 1,
from 0 to 0, from 1 to 1, from 1 to 0) in (0, Xl,, Xn). Call these counts
(Tn,O,l, Tn,o,o, Tn,l,l, Tn,l,O)' Let rn(A, t) be uniform over all sequences with
the appropriate numbers of transitions (that is, to,l transitions from 0 to
1, etc.).
(a) Show that the conditions of Theorem 2.111 hold.
(b) Find the extreme points of the set M.
8.7. Problems 533

(c) Write the representation of Theorem 2.111 as an integral over a finite-


dimensional space.
2. Suppose that {Yij : i = 1, ... ,niji = 1, ... ,k} are conditionally inde-
pendent with Yij '" N (Ji, 1) given 8 = (Jl, ... , (Jk) and M = p,. Suppose
that 8 1, , 8k are lID with N(p"I) distribution given M = p, and that
M", N(p,o, 1).
(a) Find the marginal distribution of each Yij.
(b) Show that the Yij are not exchangeable.
(c) Find the posterior distribution of 8 and M.
3. Suppose that {Yij : i = 1, ... , nij i = 1, ... , k} are conditionally inde-
pendent with Yij '" N(Ji, 1) given 8 = (Jl, ... , (Jk), M = p" and T = 7".
Suppose that 8 1 , . , 8k are lID with N(p,,7"2) distribution given M = P,
and T = 7". Show that the improper prior with density 7"-1 (with respect
to Lebesgue measure) leads to an improper posterior for T.

Section B.2:

4. Prove Proposition 8.13 on page 485.


5. *Consider the data in the following table:
ni Xi,j,i = 1, ... , ni
1 3 9.549 10.274 7.142
2 2 11.430 11.890
3 3 6.898 4.329 6.905
4 4 12.620 13.050 12.530 11.890
Model the Xi,j using the one-way ANOVA model of Section 8.2.1 with prior
hyperparameters 1/10 = 15, (0 = 0.25, bo = 1, ao = 1, and A "" Exp(0.2).
(a) Find the product of the prior density of A and the "marginallikeli-
hood" function of A.
(b) Use numerical integration to find the normalizing constant for the
posterior density.
(c) Find the posterior predictive density of a future observation from
the i = 4 group using a numerical integration method, importance
sampling, or the method of Laplace. (Approximate the density at the
points from 9 to 15 in steps of 0.05.)
6. Prove that (8.20) on page 490 is true.
7. Prove the following matrix theorem: If A and Bare nonsingular matrices,
then
(A + B)-l = A-l_A-l(A- l +B-l)-lA- l ,
A(A+B)-lB = (A- l + B-l)-l.
534 Chapter 8. Hierarchical Models

8. Prove the following matrix theorem: Let E. = El + E2, where El and E2


are nonsingular symmetric matrices, and let a and b be vectors. Then

aTEla + bT E2b - (Ela + E2b) TE;l(Ela + E2b}


= (a - b) T (Ell + E2"l}-l(a - b). (8.70)

(Hint: Use Problem 7 above.)


Section 8.4:

9. Each scientific paper published by a particular author receives a random


number of citations from other authors in the years following its publi-
cation. For i = 1, ... , k, and j = 1, ... , n, let Xi,j denote the number
of citations paper j received i years after publication. Let Tl, ... , T,. be
known positive numbers. Model the Xi,j as conditionally independent given
MI, ... , M n , e with Xi,; having Poi(MjTi) distribution. We model the Mj
as lID with Exp(8) distribution conditional on e = 8.
(a) Find minimal sufficient statistics for the parameters M I , ... ,Mn , e.
(b) Supposing that e is the only parameter, find a one-dimensional suf-
ficient statistic.
(c) Find the MLE 8 of e.
(d) Use the naive empirical Bayes approach (assuming e = 8) to find
the posterior distribution of the parameters MI, ... , Mn.
10. Suppose that Xi f"V Bin(ni,8i ), i =
1, ... , k are conditionally independent
e
given e = (81 , . , 8,.). Suppose that we model the i as conditionally lID
with Beta(o, f3) distribution given (A, B) = (0, f3).
(a) Although the formulas cannot be written out completely, describe
how one would implement the naive empirical Bayes approach using
MLEs for A and B.
(b) Use the following data, and compute the naive empirical Bayes pos-
terior distributions and posterior means for the parameters. We have
k = 2, nl = 5, n2 = 10, Xl = 3, and X2 = 3.

Section 8.5:

11. Suppose that X f"V r-l(a,b) and Y f"V r-l(c,d) are independent. Let Z =
X/Yo Prove that the conditional distribution of X given Z = z is r-1(a+
c,b+dz).
12. Using the notation of Problem 9 above, suppose that e has a prior distri-
bution, which is r(a, b), with a and b known constants.
(a) Find the posterior density of e except for the normalizing constant.
(b) Set up a successive substitution sampling scheme to generate a sample
of e and Ml, ... , Mn from the joint posterior distribution.
(c) Write a formula for an approximation to the posterior density of e
and of Ml, ... , Mn based on the successive substitution sample.
8.7. Problems 535

(d) Describe the similarities and differences between the above approxi-
mations to the joint density for the M j and the approximation found
via the empirical Bayes approach.
13. Let v and z be k-dimensional vectors with Vi > 0 for i = 1, ... , k. Suppose
that X ,....., Nk(diag(v)z,diag(v)), where diag(v) is a diagonal matrix with
(i, i) element equal to Vi. Prove that the conditional distribution of X given
ITX=cis
.
Nk ( [dIag(v) 1
- ~vv T] z,dIag
. ()V - ~vv
1 T) .
1 v 1 v

14. *Prove that condition (8.41) holds if Y ,. . ., Nk(P., 'L,) and SSS is applied to
the coordinates in their natural order. (Hint: First, prove that the condi-
tional distribution of the next iterate X given the current iterate X' is
multivariate normal with constant covariance matrix and mean that is a
linear function of X'. You can now integrate over x' analytically using facts
from the theory of the multivariate normal distribution. The integral over
x becomes the integral of the ratio of two normal densities. You can use
problems 7 and 8 in this chapter to show that this integral is a constant
times the integral of a normal density.)

Section 8.6:

15. The SSS algorithm allows one to approximate posterior distributions with-
out calculating the marginal density of the data fx(x). When fitting mix-
ture models, it is important to compute fx I'" (x 11/1) , where W is the pa-
rameter indexing models. Consider a single model with parameter e =
(el, ... ,8p ). Each iteration of SSS starts with a simulated vector 8(i) =
(e~i), ... , 8~i and then simulates e(HI) one coordinate at a time using
the conditional posterior distribution of each coordinate given the others:
~ (8Ie(i+1) 8(i+l) e(i) (i)
(8.71)
Jejle'j,x ) I '''., j_I' j+I,,,,ep ,x).

Call the expression in (8.71) Vj(HI). (Note that V 1(i+1) and V~i+1) have
slightly different formulas due to the effect of being at the ends of the
vector.) Prove that

E (fx 1eex I8 (H1fe (8(H1) _


TI p
)=1
V(i+1)
)
- fx(x),

where the E() refers to the joint distribution of (e(i), 8(HI in the sim-
ulation.
16. Suppose that one wishes to fit a mixture of k models, but that one is
required to use SSS to fit each model. Let W E {I, ... , k} index the different
models, and let e i be the parameters of model i, for i = 1, ... , k. Explain
how one could use Problem 15 above to help estimate f"'lx(1/Iix) for all
values of 1/1.
CHAPTER 9
Sequential Analysis

Most of the results of earlier chapters concern situations in which a partic-


ular data set is to be observed and the decisions, if any, to be made concern
the values of future observations. It sometimes happens that as we observe,
we get to decide what, if any, data to collect next. In this chapter, we will
describe some theory and methods for dealing with such situations. 1

9.1 Sequential Decision Problems


As a simple example of a situation in which we need to decide whether or
not to collect more data, consider the following.
Example 9.1. We are considering purchasing a shipment of parts. Prior to ob-
serving any data, we believe that the proportion P of defective parts has a U(O, 1)
distribution. We believe that the individual parts (Xi = 1 if part i is defective)
are conditionally independent Ber(p) random variables given P = p. We decide
that we can sample at most 10 parts, and we will reject the shipment if the
posterior mean of P is greater than 0.6. This will occur if there are 7 or more
defectives out of a sample of 10. Suppose that the first seven parts are defective.
Clearly, there is no need to sample any more parts. Similarly, if the first six parts
were defective, it might seem highly unlikely that the shipment would be accept-
able. Whether or not to continue sampling would depend on the relative costs of
sampling and of rejecting a good shipment.

The general sequential decision problem can be defined as follows.

IThe discussion in Section 9.1 is largely adapted from Chapter 12 of DeGroot


(1970).
9.1. Sequential Decision Problems 537

Definition 9.2. Let (8, A, JL) be a probability space, let (V, V) be a mea-
surable space, and let V : 8 --+ V be a random quantity. For i = 1,2, ... ,
let (Xi, Bi ) be measurable spaces and let (Xo, Bo) = ({O}, {0, Xo}) be a
trivial space. For i = 0,1, ... , let Xi : 8 --+ Xi be random quantities. Let
X = n:o Xi with product a-field Boo. Let Bn be the sub-a-field generated
by the first n + 1 coordinates (including 0). That is, B E Bn if and only if
B = C x n:n+1 Xi, where C E Bo ... Bn. Let X = (Xo, Xl, X2, ... ).
A stopping time is a nonnegative, extended integer-valued function N (i.e.,
N E N = {O, 1, ... } U {oo}) defined on X such that, for every finite
n, {x : N(x) = n} is measurable with respect to Bn. Let the action
space be N = N' x N, and let the loss be L : V x N --+ IR such that
L(v,(a,n)) = 2:~=oCi + L'(v,a), where Ci ~ 0 for all i (eo = 0). Let Q:
be a a-field of subsets of N', and let P A be the collection of probabil-
ity measures on (W, Q:). A mndomized sequential decision rule is a pair
6 = (6*, N) where 6* : X --+ P A and N is a stopping time. The function 6*
is called the terminal decision rule. A nonrandomized sequential decision
rule is a randomized sequential decision rule such that, for each x EX,
there is 6'(x) E W such that for each A E Q: 6*(x)(A) = 1 if 6'(x) E A
and 6*(x)(A) = 0 if 6'(x) A. If 6 is nonrandomized, then 6' is called the
terminal decision rule.
A convenient notation will be to let xn = (Xo, ... , Xn) for finite n
and Xoo = X if necessary. Also, xn = (xo, Xl. .. . , x n ) and x oo = x for
x E X. Note that we have assumed that a stopping time might be infinite.
If there exists x such that N(x) = 00, then 6*(x) must still be defined, even
though it is hardly a "terminal" decision. For convenience, we will often
write summations like 2::=~ to indicate that one extra term for n = 00
is to be included in the usual sum 2::=1. One way to prevent decision
rules from taking infinite samples with positive probability is to require
that 2::1 Ci = 00 so that any rule that takes infinite samples with positive
probability must have infinite risk.
If we can restrict attention to decision rules (6', N) such that N :5 n,
then there is an intuitively simple method of finding the optimal sequential
decision rule. The idea is to decide what would be the optimal decision and
its risk after observing xn, then compare this to what the risk would be
if we only observed xn-1. Whether N(x) = n or N(x) = n - 1 is decided
is based on which is smaller. We now know what the optimal procedure is
after observing n - 1 observations and we know its risk. Compare this to
what would be optimal if we stopped after xn-2, and so on. This procedure
is called backward induction. Consider an illustration.
Example 9.3. Suppose that {Xn}~=l are conditionally lID Ber(9) given e = 9
and e has U(O, 1) distribution. Suppose that we can take at most four observa-
tions. The action space has N' = {O, I}, and the loss function is L(9, (a, n)) =
538 Chapter 9. Sequential Analysis

D.Dln + L'(J, a) with

I if 0 > 0.4 and a = 0,


L'(O, a) ={ 01 if 0:5 0.4 and a 1, =
otherwise.

It follows that the optimal action, after N is determined, is to choose a = 1 if


=
the posterior probability of 8 :5 0.4 is less than 0.5. If we observe X 4 X4, there
are five possible posteriors depending on the value of Y4 = ~:=l Xi. The risks
are the probabilities of wrong decision plus 0.04 for the four observations.
Y4 Posterior Pr(9 < O.4IX = x) a Risk
0 Beta(I,5) 0.9222 0 0.1178
1 Beta(2, 4) 0.6630 0 0.3770
2 Beta(3,3) 0.3174 1 0.3574
3 Beta(4,2) 0.0870 1 0.1270
4 Beta(5,1) 0.0102 1 0.0502
Next, suppose that we only observe X 3 = X3. Let Y3 = Xl + X2 + X3. The
posterior will be Beta(Y3 + 1,4 - Y3), and the predictive distribution for X4 is
Ber([Y3 + 1]/5). The risk for stopping is just 0.03 (for the three observations)
plus the probability of wrong decision based on three observations. The risk for
continuing is the weighted average of the two possible risks that could occur
depending on the value of X4. For example, if Y3 = 2, the predictive distribution
for X4 is Ber(0.6), and the risk for continuing is 0.6 x 0.1270 + 0.4 x 0.3574 =
0.2192. For the other values we calculate

o
Posterior Beta 1,4
Pr(8:5 O.4IX = x) 0.8704
a o
Risk(stop) 0.1596
Pr(X4 = 1) 0.2
Risk( continue) 0.1696
Stop yes no yes yes
Risk 0.1596 0.3692 0.2092 0.0556
So, only if Y3 = 1 would we continue to observe X4. Next, suppose that we only
observe X2 = x 2 , and let Y2 = Xl + X2.
Y2
0
Posterior Beta 1,3
Pr(8 :5 O.4IX = x) 0.7840
a 0
Risk(stop) 0.2360
Pr(X3 = 1) 0.25
Risk( continue) 0.2120
Stop no no yes
Risk 0.2120 0.2892 0.0840
We would continue if Y2 E {O, I}. Next, suppose that we only observe Xl = Xl
9.1. Sequential Decision Problems 539

0
Posterior Beta 1,2
Pr{8 ::; O.4IX = x) 0.6400
a 0
Risk(stop) 0.3700
Pr{X2 = 1) 1/3
Risk{ continue) 0.2377
Stop no no
Risk 0.2377 0.1524
If we take one observation, we will take two. Finally, before we take any obser-
vations, Pr{8 ::; 0.4) = 0.4, so the terminal decision would be a = 1 and the risk
would be 0.4. On the other hand, Pr{X1 = 1) = 0.5, so the risk of continuing is
0.5 x 0.1524 + 0.5 x 0.2377 = 0.1951. Hence, we should take the first observation.
To summarize, the optimal procedure is
Data (1,1,.,.) (O,O,O,.) (1,0,1,.) (0,1,1,.) (1,0,0,0)
N 2 3 3 3 4
a
Data
1
(0,1,0,0)

10,0,1,0)
1
(1,0,0,1)
1
(0,1,0,1)
0
(O,O,I,lJ
N 4 4 4 4 4
a 0
1 1
where the dots stand for observations that do not need to be taken.
1

To compare with other procedures, there is the fixed sample size procedure
with n = 4, which has risk 0.2239. This risk is the average of the five possible
risks after four observations because each of the five possibilities has probability
1/5. This procedure rejects H if Y4 E {2, 3, 4}. The optimal procedure which
takes at most three observations has risk 0.2232 (see Problem 1 on page 567).

After reviewing Example 9.3, it is clear that if 6 is a sequential decision


rule such that Pr(N = 0) > 0, then Pr(N = 0) = 1, since we have not
allowed any randomization in the decision of whether or not to take ob-
servations. The decision as to whether to take any observations is based
on the prior distribution and the various costs. No randomness is involved;
hence {N = O} is either 0 or all of S.
For a general problem, let Q be a probability on V (usually the parameter
space 0). 2 Define

Po(Q) = min
aEN'
r
Jv L'(u,a)dQ(u),
the minimum risk possible without taking any observations if the prior is
Q. If Q denotes a prior distribution, then for each n, let Qn('lx) denote
the conditional distribution obtained from Q by conditioning on Xo =
XO, , Xn = x n. (For n = 00, Qoo('lx) denotes conditional probability
given X = x.) In particular, Qo(lx) = Q. If N is a stopping time, then

2It may be that Q is already the conditional distribution obtained from some
other probability P after conditioning on some observations.
540 Chapter 9. Sequential Analysis

QN(lx) will denote 2:::'="i Qn(lx)I{n}(N(x)). (See Problem 3 on page 567


for an alternative understanding of Qn and Q N') Suppose that we observe
xn and make the best possible decision. Then PO(Qn('!x)) + 2:~=o Ci is the
risk including the cost of observations.
Definition 9.4. Let Q be a prior distribution on V. Suppose that 6 =
(6*, N) is a sequential decision rule such that for every n (finite or infinite)
and every x E {x : N(x) = n},

Iv L'(u, 6*(x))dQn(ulx) = Po(Qn(lx)).

Then 8 is said to decide optimally after stopping.


Another way to describe what it means for 8 = (8*, N) to decide opti-
mally after stopping is to say that if N(x) = n, then 8*(x) is the same as
the formal Bayes rule for a sample of size n. A decision rule that decides
optimally after stopping may not have an optimal stopping time, but once
the decision to stop is made, the optimal terminal decision is made. Clearly,
the formal Bayes rule in a sequential decision problem will decide optimally
after stopping (see Problem 2 on page 567). For a decision rule that decides
optimally after stopping, the Bayes risk is

Definition 9.5. Suppose that 8 decides optimally after stopping and Q is


a prior on V. We say that 8 is regular if p(Q, 8) S Po(Q) and if, for every
finite n > 0 and every x E {x: N(x) > n},

In words, a decision rule is regular if, whenever the stopping time has not
yet occurred, the risk of stopping is larger than the risk of continuing. The
rule in Example 9.3 on page 537 is regular, as is every backward induction
rule. (See Problem 4 on page 568.)
Theorem 9.7. If 8 decides optimally after stopping, then there is a regular
61 such that p(Q,8t} S p(Q,8).
PROOF. Define 81 as follows. The terminal decision rule for 81 is to decide
optimally after stopping (just like 8). The stopping time N1 for 81 is the
smaller of N (the stopping time for 8) and the first time at which (9.6)
fails. Clearly, this is finite and is a stopping time since both sides of (9.6)
are Bn measurable. If 8 is regular, then (9.6) never fails and 81 = 8. Next,
9.1. Sequential Decision Problems 541

note that both sides of (9.6) are equal for each x such that N(x) = n. So,
we can compute p( Q, 81 ) as

'f;, h.,N. (.)".) [Po( Q.(-Ix)) + t 1


c; <IFX' (x')

< ;!X:Nl(X)=n} E{ PO(QN('\X)) + t, cil xn = xn }dFxn(Xn )

00+
= ~ E [p(Q, 6}IN1 = nJ Pr(N1 = n) = p(Q, 6). o
n=O

Regular decision rules do not sample too many observations, but they
may not sample enough. That is, whenever a regular decision rule contin-
ues sampling, the risk for continuing is smaller than the risk for stopping.
However, when a regular decision rule stops, the risk for continuing may
still be smaller than the risk for stopping. For example, the optimal rule
from a class of rules whose stopping times are all bounded by the same n
is regular.
Proposition 9.8. The optimal rule from the class of sequential decision
rules that sample no more than n observations is regular.

Definition 9.9. If 8i = (8;, N i ) is a regular decision rule for i = 1, ... , k,


the maximum of8 b ... ,8k, denoted max{81, ... ,ok}, is the decision rule
with stopping time N = max{N1, ... , Nd and terminal decision rule to
decide optimally after stopping.

Theorem 9.10. Let Q be a prior on V. If 81 " " , 15k are regular with finite
risk, then 80 = max{81, ... ,8d is regular and p(Q,60} ~ P(Q,Oi}, for
i = 1, ... ,k.

PROOF. We need only prove this for k = 2 because the general case follows
easily by induction. It is clear that

x = {x : N 1 (x) = No(x)} U {x: N 1 (x) < N 2 (x)}.


First, suppose that N 1 (x) < N 2 (x). Then No(x) = N 2 (x). Let n = N 1 (x).
Then

E {Po(QN,(-IX)) + t,,1 X' ~ x.}


~ E {Po(QN' (-IX)) + t",1 X" ~ x.}
542 Chapter 9. Sequential Analysis

< Po( Q.([x)) + ~ c, ~ E { Po( Q N, ([X)) + tI c, X" ~ x' }. (9.11)

The first equality is true because 60 and 62 agree for all x such that N 2 (x) =
No(x). The inequality follows since 62 is regular and N 2 (x) > n. The last
equality follows since N1(x) = n. Next, suppose that N1(x) = No(x) and
n = N 2 (x) ~ No(x):

E { Po (QN. ([X)) + te;l ~ x'}


X'

E{ Po(QN,('[X)) + ~e;1 X' ~ x'}


n
< Po(Qn('lx + L Ci (9.12)
;=0

The reasons for each line are the same as before except that the inequality
is only strict if Nl (x) > n. (Note that (9.12) holds even if n = 00.) Together
(9.11) and (9.12) show that 60 satisfies (9.6). In both of (9.11) and (9.12),
n = min{N1(x), N 2 (x)}. Write

= U Cn,
00

Cn = {x: min{N1{x), N 2 (x)} = n}, X (9.13)


n=O

for j = 0,1,2. Together (9.11) and (9.12) say that the integrand in the
second line of (9.13) for j = 0 is no greater than for either j = 1 or j = 2.
The inequalities in the conclusion to the theorem follow. 0
If r = info p{ Q, 6), then there is a sequence {6d~1 such that r =
limi-too p(Q, 6i ). Finding such a sequence is not as difficult as it may seem.
Definition 9.14. Let 6 = (6*, N) be a regular sequential decision rule. Let
N' be a stopping time. The truncation of 0 at N' is the decision rule with
stopping time min{N, N'} and terminal decision optimal after stopping.
Lemma 9.15. 3 Let 00 be the optimal rule in a sequential decision problem,
and suppose that 00 has finite risk. For each n = 1,2, ... , let On be the

3This lemma is used to help prove Corollary 9.17.


9.1. Sequential Decision Problems 543

truncation of 80 to at most n observations. Define

If limn-> 00 Pn = 0, then limn-> 00 p{Q, 8n ) = p(Q,80 ).


PROOF. If N = 0, the result is trivial, so suppose that N ;:::: 1. For a general
decision rule 8 = (8*, N), define

We know that for k = 1, ... , n - 1,

{x: No{x) = k} = {x: Nn{x) = k}

and
{x: Nn{x) = n} = {x : No(x) = n} U {x: No{x) > n}.
So Tk{8 n ) = Tk{OO) for k = 1, ... , n - 1 and
n
Tn{8n) = Tn(80) + Pn + Pr{No > n) L Ci'
i=l

So, we can write


00 n
p(Q,80 ) = "Tn(8
L.J 0) = n-+oo~
lim "Ti(80 )
n=l i=l
n-l n n
p{Q,8n) = L Ti(8 0) + Tn (8n) = L Ti(80 ) + Pn + Pr{No > n) L Ci
i=l i=l i=l

Since limn -+ oo Pr( No > n) = 0 and lim n -+ oo Pn = 0, the result follows. 0

Lemma 9.16. 4 Suppose that L' ;:::: 0 and lim n-+ oo EPo{Qn(IX)) = O. Then
limn-+ooPn = 0, where Pn is defined in Lemma 9.15.

PROOF. Since Pn is the integral of po(Qn('lx)) over a subset of Xo x X Xn


and the integrand is nonnegative, Pn is less than EPo(Qn(IX)). 0
These last two results combine into a corollary that provides a sequence
of decision rules with risk converging to the optimal risk.
Corollary 9.17. Suppose that L' ;:::: 0 and lim n-+ oo Epo{Qn{IX)) = O. Let
8n ,0 be the optimal rule among those that take at most n observations. Then
limn -+ oo p(Q, 8n,0) = p(Q, 80 ),

4This lemma is used to help prove Corollary 9.17.


544 Chapter 9. Sequential Analysis

Example 9.18 i9ontinuation of Example 9.3; see page 537). If xn = xn is ob-


served, let Yn = L::=1 Xi. Then po(Q,,(.\x)) is the smaller of the two probabilities
that a Beta(Yn + 1, n - Yn + 1) random variable is at most 0.4 or is at least 0.4.
If Yn/n converges to anything other than 0.4, one of the two probabilities will go
to 0 and the other to 1. Since Yn/n will converge to something other than 0.4
with probability 1 and Po is bounded, the dominated convergence theorem A.57
says that limn -+ oo EPo(Q(\Xo, ... , Xn)) = o. Hence, we could find rules with ap-
proximately optimal risk by taking a sequence of optimal rules among the classes
of those that take at most n observations for n = 1,2, .... The method used for
n = 4 is easily generalized to arbitrary n. Here are the computed risks for the
optimal rules Dn for several values of n:
n 5 10 20 50 100 200
Risk 0.1921 0.1720 0.1643 0.1631 0.1631 0.1631
After n = 75, the risk did not change in the first eight significant digits. After
n = 125, sixteen significant digits remained constant.
Example 9.19. Suppose that {Xn }::'=l are conditionally lID N(p.,u 2 ) given
(M,~) = (p.,cr) and M '" N(p.o,cr 2 />.0) given ~ = cr and ~2 '" r-1(ao/2,bo/2J,
with ao > 2. Let N' = JR, and let the loss be L({p., u), (a,n = en + (p. - a) .
The posterior distribution of M given xn = xn is tan (p.n, bn/[>'nan]), where

>'n = >'0 +n, p'n = >'Op.oAn+ nXn ,


n

an = ao +n, bn = bo + ~)Xi - Xn)2 + n~o (Xn - p.O)2.


i=l
Hence, the optimal decision after stopping at N =n is a = p'n and

The prior mean of this is bo/[(ao - 2)(>'0 + n)], which goes to o. It follows that
the risk of the optimal procedure that takes at most n observations converges
to the optimal risk. If ao ~ 2, then po(Q) = 00, and it pays to take one or two
observations until an > 2. At this point, pretend that the problem starts over
and use the above reasoning.
If we modify the problem to have loss Lp., u), (a, n = en+(p.-a)2 /u 2 , then
po(Qn(.\X = l/>'n, which depends on the data only through n. Hence, it is easy
to see that the optimal rule has N = n with probability 1, where n provides a
minimum to en + l/(Ao + n).
Proposition 9.20. [fthere exists finite n such that po(Qn(lx)) < Cn+l for
all x, then the optimal procedure takes no more than n observations. The
optimal procedure is a fixed sample size procedure if Po(Qn(lx)) depends
on the data only through n.
In general, it is quite difficult to specify the optimal sequential decision
procedure. The first part of Example 9.19 is one such case. To find or
approximate the optimal rule in general, we will suppose that the cost
of each observation is the same and that the available observations are
9.1. Sequential Decision Problems 545

exchangeable. That is, assume that en = c for all nand {X n };:O=l are
exchangeable. If we let p* (Q) = inf 6 p( Q, 6) denote the risk of the optimal
rule, then it is not difficult to see that

p*(Q) = min{po(Q), E(p*(Ql('IX))) + c}, (9.21)

since the second term is just the mean of the optimal risk of continuing
after the first observation given the first observation. If this is smaller than
the optimal risk for no data, then it is the optimal risk. Otherwise, the
optimal decision is to take no data and Po(Q) is the optimal risk. Clearly,
the optimal sequential decision rule is to stop sampling at N(x) = n, where
n is the first time that Po(Qn(IX)) = p*(Qn(IX)). This prescription is only
useful if we know p*. As we will demonstrate in the next theorem, we can
approximate p* by using successive substitution (see Section 8.5).
Theorem 9.22. Let Q be a probability measure and suppose that L' ~ 0
and lim n --+ oo EPo(Qn(IX)) = O. Define

for n = 0, 1, .... Then limn --+ oo Pn(Q) = p*(Q) and Pn(Q) is the risk of the
optimal rule among those that take at most n observations.
PROOF. Clearly, in light of Corollary 9.17, we need only prove that Pn is
the optimal risk for rules that take at most n observations. We will use
induction. We know that Po is the optimal risk among rules that take no
observations. Suppose that Pk is the optimal risk among rules that take at
most k observations for some k ~ O. Then

(9.23)

is the risk for taking at least one observation and then using the optimal
rule that takes at most k more observations. The optimal rule that takes
at most k + 1 observations must either take at least one observation or take
no observations. Hence, the risk of the optimal rule that takes at most k + 1
observations is the smaller of (9.23) and po(Q). That is,

o
Theorem 9.22 can be applied to Qk(IX) to produce the following corollary.
Corollary 9.24. Let Q be a probability measure, and suppose that L' ~ 0
and limn --+ oo Epo(Qn(IX)) = O. For each nand k, the conditional mean of
the risk of the optimal rule among those that take at most n+k observations
given the first k observations and given that the optimal rule takes at least
k observations is Pn(Qk(IX)) + ck.
Corollary 9.24 can be used to define an alternative decision rule.
546 Chapter 9. Sequential Analysis

Definition 9.25. The decision rule that continues to sample until the first
n such that Po(Qn('lx)) = Pk(Qn(-lX)) is called the k-step look-ahead rule.
Example 9.26 (Continuation of Example 9.19; see page 544). It is incredibly
difficult to calculate Pn for n > 2. We illustrate here how to calculate pn for n =
1,2. The posterior distribution is determined by four hyperparameters (a, b, p., A)
and
b
Po(a, b, p., A) = A(a _ 2)'
After observing XI = x, let the posterior hyperparameters be
A 2 AP.+X )
(a, b, p., A)(X) ( a+1,b+ A+1(X-P.) ' A+1 ,A+1
(a+ 1,b(x),p.(X),A+ 1).
We can write b(XI) = (1 + y2)b, where

y =X Ji p. J ~ A 1 ~ ta (0, ~) . (9.27)

In particular, E(y2) = 1/(a - 2). It follows that


1 + y2 b
E(poa, b, p., A)(X I ))) = bE (A + 1)(a _ 1) = (A + 1)(a - 2)'

So,

PI(a,b,p.,A) = min{A(a~2)'C+ (A+l:(a-2)}


b . { b }
(A + 1)(a - 2) + mill A(A + 1)(a _ 2)'c ,
b(x)
PIa,b,p.,A)(X = (A+2)(a-1)
'f b(",)
C 1 C ~ (>.+I)(.H2)(a I)'
+{ b(",)
(>.+I)(>.+2)(a-l)
if not
if Iyl ~ r,
= b 1 +y2 +{ C 2
(A + 2)(a - 1) b(>'+I)(I>.+.f2)(a I) if not,

where y is as in (9.27) once again, and

_ / c(>.+I)(>.+2)(a-l) _ 1 if C(A + 1)(A + 2)(a - 1) ~ b,


r= { V b
o if not.

i
It follows that E(Pla,b,p.,A)(XI)) equals
T
b _ b (1 + y2)Jy(y)dy
(A + 2)(a - 2) + c(1 p) + (A + 1)(A + 2)(a - 1) -r
9.1. Sequential Decision Problems 547

b
= (.\ + 2)(a - 2) + c(l -
l
p)

b r
r(!!:}!) 2 .. -1
+ (A+ l)(A + 2)(a -1) -r r(~) y'1T(1 + y )-"'-dy
b b
(A + 2)(a - 2) + c(l - p) + (A + l)(A + 2)(a - 2) q,

where p = Pr(lYI :5 r) and q = Pr(lZI :5 r), where Z '" t a -2(0, lila - 2]).
We could now calculate P2(a,b,p., A) after each observation. If po(a,b,p.,A) is
greater than P2, we should continue to sample. If Po(a, b, p., A) equals P2, the
two-step look-ahead rule would stop. We could, however, try to achieve a bet-
ter approximation to p. One way to do this might be to numerically integrate
P2a, b, p., A)(X)) times the predictive density of the next observation in order to
approximate P3(a, b, p., A).
Consider the results in Table 9.30. We used a prior with ao = 3, bo = 8, P.o = 0,
and Ao = 1. The cost per observation was c = 0.1. After the fourth observation, we
do not know whether or not po = p. If we numerically integrate P2, we get P3 =
P2. This means that we would have to consider at least four more observations
before there was any chance that the optimal rule would continue sampling. But
four more observations would cost 0.4 more without taking into account the
loss from squared error. Since the mean of po with four more observations is
just 5/9 times the current Po, which equals 0.337076, it seems unlikely that four
more observations would bring the risk down enough to justify continuing. In
fact, the lowest possible posterior risk we could obtain from sampling four more
observations would occur if all four of them were equal to the current posterior
mean, and then the risk would be 0.587264, which is barely less than po.

Another way to approximate p* is from below. It is possible (see Prob-


lem 6 on page 568) to show that if 0 :5 'Yo :5 p. (for example, 'Yo = 0)
and
'Yn(Q) = min{Po(Q),E{')'n-l(Ql(IX))) + c} (9.28)
for n = 1,2, ... , then 'Yn :5 p. for all nand limn-->oo 'Yn(Q) = p.(Q).
Example 9.29 (Continuation of Example 9.3; see page 537). Suppose that we
observe Xl = X2 = 1 and we are concerned with whether or not the optimal
rule stops at this point. We already saw that the optimal rule that takes at
most four observations stops at this point, but the optimal rule might continue.
The terminal risk is 0.064 (not counting cost of observations). The posterior is
Beta(3, 1). Treating Beta(3, 1) as the prior, we can compute Pn and "In for as
many n as we desire. We get Pn = 0.064 for all n and "In = 0.064 for n c 33.

TABLE 9.30. Two-Step Look-Ahead Rule for Example 9.26

o 8.000000 2.866667
1 -0.129354 2.002092 1.201046
2 -2.158607 1.214599 0.928760
3 1.558454 0.935753 0.915571
4 -0.677818 0.606737 0.606737
548 Chapter 9. Sequential Analysis

This means that the optimal risk for continuing from this point is 0.064 and we
should stop now.
Suppose that we observe Xl = X4 = 1 and X 2 = X3 = O. The optimal rule
that takes at most four observations has to stop at this point, and the terminal
risk is 0.3174. The posterior is Beta(3,3). Treating this as the prior, we could
calculate Pn and Tn for many n. At n = 100, they are both 0.2274. This means
that the optimal rule would continue sampling and that the optimal risk for
continuing (not counting cost of current observations) is 0.2274.

9.2 The Sequential Probability Ratio Test


Just as hypothesis tests can be introduced as special cases of decision rules,
sequential hypothesis tests are special cases of sequential decision rules. Just
as sequential decision rules require a more general setup than fixed sample
size rules, sequential hypothesis tests require a slightly more general setup
than fixed sample size tests.
Definition 9.31. Suppose that Xi E Xi are random quantities for i =
1,2, .... Let X = (Xl, X 2 , ).5 Let Po be a parametric family of distribu-
tions for X with parameter space O. Let OH n OA = 0 and OH U OA = O.
e
A sequential test of a hypothesis H : E OH versus A : E OA is a pair e
of functions (, N) where N is a stopping time and : X -+ [0, IJ gives the
conditional probability of rejecting H given X = x.
Example 9.32. Let {Xn}~=l be conditionally lID with N(8,1) distribution
given e = 8. Let OH = (-00,80] and OA = (80,00). Let {Vn}~=l and {Wn}~=l
be sequences of positive real numbers. The following is a sequential test of H
versus A:
N min{ n : xn - 80 r;. (-w n , v n )},
I if N < 00 and XN 2: V n ,
(x) { 0 if N < 00 and XN ::; -W n , or if N = 00,
where xn is the average of the first n coordinates of x.
The Neyman-Pearson lemma 4.37 was the starting point from which
the theory of hypothesis testing originated. In sequential testing problems,
there is a similar starting point. We need to begin with a parameter space
consisting of only two points n = {O, I}. Suppose that I{ has a density
Ii with respect to some measure v (such as Po + PI)' That is, {Xn}~=l
are conditionally lID with density Ii given e = i. When we have observed
Xl = Xl,"" Xn = X n , we will calculate the likelihood ratio

L ( ) _ TI~-l II (Xi)
n X - TIni=l JOf (
Xi
)'

5Classical decision rules cannot stop at N = 0, because prior information is


not used. Hence, we have dispensed with the Xo term in this setting.
9.2. The Sequential Probability Ratio Test 549

which tells us how much more likely the data are under PI than under Po.
The sequential probability ratio test [see Wald (1947)] SPRT(B, A) is, for
each n, to reject H : e = 0 if Ln(x) ~ A, accept H if Ln(x) :$ B, and to
continue sampling if B < Ln(x) < A, where 0 < B < 1 < A. Another way
to write this is to let

N{x) = inf{n : Ln{x) ~ (B, An,


and reject H if LN(x) (x) ~ A, accept H if LN(x) (x) :$ B.
It is clear that {x : N(x) = n} is measurable with respect to the correct
a-field. We would like to show that N is finite, a.s.
Theorem 9.33. Let {Zn}~l be IID with Var(Zi) > O. Let Sn = L~l Zi
and N = inf{n: Sn (b,a)}, where b < a. Then Pr(N < 00) = 1.
PROOF. Let c = lal + Ibl. Choose r large enough so that rVar(Zi) > c2 For
each multiple of r, that is n = rk, write
ir
'::'i= L Zj, Sn=SI+",+Sk.
j=(i-l)r+1

If ISm I ~ c for some m, then N :$ rm because Si would have to move


across one of the boundaries between i = r( m - 1) and i = rm if it has not
done so already, since c is the distance between the boundaries. It follows
that
{N = oo} ~ {ISjl < c, for all j}.
We know that ES~ ~ rVar(Zj) > c2 From this it follows that p = Pr(ISj I ~
e) > O. Since the Sj are lID,
00

Pr(N = 00) :$ PrOSjl < e,i = 1,2, ... ) = II(1-p) = O. 0


i=l

When we apply Theorem 9.33, we will let Zi = log [!I (Xi )/10 (Xi)], a =
log A, and b = 10gB.

Theorem 9.34. 110: = PO(LN ~ A) and (3 = PI(LN :$ B), then 0: :$


(1- (3)/A and (3:$ (1 - o:)B.

PROOF. Since {N = n} is in the a-field generated by Xl"'" X n , it follows


that
00

0: LPo(N = n,Ln ~ A)
n=l
550 Chapter 9. Sequential Analysis

J00

~ {N=n,Ln~A}
1
Ln n n
!1(xi)dv(Xl)'" dv(x n )

< AL
1J 00
II fl(Xi)dv(Xl)'"
n
dv(xn)
n=l {N=n,Ln~A} i=l
1 1
AP1(LN ~ A) = A(I- (3).

Similarly, (3 = H(LN :::; B) :::; BPo(L N :::; B) = B(1 - a). 0


If we ignore the overshoot of the boundaries, we can replace the inequal-
ities by equalities and solve the equations for

I-B A-I
a ;:::;; {3 ;:::;;
A-B' B A-B'
1-{3 {3
A ;:::;; B ;:::;;
a I-a

Theorem 9.35. Let a* and {3* be strictly between 0 and 1. The SPRT
with A = (1 - {3*)/a* and B = {3* /(1 - a*) has operating characteristics
a = PO(LN ~ A) and (3 = P1(LN :::; B), which satisfy a + {3:::; a* + {3*.

PROOF. If a :::; a* and {3 :::; {3*, the result is clearly true. So, suppose that
either {3 > {3* or a > a*. (We will see shortly that both inequalities cannot
occur simultaneously.) If {3 > {3*, then 1 - {3 < 1 - {3* and

1 a*
a < -(1- (3) = - - ( 1 - (3) < a*.
- A 1- {3*

It now follows that


I-a
{3* < (3 :::; B(1 - a) = {3* 1 _ a*'

Hence,
I-a ) *a* -a
o < {3 - {3* < {3* ( - - - 1
- 1- a*
= {3 - - .
1- a*
It follows that

a*+{3*-a-{3 = (a*-a)+({3*-{3)
a* -a
> a*-a-{3*--
1- a*
(a* - a)(l - B) > O.
9.2. The Sequent ial Probabi lity Ratio Test 551

Similarly, if a > a*, we can show that /3* > {3 and


a*+/3 *-a-{3 ~ ((3*-/3) (1-~) > o. o

Examp le 9.36. Suppos e that Xi '" Ber(O) given e = O. Suppos e that 8 E


{0.25,0.75} and H : 8 = 0.25. Then

!1(Xi)
= log !O(Xi) =
{-IOg 3 if Xi = 0,
Zi log 3 if Xi = l.
There will be no oversho ot of the bounda ries if a = k1 log 3 and b = - k2 log 3 for
kl and k2 integers. Here are some examples:

kJ k2 O! (3 A B
1 1 0.25 0.25 3 0.3333
2 2 0.1 0.1 9 0.1111
2 1 0.077 0.308 9 0.3333
1 2 0.308 0.077 3 0.1111
3 3 0.036 0.036 27 0.0370
Suppos e that we choose the level 0.1 test with k1 = k2 = 2. We
could calculat e
the mean of N, the expecte d number of observa tions needed. It
is clear that N
is even and that N = 2k if and only if the first 2k - 2 observa tions
come in pairs
0,10r 1,0 and the last two are 1,1 or 0, O. So,

Po(N = 2k) = 0.625 X 0.375 k - 1,


k = 1,2, ....
It follows that Eo(N) = 2:;;:12 kPo(N = 2k) = 3.2. It is easy
to see that
E1 (N) = 3.2 also.
To compar e this to a fixed sample size procedu re, it takes n
O! = {3 = 0.1035 (not quite as good as the
= 6 to have
sequent ial procedu re). The test has
test function
1 if2:~=1XiE{4,5,6},
cP6(X) = { 0.5 if 2:~=1 Xi = 3,
o if L~=1 Xi E {O, 1, 2}.
This test takes nearly twice as many observa tions and has higher
error probabi l-
ities. One can calculat e that Po(N ~ 6) = 0.947, so there is some
chance that
the sequent ial procedu re will need more observa tions. But this
will only occur
for data sets in which cP6 randomizes. In fact, the two tests make
almost all the
same decisions based on six observations. The only disagree ments
come when two
Is are followed by four Os or when two Os are followed by four Is,
althoug h 4>6
does random ize sometim es when the sequent ial procedu re makes a
termina l deci-
sion. For example , if the first six observa tions are 0,1,1,1,0,0, then
the sequent ial
procedu re would reject H after four observa tions, but cP6 would random
ize.
Suppose that a Bayesia n believed that Pr(8 = 0.25) = 0.5 before
seeing any
data. Then, after n observa tions with X successes,

=
0.5 x 0.25 x O.75 n - x + 0.5 x 0.75 x O.25 n - x
552 Chapter 9. Sequential Analysis

For even n, x = 1 +n/2 leads to a posterior probability of e = 0.25 equal to O.l.


Similarly, x = n/2 - 1 leads to a posterior probability equal to 0.9. The SPRT
with a = f3 = 0.1 turns out to be to reject H as soon as the posterior probability
that H is true falls to 0.1 and accept to H as soon as it rises to 0.9, if we have
equal prior probabilities to start.

An interesting calculation can be done in Example 9.36. We found that


Eo(N) = 3.2. Notice also that

EO(SN) = 0.1 x 2Iog3+0.9 x (-2 log 3) = -1.6 log 3.


Since EO(Zi) = 0.25 log 3 - 0.75 log 3 = -0.5log3, we see that EO(SN) =
EO(Zi)Eo(N). It is as if N were fixed in advance!
Theorem 9.37 (Wald's Lemma). Let {Zn}~=l be IID such that E(Zi)
exists. Let N be a stopping time such that E(N) < 00. If SN = L!l Zi,
then E(SN) = E(Zi)E(N).
PROOF. We can write SN = L~=l ZnI{n,n+1, ... }(N). Now write

E(SN) = E (~ZnI{n,n+1, ... }(N)


~ E (t, z;t I{n.n+l",,} (N)) - E (t, Z;;: I{n.n+l ... }(N))

E E(ZnI{n,n+1, ... }(N.


00

=
n=l
Since I{n,n+1, ... }(N) = 1 - I{O,l, ... ,n-l} is a function of Zlo'" ,Zn-lo it is
independent of Zn. Hence

E(Zn I{n,n+1, ... }(N)) = E(Zn) Pr(N ~ n) = E(Zt} Pr(N ~ n).

It follows that E(SN) = L~=l E(Zt} Pr(N ~ n) = E(Zl)E(N). 0


Wald's lemma can be used to help approximate the expected value of
N under distributions other than the hypothesis and alternative. If we
approximate by assuming that there is no overshoot, then

S
N -
_ {ab if reject H,
if accept H.

All we need to complete the approximation is Pr(reject H).


Lemma 9.38. If {Xn}~=l are IID with distribution P, and there exists

f (!t(X)h
h '" 0 such that
dP(x) = 1,
fo(x)
9.2. The Sequential Probability Ratio Test 553

then, to the approximation of no overshoot, for the SPRT(B, A),


l-Bh
P(reject H) = Ah _ Bh'

PROOF. If h > 0, consider the SPRT(Bh,Ah) as a test of the hypothesis


that the density of each observation (with respect to P) is 1 versus the
alternative that the density is Ut! fo)h. The likelihood ratio is L~ = L~
and Bh < L~ < A h if and only if the original likelihood ratio satisfies
B < Ln < A. So
l-Bh
P(reject H) = P(LN ~ A) = P(L'N ~ A*) ~ Ah _ Bh'

If h < 0, consider the SPRT( A - h , B- h) as a test of H* that the density is


Ut! fo)h versus the alternative that the density is 1. Then the likelihood
ratio is L~ = L;;h and

P(reject H) P(LN ~ A) = P(L'N ::; B*)

~ A
-h ( B-h -
B-h_A-h
1) = 1- Bh
Ah_Bh' 0

Using the no-overshoot approximation,

E(SN) aP(reject H) + bP(accept H)


b + (a - b)P(reject H)

~ 10g(B) + log (~) (1h-_ ;h) .


B

So, if E(Zl) :I 0, we get

'" log( B) + log (~) (l"!;h )


E(N) '" E(Zd .

Example 9.39 (Continuation of Example 9.36; see page 551). Suppose that
{Xn}~=1 are lID Ber(0.6), but we are testing H : e = 0.25 versus A : e = 0.75.
We have
/l(X) = {~ ~f x = 0,
fo(x) 3 1f x = 1.

If h = -log3(1.5) = -0.36907, then 0.4(1/3)h + 0.6 X 3h = 1. We can now


calculate

1 ( 1 _ (~) -0.36907 )
log(-g) + log(81) -036907 = 0.8451,
9-0.36907 - (~) .
0.6 x log(3) - 0.4 x log(3) = 0.2197,
0.8451 = 3.846.
0.2197
554 Chapter 9. Sequential Analysis

Notice that the mean stopping time is longer when 8 is between the hypothesis
and the alternative.
If {Xn}~=l are lID Ber(0.5), then h == 0 is the only value that works in the
equation in Lemma 9.38. Hence, Lemma 9.38 has nothing to say about this case.

The following result has a proof similar to that of Wald's lemma, but
applies to the case not yet handled in Example 9.36.
Proposition 9.40. Suppose that {Zn}~=1 are IID, with E(Zi) = 0 and
E(Zl) = 0'2. Suppose that N is a stopping time such that E(N) < 00. Then
E(SK,) = 0'2E(N).
Example 9.41 (Continuation of Examp,le 9.36; see page 551). If {Xn}~=~ are
lID Ber~0.5), then E(Zi) = 0 and E(Zi) = (log(3))2 = 1.2069. Also, E(SN) ==
4(log(3)) = 4.8278. It follows that E(N) = 4. Of course, this example is simple
enough that we could calculate Ee(N) for all () without any of these theorems.
See Problem 9 on page 568.

The SPRT has an optimal property in terms of expected sample size


which follows from its being a Bayes rule in a sequential decision problem.
This is very much like the Neyman-Pearson fundamental lemma 3.87 in
which a minimal complete class of Bayes rules was found in the fixed sample
size problem for a simple hypothesis and a simple alternative.
Lemma 9.42. Suppose that 0 < '/'1 < '/'2 < 1 and that fo and h are two
different densities with respect to a measure II. There exist 0 < W < 1 and
c> 0, such that for every '/' E b1> '/'2], the SPRT(B, A) with
B _ _ 'Y_ 1 -'/'2 A= _,/,_1-'/'1
- I-,/, '/'2 ' 1 - '/' '/'1
is a Bayes rule in the sequential decision problem with action space N =
{O, I} x {I, 2, ... }, parameter space {O, I}, prior distribution Pr(8 = 0) = ,/"
and loss function

if i =F j,
otherwise,

where Wo =1- wand WI = W.


PROOF. First we will find the general solution of the sequential deci-
sion problem, and then we will show that there is one whose solution is
SPRT(B, A). To put the problem in testing form, let OH = {fo}. Suppose
that a sequential test is 6 = (, N). Define

ao(6) = Eo((X)), a1(6) = 1- E1((X)),

Then the Bayes risk of 6 with respect to prior probability,/, = Pr(8 = 0)


is
9.2. The Sequential Probability Ratio Test 555

Define, for each 0 ~ "'I ~ 1,

U("'() = inf
6
p("'(, 6).

Since N{x) ? 1 for all x, it follows that U("'() > 0 for all "'I. Since p("'(,6) is
a positive linear function of "'I for each 6, it follows that U is the infimum of
a collection of positive linear functionsj hence, it is concave and continuous
on (O, 1) and positive at the two endpoints. Define
n
An(x) = II fi(Xj),
;=1
for i = 0,1 and n = 1,2, ... , where x = (X}'X2' ). The posterior proba-
bility of {8 = O} given Xl = XI. ... ,Xn = Xn is

where Ln{x) is the likelihood ratio after n observations (used in every


SPRT). After observing Xl = XI. ... , Xn = X n , the posterior mean of the
loss to be incurred if N = n is h("'(n(x + en, where

The posterior mean of the loss to be incurred if N > n is at least U ("'In (X) ) +
en. Hence, the Bayes rule is to continue sampling so long as h("'(n(x >
U("'(n(x and to stop at N(x) equal to the first n such that h("'(n(x ~
U("'(n(x. Note that h("'() is continuous, has a graph shaped like a triangle,
and satisfies h(O) = h(l) = O. Since U is concave, it follows that h("'() >
U ("'I) for "'I in some interval (91, 92). Figure 9.43 shows the U and h functions
for a typical example. 6 (If h("'() ~ U("'() for all "'I, define 91 = 92 to be
the value of "'I at which h is maximized.) Hence, the Bayes rule continues
sampling so long as
"'I 1 - 92 "'I 1 - g1
B .. = - 1 - - - < Ln(x) < - 1 - - - = A.j
-"'I ~ -"'I ~
it rejects H if LN(X) ? A .. j and it accepts H if LN(X) ~ B ... Therefore,
the Bayes rule is SPRT(B., A.).
The g1 and g2 found above depend on the particular decision problem
only through wand c, where we assume that WI = w and Wo = 1- w. This
is true because the functions h and U depend on the decision problem only

6The example in Figure 9.43 has fa being the Ber(0.6) distribution and h
being the Ber(0.8) distribution. Also, Wa = 0.4 and c = 0.02. In this example,
91 = 0.31 and 92 = 0.48.
556 Chapter 9. Sequential Analysis

0.0 0.2 0.4 0.8 0.8 1.0


Prior Probability

FIGURE 9.43. Typical U and h Functions for SPRT

through these values. So we call the two values 91(W,C) and 92(W,C). To
finish the proof, we need only find c and W so that 'Yi = 9i(W,C) for both
i = 1,2. Define

92(W, C)
(3(w,c) =
1- 92(W, c) ,
91(W,C)
>'(w, c) = (3(w,c)(I- 91(W,C))'
It is easy to see that 91 and 92 are also functions of (3 and >.. Set

>'0=~1-'Y2
1 - 1'1 'Y2 '
Then, we need to find wand c so that >.(w,c) = >'0 and (3(w,c) = {Jo.
As c ! 0, U(O) and U(I) both approach 0, hence 91(W,C) tends to 0 and
92(W, c) tends to las c! 0 for fixed w. Hence, for fixed w, limc!o >'(w, c) = O.
Since ph, 6) increases as c increases for every 'Y and 6, it follows that Uh)
increases as c increases for every 'Y. Hence 91 (w, c) is a decreasing function
of c and 92(W,C) is increasing in c for fixed w. It follows that>. is strictly
increasing in c for fixed w. As c --+ 00, eventually, hh) $; Uh) for all 'Y.
Let eo(w) = inf{c: 91(W,C) = 92(W,C)}. Then limcTco(w) >'(w, c) = 1. Since
0< >'0 < 1, there exists a unique c = c(w) such that >.(w,c) = >'0' As
w approaches 0 for fixed c, U approaches the constant function c and h
approaches the constant function 0 while the peak of the triangle in the
graph of h moves toward 'Y = O. Hence 91(W,C) and 92(W,C) approach O.
9.2. The Sequential Probability Ratio Test 557

Also eo(w) and c(w) approach 0 as w approaches O. Hence

limg2(w,c(W)) = 0, lim,8(w,c(w)) = O.
w!O w!O

As w approaches 1 for fixed c, U approaches the constant function c and


h approaches the constant function 0 while the peak of the triangle in the
graph of h moves toward 'Y = 1. Hence gl (w, c) and g2 (w , c) approach 1.
Also co(w) and c(w) approach 0 as w approaches O. Hence

limg2(w,c(W)) = 1, lim,8(w,c(w)) = 00.


w!O w!O

Since ,8(w,c(w)) is continuous in w, there exists w such that ,8(w,c(w)) =


,80' 0
We are now in position to prove a theorem of Wald and Wolfowitz (1948)
that says that the SPRT has the smallest expected sample size of all tests
with the same error probabilities. Since fixed sample size tests can be con-
sidered as sequential tests in which N is not a function of x, we state the
theorem in terms of sequential tests only.
Theorem 9.44. Let nH = Uo} and nA = UI}. Let Bo < 1 < Ao
Let 80 = (o,No) be SPRT(Bo,Ao), and suppose that Eo(o(X)) = 0
and 1 - E1 (o(X)) = ,8. Among all sequential tests 8 = (, N) for which
N ~ 1, Eo((X)) ::; 0, 1- E1((X)) ::;,8, and Ei(N) < 00 for i = 0, 1,80
minimizes Ei (N) for i = 0, 1.
PROOF. Let 8 = (, N) be a sequential test of H versus A with Eo((X)) =
01 ::; 0, l-E1((X)) =,81 ::; ,8, and Ei(N) < 00 for i = 0,1. Pick 0 < 'Y < 1
and define
'Y1 - 'Y
- Ao(l- 'Y) + 'Y'
Then 0 < 'Y1 < 'Y < 'Y2 < 1 and

Bo = _'Y_ 1 - 'Y2 , Ao = _'Y_ 1 - 'Y1 .


1 - 'Y 'Y2 1 - 'Y 'Y1
Lemma 9.42 says that there exist 0 < w < 1 and c > 0 such that 80
is a Bayes rule in the sequential decision problem with action space N =
{O, I} X {I, 2, ... }, parameter space {O, I}, prior distribution Pr(e = 0) = 'Y,
and loss function

L(f (.))
i, ), n = cn +
{Wi0 if i =1= j,
otherwise,

where Wo = 1 - wand WI = W. It follows that

'Y(wooo + cEoNo) + (1 - 'Y)(W1,80 + cE1No)


~ 'Y(W001 + cEoN) + (1 - 'Y)(W1,81 + cE1N).
558 Chapter 9. Sequential Analysis

Since ao ~ a1, f30 ~ (31, C > 0, Wo > 0, and WI > 0, it follows that
-y(EoNo - EoN) + (1 - -y)(E1NO - EIN) ~ O.

Since this is true for every -y E (0,1), the limit as -y goes to 0 or to 1


of the left-hand side is also less than or equal to O. These two limits are
respectively EoNo - EoN and EINo - EIN. 0
The reason for the extra conditions that Ei(N) < 00 for i = 0,1 is that
there may be another test with Eo(N) = 00 and very small value ofE1(N).

9.3 Interval Estimation*


In Section 5.2.5, we gave examples of loss functions associated with interval
estimation. For example, let the (terminal) action space be the set of all
pairs ofreal numbers (a, b) with a ~ b. We could set

L'((), (a, b)) = k(b - a) + 1 - I[a,b] (g(()),

or
L'(O, (a, b = k(b - a)2 + 1 - I[a,b] (g(O)),
or
a - g(() if g(() < a,
L'(O, (a, b = k(b - a) + { g(O) - b if g(O) > b,
o otherwise,
where k > O. The first two loss functions above lead to intervals with equal
posterior density at a and b. The third one leads to intervals with equal
posterior probability below a and above b. Another alternative is to let
N' = It and set
L'(O,a) = 1- I[a-d,a+d] (g(O)) ,
where d > 0 is some fixed half-width for the interval. In this case, the
interval has a fixed width and the coverage probability is determined (along
with the center of the interval) by the data.
Example 9.45. Suppose that {Xn}~=l are lID N(IL,u 2 ) given e = (lL,u). Let
g(6) = 1'. Suppose that we want an interval of half-width d and the cost of each
observation is c. Suppose that the prior is conjugate. The posterior distribution
of M after n observations will be tan (ILn, bn/[anAn]), where the posterior hyper-
parameters an, bn , ILn, and An are as in Example 9.19 on page 544. The optimal
decision upon stopping is ILn. The terminal risk (not counting cost of observation)
is

*This section may be skipped without interrupting the flow of ideas.


9.3. Interval Estimation 559

where Tan stands for the CDF of the tan (0,1) distribution. To implement the
one-step look-ahead rule, we need to calculate

where
Y = (Xn+l - I-'n) bn(>.:n+ 1) '" tan (0, a1 J.
Even this formula requires numerical integration to compute. For example, sup-
pose that we use a prior with ao = 2, bo = 7, 1-'0 = 1, and >'0 = 1. The cost of each
observation is 0.005 and the half-width of the interval is d = 1.96. Table 9.47 con-
tains some data along with Po and Pl. The terminal decision is 1-'9 = -1.011531,
and so the interval is [-2.9715,,0.9485). The posterior probability in the interval
is 0.9810. The probability is so high because the cost of each observation is so
low. If the cost had been 0.01 instead, the one-step look-ahead rule would have
stopped after seven observations with the interval [-3.0214,0.8986] and posterior
probability 0.9640.

The classical approach to fixed-width interval estimation has tradition-


ally not been through a loss function. Rather, one requires that sampling
continue until one has an interval of a fixed width with the desired confi-
dence coefficient or greater. No cost of observation is taken into account,
except possibly for comparing different procedures.
The most naIve procedure would be to compute a fixed sample size co-
efficient 'Y confidence interval for each n, and stop at the first n such that
the interval has half-width at most d.
Definition 9.46. Let {Xn}~l be conditionally lID N(IL, cr2 ) given e =
-_ n 2_ n -2
(IL, cr). Let Xn - Ei=l Xi/n and Sn - Ei=l (Xi - Xn) /(n - 1) for each
n. Let Nl be the smallest n ?: 2 such that SnT;;!.l (1 - 0:/2) $ /Tid,

TABLE 9.47. One-Step Look-Ahead Rule for Example 9.45

PI
o 0.4047363 0.2816774
1 -1.745003 0.2396304 0.1774288
2 0.793758 0.1178335 0.0881298
3 -4.385832 0.1475386 0.1190706
4 -4.708225 0.1268197 0.1053043
5 -0.363233 0.0797290 0.0675238
6 1.162189 0.0599002 0.0521162
7 -0.244931 0.0360019 0.0330767
8 1.455418 0.0265989 0.0258814
9 -3.079450 0.0189839 0.0189839
560 Chapter 9. Sequential Analysis

where T n - 1 is the CDF of the tn-l(O, 1) distribution. Then the interval


[X Nl - d, X Nt + d] is called the naive coefficient 1- a sequential confidence
interval for M.
It is not surprising that the naIve confidence interval does not have cov-
erage probability 1 - a. That is, Po (X Nt - d ~ I-' ~ X Nl + d) ::/= 1 - a. We
can write an expression for the coverage probability, however:

(9.48)

L Po(Xn - d ~ I-' ~ Xn + dlNl = n)Po(N


00

= l = n).
n=2
First, recall that Nl is the first n ~ 2 such that S~ ~ k n for some sequence
{kn}~=l' We will prove that the event {N1 = n} is independent of X n .
This will allow some simplification in (9.48). It is sufficient to show that
si, ... ,S~ are all independent of X n . For each k = 1,2, ... , consider the
k x k matrix rk whose rows are all unit vectors and whose ith row (for
i < k) is proportional to the vector with 1 in the first i places and -i in the
(i + 1)st place. The kth row is proportional to the vector of all Is. It is easy
to see that these rows are orthogonal to each other, so r k is orthogonal
for each k. For i < n, define Wi to be the inner product of the ith row
of r n with (X1. ... , X n ). 7 Note that the inner product of the last row of
rk and the vector Xk = (X1, ... ,Xk) is .;kXk. So, Xn is independent
of W1. ... , Wn - 1 Also, since IIrkXkl1 = E~=l xl, it follows that S~ =
r::-l1 W; for k = 2,3, .... Hence X n is independent of S2,' .. ,Sn' This
means that we can write

Po(Xn-d~I-'~Xn+dINl=n) = Po(Xn-d~I-'~Xn+d)

= 2[1-~(Vn~)]'
Hence,

Some numerical method would be required to calculate Po(Nl = n), but


the argument above can be used to show that it depends on () only through
diu. .
It should be noted that a Bayesian does not have the same problem with
naIve posterior probability intervals. So long as the decision of .whether or
not to take more data is a measurable function solely of the avallable data,

7Note that, for fixed i, Wi is the same no matter which n > i is chosen for its
definition.
9.3. Interval Estimation 561

the posterior distribution of a parameter given the data does not depend
on whether or not any more data will be taken. Hence, the Bayesian can
declare, after every observation, what is the posterior probability that the
parameter lies in any set he or she wishes. If, after n observations, there is
an interval of half-width d such that the posterior probability is 1 - a that
M is in that interval, the Bayesian can declare that and stop sampling.
There are some classical procedures that do actually have the desired cov-
erage probability. We will present one such procedure here. It is a two-stage
sampling procedure. One chooses an initial sample size no and estimates
E2 by S~o. Then one collects a second sample whose size depends on S~o.
Define
c= 1 d
T no - 1 (1- 2)
Q' N2 = max {no, l~~o J+ I} ,
where l zJ denotes the largest integer less than or equal to z. Use the interval
centered at X N2 with half-width d.
Lemma 9.49. With the above notation, the conditional distribution of

(9.50)

given e= (J-L,u) is tno-l(O, 1).


PROOF. We can write

We know that E~~l Xi and {Xno+i}:'l are independent of S~o. Since


N2 is constant conditional on S~o' we have that the conditional distribu-
tion of XN2 given S~o is N(J-L,u 2 /N2 ). So, the conditional distribution of
../N2(XN2 - J-L)/u is N(O, 1), which is independent of S~o. It follows that
(9.50) has t no -l(O, 1) distribution. 0
Technically, we could use the interval with half-width

~SnoT;/_l (1- i),


but this should be pretty close to d.
The problems with this procedure derive from the first stage of sampling.
First, there is the question of how to choose no, which turns out to be
crucial to the performance of the method as we will see in Example 9.52.
Second, there is the fact that the estimate of E2 is based only on the first
no observations. In order to get a good estimate, no must be large, but
then we might sample too much if E2 is small. If we choose no too small,
then c is small and N2 is large.
562 Chapter 9. Sequential Analysis

TABLE 9.51. Classical Fixed-Width Confidence Interval Sample Sizes for Exam-
ple 9.52
no S~ C N2
2 3.222653 0.0238 136
3 6.707906 0.2075 33
4 6.616990 0.3793 18
5 5.885602 0.4983 12
6 6.462292 0.5814 12
7 5.625235 0.6416 9
8 5.809566 0.6870 9
9 5.561759 0.7224 9

Example 9.52 (Continuation of Example 9.45; see page 558). Suppose that we
desire a classical sequential confidence interval with half-width 1.96 and coefficient
0.95. We will use the same sequence of data as we had earlier. To correspond more
closely with a classical analysis, we should change to an improper prior. In this
case, the interval based on the first nine observations is the first one to have
posterior probability greater than 0.95 and so this would be the naive sequential
interval. For various values of no, we can implement the classical procedure; the
results are summarized in Table 9.51. If no = 7,8, or 9, we will get essentially
the same result as the naive interval.

9.4 The Relevance of Stopping Rules


In Section 9.1, we introduced sequential decision problems and examined
rules that decide optimally after stopping. In Problem 2 on page 567, the
reader is asked to prove that the formal Bayes rule in a sequential decision
problem decides optimally after stopping. What exactly is the meaning of
deciding optimally after stopping? In words, it means that after the stop-
ping rule says to take no more observations, we then behave exactly as we
would if we had observed whatever data we now have under a fixed sample
size scheme. This is in stark contrast to classical sequential procedures like
the level a SPRT. When the SPRT stops at N = n, the terminal decision is
most definitely not the same as that of a level a test based on a fixed-sized
sample of size n. On the other hand, when the SPRT is viewed as a formal
Bayes rule, it is true that the terminal decision is exactly the same as what
the formal Bayes rule would be after observing a fixed-sized sample of size
n.
In light of Problem 2 on page 567, it makes perfect sense for the Bayesian
statistician to make the same decision after stopping as he or she would have
made had the data arrived in a nonsequential fashion. Why, then, does the
classical statistician behave differently in the two situations? The easiest
way to answer this question is to see what would happen if the classical
statistician tried to use a fixed sample size terminal decision after stopping.
9.4. The Relevance of Stopping Rules 563

For a simple example, suppose that {Xn}~=l are lID N(B, 1) given e = B.
We will consider the problem of testing the hypothesis H : e = Bo versus
A: e =1= Bo. Let Xn be the average of the first n of the Xi. Given e = Bo,
y'n(Xn - Bo) has N(O, 1) distribution for every n. Hence, for every e,

It follows that

P~o (lim sup In(X n -Bo) >


n-too
e) > O.

However, the event {limsuPn-too y'n(Xn - Bo) > c} is in the tail a-field C,
so the Kolmogorov zero-one Law B.68 says

P~o(limsup In(X n - Bo) > e) = 1.


n-too
Similarly, for every e,

e
Hence, given = Bo, for every e, the probability is 1 that there will exist
n such that 1y'n(Xn - Bo)1 > e. Let

N = inf{n : IJn(X n - Bo)1 > e},


where e is the 1 - 0./2 quantile of the standard normal distribution. It
follows that N is a stopping time. Suppose that a classical statistician were
to use this stopping time. Suppose that after observing N = n, he or she
e
were to use a terminal decision that was the usual level a. test of H : = Bo
versus A : e =1= Bo based on a sample of size n. This test would be to reject
H if 1y'n(Xn - Bo)1 > e. This person would reject the hypothesis with
probability 1 given e = Bo. This would not be a level a. sequential test.
Clearly, from a classical viewpoint, the terminal decision rule has to depend
on which stopping time was used and not just the observed data.
The phenomenon just illustrated is often called "sampling to a foregone
conclusion." That is, the stopping time is designed so that, if a fixed sample
size terminal decision is used upon stopping, the conclusion to be drawn
is determined in advance. A good deal of discussion of this phenomenon
exists in the literature. It is particularly pertinent to clinical trials in which
researchers would like to stop before the original study plan is finished if
the results seem overwhelming. [See, for example, Cornfield (1966).] The
concern is raised that this might allow unscrupulous researchers to sample
to a foregone conclusion. The methods of sequential analysis are designed
to prevent that, at least in the classical setting, by making the terminal
decision rule depend on which stopping time is used. From a Bayesian point
564 Chapter 9. Sequential Analysis

of view, so long as the stopping time is a function of the observed data, and
one conditions on the observed data, the terminal decision rule should be
whatever would be optimal if that data were observed from a fixed sample
size rule. If this is so, can a Bayesian be tricked into sampling to a foregone
conclusion? The answer is no, if the Bayesian uses a proper prior.8
To see why a Bayesian cannot sample to a foregone conclusion,9 suppose
that N is a strictly positive,lO integer-valued random variable that might
equal 00. Suppose also that N is a function of observable data {Xn}~=l
in such a way that, for every finite n, I{n}(N) is a function of Xl, ... ,Xn .
Let
Bn = {(Xl, ... ,xn): N = n}.
Let Z be a random variable of interest (perhaps the indicator of some subset
of the parameter space or anything else) with finite mean c. Suppose that,
for every n and every (Xl, .. ,Xn ) E Bn, E(ZIXI = Xl, . ,Xn = xn) >
d 2: c. (A similar argument works for d ::; c.) It follows from the law of
total probability B.70 that

L E(E(ZIXb ... , Xn)IN = n) Pr(N = n)


00

E(Z) = (9.53)
n=l
+ E(E(ZIXb X 2 , )IN = 00) Pr(N = 00).
If we suppose that Pr(N = 00) = 0, then the right-hand side of (9.53) is
greater than d and the left-hand side equals c ::; d, which is a contradiction.
Hence, Pr(N = 00) > O. This means that, if a Bayesian does not believe
a priori the conclusion (in this case that the mean of Z is greater than c),
then it cannot be guaranteed that he or she will believe it after sequential
sampling.
A more useful result is available if Z = [B(Y) for some random variable
Y. In this case, we can calculate a bound on the conditional probability of
stopping given Y B. First, note that E(Z) = PreY E B) = c, and rewrite
(9.53) as

L E(Pr(Y E BIXl , ... , Xn)IN = n) Pr(N = n).


00

PreY E B) 2:
n=l

The right-hand side of this is greater than dPr(N < 00). It follows that

8The reason that a proper prior is needed is subtle. The argument given below
depends on the law of total probability B.70. Kadane, Sche~vish, an~ ~eidenfeld
(1996) show that when improper prior~ arE: viewe.d as fimtely ad.d'tlve pr??a-
bilities, sampling to a foregone concluslOn IS poss.l~le because fimtely additive
probabilities do not satisfy the law of total probablhty.
9This argument is like one given by Kerridge (1963).
lOWe saw in Example 9.3 on page 537 that if Pr(N = 0) > 0, then Pr(N =
0) = 1.
9.4. The Relevance of Stopping Rules 565

Pr(N < 00) < c/d. Now, write

P (N
r < 00
IY 'i"rt B) = Pr(Y BIN < 00) Pr(N < 00).
Pr(Y B)

By design, Pr(Y BIN < 00) ::; 1 - d and Pr{Y B) = 1- c. Combining


these results gives

c(l - d)
Pr(N < oolY B) < d(l _ c) . (9.54)

The claim of (9.54) was made without proof by Savage (1962) with reference
to a specific example. Cornfield (1966) proves the result in another specific
example. A similar claim, without an explicit bound, was made by Good
(1956).
It should be noted that the proof that a Bayesian cannot sample to a
foregone conclusion involves probabilities calculated under the joint dis-
tribution of all random quantities. For example, if e
has a continuous
distribution and Z = IB(e) for some set B, it might be the case that for
some () E B, P~(N < 00) = 1 (see Problem 12 on page 569, for example),
but the set of all such () must have small prior probability.
This discussion is not meant to say that Bayesians can ignore all stop-
ping rules. All it says is that so long as the stopping rule is a function of
the observed data 11 and the Bayesian conditions on the observed data, no
further account need be taken of the stopping rule. An example is given
by Berger and Berry (1988) of a situation in which one or the other of the
two criteria is not met. We give a modified version here. Other examples
are given by Roberts (1967).
Example 9.55. Suppose that {Xn};:O=1 are liD Ber((J) conditional on e = 0
and e E {0.49, 0.51}. The observations cannot be taken at will; rather they arrive
according to a Poisson process with rate >'(0) conditional on e = 0, but the Xi
are independent of the arrival times conditional on e. Suppose that >'(0.49) is
one observation per second and >'(0.51) is one observation per hour. The stopping
rule will be to sample for one minute and then stop when the next observation
arrives.
Suppose that we observe N = 60. If we accidentally consider Xl, ... , X 60 to be
the observed data and condition on it alone, we will be ignoring valuable infor-
mation. Specifically, the time it took to observe the 60 values contains valuable
information about e. Furthermore, the very fact that 60 observations were ob-
served contains information about e, even if we don't know how long it took to

llWe mean that, for each n, the event {N = n} must be measurable with
respect to the O'-field generated by Xl, ... ,Xn . In a trivial sense, the stopping
time is always a function of the observed data if the observed data are defined
to include all and only those observations with subscripts up to and including
N. If you know that the observed data are Xl, ... ,Xn , then you know that
N = n. What is required is that, for each n, if you are merely told the values of
Xl, ... , Xn (but not N), you would be able to figure out whether or not N > n.
566 Chapter 9. Sequential Analysis

get them. Also, the criterion that the stopping rule be a function of the observed
data would not be met in this case, since one cannot tell from looking at the first
60 Xi values that the experiment would stop at N = 60. One also needs to look
at the clock (and possibly the calendar).
This is not to say that one could not make inference based solely on X =
(Xl, ... , X N ) in this case. To obtain the density of X given e, we first introduce
{Yn}~=l' the interarrival times of the Poisson process. Now, {N = n} is a function
of Yl , . , Y n , and we can write (assuming that OX(O) has units of "observations
per second") the conditional joint density of X and Y = (Yl, ... , YN ) given e = (}
as

where t = 2::::'1 Yi, n is the observed value of N, and k = 2::::'1 Xi. We can inte-
grate Y out of this to obtain the conditional density of X given e .12 To integrate
out Y for fixed n, transform from Yl, ... , Yn to t, Yn-l, ... , Y1. The Jacobian is 1,
and the ranges of integration are (with t innermost and Yl outermost)

t > 60,
i-l

0< Yi < 60- LYj, for i =n - 1, ... ,2,


j=l

0< Y1 < 60.

The result of integrating out these variables is

(60oX(OW- 1 exp( -60oX((}))


(n - I)!

So the likelihood is

f XIS (X 10) -- (}k(1_ o)n-k (60oX(0))n-1 exp(-60oX(0))


(n - 1)! . (9.56)

For example, suppose that the prior for e puts probability q on e = 0.5l.
Then the posterior probabilities of the two values of e are in the ratio:

fSlx(0. 51 Ix ) = -q-2.778 x 10- 4 exp(59.983)(1.0408)2k(2.6689 x 1O- 4 t


fSlx(0.49Ix) 1- q
If, for example, n = 60 and k = 30 are the only observed data and q is. not
essentially 0 or 1, then the posterior probability of e = 0.49 will be essentially

12 An alternative method for calculating the conditional density of X given e is


the following. The conditional joint density of the Xi given e = 0 and N = n is
still that of n lID Ber(O) random variables since N is independent of the values
of the Xi given 8. The conditional density of N given e = 0 is

f NIS (I ll) - (60oX(o))n-1 exp( -60oX(0)) for n = 1 2 ....


nu - (n _ I)! ' , ,

That is, N is just one plus a Poi(60oX(O)) r.ando~ variable. So the likelihood
function for observing X alone would be as g1ven m (9.56).
9.5. Problem s 567

1 because N = 60 is orders of magnitu de m~re likely when the Poisson


proce~s
has rate 1 than when it has rate 2.778 x 10 = A(0.51). On the
other hand, If
t = 216,000 (2.5 days) is also observed, then the posterio r probabi lity of
e = 0.51
is essentially 1.
A classical statistic ian could also make inference based solely on .
X Without
much trouble. First notice that

log fXle(xI0.51) = 0.08k - 8.2287n


!xle(xl ,49)
plus a constan t. If a level a = 0.05 test of H : e = 0.49 versus A
: e = 0.51 is
e
desired, we need to add up the probabi lities (given = 0,49) of all
with small n and large k until the sum exceeds 0.05. This happen s
k and n values
at n = 48 and
k = 29. So the MP level 0.05 test is
I if n < 48 or if n = 48 and k = 29,
c/>(n, k) ={ 0.3077 if n = 48 and k = 29,
o otherwise.
If n = 48 and k = 29 are observed and neither prior probabi lity is close
to 0, then
the posterio r probabi lity of e = 0.51 is essentially O. In fact the
Bayes factor
(likelihood ratio) drops below 1 when N > 6. So, the evidence is more
in favor of
the hypothe sis than the alternat ive whenever N > 6 is observed, yet
the MP level
0.05 test continues to reject H even when 47 observations have been
observed.
The size of the test that rejects H if and only if N :s: 6 is 6.3 X 10- 19
The type
II error probabi lity of this test is about the same.

What happened at the end of Example 9.55 is illustrative of the faulty


reasoning that leads to the choice of a test by its level. The reasoni
ng is
that we wish to protect against the more costly error (type I error)
so we
make the probability of type I error small and then choose the test
with
the smallest type II error. What happened in Example 9.55 is that the
data
are so well able to distinguish the two hypotheses that making the type
I
error probability as large as 0.05 makes the type II error probability
drop
to O. This is just the opposite effect as what was desired.

9.5 Prob lems


Section 9.1:

1. Take the situatio n describe d in Exampl e 9.3 on page 537 and


find the
optimal procedu re that takes at most three observations. (It is
not the
truncat ion of the rule in Exampl e 9.3 to three observations.) Comme
nt on
why it differs from the optimal rule in Exampl e 9.3 even before the
third
observation.
2. Prove that the formal Bayes rule in a sequent ial decision problem
decides
optimal ly after stoppin g almost surely.
3. Refer to the setup in Definition 9.2. Let C be the collection of
all A E BOO
such that, for every n, A n {N = n} E Bn. Let Q be the probabi
lity on
(V, V) induced by V from J..L.
568 Chapter 9. Sequential Analysis

(a) Prove that C is au-field.


(b) For each n and each D E 'V, prove that Qn(Dlx) = Pr(V-l(D)1
X-l(B n )), a.s.
(c) Prove that, for each D E 'V, QN(Dlx) = Pr(V-I(D)IX-1(C)), a.s.
4. Prove that a rule computed via backward induction is regular.
5. Prove Proposition 9.8 on page 54l.
6.*Suppose that L' 2:: 0 and limn~ooEPo(Q(IXo, ... ,Xn)) = O.
defined by (9.28) where 0 ::; ,0 : ;
p .
Let 'i'n be

(a) Prove that ,n ::; p. for all n.


(b) Prove that

bn+m(Q) -,n(Q)1
~ E (l'm(Q(IX1, ... , Xn)) - 'i'o(Q(IXl, ... , Xn))l).

(c) Prove that limn~oo ,n(Q) converges to some quantity ,.(Q) that
satisfies
,.(Q) = min{po(Q), Eb(Q(IXI))) + c}., (9.57)
(d) Suppose that ,i and 1'2 both satisfy (9.57) for all probabilities Q.
Show that

(e) Prove that,i = ,2'


Section 9.2:

7. Stein (1946) proves that if Pr(!r(X;) '" !O(Xi)) > 0, then there exist
c and p < 1 such that Pr(N > n) ~ cpn for the SPRT. Define Zi
log[fr (X;)/ fO(Xi)J.
(a) Prove that Var(Zi) > 0 implies Pr(!r(Xi) '" !o(Xd) > O.
(b) Find an example in which Pr(!r(Xi ) '" !o(X i )) > 0 but Var(Zi) = o.
(c) Under the conditions of Theorem 9.33, prove that there is a subse-
quence {nk}~l and a c and p < 1 such that Pr(N > nit) < cpnk.
8. Prove Proposition 9.40 on page 554.
9. Suppose that {Xn}~l are liD Ber(O) given e = O. Let N be the first n
such that 11::;=1 =
Xi - n/21 2:: 2. Prove that Ee(N) 2/[8 2 + (1 - 0)2].

Section 9.3:

10. Let {(Xn, Bn)}~=l be a sequence of sample spaces, and let In,9n be two
densities on (Xn' Bn) for every n. Let Xn : S -> Xn be a random quantity
for every n. Define Zn = 9n(Xn)/ln(Xn). Let P be the probability that
says that Xn has density In for every n. Let k > 0 and N = inf{n : Zn 2::
k}. Prove that peN < 00) ~ 1/k.
9.5. Problems 569

11. *An alternative to fixed-width confidence intervals is to form a confidence


sequence. A coefficient 'Y confidence sequence for e is a sequence of sets
{Rn}~=l such that P9(} ERn, for all n) ~ 'Y for all (}. Let {Xn}~=l be
conditionally lID with N(},l) distribution given e = (}. Use the result
from Problem 10 above to find a coefficient 'Y confidence sequence for 8.
(Hint: Let Xn in Problem 10 be (Xl, ... , Xn) in this problem. Let P be P9
and let 9n be the prior predictive density of (Xl, ... , Xn) under a suitable
prior for e.)

Section 9.4:

12. Suppose that {Xn}~=l are conditionally lID N(}, 1) given 8 = (} and that
e rv N(O, 1). Show that, given e = 0, for every a the probability is 1 that
there will exist n such that Pr(8 $ 0IXI, ... , Xn) > a.
13. *Suppose that {Xn}~=l are conditionally lID N(}, 1) given e = (} and that
e rv N(}o, 1/>.). In this problem, we will prove that, given e = (} > (}o,
the probability is less than 1 that there will exist n such that Pre e $
(}oIXI, ... , Xn) > a > 1/2, so long as Pr(e $ (}o) < a. In what follows, let
(} > (}o and a > 1/2.
(a) Define Zi = Xi - (}o and t/J(h) = E9 exp(hZi ). Show that there exists

h < such that t/J(h) = 1.
(b) Let p = Pr(e $ (}o) < a, and let Sn = E:I Zi. Prove that Pr(e $
(}oIXI , ... , Xn) > a is equivalent to Sn $ Cn, where

(c) Prove that Cn < CI < 0 for all n.


(d) Let It be the N(} - (}o, 1) density (the conditional density of Zi
given e = (}), and let fo = It times exp(hx), where h was found
in part (a). Let b > 0, and consider the SPRT(B, A) for testing the
hypothesis {fo} against the alternative {It} with A = exp(-hb) and
B = exp( -hCI). Let Mb be the stopping time of this test. Use Theo-
rem 9.35 to show that
. 1 - exp(hb)
P9(SPRT(B,A) accepts hypothesIS) $ (h) (hb)'
exp CI - exp

(e) Define

N inf{n: Pr(e $ 0IXI, ... , Xn) > a},


M inf{n: Sn < ct}.

Show that P9 (N < 00) $ P9 (M < 00).


(f) Prove that

P9(M < 00) = b-+oo


lim P9(SPRT(B, A) accepts hypothesis) < 1.
ApPENDIX A

Measure and Integration Theory

This appendix contains an introduction to the theory of measure and integration.


The first section is an overview. It could serve either as a refresher for those who
have previously studied the material or as an informal introduction for those who
have never studied it.

A.I Overview
A.I.l Definitions
In many introductory statistics and probability courses, one encounters discrete
and continuous random variables and vectors. These are all special cases of a
more general type of random quantity that we will study in this text. Before we
can introduce the more general type of random quantity, we need to generalize
the sums and integrals that figure so prominently in the distributions of discrete
and continuous random variables and vectors. The generalization is through the
concept of a measure (to be defined shortly), which is a way of assigning numerical
values to the "sizes" of sets.
Example A.1. Let S be a nonempty set, and let A ~ S. Define I-(A) to be
the number of elements of A. Then I-(S) > 0, 1-(0) = 0, and if Al n A2 = 0,
I-(AI U A2) = I-(Ad + I-(A2). Note that I-(A) = 00 is possible if S has infinitely
many elements. The measure 1- described here is called counting measure on S.
Example A.2. Let A be an interval of real numbers. If A is bounded, let I-(A)
be the length of A. If A is unbounded, let I-(A) = 00. It is easy to see that
p.(IR) = 00, I 1-(0) = 0, and if Al n A2 = 0 and Al U A2 is an interval, then

I By IR, we mean the set of real numbers.


A.L Overview 571

J.&(A 1 U A2) = J.&(AI) + J.&(A2). The measure J.& described here is called Lebesgue
measure.
Example A.S. Let f : m. -+ m.+ be a continuous function. 2 Define, for each
interval A, J.&(A) = fA f(x)dx. Then J.&(m.) > 0, J.&(0) = 0, and if Al n A2 = 0 and
Al U A2 is an interval, then J.&(AI U A2) = J.&(A 1 ) + J.&(A2).
Since measure will be used to give sizes to sets, the domain of a measure will
be a collection of sets. In general, we cannot assign sizes to all sets, but we need
enough sets so that we can take unions and complements. A collection of sets
that is closed under taking complements and finite unions is called a field. A field
that is closed under taking countable unions is called au-field.
Example A.4. Let S be any set. Let A = is, 0}. This u-field is called the trivial
u-field. As a second example, let A C S. Let A = is, A, AC , 0}. Let B be another
subset of B, and let A = {B,A, B,Ac,Bc,A n B,A nBc, .. .}. Such examples
grow rapidly. The largest u-field is the collection of all subsets of S, called the
power set of S and denoted 28 .
Example A.S. One field of subsets of m. is the collection of all unions of finitely
many disjoint intervals (unbounded intervals are allowed). This collection is not
a u-field, however.

It is easy to prove that the intersection of an arbitrary collection of u-fields


is itself a u-field. Since 2 8 is a u-field, it is easy to see that, for every collection
of subsets C of B, there is a smallest u-field A that contains C, namely the
intersection of all O'-fields that contain C. This smallest O'-field is called the 0'-
field genemted by C.
The most commonly used u-field in this book will be the one generated by the
collection C of open subsets of a topological space. 3 This u-field is called the Borel
u-field. It is easy to see that the Borel u-field 8 1 for lR is the O'-field generated
by the intervals of the form [b, 00). It is also the u-field generated by the intervals
of the form (-00, a] and the u-field generated by the intervals of the form (a, b).
Since multidimensional Euclidean spaces are topological spaces, they also have
Borel O'-fields.
An alternative way to generate the Borel u-fields of lRk spaces is by means of
product spaces. The O'-field generated by all product sets (one factor from each
u-field) in a product space is called the product u-field. In m.k , the product u-
field of one-dimensional Borel sets 8 1 is the same as the Borel u-field 8 k in the
k-dimensional space (Proposition A.3S).
Sometimes, we need to extend m. to include points at infinity. The extended real
numbers are the points in lRU{ 00, -oo}. The Borel u-field 8+ ofthe extended real
numbers consists of 8 1 together with all sets of the form BU{ oo}, BU{ -oo}, and
B U {oo, -oo} for B E 8 1 It is easy to check that 8+ is a u-field. (See Problem 4
on page 603.)

2By lR+, we mean the open interval (0,00).


3A space X is a topological space if it has a collection V of subsets called a
topology which satisfies the following conditions: 0 E V, X E V, the intersection
of finitely many elements of V is in V, and the union of arbitrarily many elements
of V is in V. The sets in V are called open sets.
572 Appendix A. Measure and Integration Theory

If A is a IT-field of subsets of a set S, then a measure p. on S is a function from


A to the nonnegative extended real numbers that satisfies
p.(0) = 0,
{An}~=l mutually disjoint implies p.(U~lAi) = E:1 P.(Ai).
If p. is a measure, the triple (S, A, ",) is called a measure space. If (S, A, p.) is a
measure space and p.(S) = 1, then p. is called a probability and (S, A, p.) is called
a probability space.
Some examples of measures were given earlier. The Caratheodory extension
theorem A.22 shows how to construct measures by first defining countably ad-
ditive set functions on fields and then extending them to the generated IT-field.
Lebesgue measure is defined in this manner by starting with length for unions of
disjoint intervals.
Sets with measure zero are ubiquitous in measure theory, so there is a special
term that allows us to refer to them more easily. If E is some statement concerning
the points in S, and p. is a measure on S, we say that E is true almost everywhere
with respect to p., written a.e. [p.], if the set of s such that E is not true is contained
in a set A with p.(A) = O. If p. is a probability, then almost everywhere is often
expressed as almost surely and denoted a.s. [p.].
Example A.6. It is well known that a nondecreasing function can have at most
a countable number of discontinuities. Since countable sets have Lebesgue mea-
sure (length) 0, it follows that nondecreasing functions are continuous almost
everywhere with respect to Lebesgue measure.

Infinite measures are difficult to deal with unless they behave like finite mea-
sures in certain important ways. If there exists a countable partition of the set S
such that each element of the partition has finite p. measure, then we say that p.
is IT-finite. When an abstract measure is mentioned in this text, it will generally
be safe to assume that it is IT-finite unless the contrary is clear from context.

A.1.2 Measurable Functions


There are certain types of functions with which we will be primarily concerned.
Suppose that S is a set with a IT-field A of subsets, and let T be another set
with a IT-field C of subsets. Suppose that I : S --+ T is a function. We say I
is measurable if for every B E C, 1-1 (B) EA. When there are several possible
IT-fields of subsets of either S or T, we will need to say explicitly with respect to
which IT-field I is measurable. If I is measurable, one-to-one, and onto and 1-1 is
measurable, we say that I is bimeasurable. If the two sets S and T are topological
spaces with BorellT-fields, a measurable function is Borel measurable.
As examples, all continuous functions are Borel measurable. But many dis-
continuous functions are also measurable. For example, step functions are mea-
surable. All monotone functions are measurable. In fact, it is very difficult to
describe a nonmeasurable function without using some heavy mathematics.
If Sand T are sets, C is a u-field of subsets of T, and I : S --+ T is a function,
then it is easy to show that r1(C) is a IT-field of subsets of S. In fact, it is the
smallest IT-field of subsets of S such that I is measurable, and it is called the
IT-field generated by I
A.I. Overview 573

Some useful properties of measurable functions are in Theorem A.38. To sum-


marize, multivariate functions with measurable coordinates are measurable; com-
positions of measurable functions are measurable; sums, products, and ratios of
measurable functions are measurable; limits, suprema, and infima of sequences
of measurable functions are measurable.
As an application of the preceding results, we have Theorem A.42, which says
that one function 9 is a function of another f if and only if 9 is measurable with
respect to the a-field generated by f.
Many theorems about measurable functions are proven first for a special class of
measurable functions called simple functions and then extended to all measurable
functions using some limit theorems. A measurable function f is called simple
if it assumes only finitely many distinct values. The most fundamental limit
theorem is Theorem A.41, which says that every nonnegative measurable function
can be approached from below (pointwise) by a sequence of nonnegative simple
functions.

A.1.3 Integration
The integral of a function with respect to a measure is a way to generalize the
Riemann integral. The interested readers should be able to convince themselves
that the integral as defined here is an extension of the Riemann integral. That
is, if the Riemann integral of a function over a closed and bounded interval
exists, then so does the integral as defined here, and the two are equal. We
define the integral in stages. We start with nonnegative simple functions. If f is
a nonnegative simple function represented as f(s) = 2::=1
adA, (s), with the ai
distinct and the Ai mutually disjoint, then the integral 0/ f with respect to I-' is
f f(s)dl-'(s) =2::=1 ail-'(Ai). If 0 times 00
occurs in such a sum, the result is 0
by convention. The integral of a nonnegative simple function is allowed to be 00.
For general nonnegative measurable functions, we define the integral of f with
f f
respect to I-' as f(s)dl-'(s) = SUPg:5/,g simple g(s)dl-'(s). For general functions f,
let j+(s) = max{f(s),O} and r(s) = - min{f(s),O} (the positive and negative
parts of f, respectively). Then f(s) = j+(s) - r(s). The integral of f with
respect to p, is

J f(s)dl-'(s) = J
f+(s)dl-'(s) - J r(s)dl-'(s),

if at least one of the two integrals on the right is finite. If both are infinite, the
integral is undefined. We say that f is integrable if the integral of / is defined and
is finite. The integral is defined above in terms of its values at all points in S.
Sometimes we wish to consider only a subset of A ~ S. The integral of f over A
with respect to I-' is

i f(s)dl-'(s) = JIA(S)f(s)dl-'(s).

Several important properties of integrals will be needed in this text. Proposi-


tion A.49 and Theorem A.53 state a few of the simpler ones, namely that functions
that are almost everywhere equal have the same integral, that the integral of a
linear combination of functions is the linear combination of the integrals, that
574 Appendix A. Measure and Integration Theory

smaller functions have smaller integrals, and that two integrable functions that
have the same integral over every set are equal almost everywhere. Another useful
property, given in Theorem A.54, is that a nonnegative integrable function leads
to a new measure /I by means of the equation II(A) = fA f(s)dl-'(s).
The most important theorems concern the interchange of limits with integra-
tion. Let Un}~=1 be a sequence of measurable functions such that fn(x) ---+ f(x)
a.e. [1-'1. The monotone convergence theorem A.52 says that if the fn are nonneg-
ative and fn(x) ~ f(x) a.e. [1-'1, then

(A.7)

The dominated convergence theorem A.57 says that if there exists an integrable
function 9 such that Ifn(x)1 :::; g(x), a.e. [I-'], then (A.7) holds.
Part 1 of Theorem A.38 says that measurable functions into each of two mea-
surable spaces combine into a jointly measurable function. Measures and inte-
gration can also be extended from several spaces into the product space. For
example, suppose that I-'i is a measure on the space (8i , Ai) for i = 1,2. To de-
fine a measure on (81 x 82,A 1 A2), we can proceed as follows. For each product
set A = Al X A 2, define 1-'1 x 1-'2(A) = 1-'1 (Al)1-'2(A2). The Caratheodory exten-
sion theorem A.22 allows us to extend this definition to all of the product space.
Lebesgue measure on lR?, denoted dxdy, is such a product measure. Not every
measure on a product space is a product measure. Product probability measures
will correspond to independent random variables.
Extending integration to product spaces proceeds through two famous theo-
rems. Tonelli's theorem A.69 says that a nonnegative function f satisfies

/ f(x, y)dl-'l x 1-'2(X, y) / [ / f(x,Y)dl-'l(X)] dI-'2(Y)

/ [ / f(X,Y)dI-'2(Y)] dl-'l(X),

Fubini's theorem A.70 says that the same equations hold if f is integrable with
respect to 1-'1 x 1-'2. These results also extend to finite product spaces 81 x .. X 8 n

A.1.4 Absolute Continuity


A special type of relationship between two measures on the same space is called
absolute continuity. If 1-'1 and 1-'2 are two measures on the same space, we say that
1-'2 is absolutely continuous with respect to 1-'1, denoted 1-'2 1-'1, if 1-'1 (A) = 0
implies 1-'2(A) = O. When 1-'2 1-'1, we say that 1-'1 is a dominating measure for
1-'2. Here are some examples:
Example A.S .
Let f be any nonnegative measurable function and let 1-'1 be a measure.
Define 1-'2(A) = fA f(s)dl-'l(S). (See Theorem A.54.) Then, 1-'2 <t: 1-'1
Let 8 be the natural numbers and let a1,a2, ... be any sequence of non-
negative numbers. Define 1-'1 to be counting measure on 8, and let 1-'2(A) =
2: a iEA ai. Then 1-'2 1-'1.
A.2. Measures 575

Let 1-'1,1-'2, ... be a collection of measures on the same space (8,A). Let
a1,a2, ... be a collection of positive numbers. Then I-' = EAlI i ail-'i is a
measure and I-'i I-' for all i.
The last example above is important because it tells us that for every countable
collection of measures, there is a single measure such that all measures in the
collection are absolutely continuous with respect to it.
The Ra.don-Nikodym theorem A.74 says that the first part of Example A.8 is
the most general form of absolute continuity with respect to u-finite measures.
That is, if 1-'1 is u-finite and 1-'2 1-'1, then there exists an extended real-valued
measurable function f such that 1-'2(A) = fA f(x)dl-'l(X). In addition, if 9 is
J
1-'2 integrable, then g(X)dI-'2(X) = f g(x)f(x)dl-'l(X). The function f is called
the Radon-Nikodym derivative of 1-'2 with respect to 1-'1 and is usually denoted
(dI-'2/dI-'1)(S).
A similar theorem, A.8i, relates integrals with respect to measures on two
different spaces. It says that a function f : 8 1 -+ 82 induces a measure on the
range 8 2 If 1-'1 is a measure on 81, then define 1-'2(A) = 1-'1 (J-1 (A). Integrals
with respect to 1-'2 can be written as integrals with respect to 1-'1 in the following
way: f g(y)dI-'2(Y) = f g(f(Xdl-'l(X). The measure 1-'2 is called the measure
induced on 82 by f from 1-'1.

A.2 Measures
A measure is a way of assigning numerical values to the "sizes" of sets. The
collection of sets whose sizes are given by a measure is a u-field. (See Examples AA
and A.5 on page 571.)
Definition A.9. A nonempty collection of subsets A of a set 8 is called a field
if
A E A implies4 AC E A,
AI, A2 E A implies Al U A2 E A.
A field A is called a u-field if {An}~=1 E A implies U~1Ai E A.
Proposition A.IO. Let N be an arbitrary set of indices, and let y = {Aa : Q E
N} be an arbitrary collection of u-fields of subsets of a set 8. Then naENAa is
also a u-field of subsets of 8.
Because of Proposition A.l0 and the fact that 28 is a u-field, it is easy to
see that, for every collection of subsets C of 8, there is a smallest u-field A that
contains C, namely the intersection of all u-fields that contain C.
Definition A.n. Let C be the collection of intervals in JR. The smallest u-field
containing C is called the Borel u-field. In general, if 8 is a topological space, and
B is the smallest u-field that contains all of the open sets, then B is called the
Borel u-field.

4The symbol A C stands for the complement of the set A.


576 Appendix A. Measure and Integration Theory

In addition to the Borel u-field, the product u-field is also generated by a simple
collection of sets.
Definition A.l2 .
Let ~ be an index set, and let {8a }aEN be a collection of sets. Define
8 = flaEN 8 a . We call 8 a product space .
For each a E ~, let Aa be a u-field of subsets of 8a . Define the product
u-field as follows. aENAa is the smallest u-field that contains all sets of
the form flaEN Aa, where Aa E Aa for all a and all but finitely many Aa
are equal to 8 a
In the special case in which ~ = {I, 2}, we use the notation 8 = 8 1 X 82 and the
product u-field is denoted A1 A2.
Proposition A.l3. 5 The Borel u-field 8 k of IRk is the same as the product u-
field of k copies of (IR, 8 1 ).
There are other types of collections of sets that are related to u-fields. Some-
times it is easier to prove results about these other collections and then use the
theorems that follow to infer similar results about u-fields.
Definition A.l4. Let 8 be a set. A collection II of subsets of 8 is called a 11'-
system if A, B E II implies A n B E II. A collection A is called a >.-system if
8 E A, A E A implies AO E A, and {An}~=1 E A with Ai n Aj = 0 for i =f. j
implies U~1 Ai E A.
As in Proposition A.1O, the intersection of arbitrarily many lr-systems is a
lr-system, and so too with A-systems. The following propositions are also easy to
prove.
Proposition A.l5. If 8 is a set and C is a collection of subsets of 8 such that
C is a lr-system and a A-system, then C is au-field.
Proposition A.16. If 8 is a set and A is a A-system of subsets, then A, AnB E A
implies A n BO E A.

The following lemma is the key to a useful uniqueness theorem.


Lemma A.l7 (lr-A theorem).6 Suppose that II is a lr-system, that A is a A-
system, and that II S;; A. Then the smallest u-field containing II is contained in
A.
PROOF. Define >.(II) to be the smallest A-system containing II, and define u(II)
to be the smallest u-field containing II. For each A S;; 8, define gA to be the
collection of all sets B S;; S such that An B E A(II).
First, we show that gA is a A-system for each A E A(II). To see this, note that
An 8 E >'(II), so 8 EgA. If B EgA, then An BE >.(II), and Proposition A.I6
says that An BO E A(II), so BO EgA. Finally, {Bn}~=1 E gA with the Bn

5This proposition is used in the proof of Theorem A.38.


6This lemma is used in the proofs of Theorems A.26 and B.46 and
Lemma A.61.
A.2. Measures 577

disjoint implies that A n Bn E oX(II) with A n Bn disjoint, so their union is in


oX(II). But their union is A n (U~=IBn). So U~=IBn EgA.
Next, we show that oX(II) ~ ge for every C E oX(II). Let A, B E II, and notice
that AnB E II, so BE gAo Since gA is a oX-system containing II, it must contain
oX(II). It follows that An C E oX(II) for all C E oX(II). If C E oX(II), it then follows
that A E ge. So, II ~ ge for all C E oX(II). Since ge is a oX-system containing IT,
it must contain oX(II).
Finally, if A, B E >'(IT), we just proved that B EgA, so An B E >.(IT) and
hence oX(II) is also a 7r-system. By Proposition A.I5, oX(II) is a cr-field containing
II and hence must contain cr(II). Since oX(II) ~ A, the proof is complete. 0
We are now in a position to give a precise definition of measure.
Definition A.1S.
A pair (8, A), where 8 is a set and A is a cr-field, is called a measumble
space.
A function p, : A --+ [O,oo} is called a measure if
p,(0) = 0,
{An}~=1 mutually disjoint implies p, (U~IAi) = Z:::I p,(Ai).
A function p, : A --+ [-oo,oo} that satisfies the above two conditions and
does not assume both of the values 00 and -00 is called a signed measure. 7
If p, is a measure, the triple (8, A, p,) is called a measure space.
If (8, A, p,) is a measure space and p,(S) = 1, then p, is called a probability
and (8, A, p,) is called a probability space.
Some examples of measures were given in Section A.1.
Theorem A.19. 8 If (S,A,p,) is a measure space and {An}~=l is a monotone
sequence, 9 then p, (limi_oo Ai) = limi_oo p,( Ai) if either of the following holds:
the sequence is increasing,
the sequence is decreasing and P,(Al) < 00.
PROOF. If the sequence is increasing, then let Bl = Al and Bk = Ak \ Ak-l for
k> 1.10 Then {Bn}~=l are disjoint and the following are true:
k k

UBi =Ak, UBi = lim A =L


00

k, p,(Ak) p,(B;)
k-oo
i=l i=l i=l

7Signed measures will only be used in Section A.6.


8This theorem is used in the proofs of Theorems A.50 and B.90 and
Lemma A.72.
9 A sequence of sets {An}~=l is monotone if either Al ~ A2 ~ ... or
Al 2 A2 2 .... In the first case, we say that the sequence is increasing and
limn_ oo An = U~lAi. In the second case, we say that the sequence is decreasing
and limn_oo An = n~lAi'
laThe symbol A \ B is another way of saying A n Be.
578 Appendix A. Measure and Integration Theory

kl~"!, JL( Ak) = ~ JL( Bd = JL (Q Bi) = JL U~"!, Ak) .


If the sequence is decreasing, then let Bi = Ai \ Ai+l, for i = 1,2, .... It follows
that

Al = k~"!' Ak U (Q Bi) ,

and all of the sets on the right-hand side are disjoint. It follows that
k-l
Ak Al \ UB;,
;=1
00

JL(At} = JL Cl~"!, Ak) + 2: JL(Bi),


i=1
k-l
JL(A k ) = JL(AI) - 2: JL(B;) ,
;=1
00

lim JL(A k )
k~oo
JL(At} - 2:JL(B;) = JL ( lim A k ). 0
k~oo
;=1

Another useful theorem concerning sequences of sets is the following.


Theorem A.20 (First Borel-Cantelli lemma).ll If E::"=1 JL(An) < 00,
then JL (n~1 U;:"=i An) = O.
PROOF. Let Bi = U~=iAn and B = n~IB;. Since B ~ B; for each i, it fol-
lows that JL(B) ::; JL(B;) for all i. Since JL(Bi) ::; E::; JL(An), it follows that
lim;_oo JL(Bi) = O. Hence JL(B) = O. 0
Theorem A.22 below is used in several places for extending measures defined
on a field to the smallest a-field containing the field. A definition is required first.
Definition A.21. Let 8 be a set, A a collection of subsets of 8, and JL : A --+
lRu {oo} a set function. Suppose that S = U~IA; with JL(Ai) < 00 for each i.
Then we say JL is a-finite. If JL is a a-finite measure on (8, A), then (8, A, JL) is
called a a-finite measure space.
The proof of Theorem A.22 is adapted from Royden (1968).
Theorem A.22 (Caratheodory extension theorem) P Let JL be a set func-
tion defined on a field C of subsets of a set 8 that is a-finite, nonnegative, extended

llThis theorem is used in the proofs of Lemma A.72 and Theorems B.90
and 1.61. There is a second Borel-Cantelli lemma, which involves probability
measures, but we will not use'it in this text. See Problem 20 on page 663. The
set whose measure is the subject of this theorem is sometimes called An infinitely
often because it is the set of points that are in infinitely many of the An.
12This theorem is used to prove the existence of many common measures
(including product measure) and in the proofs of Lemma A.24 and of Theo-
rems B.1l8, B.131, and B.133.
A.2. Measures 579

real-valued, and countably additive and satisfies J.t(0) = o. Then there is a unique
extension of J.t to a measure on a measure space13 (S, A, J.t *). (That is, C ~ A
and J.t(A) = J.t*(A) for all A E C.)
PROOF. The proof will proceed as follows. First, we will define J.t* and A. Then we
will show that J.t* is monotone and subadditive, that C ~ A, that A is a o--field,
that J.L* is countably additive on A, that J.t* extends J.t, and finally that J.t* is the
unique extension.
For each B E 2 8 , define
00

J.t*(B) = infLJ.t(A i ), (A.23)


i=l

where the inf is taken over all {Ai}~l such that B ~ U~lAi and Ai E C for all
i. Let

First, we show that J.t* is monotone and subadditive. Clearly, J.t*(A) :::; J.t(A)
for all A E C and Bl ~ B2 implies J.t*(Bl) :::; J.t*(B2). It is also easy to see that
J.t*(Bl U B2) 'S J.t*(Bl) + J.t*(B2) for all B 1 ,B2 E 28. In fact, if {Bn}~l E 28 ,
then J.t*(U~lBi) :::; L::l J.t*(Bi). The proof is to notice that the collection of
numbers whose inf is J.t* of the union includes all of the sums of the numbers
whose infima are the J.t* values being added together.
Next, we show that C ~ A. Let A E C and C E 28. Since J.t* is subadditive, we
only need to show that J.t*(C) ~ J.t*(C n A) + J.t*(C n A C ). If J.t*(C) = 00, this is
clearly true. So let J.t*(C) < 00. From the definition of J.t*, for every f > 0, there
exists a collection {Ai}~l of elements of Csuch that E:I J.t(Ai) < J.L*(C) + f.
Since J.t(Ai) = J.t(Ai n A) + J.t(Ai n A C) for every i, we have
00 00

i=l i=l

~ J.t*(CnA)+J.t*(CnA c ).
Since this is true for every f > 0, it must be that J.t*(C) ~ J.t*(CnA)+J.t*(CnA c ),
hence A E A.
Next, we show that A is a o--field. It is clear that 0 E A and A E A implies
ACE A by the symmetry in the definition of A. Let AI, A2 E A and C E 2 8 . We
can write

J.t*(C) J.t*(CnAI) + J.t*(Cn Af)


J.t*(C n AI) + J.t*(C n Af n A2) + J.t*(C n Af n Af)
~ J.t*(C n [AI U A2l) + J.t*(C n [AI U A2(),

where the first two equalities follow from AI, A2 E A, and the last follows from
the subadditivity of J.t*. So, Al U A2 EA. Let {An}~=l E A; then we can write

l3The usual statement of this theorem includes the additional claim that the
measure space (S, A, J.t*) is complete. A measure space is complete if every subset
of every set with measure 0 is in the o--field.
580 Appendix A. Measure and Integration Theory

A = U~IAi = U~IBi' where each Bi E A and the Bi are disjoint. (This just
makes use of complements and finite unions of elements of A being in A.) Let
Dn = Uf=IBi and C E 25. Since A C S;;; D;{ and Dn E A for each n, we have

1-'" (C) 1-'"(CnDn)+I-'*(CnD;{)


~ 1L*(CnDn)+IL*(CnAc )
n

LIL*(C n B i ) + IL"(C n A C ).
;=1

Since this is true for every n,


00

IL*(C) > LIL*(CnBi)+IL*(CnAc )


i=1
~ 1L*(CnA) + I-'"(Cn A C ),
where the last inequality follows from subadditivity. So, A is au-field.
Next, we show that IL* is countably additive when restricted to A. If AI, A2
are disjoint elements of A, then Al = (AI U A2) n Al and A2 = (AI U A2) n Af.
It follows that
I-'*(Al U A2) = 1-'* (Ad + 1L"(A2).
By induction, 1-'* is finitely additive on A. Let A = U~IAi, where each Ai E
A and the Ai are disjoint. Since Uf=IAi S;;; A, we have, for every n, IL*(A) ~
2:~=IIL"(Ai)' which implies I-'*(A) ~ 2:: 1 1-'* (Ai). By subadditivity, we get the
reverse inequality, hence 1-'* is countably additive on A.
Next, we prove that 1-'" extends 1-'. Since JJ" is countably additive on A, we
can let BEe and {An}~1 E C be disjoint, such that B S;;; U~IAn. Then
B S;;; U~=1 (An n B) = B, and IL" (B) ~ 2:::"=I IL (An n B) = I-'(B), since JJ is
countably additive on C.
To prove uniqueness, suppose that 1-" also extends IL to A. Then I-"(B) ~
2:::"=I IL (A n ) if B S;;; U~=IAn. Hence, JJ'(B) ~ IL(B) for all B E A. If there exists
B such that IL'(B) < 1-'" (B), let {An}~=1 E C be disjoint and such that I-'(An) <
00 and U~=IAn = S. Then, there exists n such that IL'(B nAn) < I-'*(B nAn).
Since 1L'(An ) = I-'*(An), it must be that 1L'(B c nAn) > 1L(Bc nAn), but this is
a contradiction. 0
Here are some examples:
Let S = IR and let B be the Borel u-field. Define IL( (a, b]) = b - a for inter-
vals, and extend I-' to finite unions of disjoint intervals by addition. The-
orem A.22 will extend I-' to the u-field B. This measure is called Lebesgue
measure on the real line.
Let F be any monotone increasing function on IR which is continuous from
the right. Let S = IR and let B be the Borel u-field. Define 1La, b]) =
F(b) - F(a). This can be extended to all of B. In particular, if F is a CDF,
then I-' is a probability.
In the examples above, the claim was made that I-' could be extended to the
Borel u-field. To do this by way of the Caratheodory extension theorem A.22, we
need IL to be defined on a field, countably additive, and u-finite. For the cases
described above, this can be arranged as follows. Suppose that IL is defined on
A.2. Measures 581

intervals of the form (a, b] with a = -00 and/or b = 00 possible.1 4 The collection
C of all unions of finitely many disjoint intervals of this form is easily seen to be
a field. If (aI, bd, ... , (an, bn] are mutually disjoint, set

It is not hard to see that this extension of /-L to C is well defined. This means that if
Uf=l (ai, b;] = U~l (Ci' di], where (Cl' dd, ... , (c m , dm ] are also mutually disjoint,
then E~=l JLai,bij) = E : l /-LCi,dij). If JL is finite for every interval, then it is
u-finite. To see that /-L is countably additive on C, suppose that /-La, bJ) = F(b)-
F(a), where F is nondecreasing and continuous from the right. If {(an' bn]}~=l is
a sequence of disjoint intervals and (a, b] is an interval such that U~=l (an, bn] ~

2:::
(a,b], then it is not difficult to see that E::"=lJLan,bn]) ~ /-La,b]). If (a,b] ~
U~=l(an,bn], we can also prove that 1 /-Lan,bn]) ~ JLa,bj) (see Problem 7
on page 603). Together these facts will imply that JL is countably additive on C.
The proof of Theorem A.22 leads us to the following useful result. Its proof is
adapted from Halmos (1950).
Lemma A.24. l5 Let (S, A, /-L) be a u-finite measure space. Suppose that C is a
field such that A is the smallest u-field containing C. Then, for every A E A and
f > 0, there is C E C such that JL(C~A) < f. 16

PROOF. Clearly, /-L and C satisfy the conditions of Theorem A.22, so that /-L is equal
to the p,* in the proof of that theorem. Let A E A and f > 0 be given. It follows
from (A.23) that there exists a sequence {Ai}~l in C such that A ~ U~lAi and
00

/-L(A) > L/-L(Ai ) -~.


i=l
Since p, is countably additive,

so that there exists n such that

Let C = U~lAi, which is clearly in C. Now

l4If b = 00, we mean (a, 00) by (a, b]. That is, we do not intend 00 to be a
point in the space S.
l5This lemma is used in the proof of the Kolmogorov zero-one law B.68.
16The symbol ~ here refers to the symmetric difference operator on pairs of
sets. We define C~A to be (C n A C ) u (CC n A).
582 Appendix A. Measure and Integration Theory

Similarly,

It now follows that I-'(AAC) < f. 0


Sets with measure zero are ubiquitous in measure theory, so there is a special
definition that allows us to refer to them more easily.
Definition A.25. Let E be some statement concerning the points in 8 such that
for each point s E 8 E is either true or false but not both. Suppose that there
exists a set A E A such that I-'(A) = 0 and that for all s E A C , E is true. Then
we say that E is true almost everywhere with respect to 1-', written a.e. fJ.I.]. If I-'
is a probability, then almost everywhere is often expressed as almost surely and
denoted a.s. [1-'].
The following theorem implies uniqueness of measures with certain properties.
Theorem A.26. 17 Suppose that 1-'1 and 1-'2 are measures on (8, A) and A is the
smallest u-field containing the 7r-system IT. If 1-'1 and 1-'2 are both u-finite on IT
and they agree on IT, then they agree on A.
PROOF. First, let C E IT be such that 1-'1(C) = 1-'2(C) < 00, and define gc to
be the collection of all B E A such that 1-'1(B n C) = 1-'2(B n C). Using simple
properties of measures, we see that gc is a A-system that contains IT, hence it
equals A by Lemma A.l7. (For example, if B E gc,

1-'1(B c n C) = 1-'1 (C) - Ih(B n C) = 1-'2(C) - 1-'2(B n C) =1-'2(Bc n C),


so B C E gc.)
Next, if 1-'1 and 1-'2 are not finite, there exists a sequence {Cn }::"=l E IT such
that 1-'1(Cn ) = 1-'2(Cn ) < 00, and 8 = U::"=lCn. (Since IT is only a 7r-system, we
cannot assume that the C n are disjoint.) For each A E A,

Since I-'j (Uf=l [C; n AD can be written as a linear combination of values of I-'j
at sets of the form A n C, where C E IT is the intersection of finitely many of
C1, ... , Cn, it follows from A E gc that I-'duf=dC; n A]) = 1-'2 (Uf=l[C; n AJ)
for all n, hence 1-'1(A) = 1-'2(A). 0

A.3 Measurable Functions


There are certain types of functions with which we will be primarily concerned.

17This theorem is used in the proofs of Theorems B.32, B,46, B.118, B.l3l,
and 1.115, Lemma A.64, and Corollary B,44.
A.3. Measurable Functions 583

Definition A.27. Suppose that S is a set with a a-field A of subsets, and let T
be another set with a a-field C of subsets. Suppose that I : S -+ T is a function.
We say I is measurable if for every B E C, l-l(B} E A. If I is measurable,
one-to-one, and onto and 1-1 is measurable, we say that I is bimeasurable. If
T = 1R, the real numbers, and C = B, the Borel a-field, then if I is measurable,
we say that I is Borel measurable.

Proposition A.2S. Suppose that (S, A) and (T, C) are measurable spaces. Sup-
pose that I : S -+ T is a lunction.
II A = 2s , then I is measurable.
IIC = {T,0}, then I is measurable.
II A = {S, 0}, {y} E C lor every yET, and / is measurable, then / is
constant.
As examples, if S = T = 1R and A = B is the Borel a-field, then all continuous
functions are measurable. But many discontinuous functions are also measurable.
For example, step functions are measurable. All monotone functions are measur-
able. In fact, it is very difficult to describe a nonmeasurable function without
using some heavy mathematics.
The following theorems make it easier to show that a function is measurable.
Theorem A.29. l8 Let N, S, and T be arbitrary sets. Let {A, : a E N} be a
collection 0/ subsets 01 T, and let A be an arbitrary subset 0/ T. Let / : S -+ T
be a function. Then

rl (U nEN
An) U
nEN
rl(An},

r l (n ",EN
An) n
",EN
rl(An},

rl(AC} rl(A)C.

PROOF. For the union, if s E l-l(U",ENA",}, then 1(8} E UnENAn , hence there
exists a such that 1(8} E An, so s E f-l(A n } and s E UnEN/-l(A n ). If 8 E
UnEN/-l(An ), then there exists a such that 8 E l-l(An), hence /(s) E An,
hence f(s) E UnENAn , hence 8 E l-l(U",ENA",). This proves the first equality.
The second is almost identical in that "there exists a" is merely replaced by "for
all a" in the above proof. For the complement, if 8 E l-l(A C ), then 1(8) E A C
and 1(8) ~ A. Hence, 8 ~ rl(A) and s E rl(A)c. If s E rl(A)c, then
8 ~ rl(A) and 1(8} ~ A. So, f(8) E A C and 8 E rl(AC). 0

Corollary A.30. 19 II Sand T are sets and C is a a-field of subsets of T and


I : S -+ T i8 a function, then f- l (C) is a a-field 01 subsets of S. In fact, it is the
smallest a-field 01 subsets 0/ S such that / is measurable.

18This theorem is used in the proof of Theorem A.34.


19This corollary is used in the proof of Theorem A.42, and it is used to define
the a-field generated by a function.
584 Appendix A. Measure and Integration Theory

Definition A.31. The u-field rl(C) in Corollary A.30 is called the u-field gen-
emted by f.
A measurable function also generates a u-field of subsets of its image.
Proposition A.32. Let (T, C) be a measumble space. Let U ~ T be arbitmry
(possibly not even in C). Define C. = {U n B : B E C}. Then C. is a u-field of
subsets of U.

Definition A.33. The u-field C. in Proposition A.32 is called the restriction of


the u-field C to U. If f : 8 ~ T and U = 1(8), then C. is called the image u-field
of f
Theorem A.34. 20 Let (8,A) be a measumble space and let f : 8 -- T be a
function. Let C be a nonempty collection of subsets of T, and let C be the smallest
u-field that contains CO. If rl(C) ~ A, then rl(C) ~ A.
PROOF. Let C2 be the collection of all subsets B of T such that r I (B) E A. By
assumption, C ~ C2 We will now prove that C2 is a u-field; hence it must contain
C, which implies the conclusion of the theorem. Clearly, C2 is nonempty, since C
is nonempty. Let A E C2. Theorem A.29 implies f-l(A C ) = f-l(A)c E A, since
A is a u-field. This means that A C E C2. Let AI, A 2, . .. E C2. Then Theorem A.29
implies

r l
(QAi) = Qrl(Ai) E A,

since A is a u-field. So C2 is au-field. 0


To use this theorem to show that a function f : S ~ T is measurable when T
has a u-field of subsets C, we can find a smaller collection of subsets C such that C
is the smallest u-field containing C and prove that f-I(C) ~ A. Theorem A.34
would then imply f-I(C) ~ A and f is measurable. As an example, consider the
next lemma.
Lemma A.35. 21 Let (8,A) be a measumble space, and let f : 8 ~ JR be a
function. Then f is measumble if and only if r 1b, 00 E A for all b E JR.
PROOF. The "only if' part is trivial. For the "if" part, let C be the collection of
all subsets of 1R of the form (b,oo). The smallest u-field containing these is the
Borel u-field 8, so r l (8) s:; A by Theorem A.34. 0
There are versions of Lemma A.35 that apply to intervals of the form (-00, a]
and those of the form (a, b), and so on. Similarly, there is a version for general
topological spaces.
Proposition A.36. 22 Let (8, A) be a measumble space, and let (T,C) be a topo-
logical space with Borel u-field. Then f : 8 ~ JR is measumble if and only if
rl(C) E A for all open C (or for all closed C).

20This theorem is used in the proofs of Lemma A.35, Proposition A.36, Corol-
lary A.37, Theorems A.38, B.75, and B.133, and to prove that stochastic processes
are measurable.
2lThis lemma is used in the proofs of Theorems A.38 and A.74.
22This proposition is used in the proof of Theorem A.38.
A.3. Measurable Functions 585

Another example of the use of Theorem A.34 is the proof that all continu
ous
functions are measurable. The result follows because the Borel a-field
is the
smallest a-field containing open sets.
Coroll ary A.37. Let (5, A) and (T, B) be topological spaces with
their Borel
a-fields. If f : 5 - t T is continuous, then f is mea.surable.
Here are some properties of measurable functions that will prove useful.
Theore m A.38. Let (5, A) be a measurable space.
1. Let N be an index set, and let {(Tn,C",)}"'EN be a collectio
n of measurable
spaces. For each a E N, let j", ; 5 --+ T", be a function . Define
f ; 5 --t
I1aEN Tn by f(s) = {f",(S)}",EN. Then f is measurable (with respect to
the
product a-field) if and only if each fOt is measurable.
2. If (V, Cd and (U,C2) are measurable spaces and f ; 5 --> V and 9
; V -t U
are measurable, then gU) ; 5 --> U is measurable.
3. Let f and 9 be measurable function s from 5 to IR n , and let a be
a constan t
scalar and let b E IR n be constant. Then the following function s
are also
measurable: f+g and af +b. Ifn = 1, then fg and JIg are also measura
ble,
where f / 9 can be set equal to an arbitrary constan t when 9 = o.
4. If, for each n, fn is a measurable, extended real-valued function, then
sUPn fn, infn fn, limsuPn jn, and liminfn fn are all measurable.
5. Let (T, C) be a metric space with Borel a-field. If /k ; 5 --> T is a
measurable
function for each k = 1,2, ... and limk~oo fk(s) = f(s) for all s,
then f is
measurable.
6. Let (T, C) be a metric space with Barela -field, and let J.t be a
measure on
(5, A). If /k ; 5 --> T is a measurable function for each k = 1,2, ...
and
limk_oo fk(s) exists a.e. [ILl, then there is a measurable j ; 5 -->
T such
that limk~oo /k(s) = f(s), a.e. [ILl
PROOF. (1) Suppose that j is measurable. To show
that jOt is measurable, let
Bo. E Co. and let B/3 = T/3 for f3 # a. Set C = I1;EN B/3, which
is in the
product a-field, because all but finitely many B/3 equal the entire space
T/3. Then
/;;1 (Bo.) = rI(C). Since f is measurable, rI(C) E A. Now, suppose that each
fa is measurable, and let B = flOiEN B OI , with Be. E COl for all a and
all but finitely
many BOt (say B Ot11 ... , B Otn ) equal to T",. Then f-1(B) = n~d;;/(
BOI;) E A.
Since the sets of the form B generate the product a-field, r1(B)
E A for all B
in the product a-field according to Theorem A.34.
(2) Let A E C2 We need to prove that g(f)-l( A) E A. First, note
that
g(l)-l = f-1(g-1 ). Since g is measurable, g-l(A) E C . Since
1 I is measurable,
r1(g-1 (A)) E A. So g(f)-l( A) E A.
(3) The arithme tic parts of the theorem are all similar. They all follow
from
parts 2 and 1. For example, hex, y) = x + y is a measurable function
from IR?
to JR, so h(f, g) = I + 9 is measurable. For the quotient, a little
more care is
needed. Let hex, y) = x/y when y # 0 and let it be an arbitrar y constan
t when
y = O. Then h is measurable since {(x, y) ; y = O} is in 8 2 It follows that
h(f, g)
is measurable.
(4) Let f = SUPn In. Then, for each finite b, {s ; f(s) :s b} = n~=l {s ;
fn(s) :s b} E A. Also {s ; f(s) = -oo} = n~=l{S : fn(s) = -oo}
E A, and
586 Appendix A. Measure and Integration Theory

{s : f(s) = oo} = n bl U~=I {s : fn(s) > i} EA. Similar arguments work for
inL Since limsuPnfn = infksuPn>kfn and liminfnfn = sUPkinfn2:kfn, these
are also measurable. -
(5) Let d be the metric in T. For each closed set C E C, and each m, let
Cm = {t: d(t, C) < l/m}. For each closed C, define
00 00 00

(A.39)
m=l n=l k=n

It is easy to see that A.(C) E A is the set of all s such that limn_oo fn(s) E
C. Obviously, f-I(C) consists of those s such that limn_oo fn(s) E C. Hence,
r I (C) = A. (C) E A, and Proposition A.36 says that f is measurable.
(6) Let G = {s : limk_oo fk(S) does not exist}, and let G ~ C with j.t(C) = O.
Let t E T, and define f(s) = t for sEC and f(s) = limk_oo /k(s) for s E CC.
Apply part 5 to the restrictions of the functions {fk}~1 to CC to conclude that
f restricted to CC (call the restriction g) is measurable. If A E C, f-I(A) =
g-I(A) E A if t ft A and rl(A) = g-I(A) U C E A if tEA. So f is measurable.
o
Part 6 is particularly useful in that it allows us to treat the limit of a sequence
of measurable functions as a measurable function even if the limit only exists
almost everywhere. This is only useful, however, if we can show that functions
that are equal almost everywhere have similar properties.
Many theorems about measurable functions are proven first for a special class of
measurable functions called simple functions and then extended to all measurable
functions using some limit theorems.
Definition A.40. A measurable function f is called simple if it assumes only
finitely many distinct values.
A simple function is often expressed in terms of its values. Let f be a simple
function taking values in IRn for some n. Suppose that {al, ... , ak} are the dis-
tinct values assumed by f, and let Ai = rl({ai}). Then f(s) = 2:~=1 adA; (s).
The most fundamental limit theorem is the following.
Theorem A.41. If f is a nonnegative measurable function, then there exists a
sequence of simple functions {Ji}bl such that for all s E S, /i(s) i f(s).
PROOF. For k = 1, ... ,i2 i , let Ak,i = {s : (k _1)/2 i :$ f(s) < k/2 i }. Define
AO,i = {s: f(s) ~ i}. Then AO,i,AI,i,." , A i2 ;,i are disjoint and their union is S.
Define
f '(s) = { k;l if s E Ak,i for k > 0,
i if s E AO,i.
It is clear that Ii(s) :$ f(s) for all i and s, and each fi is a simple function. Since,
for k > 0, Ak,i = A2k-l,i+l U A 2 k,HI, and AO,i = AO,HI U Ai2i+l+l,i+l U U
< fHl(S)
A (Hl)2''+1',.+1, it is easy to see that fi(s) - .
for all i and all s. It is also
-i
easy to see that for each s, there exists n such that for l ~ n, If(s) - f;(s)1 :$ 2 .
, 0
Hence Ii(s) i f(s). . .
The following theorem will be very useful throughout the study of statistics. It
says that one function 9 is a function of another f if and only if 9 is measurable
with respect to the O'-field generated by f
A.4. Integrat ion 587

Theore m A.42. Let (81, AI), (82,A2) , and (83,A3) be measurable


spaces such
that A3 contains all singletons. Suppose that f : 81 -+ 82 is measura
ble. Let Alf
be the u-field generated by f. Let T be the image of f and let A.
be the image
u-field of f. Let 9 : 8 1 -+ 8 3 be a measurable function. Then 9 is Al/
measurable
if and only if there is a measurable function h : T -+ 83 such that for
each s E 81,
g(s) = h(f(s.
PROOF. For the "if" part, assume that there is a measura
ble h : T -+ 83 such that
g(s) = h(f(s for all s E 81. Let BE A3. We need to show that g-I(B)
E Alf
Since h is measurable, h- l (B) E A., so h-I(B) = TnA for some A
E A2. Since
rI(A) = rl(T n A) and g-I(B) = rl(h-I( B), it follows that
g-I(B) =
rl(A) E AI/.
For the "only if" part, assume that 9 is Al/ measurable. For each
t E 83, let
C t == g-I({t} ). Since g is measurable with respect to All, let At E
AI! be such
that C t == r 1 (A t ). (Such At exists because of Corollary A.30.) Define
h(s) == t
for all sEAt n T. (Note that if tt 1= t2, then At} n At2 n T == 0,
so h is well
defined.) To see that g(s) = h(f(s)), let g(s) == t, so that sECt = r l
(At ). This
means that f(s) E At nT, which in turn implies h(f(s)) == t == g(s).
To see that h is measurable, let A E A3. We must show that h- l
(A) EA.
Since 9 is Alf measurable, g-l(A) E Al/, so there is some B E A2
such that
g-I(A) = rl(B). We will show that h- l (A) == BnT E A. to complet
e the proof.
If s E h-I(A), then t = h(s) E A and s == f(x) for some x E C
t ~ g-I(A) ==
rI(B), so f(x) E B. Hence, s E BnT. This implies that h- l (A) ~
BnT. Lastly,
.if s E B n T, s = f(x) for some x E rI(B) = g-I(A) and h(s)
= h(f(x) =
g(x) E A. So, h(s) E A and s E h-I(A). This implies B nT ~ h- l (A).
0
The condition that A3 contain singletons is needed to avoid the situatio
n in
the following example.
Examp le A.43. Let 8 1 = 8 2 = 8 3 = lR and let Al = A2 be the
Borel u-field,
while A3 is the trivial u-field. Then every function 9 : Sl -+ S3 is All
measurable
no matter what f is, for example, g( s) == s. If f (8) = 82, then 9 is not
a function
of f.

AA Integration
The integral of a function with respect to a measure is a way to generali
ze the
notion of weighted average. We define the integral in stages. We start
with non-
negative simple functions.
Definit ion A.44. Let f be a nonnegative simple function represen
ted as f(8) =
L::"'I adA,(s) , with the ai distinct and the Ai mutuall y disjoint. Then,
the in-
f
tegral of f with respect to 1- is f(s)dl-(8) = 2::=1
ail-(Ai). If 0 times 00
occurs
in such a sum, the result is 0 by convention.
The integral of a nonnegative simple function is allowed to be
that the formula for the integral of a nonnegative simple function is
00.
It turns out
more general
than in Definition A.44.
588 Appendix A. Measure and Integration Theory

Proposition A.45. 23 If (S, A, 1-) is a measure space, Ai E A and ai 2:


i = 1, ... , n, and f(s) = E:'::l
adA; (s), then f(s)dl-(s) = I
aiJL(Ai). E:'::l
for
Next, we consider general nonnegative measurable functions. If f is a nonnega-
tive simple function, then for every nonnegative simple function 9 :::; I, it follows
I
easily from Definition A.44 that g(s)dl-(s) :::; f f(s)dl-(s). Hence, the following
definition contains no contradiction with Definition A.44.
Definition A.46. If I is a nonnegative measurable function, then the integral
of f with respect to 1- is I
f(s)dl-(s) = SUPg~/.g simple g(s)dp.(s). I
For general functions f, define the positive part as f+(s) = max{f(s),O} and
define the negative part as r(s) = -min{f(s),O}. Then I(s) = r(s)- r(s). If
I
f 2: 0, then r == and r (s )dl- (s) = OJ hence the following definition contains
no contradiction with the previous definitions.
Definition A.47. If I is a measurable function, then the integral 01 I with respect

J J J
to 1- is
f(s)dl-(s) = t+(s)dl-(s) - r(s)dl-(s),
if at least one of the two integrals on the right is finite. If both are infinite, the
integral is undefined. We say that f is integrable if the integral of f is defined
and is finite.
The integral is defined above in terms of its values at all points in S. Sometimes
we wish to consider only a subset of S.
Definition A.4S. If A ~ Sand f is measurable, the integral of f over A with

i J
respect to 1- is
f(s)dl-(s) = IA(s)f(s)dl-(s).

Here are a few simple facts about integrals.


Proposition A.49. Let (S, A, 1-) be a probability space, and let f, 9 : S - IR be
measurable.
1. If f = 9 a.e. [1-), then f f(s)dl-(s) = f g(s)dl-(s) il either integral is de-
fined.
2. If f f(s)dJL(s) is defined and a is a constant, then

J af(s)dJL(s) = a f f(s)dJL(s).

3. If f and 9 are integrable with respect to 1', and f :::; g, a.e. lI'), then

J f(s)dJL(s) :::; J g(s)dJL(s).

4. If f and 9 are integrable and fA f(s)dJL(s) = fA g(s)dl-(s) for all A E A,


then f = g, a.e. lI').

23This proposition is used in the proof of Theorem A.53.


A.4. Integrat ion 589

The proofs of the next few theorems are essentially borrowed from
Royden
(1968).
Theore m A.50 (Fatou 's lemma} .24 Let {fn}~=l be a sequence of
nonnegative
measurable functions. Then

Jl~n:~f j(s)d/-L(s) $ l~n:i~f ! fn(s}d/-L(s).

PROOF. Let f(s) = liminfn _ oo fn(s). Since

J f(s)d/-L(s) = sup
simple <p :s; f
J (s)dJ.L(s) ,

we need only prove that, for every simple ~ f,

J (s)d/-L(s) $ l~n:~f J fn(s)d/-L(s),

Since this is clearly true if (s) = 0, a.s. [/-L], we will assume that /-L(A)
> 0, where
A = {s : (s) > O}. Let ~ f be simple, let > 0, and let 6 and
M be the
smallest and largest positive values that assumes. For each n, define

An = {s E A: /k(s) > (1- )(s), for all k 2:: n}.

Since (1 - )(s) < f(s) for all sEA, U~=lAn = A and An ~ An+!
for all n.
Let Bn = AnA::; .

f fn(s)dJ.L(s) 2:: (
JAn
f",(s)dJ.L(s) 2:: (1 - ) (
JAn
(s)dJ.L(s). (A.51)

If /-L(Bn) = 00 for n = no, then J.L(A) = 00 and (s)dJ.L(s) = 00, f since takes
on only finitely many different values. The rightmost integral in (A.51)
is at least
DJ.L(An), which goes to 00 as n increases, hence lim inf n _ oo J fn(s)dJ.L
(s) = 00 and
the result is true. So, assume p,(Bn) < 00 for all n. Since n~=lBn =
0, it follows
from Theorem A.19 that lim n _ oo J.L(Bn) = 0. So, there exists N such
that n > N
implies J.L(Bn) < . Since

i
-

J (s)dJ.L(s) = (s)dJ.L(s) = in (s)dJ.L(s) + in (s)dp,(s)

$ ((s)dJ .L(s) + M,
JAn
(A.51) implies that, for n 2: N,

f fn(s)dp,(s) 2:: (1 - ) f (s)dp,(s) - (l - )M.

24This theorem is used in the proofs of Theorems A.52, A.57, A.60,


B.117,
and 7.80.
590 Appendix A. Measure and Integration Theory

J
If 4J(s)dp.(s) = 00, the result is true again. If J4J(s)dp.(s) = K < 00, then for
every n ~ N,

j In(s)dp.(s) ~j 4J(s)dp.(s) - [(l - )M + K],

! !
hence
l~~:f In (s)dp.(s) ~ 4J(s)dp.(s) - [(l- )M + K].
Since this is true for every > 0,

liminf
n ..... oo
j In (s)dp.(s) ~ j4J(S)dP.(S)'
o
Theorem A.52 (Monotone convergence theorem). Let {fn}~=l be a se-
quence 01 measurable nonnegative functions, and let I be a measurable function
such that In(x) :5 I(x) a.e. [p.] and In(x) -+ I(x) a.e. lP-]. Then,

nl~~ j In (x)dp.(x) = j I(x)dp.(x).

PROOF. Since In :5 I for all n, J In (x)dJl(x) :5 J I(x)dp.(x) for all n. Hence


l~~~f j In (x)dp.(x) :5 li~s.:!p j In (x)dp.(x) :5 j I(x)dp.(x).

By Fatou's lemma A.50, Jl(x)dJl(x) :5liminfn-+oo JIn (x)dp.(x). o


Theorem A.53. II J I(s)dp.(s) and J g(s)dp.(s) are defined and they are not
both infinite and 01 opposite signs, then Il/(s) + g(s)]dJl(s) = J I(s)dp.(s) +
Jg(s)dp.(s).
PROOF. If I,g ~ 0, then by Theorem A.41, there exist sequences of nonnega-
tive simple functions {fn}~l and {gn}~=l such that In T I and gn T g. Then
J
Un + gn) TU + g) and Illn(s) + gn(s)]dp.(s) = In (s)dp.(s) + gn(s)dp.(s) by J
Proposition A.45. The result now follows from the monotone convergence theo-
r
rem A.52. For integrable I and g, note that U+g)+ + +g- = U+g)- + + g+. r
What we just proved for nonnegative functions implies that

j u + g)+(s)dp.(s) +j r(s)dp.(s) +/ g-(s)dJl(s)

= j[U + g)+(s) + res) + g-(s)Jdp.(s)

= f[u + g)-(s) + I+(s) + g+(s)Jdp.(s)

j (f + g)-(s)dp.(s) + / 1+ (s)dp.(s) +/ g+(s)dp.(s).

Rearranging the terms in the first and last expressions gives the desired result. If
both I and 9 have infinite integral of the same sign, then it follows easily using
A.4. Integration 591

Proposition A.49, that 1+ 9 has infinite integral of the same sign. Finally, if only
one of I and 9 has infinite integral, it also follows easily from Proposition A.49
that I + 9 has infinite integral of the same sign. 0
A nonnegative function can be used to create a new measure.
Theorem A.54. Let (8, A, IL) be a measure space, and let I : 8 --+ IR be non-
negative and measurable. Then v(A) = fA l(s)dIL(s) is a measure on (8, A).
PROOF. Clearly, v is nonnegative and v(0) = 0, since l(s)10(S) = 0, a.e. [IL].
Let {An}~=1 be disjoint. For eachn, define gn(S) = l(s)IAn (s) and In(s) =
1::=1 gi(S). Define A = U~=IAn. Then 0 $ In $ I1A, a.e. [IL] and In converges
to I1A, a.e. [IL]. So, the monotone convergence theorem A.52 says that

lim j In(s)dIL(S)
n .... oo
= v(A). (A.55)

Also, V(Ai) =f gi(s)dIL(S), for each i. It follows from Theorem A.53 that

(A.56)

Take the limit as n --+ 00 of the second and last terms in (A.56) and compare to
(A.55) to see that v is countably additive. 0

Theorem A.57 (Dominated convergence theorem). Let {fn}~=1 be a se-


quence 01 measurable junctions, and let I and 9 be measurable functions such that
In(x) --+ I(x) a.e. [IL], I/n(x)1 $ g(x) a.e. [IL], and f g(x)dIL(X) < 00. Then,

!!..~ 1 In (x)dIL(X) = I I (X)dIL(X),

PROOF. We have -g(x) $ In(x) $ g(x) a.e. [IL], hence

g(X) + In(x) ~ 0, a.e. [ILJ,


g(x) - In(x) ~ 0, a.e. [IL],
lim (g(x) + In (x)] = g(x) + I(x) a.e. [IL],
n .... oo
lim (g(x) - In(x)J
n .... oo
= g(x) - I(x) a.e. [IL].

It follows from Fatou's lemma A.50 and Theorem A.53 that

l(g(x) + f(x)]dIL(x) $ l~~~f j(g(X) + In(x)]dIL(X)


1 g(x)dIL(X) + l~~~f 1 In (x)dIL(X) ,

$ liminfjln(x)dlL(X).
n .... oo
592 Appendix A. Measure and Integration Theory

Similarly, it follows that

J[g(X) - f(x)]dJL(x) ~ l~~f J[g(X) - fn(x)]dJL(x)

J g(x)dJL(x) -li~~~p J fn(x)dJL(x),

J f(x)dJL(x) ~ li::s~p J fn(x)dJL(x).

Together, these imply the conclusion of the theorem. 0


An alternate version of the dominated convergence theorem is the following.
Proposition A.58. 25 Let {In}~l' {gn}~=l be sequences of measurable func-
tions such that Ifn(x)1 $ gn(X), a.e. [JL]. Let f and 9 be measurable functions
such that limn_ oo fn(x) = f(x) and limn_ex:> gn(X) = g(x), a.e. [JL]. Suppose
J J
that lim n _ oo gn(x)dJL(x) = g(x)dJL(x) < 00. Then, lim n _ oo fn(x)dJL(x) = J
J f(x)dJL(x).
The proof is the same as the proof of Theorem A.57, except that gn replaces 9 in
the first three lines and wherever 9 appears with fn and a limit is being taken.
For O'-finite measure spaces, the minimal condition that guarantees convergence
of integrals is uniform integrability.
Definition A.59. A sequence of integrable functions {In}~=l is uniformly inte-
grable (with respect to JL) if lim c _ oo sUPn J{:z::lfn(:z:ll>C} Ifn(x)ldJL(x) = O.

Theorem A.60. 26 Let JL be a finite measure. Let {In}~=l be a sequence of inte-


J
grable functions such that limn_co fn = f a.e. [JL]. Then limn_co fn(x)dJL(x) =
J f(x)dJL(x) if {In}~=l is uniformly integrable. 27

PROOF. Let ft, f;;, f+, and f- be the positive and negative parts of fn and
f. We will prove that the result holds for nonnegative functions and take the
difference to get the general result. Let ~ > 0 and let c be large enough so that
sUPn J{x:fn(xl>c} fn(x)dJL(x) < ~. The functions

( ) _ {fn(X) if fn(x) $ C,
gn X - C if !n(x) > c

converge a.e. [JL] to


if f(x) $ c,
g(x) = { ~(x) if f(x) > c.
We now have

J f(x)dJL(x) > J g(x)dJL(x) = }~~ J gn(x)dJ.L(X)

~ li~~~p J fn(x)dJL(x) - ~,

25This proposition is used in the proof of Scheffe's theorem B. 79.


26This theorem is used in the proofs of Theorems 1.121 and B.118.
27 One could replace "if" by "if and only if," but we will never need the "only
if" part of the theorem in this book.
A.5. Product Spaces 593

where the second line follows from the dominated convergence theorem A.57 and
J
the third from our choice of c. Since this is true for every f, we have f(x)dp,{x) ;::
J
lim sup fn(x)dp,{x). Combining this with Fatou's lemma A.50 gives

J f(x)dp,{x) = !~~ Jfn(x)dp,(x). o

A.5 Product Spaces


In Definition A.12, we introduced product spaces and product IT-fields. We would
like to be able to define measures on (81 X 8 2 , Al A2) in terms of measures on
(81, AI) and (8 2 ,A2). The derivation of product measure given here resembles
the derivation in Billingsley (1986, Section 18).
Lemma A.61. 28 Let (81,Al,P,I) and (82,A2,P,2) be IT-finite measure spaces,
and let Al A2 be the product IT-field .
For every B E Al A2 and every x E 81, B", = {y: (x,y) E B} E A2 and
P,2(B",) is a measurable function from (81, AI) to 1R U {oo} .
For every B E Al A2 and every y E 82, B11 = {x: (x,y) E B} E Al and
P,1(B11) is a measurable function from (82 , A2) to 1R U {oo}.
PROOF. Clearly, we need only prove one of the two sets of assertions. First, let
B =Al X A2 with Ai E Ai for i =
1,2 and x E 8 1. Then

B", = {A2 if x E ~1'


o otherwlSe.

So, B", E A2. Let C be the collection of all sets B ~ 8 1 X 82 such that B", E A2.
If BE C, then (BO)", = {y: (x,y) f/. B} = (B",)o, so BO E C. Let {Bn}~=1 E C.
Then it is easy to see that

(91 Bn) '" = {Y: (x,y) E 91 Bn} = 91{Y: (x,y) E Bn} = 91(Bn)", E c.
(A.62)
Clearly, 81 x 82 E C, so C is a IT-field containing all product sets; hence it contains
AlA2. Next, let fB(X) = P,2(B",) for BE A 1A2. Write 81 X82 = U~=IEn with
En = A 1n X A 2n and P,i(Am) < 00 for all n and i = 1,2 and with the En disjoint.
Then let fB,n = P,2B n En)",). It follows that fB = E:=1 fB,n. If we can show
that fB,n is measurable for each n, then so is fB, since they are nonnegative, and
the sum is well defined. If B = Bl X B 2, then fB,n(X) = IAlnnBl (x)p,2{A2n nB2),
which is a measurable function. Let 'D be the collection of all sets D ~ 81 X 82

28This lemma is used in the proofs of Lemmas A.64 and A.67 and Theo-
rems A.69 and B.46.
594 Appendix A. Measure and Integration Theory

such that fD,n is measurable. If DE V, then fDC,n = JL2(A2n ) - fD,n, which is


measurable, so DC E V. If {Dm}~=l E V with the Dm disjoint, then

JL2 (91 (Dm n En)",) = ~ Jl.2 (Dm n En)",


00

LfDm,n(X),
m=l

which is a measurable function, so U~=lDm E V. Clearly, 8 1 x 82 E V, so V is


a A-system (see Definition A.14) that contains the 1I'-system of product sets. By
the 1I'-A theorem A.17, V contains Al A2. 0
The following corollary to Lemma A.64 is a sort of dual to part 1 of Theo-
rem A.38.
Corollary A.63. Let (81, AI), (8 2 ,A2), and (X,8) be measurable spaces. If f:
8 1 X 82 -> X is measurable, then for every Sl E 8 1 , f.1(S2) = f(sl,s2) is a
measurable function from 8 2 to X.
Lemma A.64. 29 8uppose that (8 1 , AI, Jl.I) and (82 , A2, Jl.2) are a-finite measure
spaces. For each x E 8 1 , Y E 8 2 , and B E Al A 2, define B", and BY as in
iS iS
Lemma A.61. Then vdB) = 1 Jl.2(B",)dJl.1(X) and v2(B) = 2 Jl.1(BY)dJl.2(Y)
both define the same measure on (81 x 82, Al A2). If Ai E Ai for i = 1,2, then
v1(Al x A2) = JL1(A1)JL2(A2)'
PROOF. First, prove that VI is a measure. The proof that V2 is a measure is
identical. Clearly, v1(B) ~ 0 for all Band v1(0) = o. If {Bn}~=l are disjoint,
then

00

where the first equality follows from the definition of VI, the fact that Jl.2 is
countably additive, and (A.62); the second equality follows from the monotone
convergence theorem A.52 and the fact that 2:::'=1
Jl.2Bn)",) :::; 2:::'=1
Jl.2Bn )",)
for all m; and last equality follows from the definition of VI. This proves that VI
(and so too V2) is a measure. Note that if B = Al X A2, then

rIAl (x)JL2(A2)dJL1(X) = JL1(A 1)JL2(A2)


lSl
r IA2(Y)Jl.1(A1)dJl.2(Y) = v2(B).
lS2
SO, VI = V2 on the 1I'-system consisting of product sets. Since each of JL1 and Jl.2 is
a-finite, there exists a countable collection of product sets whose union is 8 1 x 82

29This lemma is used in the proof of Lemma A.67.


A.5. Produc t Spaces 595

and such that each one has finite III = measure. By Theorem A.26,
112 III agrees
with 112 on all of Al @A2'
0
Definit ion A.65. Let (8i' Ai, J1.i) for i = 1,2 be (7-finite measure
spaces. Define
the product measure J1.1 x J1.2 on (81 x 82, Al @ A2) as the common
value of the
two measure s III and 112 in Lemma A.64.
Lebesgue measure on lR.2, denoted dxdy, is a product measure
. Not every
measure on a product space is a product measure. Produc t probabi
lity measure s
will corresp ond to indepen dent random variables. (See Theorem 8.66.)
Propos ition A.66. Let J1. be a measure on a product space (8 1 X
82, Al @ A2)'
Then J1. is a product measure if and only if there exist set functions
J.ti : Ai ->
1R for i = 1,2 such that, for every Al E Al and A2 E A2, J1.(AI
x A2} =
J1.dAl)J1.2(A2).
Lemm a A.67. 30 Let f be a measurable function from 8 1 x 8 2 to
m such that
either {x E 81 : Jlf(x,y)ldJ1.2(Y) = oo} ~ A E AI, where III (A) =
0, or f 2 O.
Then, there is a measurable (possibly extended real-valued) function
9 : 81 ->
mu{o o} such thatg(x ) = Jf(x,y) dIl 2(Y), a.e. [Ill]' Iff is the indicat
orofa
measurable set B, then

! g(x)dIl 1(X} = III x 1l2(B). (A.68)

PROOF. For each B E Al @ A 2, note that J IB(X, y)dJ1.2(y) =


J1.2(B:x), where
Bx is defined in Lemma A.61. It was shown there that 1l2(B:x) is
a measur-
able function of x. It follows from Lemma A.64 that (A.68) holds.
It now fol-
lows from the linearity of integral s that if f is a nonnega tive simple
function,
then g(x) = J f(x,y)d Il 2(Y) is a measura ble function of x. If f is
a nonnega tive
measura ble function, let {fn}~=l be a sequence of nonnega tive simple
functions
such that In ~ I for all nand limn -+ oo fn(x, y) = f(x, y) for all
(x, y). Then,
the monoto ne convergence theorem A.52 says that limn-+oo J fn(x,
y)dIl2(Y) =
J l(x,y)dJ1.2(Y) = g(x) for all x. By part 5 of Theorem A.38, 9 is measura
ble. If
J1.r{x E 81 : J I/(x, y)ldIl2(Y) = oo} = 0, then the argume nt just given
both rand r and the difference Jr (x, Y )dJ1.2 (y) - Jr
applies to
(x, Y )d1l2 (y) is defined
a.e. [1l11 and equals J f(x, y)dJ1.2(y), a.e. [Ild If we let g(x) = J
J rex, y)dIl 2(y) for all x f/. A, and let g(x) be constant on A, (x,then
r y)dIl2( Y)-
g(x) =
J I(x, y)dIl2(X), a.e. [Ill], and 9 is measurable. 0
The following two theorem s will be used many times in the study
of product
spaces.
Theore m A.69 (Tonel li's theore m). Let (81,AI,1 l1) and (8
2,A2,J.t2) be (7-
finite measure spaces. Let f : SI X S2 ...... m be a nonnegative measura
ble function.
Then

30This lemma is used in the proofs of Theorem A.70 and of Lemma


s 6.48
and 8.46.
596 Appendix A. Measure and Integration Theory

= / [/ f(X,Y)dIL2(Y)] dILl (x).


PROOF. As in the proof of Lemma A.67, let {In}~l be a sequence of non-
negative simple functions such that fn S f for all nand lim n _ oo fn(x, y) =
J
f(~~Y) for all (x,y). If fn(x,y) = L~:l ai,nIBi.n(x,y), then fn(x,y)dIL2(Y) =
Li=l ai,nIL2(Bi ,n,x) by Lemma A.61 and
/ [ / fn(x,y)dIL2(y)] dlLl(X) =/ f(x,y)dlLl X 1L2(X,y)

by (A.68). Since 0 S J J
fn(x, y)dIL2(Y) ::; f(x, y)dJj2(y) for all x and n, and
J J
lim n _ oo fn(x, y)dIL2(Y) = f(x, y)dIL2(Y) as in the proof of Lemma A.67, it
follows from the monotone convergence theorem A.52 that

f f(x, y)dJjl x Jj2(X, y) = }!..~ f fn(X, y)dJjl x Jj2(X, y)

= }!..~ / [/ fn(X, Y)dIL2(Y)] dILl (X)

f [}!..~ f fn(X, y)dJj2(y)] dJjl (x)

/ [ / f(X,y)dJj2(X,y)] dlLl(X).

The proof that the iterated integrals can be calculated in the other order is
similar. 0

Theorem A.70 (Fubini's theorem). Let (Sl,Al,Jjl) and (S2,A2,1L2) be (1-


finite measure spaces. If f : Sl x S2 --+ IR is integrable with respect to Jjl x Jj2,
then

/J(X,Y)dJjl XJj2(X,y) = ![/f(x,y)dJjl(X)]dJj2(y) = f[fJ(x,Y)dIL2(Y)]dlLI(X)'


PROOF. Let g(x) = J If(x,y)dIL2(Y), a.e. [ILl] be measurable. Then

1 g(x)dIL1(X) = 1[1 If(X,Y)ldIL2 (Y)] dlLl(X) = 1 If{x,y)ldlLl X 1L2{X,y) < 00

follows from Tonelli's theorem A.69 applied to If I It follows that

{x: / If(x,y)ldJj2(Y) = oo} CAE Al


implies 1L1(A) = O. Apply Tonelli's theorem A.69 to f+ and f- and note that
J J
the set of all x such that r(x,y)dIL2(Y) - r(x,y)dIL2(Y) is undefined is a
J
subset of {x : If(x, y)ldIL2(Y) = oo}. It follows that this difference of integrals
is defined a.e. [ILl] and the integral (with respect to ILd of the difference (which
equals J[f f(x,y)dIL2{y)]dIL1(X)) is the difference of the integrals (which equals
J f{x, y)dlLl X 1L2(X, y)). . 0
All of the results of this section can be extended to finite product spaces
81 x ... X 8 n by simple inductive arguments.
A.6. Absolute Continuity 597

A.6 Absolute Continuity


It is also common to consider two different measures on the same space.
Definition A.71. Let ILl and IL2 be two measures on the same space (S,A).
= =
Suppose that, for all A E A, ILl (A) 0 implies 1L2{A) O. Then, we say that 1L2
is absolutely continuous with respect to ILl, denoted 1L2 ILl When 1L2 ILl, we
say that ILl is a dominating measure for 1L2.
I
Consider next a function f and a measure IL such that f(x)dlL(x) is defined.
Then II(A) = J
f(x)dIL(x) is defined for all measurable A. If f takes on negative
values with pos'ftive measure, then II is not a measure because it assigns negative
values to some sets, such as A = {x : f(x) < O}. However, II is still a signed
measure.
If one of a pair of two measures is finite, there is a necessary and sufficient
condition for absolute continuity which resembles the definition of continuity of
functions.
Lemma A.72. 3l Let ILl and IL2 be measures on a space (S,A). Consider the
following condition:

For every f > 0, there is 6. such that JLl(A) < 6. implies JL2(A) < f. (A.73)

If condition (A. 73) holds, then JL2 JLl.


If IL2 ILl and 1L2 is finite, then condition (A. 73) holds.
PROOF. For the first part, let f > 0 and suppose that ILl (A) = O. Then JLl (A) < 6.
and 1L2(A) < f. Since this is true for all f > 0, JL2(A) = O. For the second
part, suppose that JL2 ILl, that IL2 is finite, and that (A.73) fails. Then there
exists > 0 such that, for every integer n, there is An with 1-'1 (An) < 1/n2 but
1L2{An) ~ E. Let A = n~l U:'=k An. By the first Borel-Cantelli lemma A.20,
ILl(A) = 0 so' IL2(A) = O. Since 1-'2 is finite, Theorem A.19 implies that

JL2(A) = kl~~ jJ,2 (Qk An) ~ E.

This is a contradiction. 0
The following theorem says that the first part of Example A.8 on page 574 is
the most general form of absolute continuity with respect to o--finite measures.
The proof is mostly borrowed from Royden (1968).
Theorem A.74 (Radon-Nikodym theorem). Let ILl and JL2 be measures on
(8, A) such that 1L2 JLl and 1-'1 is o--finite. Then there exists an extended real-
valued measumble junction f : S --+ [0,00] such that for every A E A,

(A.75)

3IThis lemma is used in the proof of Lemma B.119.


598 Appendix A. Measure and Integration Theory

Also, il 9 : S -> IR is /-l2 integrable, then

J g(x)d/-l2(x) = Jg(x)/(x)d/-ll(X). (A.76)

The function f is called the Radon-Nikodym derivative of /-l2 with respect to /-ll
and it is unique a.e. [/-llj. The Radon-Nikodym derivative is sometimes denoted
(d/-l2/d/-l 1) (s). II/-l2 is (I-finite, then I is finite a.e. [/-ld
PROOF. First, we prove uniqueness a.e. [/-llj. Suppose that such an f exists.
Let 9 be another function such that I and 9 are not a.e. [/-llj equal. Let An =
{x : I(x) > g(x) + lin} and Bn = {x : I(x) < g(x) - lin}. Since I and
9 are not equal a.e. [/-ld, then there exists n such that either /-ll(An) > 0 or
/-ll(Bn) > o. Let A be a subset of either An or Bn with finite positive measure.
Then fA f(x)d/-l1(x) '" fA g(x)d/-l1(x). Hence 9 '" d/-l2/d/-ll.
The proof of existence proceeds as follows. First, we show that we can reduce
to the case in which /-ll is finite. Then, we create a collection of signed measures
Va indexed by a real number Ct. For each Ct we find a set A a such that every
subset of Aa has positive Va measure and every subset of the complement B a
has negative Va measure. We then show that B{J ~ B a for (3 ~ Ct, which allows
us to define I(x) = sup{Ct : X E Ba}. Finally, we show that I satisfies (A.75) and
(A.76).
Now, we prove that we need only consider finite /-ll. Since /-ll is u-finite, let
{An}~=l be disjoint elements of A such that /-ll(Ai) < 00 and S = U~lAi. Let
/-lj,i be /-lj restricted to Ai for j = 1,2 and each i. Then /-l2,i /-ll,i for each i and
each /-ll,i is finite. Suppose that for each i we can find Ii as in the theorem with
/-lj replaced by /-lj,i for j = 1,2. Then f(x) = "E:l IA; (X)fi(X) is the function
required by the theorem as stated. Hence, we prove the theorem only for the case
in which /-ll is finite.
Suppose that /-ll is finite, and define the signed measure Va = Ct/-ll - /-l2 for
each nonnegative rational number Ct. (Note that va(A) never equals 00, although
it may equal -00.) For each Ct, define

Pa {A E A : va(B) ~ 0, for every B ~ A},


Aa sup va(A).
AEPa

That is, A" is the supremum of the signed measures of sets all of whose subsets
have nonnegative signed measure. 32 Since 0 EPa, A" ~ o. Let {An}~=l be
such that A" = limi_oo v,,(A i ), and let A a = U~lAi. Since every subset of AO
can be written as a union of subsets of the Ai, it follows that A a EPa, hence
Aa ~ va(Aa). Since A" \ Ai <; A a , it follows that va(A a \ Ai) ~ 0 for all i a~d
l/a(A") = l/a(A a \ A;) + l/a(A;) 2 va(Ai) for all i. It follows that Aa ~ va(A ).
Hence Aa = v,,(A a ) < 00. Define B a = (Aa)c.
Next, we prove that every subset of B a has nonpositive measure. 33 If not, let
B <; B a such that va(B) > o. If B has no subsets with negative signed measure,

32The sets in Pa are often called the positive sets relative to the signed measure
Va
33S uch sets are called negative sets relative to the signed measure Va
A.6. Absolute Continuity 599

then BuA'" E P'" and v",(A"'UB) > >''''' a contradiction. So, let n1 be the smallest
positive integer such that there is a subset B1 ~ B with v",(Bd < -l/n1. For
each k > 1, let nk be the smallest positive integer such that there exists a subset
Bk ~ B \ u7,;;} Bi with V",(Bk) < -link. Now, let C = B \ Uk"=lBk. Clearly
v"'(C) > O. If we prove that C has no subsets with negative signed measure,
then C E P'" and we have another contradiction. So, suppose that D ~ C has
v",(D) = - < O. Since v",(B) > 0, it must be that 2:~=1 V",(Bk) > -00.
Hence limk_oo nk = 00. So, there is k such that I/(nk+1 - 1) < . Notice that
D ~ C ~ B\U~=lBk. Since v",(D) < -l/(nk+l-I), this contradicts the definition
of nk+1.
If (3 > 0, we have

v",(A'" n B/3) ~ 0, v/3(A'" n B/3) $ O.

Subtract the first inequality from the second to get (f3 - O)J.!l (A'" n B/3) $ 0,
from which it follows that J.!l(A'" n B/3) = O. Since v/3(A) ;::: v",(A) for f3;::: 0, we
can assume that A'" ~ A/3 if f3 ;::: o. It follows that B/3 ~ B'" for f3 ;::: 0, and we
can define f(x) = sup{o : x E B"'}. Since B O = S, f(x) ~ 0 for all x. It is easy
to see that f(x) ;::: 0 if x E B'" and f(x) $ 0 if x E A"'. It is also easy to see that
{x: f(x) ;::: b} = U"'~bB"'. Since this is a countable union of measurable sets, it
is measurable. By Lemma A.35, f is measurable.
Next, we prove that (A.75) holds for every A E A. Let A E A be arbitrary
and let > 0 be given. Let N > J.!l(A)/ be a positive integer. Define Ek =
An Bk/N n A(k+l)/N and Eoo = A \ Uk::1Ak/N. Then A = Uk::1Ek U Eoo and
the E j are all disjoint. So J.!2(A) = J.!2(Eoo ) + 2:;:'0 J.!2(Ek). By construction
f(x) E [kiN, (k + l)/N] for all x E Ek and f(x) = 00 for all x E Eoo. Since
Vk/N(Ek) $ 0 and V(k+1)/N(Ek) ~ 0, we have, for finite k,

1J.!2(Ek) -le k
f(X)dJ.!l(X)1 $ ~J.!l(Ek)' (A.77)

If J.!l(Eoo ) > 0, then J.!2{Eoo ) = 00 since v",(Eoo ) < 0 for all o. If J.!l(Eoo ) = 0,
then J.!2(Eoo) = 0 by absolute continuity. Either way, J.!2{Eoo ) = JEoo
f(x)dJ.!l{X).
Adding this into the sum of (A. 77) over all finite k gives

1J.!2{E) -Ie f(X)dJ.!l{X)1 $ ~J.!l(E) < .

Since this is true for every > 0, (A.75) is established.


To prove (A.76), we note that it is true if 9 is an indicator function, hence it
is true for all simple functions. By the monotone convergence theorem A.52, it is
true for all nonnegative functions and by subtraction it is true for all integrable
functions.
Finally, if f(x) = 00 for all x E A with J.!l(A) > 0, then J.!2{B) = 00 for every
B ~ A such that /L1{B) > O. It is now impossible for J.!2 to be a-finite. 0
In statistical applications, we will often have a class of measures, each of which
is absolutely continuous with respect to a single a-finite measure. It would be
nice if the single dominating measure were in the original class or could be con-
structed from the class. The following theorem addresses this problem. The proof
is borrowed from Lehmann (1986).
600 Appendix A. Measure and Integration Theory

Theorem A.78. 34 Let 1. be a u-finite measure on (S,A). Suppose that N is a


collection of measures on (S, A) such that for every v E N, v 1.. Then there
exists a sequence of nonnegative numbers {Ci}~1 and a sequence of elements of
N, {Vi}~1 such that L:
1 Ci = 1 and v 2::1 CiVi for every v E N.

PROOF. If N is a countable collection, the result is trivially true. If 1. is finite,


let >. = 1..If 1. is not finite, then there exists a countable partition of S into
{8i}~1 such that 0 < 1.(8i ) = di < 00. For each B E A, let >.(B) = 2::11.(B n
8 i )/(2'di ). In either case>. is finite and v >. for every v E N. Define Q to be
the collection of all measures of the form"~ LJ.=l aWi where"~LJ.=l ai = 1 and each
Vi E N. Clearly f3 E Q implies f3 >..
Next, let V be the collection of sets C in A such that there exists Q E Q
satisfying >.({x E C : dQ/d>.(x) = O}) = 0 and Q(C) > o. To see that V is
nonempty, let v be a measure in N that is not identically 0 and let C = {x :
dv/d>.(x) > O}. Then with Q = v, we have {x E C : dQ/d>.(x) = O} = 0 and
Q(C) = v(C) = v(8) > 0, so C E V. Since A is finite, sUPcev >'(C) = c < 00, so
there exist {Cn}~=l such that limn_co >'(Cn ) = c and C n E V for all n. Let Co =
U~=lCn and let Qn E Q be such that Qn(Cn) > 0 and >.({x E Cn : dQn/d>.(x) =
O}) = O. Let Qo = 2:::'=12- nQn E Q, so that dQo/d>. = E::'=12- ndQn/d>. and

{ x E Co: dQo
d>' (x) = 0} ~ UOO
{
x E Cn : dQn
d>' (x) = 0}
n=l
which implies that Co E V and >'(Co) = c.
Since Qo E Q, we now need only prove that v Qo for all v E N to finish
the proof. Suppose that Qo(A) = 0 and v E N. We must prove v(A) O. Since =
Qo(A nCo) = 0 and dQo/d>.(x) > 0 for all x E Co, it follows that >'(A nCo) = 0
and hence v(AnCo) = O. Let C = {x: dv/d>.(x) > O}. Then, v(Anccfnc c ) = 0
since dv/d>.(x) = 0 for x E C c . Let D = Anccf nc, which is disjoint from Co. If
>.(D) > 0, then >'(Co U D) > >'(Co) and D E V. It follows easily that Co U DE V
and >'(Co U D) > >'(Co) contradicts >'(Co) = c. Hence >.(D) = 0 and v(D) = 0,
which implies veAl = v(A nCo) + v(A n ccf n CC) + v(D) = o. 0
There is a chain rule for Radon-Nikodym derivatives.
Theorem A.79 (Chain rule).35 Let v and." be u-finite measures and suppose
that 1. v 1/. Then

dl. (s) = dJ-l (8) dv (8), a.e. [.,,). (A.80)


d." dv d."
PROOF. It is easy to see that 1. ." so that dJ-l/d." exists. For every set A, it

1
follows from (A.76) that

J-I(A) = 1 dl.
-(s)dv(s)
Adv
== dl. (8)-d
-d
A v
dv (s)d.,,(s).
."

34This theorem is used in the proofs of Lemmas 2.15 and 2.24. It appears as
Theorem 2 in Appendix 3 of Lehmann (1986) and is attributed to Halmos and
Savage (1949).
35This theorem is used in the proof of Lemma 2.15.
A.6. Absolute Continu ity 601

By the uniqueness of Radon- Nikody m derivatives, (A.80) holds.


.0
The Radon- Nikody m theorem A. 74 relates integrals with respect
to two dIf-
ferent measures on the same space. There are also theorems that relate
integrals
with respect to two different measures on two different spaces.
Theore m A.81. A measurable function f from one measure space l
(5 ,Al,lll)
to a measurable space (S2,A2) , f: SI -+ S2, induces a measure on the range
S2.
For each A E A2, define 1l2(A) = III (I-I (A)). Integrals with respect
to 112 can
be written as integrals with respect to j)l in the following way: If
g : 52 -+ IR tS

J J
integrable, then
g(y)dIl2(Y) = g(f(xd j)J(x), (A.82)

PROOF. What needs to be proven is that {t2 is indeed


a measure and that (A.82)
holds. To see that {t2 is a measure, note that if A, B E A2 are disjoint
, then so too
are f-l(A) and f-l(B). The fact that j)2 is nonnegative and countab
ly additive
now follows directly from the same fact about ttl.
If 9 : S2 -+ lR is the indicato r function of a set A, then

J g(y)d{t2(Y) {t2(A) = j)1(f-J( A))


f I f -l(A)(x) d{tl(X) = f g(f(Xd j)l(X),

That (A.82) is true for all nonnegative simple functions follows by adding
the far
ends of this equatio n (multiplied by positive constan ts). The monoto
ne conver-
gence theorem A.52 allows us to extend the equality to all nonnegative
integrable
functions. By subtrac tion, we can extend to all integrab le function
s. 0
Definit ion A.S3. The measure {t2 in Theorem A.8! is called the measure
induced
on (82 , A2) by f from j)J.

If the measure ttl in Theorem A.81 is not finite, and the function
f is not
one-to-one, the measure {t2 may not be very interesting.
Examp le A.S4. Let 81 = lR?, 8 2 = IR, j)l equal Lebesgue measure
on lR?, and
f(x, y) = x. Let the two u-fields be Borelu-fields. The measure {t2
that f induces
on (82, A2) from ttl is the following. If A E A2 and the Lebesgue measure
of A is
0, then {t2(A) = O. Otherwise, j)2(A) = 00. Althoug h j)2 is absolute
ly continuous
with respect to Lebesgue measure, it is not IT-finite. The only function
s 9 that
are integrable with respect to {t2 are those that are almost everywh
ere O.
If j)l is IT-finite, there is a way to avoid the problem in Exampl e
A.84 by making
use of the following result.
Theore m A.85. 36 A measure {t on a space (S, A) is u-finite if and only
if there
exists an integrable function f : 8 -+ IR such that f > 0, a.e. [ttl.

36This theorem is used in the proof of Theorem B.46.


602 Appendix A. Measure and Integration Theory

PROOF. For the "if" part, let f be as in the statement of the theorem. Let 0 <
f f(s)dJ.L(s) = c < 00. Let An = {s : lin ~ f(s) < l/(n - I)}, for n = 1,2, ....
We see that Al = {s: f(s) ~ I} and 8 = U~=lAn. We can write

It follows that J.L(An) ~ nc for all n. Hence J.L is u-finite.


For the "only if" part, assume that J.L is u-finite, and let {An}~=l be mutually
disjoint sets such that 8 = U~=lAn and J.L(An) < 00 for all n. Define f(s) to
equal TnlJ.L(An} for all sEAn and for all n such that J.L(An} > O. For n such
that J.L(An) = 0, set f(s) = 0 if sEAn. Then

Example A.S6 (Continuation of Example A.84; see page 601). Let hex, y) =
exp( _[x 2 + y2)/2). It is known that h is integrable with respect to J.Ll and h
is everywhere strictly positive. Let J.L~(C) = Ie h(x,y)dJ.L 1(x, y). Then J.L~ J.Ll
and J.LI J.Li The measure J.L~ induced on (8 2 ,A2 ) from J.Li by f(x,y} = x
is J.L~(B) = J21T IB exp( _x 212)dx. A function 9 : 82 -+ IR is integrable with
respect to J.L; if and only if exp( _x 2/2)g(x) is integrable with respect to Lebesgue
measure.

As a sort of reverse version of Theorem A.8l, functions from a measurable


space to a measure space induce measures on the domain space.
Proposition A.ST. Let f be a measurable function from a measurable space
(8 l ,Al) to a measure space (82,A2,J.L2), f : 81 -+ 82. Let All ~ Al be the
u-field generated by f, and let T be the image of f. 8uppose that T E A2. Then f
induces a measure J.Ll on (8l,Alf) defined by J.Ll(A) = J.L2(TnB) if A = rl(B).
Jilurthermore, if 9 : (81 , Au) -+ IR is integrable with respect to J.L1, then

J g(X)dJ.LI(X) =[ h(y)dJ.L2(y), (A.88)

where h satisfying h(J(x = g(x) is guaranteed to exist by Theorem A.42.

A.7 Problems
Section A.2:

1. Let 8 be a set and let A be the collection of all subsets of 8 that either
are countable or have countable complement. Prove that A is au-field.
2. Prove Proposition A.lO on page 575.
A.7. Problem s 603

3. Prove Proposi tion A.13 on page 576. (Hint: First, show that every
open
ball in IRk is the union of countably many open rectangles. Then
prove
that the smalles t a-field containing open balls must be the same
as the
smallest a-field containing open rectangles.)
4. Prove that B+ defined on page 571 is a a-field of subsets of the
extende d
real numbers.
5. Prove Proposi tion A.15 on page 576.
6. Prove Proposi tion A.16 on page 576.
7. *Let F : lR -+ lR be a nondecreasing function that is continuous
from the
right. For each interval (a, b], define p.a, b]) = F(b) - F(a).
(a) Suppose that {(an, bn]}~l is a sequence of disjoint intervals
such
that U~=l (an, bnJ ~ (a, bJ. Prove that I:::"=lp.an, bnl) S; p.a, bl).
(Hint: Prove it for finite collections and take a limit.)
(b) Suppose that {(an, bn]};';"=l is a sequence of disjoint intervals
such
that (a,b} ~ U;';"=l(an,bn j. Prove that L:::"=lp,an,bn ]) ~ p,a,b]).
(Hint: First, prove it for finite collections by induction. For
infinite
collections, let p.( (a, b]) > E > O. Cover a compac t interval [a + 6,
b}
with finitely many open intervals (an, bn + 6n ) such that lp.a, bJ)
-
p.a + 6, b])1 < E/2 and IL::=1 L::=1
p.an, bnl) - p.an, bn +6n ])1 <
E/2. This can be done by using continuity from the right.)
(c) Prove that p. is countably additive on the smallest field contain
ing
intervals of the form (a,b]. (Hint: Deal separat ely with finite and
semi-infinite intervals)
8. A measure space (8,A,p.) is complete if A ~ B E A and p,(B) =
0 implies
A E A. Let (8,C,p.) be a measure space, and let '0 = {D : 3A,C
E
C with D~A ~ C and p.(C) = O}. For each D E '0, define p.(D) =
p.(A)
where D~A ~ C and p.(C) = O. Show that p.. is well defined and
that
(8, V, p..) is a complete measure space.
Section A.3:

9. Prove Proposi tion A.28 on page 583.


10. Prove Proposi tion A.32 on page 584.
11. Prove Proposi tion A.36 on page 584.
12. Let (8, A, p.) be a measure space, and let {jn};';"=1 be a sequenc
e of mea-
surable functions from 8 to lR. Suppose that for every E > 0, L:=1
p,( {s :
fn(s) > E}) < 00. Prove that limn_oo fn(s) = 0, a.e. lp.l. (Hint:
Use the
first Borel-C antelli lemma A.20.)
13. Let (8j, Aj) for j = 0, 1, 2, 3 be measurable spaces. Let fJ : 80
-+ 8j be
measurable and onto for j = 1,2,3. Let Ao,j be the a-field generat
ed by
=
fJ for j 1,2. Prove that /3 is measurable with respect to AO,1 n A O,2
if and only if there exist measurable gj : 8 j -+ 83 for j = 1,2 such
that
fa = glUt) = g2(h)
604 Appendix A. Measure and Integration Theory

Section A..4:

14. If f ~ 0 is measurable and J f(s)dJ,l(s) = 0, then show that f(s) = 0, a.e.


[J,I).
15. If f(s) > 0 for all sEA and J,I(A) > 0, prove that fA f(s)dJ,l(s) > O.
16. Prove Proposition A.45 on page 588. (Hint: Use induction on n.)
17. Prove Proposition A.49 on page 588. (Hint: For part 4, use Problem 14 on
page 604.)
18. Let 8 = IR and let A be the O'-field of sets that are either countable or have
countable complement. (See Problem 1 on page 602.) Let J,I be Lebesgue
measure. Suppose that f : 8 - IR is integrable. Prove that f = 0, a.e. fI.t.).
19. Let (8, A) be a measurable space, and let f be a bounded measurable
function. (That is, there exist a and b such that a $; f(x) $; b for all
x E 8.)
(a) Let J,I be a measure on (8, A) such that J,I(8) = 1. Prove that
a $; J f(x}dJ,l(x) $; b.

(b) Let > O. Prove that there exists a simple function 9 such that for
J J
all measures J,I satisfying J,I(8) = 1, I f(x)dJ,l(x) - g(x)dJ,l(x) I < f.
20. Prove the following alternative type of monotone convergence theorem:
Let {In}~=1 be a sequence of integrable functions such that fn(x) con-
J
verges monotonically to f(x) a.e. [J,I). Then f(x)dJ,l(x) is defined and
J f(x)dJ,l(x) = limn _ oo f fn(x)dJ,l(x). (Hint: Use the dominated conver-
gence theorem A.57 on the positive parts of fn and the monotone con-
vergence theorem A.52 on the negative parts, or vice versa, depending on
whether the convergence is from above or below.)
21. Let (8, A, J,I) be a measure space, let {gn}~=1 be a sequence of integrable
functions that converges a.e. fI.t.), and let 9 be another integrable function.
Suppose that for all C E A,

lim
n-+oo}c
r
gn(s)dJ,l(s) = 1
C
g(s)dJ,l(s).

Prove that lim n _ oo gn = g, a.e. [J,I).

8ection A.5:

22. Prove Proposition A.66 on page 595.


23. Let (81,Ad and (82,A2) be measurable spaces, and define the product
space (81 x 82, Al A2)' Prove that A x B E Al A2 with A ~ 81 and
that B ~ 82 implies A E Al and B E A2. (Hint: For each C E Al A2,
define Cy = {x : (x,y) E C}. Then let C = {C: Cy E AI, for all y E 82}.
Prove that C is a O'-field containing all product sets.)
A.7. Problems 605

Section A.6:

24. Suppose that IJI : 1J2 and 1J2 : 1J1.


(a) Show that a.e. [1J1] means the same thing as a.e. [1J2].
(b) Show that

dlJl (s) = (d1J 2 (S)) -1, a.e. [I


IJI an d a.e. [I
1J2.
dIJ2 dlJl

25. If IJI is a measure and f is a nonnegative measurable function, then define


the measure 1J2 by 1J2(A) = JAf(s)dIJ1(s). Prove that 1J2 : 1J1.
26. Let >. be Lebesgue measure on JR and define

for some fixed c > 0 and Xo E JR.


(a) Prove that IJ is a measure.
(b) Show that>. : IJ, but that IJ :. >..
(c) Show that J f(x)dlJ(x} = J f(x)d>'(x} + cf(xo}.
27. *In the proof of Theorem A.74, we proved the Hahn decomposition theorem
for signed measures, namely that if v is a signed measure on (S, A), then
there exists A E A such that A is a positive set and A C is a negative set
relative to v.
(a) Let v be a signed measure on (S,A). Suppose that there are two dif-
ferent Hahn decompositions. That is, Al and A2 are both positive sets
and Af and Af are both negative sets. Prove that every measurable
subset B of Al n Af has v(B) = o.
(b) If v is a signed measure on (S,A), use the Hahn decomposition the-
orem to create definitions for the following:
i. The integral with respect to v of a measurable function.
ii. When a function is integrable with respect to v.
(c) If there are two different Hahn decompositions for a signed measure
v, prove that the definition of integral with respect to v produces the
same value for both decompositions.
28. In the statement of Proposition A.87 on page 602, prove that the measure
IJI is well defined. (That is, suppose that A = rl(Bd = r 1(B2), and
prove that 1J2(BI n T) = 1J2(B2 n T).) Also prove that IJI is a measure.
29. In the statement of Proposition A.87 on page 602, assuming that IJI is a
well-defined measure, prove that (A.88) holds.
ApPENDIX B

Probability Theory

This appendix builds on Appendix A but is otherwise self-contained. It contains


an introduction to the theory of probability. The first section is an overview.
It could serve either as a refresher for those who have previously studied the
material or as an informal introduction for those who have never studied it.

B.1 Overview
B.l.l Mathematical Probability
The measure theoretic definition of probability is that a measure space (8, A, JL)
is called a probability space and JL is called a probability if 1-'(8) = 1. Each element
of A is called an event. A measurable function X from 8 to some other space
(X, 8) is called a random quantity. The most popular type of random quantity
is a random variable, which occurs when X is m. with the Borel u-field. The
probability measure JLx induced on (X, 8) by X from JL is called the distribution
ofX.
Example B.l. Let 8 = X = m. with Borel u-field. Let f be a nonnegative
function such that f f(x)dx = 1. Define JL(A) = fA f(x)dx and Xes) = s. Then
X is a continuous random variable with density f, and JLX = JL. If we let /I denote
Lebesgue measure, then JLX /I with dJLx/d/l = f.
1
Example B.2. Let 8 = m. with Borel u-field. Let X = {X 1 ,X2 , a countable
set. Let f be a nonnegative function defined on X such that Ei=l f(x;) = 1.
Define JL(A) = E{i:Z,EA} f(Xi). Then X is a discrete random variable with prob-
ability mass function f, and JLX = 1-'. If we let /I denote counting measure on X,
then JL /I with dJL/d/l = f.
B.l. Overview 607

In both of these examples, we will say that f is the density of X with respect
to v.
When there is one probability space (8, A, IJ.) from which all other probabilities
are induced by way of random quantities, then the probability in that one space
will be denoted Pro So, for example, if J-tx is the distribution of a random quantity
X and if BE B, then Pr(X E B) = J-t(X- 1 (B = J-tx(B).
The expected value or mean or expectation of a random variable X is defined
(and denoted) as E(X) = J xdJ-tx(x), if the integral exists, where J-tx is the
distribution of X. If X is a vector of random variables (called a random vector),
then E(X) will stand for the vector with coordinates equal to the means of the
coordinates of X.
The (in)famous law of the unconscious statistician, B.12, is very useful for
calculating means of functions of random quantities. It says that E[f(X)] =
J f(x)dJ-tx(x). For example, the variance of a random variable X with mean c is
Var(X) = E([X - CJ2), which can be calculated as J(x - c)2dJ-tx(x). The covari-
ance between two random variables X and Y with means Cx and cy, respectively,
is Cov(X, Y) = E([X - cxllY - cyJ).

B.1.2 Conditioning
We begin with a heuristic derivation of the important concepts using the special
case of discrete random quantities. Afterwards, we define the important terms in
a more rigorous way.
Consider the case of two random quantities X and Y, each of which assumes at
most countably many distinct values, X E X = {Xl, ... } and Y E Y = {Yl, ... }.
Let Pij = Pr(X = Xi, Y = Vi). Then
00

Pr(X = Xi) = LPi; = Pi., and


;=1
00

Pr(Y = Yi) = LPi; = P.i.


i=1
These equations give the marginal distributions of X and Y, respectively. We can
define the conditional probability that X = Xi given Y = Yi by

Pr(X = xilY = Yi) = Pii = Pili.


P.;
~o.te t~at ~or e~h j,E: 1 Pi!; = 1 so that the numbers {Pi\j.}~1 define a proba-
bIlity dIstrIbutIon on X kn~~n as the conditional distribution of X given Y = Yi.
We can calculate the cond,t,onal mean (expectation) of a function f of X given
Y = Yi by

= Yi) = L f(Xi)Pilj.
00

E(f(X)IY
i=1
From the conditional distribution, we could define a measure on (X, 2x) by

J-tx!y(AIYi) =L Pi!i
""EA
608 Append ix B. Probabi lity Theory

It follows that, for each j, E(f(X) IY = Yj) = J f(x)d/-LxlY(xIYj). We can think


of this conditio nal mean as a function of y:

g(y) = E(f(X) IY = y).

The margina l distribu tion of Y is a measure on (Y,2)1) defined by

1J,y(B) = LP.j, for all B E2)1.


YjEB
x Y, 2x 2)1)
Similarly, the joint distribu tion of (X, Y) induces a measure on (X
2 x 2)1. The point of all of these
by /-LX,Y(C) = L(Xi,Yj)ECPi j , for all C E
write the integral of 9 over
measure s and distribu tions is the following. We can
any set B E 2)1 as

fa L L L f(Xi)PiIiP.j
00

g(y)d/-LY(Y) g(Yj )P.j =


YjEB YjEB i=l

J f(x)fB (y)d/-Lx,Y (x, y) = E (f (X) fB (Y.

The overall equatio n

fa g(y)d/-LY(Y) = E (f (X) fB (Y

in general.
will be used as the propert y that defines conditio nal expecta tion
the definitio n of conditio nal expecta tion, we will define conditio nal prob-
Throug h
ability and conditio nal distribu tions in general.
mean and
Theorem B.21 says that, in general, if a random variable X has finite
A, then a function 9 : S -+ 1R exists which is measura ble
if C is a sub-a-field of
with respect to the a-field C and such that

E(XfB) = fa g(s)d/-L(s), for all B E C. (B.3)

random
This is the general version of what we worked out above for discrete
s in which C was the u-field generat ed by Y. We will use the symbol
variable
that E(XIC)
E(XIC) to stand for the function g. The two importa nt features
with respect to the u-field C and that it satisfies
possesses are that it is measura ble
that equals E(XIC) a.s. [/-L] will also satisfy (B.3), so there
(B.3). Any function
many function s that satisfy the definitio n of conditio nal expecta tion. All
may be
When we say
such functions are called versions of the conditio nal expecta tion.
E(XIC), we will mean that it is a version of E(XIC).
that a random variable equals
we can set B = S in (B.3) and the equatio n become s E(X) =
Notice that
generali zation
E[E(XIC)]. This result is called the law oftotal probability. A useful
is given in Theorem B.70.
symbol
lf C is the a-field generat ed by another random quantit y Y, then the
of E(XIC). For the case in which C is the a-field
E(XIY) is usually used instead
ed by Y, some special notation is introduc ed. We saw in Theorem A.42
generat
by Y if and
that a function is measura ble with respect to the a-field generat ed
B.l. Overview 609

only if it is a function of Y. Hence, there is a function h defined


on the sp~e
Y where Y takes its values such that E(XIY) =
heY). We use the notatIOn
E(XIY = t) to stand for h(t). (See Corollary B.22.) In this not~tio~
, w~ have, for
all B E C, E(XIB) = fe E(XIY = t)dlJ.y(t), where lJ.y is the dlstnbu
tlOn of Y.
Example B.4. Let S = lR? and let A be the two-dimensional Borel sets.
Let

IJ.(A) = jA
_1_ exp {_3(s~
V37r 3
+ s~ - SIS2)} dSl d S2.

Suppose that Xes) = SI and Y(s) = S2 when s = (SI' S2). Now E(IX!)
00.
y'2[ir <
We claim that g(s) == s2/2 and h(t) == t/2 satisfy the conditions required
=
to be
E(XIY) (s) and E(XIY == t), respectively. First, note that t~e u-field
by Y is Ay == {1R xC: C is Borel measurable}, and IJ.Y IS the measure ed genera~
wIth
density exp(-t2 /2)/.;z. ;r. It is clear that any measurable function
of S2 alone is
Ay measurable. Let B =
1R. x C, so that E(XIB) equals

[: L ~7r
SI exp {-~(s~ + s~ - SI S 2)} ds2dsl
= /1
e
00

-00
SI V2 exp {_3 (SI- !S2)2} -1-exp
v'31r 3 v'2ir
2
{-!s~}dSldS2
2

= r
le 2
!s2_1_ exp
V'f;ir
{-!s~} dS2

L[: ~S2:'
2

exp {-~ (Sl - ~S2 f} ~ exp {-~S~} ds}ds2


= Is ~S2 Ja7r exp { -~(s~ + s~ - SI S 2)} ds}ds2 = Is g(s)dJ1(s).
Note also that the third line in the above string equals fe h(s2)dlJ.y(s2).
It is easy to see that if X is already measurable with respect
to C, then
E(XIC) == X.
Conditional probability turns out to be the special case of conditio
nal expec-
tation in which X = IA. That is, we define Pr(AIC) == E(IAIC) . A
conditional
probability is regular if Pr('IC)( s) is a probability measure for all s.
It turns out
that, under very general conditions (see Theorem B.32), we can choose
the func-
tions Pr(AIC )O in such a way that they are regular conditional probabi
lities. In
particul ar, the space (X, 8) needs to be sufficiently like the real number
s with the
Borel u-field. Such spaces are called Borel spaces as defined in Definiti
on B.31. All
of the most common spaces are Borel spaces. In particul ar, 1R. k for all
finite k and
1R00 For those readers with more mathem atical background, comple
te separab le
metric spaces are also Borel spaces. Also, finite and countable product
s of Borel
spaces are Borel spaces.
In the future, we will assume that all versions of conditional probabi
lities are
regular when they are on Borel spaces. If C is the a-field generated
by Y, then
Pr(AIY == y) will be used to stand for E(IAIY = y).
If X : S -+ X is a random quantity , its conditional distribution is the
collection
of conditional probabilities on X induced from the restriction of conditio
nal prob-
abilities on S to the a-field generated by X. If the PCIC) are regular
conditional
610 Appendix B. Probability Theory

probabilities, then we say that the version of the conditional distribution of X


given C is a regular conditional distribution. When we refer to a conditional dis-
tribution without the word "version," we will mean a version of the conditional
distribution. Occasionally, we will need to choose a version that satisfies some
other condition. In those cases, we will try to be explicit about versions.
Because conditional distributions are probability measures, many of the theo-
rems from Appendix A which apply to such measures apply to conditional dis-
tributions. For example, the monotone convergence theorem A.52 and the dom-
inated convergence theorem A.57 apply to conditional means because limits of
measurable functions are still measurable. Also, most of the properties of proba-
bility measures from this appendix apply as well.
We now turn our attention to the existence and calculation of densities for
conditional distributions. If the joint distribution of two random quantities has
a density with respect to a product measure, then the conditional distributions
have densities that can be calculated in the usual way, as the joint density divided
by the marginal density of the conditioning variable. Theorem B.46 allows us to
extend this result to joint distributions that are not absolutely continuous with
respect to product measures, such as when one of the quantities is a function of
the other. Here, we merely give an example of how such conditional densities are
calculated.
Example B.5. Let X = (XI,X2) have bivariate normal distribution with den-
sity

with respect to Lebesgue measure on rn? The marginal density of Y = XI + X2


with respect to Lebesgue measure is

fy(y) = ~u exp ( - 2~2 [y - (I-'I + 1-'2)1 2) ,

where u 2 = u~ +u~ +2{XTIU2. The fair (X, Y) does not have a joint density with
respect to Lebesgue measure on IR , but it does have a joint density with respect
to the measure /.I on IR3 defined as follows. For each A ~ IR\ let A' = {(XI,X2) :
(XI,X2,X1 +X2) E A}. Let /.I(A) = A2(A'), where Ak is Lebesgue meas.ure on IRk
for k = 1,2. Then fx,y(x, y) = fx(x) is the joint density of (X, Y) With respect
to /.I, and

/x(x) = _1_ ex p ( __
fy (y) V21ru*
I_
2u*2
(XI -I'I _[u~ + {XT1U2](~ U
-1'1 - 1-'2)2) ,

if y = Xl + X2, is the conditional density of X given Y = y with respect to the


measure /.Ix 1)1 (AIY) = AI(A:), where A: = {Xl: (Xl,y - xI) E A}.
The concept of conditional independence will turn out to be central ~~ t~e
development of statistical models. A collection {Xn}~=1 of random quantities IS
B.1. Overview 611

conditio nally indepen dent given another quantity Y if the con~itio


n~l .distrib~
tion (given Y) of every finite subset is a product measure. If, In
ad~ltlOn, Y IS
constan t almost surely, we say that {Xn}~=1 are independent. We
Will call ran-
dom quantiti es (conditionally) IID if they are (conditionally) indepen
dent and
they all have the same conditio nal distribu tion.

B.1.3 Limit Theorems


There are three types of convergence which we consider for sequenc
es of random
quantiti es: almost sure convergence, convergence in probability, and
convergence
in distribu tion. The weakest of these is the last. (See Theorem B.90.)
A sequence
{Xn}~=1 converges in distribution to X if limn_co E (f (Xn)) ::::: E (f (X))
f~
every bounde d continu ous function f. We denote this type of converg
ence Xn -+
X. If X = JR., a more commo n way to express Xn E. X is that limn_
oo Fn(x) =
F(x) for all x at which F is continuous, where Fn is the CDF of Xn and
F is the
CDF of X.I
If X is a metric space with metric d, we say that a sequence {Xn}~=
l converges
in probability to X if, for every f > 0, limn_co Pr(d(X n, X) > f) =
O. We write
this as Xn .!.. X. Almost sure convergence is the same as almost
everywhere
convergence of functions, and it is the stronge st of the three. That
is, Xn -+ X,
a.s. means that {s : Xn(S) does not converge to Xes)} ~ E with Pr(E)
::::: O.
A popular method for proving convergence in distribu tion involves
the use of
charact eristic functions. The characteristic function of a random vector
X is the
complex-valued function

cPx(t) ::::: E (exp[it T Xl) .


It is easy to see that the characte ristic function exists for every
random vector
and has complex absolute value at most 1 for all t. Other facts that
follow directly
from the definition are the following. If Y ::::: aX +b, then cPy(t) ::::: cPx
(at) exp(itb) .
If X and Yare indepen dent, cPx+y = cPxcPy.
The importa nce of characte ristic functions is that they charact erize
distribu -
tions (see the uniqueness theorem B.106) and they are "continuous"
as a function
of the distribu tion in the sense of convergence in distribu tion (see
the continu ity
theorem B.93).
Two of the more useful limit theorem s are the weak law oflarge number
s B.95
and the central limit theorem B.97. If {Xn}~=l are llD random
variables with
2::
finite mean fL, then the weak law of large number s says that the sample
average
Xn = 1 X;/n converges in probabi lity to fL. If, in addition , they have finite
variance 0'2, the central limit theorem B.97 says that v'n(X - fL)
the normal distribu tion with mean 0 and variance 0'2.
n E. N(0,0'2),

lSee Problem 25 on page 664. If X::::: IRk, the same idea can be used.
That
is, Xn E. X if and only if the joint. CDFs Fn of Xn converge to
the joint CDF
F of X at all points at which F is continuous. Since we will not need
to use this
charact erizatio n, we will not prove it.
612 Appendix B. Probability Theory

B.2 Mathematical Probability


In this chapter, we will present the basic framework of the measure theoretic
probability calculus. Most of the concepts like random quantities, distributions,
an so forth. will be special cases of measure theoretic concepts introduced in
Appendix A.

B.2.1 Random Quantities and Distributions


We begin by introducing the basic building blocks of probability theory.
Definition B.O. A probability space is a measure space (8, A, 1-') with 1-'(8) = 1.
Each element of A is called an event. If (8,A, I-') is a probability space, (X,8)
is a measurable space, and X : 8 -> X is measurable, then X is called a random
quantity. If X = JR and 8 is the Borel or Lebesgue u-field, then X is called a
random variable. Let I-'x be the probability measure induced on (X, 8) by X
from I-' (see Definition A.S3). This probability measure is called the distribution
of X. The distribution of X is said to be discrete if there exists a countable set
A ~ X such that I-'x(A) = 1. The distribution of X is continuous if I-'x({x}) = 0
for all x E X.
The distribution of X is easily seen to be equivalent to the restriction of I-' to
the u-field generated by X, Ax.
When there is one probability space from which all other probabilities are
induced by way of random quantities, then the probability in that one space will
be denoted Pro So, for example, in the above definition of the distribution of a
random quantity X, if BE 8, then Pr(X E B) = I-'(X- 1 (B)) = I-'x(B).
The distribution of a random variable can be described by its cumulative dis-
tribution function.
Definition B.T. A function F is a (cumulative) distribution function (CDF) if
it has the following properties:
F is nondecreasingj
lim"' ..... -oo F(x) = OJ
lim"' ..... oo F(x) = Ij
F is continuous from the right.
Proposition B.S. If X is a random variable, then the function Fx(x) = Pr(X $
x) is a CDF. In this case, Fx is called the CDF of X.
A distribution function F can be used to create a measure on (JR, 8) as fol-
lows. Set I-'a,b)) = F(b) - F(a), and extend this to the whole u-field using the
Caratheodory extension theorem A.22. 2
We can also construct a distribution function from a probability measure on
the real numbers. If I-' is a probability measure on (JR, 8 1 ), the CDF associated
with it is F(x) = I-'-oo,x]). If f is a Borel measurable function from JR to JR,
J J
we will write f{x)dF(x) and f{x)dl-'{x) interchangeably.

2See the discussion on page 581 and Problem 7 on page 603.


B.2. Mathem atical Probabi lity 613

If J-t is a probability measure on (IR n , B n ), a joint CDF can be defined


as
F{Xl, ... ,X n ) = J-t ((-00, xd x .,. x (-oo,x n ]) ,
the measure of an orthant . For every joint CDF, there is a random vector
X with
that CDF and we call the CDF Fx.
Definit ion B.9. Let (8, A, J-t) be a probability space, and let (X, B, v)
be a mea-
sure space. Suppose that X : 8 -> X is measurable. Let J-tx be
the measure
induced on (X,B) by X from J-t. Suppose that J-tx v. Then we call
the Radon-
Nikodym derivative f x = dJ-tx / dv the density of X with respect to
v.
Propos ition B.lO. If h : X :-> IR is measurable and fx is the density
of X
J J
with respect to v, then h{x)dF x{x) = h(x)fx( x)dv(x) .
Definit ion B.ll. If X is a random variable with CDF Fx{')' then the
expected
value (or mean, or expectation) of X is E{X) = J xdFx{x }. If X is
a random
vector, then E(X) will stand for the vector with coordinates equal
to the means
of the coordinates of X.
The following theorem is often called the law of the unconscious
statistic ian,
because some people forget that it is not really the definition of expecte
d value.
Theore m B.l2. 3 If X : 8 -> X is a random quantity and f : X
-> IR is a
J
measurable function, then E[f(X)] = f(x)dJ-tx(x), where J-tx is the
distribution
ofX.
PROOF. If we let Y = f{X), then Y induces a measure (with CDF Fy)
on
(JR, B1) according to Theorem A.8!. The definition of E(Y) is ydFy(y J
), and
J J
Theorem A.81 says that ydFy(y ) = f(x)dFx (x}.
0
Definit ion B.l3. If X is a random variable with finite mean c, then the
variance
of X is the mean of eX - C)2 and is denoted Var(X) . If X is a random
vector
with finite mean vector c, then the covariance matrix of X is the mean
of eX -
c)(X - c)T and is also denoted Var(X) . The covariance of two random
variables
X and Y with finite means Cx and cy is E((X - cx][Y - CY]) and
is denoted
Cov(X, V).

It is possible for a random variable to have finite mean and infinite


variance.
Propos ition B.l4. If X has finite mean J-t, then Var(X) = E(X2) - 2
J-t

B.2.2 Some Useful Inequalities


Although there are theoreti cal formulas for calculating means of
functions of
random variables, often they are not analytically tractabl e. We may,
on the other
hand, only need to know that a mean is less than some value. For this
reason, we
present some well-known inequalities concerning means of random
variables.

3This theorem is used in making sense of the notation Ell when introdu
parame tric models. cing
614 Appendix B. Probability Theory

Theorem B.15 (Markov inequality). 4 Suppose that X is a nonnegative mn-


dom variable with finite mean p.. Then, for all c > 0, Pr(X ~ c) ~ p./c.

PROOF. Let F be the CDF of X. Then, we can write

p. = f xdF(x) ~ 1
[c,oo)
xdF(x) ~ cl
[c,oo)
dF(x) = cPr(X ~ c).
Divide the extreme parts by c to get the result. 0
The following well-known inequality follows trivially from the Markov inequal-
ity B.15.
Corollary B.I6 (Tchebychev's inequality).5 Suppose that X is a mndom
variable with finite variance (12 and finite mean p.. Then, for all c > 0,
(12
Pr(IX - p.1 ~ c) ~ 2'
c
Another well-known inequality involves convex functions. 6 The proof of this
theorem resembles the proofs in Ferguson (1967) and Berger (1985).
Theorem B.IT (Jensen's inequality).1 Letg be a convex function defined on
a convex subset X of IRk and suppose that Pr(X E X) = 1. If E(X) is finite,
then E(X) E X and g(E(X $ E(g(X.
PROOF. First, we prove that E(X) E X by induction on the dimension of X.
Without loss of generality, we can assume that E(X) = 0, since we can subtract
E(X) from X and from every element of X, and E(X) E X if and only if 0 E
X - E(X). If k = 0, then X = {O} and E(X) = O. Suppose that 0 E X for all
X with dimension strictly less than m ~ k. Now suppose that X and X have
dimension m and 0 f/. X. Since X and {O} are disjoint convex sets, the separating
hyperplane theorem C.5 says that there is a nonzero vector v and a constant c
such that, for every x E X, V T X $ c and 0 ~ c. 8 If we let Y = v T X, then we
have Pr(Y ~ c) = 1 and E(Y) = 0 ~ c. It follows that Pr(Y = c) = 1 and c = O.
Hence, X lies in the (m - I)-dimensional convex set Z = X n {x : v T x = O}. It
follows that 0 E Z C X.
Next, we prove the inequality by induction on k. For k = 0, E(g(X =
g(E(X, since X is degenerate. Suppose that the inequality holds for all di-
mensions up to m - 1 < k. Let X have dimension m. Define the subset of JRm+l,
X' = {(x,z): x E X,z E JR, and g(x) ~ z}.

Let (Xl,ZI) and (X2,Z2) be in X' and define


(y, w) = (axl + (1- a)x2, aZl + (1- a)z2)'
4This theorem is used in the proofs of Corollary B.16 and Lemma 1.61.
5This corollary is used in the proof of Theorem 1.59.
6Let X be a linear space. A function f: X -+ JR is convex if f(oXx+(I-oX)y) ~
oXf(x) + (1 - oX)f(y) for all x, y E X and all oX E [0,1].
1This theorem is used in the proofs of Lemma B.114 and Theorems B.118
and 3.20.
sThe symbol v T stands for the transpose of the vector v.
B.3. Conditioning 615

Since ag(xl) + (1- a)g(x2) ~ g(y) and w ~ ag(xI) + (1- a)g(x2), it follows that
(y, w) E X', so X' is convex. It is also clear that (E(X), g{E(X))) is a boundary
point of X'. The supporting hyperplane theorem C.4 says that there is a vector
v = (v""v .. ) such that, for all (x,z) E X', v;I x + v .. z ~ v;IE(X) + v .. g(E(X.
Since (x, Zl) E X' implies (x, Z2) E X' for all Z2 > ZI, it cannot be that v.. < 0,
since then lim.. _ oo v;I x + V .. Z = -00, a contradiction. Since (x, g(x E X' for all
x E X, it follows that v;I X + v .. g(X) ~ v;IE(X) + v .. g(E(X, from which we
conclude
v .. g(E(X :5 v~ [X - E(X)] + v .. g(X). (B.18)
Taking expectations of both sides of this gives v .. g(E(X :5 v .. g(X). If v .. > 0, the
proof is complete. If v .. = 0, then (B.18) becomes 0:5 v T [X -E(X)] which implies
v T[X - E(X)] = 0 with probability 1. Hence X lies in an (m - l)-dimensional
space, and the induction hypothesis finishes the proof. 0
The famous Cauchy-Schwarz inequality for vectors 9 has a probabilistic version.
Theorem B.lO (Cauchy-Schwarz inequality).l0 Let Xl and X2 be two ran-
dom vectors of the same dimension such that E(IIXiIl2) is finite for i = 1,2. Then

(B.20)
PROOF. Let Z = 1 if XlX2 ~ 0 and Z = -1 if XlX2 < O. Let Y = IIXI +
cZX2112, where c = -y'EIIXl Il2/EIIX211 2. Then Y ~ 0 and Z2 = 1. So

o :5 E(Y) = EIIXlll 2 + c2EIIX2112 + 2cE(IXl X21)


= 2EIIXlll2 _ 2E(IXl X 21)y'EII X l Il2.
y'EIIX211 2
The desired result follows immediately from this inequality. o

B.3 Conditioning
B.3.1 Conditional Expectations
Section B.1.2 contains a heuristic derivation of the important concepts in condi-
tioning using the special case of discrete random quantities. We now turn to a
more general presentation.
Theorem B.21. 11 Let (8, A, 1') be a probability space, and suppose that X: 8 ~
1R is a measurable function with E(IXI) < 00. Let C be a sub-u-field of A. Then
there exists a C measurable function 9 : 8 ~ 1R which satisfies

E(XIB) = fa g(s)dl'(s), for all B E C.

9That is, if Xl and X2 are vectors, then Ixi x21 :5 IIXlllllx211.


lOThis theorem is used in the proofs of Theorems 3.44, 5.13, and 5.18.
llThis theorem is used to help define the general concept of conditional
expectation.
616 Appendix B. Probability Theory

PROOF. Use Theorem A.54 to construct two measures 11+ and p,- on (8,C):

p,+(B) = i X+(s)dp,(s), p,_(B) = i X-(s)dp,(s).

It is clear that p,+ p, and p,_ p,. The Radon-Nikodym theorem A.74 tells us
that there are C measurable functions g+ and g_ such that

p,+(B) = fa g+(s)dp,(s), p,_(B) = fa g_(s)dp,(s).

Since E(XIB) = p,+(B) - p,_(B), the result follows with g = g+ - g_. 0


We will use the symbol E(XIC) to stand for the function g. If C is the a-field
generated by another random quantity Y, then the symbol E(XIY) is usually
used instead of E(XIC). For the case in which C is the a-field generated by Y,
the next corollary follows from Theorem B.21 with the help of Theorem A.42.
Corollary B.22. Let (8, A, p,) be a probability space, and let (y, C) be a mea-
surable space such that C contains all singletons. 8uppose that X : 8 --+ IR and
Y : 8 --+ Y are measurable functions and E(IXI) < 00. Let P,y be the measure
induced on (Y,C) by Y from p, (see Theorem A.Sl). Let Ay be the sub-a-field
of A generated by Y. Then there exists a function h : Y --+ IR that satisfies the
following: If BE Ay equals y-l(C) for C E C, then E(XIB) = fch(t)dp,y(t).
We will use the symbol E(XIY = t) to stand for the h(t) in Corollary B.22.
At this point the reader might wish to review Example BA on page 609.
To summarize the above results, we state the following.
Definition B.23. Let (8, A, p,) be a probability space, and suppose that X
8 --+ 1R is measurable and E(IXI) < 00. Let C be a sub-a-field of A. We define
the conditional mean (conditional expectation) of X given C denoted E(XIC) to
be any C measurable function g : 8 --+ 1R that satisfies

E(XIB) = i g(s)dp,(s), for all B E C.

Each such function is called a version of the conditional mean. If Y : 8 --+ Y and
C is the sub-a-field generated by Y, then E(XIC) is also called the conditional
mean of X given Y, denoted E(XIY). If, in addition, the a-field of subsets of Y
contains singletons, let h : Y --+ ill. be the function such that 9 h(Y). Then =
h(t) is denoted by E(XIY t). =
When we say that a random variable equals E(XIY), we will mean that it is
a version of E(XIY). The following propositions are immediate consequences of
the above definitions.
Proposition B.24. Let (8, A, p,) be a probability space, and let (y, C) be a mea-
surable space such that C contains singletons. Let X : 8 --+ IR and Y : 8 --+ Y
be measurable. Let P,y be the measure on Y induced from p, by Y. A func-
tion 9 : Y --+ IR is a version of E(XIY = t) if and only if for all B E C,
fB g(t)dp,y(t) = E(XIB(Y'
B.3. Conditioning 617

Proposition B.25 .
If Z and W are both versions ofE(XIC), then Z = W, a.s .
If X is C measurable, then E(X\C) = X, a.s.

Proposition B.26. IfC = {8,0}, the trivial u-fi~ld, then E(XIC) = E(X).

Proposition B.27. 12 Let (8,A,IL) be a probability space, and let (Y,C) be a


measurable space. Let X : 8 ..... IR and Y : 8 ..... Y be measurable, and let
9 : Y ..... IR be such that g(Y)X is integrable. Let ILY be the measure on Y induced
from IL by Y. Then E(g(Y)X) = Jg(t)E(XIY = t)dlLy(t).

Proposition B.28Y Let (8, A, IL) be a probability space and let X : 8 ..... JR,
Y : 8 ..... (y,B1), and Z: 8 ..... (2,B 2 ) be measurable functions. Let lLy and ILZ
be the measures induced on Y and 2 by Y and Z, respectively, from p,. Suppose
that E(\XI) < 00 and that Z is a one-to-one function of Y, that is, there exists
a bimeasurable h : Y ..... 2 such that Z = h(Y). Then E(X\Y = y) = E(X\Z =
h(y, a.s. IlLY].
Conditional probability is the special case of conditional expectation in which
X=IA.
Definition B.29. Let (8, A, p,) be a probability space. For each A E A, the
conditional probability of A given C (or given Y if C is the u-field generated by
Y) is Pr(A\C) = E(IAIC). IfPr(\C)(s) is a probability on (8, A) for aIls E 8, then
the conditional probabilities given C are called regular conditional probabilities.
It turns out that under very general conditions (see Theorem B.32), we can
choose the functions Pr(AIC) in such a way that they are regular conditional
probabilities. In the future, we will assume that this is done in all such cases.
If C is the u-field generated by Y, then Pr(AIY = y) will be used to stand for
E(IA\Y = y) as in the discussion following Corollary B.22.
If X : 8 --t X is a random quantity, its conditional distribution is the collec-
tion of conditional probabilities on X induced from the restriction of conditional
probabilities on 8 to the u-field generated by X.
Definition B.30. Let (8,A,IL) be a probability space and let (X, B) be a mea-
surable space. Suppose that X : 8 ..... X is a measurable function. Let P be the
probability on (X, B) induced by X from IL. Let C be a sub-u-field of A. For
each B E B, let P(BIC) = Pr(A\C), where A = X- 1 (B). We say that any set of
functions from 8 to [0,1] of the form

{P(BICK), for all BE 8}

is a version of the conditional distribution of X given C. If C is the u-field gen-


erated by another random quantity Y : 8 ..... y, a version of the conditional

12This proposition is used in the proof of Theorem B.64.


13This proposition is used to facilitate the transition from spaces of probability
measures to subsets of Euclidean space when parametric models are introduced.
It is also used in the proof of Theorem 2.114.
618 Appendix B. Probability Theory

distribution of X given Y is specified by any collection of probability functions


of the form
{Pr(IY = t), for all t E Y}.
If the P(IC) are regular conditional probabilities, then we say that the version
of the conditional distribution of X given C is a regular conditional distribution.
When we refer to a conditional distribution without the word ''version,'' we
will mean a version of the conditional distribution. Occasionally, we will need to
choose a version that satisfies some other condition. In those cases, we will try
to be explicit about versions.
If X is sufficiently like the real numbers, there will be versions of conditional
distributions that are regular. We make that precise with the following definition.
Definition B .31. Let (X, 8) be a measurable space. If there exists a bimeasur-
able function 4> : X -+ R, where R is a Borel subset of JR, then (X, 8) is called a
Borel space.
In particular, we can show that all Euclidean spaces with the Borel u-fields
are Borel spaces. (See Lemma B.36.) First, we prove that regular conditional
distributions exist on Borel spaces. The proof is borrowed from Breiman (1968,
Section 4.3).
Theorem B.32. Let (8, A, p.) be a probability space and let C be a sub-u-field of
A. Let (X, 8) be a Borel space. Let X : 8 -+ X be a mndom quantity. Then there
exists a regular conditional distribution of X given C.
PROOF. Let 4> : X -+ R be the function guaranteed by Definition B.31. Define
the random variable Z = 4>(X) : 8 -+ R ~ JR. First we prove that the u-
field generated by X, Ax, is contained in the u-field generated by Z, Az. Let
B E Ax; then there is C E 8 such that B = X-l(C). Since 4> is one-to-one,
4>-l(4)(C)) = C. Since 4>-1 is measurable, 4>(C) is a Borel subset of R. Now,
Z-l(4)(C)) = X-I (C) = B, hence B E Az. It is also easy to see that Az is
contained in Ax, so they are equal. If Z has a regular conditional distribution,
then so does X. The remainder of the proof is to show that Z has a regular
conditional distribution.
For each rational number q, choose a version of Pr(Z :::; qlC) and let

Mq,r = {s : Pr(Z :::; qIC)(s) < Pr(Z :::; rIC)(s)}, M = UMq,r.


q>r

According to Problem 3 on page 662 and countable additivity, p.(M) = O. Next,


define
Nq = {s: r ! q, lim
r rational
Pr(Z:::; rIC)(s) i= Pr(Z:::; qIC)}, N= UN q
All q

We can use Problem 3 on page 662 once again to prove that p.(Nq ) 0 for all q, =
=
hence p.( N) = O. Similarly, we can show that p.( L) 0, where L is the set

{
s:
r
lim
-+ -00
Pr(Z ~ rIC)(s) =J: o} U{s:
r
lim
-+ 00
Pr(Z ~ rIC)(s) =J: I}.
r rational r rational
B.3. Conditioning 619

If G is an arbitrary CDF, we can define


if s E MUNUL,
F(zIC)(s) ={ <?(z)
otherwise.
hmr! :t, r rational Pr(Z ::; riC)
F(IC)(s) is a CDF for every s (see Problem 2 on page 661), and it is easy to
check that F(zIC) is a version of Pr(Z ::; zlC) for every z. If we extend F(IC)(s)
to a probability measure 7](.j s) on the Borel u-field for every s, we only need to
check that, for every Borel set B, 7](Bj.) is a version of Pr(Z E BIC). That is, for
every C E C, we need

l 7](B; s)d",(s) = Pre {Z E B} n C). (B.33)

By construction, (B.33) is true if B is an interval of the form (-00, z). Such


intervals form a 7r-system II such that B is the smallest u-field containing II. If
we define

Q (B) = Ie 7](Bj s)d",(s) Q (B) = Pr({Z E B} nc)


1 Pr(C) , 2 Pr(C) ,
we see that Ql and Q2 agree on II. Tonelli's theorem A.69 can be used to see
that Ql is countably additive, while Q2 is clearly a probability. It follows from
Theorem A.26 that Ql and Q2 agree on B. 0
Note that the only condition required for regular conditional distributions to
exist is a condition on the space of the random quantity for which we desire a
regular conditional distribution. The u-field C, or the random quantity on which
we condition, can be quite general. In the future, if we assume that (X, B) is a
Borel space, we can construct regular conditional distributions given anything we
wish. Also, since the function in the definition of Borel space is one-to-one and
the Borel u-field of IR contains singletons, it follows that the u-field of a Borel
space contains singletons (cf. Theorem A.42).

B.3.2 Borel Spaces*


In this section we prove that there are lots of Borel spaces. First, we prove that
every space satisfying some general conditions is a Borel space, and then we will
show that Euclidean spaces satisfy those conditions. Then, we show that finite
and countable products of Borel spaces are Borel spaces. The most general type
of Borel space in which we shall be interested is a complete separable metric
space (sometimes called a Polish space).
Definition B.34. Let X be a topological space. A subset D of X is dense if, for
every x E X and every open set U containing x, there is an element of Din U.
If there exists a countable dense subset of X, then X is separable. Suppose that
X is a metric space with metric d. A sequence {Xn}~=1 is Cauchy if, for every E,
there exists N such that m, n ~ N implies d(xn' Xm) < E. A metric space X is
complete if every Cauchy sequence converges. A complete and separable metric
space is called a Polish space.

*This section may be skipped without interrupting the flow of ideas.


620 Appendix B. Probability Theory

We would like to prove that all Polish spaces are Borel spaces. First, we prove
that !Roo is a Borel space (Lemma B.36). Then we prove that there exist bimeasur-
able maps between Polish spaces and measurable subsets of !Roo (Lemma BAO).
The following simple proposition pieces these results together.
Proposition B .35. If X is a Borel space and there exists a bimeasurable function
f : Y --> X, then Y is a Borel space.
Lemma B.36. The infinite product space IRoo is a Borel space.

PROOF. The idea of the prooe 4 is the following. We start by transforming each
coordinate to the interval (0,1) using a continuous function with continuous
inverse. For each number in (0,1) we find a base 2 expansion, which is a sequence
of Os and Is. We then take these sequences (one for each coordinate) and merge
them into a single sequence, which we then interpret as the base 2 expansion of
a number in (0,1). If this sequence of transformations is bimeasurable, we have
our function <p.
Let 'I/J : lRoo --> (0,1)00 be defined by

tan-l(xd 1 tan- (x2) 1


'I/J(Xl,X2, ... )= (
1
-+
2 7r
,-+
2 7r
,... ) ,
which is bimeasurable. For each x E [0,1), set yo(x) =x and for j = 1,2, ... ,
define

I if 2Yi-l(X) ~ 1,
{
o if not,
Yj(x) = 2Yj-l(X) - Zj(x).

For each j, Zj is a measurable function. It is easy to see that Zj (x) is the jth digit
in a base 2 expansion of x with infinitely many Os. Note also that Yj(x) E [0,1)
for all j and x.
Create the following triangular array of integers:

1
2 3
4 5 6
7 8 9 10
11 12 13 14 15

Let the jth integer from the top of the ith column be l(i,j). Then

l(i,j) = i(i;I) +i(j-l)+ U- 1)ij - 2 ).

14This proof is adapted from Breiman (1968, Theorem AA7).


B.3. Conditioning 621

Clearly, each integer t appears once and only once as ( i, j) for some i and j .15
Define
(8.37)

Then h is clearly a measurable function from (0, 1)00 to a subset R of (0, 1). There
is a countable subset of (0, 1) which is not in the image of h. These are the numbers
with only finitely many Os in one or more of the subsequences {(i,j)}~l of their
base 2 expansion for i = 1,2, .... For example, the number c = E:o
2- i (Hl)/2-l
is not in R.16 Since the complement of a countable set is measurable, the set R
is measurable.
We define 4> = h( t/J). If we can show that h has a measurable inverse, the proof
is complete. For each x E R, define

4>i(X) = ~ Zt(i,j) (x) . (8.38)


~ 2]
j=l

Clearly, each 4>i is measurable. Note that, for each i and j,

Zj(4)i(X)) = Zt(i,j)(X), (8.39)

Combining (B.37), (B.38), (8.39), and the fact that every integer appears once
and only once as lei, j) for some i and j, we see that h(4)l (x), 4>2 (X), ... ) = x, so
that (4)1,4>2, ... ) is the inverse of h and it is measurable. 0

Lemma B.40. If (X,8) is a Polish space with the Borel (J-field and metric d,
then it is a Borel spaceP

PROOF. All we need to prove is that there exists a bimeasurable f : X -> G, where
G is a measurable subset of ]Roo. We then use Lemma B.36 and Proposition B.35.
Let {Xn}~=l be a countable dense subset of X, and let d be the metric on X.
Define the function f : X -> IR 00 by

f(x) = (d(x, xd, d(x, X2), .. .).


We will first show that f is continuous, which will make it measurable. Suppose
that {Yn}~=l is a sequence in X that converges to Y E X. The kth coordinate of
f(Yn) is d(Yn,Xk), which converges to d(y,Xk) because the metric is continuous.
Hence, each coordinate of f is continuous, and f is continuous. Next, we prove
that f is one-to-one. Suppose that f(x) = fey). Then d(x, Xn) = dey, Xn) for

l5It is easy to check the following. For each integer t, let k = inf{ n : t :s
n(n + 1)/2}. Then ret) = 1 + k(k + 1)/2 - t and set) = k + 1 - ret) have the
= =
property that (r(t), s(t)) t, r((i,j)) i, and s((i,j)) j. =
l6This number corresponds to having Is in the first column of the triangular
array but nowhere else. Clearly, 0 < c < 1, but it is impossible to have Is in the
entire first column, since this would require Xl = 1. Even if Xl = 1 had been
allowed, its base 2 expansion would have ended in infinitely many Os rather than
infinitely many Is.
l7This proof is adapted from p. 219 of Billingsley (1968) and Theorem 15.8 of
Royden (1968).
622 Appendix B. Probability Theory

all n. Since {Xn}~1 is dense, there exists a subsequence {Xn;}~1 such that
limj--+oo Xnj = x. It follows that 0 = limj--+oo d(x, x nj ) = limj-+oo dey, xnj)j hence
limj--+oo Xnj = y, and y = x.
Next, we prove that f- I : f(X) -+ X is continuous. Suppose that a se-
quence of points {f(Yn)}::'=1 converges to fey). Let limj--+oo Xnj = y. Then
limj--+oo dey, Xnj ) = O. But dey, Xnj ) is the nj coordinate of fey), which in turn
is the limit (as n -+ 00) of the nj coordinate of f(Yn). For each j, d(Yn,Y) $
d(yn, xnj ) +d(y, xnj ). Let 10 > 0 and let j be large enough so that dey, Xnj ) < 10/2.
Now, let N be large enough so that n ~ N implies d(Yn' Xn,;) < dey, Xn,; ) + f/2. It
follows that, if n ~ N, d(Yn, y) < E. Hence limn-+oo Yn = y and f- I is continuous,
hence measurable.
Finally, we will prove that the image G of f is a measurable subset of JRoo. We
will do this by proving that G is the intersection of countably many open subsets
-18
of G. Let G n be the following set:
{x E JRoo : 3 0", a neighborhood of x with d(a,b) ~ lin for all a,b E f-I(O",)}.
Since 0", S;; G n for each x E G n , G n is open. Also, since f and f- I are continuous,
it is easy to see that G S;; G n for all n. Let G' = G n::'=1 Gn. For each x E G',
let O""n S;; G n be such that 0",,1 ;;;? 0",,2 ;;;? and that dCa, b) $ lin for
all a,b E f-I(O""n)' Note that f-I(O."n) ;;;? f-I(O""n+il for all n. If Yn E
f-I(O""n) for every n, then {Yn}::'=1 is a Cauchy sequence, since n,m ~ N
implies d(Yn,Ym) $ liN. Hence, there is a limit Y to the sequence. It is easy to
see that if there were two such sequences with limits Y and y', then dey, y') < f
for all f > 0, hence y = y'. So we can define a function h: G' -+ X by hex) = y.
If x E G, then clearly hex) = rl(x). If x' E O""n, then d(h(x), h(x' ~ lin, so
h is continuous. We now prove that G' S;; G, which implies that G = G' and the
proof will be complete. Let x E G', and let Xn E G be such that Xn -+ x. (This is
possible since G' S;; G.) Since h is continuous, f-I(X n) -+ hex). If Yn = f-I(X n)
and y = hex), then Yn -+ Y and f(Yn) -+ fey) E G, since f is continuous. But
f(Yn) = Xn , so fey) = x, and the proof is complete. 0
Next, we show that products of Borel spaces are Borel spaces.
Lemma B.41. Let (Xn,Bn) be a Borel space for each n. The product spaces
n~=1 Xi for all finite nand n:=1
Xn with product u-fields are Borel spaces.
PROOF. We will prove the result for the infinite product. The proofs for finite
products are similar. If Xn = JR for all n, the result is true by Lemma B.36. For
general X n , let 4>n : Xn -+ R,.. and 4>. : JRoo -+ R. be bimeasurable, where R,..
and R. are measurable subsets of JR. Then, it is easy to see that

4> : g (g Rn)
Xn -+ 4>.

is bimeasurable, where 4>(XI,X2, ... ) = 4>.(4)I(XI),4>2(X2),'' .). 0


Next, we show that the set of bounded continuous functions from [0,1) to the
real numbers is also a Polish space.

18We use symbol G to stand for the closure of the set q..
The closure C?f a subs~t
G of a topological space is the smallest closed set contammg G. A set IS closed If
and only if its complement is open.
B.3. Conditi oning 623

Lemm a B.42. 19 Let C[O, 1) be the set of all bounded continuous function
s from
[0,1) to JR. Let p(f, g) = SUPXE[O ,l j lf(x) - g(x)l Then, p is a metric on
C[O, 1)
and C[O, 1) is a Polish space.
PROOF. That p is a metric is easy to see. To see that C
is separab le, let Dk be the
set of functions that take on rational values at the points 0, l/k, ...
, (k - 1)/k, 1
and are linear between these values. Let D = Uk=lDk. The set D
is countab le.
Every continu ous function on a compac t set is uniformly continuous,
so let f E
e[O, 1) and > 0. Let 8 be small enough so that Ix-YI < 8 implies If(x)- f(Y)1
<
E/4. Let k be larger than 4/. There exists 9 E Dk such that Ig(i/k) - f(i/k)1
for each i = 0, ... , k. For ilk < x < (i + l)/k, If(x) - f(i/k)1
< /4
< E/4, and
Ig(x) - g(i/k)1 < /2, so I/(x) - g(x)1 < E. To see that e[O, 1] is complet
e, let
{/n}~=l be a Cauchy sequence. Then, for all x, {/n(X)}~=l is a Cauchy
sequence
of real number s that converges to some number f(x). We need to
show that the
convergence of f n to f is uniform. To the contrary, assume that there
exists E such
that, for each n there is Xn such that Ifn(Xn) - f(xn)1 > E. We know
that there
exists n such that m > n implies lin (X) - Im(x)1 < E/2 for all x.
In particul ar,
Ifn(x n ) - fm(xn)1 < E/2 for all m > n. Since limm _ oo fm(xn) = f(x n}
, it follows
that there exists m such that Ifm(x n) - f(xn)1 < E/2, a contrad iction.
0
Because Borel spaces have u-fields that look just like the Borel u-field
of the
real number s, their u-fields are generat ed by countab ly many sets.
The countab le
field that generat es the Borel u-field of 1R is the collection of all
sets that are
unions of finitely many disjoint intervals (including degener ate ones
and infinite
ones) with rational endpoin ts.
Propos ition B.43. 20 Let (X,B) be a Borel space. Then there exists
a countable
field e such that B is the smallest u-field containing C.
Because a field is a 7f-system, Theorem A.26 and Proposi tion B.43
imply the
following.
Coroll ary B.44. Let (X, B) be a Borel space, and let e be a countab
le field that
genemtes B. II ILl and IL2 are u-finite measures on B that agree on e,
then they
agree on B.

B.3.3 Conditional Densities


Because conditional distribu tions are probabi lity measures, many of
the theorem s
from Append ix A which apply to such measure s apply to conditio
nal distribu -
tions. For example, the monoto ne convergence theorem A.52 and
the domina ted
convergence theorem A.57 apply to conditio nal means because limits
of measur-
able functions are still measurable. Also, most of the propert ies
of probabi lity
measures from this append ix apply as well. In this section, we focus
on the exis-
tence and calculat ion of densities for conditio nal distribu tions.
If the joint distribu tion of two random quantiti es has a density with
respect
to a product measure, then the conditio nal distribu tions have densitie
s that can

19This lemma is used in the proof of Lemma 2.121.


20This proposi tion is used in the proofs of Lemma s 2.124 and 2.126
and The-
orem 3.110.
624 Appendix B. Probability Theory

be calculated in the usual way.


Proposition B.45. Let (S,A,p.) be a probability space and let (X,81,VX) and
(Y,82,Vy) be l1-jinite measure spaces. Let X : S -+ X and Y : S -+ Y be
measurable /unctions. Let P,X,Y be the probability induced on (X x Y,81 82)
by (X, Y) from p.. Suppose that p.x,y
Vx x Vy. Let the density be Ix,Y(x, y).
Let the probability induced on (Y,82) by Y from p, be denoted p,y. Then P,Y is
absolutely continuous with respect to Vy with density

fy(y) = i fX,y(x,y)dvx(x),

and the conditional distribution of X given Y has densities

f XIY (xly) = fx,Y(x, y)


fy(y)

with respect to vx.


This proposition can be proven directly using Tonelli's theorem A.69 or as a
special case of Theorem B.46 (see Problem 15 on page 663).
Theorem B.46. Let (X,8I) be a Borel space, let (y,82) be a measurable space,
and let (XxY, 8182, v) be a l1-jinite measure space. Then, there exists a measure
11)1on (y,82) and for each y E y, there exists a measure vXly(ly) on (X,81)
such that for each integrable or nonnegative h : X x Y -+ JR, f hex, y)dvXly(xly)
is 8 2 measurable and

1 hex, y)d/l(x, y) = 1[1 hex, y)d/lXIY(XIY)] dVy(y). (B.47)

PROOF. Let f be the strictly positive integrable function guaranteed by The-


orem A.85. Without loss of generality, assume that f I(x, y)dv(x, y) = 1. The
measure p,(A) = fA I(x, y)d/l(x, y) is a probability, /I p" and (dv/dp,)(x, y) =
1/I(x, y). Let P,xIY be a regular conditional distribution on (X, 8 1 ) constructed
from p., and let Vy be the marginal distribution on (y,82). Define

Note that
1
IAXB(X,y)dp,Xly(xly) = IB(y)P,Xly(Aly), (B.48)

which is a measurable function of y because P.xIY is a regular conditional distri-


bution. Just as in the proof of Lemma A.61, we can use the 7r-A theorem A.17
to show that f gdp.xl'Y is measurable if 9 is the indicator of an element of
the product l1-field. It follows that f gdp.xlY is measurable for every nonneg-
ative simple function g. By the monotone convergence theorem A.52, letting
{gn}~=l be nonnegative simple functions increasing to. 9 everywhere, it fol.lows
that f g(x,y)dp,Xly(xly) is measurable for all nonnegatIve measurable functIOns,
and hence f hd/lXIY = f h/ fdp,xlY is measurable if h is nonnegative.
B.3. Conditioning 625

Next, define a probabi lity 1/ on (X x y, BI @ B2) by

1/(C) = 1[I Ic(x, y)dJLXly(XIY)] dvy(y).

It follows from (B.48) that 'f/ and JL agree on the collection of all
product sets
(a 7T-system that generates BI @ B2)' Theorem A.26 implies that they
agree on
BI @ B2. By linearity of integrals and the monotone convergence theorem
A.52,
if 9 is nonnegative, then

1 g(x, y)d1/(x, y) 1 [I g(x, Y)dJLX1y(XIY)] dvy(y)

1 [I g(X,Y)I(X,Y)dVXIY(X IY)] dVy(y). (B.49)

For every nonnegative h,

1 h(x,y)d v(x,y) = I
h(X,y)
f(x,y/( x,y)dv (x,y)= Ih(X,y ) (
f(x,y)d JLX,y
)
(B.50)

1~~:: ~~ d1/(x, y) = 1 [I hex, y)dVXIY(X IY )] dVy(y),

where the second equality follows from the fact that dJLldv = I,
the third fol-
lows from the fact that JL and 1/ are the same measure, and the
fourth follows
from (B.49). If h is integrable with respect to v, then (B.50) applies
to h+,
h-, and Ihl, and all three results are finite. Also, f Ih(x, y)ldvxl y(xly)
is mea-
surable and vy({y : flh(x,y )ldvxly (xly) = oo}) = o. So f h+(x,y)
dvxly(x l)
and f h-(x,y) dvXly( xly) are both finite almost surely, and their
difference is
f hex, y)dvXly(xly), a measurable function. It now follows that (B.47) holds. 0
The measures /.Iy and VXIY in Theorem B.46 are not unique. In the
proof, we
could easily have defined /.Iy several ways, such as vy(A) = fA g(Y)JLy
(y) for any
strictly positive function 9 with finite JLy integral. A corresponding
adjustm ent
would have to be made to the definition of VXIY:

In the special case in which v is a product measure VI x V2, it is


easy to show
that VI can play the role of vXly(ly ) for all y and that V2 can play
the role of /.Iy
in Theorem B.46. (See Problem 15 on page 663.)
There is a familiar applicat ion of Theorem B.46 to cases in which X
and Yare
Euclidean spaces but V is concent rated on a lower-dimensional manifol
d defined
by a function y = g(x).
Propos ition B.51. Suppose that X = mn and Y = mk , with k <
n. Let 9 :
X -+ Y be such that there exists h : X -> mn - k such that vex) =
(g(x), hex))
is one-to-one, is differentiable, and has a differentiable inverse. For
y E mk and
wE mn - k , define J(y,w) to be the Jacobian, that is, the
determinant of the
matrix of partial derivatives of the coordinates of v-I(y,w ) with respect
to the
coordinates of y and of w. Let Ai be Lebesgue measure on mi , for each
i. Define
626 Appendix B. Probability Theory

a measure v on X x Y by v(C) = An({X : (x,g(x E C}). Then, vy equal to


Lebesgue measure on IRk and vXly(AIY) = fAo J(y,W)dAn-k(W) satisfy (B.47),
v
where A; = {w: v- 1 (y,w) E A}.
We are now in position to derive a formula for conditional densities in general?l
Theorem B.52. Let (8, A, JL) be a probability space, let (X, 8 1) be a Borel space,
let (Y,82) be a measurable space, and let (X x y, 8 1 82 , v) be a (I-finite measure
space. Let vy and VXIY be as guaranteed by Theorem B.46. Let X : 8 -+ X and
Y : 8 -+ Y be measurable functions. Let JLX,Y be the probability induced on
(X x Y,8 1 ~) by (X, Y) from JL. 8uppose that JLX,Y v. Let the density be
fx,y(x,y). Let the probability induced on (y,82) by Y from JL be denoted JLY.
Then, JLy vyi for each y E y,

~Y (y) = fy(y) = [ /x,y(x,y)dvxly{xly); (B.53)


y lx
and the conditional distribution of X given Y = y has density

j (I) - jx,Y{x,y) (8.54)


XIY xy - Jy(y)

with respect to VXly(ly).


PROOF. It follows from Theorem B.46 that for all B E 8 2,

JLY(B) = f IB (y)/x,Y (x, y)dv(x, y)

= f IB{Y) [f fX,Y(X,Y)dVXI)I(X 1Y)] dvy{y).


The fact that JLY V)I and (B.53) both follow from this equation. Let JLxly(ly)
denote a regular conditional distribution of X given Y = y. For each A E 81
and B E 82, apply Theorem B.46 with h(x, y) = IA(x)IB(y)fxly(xly)Jy(y) to
conclude

Since this is true for all B E 82, we conclude that

JLxlY(Aly) = 1 fXIY(xly)dvXly(xly)

Hence (8.54) gives the density of JLxly(ly) with respect to vXI)I(ly) 0


The point of Theorem B.52 is that we can calculate conditional densities for
random quantities even if the measure that dominates the joint distribution is
not a product measure. When the joint distribution is dominated by a product

21The condition that the joint distribution have a density with respect .t~ a
measure v in Theorem B.52 is always met since v can be taken equal to the Jomt
distribution. The theorem applies even if v is not the joint distribution, however.
B.3. Conditioning 627

measure the conditional distributions are all dominated by the same measure.
(See Pr;blem 15 on page 663.) In general, however, the conditional distributio~
of X given Y = y is dominated by a measure that depends on y. For example, If
Y = g{X), the joint distribution of (X, Y) is not dominated by a product measure
even if the distribution of X is dominated. (See also Problem 7 on page 662.)
Nevertheless, we have the following result.
Corollary B.55. 22 Let (S,A,p,) be a probability space, let (Y,B2) be a measur-
able space such that B2 contains all singletons, and let (X, B) be a Borel space with
Vx a u-finite measure on (X,8). Let X : S -+ X and g : X -+ Y be measumble
junctions. Let Y = g{X). Suppose that the distribution of X has density fx with
respect to vx. Define von (X x Y,Bl B2) by v(C) = vx({x: (X,9(X E C}).
Let p,X,Y be the probability induced on (X x y, BI B2) by (X, Y) from p,. Let the
probability induced on (Y,B2 ) by Y from p, be denoted P,y. Then p,x,y v with
Radon-Nikodym derivative fx,y(x,y) = fx(x)l{g(x)} (y). Also, the conditions of
Theorem B.46 hold, and we can write

dp,y ( )
dvy Y = fy(y) = i l{g(x)} {y)fx (x)dvXIY{x/y),

if y = g(x),
fxlY{x/y) = { O~;f:~ otherwise.

Also, the conditional distribution ofY given X is given by P,Ylx{C/x) = 10{g(x.


PROOF. Since Vx is u-finite, v is also. Since Y is a function of X, Theorem A.81
J
implies that for all integrable h, h{x, y)dv(x, y) =J
h(x, g{xdvx(x). The facts
that Ix,Y has the specified form and that P,YIX is the conditional distribution of
Y given X follow easily from this equation. 0
The point of Corollary B.55 is that if Y = g{X), then we can assume that the
conditional distribution of X given Y =
y is concentrated on g-I({y}).
Example B.56. 23 Let 1 be a spherically symmetric density with respect to An,
LebesguemeasureonlRn . That is, I(x) = h{x T x) for some function h: lR -+ lR+ o
(the interval [0,00 and J h{x T x)dAn{X) = 1. Let X have density f and let
V = XT X. Let R = V- 1 / 2 , and transform to spherical coordinates:

XI rcos(fh),
X2 = r sin(Ot} COS{(2),

X n -l rsin(Ot} .. 'COS(On_l),
Xn = r sin(Ol) ... sin(On_l).
The Jacobian is r n - 1 j(f), where j is some function of 0 alone. The Jacobian for
the transformation to v and 0 is v(n/2)-lj(0)/2. The integral of j(O) over all 0

22This corollary is used in the proof of Theorem 2.86 and in Example 3.106.
23The calculation in this example is used again in Example 4.121.
628 Appendix B. Probability Theory

values is 7r n / 2/f(n/2}. So, the marginal density of V is

Iv (v) = 7r ~ V ~ -1 h( V )
2f(~)
The conditional density of X given V =v is then
2f(~)Vl-~ T
IXIV(xlv) = 7r1f I{v}(x x)

f
with respect to the measure IIXIV(Clv) = c v(n/2)-lj(9)dA n_l(9)/2, where

C' = {9 : v! (cos(Ot), ... , sin(OI}'" sin(On_t}} E C}.


It follows that the conditional distribution of X given V = v is given by

JLxlV(Clv} +1
= r(!!)
7r c'
j(O}dAn-l(O}.

It is easy to see that JLxlV(lv} is the uniform distribution over the sphere of
radius v in n dimensions.

Another example was given in Example B.5 on page 610.

B.3.4 Conditional Independence


The concept of conditional independence will turn out to be central to the devel-
opment of statistical models.
Definition B.57. Let ~ be an index set, let Y and {XihEN be random quantities,
and let Ai be the a-field generated by Xi. We say that {XihEN are conditionally
independent given Y if, for every n and every set of distinct indices iI, ... ,in and

n
every collection of sets Al E Ail, ... , An E A in , we have

p, (6 A; Y) ~ P,(A;IY), a.a. (B.58)

If, in addition, Y is constant almost surely, we say {Xi hEN are independent.
Under the same conditions as above, if all of the conditional distributions of
the Xi given Yare the same, then we say {XihEN are conditionally IID given
Y. If, in addition, Y is constant almost surely, we say {XihEN are IID.
Example B.59. Let F be a joint CDF of n random variables Xl,.'" X n , and
let JL be the corresponding measure on lRn. Then JL is a product measure if and
only if Xl, ... ,Xn are independent (see Proposition B.66).
Example B.60 (Continuation of Example B.56; see page 627).24 Transform to
(Y, V), where Y = X/Vl/2. Then, the conditional distribution of Y given V is
given by

24This calculation is used again in Example 4.121.


8.3. Conditioning 629

where D' = {9 : (cos(91 ), ... ,sin(91 ) .. sin(9n -I)) ED}. We note that this
formula does not depend on v; hence Y is independent of V. In addition, it is
easy to see that JLYlv(ylv) is just the uniform distribution over the sphere of
radius 1 in n dimensions.
The use of conditional independence in predictive inference is based on the
following theorem.
Theorem B.61. 26 Let N be an index set, let Y and {XihEN be a collection
of random quantities, and let Ai be the u-field generated by Xi. Then {XihEN
are conditionally independent given Y if and only if for every nand m and
every set of distinct indices it, ... ,in, jt, ... ,jm and every collection of sets Al E
Ail, ... ,An E Ain , we have

(8.62)

PROOF. For the "if" part, we will assume (8.62) and prove (B.58) by induction
on n. For n = 1, there is nothing to prove. Assuming (8.58) is true for all n ::; k,
we now prove it for n = k + 1. Let Aj E Ai; for j = 1, ... ,k + 1. According to
(B.62) and (8.58) for n = k, we have

It follows that for all B E Ay, the u-field generated by Y,

Pr (B 0 Ai) = Pr (BnAHIOAi)

= inAk+1 Pr (oAiIY,Xk+l) (s)dJL(s)


= lnAIo+I n
k
Pr(AiIY)(s)dJL(s) = l 1Ak+l (s) nk
Pr(AiIY)(s)dJL(s)

=
f
JB Pr(AHIIY)(s) n k
Pr(AiIY)(s)dJL(s) = in k+l
Pr(AiIY)(s)dJL(s).

The equality of the first and last terms above for all B E Ay means that
Il~~ll Pr(AiIY) = Pr(n~~: AiIY), a.s., which is what we need to complete the
induction.
For the "only if" part, we will assume (B.58) and prove (B.62). For a function
9 to be the left-hand side of (8.62), it must be measurable with respect to the
u-field AY,m generated by Y, Xii, ... ,Xjm , and satisfy

(B.63)

26This theorem is used in the proofs of Theorems 2.14 and 2.20.


630 Appendix B. Probability Theory

for all C E AY,m. Clearly, the right-hand side of (B.62) is measurable with respect
to AY,7n' If C = Cy nCx, where Cy E Ay and Cx is in the O'-field generated by
X jl , ... , Xj"" then

Pr (CQ Ai) fay Pr ( Cx Q Ai Y) (s)dJL(s)

= fay lex (s) Pr (0 Ai Y) (s)dJL(s)


fa Pr (0 Ail Y) (s)dJL(s).
This means that (B.63) holds with 9 = Pr(n?=lAiIY) so long as C is of the
specified form. To show that it holds for all C E AY,m, we first note that AY,m is
the smallest O'-field containing all sets of the specified form. Clearly, (B.63) holds
for all sets that are unions of finitely many disjoint sets of the specified form by
linearity of integrals. These sets form a field C. According to Lemma A.24, for
each f > 0, there is C. E C such that Pr(C.~C) < f/2. The following facts follow
trivially:

1e.
g(s)dJL(s)

Pr ( cD Ai) - Pr ( C. 0I Ai) < 2'


f

Il g(s)dJL(s) -l. g(s)dJL(s) I <


f

Combining these gives that lIe g(s)dJL(s) - Pr (C n?=l Adl < f. Since f is arbi-
trary, (B.63) holds for all C E AY,m. 0
A particular case of interest involves three random quantities. Theorem B.64
says that when there are only two Xs in Theorem B.61, we can check conditional
independence by checking only one of the equations of the form (B.62).
Theorem B.64. 26 Let X, Y, and Z be three random quantities, and let Ax, Ay,
and Az be the O'-fields generated by each of them. Suppose that for all A E Ax,
=
Pr(AIY, Z) Pr(AIY). Then X and Z are conditionally independent given Y.
PROOF. We need to check that for every A E Ax and B E Az, Pr(A n BIY) =
Pr(AIY) Pr(BIY). Equivalently, for all such A and B, and all C E Ay, we must
show

Pr(A n B n C) = f le(s) Pr(AIY)(s) Pr(BIY)(s)dJL(s). (B.65)

26This theorem is used in the proofs of Theorems 2.14 and 2.20.


B.3. Conditioning 631

Since we have assumed that Pr(AIY, Z) = Pr(AIY ), we have that, for


all B E Az
and C E Ay,

Pr(A n B n C) = f Ie(s)I8 (s) Pr(AIY) (s)d/L(s ).

We can use Proposi tion B.27 with g(Y) = Ie Pr(AIY ) and X = 18


to see that

f Ie(s)I8 (s) Pr(AIY) (s)d/L(s ) = f Ic(s) Pr(AIY )(s) Pr(BIY) (s)d/L(s ).

Together, these last two equatio ns prove (B.65).


The following result relates product measure on a product space to indepen 0
dent
random variables.
Propos ition B.66. Let (8, A, /L) be a probability space and let (Ti
, Bi ) (i =
1, ... , n) be measurable space. Let Xi : 8
-+ Ti be measurable for i = 1, ... , n. Let
/Li be the measure that Xi induces on Ti for each i, and let Tn =
TI X .. X Tn,
B n = B1 ... Bn. Let /L. be the measure that (Xl,' .. , Xn) induces
on (Tn, Bn)
from /L. Then /L. is the product measure /Ln = /LI X .,. x /Ln, if and
only if the Xi
are independent.
The same result holds for conditional independence.
Coroll ary B.67. Random quantities X I, ... ,Xn are conditionally
indepen dent
given Y if and only if the product measure of the conditional
distributions of
XI, ... ,Xn given Y is a version of the conditional distribution of(XI,
... ,Xn )
given Y.
There is an interest ing theorem that applies to sequences of indepen
dent ran-
dom variables, even if they are not identically distribu ted.
Theore m B.68 (Kolmo gorov zero-on e law).27 Suppose that (8,A,/L
) is a
probability space. Let {Xn}~l be a sequence of indepen dent random
quantities.
For each n, let Cn be the u-field generated by (X n , X n + l , .. .) and let
C = n~=ICn.
Then every set in C has probability 0 or probability 1.
PROOF. Let An be the u-field generat ed by (Xl, ... , Xn). Then C.
= U~IAn is
a field. It is easy to see that C is contained in the smallest u-field contain
ing C.
Let A E C. By Lemma A.24, for every k > 0, there exists nand Ck
E An such
that IL(A~Ck) < 11k. It follows that

lim IL( Ck) = IL(A),


k->oo

(B.69)
Since A E C, it follows that A E Cn + l ; hence A and Ck are indepen
dent for
every k. It follows that IL(Ck n A) = /L(A)/L(Ck). It follows from
(B.69) that
IL(A) = IL(A)2, and hence either /L(A) = 0 or /L(A) = 1.
0

27This theorem is used in the proofs of Corollary 1.63 and Lemma


7.83, and in
the discussion of "sampling to a foregone conclusion" in Section 9.4.
632 Appendix B. Probability Theory

The a-field C in Theorem B.68 is often called the tail a-field of the sequence
{Xn}~=l' An interesting feature of the tail a-field is that limits are measurable
with respect to it. 28 (See Problem 21 on page 663.)

B.3.5 The Law of Total Probability


Next, we introduce some theorems that are very simple to state for discrete
random variables but appear to be rather unwieldy in the general case. We will,
however, need them often.
Theorem B.70 (Law of total probability). Let (8, A, JL) be a probability
space, and let Z be a random variable with E(IZI) < 00. Let C ~ B be sub-u-
fields of A. Then E(ZIC) = E{E{ZIB)IC), a.e. [JL].
PROOF. Define T = E{ZIB) : 8 -+ JR, which is any B measurable function
satisfying E{ZIB) = fB T(s)dJL{s), for all BE B. We need to show that E{ZIC) =
E{TIC) a.s. fIL]. The function E{TIC) is any C measurable function satisfying
fc E{TIC){s)dJL{s) = E{TIc), for all C E C. But, since C ~ B, C E C implies
C E B. So, for C E C,

l E{TIC)(s)dJL{s) = E{TIc) = J Ic(s)T{s)dJL{s) = l T{s)dJL{s) = E{ZIc),

where the last equality follows since T = E{ZIB) and C E B. Since E{TIC) is C
measurable, equating the first and last entries of the above string of equations
means that E{TIC) satisfies the condition required for it to equal E{ZIC). 0
When Band C are the a-fields generated by two random quantities X and
Y, respectively, C ~ B means Y is a function of X. So, Theorem B.70 can be
rewritten in this case.
Corollary B.n. Let X: 8 -+ Ul, Y : 8 -+ U2, and Z: 8 -+ 1R be measurable
junctions such that E{IZI) < 00. 8uppose that Y is a function of X. Then,
E{ZIY) = E {E {ZIX)I Y}, a.s. fIL]
The most popular special case of this corollary occurs when Y is constant.
Corollary B.72. 29 Let (8,A,JL) be a probability space. Let X : 8 -+ U1 and
Z : 8 -+ 1R be measurable functions such that E{IZI) < 00. Then, E{Z) =
E{E{ZIX)}.
This is the special case of Theorem B.70 when C is the trivial a-field.
The following theorem implies that if a conditional mean given X depends on
X only through heX), then it is also the conditional mean given heX).
Theorem B.73. 30 Let (8,A,JL) be a probability space and let Band C be sub-
a-fields of A with C ~ B. Let Z : 8 -+ 1R be measurable such that E(IZl) <

28The tail a-field will play a role in the proofs of Corollary 1.63 and Theo-
rem 1.49.
29This corollary is used in the proof of Theorem B.75.
30This theorem is used in the proofs of Theorems 1.49 and 2.6.
B.3. Conditioning 633

00. Then there exists a version of E(ZI8) that


is C measurable if and only if
E(ZI8) = E(ZIC), a.s. [j.t].
PROOF. For the "if" direction, if E(ZI8) = E(ZIC), a.s. [j.t], then
E(ZIC) is
measurable with respect to both C and 8, and hence it is a C measura
ble version
of E(ZI8). For the "only if" direction, if W is a C measurable version
of E(ZI8) ,
then W = E(WIC), a.s. [j.t] by the second part of Proposi tion B.25.
By the law
of total probability B.70, E(WIC) = E(ZIC), a.s. [j.t].
0
A useful corollary is the following.
Coroll ary B.74. 31 Let (8, A, j.t) be a probability space. Let (81 ,Ad and
(82, A2)
be measurable spaces, and let X : 8 --> 8 1 and h : 8 1 --> 82 be
measurable
functions. Let Z : 8 -+ JR be measurable such that E(lZI) < 00. Define
Y = heX).
Then E(ZIX = x, Y = y) == E(ZIX == x) a.s. with respect to the
measure on
(81 x 8 2, A1 A2) induced by (X, Y) : 8 --> 8 1 X 82 from j.t.
The following theorem deals with conditioning on two random quantiti
es at
the same time. In words it says that the conditional mean of a random
variable
Z given two random quantiti es Xl and X2 can be calculated two
ways. One is
to condition on both Xl and X2 at once, and the other is to conditio
n on one
of them, say X2, and then find the conditional mean of Z given Xl,
but starting
from the conditional distribu tion of (Z,X1) given X2.
Theore m B.75. 32 Let (8, A, j.t) be a probability space and let (Xi, 8 i) fori
== 1,2
be measurable spaces. Let Xi : 8 - t Xi for i == 1,2 and Z : 8 --> JR
be random
quantities such that E(lZI) < 00. Let j.t1,2,Z denote the measure on
(Xl x X2 X
JR,Bl B2 B) induced by (Xl,X2 , Z) from j.t. (Here,8 denotes the
Borel (J'-
field.) For each (x,y) E Xl X X2, let g(x,y) denote E(ZI(X l,X 2) =
(x,y. For
each A E A and y E X2, let j.t(2)(Aly) denote Pr(AIX2 = y). For each
y E X2,
let hex, y) denote the conditional mean of Z given Xl = x calculat
ed in the
probability space (8,A,j.t( 2)(ly. Then h = 9 a.s. [j.t1,2,Z].
PROOF. Saying that h = 9 a.s. [j.tl,2,Z] is equivalent to saying that

To prove this we first note that f(s) = h(Xl (s), X2(S is measurable
with respect
to the (J'-field generated by (Xl, X 2 ), Ax 1 ,x 2' All that remains is
to show that
it satisfies the integral condition required to be E(ZIX 1,X ). That
2 is, for all

l
C E AXl,x2 ,
E(ZIc) = f(s)dj.t(s). (8.76)
Let j.t2 be the measure on (X2,8 2) induced by X2 from j.t. First,
suppose that
C = An B, where A E AXl and B E A x2 . The last hypothesis of
the theorem
says that for all A E AXIl E(ZIAI X2 = y) = fA h(X l (s),y)dj.t(2)(sly)
. If j.t112(ly)
is the probability on (Xl,B l ) induced by Xl from j.t(2) (-Iy), then j.t112(ly)
is also
the conditional distribu tion of Xl given X2 = Y as in Theorem B.46.
Suppose

31This corollary is used in the proof of Theorem 2.14.


32This theorem is used in the proof of Lemma 2.120, and it is used
in making
sense of the notation Ee when introducing parame tric models.
634 Appendix B. Probability Theory

that A = X11(D) and B = X;l(F). Then An B = (Xl,X 2)-1(D x F) and


E(ZIAJX2 = y) = JDh(X,y)d/1-1(X), By Corollary B.72 and Theorem B.46, we
can write

E(ZlAIB) = 1Iv h(x, y)d/L112(XJy)d/1-2(Y)

( h(x, y)d/1-1,2,Z(X, y, z) = ( J(s)d/1-(s).


JDXFXR J AnB
This proves (8.76) for C = An B. Let C be the collection of all sets C in A such
that (8.76) holds. Clearly SEC. If C E C, then CC E C since f J(s)d/1-(s) =
E(Z). By additivity of integrals, if {Cd~l E C, then U~lCi ~ C, hence C
contains the smallest IT-field containing all sets of the form An B for A E AXl
and B E AX2' Theorem A.34 can be used to show that this IT-field is AXl,X2' 0
If a random variable has finite second moment, then there is a concept of
conditional variance.
Definition B.77. Let X : S --+ 1Rk have finite second moment, and let C be a
sub-IT-field of A. Then the conditional covariance matrix of X given C is defined
as Var(XJC) = E[(X - E(XJC)(X - E(XJCTJC].
The following result is easy to prove.
Proposition B.78. 33 Let X: S --+ IRk have finite second moment, and let C be
a sub-IT-field of A. Then Var(X) = EVar(XJC) + Var[E(XJC)].

BA Limit Theorems
There are several types of convergence that will be of interest to us. They involve
sequences of random quantities or sequences of distributions.

B.4.1 Convergence in Distribution and in Probability


The simplest type of convergence occurs when the distributions have densities
with respect to a common measure. The following theorem is due to Scheffe
(1947).
Theorem B.79 (Schefi'e's theorem).34 Let {Pn}~=l and P be nonnegative
Junctions from a measure space (X, B, II) to IR such that the integral oj each
junction is 1 and limn->ooPn(x) = p(x), a.e. [II]. Then

lim
n-+ooJB
r Pn(X)dll(X) =(
JB
p(X)dll(X), Jor all B E B.

PROOF. Let 6n (x) = Pn(X) - p(x), and let 6;i and 6;; be its positive and ?eg-
ative parts. Clearly, both limn->oo 6;i = 0 and limn->oo 6;; = 0, a.e. [II]. Smce

33This proposition is used in the proofs of Theorems 2.36 and 2.86.


34This theorem is used in the proofs of Lemma 1.113 and Theorem 1.121.
B.4. Limit Theorems 635

o s:; 0;; :5 p is true, it follows from the dominated convergence theorem A.57
that limn~oo IB 0;; (x)dv(x) = 0 for all B. Since both Pn and P are densities,
Ix On (x)dv(x) = 0 for all n. It follows that limn~oo Ix 0;; (x)dv(x) = O. Since
IB(x)8;;(x) s:; 8;t(x) for all x, it follows from Proposition A.58 that

lim
n-+cx>
1
B
o;!"(x)dv(x) = o.
=
So, limn~oo IBfpn(X) - p(x)Jdv(x) 0 for all B. 0
Since defining convergence requires a topology, the following definitions require
that the random quantities lie in various types of topological spaces.
Definition B.SO. Let {Xn}~=,l be a sequence of random quantities and let X
be another random quantity, all taking values in the same topological space X.
Suppose that limn~oo E (f (Xn = E (f (X for every bounded continuous func-
tion , : X ..... JR, then we say that Xn converges in distribution to X, which is
written Xn E. X.
Convergence in distribution is sometimes defined in terms of probability mea-
sures. The reason is that if Xn E. X, the actual values of Xn and of X do not
play any role in the convergence. All that matters is the distributions of Xn and
of X.
Definition B.S1. Let {Pn}~=l be a sequence of probability measures on a topo-
logical space (X, 8) where 8 contains all open sets. Let P be another probability
on (X, 8). We say that Pn converges weakl1f5 to P (denoted Pn ~ P) if, for each
bounded continuous function 9 : X ..... JR, limn~oo J g(x)dPn(x) = J g(x)dP(x).

35This is not exactly the same as the concept of weak convergence in normed
linear spaces [see, for example, Dunford and Schwartz (1957), p. 419J. The col-
lection of all probability measures on a space (X, 8) can be considered a subset
of a normed linear space C consisting of all finite signed measures v (see Defini-
tion A.IS) with the norm being SUPBEB Iv(B)I. Weak convergence of a sequence
{Vn}~='l in this space would require the convergence of L(vn) for every bounded
linear functional L on C. Every bounded measurable function 9 on (X, 8) deter-
I
mines a bounded linear functional Lg on C by Lg(v) = g(x)dv(x), where the
integral with respect to a signed measure can be defined as in Problem 27 on
page 605. Hence, weak convergence of a sequence of probability measures would
require convergence of the means of all bounded measurable functions. In partic-
ular, limn~oo Pn(B) = PCB) for all measurable sets B, not just those for which
P assigns 0 probability to the boundary (see the portmanteau theorem B.83 on
page 636). Alternatively, we can consider the set of bounded continuous functions
, : X ..... JR as a normed linear space N with "'" = sup", I/(x)l. Then the set of
finite signed measures C is a set of bounded linear functionals on N using the
J
definition v(f) = '(x)dv(x). Weak" convergence of a sequence {Vn}~=l in C to
v is defined as the convergence of vn(f) to v(f) for all , EN. This is precisely
convergence in distribution. Hence, it would make more sense to call convergence
in distribution weak" convergence rather than weak convergence. Since the tra-
dition in probability theory is to call it weak convergence, we will continue to do
so.
636 Appendix B. Probability Theory

It is easy to see that these two types of convergence are the same.
Proposition B.S2. Let Pn be the distribution of X n , and let P be the distribution
of X. Then, Xn E. X if and only if Pn ~ P.
Since we will usually be dealing with X spaces that are metric spaces, there are
some equivalent ways to define convergence in distribution or weak convergence.
The proofs of Theorems B.83 and B.88 are adapted from Billingsley (1968).
Theorem B.S3 (Portmanteau theorem).36 The following are all equivalent
in a metric space:
w
1. Pn -+ P;
2. limsuPn_oo Pn(B) :$ P(B) for each closed B;
3. liminf n _ oo Pn(A) 2:: P(A), for each open A;
4. limn_co Pn(C) = P(C), for each C with P(8C) = 0. 37
PROOF. Let d be the metric in the metric space. First, assume (1) and let B be
a closed set. Let 6 > 0 be given. For each e > 0, define C. =
{x : d(x,B) :$ e},
where d(x, B) = infl/EB d(x, y). Since Id(x, B) - d(y, B)I :$ d(x, y), we see that
d(x, B) is continuous in x. Each C. is closed and n.>oc. = B. Let e be small
enough so that P(C.) :$ P(B) + 6. Let f : 1R -+ 1R be

1 if t :$ 0,
f(t) ={ 1- t if 0 < t < 1,
o if t 2:: 1,

and define g.(x) = f(d(x, B)/e). Then g. is bounded and continuous. So,
!~~ f g.{x)dPn{x) = f g.{x)dP{x).

It is easy to see that 0 :$ g.(x) :$ 1, g.(x) = 1 for all x E B, and g.(x) = 0 for all
x C. Hence, for every 6 > 0,

Pn(B) = J IB(x)dPn(x):$ Jg. (x)dPn (x) -+ J g.(x)dP(x)

5 f Ic. (x)dP(x) = P(C.) :$ P(B) + 6.


It follows that limsuPn_oo Pn(B) :$ P(B), which is (2).
That (2) and (3) are equivalent follows easily from the facts that if A is open,
then B = A C is closed and Pn(A) = 1- Pn(B). It is also easy to see that (2) and
(3) together imply (4). Next assume (4), let B be a closed set, and define C. as
above. The boundary of C. is a subset of {x: d(x, B) = e}. There can be at most
countably many e such that these sets have positive probability. Hence, there

36This theorem is used in the proofs of Theorem B.88 and Lemma 7.19.
37We use the symbol a in front of the name of a subset of a topological space
to refer to the boundary of the set. The boundary of a set C in a topological space
is the intersection of the closure of the set with the closure of the complement.
BA. Limit Theorems 637

exists a sequence {fk}~1 converging to 0 such that P(d(X,B) = fk) = 0 for all
k. It follows that limn_co Pn(C. k ) = P(C. k ) for all k. Since Pn(B) :5 Pn(C. k )
for every nand k, we have, for every k,

Since PCB) = limk ..... co P(C' k )' we have (2). So, (2), (3), and (4) are equivalent
and (1) implies (2).
All that remains is to prove that (2) implies (1). Assume (2), and let f be
a bounded continuous function. Let m < f(x) < M for all x. For each k, let
Fi,k = {x : f(x) :5 m + (M - m)i/k} for i = 1, ... , k. Let FO,k = 0. Each Fi,k is
closed, since f is continuous. Let Gi,k = Fi,k \ Fi- 1 ,k for i = 1, ... , k. It is easy
to see that for every probability Q,

m + (M - m) Lk i-I!
-k-Q(Gi,k) < f(x)dQ(x) :5 m + (M - m) L "kQ(Gi,k).
ki
i=1 i=1

Since Q(Gi,k) = Q(Fi,k) - Q(Fi-l,k) for every i and k, we get

M-m
M - - k- Lk ! M-m
Q(Fi,k) < f(x)dQ(x) :5 M + - k- -
M-m
- k- LkQ(Fi,k).
i=1 i=1
(B.84)
For each i,
(B.85)

It follows that, for every k,

! f(x)dP(x) :5 M
M-m
+ -k- -
M-m
-k- L P(Fi,k)
k

i=1

M-m M-m k
:5 M + - - k - - - - k - LlimsupPn(Fi,k)
i=1 n-oo

:5 Mk-m +liminf!f(x)dPn(x),
n_co
where the first inequality follows from the second inequality in (B.84) with Q = P,
the second inequality follows from (B.85), and the third inequality follows from
the first inequality in (B.84) with Q = Pn. Letting k be arbitrarily large, we get

! f(x)dP(x) :5 l~~~f ! f(x)dPn(x). (B.86)

Now, apply the same reasoning to - f to get

- ! f(x)dP(x) < l~~~f! - f(x)dPn(x) = -li:~~p ! f(x)dPn(x),

! f(x)dP(x) ~ li:~~P! f(x)dPn(x). (B.87)

Together, (B.86) and (B.87) imply (1). o


638 Appendix B. Probability Theory

Theorem B.88 (Continuous mapping theorem).38 Let {Xn}~=l be a se-


quence of random quantities, and let X be another random quantity all taking
values in the same metric space X. Suppose that Xn Eo X. Let y be a metric
space and let 9 : X -+ y. Define

C9 = {x: 9 is continuous at x}.

Suppose that Pr(X E C g ) = 1. Then g(Xn }


D
-+ g(X}.

PROOF. Let P n be the distribution of g(X n } and let P be the distribution of


g(X). Let B be a closed subset of y. If x E g-I(B) but x r g-I(B), then 9 is
not continuous at x. It follows that g-I(B} <; g-I(B} U Now write Cf.
limsupPn(B} = limsupPr(Xn E g-I(B :::; limsupPr(Xn E g-I(B
n-+oo n-+oo n-+oo

< Pr(X E g-I(B :::; Pr(X E g-I(B + Pr(X E C;}


= Pr(X E g-I(B = PCB),
and the result now follows from the portmanteau theorem B.83. o
Another type of convergence is convergence in probability.
Definition B.89. If {Xn}~=l and X are random quantities in a metric space
with metric d, and if, for every f > 0, lim n -+oo Pr(d(Xn, X) > f) = 0, then we
say that Xn converges in probability to X, which is written Xn !: X.
The following theorem is useful in that it relates convergence in distribution,
convergence in probability, and the simpler concept of convergence almost surely.
Theorem B.90. 39 Let {Xn}~=l be a sequence of random vectors and let X be a
random vector.
1. Iflim n-+ oo Xn =X a.s., then Xn !: X.
P D
2. If Xn -+ X, then Xn -+ X.
D P
3. If X is degenerate and Xn -+ X, then Xn -+ X.
4. If Xn ~ X, then there is a subsequence {nk}k:,l such that limk-+oo X nk =
X, a.s.
PROOF. First, assume that Xn converges a.s. to X. For each nand f, let An,. =
{s: d(Xn(s},X(s:::; f}. Then Xn(S} converges to Xes} if and only if

,E D. (Q, [.ON A . ]).

Since this set must have probability 1, then so too must U~=l (n~=NAn,.) for
all . By Theorem A.19, it follows that for every , limN-+oo Pr (n~=NAn,.) = 1.

38This theorem is used to provide a short proof of DeFinetti's representation


theorem for Bernoulli random variables in Example 1.82 on page 46.
39This theorem is used in the proofs of Theorems B.95, 1.49, 7.26, and 7.78.
B.4. Limit Theorem s 639

Hence, for each 10 > 0, limn_oo Pr(A~,<) == 0, which is precisely what it means to
p
say that Xn -> X.
Next assume that Xn .!: X. Let 9 : X -> IR be bounde d and continu
ous with
Ig(x)1 ~ K for all x. Let 10 > 0, and let A be a compac t se~ wit~ Pr(X
E A~ > 1-

/[6Kj. A continuous function (like g) on a compac t set IS umformly
contmu?us.
So let 8 > be such that x E A and d(x, y) < 8 implies Ig(x) - g(y)1
< 10/3. Smce
Xn ~ X, there exists N such that n ~ N implies Pr(d(X n, X) < 8)
> 1-/[6K j.
Let B = {X E A,d(Xn ,X) < 8}. It follows that Ig(X)IB - g(Xn)IB
I < 10/3 and,
for all n ~ N, Pr(B) > 1 - /[3Kj. Also, note that n ;:: N implies
10 10
IEg(X) - E[g(X) IBli < 3' IEg(Xn) - E[g(Xn )IBll < 3'

So, n ;:: N implies


IEg(X) - Eg(Xn)1 ::; IEg(X) - E[g(X)IBJI + IE[g(X )IBj- E[g(Xn)IBJI
+ IE[g(Xn)IBJ - Eg(Xn)1
10
-3 -10 +3-10 = 1:.
+3
. D
Thus, lImn-+oo Eg(Xn) == Eg(X), and we have proven Xn -+ X.
Next, suppose that X is degener ate at Xo and Xn 2? X. Let 10 > 0,
and define
if d(x,xo) ::; ~,
if d(x,xo) ~ 10,
otherwise.
Since 9 is bounde d and continuous, Eg(Xn) converges to Eg(X). But
Eg(X) = 1
since Pr(g(X ) = 1) = 1, and Eg(Xn) ::; Pr(d(Xn, X) < 10), since 0::;
g(x) ::; 1 for
all x. So lim n _ oo Pr(d(Xn, xo) < 10) = 1, and Xn ~ X.
Finally, assume that Xn ~ X. Let nk be such that n ;:: nk implies

Pr(d(X n, X) ;::~) < Tk.


Define Ak = {d(Xnk 'X) ;:: 11k}. By the first Borel-C antelli lemma
A.20, we
have Pr(B) = 0, where B = nbl Uk=i Ak. It is easy to check that
B is the
event that d(Xnk' X) is at least l/k for infinitely many different k.
Hence Be ~
{limk-+oo X nk = X}, and limk_oo X nk = X, a.s.
0

B.4.2 Characteristic Functions


There is a very importa nt method for proving convergence in distribu
tion which
involves the use of characteristic functions.
Definit ion B.91. Let X be a random vector. The complex-valued function

cpx(t) = E (exp[it T Xl)


is called the characteristic function of X. If F is a k-dimensional
distribu tion
J
function, the function cpp(t) = exp[it T x)dF(x) is called the charact
eristic func-
tion of F.
640 Appendix B. Probability Theory

Example B.92. Let X have standard normal distribution. Then

r/>X(t) J 1 exp -
exp( itx) ..,f'f; (X2) 1
2" dx ==..,f'f; J exp -([X - itJ2 2
2 + t ) dx

exp ( -~).
Similarly, for other normal distributions, N(/-, (12), the characteristic functions
are c/>x(t) == exp( _(12 t 2 /2 + it/-).

By Theorem B.12, if X has CDF F, then c/>x == c/>F. It is easy to see that the
characteristic function exists for every random vector and it has complex absolute
value at most 1 for all t. Other facts that follow directly from the definition are
the following. If Y == aX + b, then r/>y(t) == c/>x(at)exp(itb). If X and Y are
independent, c/>x+y == c/>xc/>y.
The reason that characteristic functions are so useful for proving convergence
in distribution is two-fold. First, for each characteristic function C/>, there is
only one CDF F such that c/>F == c/>. (See the uniqueness theorem B.106.) Sec-
ond, characteristic functions are "continuous" as a function of the distribution
in the sense of convergence in distribution. That is, Xn E. X if and only if
limn~oo c/>Xn(t) == c/>x(t) for all t.40 (See the continuity theorem B.93.)

Theorem B.93 (Continuity theorem).41 For finite-dimensional random vec-


tors, convergence in distribution is equivalent to convergence of characteristic
functions. That is, Xn E. X if and only if limn~oo c/>Xn (t) == c/>x (t) for all t.
PROOF. The "only if" part follows from Definition B.80 and the fact that one
can write exp( it T x) as two bounded, continuous, real-valued functions of x for
every t.
For the "if" part, suppose that X is k-dimensional and that lim n-+ oo c/>Xn (t) ==
c/>x(t) for all t. To prove that for each bounded continuous g, limn~oo Eg(Xn) ==
Eg(X), we will truncate 9 to a bounded rectangle and then approximate the
truncated function by a function g' whose mean is a linear combination of values
of the characteristic function. The mean of g'(Xn ) will then converge to the mean
of g'(X). We then need to show that the means of g'(X) and g'(Xn ) approximate
the means of g(X) and g(Xn), respectively.
First, we need to find a bounded rectangle on which to do the truncation. For
each coordinate Xl of X, we will show that if a and b are continuity points of
the CDF FXi of Xl, and FXl(b) - FXl(a) > q, then there is b' > b and a' < a
such that limn -+ oo FXln (b') - FXln (a') ?: q. For each a, b, 0, define

I if a < x < b,
1-~ if a - 0 < x:::; a,
{ (B.94)
fa,b,6(x) == ~ _ X~b if b:::; x < b+ 0,
otherwise.

40This presentation is a hybrid of the presentations given by Breiman (1968,


Chapter 8) and Hoel, Port, and Stone (1971, Chapter 8).
41This theorem is used in the proofs of Theorems B.95, B.97, and 7.20.
B.4. Limit Theorems 641

Note that this function has equal values at a - 6 and b + 6. Consider the inter-
val [a - 6, b + 6] as a circle identifying the two endpoints. Now, use the Stone-
Weierstrass theorem C.3 to approximate uniformly fa.b.6 to within f on the circle
by f~.b.6 (X} = E~=-l bj exp(21fijx/c}, where c = b - a + 26. If Y is a random
variable, then Ef~.b.6 (Y} is a linear combination of values of the characteristic
function of Y. So, we have lim n ..... oo Ef~.b.6 . (Xn} = Ef~.b.6 . (X}. Let q > 0, and
let a and b be continuity points of Fxt such that Fxt(b} - Fxt(a} = v > q. Let
w = v - q. Let 6 > 0 be arbitrary, and define a' = a - 6 and b' = b + 6. Let N be
large enough so that n ;::: N implies IEf~.b.6.w/3(X~} - Ef~.b.6.w/3(Xl)1 < w/3. If
n;::: N, then
Fxtn (b') - Fxtn (a');::: Efa.b.6(X~) > Ef~.b.6.f(X!) - .!j
>
- EJ'a.b.6.' (Xl)
-"'3 2w ;::: Efa.b.6(X t ) - w
;::: Fxt(b} - Fxt(a} - w = q.

Now, let 9 be a bounded continuous function, and suppose that Ig(x)1 < K for
all x. Let f > O. For each coordinate xt of X, let at and be be continuity points
of Fxt such that Fxt(bt} -Fxt(at) > I-f/(7[K +f/7]k). Let 6 > 0 be arbitrary,
and define at = at - 6, bt = bt - 6, and g*(x} = g(x) n:=l fa~.b~.6(Xt). Use the
Stone-Weierstrass theorem C.3 to uniformly approximate g* to within f/7 on the
rectangle {x : at - 6 :5 Xl :5 bt + 6} by

L ... L
ml mJe

g'(x} = ajlo ... jk exp(21fij T x),


il=-ml ik=-mk

where j is the vector with lth coordinate jt/[bt - at + 26]. Then,

lim Eg'(Xn) = Eg'(X).


n ..... oo

Let Nl be large enough so that n;::: Nl implies Fx~ (bt}-Fx~ (at) ;::: I-f/(7[K +
f/7]k) for all j. Let N2 be large enough so that n ;::: N2 implies IEg'(Xn) -
Eg'(X} I < f/7. Let R be the rectangle R = {x : at < Xl :5 btl. Since g' is periodic
in every coordinate, it is bounded by K + f/7 on all of IRk. If n ;::: max{Nl' N2},
then IEg(Xn} - Eg(X)1 is no greater than

Elg(Xn)Inc(Xn)1 + Elg(X}Inc(X) I + Elg'(Xn)Inc(Xn)1


+ IEg(Xn}In(Xn} - Eg'(Xn}In(Xn)1 + IEg'(Xn) - Eg'(X}1
+ Elg'(X)Inc(X} I + IEg'(X)In(X) - Eg(X)In(X)1 :5 f. 0

We will prove two more limit theorems that make use of the continuity theo-
rem B.93. Suppose that X has finite mean. Since Iexp(itx} - 11 :5 min{ltxl,2}
for all t,z,42 and
. exp(itx} -1
I1m .
t ..... o . t = lX

42See Problem 26 on page 664.


642 Appendix B. Probability Theory

for all x, it follows from the dominated convergence theorem that

Similarly, if X has finite variance, it can be shown that

;t22 <px(t)lt=o = _E(X2).


Using these two facts, we can prove the weak law of large numbers and the centrol
limit theorem.
Theorem B.95 (Weak law of large numbers). Suppose that {Xn}~=l are
IID mndom variables with finite mean p,. Then, Xn = L:~=1 Xiln converges in
probability to p,.
PROOF. First, we will prove that the characteristic function of X n -p, converges to
1 for all t. Let Y; = Xi -p,. Since <PY; (0) = 1, log <PY, (t) exists and is differentiable
near t = 0, and we know that

!!..log<py (0) = 0 = lim logx,(t) (B.96)


dt' t-+O t
The characteristic function of X n - p, is ' (t) = Y; (tint. For fixed t, let n be

large enough so that tin is close enough to for log Y, (tin) to be well defined.
We know that
t) log Y; ( .!. )
log.(t) = n log Y, ( ;;: = t .!. n
n
The limit of this quantity, as n - t 00, is 0 by (B.96). It follows that for all t,
lim n -+ oo .(t) = 1. By the continuity theorem B.93, Xn - f.t E. o. By Theo-
- p
rem B.90, Xn - f.t - t O. 0
In Chapter 1, we prove a strong law of large numbers 1.62, which has a stronger
conclusion and a weaker hypothesis. There is also a weak law of large numbers
for the case of infinite means. (See Problem 27 on page 664.)
The following theorem is very useful for approximating distributions.
Theorem B.97 (Centrailimit theorem). Suppose that {Xi}~l is a sequence
that is lID with finite mean f.t and finite variance (]'2. Let X n be the avemge of
thefirstn XiS. Then fo(Xn-P,) E. N(0,(]'2), the normal distribution with mean
o and variance (]'2.
PROOF. Set Yn = foe X n -p,). We might as well assume that f.t = 0, since we have
just subtracted it from each Xi. Since the second derivative of the characteristic
function at t = 0 of each Xi is _(]'2, we can apply I'Hopital's rule twice to conclude

(B.98)

The characteristic function of Yn is Yn (t) = x; (tl v'nt We will prove that


this converges to exp(-t 2(]'2/2) for each t. Since log<Pyn(t) = nlogx;(tlfo),
B.4. Limit Theorems 643

we use (B.98) to note that

It follows that limn -+ oo CPY" (t) = exp( _t 2 u 2 /2), and the continuity theorem B.93
finishes the proof. 0
There is also a multivariate version of the central limit theorem.
Theorem B.99 (Multivariate central limit theorem).43 Let {Xn}~=l be a
sequence of lID mndom vectors in IRP with mean p. and covariance matrix E.
Then v'n(X" - p.) Eo Np(O, E), a multivariate normal distribution.

PROOF.
-
Let Yn = Vri(X n - p.) and let Y '" Np(O,E). Then Yn -+ Y if and
v
only if the characteristic function of Yn converges to that of Y. That is, if and
only if, for each A E JRP, E exp { iA TYn} -+ E exp { iA TY }. This occurs if and
only if, for each A, ATYn Eo ATy. The distribution of ATy is N(O, ATEA), and
ATYn is .;n times the average of the AT (Xn - p.). By the univariate central limit
theorem B.97, ATYn Eo ATy. 0
There are inversion formulas for characteristic functions which allow us to
obtain or approximate the original distributions from the characteristic functions.
Example B.100 (Continuation of Example B.92j see page 640). Let X have
J
distribution N(O, ( 2 ). Then Icpx(t)ldt < 00. In fact,

2~ J exp(-ixt)cpx(t)dt = ~
2~
Jexp ( - u 2 [t + ~X]2 __
2 u2
1 X2) dt
2u 2

~u exp ( - 2~2X2) = fx(x).


Example B.100 says that the following inversion formula applies to normal
distributions with 0 mean. It is equally easy to see that it applies to NIe(O, lie)
distributions. 44
Lemma B.IOI (Continuous inversion formula).45 Let X E IR" have inte-
grable chamcteristic function. Then the distribution of X has a bounded density
f x with respect to Lebesgue measure given by

fx(x) = (2!)" J exp(-it T x)CPx(t)dt. (B.102)

PROOF. Clearly, the function in (B.102) is bounded since cpx is integrable. Let Y"
have N,,(O, u 2 lie) distribution. The characteristic function of X + Y" is CPxCPy".

(2!)" Jexp(-it T x)cpx(t)cpy" (t)dt

43This theorem is used in the proofs of Theorems 7.35 and 7.57.


44We use the symbol lie to stand for the k x k identity matrix.
45This lemma is used in the proofs of Lemma B.105 and Corollary B.106.
644 Appendix B. Probability Theory

(2!)k JJ exp(-it T x)exp(it T z)</>y,,(t)dFx(z)dt (B.103)

J h, (x - z)dFx(z) = fx+y" (x),

where the second equality follows from the fact that (B.102) applies to normal
distributions. Now suppose that we let (f go to zero. Since </>x is integrable and
</>y,,(t) goes to 1 for all t, it follows that the left-hand side of (B.103) converges to
the right-hand side of (B.102). It also follows that fx+y" is bounded uniformly
in (f and x. Let B be a hypercube such that the probability is 0 that X is in the
boundary of B. Then

r,,-0 fx+Yu (x)dx


} B
lim
,,-oJrB fx+Y" (x)dx = } {B fx(x)dx,
= lim (B. 104)

where the first equality follows from the boundedness of f x +Yu' and the second
is proven as follows. The difference between fx+y" (x)dx and IB fx(x)dx is IB
the sum over the 2k corners of the hypercube B of terms like
k

L Pr(b i - Y",i < Xi :::; bi , Y",i > 0) + Pr(bi < Xi :::; b; - y",;, Y",i < 0),
i=l

where bi is the ith coordinate of the corner. We can write

Pr(bi - Y",i < Xi ~ bi, Y",i > 0) = 1 00


Pr(bi - y < Xi :::; bi, y > O)dFY".i (y).

This last expression goes to 0 as (f - 0 since b; is a continuity point for FXi'


A similar argument applies to the other probability. The equality of the first
and last expressions in (B.104) is what it means to say that lim,,_o fx+y" (x)
is the density of X with respect to Lebesgue measure. This, in turn, equals the
right-hand side of (B.102). 0

Lemma B.I05. 46 Let Y be a mndom variable such that </>y is integmble. Let X
be an arbitmry mndom variable independent of Y. For all finite a < band c,

Pr(a < X + cY ~ b) = ~
211"
f (exp( -ibt) -. exp( -iat) </>x(t)</>y(ct)dt.
-~t

PROOF. Since </>y is integrable and </>x+cy(t) = </>x (t)</>y (ct), it follows that
X + cY has integrable characteristic function. Lemma B.101 says that (B.102)
applies to X + cY, hence

~ j</>x(t)</>y(ct)exP(-itX)dt
211"

Pr(a < X + cY :::; b) lb fx+cy(x)dx

46This lemma is used in the proof of Corollary B.106.


B.5. Stochastic Processes 645

2~ 1b J </Jx(t)</Jy(ct)exp(-itx)dtdx

2~ J </Jy(et)</Jx(t) 1b exp(-itx)dxdt

2~ ! </Jy(ct)</Jx(t) (exp( -itb) ':texp (-ita)) dt. 0

Corollary B.106 (Uniqueness theorem).47 Let F and G be two univariate


CDFs such that </JF = </JG. Then F = G.
PROOF. In the proof of Lemma B.101, we proved that ifY '" N(O, 1), and if a and
b are continuity points of F, and X has CDF F, then limc_o Pr(a < X + cY ::;
b) = Pr(a < X::; b). The same is true of G. Hence, F = G by Lemma B.105. 0
An obvious consequence of the uniqueness theorem is the following.
Corollary B.107. 48 Suppose that F and G are k-dimensional CDFs such that
J
lor every bounded continuous I, l(x)dF(x) = l(x)dG(x). Then F = G. J

B.5 Stochastic Processes


B.5.1 Introduction
Sometimes we wish to specify a joint distribution for an infinite sequence of
random variables. Let (S, A, 1-') be a probability space. If Xn : S --+ IR for every
n and each Xn is measurable with respect to the Borel u-field B, we can define
a u-field of subsets of IR oo such that the infinite sequence X = (X l ,X2 , ... ) is
measurable. Let BOO be the smallest u-field that contains all finite-dimensional
orthants, that is, every set B of the form

{x: Xil ::; el, ... ,Xi n ::; Cn , for some n and some integers il, ... ,in

and some numbers CI, , Cn }.

It is clear that X-I (B) E A since it is the intersection of finitely many sets in
A. By Theorem A.34, it follows that X-I (BOO) ~ A, so X is measurable with
respect to this u-field.

B.5.2 Martingales+
A particular type of stochastic process that is sometimes of interest is a martin-
gale. [For more discussion of martingales, see Doob (1953), Chapter VII.]

47This corollary is used in the proof of Theorem 2.74.


48This corollary is used in the proof of DeFinetti's representation theorem 1.49.
+This section contains results that rely on the theory of martingales. It may
be skipped without interrupting the flow of ideas.
646 Appendix B. Probability Theory

Definition B.lOS. Let (8, A, 1-1) be a probability space. Let N be a set of consec-
utive integers. For each n EN, let F,. be a sub-u-field of A such that Fn ~ F,.+l
for all n such that n and n + 1 are in N. Let {X,,},.E.N' be a sequence of ran-
dom variables such that Xn is measurable with respect to Fn for all n. The
sequence of pairs {(Xn,F,.)}nE.N' is called a martingale if, for all n such that n
and n + 1 are in N, E(Xn+lIFn) = Xn. It is called a submartingale if, for every
n, E(Xn+lIFn) ~ X n .
Note that a martingale is also a submartingale.
Example B.I09. A simple example of a martingale is the following. Let N =
{1, 2, ..:1,. and let {Yn}~l be independent random variables with mean O. Let
Xn = E:=l Yi. Let Fn be the u-field generated by YI, ... , Yn . Then,

E(YI + ... + Yn+ll.rn) = YI + ... + Yn = X n ,


since E(Yn+lIFn) = 0 by independence. If each Yi has nonnegative finite mean,
then E(Xn+lIF,.) ~ Xn, and we have a submartingale.
Example B.llO. Another example of a martingale is the following. Let N be a
collection of consecutive integers, and let {F,,},.E.N' be an increasing sequence of
u-fields. Let X be a random variable with E(IXI) < 00. Set X,. = E(XIF,.). By
the law of total probability B. 70,

E(X"+lIF,.) = E[E(XIF,.+dIF,.) = E(XIFn) = X n ,


so {(X,., Fn)}nE.N' is a martingale.
Example B.llI. If {(Xn, Fn)}"E.N' is a martingale, then

IXnl = IE(X"+IIFn)1 :5 E(lXn+lIlFn), (B.112)


hence {(IXnl,Fn)}nE.N' is a submartingale.

The following result is proven using the same argument as in Example B.111.
Proposition B.ll3. 49 If {(Xn, Fn)}nE.N' is a martingale, then EIXnl is nonde-
creasing in n.
The reader should note that if {(Xn, Fn)}nE.N' is a submartingale and if M ~
N is a string of consecutive integers, then {(Xn, .rn)}nEM is also a submartingale.
Similarly, if k is an integer (positive or negative) and M = {n: n+k EN}, then
{(X~, F~)}nEM is a submartingaie, where X~ = Xn+A: and .r:..
= Fn+lc. This
latter is just a shifting of the index set.
There are important convergence theorems that a.pply to many martingales
and submartingales. They say that if the set N is infinite, then limit random
variables exist. A lemma is needed to prove these theorems. 50 It puts a bound on
how often a submartingale can cross an interval between two numbers. It is used
to show that such crossings cannot occur infinitely often with high probability.
(Infinitely many crossings of a nondegenerate interval would imply divergence of
the submartingale.)

49This proposition is used in the proof of Theorem B.122.


50This lemma is proven by Doob (1953, Theorem VII, 3.3).
B.5. Stochastic Processes 647

Lemma B.114 (Upcrossing lemma).5l Let.N = {I, ... , N}, and suppose
that {(X",.rn)}~=l is a submartingale. Let r < q, and define V to be the number
of times that the sequence Xl, ... , XN crosses from below r to above q. Then

E(V) :5 _1_ (EIXNI


q-r
+ Ir!). (B.115)

PROOF. Let Yn = max{O,Xn - r} for every n. Since g(x) = max{O, x} is a non-


decreasing convex function of x, it is easy to see (using Jensen's inequality B.17)
that {Yn , .rn}~=l is a submartingale. Note that a consecutive set of Xi(S) cross
from below r to above q if and only if the corresponding consecutive set of Yi(s)
cross from 0 to above q - r. Let To(s) = 0 and define Tm for m = 1,2, ... as
Tm(s) = inf{k:5 N: k > Tm-l(s), Yk(s) = O}, if m is odd,
Tm(s) = inf{k:5 N: k > Tm-l(s), Yk(s) ;::: q - r}, if m is even,
Tm(s) = N + 1, if the corresponding set above is empty.

Now V(s) is one-half of the largest even m such that Tm(s) :5 N. Define, for
i= 1, ... ,N,

Ri(S) = {Io if Tm(s) < i :5 Tm+l(S) for m odd,


otherwIse.

Then (q - r)V(s) :5 E;:'l Ri(S)(Yi(S) - Yi-l(S = X, where Yo == 0 for conve-


nience. First, note that for all m and i, {Tm(s) :5 i} E .ri. Next, note that for
every i,

{s:Ri(s)=l}= U ({Tm:5i-1}n{Tm+l:5i-1}C)E.ri-l. (B.116)


m odd

E(X) = L:J. . _
N

i=l {s.R,{s)-l}
(Yi(S) - Yi-l(sdl-(s)

= L J . . _ (E(YiI.ri-t}(S) - Yi-l(sdJL(s)
i=l {s.R,{s)-l}
N
:5 L !(E(YiI.ri-t}(S) - Yi_l(sdJL(s)
i=l
N
L(E(Yi) - E(Yi_t}) = E(YN),
i=l
where the second equality follows from (B.116) and the inequality follows from
the fact that {Yn, Fn}~=l is a submartingale. It follows that (q-r)E(V) :5 E(YN).
Since E(YN) :5lrl + E(IXN!), it follows that (B.115) holds. 0
The proof of the following convergence theorem is adapted from Chow, Rob-
bins, and Siegmund (1971).

5lThis lemma is used in the proofs of Theorems B.U7 and B.122.


648 Appendix B. Probability Theory

Theorem B.117 (Martingale convergence theorem: part 1).52 Suppose


that {(Xn,Fn)}~l is a submartingale such that sUPnEIXnl < 00. Then X =
limn _ oo Xn exists a.s. and EIXI < 00.
PROOF. Let X = limsuPn-+oo Xn and X. = liminfn-+oo Xn. Let B = {s :
X.{s) < X(s)}. We will prove that J.L(B) = O. We can write

B=
r < q,
u
r, q rational
{s : X{s) ~q >r ~ X.(s)}.

Now, X{s) > q > r ~ X.{s) if and only if the values of Xn(S) cross from being
below r to being above q infinitely often. For fixed rand q, we now prove that
this has probability OJ hence J.L(B) = O. Let Vn equal the number of times that
Xl, ... , Xn cross from below r to above q. According to Lemma B.114,

supE(Vn) $ _1_ (SuPE{IXnl) + Irl) < 00.


n q-r n

The number of times the values of {Xn(S)}~=l cross from below r to above q
equals limn-+oo Vn(s). By the monotone convergence theorem A.52,

00 > supE(Vn) = E{ lim Vn ).


n n-oo

It follows that J.L{{s: limn-+ co Vn{s) = oo}) = O.


Since J.L(B) = 0, we have that X = lim n-+ oo Xn exists a.s. Fatou's lemma A.50
says E(IXi) $ liminfn-+co E(lXnl) $ sUPn E(lXnl) < 00. 0
For the particular martingale in which Xn = E(XIFn) for a single X, we have
an expression for the limit.
Theorem B.118 (Levy's theorem: part 1).53 Let {Fn}~=l be an increasing
sequence of u-fields. Let Foo be the smallest u-field containing all of the Fn. Let
E(\XD < 00. Define Xn = E(X\Fn) and Xoo = E(X\Foo). Then limn-+oo Xn =
Xoo , a.s.
The proof of this theorem requires a lemma that will also be needed later.
Lemma B.119. 54 Let {Fn }~=l be a sequence of u-fields. Let E(\XI) < 00. Define
Xn = E(X\Fn). Then {Xn}~=l is a uniformly integmble sequence.
PROOF. Since E(XIFn) = E{X+IFn) - E(X-IFn), and the sum of uniformly
integrable sequences is uniformly integrable, we will prove the result for nonneg-
ative X. Let Ae,n = {Xn ~ c} E Fn. So fAc.n Xn{s)dJ.L{s) = fAc,n X{s)dJ.L{s). If
we can find, for every E > 0, a C such that fA X (s)dJ.L(s) < E for all n and all
c ~ C, we are done. Define 71{A) = fAX{s)dJ.Lt~). We have 71 J.L and 71 is finite.

52This theorem is used in the proof of Theorems B.118 and 1.121.


53This theorem is used in the proofs of Theorem 7.78 and Lemma 7.124.
54This lemma is used in the proofs of Theorems B.U8, B.122, and B.124. It is
borrowed from Billingsley (1986, Lemma 35.2).
B.5. Stochastic Processes 649

By Lemma A.72, we have that for every E > 0 there exists li such that JL(A) < 6
implies T/(A) < E. By the Markov inequality B.15,
1 1
JL(Ac,n) :$ -E(Xn) = -E(X),
c c
for all n. Let C = 2E(X)/li. Then c ~ C implies JL(Ac,n) < 6 for all n, so
T/(Ac,n) < for all n. 0
PROOF OF THEOREM B.ll8. By Lemma B.119, {Xn}~=1 is a uniformly integrable
sequence. Let Y be the limit of the martingale guaranteed by Theorem B.117.
Since Y is a limit of functions of the X n , it is measurable with respect to Foo. It
follows from Theorem A.60 that for every event A, lim n _ oo E(XnIA) = E(Y IA)'
Next, note that, for every A E F n ,

fA
Y(s)dJ.&(s) = n_oo
lim f
A
E(XIFn)(s)dJ.&(s) = f
A
X(s)dJL(S),

where the last equality follows from the definition of conditional expectation.
Since this is true for every n and every A E F n , it is true for all A in the field
F = U~=IFn. Since IXI is integrable, we can apply Theorem A.26 to conclude
that the equality holds for all A E F oo , the smallest IT-field containing F. The
equality E(XIA) = E(YIA) for all A E Foo together with the fact that Y is Foo
measurable is precisely what it means to say that Y = E(XIFoo) = Xoo. 0
For negatively indexed martingales, there is also a convergence theorem. Some
authors refer to negatively indexed martingales in a different fashion, which is
often more convenient.
Definition B.120. Let (S, A, JL) be a probability space. For each n = 1,2, ... ,
let Fn be a sub-IT-field of A such that Fn+l ~ Fn for all n. Let {Xn}~=1 be a
sequence of random variables such that Xn is measurable with respect to Fn for
all n. The sequence of pairs {(Xn,Fn)}~=1 is called a reversed martingale if for
all n E(XnIFn+l) = X n + 1 .
Example B.121. As in Example B.110, we can let {Fn}~=1 be a decreasing
sequence of IT-fields, and let E(IX!) < 00. Define Xn = E(XIFn). It follows from
the law of total probability B.70 that {(Xn,Fn)}~=l is a reversed martingale.
The following theorem is proven by Doob (1953, Theorem VII 4.2).
Theorem B.122 (Martingale convergence theorem: part 11).55 Suppose
that {(Xn, Fn)}n<o is a martingale. Then X = limn _ - oo Xn exists a.s. and has
finite mean.
PROOF. Just as in the proof of Theorem B.117, we let Vn be the number of times
that the finite sequence Xn,Xn+1, ... ,X- 1 crosses from below a rational r to
above another rational q (for n < 0). The upcrossing lemma B.114 says that
1
E{Vn) :$ -
q-r
(E(IX_ 1 !) + Ir!) < 00.

55This theorem is used in the proof of Theorem B.124.


650 Appendix B. Pcobability Theory

As in the proof of Theorem B.117, it follows that X = limn_-oo Xn exists with


probability 1. From (B.112) and Lemma B.119, it follows that

lim E(IXnl)
n--oo
= E(IXI).
By Proposition B.113, it follows that E(lXI) < 00, and so X has finite mean. 0
It is usually more convenient to express Theorem B.122 in terms of reversed
martingales.
Corollary B.123. 56 If {(Xn, Fn)}~l is a reversed martingale, then limn_co Xn
exists a.s. and has finite mean.
There is also a version of Levy's theorem B.118 for reversed martingales.
Theorem B.124 (Levy's theorem: part 11).57 Let {Fn}~=l be a decreasing
sequence of a-fields. Let Foo be the intersection n~=lFn. Let E(IXI) < 00. Define
Xn = E(XIFn) and Xoo = E(XIFoo). Then limn_oo Xn = Xoo a.s.
PROOF. It is easy to see that {(Xn' Fn)}~=l is a reversed martingale and that
E(IX11) < 00. By Theorem B.122, it follows that lim n__ oo Xn = Y exists and is
finite a.s. To prove that Y = Xoo a.s., note that Xoo = E(X1IFoo) since .1'00 ~ Fl.
So, we must show that Y = E(X1IFoo). Let A E .1'00' Then

i Xn(s)dl./-(s) = i Xl (s)dl./-(S) ,

since A E Fn and Xn = E(XdFn). Once again, using (B.112) and Lemma B.119,
it follows that fA Y(s)dl./-(s) = fA X1 (s)dl./-(s); hence Y = E(XdFoo). 0

B.5.3 Markov Chains


Another type of stochastic process we will occasionally meet is a Markov chain. 58
Definition B.125. Let {Xn}~=l be a sequence of random variables taking val-
ues in a space X with a-field B. The sequence is called a Markov chain (with
stationary transition distributionsl 9 if there exists a function p : B x X -+ [0, I]
such that
for all x EX, pC x) is a probability measure on B;
for all B E B, p(B,) is B measurable;

56This corollary is used in the proof of Theorem B.124.


57This theorem is used in the proofs of Theorem 1.62, Corollary 1.63,
Lemma 2.121, and Lemma 7.83.
"This section may be skipped without interrupting the flow of ideas.
581n this text, we only use Markov chains as occasional examples of sequences
of random variables that are not exchangeable.
59There are more general definitions of Markov chains and Markov processes
in which the transition distribution from Xn to Xn+l is allowed to depend on n.
We will not need these more general processes in this book.
B.5. Stochastic Processes 651

for each n and each BE 13,


p(B, x) = Pr(Xn+1 E BIXI = Xl, X2 = X2,, Xn-l = Xn-l, Xn = X),
almost surely with respect to the joint distribu tion of (X I, .. , X n ).
The last condition in the definition of a Markov chain says that the
conditional
distribu tion of Xn+l given the past depends only on the most recent
past X n . In
other words, Xn+l is conditionally independent of Xl, ... , Xn-l given n
X .
Examp le B.126. A sequence {Xn}~=l of lID random variables
is a Markov
chain with p(B,x) = Pr(Xi E B) for all x.
Examp le B.121. Let {Xn}~=l be Bernoulli random variables such
that
Pr(Xn+1 = 11XI = Xl," ,Xn = Xn) = px,.,l,
for Xn E {O, 1}. The entire joint distribu tion of the sequence is determi
ned by the
numbers PO,1, Pl,l, and Pr(XI = 1).

B.5.4 General Stochastic Processes


Occasionally, we will have to deal with more complicated stochas
tic processes.
What makes them more complicated is that they consist of more than
countably
many random quantities.
Examp le B.128. Let :F be a set of real-valued functions of a real
vector. That
is, there exists k such that FE :F means F : JRk -+ JR. Suppose that
X : 8 -+ :F
is a random quantity whose values are functions themselves. We would
like to
be able to discuss the distribu tion of X. We will need a u-field of
subsets of :F
in order to discuss measurability. A natural u-field is the smalles
t u-field that
contains all sets of the form At,x = {F E :F : F( t) :$ x}, for all
t E JRk and
all X E JR. It can be shown (see below) that X is measurable with
respect to
this u-field if, for every t E 1Rk, the real-valued function G t : 8 -+
1R is Borel
measurable, where Gt(s) = F(t) when Xes) = F.
A general stochastic process can be defined, and it resembles the above
example
in all importa nt aspects.
Definit ion B.129. Let (8, A, 11) be a probability space, and let R
be some set.
For each r E R, let (Xr, 13r ) be a Borel space, and let Xr : 8 -+ Xr be
measurable.
The collection of random variables X = {Xr : r E R} is called
a stochastic
process.
Examp le B.130. If every (Xr ,13r ) is the same space (X,13), then
X can be
thought of as a "random function" from R to X as follows. For each
s E 8, define
=
the function F. : R -+ 8 by F.(r) Xr(S). In order to make this a
function, we need a u-field on the set of functions from R to X. Since
true random
this set of
functions is the product set X R , a natural u-field is the product u-field
BR. The
product u-field is easily seen to be the smallest u-field containing all
sets of the
form Ar,B = {F : F(r) E B}, for r E Rand B E 13. Now, let F :
8 -+ XR be
defined by F(s) == F. Then F is measurable because

F-I(Ar, B) = {s: F.(r) E B} = {s : Xr{s) E B} E A,


because Xr is measurable.
652 Appendix B. Probability Theory

The important theorem about stochastic processes is that their distribution is


determined by the joint distributions of all finite collections of the X r
Theorem B.131.6o Let R be a set and, for each r E R, let (Xr,Br) be a Borel
space. Let X = {Xr : r E R} and X' = {X; : r E R} be two stochastic processes.
Suppose that for every k and every k-tuple (rl' ... , rk) E Rk, the joint distribution
of (Xr1 , ,Xrk ) is the same as that of (X;l' ... , X;k). Then the distribution of
X is the same as that of x' .

PROOF. Define X = TIrER Xr and let B be the product u-field. Say that a set
C E B is a finite-dimensional cylinder set if there exists k and rl, ... , rk E R and
a measurable D ~ TI:=l Xr , such that

C = {x EX: (x r1 , . ,Xrk) ED}.


It is easy to see that if {rl, ... ,rk} ~ {tl, ... , t m } for m ~ k, then there exists a
measurable subset D' of n;:lXSj such that

C = {x EX: (X S l> ... ,xs m ) ED'},


by taking the Cartesian product of D times the product of those Xr for r E
{Sl, ... , Sm} \ {rl, ... , rk} and then possibly rearranging the coordinates of all
points in this set to match the order of rl, ... , rk among Sl, ... , Sm. So, if C and
G are both finite-dimensional cylinder sets with

G = {x EX: (Xhl, ,Xht) E E},


then we can let {tl, ... , t m } = {rl, ... , rk} U {hI, ... , ht} and write
C {XEX:(Xtl> ... ,Xt m )ED'},
G {XEX:(xtl> ... ,Xt m )EG'}.
It follows that

enG = {x EX: (Xtll ... , Xt m ) E D' n G'} .


So the finite-dimensional cylinder sets form a 1r-system. By assumption, the dis-
tributions of X and X' agree on this 1r-system. Since X = {x EX: Xr E X r } for
arbitrary r E R and since the distributions of X and X' are finite measures, we
can apply Theorem A.26 to conclude that the distributions are the same. 0
Another important fact about general stochastic processes is that it is possible
to specify a joint distribution for the entire process by merely specifying all
of the finite-dimensional joint distributions, so long as they obey a consistency
condition.
Definition B.132. Let X = TIrER Xr with the product u-field, where (Xr , Br) is
a Borel space for every r. For each finite k and each k-tuple (iI, ... , ik) of distinct
elements of R, let Pilo ... ,ik be a probability measure on n:=l
Xi;- We say that
these probabilities are consistent if the following conditions hold for each k and
distinct iI, ... , ik E R and each A in the product u-field of TI;=l Xi; :

6This theorem is used in the proofs of Theorem B.133 and DeFinetti's repre-
sentation theorem 1.49.
B.5. Stochas tic Processes 653

For each permuta tion 7r of k items, Pil, ... ,ik (A) = H,,(1) , ... ,i,,(k) (B), where
B = {(X".(I),'" ,X".(k: (Xl, ... ,Xk) E A} .
For each f. E R \ {il, ... , ik}, H1, ... ,ik (A) = Pil, ... ,ik,l(B), where

B = {(Xl, ... ,Xk, Xk+I) : (XI, .. " Xk) E A, Xk+l EXt}.


Since the set R may not be ordered, the first condition ensures that
it does not
matter in what order one writes a finite set of indices. The second
condition is
the substan tive one, and it ensures that the margina l distribu tions
of subsets of
coordinates are the probability measures associated with those subsets.
To avoid excessive notation , it will be convenient to refer to PJ as
the proba-
bility measure associated with a finite subset J ~ R without specifyi
ng the order
of the elements of J. When the consistency conditions in Definition
B.132 hold,
this should not cause any confusion.
The proof of the following theorem is adapted from Loeve (1977,
pp. 94-5).
The theorem says that consistent finite-dimensional distribu tions
determi ne a
unique joint distribu tion on the product space.
Theore m B.133. 61 Let X = TIrER Xr with the product u-field, where
Xr is a
Borel space for every r. For each finite subset J ~ R, let PJ be
a probability
measure on TIrEJ X r Suppose that the PJ are consistent as defined
in Defini-
tion B.132. Then there exists a unique distribution on X with finite-di
mensional
marginals given by the PJ.
PROOF. The uniqueness follows from Theorem B.131, if we can prove
existence.
First, suppose that Xr = 1R for all r. Let C be the class of all unions
of finitely
many finite-dimensional cylinder sets of the form C = TIrER C r , where
all but
finitely many of the C r equallR and the others are unions of finitely
many inter-
vals. The class C is a field. For C of the above form, define P(C) = PJ(TIrE
J Cr ).
The consistency assumption implies that P can be uniquely extende
d to a finitely
additive probability on C. To show that P is countab ly additive,
we will show
that if {An}~=1 is a decreasing sequence of elements of C such that
P(An) > 10
for all n, then A = n~=IAn is nonempty. Suppose that P(An) > 10
for all n. Let
I n be the set of all subscrip ts involved in AI, ... , An and J be the
union of these
sets. Let An = Bn X TIrOn X r. Then P(A .. ) = PJ" (Bn), and Bn is
the union of
finitely many product s of intervals. For each product of intervals H
that consti-
tutes B n , we can find a product of bounde d closed intervals contain
ed in H such
that the PJn probability of the union of these H is as close as we wish
to PJ" (B n ).
Let C n be a finite union of product s of closed bounde d intervals contain
ed in Bn
such that PJn (Bn \ Cn ) < /2n+l. If Dn is the cylinder set correspo
nding to Cn ,
then
PJn (An \ Dn) = PJn (Bn \ en) < 2n~1 .
Now, let En = Ann~1 Di, so that P(An \En) < 10/2. It follows that
PeEn) > /2,
so each En is nonempty. Let xn :::: (x~,x~, ... ) E En. Since El 2
E2 2 "', it
follows that for every k 2: 0, x n + k E En ~ D ... Hence (x?+kji E I
n ) E Cn. Since

61This theorem is used in the proof of Lemma 2.123.


654 Appendix B. Probability Theory

each en is bounded, there is a subsequence of {(xf; i E Jl)}~=1 that converges to


a point (Xiii E JI) Eel. Let the subsequence be {(x~~;i E J1)}%"=I' Then there
e
is a subsequence of{(x~~; i E h)}%"=l that converges to a point (Xi; i E h) E 2
Continue extracting subsequences to get a limit point XJ = (Xi; i E J) E Dn for
all n. Hence, every point that extends XJ to an element of X is in An for all
n, and A is nonempty. Now apply the Caratheodory extension theorem A.22 to
extend P to the entire product a-field.
For general Borel spaces, let CPr : Xr -+ Fr be a bimeasurable mapping to a
Borel subset of rn. for each r. It follows easily by using Theorem A.34 that the
function cp: X -+ TIrERFr is bimeasurable, where cp(x) = (CPr(Xr);r E R). For
each finite s.ubset J, cP induces a probability on TIiEJ IR from PJ, and these are
clearly consistent. By what we have already proven there is a probability P on
TITER IR with the desired marginals. Then cp-l induces a probability on X from
P with the desired marginals. 0

B.6 Subjective Probability


It is not obvious for what purpose a mathematical probability, as described in
this chapter and defined in Definition A.18, would ever be useful. In this section,
we try to show how the mathematical definition of probability is just what one
would want to use to describe one's uncertainty about unknown quantities if one
were forced to gamble on the outcomes of those unknown quantities. 62
DeFinetti (1974) suggests that probability be defined in terms of those gam-
bles an agent is willing to accept. Others, like DeGroot (1970), would only require
that probabilities be subjective degrees of belief. Either way, we might ask, "Why
should degrees of belief or gambling behavior satisfy the measure theoretic defini-
tion of probability?" In this section, we will try to motivate the measure theoretic
definition of probability by considering gambling behavior. We begin by adopting
the viewpoint of DeFinetti (1974}.63
For the purposes of this discussion, let a mndom variable be any number about
which we are uncertain. For each bounded random variable X, assume that there
is some fair price p such that an agent is indifferent between all gambles that pay
c(X - p), where c is in some sufficiently small symmetric interval around 0 such
that the maximum loss is still within the means of the agent to pay. For example,
suppose that X = X is observed. If c(x - p) > 0, then the agent would receive
this amount. If c(x - p) < 0, then the agent would lose -c(x - p). It must be
that -c(x - p} is small enough for the agent to be able to pay. Surely, for X in a

621n Section 3.3, we give a much more elaborate motivation for the entire
apparatus of Bayesian decision theory, which includes mathematical probab~lity
as one of its components. An alternative derivation of mathematical probability
from operational considerations is given in Chapter 6 of DeGroot (1970).
63There are a few major differences between the approach in this section and
DeFinetti's approach, which DeFinetti, were he alive, would be quick to po~nt
out. Out of respect for his memory and his followers, we will also try to pomt
out these differences as we encounter them.
B.6. Subjective Probability 655

bounded set, C can be made small enough for this to hold, so long as the agent
has some funds available.
Definition B.134. The fair price p of a random quantity is called its prevision
and is denoted P(X). It is assumed, for a bounded random quantity X, that the
agent is indifferent between all gambles whose net gain (loss if negative) to the
agent is c(X - P(X)) for all c in some symmetric interval around O.
The symmetric interval around 0 mentioned in the definition of prevision may
be different for different random variables. For example, it might stand to reason
that the interval corresponding to the random variable 2X would be half as wide
as the interval corresponding to X.
Another assumption we make is that if an agent is willing to accept each of a
countable collection of gambles, then the agent is willing to accept all of them
at once, so long as the maximum possible loss is small enough for the agent to
pay.64 An example of countably many gambles, each of which is acceptable but
cannot be accepted together, is the famous St. Petersburg paradox.
Example B.135. Suppose that a fair coin is tossed until the first head appears.
Let N be the number of tosses until the first head appears. For n = 1,2, ... ,
define
if N = n,
otherwise.
Suppose that our agent says that P(Xn } = 1 for all n. For each n, there is Cn < 0
such that the agent is willing to accept cn(Xn - 1). If - 2::""=1cn 2n is too big,
however, the agent cannot accept all of the gambles at once. Similarly, there are
Cn > 0 such that the agent is willing to accept Cn (Xn - 1). If 2:::'-1
Cn is too
big, the agent cannot accept all of these gambles. The St. Petersburg paradox
corresponds to the case in which Cn = 1 for all n. In this case, the agent pays 00
and only receives 2N in return. We have ruled out this possibility by requiring
that the agent be able to afford the worst possible loss.

The following example illustrates how it is possible to accept infinitely many


gambles at once.

Example B.136. Suppose that a random quantity X could possibly be anyone


of the positive integers. For each positive integer x, let

if X = x,
if not.

Suppose that our agent is indifferent between all gambles of the form c(Ix _ 2-"')
for all -1 :s :s
c 1 and all integers x. Then, we assume that the agent is also
indifferent between all gambles of the form 2:::'=1
c",(I", - T X ), so long as -1:s
:s 1 for all x. (Note that the largest possible loss is no more than 1.) Let
:L:
C",
y = 1 cx!x with -1 :5 Cx :5 1 for all x. Note that Y is a bounded random

64DeFinetti would not require an agent to accept count ably many gambles at
once, but rather only finitely many. We introduce this stronger requirement to
avoid mathematical problems that arise when the weaker assumption holds but
the stronger one does not. Schervish, Seidenfeld, and Kadane (1984) describe one
such problem in detail.
656 Appendix B. Probability Theory

quantity, and that the agent has implicitly agreed to accept all gambles of the
form c(Y - ,.,,) for -1 ::; c ::; 1, where ,." = E::'=l cz2- z . If the agent were
foolish enough to be indifferent between all gambles of the form dey - p) for
-a ::; d ::; a where p i= ,.", then a clever opponent could make money with no risk.
For example, if p > ,.", let f = min{l, a}. The opponent would ask the agent to
accept the gamble fey -p) as well as the gambles - fcz(Iz _2- Z ) for x = 1,2, ....
The net effect to the agent of these gambles is - f(p -,.,,) < 0, no matter what
value X takes! A similar situation arises if p < ,.". Only p = ,." protects the agent
from this sort of problem, which is known as Dutch book.

To avoid Dutch book, we introduce the following definition.


Definition B.137. Let {X" : el E A} be a collection of bounded random vari-
ables. Suppose that, for each el, an agent gives a prevision P(X,,) and is indifferent
between all gambles of the form c(X" - P(X,, for -d" ::; c ::; d" with d" =
min{maxx(:':~:P(x,.), p(X,.)MminX ,.} for some M > o. These previsions are coherent
if there exist no countable subset B ~ A and {Cb : -db::; Cb ::; db, for all bE B}
such that -M ::; LbEB Cb(Xb - P(Xb < 0 under all circumstances. 65 If a
collection of previsions is not coherent, we say that it is incoherent.
The value M is the maximum amount the agent is willing to lose. Coherence of
a sufficiently rich collection of previsions is equivalent to a probability assignment.
Theorem B.13S. 66 Let (B,A) be a measurable space. Suppose that, Jor each
C E A, the agent assigns a prevision P(Ic), where Ic is the indicator oj C.
Define ,." : A -+ IR by ,.,,( C) = P(lc). Then the previsions are coherent if and
only if,." is a probability on (8, A).
PROOF. Without loss of generality, suppose that the agent is indifferent between
all gambles of the form c(Ic - P(lc, for all -1 ::; c ::; 1. For the "if" part,
assume that,." is a probability. Let {Cn}::'=l E A and Ci E [-1,1) be such that
with
=L ,
00

X en(lcn - ,.,,(Cn
n=l
the maximum losses from X and from -X are small enough for the agent to
afford. Since this makes X bounded, it follows from Fubini's theorem A.70 that
E(X) = OJ hence it is impossible that X < 0 under all circumstances, and the
previsions are coherent.
For the "only if' part, assume that the previsions are coherent. Clearly, ,.,,(0) =
0, since 10 = 0 and -c,.,,(0) ~ 0 for both positive and negative c. It is also easy to
see that ,.,,(A) ~ 0 for all A. If ,.,,(A) < 0, then for all negative c, c(IA -,.,,(A < 0
and we have incoherence. Countable additivity follows in a similar fashion. Let
{An}~=l be mutually disjoint, and let A = U~=lAn. If ,.,,(A) < E:=l,.,,(A n ),

65When only finitely many gambles are required to be combined at once, as by


DeFinetti (1974), incoherence requires that the sum .be s~rictly less than s.ome
negative number under all circumstances. That is, DeFmettl would allow a stnctly
negative gamble to be called coherent, so long as the least upper bound was O.
66This theorem is used in the proof of Theorem B.139.
B.6. Subjective Probabi lity 657

then the following gamble is always negative:


00

If J.t(A) >E::"=l J.t(An), then the negative of the above gamble is always negativ~
Either way there is incoherence.
Theorem B.138 says that if an agent insists on dealing with a l7-field
of sub-
sets of some set 8, then expressing coherent previsions for gambles
on events is
equivalent to choosing probabilities. 67 Similar claims can be made about
bounde d
random variables.
Theore m B.139. Let C be the collection of all bounded measura
ble function s
from a measurable space (8, A) to lR. Suppose that, for each X
E C, an agent
assigns a prevision P(X). The previsions are coherent if and only
if there exists
a probability /1 on (8, A) such that P(X) = E(X) for all X E C.
PROOF. Suppose that the agent is indifferent between
all gambles of the form
c(X - P(X for -dx ::; c S dx. For the "if" direction, the proof
is virtuall y
identical to the corresponding part of the proof of Theorem B.138.
For the "only
if" part, note that IA E C for every A E A. It follows from Theorem
B.138 that a
probabi lity J.t exists such that J.t(A) = P(lA) for all A E A. Hence P(X)
for all simple functions X. Let X > 0 and let Xl S X 2 S ... be simple
= E(X)
functions
less than or equal toX such that lim n _ oo Xn = X. Then X =
so
E::"=l(Xn+l -Xn ),
00

P(X) = LP(Xn +1 - Xn) = }~~ E(X n +l) = E(X),


n=l
from coherence and the monoto ne convergence theorem A.52. For
general X,
let X+ and X- be, respectively, the positive and negative parts
of X. Since
P(X) = P(X+) - P(X-) follows easily from coherence, the proof
is complete. 0
We conclude this "motivation" of probability theory from gambling
considera-
tions by trying to motivat e conditional probability. Suppose that,
in addition to
assigning previsions to gambles involving arbitrar y bounde d random
variables,
the agent is also required to assign conditional previsions in the followin
g way.
Let C be a sub-l7-field of A, and suppose that gambles of the form cIA
(X - p), for
all nonemp ty A E C, are being considered. 68 The fair price would
be that value
of p, denoted P(XIA) , such that the agent was indifferent between
all gambles of
the form cIA (X - P(XIA) ) for all c in some symmet ric interval around
than choose a different P(XIA) for each A, the agent has the option
o. Rather
of choosing
a single function Q : 8 -+ 1R such that Q is measura ble with respect
to the l7-field
C. The conditio nal gambles would then be cIA(X - Q).
Examp le B.140. For the simple case in which C = {0, A, A G , 8},
Q is measur-
able if and only if it takes on only two values, one on A and the other
on AG. In

671n the theory of DeFine tti (1974), one obtains finitely additive
probabilities
without assuming that probabilities have been assigned to all element
s of a 17-
field.
68DeFinetti (1974) would only require that such conditional gambles
be con-
sidered one at a time rather than a l7-field at a time.
658 Appendix B. Probability Theory

this case, there are only two sets of conditional gambles (other than the "uncondi-
tional" gambles c[X-P(X)]) , namely cIA (X -P(XIA and cIAc(X -P(XIAc .
Here, Q = P(XIA)IA + P(XIAc)IX, Note that the previsions P(XIA) and
P(IA) = ",(A) are already expressed. It is easy to see that
cIA(X - P(XIA
= c(XIA - E(XIA - cP(XIA)(IA -",(A + c[P(XIA)",(A) - E(XIA)].
Clearly, the only coherent choices of P(XIA) satisfy P(XIA)",(A) = E(XIA). If
",(A) > 0, then P(XIA) = E(XIA)/",(A), the usual conditional mean of X given
A. Similarly, P(XIAc)",(Ac ) = E(XI~) must hold.
The general situation is not much different from Example B.140.
Theorem B.141. Suppose that an agent must choose a function Q that is mea-
sumble with respect to a sub-a-field C so that for each nonempty A E C, he or
she is indifferent between all gambles of the form cIA(X - Q). The choice of Q
is coherent if and only ifE(QIA) = E(XIA), for all A E C.
PROOF. As in Example B.140, note that
cIA(X - Q) = c(XIA - E(XIA - c(QIA - E(QIA + c[E(QIA) - E(XIA)]'
The choice of Q can be coherent if and only if E(QIA) = E(XIA). 0
The reader should note the similarity between the conditions in Theorem B.141
and Definition B.23. The function Q must be a version of the conditional mean
of X given C.
Example B.142. Let (X, Y) be random variables with a traditional joint density
with respect to Lebesgue measure Ix,Y. That is, for all C E JR.2,

PrX, Y) E C) = fa IX,y(x,y)dxdy,

and for all bounded measurable functions 9 : JR.2 --+ JR.,

E(g(X, Y = j g(x, y)fx,y(x, y)dxdy. (B.143)

Let C be the a-field generated by Y. That is, C = {y-I(A) : A E B}, where B is


the Borel a-field of subsets of JR.. It is straightforward to check that for all A E c,
E(XIA) = E(QIA), where Q(s) = h(Y(s, and

h(y) = jxfx,y(X,Y) dx,


fy(y)
J
and fy(y) = fX,Y(x,y)dx is the usual marginal density of Y. (Just apply
(B.143) with g(x,y) = xh(y) and with g(x,y) = xlc(y), where A = y-I(C).)
What we have done in this section is give a motivation for the use of the math-
ematical probability calculus to express uncertainty for the purposes of gambling.
We assume that an agent chooses which gambles to accept in such a way that he
or she is not subject to Dutch book, which is a combination of acceptable gam-
bles that produces a loss no matter what happens. We were also able to use this
approach to motivate the mathematical definition of conditional expectatio~. by
introducing conditional gambles and requiring that the same coherence condltlOn
apply to conditional and unconditional gambles alike.
B.7. Simulation 659

B.7 Simulation *
Several times in this text, we will want to generat e observations
that have a
desired distribu tion. Such observations will be called pseudorandom
numbers be-
cause samples appear to have the properties of random variables,
but they are
actually generat ed by a complicated determi nistic process. We will
not go into
detail on how pseudor andom numbers with uniform U(O, 1) distribu
tion are gen-
erated. In this section, we wish to prove a couple of useful theorem s
about how to
generat e pseudor andom number s with other distribu tions under the
assump tion
that pseudor andom number s with U(O, 1) distribu tion can be generat
ed.
Theore m B.144. Let F be a CDF and define the inverse of F by

F- 1 ( } _ { inf{x: F(x) ~ q} ifq > 0,


q - sup{x : F(x) > O} if q = O.
If U has U(O, 1) distribution, then X = F- 1 (U) has CDF F.
PROOF. We will calculate Pr(X ~ t) for all t. First, let t be a continu
ity point of
F. Then
Pr(X ~ t) = Pr(F-1(U) ~ t) = Pr(U ~ F(t = F(t),
where the second equality follows from the fact that, at a continu
ity point t,
X ~ t if and only if U ~ F(t), and the third equality follows
from the fact
that U has U(0,1} distribu tion. Finally, let t be a jump point
of F and let
F(t) -lim"'r t F(x) = c. Then X = t if and only if t - c < U ~ t, so

Pr(X = t) = Pr(t - c < U ~ t) = c.


So, X has CDF F at continu ity points of F and its distribu tion
has the same
sized jumps as F at the same points. So the CDF of X is F.
0
This theorem allows us to generat e pseudor andom variables with
arbitrar y
CDF F, if we can find F- 1 The method described in this theorem
is called the
probability integral transform. Note that the probabi lity integral transfor
m has a
surprising theoreti cal implication.
Propos ition B.145. Let U have U(O,1) distribution, and let X
be a random
quantity taking values in a Borel space X. Then there exists a measura
ble function
f : 10, 1] ~ X such that feU) has the same distribution as X.
The next theorem allows us to find pseudor andom variables with
arbitrar y
density I if we can generat e pseudor andom variables with another density
9 such
that I(x) ~ kg(x} for some number k and all x.
Theore m B.146 (Accep tance-r ejectio n). Let f be a nonnegative
integrable
function, and let 9 be a density junction. Let k > 0 and suppose that f
(x) ~ kg( x)
for all x. Suppose that {Y;}~1 and {Ui}~l are all independent and
that the Y;
have density 9 and the Ui are U(O, 1). Define Z = YN, where

N = min {i .
U. < fey;} }
. - kg(Y;) .

*This section may be skipped without interrup ting the flow of ideas.
660 Appendix B. Probability Theory

Then Z has density proportional to f.


PROOF. We can write the CDF of Z as

Pr(Z ~ t) = ( IUi::; - -
Pr Yi::; t
f(Yi)
kg(Yi)
Pr (Yi ::; t, U. ::; tK~i))
= --'--.,....---....,--!...
Pr (u. < 1llil)
- k9(YJ

E [pr ( Yi ::; t, Ui ::; ~ IYi) ]


=
E [pr ( Ui ::; k~\W) IYi ) ]
where we have used the law of total probability B.70 in the last equation. The
conditional probability in the numerator is

Pr (Yi <- , U <- k IYi)


t f(Yi)
9 (Yi)'

= { 0 if Yi
> t,
k9(YJ l . lv".i <
1.!Y:il.f _ t

The mean of this is

jt -00
fey)
kg(y)g(y)d y =k
1jt -00 f(y)dy,

since Yi has PDF g(.). Similarly, the denominator conditional probability can be
written as
Pr (Ui < f(Yi)
- kg(Yi)
IYi) = f(Yi) .
kg(Yi)
The mean of this is likewise seen to be J f(y)dy/k. The ratio of these is
Pr(Z < t) = J~oo f(y)dy
- J f(y)dy ,

hence Z has density proportional to f. 0


Next, we prove a theorem that allows us to simulate from distributions with
bounded densities and sufficiently thin tails even when we only know the density
up to a normalizing constant. The theorem is due to Kinderman and Monahan
(1977).
Theorem B.147 (Ratio of uniforms method). Let f : IR -> [0,(0) be an
integrable function. Define

If (U, V) has uniform distribution over the set A, then V /U has density propor-
tional to f.
PROOF. Let (U, V) be uniformly distributed on the set A. Then fu,v(u,v) =
lA(u,v)/c, where c is the area of A. Define X = U and Y V/U. The Jacobian =
for the transformation is x and the joint density of (X, Y) is

/x,y(x,y)
x
= ~IA(x,xy) x
= ~I[O,v'f'"iY)l (x ) .
B.8. Problems 661

J
It follows that fy(y) = ov'fu5 ~dx = icf(y). 0
If both f(x) :::; b and a :::; xV f(x) :::; b for all x, then A is contained in the
rectangle with opposite corners (0, a) and (b, c). We can then generate U '" U(O, b)
and V '" U(a, c). We set X = V/U, and if U2 :5 f(X), take X as our desired
random variable. If U 2 < f(X), try again.
An important application of simulation is to the numerical integration tech-
nique called importance sampling. Suppose that we wish to know the value of the
ratio of two integrals
J v(8)h(8)d8
(B.148)
J h(8)d8 '
where 8 can be a vector. Suppose that f is a density function such that h/ f is
nearly constant and it is easy to generate pseudorandom numbers with density
/. Let {Xi}~l be an IID sequence of pseudorandom numbers with density f.
Then

J h(8)d8 = E (h(Xi)
f(X i ) ,

J v(8)h(9)d9 h(X;)
E ( v(Xd /(Xi) ,

where the expectations are with respect to the pseudo-distribution of Xi. If we let
Wi = h(Xi)/ /(Xi ) and Zi = V(Xi)Wi , then the weak law of large numbers B.95
says that Zn/W n converges in probability to (B.148).69 The reason that we want
h/ / to be nearly constant is so that the variance of Wi is small. In Section 7.1.3,
we will show how to approximate the variance of Zn/W n as an estimate of
(B.148).

B.8 Problems
Section B.2:

1. Suppose that an urn contains m ~ 3 white balls and n ~ 3 black balls.


Suppose that the urn is well mixed so that at any time, the probability
that anyone of the remaining balls in the urn is as likely to be drawn as
any other. We will draw three balls without replacement and set Xi = 1 if

the ith ball drawn is black, Xi = if the ith ball is white. Show that
Pr(X1 = 0,X2 = 1,X3 = 1)
Pr(X1 = 1, X 2 = 1, X3 = 0).
2. Suppose that H is a nondecreasing function and
F(x) = inf H(t).
t >x
t rational

69The strong law of large numbers 1.63 says that Zn/W n converges a.s. to
(B.148).
662 Appendix B. Probability Theory

(a) Prove that F is continuous from the right.


(b) Prove that infall x H(x) = infall x F(x).
(c) Prove that sUPall x H(x) = sUPall x F(x).
Section B.3:

3. Using the definition of conditional probability, show that AnB = 0 implies


Pr(AIC) + Pr(BIC) = Pr(A U BIC), a.s.

Use this to help prove that {An}~=l disjoint implies

4. *Let Xl and X 2 be lID random variables with U(O, 1) distribution. Let

Using the definition of conditional distribution, show that the conditional


distribution of Xl given T = t is a mixture of a point mass at t and a
U(O, t) distribution. Also, find the mixture.
5. Let (S, A, IJ.) be a probability space. Let C be a sub-q-field such that IJ.(C) E
{O, I} for all C E C. Let EIXI < 00. Prove that E(XIC) = E(X), a.s. [IJ.].
6. Let (S, A, IJ.) be a probability space. Let {An}~l be a partition of S,
and let C be the smallest q-field containing {An}~l. Let X be a random
variable. Show that E(XIC) = E~ IAn W n , where

Wn = {:;('A:;P if IJ.(An) > 0,


otherwise.

7. Let <l> denote the standard normal CDF, and let the joint CDF of random
variables (X, Y) be

< y + 1,
Fx,Y(x, y) = { ~~y)
~ if Y - 1 ~ x
if x ~ Y + 1,
otherwise.

(a) Find the conditional distribution of X given Y.


(b) Find the conditional distribution of Y given X.
8. Prove Proposition B.25 on page 617. (Hint: Use part 4 of Proposition A.49.)
9. Prove Proposition B.26 on page 617.
10. Prove Proposition B.27 on page 617. (Hint: Prove it for 9 an indicator func-
tion, then for simple functions, then for nonnegative measurable functions,
then for all integrable functions.)
11. Prove Proposition B.28 on page 617.
B.8. Problems 663

12. Suppose that Xl, ... , Xn are independent, each with distribution N(e, 1).
Find the conditional distribution of Xl"'" Xn given Xn = x, where Xn =
E~lXi/n.
13. Let 81 ~ 82 ~ ... be a sequence of u-fields, and let X ~ o. Suppose that
E(XI8n ) = Y for all n. Let 8 be the smallest u-field containing all of the
8 n. Show that E(XI8) = Y, a.s. (Hint: Show that the union of the 8 n is a
1I'-system, and use Theorem A.26.)
14. Prove Proposition B.43 on page 623.
15. Assume the conditions of Theorem B.46. Also, suppose that (X,8l,Vt)
and ()l, 82,112) are u-finite measure spaces and v = VI X V2. Prove that VI
can play the role of vXly(ly) for all y and that V2 can play the role of vy
in the statement of Theorem B.46.
16. Prove Proposition B.51 on page 625. (Hint: Notice that IA(V-l(y,w =
IA;(w).)
17. Prove Proposition B.66 on page 631. (Hint: Prove the result for product
sets first, and then use Theorem A.26.)
18. Prove Corollary B.67 on page 631.
19. Prove Corollary B.74 on page 633.
20. Prove the second Borel-Cantelli lemma: If {An}~=l are mutually inde-
pendent and E:=l Pr(An) = 00, then Pr(n~l U~=i An) = 1. (This set
is sometimes called An infinitely often.) (Hint: Find the probability of the
complement by using the fact that 1 - x:5 exp(-x) for 0:5 x :5 1.)
21. *Suppose that (8, A, It) is a measure space. Let {fn}~=l be a sequence of
measurable functions In : 8 --+ T, where (T,8) is a metric space with Borel
u-field. Let C be the tail u-field of {fn}~=l' If limn_oo fn(s) = f(s), for all
s, then prove that f is measurable with respect to C. (Hint: Refer to the
proof of part 5 of Theorem A.3B. Show that the set A. E C by showing
that the union in (A.39) does not need to start at 1.)
22. Let (8, A, It) be a probability space, and let C be the tail u-field of a se-
quence of random quantities {Xn}~=lt where Xn : 8 --+ X for all n. Let
V be the u-field generated by {Xn}~=l' Let X = (Xl,X2, ... ) E Xoo.
If 11' is a permutation of a finite set of integers {1, ... , n}, let 11'X =
(X"(l), ... , X,.(n) , X n + 1, ... ). We say that A E V is symmetric if A =
X-I (B) and for every permutation 11' of finitely many coordinates, A =
(1I'X)-l(B) as well.
(a) Prove that every C E C is symmetric.
(b) Show that there can be symmetric events that are not in C.
23. Prove Proposition B.78 on page 634.
664 Appendix B. Probability Theory

Section B.4:

24. Find a sequence of random variables that converges in probability to 0


but does not converge a.s. to O. (Hint: Consider the countable collection
of all subsets of [0,1] of the form [k/2 n , (k + 1)/2n] with k and n integers.
Arrange them in an appropriate sequence.)
25. Let {Xn}~=l be a sequence of random variables, and let X be another
random variable. Let Fn be the CDF of Xn and let F be the CDF of X.
Prove that Xn E. X if and only if limn-->oo Fn(x) = F(x) for every x such
that F is continuous at x.
26. Prove that I exp(iy) -11 ~ min{lyl, 2} for all y. (Hint: Show that exp(iy) =
1+i J:
exp( is )ds for y 2: 0 and a similar formula for y < 0.)
27. Prove the weak law of large numbers for infinite means: Suppose that
{Xi}~l are IID with mean 00. Then, for all real x, limn-->oo Pr(Xn >
x) = 1, where Xn = 2::~=1 Xi/no (Hint: Define Yi.t = min{Xi' t}. Prove
that E(Yi.t) < 00 for all t, but limt-->oo E(Yi.t) = 00.)
28. *Suppose that X is a random vector having bounded density with respect to
Lebesgue measure. Prove that the characteristic function of X is integrable.
(Hint: Run the proof of Lemma B.101 in reverse.)
Section B.5:

29. Let {in}~=l be a sequence of numbers in {O, I}. Suppose that {Xn}~=1 is
a sequence of Bernoulli random variables such that

Pr(X1 =il, ... ,Xn =i n )= x~2(n~4)'


x+2

where x = 2::7=1 ij. Show that this specifies a consistent set of joint distri-
butions for n = 1,2, ....
30. Let /-L be a finite measure on (1R, B), where B is the Borel u-field. Suppose
that {X(t) : -00 < t < oo} is a stochastic process such that X(t) has
Beta(/-L(-oo,t],/-L(t,oo)) distribution for each t, X(t) > Xes) ift > s, and
X() is continuous from the right.
(a) Prove that Pr(limt-->oo X(t) = 1) = 1.
(b) Let U = inf{t : X(t) 2: 1/2}. Prove that the median of U is inf{t :
IL(-oo,t]2: /-L(t,oo)}. (Hint: Write {U $ s} in terms of X().)
31. Let R be a set, and let (Xr, Br) be a Borel space for every r E R. Let X =
n Xr and let B be the product u-field. For each r E R, let Xr : X -+ Xr
bert'~ projection function Xr(X) = Xr . Prove that B is the union of all of
the u-fields generated by all of the countable collections of Xr functions.
That is, let Q be the set of all countable subsets of R, and for each q E Q
let xq = {Xr lrEq and let Bq be the u-field generated by xq. Then show
that B = uqEQBq.

Section B. 7:

32. Prove Proposition B.145 on page 659.


ApPENDIX C
Mathematical Theorems Not
Proven Here

There are several theorems of a purely mathematical nature which we use on


occasion in this text, but which we do not wish to prove here because their proofs
involve a great deal of mathematical background of which we will not make use
anywhere else.

C.1 Real Analysis


Theorem C.1 (Taylor's theorem).1 Suppose that 1 : IRm -+ IR has con-
tinuous partial derivatives 01 all orders up to and including k + 1 with respect
to all coordinates in a convex neighborhood D of a point Xo. For xED and
i = 1, ... , k + 1, define

D(i)(f,x,y) = :t ... :t
i1=1 i;=1
(8Z'
31
~i.8z' I(Z)\
3.
llYi.) ,
z=z 8=1

where we allow notation like 8 3 /8z 1 8z 1 8z4 to stand for 8 3 /8z~8z4. Then, for
XED,
k

!(x) = !(xo) + ~ ~D(i)(f;xo,x - xo) + (k ~ I)! D(k+l) (f; x., x - xo),


0=1

where x is on the line segment joining X and Xo.

1This theorem is used in the proofs of Theorems 7.63, 7.89, 7.108, and 7.125.
For a proof (with m = 2), see Buck (1965), Theorem 16 on page 260.
666 Appendix C. Mathematical Theorems Not Proven Here

Theorem C.2 (Inverse function theorem).2 Let f be a continuously differ-


entiable function from an open set in IRn into IR n such that ali / ax j)) is a
nonsingular matrix at a point x. If y = f (x), then there exist open sets U and V
such that x E U, Y E V, f is one-to-one on U, and feU) = V. Also, if g: V -+ U
is the inverse of f on U, then g is continuously differentiable on V.
Theorem C.3 (Stone-Weierstrass theorem).3 Let A be a collection of con-
tinuous complex functions defined on a compact set C and satisfying these con-
ditions:
If f E A, then the complex conjugate of f is in A.
If Xl =1= X2 E C, then there exists f E A such that f(xd =1= f(x2).
If f, 9 E A, then f + 9 E A and f g E A.
If f E A and c is a constant, then cf E A .
For each x E C, there exists f E A such that f(x) =1= O.
Then, for every continuous complex function f on C, there exists a sequence
{fn}::"=l in A such that fn converges uniformly to f on C.
Theorem C.4 (Supporting hyperplane theorem).4 If S is a convex subset
of a finite-dimensional Euclidean space, and Xo is a boundary point of S, then
there is a nonzero vector v such that for every xES, V T X ~ V T Xo.
Theorem C.5 (Separating hyperplane theorem). 5 If Sl and S2 are disjoint
convex subsets of a finite-dimensional Euclidean space, then there is a nonzero
vector v and a constant c such that for every x E Sl, V T X ::; c and for every
y E S2, V T Y ~ c.
Theorem C.6 (Bolzano-Weierstrass theorem).6 Suppose that B is a closed
and bounded subset of a finite-dimensional Euclidean space. Then every infinite
subset of B has a cluster point in B.

C.2 Complex Analysis


Theorem C. 7. 7 Let f be an analytic function in a neighborhood of a point z.
Then the derivatives of f of every order exist and are analytic in a neighborhood

2This theorem is used in the proof of Theorem 7.57. For a proof, see Rudin
(1964), Theorem 9.17.
3This theorem is used in the proofs of DeFinetti's representation theorem 1.49
and 1.47 and Theorem B.93. For a proof, see Rudin (1964), Theorem 7.31.
4This theorem is used in the proof of Theorem B.17. For a proof, see Berger
(1985), Theorem 12 on page 341, or Ferguson (1967), Theorem 1 on page 73.
5This theorem is used in the proof of Theorems B.17, 3.77 and 3.95. For a
proof, see Berger (1985), Theorem 13 on page 342, or Ferguson (1967), Theorem 2
on page 73. ..
6This theorem is used in the proof of Theorem 3.77. For a proof, see DugundJ1
(1966), Theorems 3.2 and 4.3 of Chapter XI. .
7This theorem is used to show that certain estimators are UMVUE, and III
the proof of Theorem 2.74. For a proof, see Churchill (1960, Sections 52 and 56).
C.3. Functional Analysis 667

01 z. II J<k) denotes the kth derivative 01 I, then


00 (k)
I(x) = ~)x - z)k I k'(z)
k=O
lor all x in some circle around z.
Theorem C.8 (Maximum modulus theorem).8 Let I be an analytic func-
tion in an open set D which is continuous on the closure 01 D. Let the maximum
value 01 I/{z)1 lor z in the closure 01 D be c. Then I/{z)1 < c lor all zED unless
I is constant on D.
Theorem C.9 (Cauchy's equation).9 Let G be a Borel subset 01 IRk with
positive Lebesgue measure. Let I : G -+ IR be measurable. Let H 1 = G and
Hn = Hn-l + G lor each n. For each n, let gn : Hn -+ IR be measurable such
that gn (E~=l Xi) = E~l I(Xi), lor almost all (Xl, ... ,Xn) E G n Then there is
a real number a and a vector b E IRk such that I{x) = a + bT X a.e. in G.

C.3 Functional Analysis


Theorem C.10. lo lIT is an operator with finite norm on the Hilbert space L 2 {",)
J
given by T(f)(x) = K(x',x)d",(x'), then T is 01 Hilbert-Schmidt type il and

JJ
only il
IK{x', x)1 2 d",{x')d",{x) < 00.

Theorem C.llY Every operator 01 Hilbert-Schmidt type is completely contin-


uous.
Theorem C.12.l2 II T is a completely continuous sell-adjoint operator, then T
has an eigenvalue A with IAI = IITII.
Theorem C.13. l3 lIT is a linear operator with finite norm and T is its adjoint
operator, then IITTII = IITII2.

8This theorem is used in the proof of Theorem 2.64. For a proof, see Churchill
(1960), Section 54, or Ahlfors (1966), Theorem 12' on page 134.
9This theorem is used in the proof of Theorem 2.114. For a proof, see Diaconis
and Freedman (1990), Theorem 2.1.
lOThis theorem is used in the proof of Theorem 8.40. For a lroof, see Sec-
tion XI.6 of Dunford and Schwartz (1963). By L 2{",) we mean {/: 12{x)d",{x) <
oo}.
llThis theorem is used in the proof of Theorem 8.40. For a proof, see Theo-
rem 6 of Section XI.6 of Dunford and Schwartz (1963). The reader should note
that Dunford and Schwartz (1963) use the term compact instead of completely
continuous.
l2This theorem is used in the proof of Theorem 8.40. For a proof, see Lemma 1
in Section VIIL3 of Berberian (1961).
l3This theorem is used in the proof of Theorem 8.40. For a proof, see part (5)
of Theorem 2 on p. 132 of Berberian (1961).
ApPENDIX D
Summary of Distributions

The distributions used in this book are listed here. We give the name and sym-
bol used to describe each distribution. Each distribution is absolutely continuous
with respect to some measure or other. In most cases the mean and variance are
given. In some cases, the symbol for the CDF is given.

D.l Univariate Continuous Distributions


Alternate noncentral beta
Symbol: ANCB(q,a,,)
.
D enslty: f x ()
x ==
,",00
L.Jk=O ~
~(1
-,
)! k r + +2k
"Y r ~+k r !+k x
~+k-l(1
-x
)!+k- 1

Dominating measure: Lebesgue measure on [0,1]

Alternate noncentral chi-squared


Symbol: ANCX2(q, a, ,)1
",,00 r(k+!? k( )! ",!+k-l ("')
DenSity: fx(x) == L.Jk=O klr(! "Y 1-"Y 2!+k r (!+k) exp -2
Dominating measure: Lebesgue measure on [0, (0)
Mean: q+a~
1-1'

Variance: 2 [q + a1'12~1'1')]

IThis distribution was derived without a name by Geisser (1967). It was named
L2 by Lecoutre and Rouanet (1981).
D.l. Univariate Continuous Distributions 669

Alternate noncentral F
Symbol: ANCF(q,a,,?

Dominating measure: Lebesgue measure on [0; 00)


Mean: (1 -,) a~2 + ,~, if a> 2
Variance: 2a 2 (a_2+q)(1_"}')2
(a-2)2(a-4)q
+ 4a(a-2)q2
2 "}'(1-"}') if a > 4
,

Beta
Symbol: Beta( a, (3)
Density: Ix (x) = ic~'igJ)xa-1(1- X),8-1
Dominating measure: Lebesgue measure on [0,1]
Mean: a~,8
.
v:arlance: a('J
(a+,8)2(a+,8+1)

Cauchy
Symbol: Cau(p" (7'2)

Density: Ix (x) = 11'-1 (1 + (:z:~~)2) -1

Dominating measure: Lebesgue measure on (-00,00)


Mean: Does not exist
Variance: Does not exist

Chi-squared
Symbol: X~

Density: fx(x) = 2i~(~) exp(-~)


Dominating measure: Lebesgue measure on [0, 00)
Mean: a
Variance: 2a

2The alternate noncentral F distribution, with a different scaling factor, was


called the 1/1 2 distribution by Rouanet and Lecoutre (1983). See also Lecoutre
(1985). The distribution was derived without a name by Femindiz (1985).
Schervish (1992) gives additional details concerning the ANCX 2 , ANCE, and
ANCF distributions.
670 Appendix D, Summary of Distributions

Exponential
Symbol: Exp(O)
Density: fx(x) == Oexp(-xO)
Dominating measure: Lebesgue measure on [0, 00)
Mean: !
Variance: -b
F
Symbol: Fq,a

r(!I...!!.)q~a!
x2- (a + qx)--r
!1 1 !1<!
DensIty: fx(x) == ~r
r(~)r(!)
Dominating measure: Lebesgue measure on [0,00)
Mean: a~2' if a> 2
" .
variance: 2a2 q(a-4)(a-2)2'
q+a-2 'f
I a>
4

Gamma
Symbol: r(a,.8)
Density: fx(x) == ~X"'-l exp(-,Bx)
Dominating measure: Lebesgue measure on [0, 00)
Mean: ~
Variance: -p
Inverse gamma
Symbol: r- 1 (a,,B)
Density: fx(x) == ~X-O-l exp(-~)
Dominating measure: Lebesgue measure on [0, 00)
Mean: 6, if a> 1
Variance: (0 1~2("'_2)' if a > 2

Laplace
Symbol: Lap(tL, 0')
Density: fx(x) == 2~ exp (_Ix~el),
Dominating measure: Lebesgue measure on 1R
Mean: tL
Variance: 20'2
Dol. Univariate Continuous Distributions 671

Noncentral beta
Symbol: NCB (a., {3, 1/J)
D ensl'ty: f x ()
x = ,",00 (!I!.)k exp (!I!.)
L..Jk=O 2
r(a+~)
- 2 klr(a+k)r(~)x a+k-l(1 - x )~-1
Dominating measure: Lebesgue measure on [0,1)

Noncentral chi-squared
Symbol: NCx.~(1/J)
D enslty:
' f x (x ) -- ,",00
L..Jk=O
(!I!.)k
2 exp (!I!.)
- 2 ",!+k-l (
kJ2~+kr(!+k) exp -2
"')

Dominating measure: Lebesgue measure on [0,00)


Mean: q+1/J
Variance: 2q + 41/J

Noncentral F
Symbol: NCF(q, a, 1/J)

Dominating measure: Lebesgue measure on [0, 00)


Mean: (1 +~) a~2' if a> 2
Variance' 2 (!!}2 (q+.p)2+(q+2.p){a-2) if a > 4
'q (a 2)2(" 4)

Noncentral t
Symbol: NCt a (6)

Density: fx(x) = 2::-0 (6:t exp(-~) va;;-r(


- 0
r(~) ! )(!)
i (1 + "':)-~
Dominating measure: Lebesgue measure on ill.
~
Mean: 6----rrrr Vf, if a> 1
,
Variance: 2
a(6 +1)
a-2
a6 2
-"""2
[~"-l
r!
]2 0

' If a>2
CDF: NCT,,(oj6)

Normal
Symbol: N(p., 002 )
Density: /x(x) = (V21Too)-l exp ( _("';,.~)2)
Dominating measure: Lebesgue measure on (-00,00)
672 Appendix D. Summary of Distributions

Mean: J.t
Variance: 0'2

CDF: ~(.) (For N(O, 1) distribution)

Pareto
Symbol: Par(o:, c)
Density: /x(x) = ;=+1
Dominating measure: Lebesgue measure on [e, 00)
Mean: :~l' if 0: > 1
...varIance:
" c 2
(0 2)(0 1)2,'1 f
0: >2

t
Symbol: t a (J.t,0'2)

Density: /x(x) = ~1i~(7 (1 + ("':(7!tr~


Dominating measure: Lebesgue measure on (-00,00)
Mean: J.t, if a > 1
Variance: 0'2 a~2' if a >2
CDF: TaO (For t .. (O, 1) distribution)

Uniform
Symbol: U(a, b)
Density: /x(x) = (b - a)-l
Dominating measure: Lebesgue measure on [a, b]
Mean: ~
Variance: (b_a)2
12

D.2 Univariate Discrete Distributions


Bernoulli
Symbol: Ber(p)
Density: /x(x) = p"'(l _ p)l-'"
Dominating measure: Counting measure on {O,1}
Mean: p
Variance: p(l - p)
D.2. Univariate Discrete Distributions 673

Binomial
Symbol: Bin{n,p)
Density: /x{x) = (:)p"'{I- p)l-",
Dominating measure: Counting measure on {O, ... , n}
Mean: np
Variance: np{l - p)

Geometric
Symbol: Geo(p)
Density: /x{x) = p{l- p)'"
Dominating measure: Counting measure on {O, 1,2, ... }
Mean.!=.E
p

Variance: ~
p

Hypergeometric
Symbol: Hyp{N,n,k)
Density: /x{x) = (~)[~):)
Dominating measure: Counting measure on
{max{O,n - N + k}, ... ,min{n,k}}
Mean: ~
Variance: n (~) ( N;k) (~:7 )

Negative binomial
Symbol: N egbin{ a, p)
Density: /x{x) = Wot:lpO{l - p)'"
Dominating measure: Counting measure on {O, 1,2, ... }
Mean' a.!=.E
p

Variance: a7
Poisson
Symbol: Poi{>.)
Density: fx{x) = exp{->'):~
Dominating measure: Counting measure on {O, 1,2, ... }
Mean: >.
Variance: >.
674 Appendix D. Summary of Distributions

D.3 Multivariate Distributions


Dirichlet
Symbol: Dirk(al, ... , ak)
Densl' ty.. f Xl..Xk_l (Xl, .. , Xk-l ) = I'(",tl
r{"'o) X"'l-l .. 'X"'k-l-l(l
... r(",/c) 1 k-I - x1 -
... - xk_da/c-l, where ao = 2::=1 ai
Dominating measure: Lebesgue measure on
{(Xl, ... ,Xk-l): all Xi ~ 0 and Xl + ... Xk-l ::; I}
Mean: E(Xi ) =~
Variance: Var(Xi) = "';J0o-a;)
"'0("'0+ 1 )

Covariance: Cov(Xi , X J.) = ",g ("'0 +1)!riCk;

Multinomial
Symbol: Multk(n,Pl, ... ,Pk)
Density: !Xl ..... Xk(Xl, ... ,Xk) =( n
.::r:l'I:l:k
)PZll ... p~/c
'"

Dominating measure: Counting measure on


{(Xl, ... ,Xk): all Xi E {O, ... ,n} and Xl + ... + X" = n}
Mean: E(Xi) = npi
Variance: Var(Xi) = npi(l - Pi)
Covariance: COV(Xi, Xj) = -npiPj

Multivariate Normal
Symbol: Np(JJ, (1)
Density: !x(x) = (271r~I(1I-~ exp(-!(x - JJ)T (1-l(X - JJ)
Dominating measure: Lebesgue measure on IRP
Mean: E(Xi) = /Ji
Variance: Var(Xi) = (1i.i
Covariance: COV(Xi, Xi) = (1i.i
References

AHLFORS, L. (1966). Complex Analysis (2nd ed.). New York: McGraw-Hill.


AITCHISON, J. and DUNSMORE, I. R. (1975). Statistical Prediction Analysis.
Cambridge: Cambridge University Press.
ALBERT, J. H. and CHlB, S. (1993). Bayesian analysis of binary and polychoto-
mous response data. Journal of the American Statistical Association, 88,
669-679.
ALDOUS, D. J. (1981). Representations for partially exchangeable random vari-
ables. Journal of Multivariate Analysis, 11, 581-598.
ALDOUS, D. J. (1985). Exchangeability and related topics. In P. L. HENNEQUlN
(Ed.), Ecole d'EM de ProbabiliMs de Saint-Flour XIII-1983 (pp. 1-198).
Berlin: Springer-Verlag.
ANDERSON, T. W. (1984). An Introduction to Multivariate Statistical Analysis
(2nd ed.). New York: Wiley.
ANSCOMBE, F. J. and AUMANN, R. J. (1963). A definition of subjective proba-
bility. Annals of Mathematical Statistics, 34, 199-205.
ANTONIAK, C. E. (1974). Mixtures of Dirichlet processes with applications to
Bayesian nonparametric problems. Annals of Statistics, 2, 1152-1174.
BAHADUR, R. R. (1957). On unbiased estimates of uniformly minimum variance.
Sankhya, 18, 211-224.
BARNARD, G. A. (1970). Discussion on paper by Dr. Kalbfleisch and Dr. Sprott.
Journal of the Royal Statistical Society (Series B), 32, 194-195.
BARNARD, G. A. (1976). Conditional inference is not inefficient. Scandinavian
Journal of Statistics, 3, 132-134.
BARNDORFF-NIELSEN, O. E. (1988). Parametric Statistical Models and Likeli-
hood. Berlin: Springer-Verlag.
BARNETT, V. (1982). Comparative Statistical Inference (2nd ed.). New York:
Wiley.
BARRON, A. R. (1986). Discussion of "On the consistency of Bayes estimates"
by Diaconis and Freedman. Annals of Statistics, 14, 26-30.
BARRON, A. R. (1988). The exponential convergence of posterior probabilities
with implications for Bayes estimators of density functions. Technical Re-
port 7, Department of Statistics, University of Illinois, Champaign, IL.
BASU, D. (1955). On statistics independent of a complete sufficient statistic.
Sankhya, 15, 377-380.
BASU, D. (1958). On statistics independent of sufficient statistics. Sankhya, 20,
223-226.
676 References

BAYES, T. (1764). An essay toward solving a problem in the doctrine of chances.


Philosophical Transactions of the Royal Society of London, 53, 37(}---418.
BECKER, R. A., CHAMBERS, J. M., and WILKS, A. R. (1988). The New S
Language: A Programming Environment for Data Analysis and Graphics.
Pacific Grove, CA: Wadsworth and Brooks/Cole.
BERBERIAN, S. K. (1961). Introduction to Hilbert Space. New York: Oxford
University Press.
BERGER, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd
ed.). New York: Springer-Verlag.
BERGER, J. O. (1994). An overview of robust Bayesian analysis (with discussion).
Test, 3, 5-124.
BERGER, J. O. and BERRY, D. A. (1988). The relevance of stopping rules in
statistical inference (with discussion). In S. S. GUPTA and J. O. BERGER
(Eds.), Statistical Decision Theory and Related Topics IV (pp. 29-72). New
York: Springer-Verlag.
BERGER, J. O. and SELLKE, T. (1987). Testing a point null hypothesis: The
irreconcilability of P values and evidence (with discussion). Journal of the
American Statistical Association, 82, 112-122.
BERK, R. H. (1966). Limiting behavior of posterior distributions when the model
is incorrect. Annals of Mathematical Statistics, 37, 51-58.
BERKSON, J. (1942). Tests of significance considered as evidence. Journal of the
American Statistical Association, 37,325-335.
BERTI, P., REGAZZINI, E., and RIGo, P. (1991). Coherent statistical inference
and Bayes theorem. Annals of Statistics, 19, 366-381.
BICKEL, P. J. and FREEDMAN, D. A. (1981). Some asymptotic theory for the
bootstrap. Annals of Statistics, 9, 1196-1217.
BILLINGSLEY, P. (1968). Convergence of Probability Measures. New York: Wiley.
BILLINGSLEY, P. (1986). Probability and Measure (2nd ed.). New York: Wiley.
BISHOP, Y. M. M., FIENBERG, S. E., and HOLLAND, P. W. (1975). Discrete
Multivariate Analysis: Theory and Practice. Cambridge, MA: MIT Press.
BLACKWELL, D. (1947). Conditional expectation and unbiased sequential esti-
mation. Annals of Mathematical Statistics, 18, 105-110.
BLACKWELL, D. (1973). Discreteness of Ferguson selections. Annals of Statistics,
1,356-358.
BLACKWELL, D. and DUBINS, L. (1962). Merging of opinions with increasing
information. Annals of Mathematical Statistics, 33, 882-886.
BLACKWELL, D. and RAMAMOORTHI, R. V. (1982). A Bayes but not classically
sufficient statistic. Annals of Statistics, 10, 1025-1026.
BLYTH, C. R. (1951). On minimax statistical decision procedures and their
admissibility. Annals of Mathematical Statistics, 22, 22-42.
BONDAR, J. V. (1988). Discussion of "Conditionally acceptable frequentist so-
lutions" by George Casella. In S. S. GUPTA and J. O. BERGER (Eds.) ,
Statistical Decision Theory and Related Topics IV (pp. 91-93). New York:
Springer-Verlag.
References 677

BORTKIEWICZ, L. V. (1898). Das Gesetz der Kleinen Zahlen. Leipzig:


Teubne r.
Box, G. E. P. and Cox, D. R. (1964). An analysis of transfor
mations (with
discussion). Journal of the Royal Statistical Society (Series B), 26,
211-246.
Box, G. E. P. and TIAO, G. C. (1968). A Bayesian approac h to
some outlier
problems. Biometrika, 55, 119-129.
BREIMAN, L. (1968). Probability. Reading, MA: Addison-Wesley.
BRENNER, D., FRASER, D. A. S., and McDuNNOUGH, P. (1982).
On asymp-
totic normali ty of likelihood and conditio nal analysis. Canadian Journal
of
Statistics, 10, 163-172.
BROWN, L. D. (1967). The conditio nal level of Student 's t test.
Annals of
Mathematical Statistics, 38, 1068-1071.
BROWN, L. D. (1971). Admissible estimat ors, recurren t diffusions,
and insoluble
bounda ry value problems. Annals of Mathematical Statistics, 42,
855-903.
(See also correction, Annals of Statistics, 1, 594-596.)
BR.OWN, L. D. and HWANG, J. T. (1982). A unified admissibility
proof. In S. S.
GUPTA and J. O. BERGER (Eds.), Statistical Decision Theory and
Related
Topics III (pp. 205-230). New York: Academic Press.
BUCK, C. (1965). Real Analysis (2nd ed.). New York: McGraw-Hill.
BUEHLER, R. J. (1959). Some validity criteria for statistic al inferenc
es. Annals
of Mathematical Statistics, 30, 845-863.
BUEHLER, R. J. and FEDDERSON, A. P. (1963). Note on a conditio
nal propert y
of Student 's t. Annals of Mathematical Statistics, 34, 1098-1100.
CASELLA, G. and BERGER, R. L. (1987). Reconciling Bayesian and
frequentist
evidence in the one-sided testing problem (with discussion). Journal
of the
American Statistical Association, 82, 106-11l .
CHALONER, K., CHURCH, T., LOUIS, T. A., and MATTS, J. P. (1993).
Graphic al
elicitati on of a prior distribu tion for a clinical trial. The Statistic
ian, 42,
341-353.
CHANG, T. and VILLEGAS, C. (1986). On a theorem of Stein relating
Bayesian
and classical inferences in group models. Canadian Journal of Statistic
s, 14,
289-296.
CHAPMAN, D. and ROBBINS, H. (1951). Minimu m variance estimat
ion without
regulari ty assump tions. Annals of Mathematical Statistics, 22, 581-586
.
CHEN, C.-F. (1985). On asympt otic normali ty of limiting density
functions with
Bayesian implications. Journal of the Royal Statistical Society (Series
B),
47, 540-546.
CHOW, Y. S., ROBBINS, H., and SIEGMUND, D. (1971). Great Expecta
tions: The
Theory of Optimal Stopping. New York: Hought on Mifflin.
CHURCHILL, R. V. (1960). Complex Variables and Applications (2nd
ed.). New
York: McGraw Hill.
CLARKE, B. S. and BAR.RON, A. R. (1994). Jeffreys' prior is asympto
tically least
favorable under entropy risk. Journal of Statistical Planning and
Inference,
41,37-6 0.
678 References

CORNFIELD, J. (1966). A Bayesian test of some classical hypotheses-with ap-


plications to sequential clinical trials. Journal of the American Statistical
Association, 61, 577-594.
COX, D. R. (1958). Some problems connected with statistical inference. Annals
of Mathematical Statistics, 29, 357-372.
Cox, D. R. (1977). The role of significance tests. Scandinavian Journal of
Statistics, 4, 49-70.
Cox, D. R. and HINKLEY, D. V. (1974). Theoretical Statistics. London: Chap-
man and Hall.
CRAMER, H. (1945). Mathematical Methods of Statistics. Princeton: Princeton
University Press.
CRAMER, H. (1946). Contributions to the theory of statistical estimation. Skan-
dinavisk Aktuarietidsk, 29, 85-94.
DAVID, H. A. (1970). Order Statistics. New York: Wiley.
DAWID, A. P. (1970). On the limiting normality of posterior distributions. Pro-
ceedings of the Cambridge Philosophical Society, 67, 625-633.
DAWID, A. P. (1982). Intersubjective statistical models. In G. KOCH and
F. SPIZZICHINO (Eds.), Exchangeability in Probability and Statistics (pp.
217-232). Amsterdam: North-Holland.
DAWID, A. P. (1984). Statistical theory: The prequential approach. Journal of
the Royal Statistical Society (Series A), 147, 278-292.
DAWID, A. P., STONE, M., and ZIDEK, J. V. (1973). Marginalization paradoxes
in Bayesian and structural inference. Journal of the Royal Statistical Society
(Series B), 35, 189-233.
DEFINETTI, B. (1937). Foresight: Its logical laws, its subjective sources. In H. E.
KYBURG and H. E. SMOKLER (Eds.), Studies in Subjective Probability (pp.
53-118). New York: Wiley.
DEFINETTI, B. (1974). Theory of Probability, Vols. I and II. New York: Wiley.
DEGROOT, M. H. (1970). Optimal Statistical Decisions. New York: Wiley.
DEMoIVRE, A. (1756). The Doctrine of Chance (3rd ed.). London: A. Millar.
DIACONIS, P. and FREEDMAN, D. A. (1980a). Finite exchangeable sequences.
Annals of Probability, 8, 745-764.
DIACONIS, P. and FREEDMAN, D. A. (1980b). DeFinetti's generalizations of
exchangeability. In R. C. JEFFREY (Ed.), Studies in Inductive Logic and
Probability, II (pp. 233-249). Berkeley: University of California.
DIACONIS, P. and FREEDMAN, D. A. (1980c). DeFinetti's theorem for Markov
chains. Annals of Probability, 8, 115-130.
DIACONIS, P. and FREEDMAN, D. A. (1984). Partial exchangeability and suf-
ficiency. In J. K. GHOSH and J. Roy (Eds.), Statistics: Applications and
New Directions (pp. 205-236). Calcutta: Indian Statistical Institute.
DIACONIS, P. and FREEDMAN, D. A. (1986a). On the consistency of Bayes
estimates (with discussion). Annals of Statistics, 14, 1-26.
References 679

DIACONIS, P. and FREEDMAN, D. A. (1986b). On inconsis tent Bayes


estimat es
of location. Annals of Statistics, 14, 68-87.
DIACONIS, P. and FREEDMAN, D. A. (1990). Cauchy 's equatio n and
DeFine tti's
theorem . Scandinavian Journal of Statistics, 17, 235-250.
DIACONIS, P. and YLVISAKER, D. (1979). Conjug ate priors for exponen
tial fam-
ilies. Annals of Statistics, 7, 269-281.
DICKEY, J. M. (1980). Beliefs about beliefs, a theory of stochas tic
assessments
of subjecti ve probabilities. In J. M. BERNARDO, M. H. DEGROOT,
D. V.
LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics (pp.
471-487).
Valencia, Spain: University Press.
DOOB, J. L. (1949). Applica tion of the theory of marting ales. In
Le Calcul des
Probabilites et ses Applications (pp. 23-27). Paris: Colloques Interna tionaux
du Centre Nationa l de la Recherche Scientifique.
DOOB, J. L. (1953). Stochastic Processes. New York: Wiley.
DUBINS, L. E. and FREEDMAN, D. A. (1963). Random distribu
tion functions.
Bulletin of the American Mathematical Society, 69, 548-551.
DUGUNDJI, J. (1966). Topology. Boston: Allyn and Bacon.
DUNFORD, N. and SCHWARTZ, J. T. (1957). Linear Operators, Part
I: General
Theory. New York: Interscience.
DUNFORD, N. and SCHWARTZ, J. T. (1963). Linear Operators, Part
II: Spectral
Theory. New York: Interscience.
EBERHARDT, K. R., MEE, R. W., and REEVE, C. P. (1989). Compu
ting factors
for exact two-sided toleranc e limits for a normal distribu tion. Commu
nica-
tions in Statisti cs-Simu lation and Computation, 18, 397-413.
EDWARDS, W., LINDMAN, H., and SAVAGE, L. J. (1963). Bayesia
n statistic al
inference for psychological research. Psychological Review, 70, 193-242
.
EFRON, B. (1979). Bootstr ap method s: Anothe r look at the jackknif
e. Annals of
Statistics, 7, 1-26.
EFRON, B. (1982). The Jackknife, the Bootstrap and Other Resamp
ling Plans.
Philade lphia: Society for Industr ial and Applied Mathem atics.
EFRON, B. and HINKLEY, D. V. (1978). Assessing the accurac
y of the max-
imum likelihood estimat or: Observe d versus expecte d Fisher informa
tion.
Biometrika, 65, 457-487.
EFRON, B. and MORRIS, C. N. (1975). Data analysis using Stein's
estimat or
and its generalizations. Journal of the American Statistical Associa
tion, 70,
311-319.
EFRON, B. and TIBSHIRANI, R. J. (1993). An Introduction to
the Bootstrap.
London: Chapm an and Hall.
ESCOBAR, M. D. (1988). Estimating the Means of Several Normal
Populations
by Nonparametric Estimat ion of the Distribution of the Means. Ph.D.
thesis,
Yale University.
FABIUS, J. (1964). Asympt otic behavio r of Bayes' estimate s. Annals
of Mathe-
matical Statistics, 35, 846-856.
680 References

FERGUSON, T. S. (1967). Mathematical Statistics: A Decision Theoretic Ap-


proach. New York: Academic Press.
FERGUSON, T. S. (1973). A Bayesian analysis of some nonparametric problems.
Annals of Statistics, 1, 209-230.
FERGUSON, T. S. (1974). Prior distributions on spaces of probability measures.
Annals of Statistics, 2, 615-629.
FERRANDIZ, J. R. (1985). Bayesian inference on Mahalanobis distance: An al-
ternative to Bayesian model testing. In J. M. BERNARDO, M. H. DEG-
ROOT, D. V. LINDLEY, and A. F. M. SMITH (Eds.), Bayesian Statistics
2: Proceedings of the Second Valencia International Meeting (pp. 645-653).
Amsterdam: North Holland.
FIELLER, E. C. (1954). Some problems in interval estimation. Journal of the
Royal Statistical Society (Series B), 16, 175-185.
FISHBURN, P. C. (1970). Utility Theory for Decision Making. New York: Wiley.
FISHER, R. A. (1922). On the mathematical foundations of theoretical statistics.
Philosophical Transactions of the Royal Society of London, Series A, 222A,
309-368.
FISHER, R. A. (1924). The conditions under which X2 measures the discrepancy
between observation and hypothesis. Journal of the Royal Statistical Society,
87, 442-450.
FISHER, R. A. (1925). Theory of statistical estimation. Proceedings of the Cam-
bridge Philosophical Society, 22, 700-725.
FISHER, R. A. (1934). Two new properties of mathematical likelihood. Proceed-
ings of the Royal Society of London, A, 144, 285-307.
FISHER, R. A. (1935). The fiducial argument in statistical inference. Annals of
Eugenics, 6, 391-398.
FISHER, R. A. (1936). Has Mendel's work been rediscovered? Annals of Science,
1, 115-137.
FISHER, R. A. (1943). Note on Dr. Berkson's criticism of tests of significance.
Journal of the American Statistical Association, 38, 103-104.
FISHER, R. A. (1966). The Design of Experiments (8th ed.). New York: Hafner.
FRASER, D. A. S. and McDuNNOUGH, P. (1984). Further remarks on asymp-
totic normality of likelihood and conditional analyses. Canadian Journal of
Statistics, 12, 183-190.
FREEDMAN, D. A. (1963). On the asymptotic behavior of Bayes' estimates in
the discrete case. Annals of Mathematical Statistics, 34, 1386-1403.
FREEDMAN, D. A. (1977). A remark on the difference between sampling with
and without replacement. Journal of the American Statistical Association,
72,681.
FREEDMAN, D. A. and DIACONIS, P. (1982). On inconsistent M-estimators.
Annals of Statistics, 10, 454-461.
FREEDMAN, L. S. and SPIEGELHALTER, D. J. (1983). The assessment ofsubjec-
tive opinion and its use in relation to stopping rules of clinical trials. The
Statistician, 32, 153-160.
References 681

FREEMAN, P. R. (1980). On the number of outliers in data from a linear


model. In
J. M. BERNARDO, M. H. DEGROOT, D. V. LINDLEY, and A. F. M. SMITH
(Eds.), Bayesian Statistics (pp. 349-365). Valencia, Spain: Univers
ity Press.
GABRIEL, K. R. (1969). Simulta neous test proced ures-so me theory
of mUltiple
comparisons. Annals of Mathematical Statistics, 40, 224-250.
GARTHWAITE, P. and DICKEY, J. (1988). Quantif ying expert opinion
in linear
regression problems. Journal of the Royal Statistical Society (Series
B), 50,
462-474.
GARTHWAITE, P. H. and DICKEY, J. M. (1992). Elicitat ion of prior
distribu tions
for variable-selection problems in regression. Annals of Statistics,
20, 1697-
1719.
GAVASAKAR, U. K. (1984). A Study of Elicitation Procedures by
Modelling the
Errors in Responses. Ph.D. thesis, Carnegie Mellon University.
GEISSER, S. (1967). Estimat ion associat ed with linear discriminants.
Annals of
Mathematical Statistics, 38, 807-817.
GEISSER, S. and EDDY, W. F. (1979). A predictive approac h to model
selection.
Journal of the American Statistical Association, 74, 153-160.
GELFAND, A. E. and SMITH, A. F. M. (1990). Samplin g-based approac
hes to cal-
culating margina l densities. Journal of the American Statistical ASSOCia
tion,
85,398 -409 .
. GEMAN, S. and GEMAN, D. (1984). Stochas tic relaxati on, Gibbs
distribu tions
and the Bayesian restorat ion of images. IEEE Trans. on Pattern
Analysis
and Machine Intelligence, 6, 721-741.
GNANADESIKAN, R. (1977). Methods for Statistical Data Analysis
of Multivariate
Observations. New York: Wiley.
GOOD, I. J .. (1956). Discussion of "Chanc e and control: Some implica
tions of
random ization" by G. Spencer Brown. In C. CHERRY (Ed.), Informa
tion
Theory: Third London Symposium (pp. 13-14). London: Butterw orths.
HALL, P. (1992). The Bootstrap and Edgeworth Expansion. New York:
Springer-
Verlag.
HALL, W. J., WIJSMAN, R. A., and GHOSH, J. K. (1965). The
relation ship
between sufficiency and invariance with applicat ions in sequent ial
analysis.
Annals of Mathematical Statistics, 36, 575-614.
HALMOS, P. R. (1950). Measure Theory. New York: Van Nostran
d.
HALMOS, P. R. and SAVAGE, L. J. (1949). Applica tion of the Radon-
Nikody m
theorem to the theory of sufficient statistic s. Annals of Mathematical
Statis-
tics, 20, 225-241.
HAMPEL, F. R., RONCHETTI, E. M., ROUSSEEUW, P. J., and STAHEL
, W. A.
(1986). Robust Statistics: The Approach Based on Influence Functions.
New
York: Wiley.
HARTIGAN, J. (1983). Bayes Theory. New York: Springer-Verlag.
HEATH, D. and SUDDERTH, W. D. (1976). DeFine tti's theorem on
exchang eable
variables. American Statistician, 30, 188-189.
682 References

HEATH, D. and SUDDERTH, W. D. (1989). Coherent inference from improper


priors and from finitely additive priors. Annals of Statistics, 17, 907-919.
HEWITT, E. and SAVAGE, L. J. (1955). Symmetric measures on cartesian prod-
ucts. Transactions of the American Mathematical Society, 80, 470--501.
HEYDE, C. C. and JOHNSTONE, 1. M. (1979). On asymptotic posterior normality
for stochastic processes. Journal of the Royal Statistical Society (Series B),
41, 184-189.
HILL, B. M. (1965). Inference about variance components in the one-way model.
Journal of the American Statistical Association, 60, 806--825.
HILL, B. M., LANE, D., and SUDDERTH, W. D. (1987). Exchangeable urn pro-
cesses. Annals of Probability, 15, 1586-1592.
HOEL, P. G., PORT, S. C., and STONE, C. J. (1971). Introduction to Probability
Theory. Boston: Houghton Mifflin.
HOGARTH, R. M. (1975). Cognitive processes and the assessment of subjective
probability distributions (with discussion). Journal of the American Statis-
tical Association, 70, 271-294.
HUBER, P. J. (1964). Robust estimation of a location parameter. Annals of
Mathematical Statistics, 35, 73--101.
HUBER, P. J. (1967). The behaviour of maximum likelihood estimates under
nonstandard conditions. In L. M. LECAM and J. NEYMAN (Eds.), Pro-
ceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, volume 1 (pp. 221-233). Berkeley: University of California.
HUBER, P. J. (1977). Robust Statistical Procedures. Philadelphia: Society for
Industrial and Applied Mathematics.
HUBER, P. J. (1981). Robust Statistics. New York: Wiley.
JAMES, W. and STEIN, C. M. (1960). Estimation with quadratic loss. In J. NEY-
MAN (Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical
Statistics and Probability, volume 1 (pp. 361-379). Berkeley: University of
California.
JAYNES, E. T. (1976). Confidence intervals vs. Bayesian intervals (with dis-
cussion). In W. L. HARPER and C. A. HOOKER (Eds.), Foundations of
Probability Theory, Statistical Inference, and Statistical Theories of Science
(pp. 175-257). Dordrecht: D. Reidel.
JEFFREYS, H. (1961). Theory of Probability (3rd ed.). Oxford: Oxford University
Press.
JOHNSTONE, 1. M. (1978). Problems in limit theory for martingales and posterior
distributions from stochastic processes. Master's thesis, Australian National
University.
KADANE, J. B., DICKEY, J. M., WINKLER, R. L., SMITH, W., and PETERS,
S. C. (1980). Interactive elicitation of opinion for a normal linear model.
Journal of the American Statistical Association, 75, 845-854.
KADANE, J. B., SCHERVISH, M. J., and SEIDENFELD, T. (1985). Statistical
implications of finitely additive probability. In P. GOEL and A. ZELLNER
References 683

(Eds.), Bayesian Inference and Decision Techniques with Applic~t


ions: Es-
says in Honor of Bruno DeFinetti (pp. 59-76). Amster dam: Elsevier
SCience
Publish ers.
KADANE J. B., SCHERVISH, M. J., and SEIDENFELD, T. (1996). Reasoni
ng to a
fore~one conclusion. Journal of the American Statistical Association, 91,
to
appear.
KAGAN, A. M., LINNIK, Y. V., and RAO, C. R. (1965). On a charact
erizatio n of
the normal law based on a propert y of the sample average. Sankhya
, Series
A, 32, 37-40.
KAHNEMAN, D., SLOVIC, P., and TVERSKY, A. (Eds.) (1982).
Judgment Un-
der Uncertainty: Heuristics and Biases. Cambridge: Cambri dge Univers
ity
Press.
KASS, R. E. and RAFTERY, A. E. (1995). Bayes factors. Journal of
the Americ an
Statistical Association, 90, 773-795.
KASS, R. E. and STEFFEY, D. (1989). Approx imate Bayesian inferenc
e in condi-
tionally indepen dent hierarchical models (parame tric empiric al Bayes
mod-
els). Journal of the Americ an Statistical Association, 84, 717-726
.
KASS, R. E., TIERNEY, L., and KADANE, J. B. (1988). Asympt otics
in Bayesia n
comput ation. In J. M. BERNARDO, M. H. DEGROOT, D. V.
LINDLEY,
and A. F. M. SMITH (Eds.), Bayesian Statistics :1 (pp. 261-278).
Oxford:
Clarend on Press.
KASS, R. E., TIERNEY, L., and KADANE, J. B. (1990). The validity
of posterio r
expansi ons based on Laplace 's method . In S. GEISSER, J. S. HODGES
, S. J.
PRESS, and A. ZELLNER (Eds.), Bayesian and Likelihood Methods
in Statis-
tics and Econometrics (pp. 473-488). Amster dam: Elsevier (North Holland
).
KEIFER, J. and WOLFOWITZ, J. (1956). Consistency of the maximu
m likelihood
estimat or in the presence of infinitely many inciden tal parame ters.
Annals
of Mathematical Statistics, 27, 887-906.
KERRJDGE, D. (1963). Bounds for the frequency of misleading Bayes
inferences.
Annals of Mathematical Statistics, 34, 1109-1110.
KINDERMAN, A. J. and MONAHAN, J. F. (1977). Compu ter generat
ion of ran-
dom variables using the ratio of uniform deviates. A eM Transac
tions on
Mathematical Software, 3, 257-260.
KINGMAN, J. F. C. (1978). Uses of exchangeability. Annals of
Probability, 6,
183-197.
KNUTH, D. E. (1984). The 1E;Xbook. Reading, MA: Addison-Wesley
.
KRAFT, C. H. (1964). A class of distribu tion function processe
s which have
derivatives. Journal of Applied Probability, 1, 385-388.
KRASKER, W. and PRATT, J. W. (1986). Discussion of "On the
consiste ncy
of Bayes estimate s" by Diaconis and Freedm an. Annals of Statistic
s, 14,
55-58.
KREM, A. (1963). On the indepen dence in the limit of extreme
and central
order statistic s. Publications of the Mathematical Institute of the Hungar
ian
Academy of Science, 8, 469-474.
684 References

KSHIRSAGAR, A. M. (1972). Multivariate Analysis. New York: Marcel Dekker.


KULLBACK, S. (1959). Information Theory and Statistics. New York: Wiley.
LAMPORT, L. (1986). UTEX: A Document Prepamtion System. Reading, MA:
Addison-Wesley.
LAURITZEN, S. L. (1984). Extreme point models in statistics (with discussion).
Scandinavian Journal of Statistics, 11, 65-91.
LAURITZEN, S. 1. (1988). Extremal Families and Systems of Sufficient Statistics.
Berlin: Springer-Verlag.
LAVINE, M. (1992). Some aspects of Polya tree distributions for statistical mod-
elling. Annals of Statistics, 20, 1222-1235.
LAVINE, M., WASSERMAN, L., and WOLPERT, R. L. (1991). Bayesian inference
with specified prior marginals. Journal of the American Statistical Associa-
tion, 86, 964-971.
LAVINE, M., WASSERMAN, L., and WOLPERT, R. L. (1993). Linearization of
Bayesian robustness problems. Journal of Statistical Planning and Inference,
37,307-316.
LECAM, L. M. (1953). On some asymptotic properties of maximl,lm likelihood
estimates and related Bayes estimates. University of California Publications
in Statistics, 1, 277-330.
LECAM, L. M. (1970). On the assumptions used to prove asymptotic normality
of maximum likelihood estimates. Annals of Mathematical Statistics, 41,
802-828.
LECOUTRE, B. (1985). Reconsideration of the F-test of the analysis of variance:'
The semi-Bayesian significance test. Communications in Statistics-Theory
and Methods, 14, 2437-2446.
LECOUTRE, B. and ROUANET, H. (1981). Deux structures statistiques fonda-
mentales en analyse de la variance univariee et mulitvariee. Mathematiques
et Sciences Humaines, 75, 71-82.
LEHMANN, E. 1. (1958). Significance level and power. Annals of Mathematical
Statistics, 29,1167-1176.
LEHMANN, E. L. (1983). Theory of Point Estimation. New York: Wiley.
LEHMANN, E. L. (1986). Testing Statistical Hypotheses (2nd ed.). New York:
Wiley.
LEHMANN, E. 1. and SCHEFFE, H. (1955). Completeness, similar regions and
unbiased estimates. Sankhya, 10, 305-340. (Also 15, 219-236, and correction
17, 250.)
LINDLEY, D. V. (1957). A statistical paradox. Biometrika, 44, 187-192.
LINDLEY, D. V. and NOVICK, M. R. (1981). The role of exchangeability in
inference. Annals of Statistics, 9, 45-58.
LINDLEY, D. V. and PHILLIPS, L. D. (1976). Inference for a Bernoulli process
(a Bayesian view). American Statistician, 30, 112-119.
LINDLEY, D. V. and SMITH, A. F. M. (1972). Bayes estimates for the linear
model. Journal of the Royal Statistical Society (Series B), 34, 1-41.
References 685

LOEVE, M. (1977). Probability Theory I (4th ed.). New York: Springer-Verlag.


MAULDIN, R. D., SUDDERTH, W. D., and WILLIAMS, S. C. (1992). Polya trees
and random distributions. Annals of Statistics, 20, 1203-1221-
MAULDIN, R. D. and WILLIAMS, S. C. (1990). Reinforced random walks and
random distributions. Proceedings of the American Mathematical Society,
110, 251-258.
MENDEL, G. (1866). Versuche iiber pflanzenhybriden. Verhandlungen Natur-
forschender Vereines in Brunn, 10, 1.
METIVIER, M. (1971). Sur la construction de mesures alt~atoires presque surement
absolument continues par rapport a une mesure donnee. Zeitschrijt fur
Wahrscheinlichkeitstheorie, 20, 332-344.
MORRIS, C. N. (1983). Parametric empirical Bayes inference: Theory and appli-
cations (with discussion). Journal of the American Statistical Association,
78,47-65.
NACHBIN, L. (1965). The Haar Integral. Princeton: Van Nostrand.
NEYMAN, J. (1935). Su un teorema concernente Ie cosiddette statistiche suffici-
enti. Giornale Dell'Istituto Italiano degli Attuari, 6, 320--334.
NEYMAN, J. and PEARSON, E. S. (1933). On the problem of the most efficient test
of statistical hypotheses. Philosophical 1hmsactions of the Royal Society of
London, Series A, 231, 289-337.
NEYMAN, J. and SCOTT, E. L. (1948). Consistent estimates based on partially
consistent observations. Econometrica, 16, 1-32.
PEARSON, K. (1900). On the criterion that a given system of deviations from the
probable in the case of a correlated system of variables is such that it can
be reasonably supposed to have arisen from random sampling. Philosoph-
ical Magazine (5thSeries), 50, 339-357. (See also correction, Philosophical
Magazine (6thSeries), 1, 670-671.)
PERLMAN, M. (1972). On the strong consistency of approximate maximum like-
lihood estimators. In L. M. LECAM, J. NEYMAN, and E. L. SCOTT (Eds.),
Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and
Probability, volume 1 (pp. 263-281). Berkeley: University of California.
PIERCE, D. A. (1973). On some difficulties in a frequency theory of inference.
Annals of Statistics, 1,241-250.
PITMAN, E. (1939). The estimation of location and scale parameters of a contin-
uous population of any given form. Biometrika, 30, 391-421.
PRATT, J. W. (1961). Review of "Testing Statistical Hypotheses" by E. L.
Lehmann. Journal of the American Statistical Association, 56, 163-167.
PRATT, J. W. (1962). Discussion of "On the foundations of statistical inference"
by Allan Birnbaum. Journal of the American Statistical Association, 57,
314-316.
RAO, C. R. (1945). Information and the accuracy a.ttainable in the estimation
of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37,
81-91.
686 References

RAO, C. R. (1973). Linear Statistical Inference and Its Applications (2nd ed.).
New York: Wiley.
ROBBINS, H. (1951). Asymptotically subminimax solutions of compound sta-
tistical decision problems. In J. NEYMAN (Ed.), Proceedings of the Second
Berkeley Symposium on Mathematical Statistics and Probability (pp. 131-
148). Berkeley: University of California.
ROBBINS, H. (1955). An empirical Bayes approach to statistics. In J. NEY-
MAN (Ed.), Proceedings of the Third Berkeley Symposium on Mathematical
Statistics and Probability, volume 1 (pp. 157-164). Berkeley: University of
California.
ROBBINS, H. (1964). The empirical Bayes approach to statistical decision prob-
lems. Annals of Mathematical Statistics, 35, 1-20.
ROBERT, C. P. (1993). A note on Jeffreys-Lindley paradox. Statistica Sinica, 3,
601-608.
ROBERTS, H. V. (1967). Informative stopping rules and inferences about popu-
lation size. Journal of the American Statistical Association, 62, 763-775.
ROUANET, H. and LECOUTRE, B. (1983). Specific inference in ANOVA: From
significance tests to Bayesian procedures. British Journal of Mathematical
and Statistical Psychology, 36, 252-268.
ROYDEN, H. L. (1968). Real Analysis. London: Macmillan.
RUBIN, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130-134.
RUDIN, W. (1964). Principles of Mathematical Analysis (2nd ed.). New York:
McGraw-Hill.
SAVAGE, L. J. (1954). The Foundations of Statistics. New York: Wiley.
SAVAGE, L. J. (1962). The Foundations of Statistical Inference. London:
Methuen.
SCHEFFE, H. (1947). A useful convergence theorem for probability distributions.
Annals of Mathematical Statistics, 18,434--438.
SCHERVISH, M. J. (1983). User-oriented inference. Journal of the American
Statistical Association, 78, 611-615.
SCHERVISH, M. J. (1992). Bayesian analysis of linear models (with discussion).
In J. M. BERNARDO, J. O. BERGER, A. P. DAWID, and A. F. M. SMITH
(Eds.), Bayesian Statistics 4: Proceedings of the Second Valencia Interna-
tional Meeting (pp. 419--434). Oxford: Clarendon Press.
SCHERVISH, M. J. (1994). Discussion of "Bootstrap: More than a stab in the
dark?" by G. A. Young. Statistical Science, 9, 408-410.
SCHERVISH, M. J. (1996). P-values: What they are and what they are not.
American Statistician, 50, to appear.
SCHERVISH, M. J. and CARLIN, B. P. (1992). On the convergence of successive
substitution sampling. Journal of Computational and Graphical Statistics,
1, 111-127.
SCHERVISH, M. J. and SEIDENFELD, T. (1990). An approach to consensus and
certainty with increasing evidence. Journal of Statistical Planning and In-
ference, 25,401-414.
References 687

SCHERVISH, M. J., SEIDENFELD, T., and KADANE, J. B. (1984). The extent


of non-conglomerability of finitely additive probabilities. Zeitschrift fur
Wahrscheinlichkeitstheorie, 66, 205-226.
SCHERVISH, M. J., SEIDENFELD, T., and KADANE, J. B. (1990). State dependent
utilities. Journal of the American Statistical Association, 85, 840-847.
SCHWARTZ, L. (1965). On Bayes procedures. Zeitschrift fur Wahrscheinlichkeit-
stheorie, 4, 10-26.
SEIDENFELD, T. and SCHERVISH, M. J. (1983). A conflict between finite addi-
tivity and avoiding Dutch Book. Philosophy of Science, 50, 398-412.
SEIDENFELD, T., SCHERVISH, M. J., and KADANE, J. B. (1995). A representation
of partially ordered preferences. Annals of Statistics, 23, 2168-2217.
SERFLING, R. J. (1980). Approximation Theorems of Mathematical Statistics.
New York: Wiley.
SETHURAMAN, J. (1994). A constructive definition of Dirichlet priors. Statistica
Sinica, 4, 639-650.
SINGH, K. (1981). On the asymptotic accuracy of Efron's bootstrap. Annals of
Statistics, 9, 1187-1195.
SMITH, A. F. M. (1973). A general Bayesian linear model. Journal of the Royal
Statistical Society, Ser. B, 35, 67-75.
SPJ0TVOLL, E. (1983). Preference functions. In P. J. BICKEL, K. DOKSUM,
and J. L. HODGES, JR. (Eds.), A Festschrift for Erich L. Lehmann (pp.
409-432). Belmont, CA: Wadsworth.
STATSCI (1992). S-PLUS, Version 3.1 (software package). Seattle: StatSci Divi-
sion, MathSoft, Inc.
STEIN, C. M. (1946). A note on cumulative sums. Annals of Mathematical
Statistics, 17, 498-499.
STEIN, C. M. (1956). Inadmissibility of the usual estimator for the mean of
a multivariate normal distribution. In J. NEYMAN (Ed.), Proceedings of
the Third Berkeley Symposium on Mathematical Statistics and Probability,
volume 1 (pp. 197-206). Berkeley: University of California.
STEIN, C. M. (1965). Approximation of improper prior measures by prior proba-
bility measures. In J. NEYMAN and L. M. LECAM (Eds.), Bernoulli, Bayes,
Laplace: Anniversary Volume (pp. 217-240). New York: Springer-Verlag.
STEIN, C. M. (1981). Estimation of the mean of a multivariate normal distribu-
tion. Annals of Statistics, 9, 1135-1151.
STIGLER, S. M. (1986). The History of Statistics: The Measurement of Uncer-
tainty before 1900. Cambridge, MA: Belknap.
STONE, M. (1976). Strong inconsistency from uniform priors. Journal of the
American Statistical Association, 71, 114-125.
STONE, M. and DAWID, A. P. (1972). Un-Bayesian implications of improper
Bayes inference in routine statistical problems. Biometrika, 59, 369-375.
688 References

STRASSER, H. (1981). Consistency of maximum likelihood and Bayes estimates.


Annals of Statistics, 9, 1107-1113.
STRAWDERMAN, W. E. (1971). Proper Bayes minimax estimators of the multi-
variate normal mean. Annals of Mathematical Statistics, 42, 385-388.
TAYLOR, R. L., DAFFER, P. Z., and PATTERSON, R. F. (1985). Limit Theorems
for Sums of Exchangeable Random Variables. Totowa, NJ: Rowman and
Allanheld.
TIERNEY, L. (1994). Markov chains for exploring posterior distributions (with
discussion). Annals of Statistics, 22,1701-1762.
TIERNEY, L. and KADANE, J. B. (1986). Accurate approximations for poste-
rior moments and marginal densities. Journal of the American Statistical
Association, 81, 82-86.
VENN, J. (1876). The Logic of Chance (2nd ed.). London: Macmillan.
VERDINELLI, I. and WASSERMAN, L. (1991). Bayesian analysis of outlier problems
using the Gibbs sampler. Statistics and Computing, 1, 105-117.
VON MISES, R. (1957). Probability, Statistics and Truth. London: Allen and
Unwin.
VON NEUMANN, J. and MORGENSTERN, O. (1947). Theory of Games and Eco-
nomic Behavior (2nd ed.). Princeton: Princeton University Press.
WALD, A. (1947). Sequential Analysis. New York: Wiley.
WALD, A. (1949). Note on the consistency of the maximum likelihood estimate.
Annals of Mathematical Statistics, 20, 595-601.
WALD, A. and WOLFOWITZ, J. (1948). Optimum character of the sequential
probability ratio test. Annals of Mathematical Statistics, 19, 326-339.
WALKER, A. M. (1969). On the asymptotic behaviour of posterior distributions.
Journal of the Royal Statistical Society (Series B), 31, 80-88.
WALLACE, D. L. (1959). Conditional confidence level properties. Annals of
Mathematical Statistics, 30,864-876.
WELCH, B. L. (1939). On confidence limits and sufficiency, with particular ref-
erence to parameters of location. Annals of Mathematical Statistics, 10,
58-69.
WEST, M. (1984). Outlier models and prior distributions in Bayesian linear
regression. Journal of the Royal Statistical Society (Series B), 46,431-439.
WILKS, S. S. (1941). Determination of sample sizes for setting tolerance limits.
Annals of Mathematical Statistics, 12, 91-96.
YOUNG, G. A. (1994). Bootstrap: More than a stab in the dark? (with discus-
sion). Statistical Science, 9, 382-415.
ZELLNER, A. (1971). An Introduction to Bayesian Inference in Econometrics.
New York: Wiley.
Notation and Abbreviation Index

o (vector of Os), 385 Dirk("') (distribution), 674


dJl2/dJll (Radon-Nikodym
1 (vector of Is), 345 derivative), 575, 598
~ (symmetric difference), 581
28 (power set), 571
Eo(-) (conditional mean given e = 0),
(absolutely continuous), 574, 597 19
a.e. (almost everywhere), 582 ExpO (distribution), 670
ANCB(,,) (distribution), 668 EO (expected value), 607, 613
ANC:e(,,) (distribution), 668 E(I) (conditional mean), 616
ANCF(,,) (distribution), 669
ANOVA (analysis of variance), 384 f+ (positive part), 588
ARE (asymptotic relative efficiency), f- (negative part), 588
413 fXls (conditional density of X given
a.s. (almost surely), 582 8),13
Ax (u-field generated by X), 51, 82 fXIY (conditional density), 13
N (action space), 144 Fq,a (distribution), 670
ex (action space u-field), 144
r- 1 (.,.) (distribution), 670
\ (remove one set from another), 577 f(-,) (distribution), 670
B (closure of set), 622 Geo() (distribution), 673
Ber() (distribution), 672
Beta{,) (distribution), 669 HPD (highest posterior density), 327
Bin{,) (distribution), 673 Hyp{',',') (distribution), 673
[3k (Borel u-field), 576
[3 (Borel u-field), 575
lID (independent and identically
distributed), 611
Cau(,) (distribution), 669 lk (identity matrix), 643
CDF (cumulative distribution IXIT('; 1) (conditional Kullback-
function), 612 Leibler information),
c~ (constant related to RHM), 367 115
C g (constant related to LHM), 367 IXIT(-I-) (conditional Fisher
X~ (distribution), 669 information), 111
E. (converges in distribution), 635 Ix (-) (Fisher information), 111
Ix(j) (Kullback-Leibler
& (converges in probability), 638 information), 115
w
- (converges weakly), 635 lAO (indicator function), 9
Covo(',') (conditional covariance
given e = 0), 19 Lap(, .) (distribution), 670
Cov (covariance), 613 '\g (measure constructed from LHM),
Cp (u-field on set of probability 367
measures), 27 LHM (left Haar measure), 363
A C (complement of set), 575 LMP (locally most powerful), 245
LMVUE (locally minimum variance
Dir() (Dirichlet process), 54 unbiased estimator), 300
690 Notation and Abbreviation Index

LR (likelihood ratio), 274 Qn(lx) (conditional distribution


given n observations), 539
MC (most cautious), 230
MLE (maximum likelihood 1R (real numbers), 570
estimator), 307 1R+ (positive reals), 571
MLR (monotone likelihood ratio), 1R+ o (nonnegative reals), 627
239 pg (measure constructed from RHM),
MP (most powerful), 230 367
MRE (minimum risk equivariant), RHM (right Haar measure), 363
347 r(T/,8) (Bayes risk), 149
MUltk( . .. ) (distribution), 674 R(B,8) (risk function), 149
J.Lejx (1) (posterior distribution), 16
(S, A, J.L) (measure space), 577
NCB(,,) (distribution), 671 SPRT (sequential probability ratio
NCX~O (distribution), 671 test), 549
NCF(,.,) (distribution), 671 SSS (successive substitution
NCtaO (distribution), 671 sampling), 507
NCTa(j) (CDF of NCt
T (parameter space a-field), 13
distribution), 671
Negbin(,) (distribution), 673 8' (parametric index), 50
v (dominating measure), 13 8 (parameter), 51
N(,) (distribution), 671 T (statistic), 84
N p (,) (distribution), 674 TaO (CDF of t distribution), 672
N (integers plus 00), 537 ta(-,) (distribution), 672
v T (transpose of vector), 614
n (parameter space), 13, 82
UMA (uniformly most accurate), 317
Op (stochastic small order), 396
UMAU (uniformly most accurate
Op (stochastic large order), 396
unbiased), 321
o (small order), 394
o (large order), 394
UMC (uniformly most cautious), 230
UMCU (uniformly most cautious
unbiased), 254
Po (parametric family), 50, 82
UMP (uniformly most powerful), 230
Par(,) (distribution), 672
UMPU (uniformly most powerful
{)B (boundary of set), 636
unbiased), 254
<1>(-) (CDF of normal distribution),
UMPUAI (uniformly most powerful
672
unbiased almost invariant),
P n (empirical probability measure),
384
12
UMVUE (uniformly minimum
Poi(.) (distribution), 673
variance unbiased
PrO (probability), 612
estimator), 297
Pr(I) (conditional probability), 617
USC (upper semicontinuous), 417
P9,T(-) (conditional distribution of T
U(-,) (distribution), 672
given e = B), 84
P~ (-) (conditional probability given Var 9(.,) (conditional variance given
8 = B), 51, 83 e = B), 19
P9(-) (conditional distribution given Var (variance), 613
e = B), 51, 83
P (random probability measure), 25 x (element of sample space), 82
P (set of all probability measures), 27 X (sample space), 13, 82
Name Index

Ahlfors, L., 667, 675 Chaloner, K., 24, 677


Aitchison, J., 325, 675 Chambers, J., x, 676
Albert, J., 519, 675 Chang, T., 379, 677
Aldous, D., 46, 79, 482, 675 Chapman, D., 303, 677
Anderson, T., 386, 675 Chen, C., 435, 677
Andrews, C., ix Chib, S., 519, 675
Anscombe, F., 181, 675 Chow, y., 647, 677
Antoniak, C., 59, 675 Church, T., 24, 677
Aumann, R, 181, 675 Churchill, R, 666-667, 677
Clarke, B., 446, 677
Bahadur, R, 94, 675 Cornfield, J., 563, 565, 678
Barnard, G., 320, 420, 675 Cox, D., 21, 218, 424, 521, 677-678
Barndorff-Nielsen, 0., 307, 675 Cramer, H., 301, 678
Barnett, V., vii, 675
Barron, A., 434-435, 446, 675, 677 Daffer, P., 33, 688
Basu, D., 99-100, 675 David, H., 404, 678
Bayes, T., 16, 29, 676 Dawid, A., 21, 125, 435, 521, 678,
Becker, R, x, 676 687
Berberian, S., 507, 667, 676 DeFinetti, B., ix, 6, 21, 25, 28, 654,
Berger, J., 22, 173, 284, 525, 565, 656-657, 678
614, 666, 676 DeGroot, M., ix, 91, 98, 181, 362,
Berger, R, 283, 677 536, 654, 678
Berk, R, 417, 430, 432, 676 DeMoivre, A., 8, 678
Berkson, J., 218, 281, 676 Diaconis, P., ix, 15, 28, 41, 46, 108,
Berry, D., 565, 676 123,126,426,434,479-480,
Berti, P., 21, 676 667, 678-680
Bhattacharyya, A., 305 Dickey, J., 24, 679, 681-682
Bickel, P., 330-331, 676 Doob, J., 36, 429, 507, 645-646, 679
Billingsley, P., 46, 621, 636, 648, 676 Doytchinov, B., ix
Bishop, Y., 462, 676 Dubins, L., 70, 455, 676, 679
Blackwell, D., 56, 86, 152, 455, 676 Dugundji, J., 666, 679
Blyth, C., 158, 676 Dunford, N., 507, 635, 667, 679
Bohrer, R., x Dunsmore, I., 325, 675
Bondar, J., 236, 676
Bortkiewicz, L., 462, 677 Eberhardt, K., 326, 679
Box, G., 21, 521, 677 Eddy, W., 521,681
Breiman, L., 618, 640, 677 Edwards, W., 222, 284, 679
Brenner, D., 435, 677 Efron, B., 166,330-331,335-336,423,
Brown, L., 99, 160, 167, 677 679
Buck, C., 665, 677 Escobar, M., 60, 679
Buehler, R, 99, 677
Fabius, J., 61, 679
Carlin, B., 507, 686 Fedderson, A., 99, 677
Casella, G., 283, 677
692 Name Index

Ferguson, T., 52, 56, 61, 173, 179, Johnstone, I., 435, 682
181,248,258,614,666,680
Ferrandiz, J., 669, 680 Kadane, J., 21, 24,183-184,446,564,
Fieller, E., 321, 680 655, 682-683, 687-688
Fienberg, S., 462, 676 Kagan, A., 349, 683
Fishburn, P., 181, 680 Kahneman, D., 23, 683
Fisher, R., 89, 96, 217-218, 307, 370, Kasa, R., ix, 226, 446, 505, 683
373, 522, 680 Kerridge, D., 564, 683
Fraser, D., 435, 677, 680 Kiefer, J., 417, 420, 683
Freedman, D., 15, 28, 40-41, 46, 61, Kinderman, A., 660, 683
70,123,126,330-331,426, Kingman, J., 36, 683
434,479-480,667,676,678- Knuth, D., x, 683
680 Kraft, C., 66, 683
Freedman, L., 24, 680 Krasker, W., 56, 683
Freeman, P., 524, 681 Krem, A., 408, 683
Kshirsagar, A., 386, 684
Gabriel, K., 252, 681 Kullback, S., 116, 684
Garthwaite, P., 24, 681
Gavasakar, U., 24, 681 Lamport, L., x, 684
Geisser, S., 521, 668, 681 Lane, D., 9, 682
Gelfand, A., 507,681 Lauritzen, S., 28, 123, 481, 684
Geman, D., 507, 681 Lavine, M., 69,526,684
Geman, S., 507, 681 LeCam, L., 414, 437, 684
Ghosh, J., 381-382, 681 Lecoutre, B., 668-669, 684, 686
Gnanadesikan, R., 22, 681 Lehmann, E., 231,280, 285,298,350,
Good, I., 565, 681 684
Levy, P., 648, 650
Hadjicostas, P., ix Lindley, D., 6, 229, 284, 479, 684
Hall, P., 337-338, 681 Lindman, H., 222,284,679
Hall, W., 381-382, 681 Linnik, Y., 349, 683
Halmos, P., 364, 600, 681 Laeve, M., 34, 653, 685
Hampel, F., 315, 681 Louis, T., 24, 677
Hartigan, J., 20-21, 33, 681
Matts, J., 24, 677
Heath, D., 21, 46, 681-682
Mauldin, R., 66, 69, 685
Hewitt, E., 46, 682
McDunnough, P., 435, 677, 680
Heyde, C., 435, 682
Mee, R., 326, 679
Hill, B., 9, 484, 682
Mendel, G., 217, 685
Hinkley, D., 218, 423, 678-679
Metivier, M., 66,685
Hodges, J., 414
Monahan, J., 660, 683
Hoel, P., 640, 682
Morgenstern, 0., 181-182,688
Hogarth, R., 24, 682 Morris, C., 166, 500, 679, 685
Holland, P., 462, 676
Huber, P., 310, 315, 428, 682 Nachbin, L., 364, 685
Hwang, J., 160, 677 Neyman, J., 89, 175,231,247,420,
685
James, W., 163,682 Nobile, A., ix
Jaynes, E., 379, 682 Novick, M., ix, 6, 684
Jeffreys, H., 122, 229, 284, 682
Jiang, T., ix Oue, S., ix
Name Index 693

Patterson, R., 33, 688 Smith, A., 479, 507, 681, 684, 687
Pearson, E., 175, 231, 247, 685 Smith, W., 24, 682
Pearson, K., 216, 685 Spiegelhalter, D., 24, 680
Perlman, M., 430, 685 Spj(lltvoll, E., 283, 687
Peters, S., 24, 682 Stahel, W., 315, 681
Phillips, L., 6, 684 Steffey, D., 505, 683
Pierce, D., 99, 685 Stein, C., 163, 379, 382, 568, 682, 687
Pitman, E., 347, 685 Stigler, S., 8, 687
Port, S., 640, 682 Stone, C., 640, 682
Portnoy, S., x Stone, M., 21, 678, 687
Pratt, J., 56, 98, 683, 685 Strasser, H., 430, 688
Strawderman, W., ix, 166,688
Raftery, A., 226, 683 Sudderth, W., 9, 21, 46, 66, 69,
Ramamoorthi, R., 86, 676 681-682, 685
Rao, C., 152, 301, 349, 683, 685-686
Reeve, C., 326, 679 Taylor, R., 33, 688
Regazzini, E., 21, 676 Tiao, G., 521, 677
Rigo, P., 21, 676 Tibshirani, R., 336, 679
Robbins, H., 303, 647, 677, 686 Tierney, L., 225, 446, 507, 683, 688
Robert, C., 225, 686 Tversky, A., 23, 683
Roberts, H., 565,686
Ronchetti, E., 315, 681 Venn, J., 8,688
Rouanet, H., 668-669, 684, 686 Verdinelli, I., 524, 688
Rousseeuw, P, 315, 681 Villegas, C., 379, 677
Royden, H., 578, 589, 597, 621, 686 Von Mises, R., 10, 688
Rubin, D., 332, 686 Von Neumann, J., 181-182,688
Rudin, W., 666, 686
Wald, A., 415, 549, 552, 557,688
Savage, L., 46, 181, 222, 284, 565, Walker, A., 435, 442, 688
600, 679, 681-682, 686 Wallace, D., 99, 688
Scheffe, H., 298, 634, 684, 686 Wasserman, L., ix, 524, 526, 684, 688
Schervish, M., v Welch, B., 320, 688
Schwartz, J., 507, 635, 667, 679 West, M., 524, 688
Schwartz, L., 429, 687 Wijsman, R., x, 381-382, 681
Scott, E., 420, 685 Wilks, A., x, 676
Seidenfeld, T., ix, 21, 183-184, 187, Wilks, S., 325, 688
429, 564, 655, 682-683, Williams, S., 66, 69, 685
686-687 Winkler, R., 24, 682
Sellke, T., 284, 676 Wolfowitz, J., 417, 420, 557, 683, 688
Serfiing, R., 413, 687 Wolpert, R., 526, 684
Sethuraman, J., 56, 687
Short, T., ix Ylvisaker, D., 108, 679
Shurlow, N., v Young, G., 329, 688
Siegmund, D., 647, 677
Singh, K., 331, 687 Zellner, A., 16, 688
Siovic, P., 23, 683 Zidek, J., 21, 678
Subject Index*

Abelian group, 353 Basu's theorem, 99


Absolutely continuous, 574, 597, 668 Bayes factor, 221, 238, 262-263, 274
Absolutely continuous function, 211 Bayes risk, 149
Accept hypothesis, 214 Bayes rule, 150, 154-155, 167-168,
Acceptance-rejection, 659 178
Action space, 144 extended, 169
Admissible, 154-157, 162, 167, 174 formal, 146, 150, 157, 348, 351,
A, 154-156, 162 369
Almost everywhere, 572, 582 generalized, 156-157
Almost invariant function, 383 partial, 147, 150
Almost surely, 572, 582 Bayes' theorem, 4, 16
Alternate noncentral beta Bayesian bootstrap, 332
distribution, 668 Bernoulli distribution, 672
Alternate non central X2 distribution, Beta distribution, 54, 669
668 Bhattacharyya lower bounds, 305
Alternate noncentral F distribution, Bias, 296 .
669 Bimeasurable function, 572, 583, 618
Alternative, 2, 214 Binomial distribution, 673
composite, 215 Bolzano-Weierstrass theorem, 666
simple, 215, 233 Bootstrap, 329
Analysis of variance, 384, 491 Bayesian, 332
Analytic function, 105 nonparametric, 329
Ancillary statistic, 95, 99, 119 parametric, 330
maximal, 97 Borel a-field, 571, 575
ANOVA, 384, 491 Borel space, 609, 618
Archemedian condition, 192 Borel-Cantelli lemma:
ARE,413 first, 578
Asymptotic distribution, 399 second, 663
Asymptotic efficiency, 413 Boundary, 636
Asymptotic relative efficiency, 413 Boundedly complete statistic, 94, 99
Asymptotic variance, 402 Box-Cox transformations, 521
Autoregression, 141
Called-off preference, 184
Autoregressive process, 441
Caratheodory extension theorem, 578
Axioms of decision theory, 183-184,
Cauchy distribution, 669
296
Cauchy sequence, 619
Backward induction, 537 Cauchy's equation, 667
Bahadur's theorem, 94 Cauchy-Schwarz inequality, 615
Base measure, 54 CDF,612
empirical, 404-405, 408
Base of test, 215--216
Central limit theorem, 642
multivariate, 643
'Italicized page numbers indicate Chain rule, 600
where a term is defined. Chapman-Robbins lower bound, 304
Subject Index 695

Characteristic function, 611, 639 Conservative prediction set, 324


Chi-squared distribution, 669 Conservative tolerance set, 325
Chi-squared test of independence, 467 Consistent, 397, 412
Closed set, 622 Consistent conditional preference, 186
Closure, 622 Consistent distributions, 652
Coherent tests, 252 Contingency table, 467
Complete class, 174 Continuity axiom, 184
essentially, 174, 244, 251, 256 Continuity theorem, 640
minimal, 174 Continuous distribution, 612
minimal, 174-175 Continuous mapping theorem, 638
Complete class theorem, 179 Convergence:
Complete measure space, 579, 603 pointwise, 184
Complete metric space, 619 weak, 399, 635
Complete statistic, 94, 298 Convergence in distribution, 399, 611,
boundedly, 94, 99 635
Composite alternative, 215 Convergence in probability, 396, 611,
Composite hypothesis, 215 638
Conditional distribution, 13, 16, 607, Convex function, 614
609,617 Counting measure, 570
regular, 610, 618 Covariance, 607, 613
version, 617 Cramer-Rao lower bound, 301
Conditional expectation, 19,607, 616 multiparameter, 306
version, 608, 616 Credible set, 327
Conditional Fisher information, 111, Cumulative distribution function (see
119 CDF),612
Conditional independence, 9, 610, 628 Cylinder set, 652
Conditional Kullback-Leibler
information, 115, 119 Data, 82
Conditional mean, 607, 616 Decide optimally after stopping, 540
version, 616 Decision rule, 145
Conditional preference, 185 maximum, 541
consistent, 186 nonrandomized, 145, 151, 153
Conditional probability, 607, 609, 617 nonrandomized sequential, 537
regular, 609, 617 randomized, 145, 151
Conditional score function, 111 randomized sequential, 537
Conditionally sufficient statistic, 95 regular, 54{}-541
Confidence coefficient, 315, 325 sequential, 537
Confidence interval, 3 nonrandomized, 537
fixed-width, 559 randomized, 537
sequential, 559 terminal, 537
Confidence sequence, 569 truncated, 542
Confidence set, 279, 315, 379 Decision theory, 144, 181
conservative, 315 axioms, 183-184
exact, 315 Decreasing sequence of sets, 577
randomized, 316 DeFinetti's representation theorem,
UMA,317 28
UMAU, 321 Degenerate exponential family, 104
Conjugate prior, 92 Degenerate weak order, 183
Conservative confidence set, 315 Delta method, 401, 464, 466
696 Subject Index

Dense, 619 improper, 20


Density, 607, 613 t,672
Dirichlet distribution, 52, 54, 674 uniform, 659, 672
Dirichlet process, 52, 54, 332, 434 Distribution function (see CDF), 612
Discrete distribution, 612 Dominance axiom, 185
Distribution: Dominated convergence theorem, 591
alternate noncentral beta, 668 Dominates, 154
alternate noncentral X?, 668 Dominating measure, 574, 597
alternate noncentral F, 669 Dutch book, 656
asymptotic, 399
Bernoulli, 672 Efficiency:
beta, 54, 669 asymptotic, 413
binomial, 673 asymptotic relative, 413
Cauchy, 669 second-order, 414
chi-squared, 669 Elicitation of probabilities, 22-23
conditional, 13, 16 Empirical Bayes, 166, 420, 500
consistent, 652 Empirical CDF, 404-405, 408
continuous, 612 Empirical distribution, 12, 38
Dirichlet, 52, 54, 674 Empirical probability measure, 12
discrete, 612 (-contamination class, 524, 526, 528
empirical, 12, 38 Equal-tailed test, 263
exponential, 670 Equivalence class, 140
F,670 Equivalence relation, 140
fiducial, 370, 373 Equivariant rule, 357
gamma, 670 location, 346-347, 351
geometric, 673 minimum risk (see MRE), 347
half-normal, 389 scale, 350
hyper geometric, 673 Essentially complete class, 174, 244,
inverse gamma, 670 251, 256
Laplace, 670 minimal, 174
least favorable, 168 Estimator, 3, 296
marginal, 14 maximum likelihood, 3, 307
multinomial, 674 MFUE, 347, 351, 363
multivariate normal, 643, 674 Pitman, 347, 363
negative binomial, 673 point, 3, 296
noncentral beta, 289, 671 unbiased, 3, 296
noncentral X2 , 671 Event, 606, 612
noncentral F, 289, 671 Exact confidence set, 315
noncentral t, 289, 325, 671 Exact prediction set, 324
normal, 21, 349, 611, 640, 642, Exchangeable, 7, 27-28
671 partially, 125, 479
multivariate, 643, 674 row and column, 482
Pareto, 672 Expectation, 607, 613
Poisson, 673 conditional, 19, 616
posterior, 16 Expected Fisher information, 423
predictive, 14 Expected loss principle, 146, 181
posterior, 18 Expected value (see Expectation),
prior, 14 613
prior, 13 Exponential distribution, 670
Subject Index 697

Exponential family, 102-103, 105, location, 354


109, 155, 239, 249 location-scale, 354, 357, 368
degenerate, 104 permutation, 355
nondegenerate, 104 scale, 354
Extended Bayes rule, 169
Extended real numbers, 571 Haar measure:
Extremal family, 123, 125 left, 363
related, 366
F distribution, 670 right, 363
Fatou's lemma, 589 related, 366
FI regularity conditions, 111 Hahn decomposition theorem, 605
Fiducial distribution, 370, 373 Half-normal distribution, 389
Field, 571, 575 Hierarchical model, 166, 476
Finite population sampling, 74 Highest posterior density region (see
Finitely additive probability, 21, 281, HPD),327
564, 657 Hilbert space, 507
Fisher information, 111, 113, 301, Hilbert-Schmidt-type operator, 507,
412, 463 667
conditional, 111, 119 Horse lottery, 182
expected, 423 Hotelling's T2, 388
observed, 226, 424, 435 HPD region, 327, 329, 343
Fisher-Neyman factorization Hypergeometric distribution, 673
theorem, 89 Hyperparameters, 477
Fixed point, 505 Hypothesis, 2, 214
Fixed-point problem, 505 composite, 215
Fixed-width confidence interval, 559 one-sided, 241
Floor of test, 215-216 simple, 215, 233
Formal Bayes rule, 146, 150, 157,348, Hypothesis test, 2
351, 369 predictive, 219, 325
Fubini's theorem, 596 randomized, 3
Function: Hypothesis-testing loss, 214
absolutely continuous, 211
bimeasurable, 583 Identity element of group, 353
measurable, 572, 583 Ignorable statistic, 142
simple, 586 lID, 2, 8, 611, 628
conditionally, 9-10, 83,611, 628
Gamma distribution, 670 Image sigma field, 584
General linear group, 354 Importance sampling, 403, 661
Generalized Bayes rule, 156-157, 159 Improper prior, 20, 122, 223, 263
Generalized Neyman-Pearson lemma, Inadmissible, 154
247 Increasing sequence of sets, 577
Generated u-field, 571-572, 584 Independence, 610, 628
Geometric distribution, 673 conditional, 9, 610, 628
Gibbs sampling, 507 Indifferent, 183
Goodness of fit test, 218, 461 Induced measure, 575, 601
Gross error sensitivity, 312 Infinitely often, 578, 663
Group, 35~355-356 Influence function, 311
abelian, 353 Information:
general linear, 354 Fisher, 111, 113,463
698 Subject Index

Kullback-Leibler, 115-116 LHM,363


Integrable, 588 Likelihood function, 2, 13, 307
uniformly, 592 Likelihood ratio test (see LR test),
Integral, 573, 587-588 274
over a set, 588 Linear regression, 276, 321
Invariance of distributions, 355 LMP test, 245, 265, 289
Invariant function, 357 LMPU test, 265, 292
almost, 383 LMVUE,900
location, 346 Locally minimum variance unbiased
maximal, 358 estimator, 300
scale, 350 Locally most powerful test (see
Invariant loss, 956 LMP),245
location, 946 Location equivariant rule, 946
scale, 350-351 Location estimation, 346
Invariant measure, 363 Location group, 354
Inverse function theorem, 666 Location invariant function, 346
Inverse gamma distribution, 670 Location invariant loss, 346
Inverse of group element, 353 Location parameter, 944
Location-scale group, 354
Jacobian, 625 Location-scale parameter, 345
James-Stein estimator, 163,486 Look-ahead decision rule, 546
Jeffreys' prior, 122,446 Loss function, 144, 162, 189, 296
Jensen's inequality, 614 convex, 349
hypothesis-testing, 214
Kolmogorov zero-one law, 631 squared-error, 146, 297
Kullback-Leibler divergence, 116 0-1, 215
Kullback-Leibler information, 0-1-c, 215, 218
115-116 Lower boundary, 170, 179, 233-235,
conditional, 115, 119 287
LR test, 223, 273-274, 458-459
Levy's theorem, 648, 650
'x-admissible, 154-156, 162 Marginal distribution, 14, 607
Laplace approximation, 226, 446 Marginalization paradox, 21
Laplace distribution, 670 Markov chain, 15, 507, 650
Large order, 394 Markov chain Monte Carlo, 507
stochastic, 996 Markov inequality, 614
Law of large numbers: Martingale, 645-646
strong, 34-36 reversed, 33, 649
weak,642 Martingale convergence theorem,
Law of the unconscious statistician, 648-649
607, 613 Maximal ancillary, 97
Law of total probability, 632 Maximal invariant, 358
Least favorable distribution, 168 Maximin strategy, 168
Lebesgue measure, 571, 580 Maximin value, 168
Left Haar measure, 363 Maximum likelihood estimator, 3,
related, 366 307, 415, 418-421
Lehmann-Scheffe theorem, 298 Maximum modulus theorem, 667
L-estimator, 410 Maximum of decision rules, 541
Level of test, 215-216 MC test, 230
Subject Index 699

Mean, 607, 613 Neyman-Pearson fundamental


conditional, 616 lemma, 175, 231
trimmed, 314 NM-Iottery, 182
Measurable function, 572, 583 Noncentral beta distribution, 289,
Measure, 570, 572, 575, 577 671
induced, 601 Noncentral X2 distribution, 671
Lebesgue, 571, 580 Noncentral F distribution, 289, 671
product, 595 Noncentral t distribution, 289, 325,
u-finite, 572, 578, 601 671
signed, 577, 597 Nondegenerate exponential family,
Measure space, 572, 577 104
M-estimator, 313-315, 424-428, 434 Nondegenerate weak order, 183
Method of Laplace, 226, 446 Nonnull states, 184
Method of moments, 340 Nonparametric, 52
Mill's ratio, 470 Nonparametric bootstrap, 329
Minimal complete class, 174-175 Nonrandomized decision rule, 145,
Minimal essentially complete class, 151, 153
174 Nonrandomized sequential decision
Minimal sufficient statistic, 92 rule, 537
Minimax principle, 167, 189 Normal distribution, 21, 349, 611,
Minimax rule, 167-169 640, 642, 671
Minimax theorem, 172 multivariate, 643, 674
Minimax value, 168 Null states, 184
Minimum risk equivariant (see MRE),
347 Observed Fisher information, 226,
MLE, 3, 307, 415, 418-421 424,435
MLR,239-244 One-sided hypothesis, 241
Monotone convergence theorem, 590 One-sided test, 239, 243
Monotone likelihood ratio, 239-244 Open set, 571
Monotone sequence of sets, 577 Operating characteristic, 215
Most cautious test, 230 Orbit, 358
Most powerful test, 230 Order statistics, 86
MP test, 230 Outliers, 521
MRE, 347, 349, 351, 363
Multinomial distribution, 674 Parameter, 1, 6, 50--51, 82
Multiparameter Cramer-Rao lower location, 344
bound,306 location-scale, 345
Multivariate central limit theorem, natural, 103, 105
643 scale, 345
Multivariate normal distribution, 643, Parameter space, 1, 50, 82
674 natural, 103, 105
Parametric bootstrap, 330
Natural parameter, 103, 105 Parametric family, 1, 50, 102
Natural parameter space, 103, 105 Parametric index, 33, 50
Natural sufficient statistic, 103 Parametric models, 12
Negative binomial distribution, 673 Parametric Models, 49
Negative part, 573, 588 Pareto distribution, 672
Negative set, 598 Partial Bayes rule, 147, 150
Neyman structure, 266 Partially exchangeable, 125,479
700 Subject Index

Percentile-t bootstrap confidence Pure significance test, 217


interval, 336 P-value, 279, 375, 380
Permutations, 355
11"->' theorem, 576 Quantile:
Pitman's estimator, 347, 363 sample, 404-405, 408
Pivotal, 316, 370, 373
Point estimation, 296 Radon-Nikodym derivative, 575, 598
Point estimator, 296 Radon-Nikodym theorem, 597
Pointwise convergence, 184 Random probability measure, 27
Poisson distribution, 673 Random quantity, 82, 606, 612
Polish space, 619 Random variables, 606, 612
Polya tree distribution, 69 exchangeable, 27
Polya urn scheme, 9 IID,8
Portmanteau theorem, 636 Randomized confidence set, 316
Positive part, 573, 588 Randomized decision rule, 145, 151
Positive set, 598 Randomized sequential decision rule,
Posterior distribution, 4, 16 537
asymptotic normality, 435, 437, Randomized test, 3
442-443 Rao-Blackwell theorem, 152
consistency, 429-430 Ratio of uniforms, 660
Posterior predictive distribution, 18 Regression, 276, 321, 519
Posterior risk, 146, 150 Regular conditional distribution, 610,
Power function, 2, 215, 240 618
Power set, 571 Regular conditional probabilities,
Prediction set, 324-325 609,617
conservative, 324 Regular decision rule, 540
exact, 324 Reject hypothesis, 214
Predictive distribution, 14, 455 Rejection region, 2
posterior, 18 Related LHM, 366
prior, 14 Related RHM, 366
Predictive hypothesis test, 219, 325 Relative rate of convergence, 413, 470
Preference, 182 Restriction of u-field, 584
conditional, 185 Reversed martingale, 649
consistent, 186 RHM,363
Prevision, 655 Right Haar measure, 363
Prior distribution, 4, 13 related, 366
improper, 20, 223, 263 Risk function, 149-150, 153, 155, 167,
natural conjugate family, 92 216, 233, 297-298
Prize, 181 Risk set, 170-172, 179,233,235,287
Probability, 572, 577 Robustness, 310
empirical, 12 Bayesian, 524
Rowand column exchangeable, 482
random, 27
Probability integral transform, 519,
Sample quantile, 404-405, 408
659
Sample space, 2, 82
Probability space, 572, 577, 606, 612
Scale equivariant rule, 350
Product measure, 595
Scale estimation, 350
Product u-field, 576
Scale group, 354
Product space, 576 Scale invariant function, 350
Pseudorandom numbers, 659
Subject Index 701

Scale invariant loss, 350-351 Stopping time, 537, 548, 552, 554
Scale parameter, 345 Strict preference, 183
Scheffe's theorem, 634 Strong law of large numbers, 34-36
Score function, 111, 122,302, 305 Strongly unimodal, 329
conditional, 111 Submartingale, 646
Second-order efficiency, 414 Successive substitution, 505-506, 545
Sensitivity analysis, 524 Successive substitution sampling, 507
Separable space, 619 Sufficient statistic, 84-85-86, 99, 103,
Separating hyperplane theorem, 666 109, 150-151, 298
Sequential decision rule, 537 conditionally, 95
Sequential probability ratio test, 549 minimal,92
Sequential test, 548 natural, 103
Set estimation, 296 Superefficiency, 414
Shrinkage estimator, 163 Supporting hyperplane theorem, 666
a-field, 575 Sure-thing principle, 184
Borel, 571, 575
generated, 571-572, 584 t distribution, 672
image, 584 Tail a-field, 632
restriction, 584 Tailfree process, 60
tail, 632 Taylor's theorem, 665
a-finite measure, 572, 578, 601 Tchebychev's inequality, 614
Signed measure, 577, 597, 605, 635 Terminal decision rule, 537
Significance probability, 217, 228, 280 Test:
Significance test, 217 goodness of fit, 218, 461
Simple alternative, 215 one-sided, 239, 243
Simple function, 586 two-sided, 256, 273
Simple hypothesis, 215 Test function, 175, 215
Size of test, 2, 215-216 Theorem:
Small order, 394 Bahadur,94
stochastic, 396 Basu, 99
SPRT,549 Bayes, 4,16
Squared-error loss, 146, 297 Bhattacharyya lower bounds,
Vn-consistent, 401 305
SSS, 507 Bolzano-Weierstrass, 666
St. Petersburg paradox, 655 Caratheodory extension, 578
State independence, 184, 205 Cauchy's equation, 667
State-dependent utility, 205-206 central limit, 642
States of Nature, 181, 189, 205 multivariate, 643
Statistic, 83 chain rule, 600
ancillary, 95,99, 119 Chapman-Robbins bound, 304
boundedly complete, 94 complete class, 179
complete, 94, 298 continuity, 640
sufficient, 84-85-86, 99, 103, continuous mapping, 638
150-151,298 Cramer-Rao lower bound, 301
Stein estimator (see James-Stein DeFinetti, 27-28
estimator), 163 dominated convergence, 591
Stochastic large order, 396 Fatou's lemma, 589
Stochastic small order, 396 Fisher-Neyman, 89
Stone-Weierstrass theorem, 666 Fubini,596
702 Subject Index

Hahn decomposition, 605 UMP test, 230, 240, 243-244, 255,


inverse function, 666 257
Kolmogorov zero-one law, 631 UMPU test, 254-256
Levy, 648, 650 UMPUAI test, 384
law of total probability, 632 UMVUE, 297-299
Lehmann-Scheffe, 298 Unbiased estimator, 3, 296-302
martingale convergence, 648-649 Unbiased test, 254
maximum modulus, 667 Uniform distribution, 659, 672
minimax, 172 Uniformly integrable, 592
monotone convergence, 590 Uniformly minimum variance
multivariate central limit, 643 unbiased estimator (see
Neyman-Pearson, 175,231 UMVUE),297
generalized, 247 Uniformly most accurate confidence
11"->',576 set (see UMA), 317
portmanteau, 636 Uniformly most accurate unbiased
Radon-Nikodym, 597 confidence set (See UMAU),
Rao-Blackwell, 152 321
Scheffe, 634 Uniformly most cautious test (see
separating hyperplane, 666 UMC),230
Stone-Weierstrass, 666 Uniformly most cautious unbiased
strong law of large numbers, 36 test (see UMCU), 254
supporting hyperplane, 666 Uniformly most powerful test (see
Taylor, 665 UMP),230
Tonelli, 595 Uniformly most powerful unbiased
uniqueness, 645 test (see UMPU), 254
upcrossing, 647 Uniqueness theorem, 645
weak law of large numbers, 642 Up crossing lemma, 647
Tolerance coefficient, 325 Upper semicontinuous, 417
Tolerance set, 219, 325 USC, 417
conservative, 325 Utility function, 181, 188
Tonelli's theorem, 595 state-dependent, 205-206
Topological space, 571, 575
Topology, 571 Variance, 607, 613
Transformation, 354 Variance components, 484
Transition kernel, 124 Variance stabilizing transformation,
Trimmed mean, 314 402
Trivial u-field, 571 Version of conditional distribution,
Truncated decision rule, 542 617
Two-sided alternative, 246 Version of conditional expectation,
Two-sided hypothesis, 246 608,616
Two-sided test, 256, 273 Version of conditional mean, 616
Type I error, 214
Type II error, 214 Wald's lemma, 552
Weak convergence, 399, 635
UMA confidence set, 317 Weak convergence, 635
UMAU confidence set, 321 Weak law of large numbers, 642, 664
UMC test, 230-231, 239, 244, 255, Weak order, 183, 216-217, 280
257 degenerate, 183
UMCU test, 254-256 nondegenerate, 183
Weak preference, 182
Springer Series in Statistics
(conlin"od from p. Ii)

Pollard: Convergence of Stochastic Processes.


Pratt/Gibbons: Concepts of Nonparametric Theory.
ReadlCressie: Goodness-of-Fit Statistics for Discrete Multivariate Data.
Reinsel: Elements of Multivariate Time Series Analysis.
Reiss: A Course on Point Processes.
Reiss: Approximate Distributions of Order Statistics: With Applications
to Non-parametric Statistics.
Rieder: Robust Asymptotic Statistics.
Rosenbaum: Observational Studies.
Ross: Nonlinear Estimation.
Sachs: Applied Statistics: A Handbook of Techniques, 2nd edition.
Siirndal/SwenssonlWretman: Model Assisted Survey Sampling.
Schervish: Theory of Statistics.
Seneta: Non-Negative Matrices and Markov Chains, 2nd edition.
Shao/Tu: The Jackknife and Bootstrap.
Siegmund: Sequential Analysis: Tests and Confidence Intervals.
Simonoff: Smoothing Methods in Statistics.
Small: The Statistical Theory of Shape.
Tanner: Tools for Statistical Inference: Methods for the Exploration of Posterior
Distributions and Likelihood Functions, 3rd edition.
Tong: The Multivariate Normal Distribution.
van der Vaart/Wellner: Weak Convergence and Empirical Processes: With
Applications to Statistics.
Vapnik: Estimation of Dependences Based on Empirical Data.
Weerahandi: Exact Statistical Methods for Data Analysis.
West/Harrison: Bayesian Forecasting and Dynamic Models.
Wolter: Introduction to Variance Estimation.
Yaglom: Correlation Theory of Stationary and Related Random Functions I:
Basic Results.
Yag/om: Correlation Theory of Stationary and Related Random Functions II:
Supplementary Notes and References.

Vous aimerez peut-être aussi