Académique Documents
Professionnel Documents
Culture Documents
Bayes
Objectives of Course
Key References:
Gelman, A. et al. Bayesian Data Analysis. 2nd Ed. (2004). Chapman & Hall
Robert, C. P. and Casella, G. Monte Carlo Statistical Methods. (2004/1999).
Springer
Gilks, W. R. et al. eds. Markov chain Monte Carlo in Practice. (1996). Chapman &
Hall.
Acknowledgements:
Nicky Best for WinBugs help and examples.
Intro
Uncertainty propagation
Quantify all aspects of uncertainty via probability.
Probability is the central tool.
Axiomatic
Bayesian statistics has an axiomatic foundation, from which all
procedures then follow.
It is prescriptive in telling you how to act coherently
If you dont adopt a Bayesian approach you must ask yourself
which of the axioms are you prepared to discard? See
disadvantages.
Intro
Unified framework.
Random effects, Hierarchical models, Missing variables, Nested
and Non-nested models. All handled in the same framework.
Intuitive
For a model parameterised by , what is the interpretation of
Intro
Intro
Inference is subjective
You and I on observing the same data will be lead to different
conclusions
This seems to fly in the face of scientific reasoning
Intro
Intro
10
Overview
11
1.1
Overview
12
1.2
Bayes Theorem
p(y1 , . . . , yn )
is invariant to permutations of the indecies. This is a key assumption.
In BDA almost all inference is made using probability statements,
regarding |y or y
|y , for future unobserved y
Overview
13
p(y, ) = p(y|)p()
where p(y|) is the sampling distribution and p() is your prior
the prior p() elicits a model space and a distn on that model
space via a distn on parameters
Overview
14
p(y, )
p(|y) =
p(y)
p(y|)p()
=
p(y)
BAYES THEOREM
Overview
15
1.3
p(y) =
Z
p(y, )d =
p(y|)p()d
Overview
16
Z
p(i |y) =
p(i , i |y)di
Z
=
where i
Overview
17
Z
p(
y |y) =
p(
y |, y)p(|y)d
Z
p(
y |)p(|y)d
Note that y
, y are conditionally independent given ; though
clearly p(
y , y) are dependent
Again, note that all 3 quantities are defined via probability statements
on the unknown variable of interest
Overview
18
y = x +
where N (0, 2 I). Alternatly,
y N (y|x, 2 I)
for now assume that is known
Classically, we would wish to estimate the regression coefficients, ,
given a data set, {yi , xi }n
i=1 , say using MLE
= (x0 x)1 x0 y
Bayesian modelling proceeds by constructing a joint model for the
Overview
19
{x, 2 }
Suppose we take,
p() = N (|0, v)
Overview
20
Then,
p(|y) p(y|)p()
1
exp[ 2 (y x)0 (y x)]
2
|v|1/2 exp[ 0 v]
n/2
v)
p(|y) = N (|,
= (x0 x + v 1 )1 x0 y
v = (x0 x + v 1 )1
Overview
21
Note
p(y0 |x0 , y) =
v)d
N (y0 |x0 , 2 )N (|,
2 (1 + x0 vx0 ))
= N (y0 |x0 ,
0
Overview
22
1.4
Overview
23
Often we will use forms for p() that depend on p(y|) and which
make these calculations easy
when the prior and the posterior are from the same family of
distributions (such as normal-normal) then the prior is termed
conjugate
Note, from a purest perspective this is putting the cart before the
horse
Overview
24
1.5
Overview
25
Overview
26
Overview
27
Overview
28
, or,
(i) p(|y)
for i = 1, . . . , M
Overview
29
v)
p(|y) = N (|,
= vx0 y
v = (x0 x + v 1 )1
MCMC would approximate this distribution with M samples drawn
from the posterior,
v)
{ (1) , . . . , (M ) } N (,
Overview
30
1.6
Overview
31
Examples:
where I() is the logical indicator function. More generaly, for a set
A
M
X
1
P r( A|y)
I((i) A)
M i=1
Overview
32
M i=1
Overview
33
Overview
34
Note that all these quantities can be computed from the same
bag of samples. That is, we can first collect (1) , . . . , (M ) as a
proxy for p(|y) and then use the same set of samples over and
again for whatever we are subsequently interested in.
Overview
35
Models
36
Modelling Data
Models
37
2.1
MCMC simulation
Models
38
Models
39
as n
Models
40
Models
41
Models
42
Models
43
Models
44
Models
45
Models
46
1
logit(i ) =
= log
1 + exp(i )
where we note, logit(i )
P (yi = 1|xi )
P (yi = 0|xi )
0 as i , logit(i ) 1 as
Models
47
Models
48
p(yi |xi , ) ()
i=1
Models
49
Models
50
MCMC Algorithms
Models
51
First, recall that MCMC is an iterative procedure, such that given the
current state of the chain, (i) , the algorithm makes a probabilistic
update to (i+1)
The general algorithm is
MCMC Algorithm
(0) x
For i=1 to M
(i) = f ((i1) )
End
where f () outputs a draw from a conditional probability density
Models
52
Models
53
Models
54
3.1
Metropolis-Hastings algorithm
Models
55
M-H Algorithm
(0) x
For i=0 to M
Draw
(i) )
q(|
(i) , where
p(b|y)q(a|b)
(a, b) = min 1,
p(a|y)q(b|a)
End
Models
56
Note:
The acceptance term corrects for the fact that the proposal density
is not the target distribution
Models
57
p(b|y)q(a|b)
,
p(a|y)q(b|a)
;
< ((i) , )
THEN accept ;
IF U
Models
58
p(b|y)
(a, b) = min 1,
p(a|y)
Models
59
Pretty much any q(a|b) will do, so long as it gets you around the
state space . However different q(a|b) lead to different levels of
performance in terms of convergence rates to the target
distribution and exploration of the model space
Models
60
Models
61
Models
62
Models
63
Models
64
3.2
p(j |j , y)
is easy to sample from; where j
= \j
Models
65
Models
66
Gibbs Sampler
(0) x
For i=0 to M
Set (i)
For j=1 to p
Draw X
Set j
p(j |j , y)
End
Set (i+1)
End
Models
67
Note:
Models
68
y = x +
where
Models
69
p() = N (|0, v)
p( 2 ) = IG( 2 |a, b)
where IG(|a, b) denotes the Inverse-Gamma density,
Models
70
v)
p(|y, 2 ) = N (|,
= ( 2 x0 x + v 1 )1 2 x0 y
v = ( 2 x0 x + v 1 )1
Models
71
and
p( 2 |, y) = IG( 2 |a + n, b + SS)
SS = (y x)0 (y x)
Hence the Gibbs sampler can be adopted:
Models
72
(, 2 )(0) x
For i=0 to M
Set (,
2)
(, 2 )(i)
2 N (|,
V )
Draw |
IG( 2 |a + n, b + SS)
Draw 2 |
Set (, 2 )(i)
End
(,
2)
Models
73
y = x +
where
N (0, 2 I).
Models
74
y N (y|x, 2 I)
N (|0, vI)
2 IG( 2 |a, b)
v IG(v|c, d)
note the hierarchy of dependencies
where IG(|a, b) denotes the Inverse-Gamma density,
Models
75
Models
76
v)
p(|y, 2 , v) = N (|,
= ( 2 x0 x + v 1 )1 2 x0 y
v = ( 2 x0 x + v 1 )1
and
Models
77
p( 2 |, y) = IG( 2 |a + n, b + SS)
SS = (y x)0 (y x)
and
Models
78
{, 2 , v}(0) x
For i=0 to M
Set (,
2 , v)
{, 2 , v}(i)
2 , v N (|,
V )
Draw |
IG( 2 |a + n, b + SS)
Draw 2 |
Draw v
|
IG(v|c + p, d + SB)
Set {, 2 , v}(i) (,
2 , v)
End
Models
79
Slice sampling
Rejection sampling
Ratio of uniforms
Models
80
Some Issues
Output
81
Output
82
Given the current state of the chain, (i) , the algorithm makes a
probabilistic update to (i+1)
The update, f (), is made in such a way that the distribution
p((i) ) p(|y), the target distribution, as i , for any
starting value (0)
Hence, the early samples are strongly influenced by the distribution of
Output
83
That is, the first B samples, {(0) , (1) , . . . , (B) } are discarded
This user defined initial portion of the chain to discard is known as
a burn-in phase for the chain
Output
84
Output
85
4.1
Convergence diagnostics
Output
86
http://www.mrc-bsu.cam.ac.uk/bugs/
documentation/Download/cdaman03.pdf
Output
87
Graphical Analysis
The first step in any output analysis is to eyeball sample traces from
(1)
various variables, {j
(M )
, . . . , j
Output
88
Initially then, the chain will often be seen to migrate away from (0)
towards a region of high posterior probability centred around a mode
of p(|y)
The time taken to settle down to a region of a mode is certainly the
very minimum lower limit for B
Output
89
Output
90
4.2
Geweke
Output
91
Output
92
4.3
The method compares the within and between chain variances for
each variable. When the chains have mixed (converged) the
Output
93
Output
94
Output
95
Output
96
The Theory
A central limit result for Markov chains holds that
and E[f ()] denotes the true unknown expectation. Note that almost
all quantities of interest can be written as expectations
Output
97
f2 =
s=
Hence, the greater the covariance between samplers, the greater the
variance in the MCMC estimator (for given sample size M )
In Practice
The variance parameter f2 can be approximated using the sample
autocovariances
Plots of autocorrelations within chains are extremely useful
High autocorrelations indicate slow mixing (movement around the
parameter space), with increased variance in the MCMC estimators
Output
98
ESS = M/(1 + 2
k
X
(j))
j=1
Pk
j=1
autocorrelations
The ESS estimates the reduction in the true number of samples,
compared to iid samples, due to the autocorrelation in the chain
Output
99
Conclusions
100
Conclusions
Conclusions
101
Conclusions
102