Vous êtes sur la page 1sur 42

Introduction to Bayesian Analysis

Timothy Cogley
NYU Fall 2012
September 12, 2012
1 Outline
Classical v. Bayesian Perspectives
Bayes Theorem
Examples
2 Classical v. Bayesian Perspectives
Common elements
j = ooto
)(j) = true distribution
j(j|0) = likelihood function for a parametric model
0 = jovonctcvc
2.1 Classical perspective
There exists a true, xed value of 0 about which we want to learn.
We form an estimate

0, e.g. by MLE.
We characterize the properties of

0 by imagining how it would behave across re-
peated samples.
Because

0 varies across samples, we treat it as a random variable.
The distribution of

0 across samples summarizes what we know about 0.
Want estimators or test procedures that do well on average.
But concept of repeated samples is often just a convenient ction; usually there
is only one sample.
Reliance on this concept gives rise to the term frequentist.
2.2 Bayesian perspective
Want to characterize subjective beliefs about 0.
Posit that 0 is a random variable.
Distributions over 0 summarize subjective beliefs.
If 0 is a r.v., what is

0?
a deterministic function of the sample
e.g., for OLS,

0 = (A
0
A)
1
A
0
Y is a deterministic function of A, Y.
Only one

0 can emerge in a given sample.
Conditional on A, Y, there is no uncertainty about

0.
Bayes theorem tells us how optimally to update beliefs after seeing data. I.e., it is
about the passage from prior to posterior beliefs.
Before looking at data
j(0) = jvicv
After looking at data
j(0|j) = jcctcvicv
These judgments are mediated by a model of how j and 0 are related. A model
is a probability distribution over outcomes, j(j|0).
Another way to distinguish Bayesian and classical perspectives is in terms of condi-
tioning.
Bayes: inference is conditional on the sample at hand.
Classical: inference depends on what could happen in other, hypothetical sam-
ples that could be drawn from the same mechanism.
My view: although there are interesting philosophical dierences, I regard them as
alternative tools. Sometimes one is more convenient, sometimes another.
One reason why I like Bayesian methods: they connect well with Bayesian decision
theory. For policy-oriented macroeconomists, this is a big advantage.
3 Bayes Theorem
Can factor a joint density into the product of a conditional and a marginal
j(j, 0) = j(j|0)j(0) = j(0|j)j(j)
Can also marginalize a joint density by integrating wrt other argument
j(j) =
Z
j(j, 0)o0
Bayes theorem follows from these two facts
Start with the joint density j(j, 0).
Can factor in two ways
j(j, 0) = j(0|j)j(j) = j(j|0)j(0).
Divide through by j(j),
j(0|j) =
j(j|0)j(0)
j(j)
.
The marginal j(j) can also be expressed as
j(j) =
Z
j(j, 0)o0 =
Z
j(j|0)j(0)o0
Hence the posterior is
j(0|j) =
j(j|0)j(0)
R
j(j|0)j(0)o0
j(j|0)j(0).
Terminology:
I(j, 0) = j(j|0)j(0) is known as the posterior kernel
j(j) =
R
j(j|0)j(0)o0 is known as the normalizing constant or marginal
likelihood
j(0|j) is known as the posterior density.
Remarks:
Notice that j(j) does not depend on 0. All the information about 0 is contained
in the kernel
The posterior kernel does not necessarily integrate to 1, so it is not a proper
probability density.
When we divide the kernel by the normalizing constant, we ensure that the
result does integrate to 1.
Important to verify that j(j) exists
it does when priors are proper, i.e. when j(0) integrates to 1
problems sometimes arise when j(0) is improper
4 Examples
4.1 Mean of iid normal random vector with known variance
Suppose j
t
.11(j, ) with known but unknown j
Prior: j(j) .(j
0
,
0
) (Interpret prior precision for the univariate case)
j(j|
0
) exp[(12)(j j
0
)
0

1
0
(j j
0
)].
Likelihood:
j(j|j,
0
) exp[(12)
X
t
(j
t
j)
0

1
0
(j
t
j)].
Posterior Kernel: multiply the prior and likelihood
j(j|j,
0
) j(j|
0
)j(j|j,
0
)
= exp{(12)[(j j
0
)
0

1
0
(j j
0
) +
X
t
(j
t
j)
0

1
0
(j
t
j)]}. (1)
Notice that this is a quadratic within an exponential, making it a Gaussian kernel.
It follows that we can express the posterior kernel as
j(j|j,
0
) exp[(12)(j j
1
)
0

1
1
(j j
1
)]. (2)
To nd expressions for j
1
,
1
, expand (1) and equate powers with (2).
(j j
0
)
0

1
0
(j j
0
) = j
0

1
0
j +j
0
0

1
0
j
0
j
0

1
0
j
0
j
0
0

1
0
j
X
t
(j
t
j)
0

1
0
(j
t
j) =
X
t
[j
0
t

1
0
j
t
+j
0

1
0
j j
0
t

1
0
j j
0

1
0
j
t
]
Notice that
X
t
j
0

1
0
j = Tj
0

1
0
j,
X
t
j
0
t

1
0
j = T j
0

1
0
j,
X
t
j
0

1
0
j
t
= T j
0

1
0
j.
Collecting terms in j,
(j j
0
)
0

1
0
(j j
0
) +
P
t
(j
t
j)
0

1
0
(j
t
j)] =
= j
0
[
1
0
+T
1
0
]j j
0
[
1
0
j
0
+T
1
0
j] [j
0
0

1
0
+T j
1
0
]j
+ terms not involving j
Next, equate powers with terms in (2)
j
0

1
1
j = j
0
[
1
0
+T
1
0
]j
1
= [
1
0
+T
1
0
]
1
j
0

1
1
j
1
= j
0
[
1
0
j
0
+T
1
0
j]

1
1
j
1
= [
1
0
j
0
+T
1
0
j],
j
1
=
1
[
1
0
j
0
+T
1
0
j],
= [
1
0
+T
1
0
]
1
[
1
0
j
0
+T
1
0
j].
Interpretation:
The posterior mean is a variance-weighted average of the prior mean and the
sample average.
The relative weights depend on the prior variance
1
0
, the sample size T,
and the variance of j (i.e.
0
).
If the prior is diuse (
1
0
= 0), the posterior mean simplies to
j
1
= [T
1
0
]
1
[T
1
0
j] = j.
Otherwise the prior pulls j
1
away from j toward j
0
.
Hence the posterior mean is biased. (Is that a bad thing?)
Similarly, the posterior precision
1
1
is a variance-weighted average of the prior
and sample precisions.
If the prior is diuse, the posterior variance simplies to

1
= [T
1
0
]
1
= T
1

0
,
the variance of the sample mean.
If the prior adds information (
1
0
= 0), it reduces the posterior variance.
4.2 Variance of iid normal random vector with known mean
Now suppose j
t
.11(j, ) with known j = j
0
but unknown
Prior: for a Gaussian likelihood, the conjugate prior is inverse Wishart
j() = 1W(S
1
0
, T
0
)
||
(T
0
+a+1)2
exp{(12)tv(
1
S
0
)},
where T
0
is the prior degrees of freedom, S
0
is the prior scale or sum of squares
matrix, and a = dim(j
t
).
Conjugate: the posterior has the same functional form as the prior
hence, functional form for posterior is known
consequently, you just need to update its sucient statistics
Wishart is a multivariate generalization of a chi-square.
Sucient statistics for IW:
T
1
= T
0
+T.
S
1
= S
0
+
X
t
(j
t
j
0
)(j
t
j
0
)
0
.
Deriving the updating rule
Start with the log prior
log j() = (12)(T
0
+a + 1) log || (12)tv(
1
S
0
)
Add to it the Gaussian log likelihood
log j(j|j
0
, ) = (T2) log || (12)
X
t
(j
t
j
0
)
0

1
(j
t
j
0
)
The result is the log posterior kernel
log I(|j, j
0
) = (12)(T +T
0
+a + 1) log ||
(12)tv(
1
S
0
) (12)
X
t
(j
t
j
0
)
0

1
(j
t
j
0
).
The last term on the right-hand side is
X
t
(j
t
j
0
)
0

1
(j
t
j
0
) = tv
h
X
t
(j
t
j
0
)
0

1
(j
t
j
0
)
i
,
= tv
h

1
X
t
(j
t
j
0
)(j
t
j
0
)
0
i
,
= tv
h

1
S
T
i
,
where
S
T

X
t
(j
t
j
0
)(j
t
j
0
)
0
.
Hence the log posterior kernel is
log I(|j, j
0
) = (12)(T +T
0
+a + 1) log ||
(12)tv(
1
S
0
) (12)tv(
1
S
T
),
= (12)(T +T
0
+a + 1) log ||
(12)tv[
1
(S
0
+S
T
)].
This is the log kernel for an inverted Wishart density with degrees of freedom
T +T
0
and scale matrix S
0
+S
T
.
4.3 iid normal random vector with unknown mean and variance
This can be derived analytically, leading to a matrix variate t-distribution (e.g. see
Bauwens, et. al.)
But I want to describe an alternative approach that will be used later when discussing
Gibbs sampling.
The Gibbs sampler uses a models conditional densities to simulate its joint density.
For the joint distribution j(j, ), the conditionals are j(j|) and j(|j).
The two previous examples describe each of the conditionals.
The Gibbs sampler simulates the joint posterior by drawing sequentially from
the conditionals (more later).
Thus we can decompose a complicated probability model into a collection of simple
ones. We will use this idea a lot.
4.4 A Regression Model with a Known Innovation Variance
j
t
= a
0
t
o +.
t
,
a
t
is strictly exogenous and .
t
is .11(0, o
2
).
o is unknown, but ov(.
t
) is known.
Prior: o .(n, o
2
A)
j(o) |o
2
A|
12
exp[(12)(o n)
0
(o
2
A)
1
(o n)]
Likelihood: j
t
|a
t
is conditionally Gaussian
j(j|a, o, o
2
) exp[(12)
X
t
(j
t
a
0
t
o)
2
o
2
Posterior: By following steps analogous to those for example (4.1), one can show
j(o|j
t
, a
t
, o
2
) = .(o
1
, \
1
),
where
\
1
= (o
2
A
1
+o
2
A
0
A)
1
,
o
1
= \
1
(o
2
A
1
n+o
2
A
0
Y ).
Once again, the posterior mean is a precision weighted average of prior and sample
information.
As the prior precision A
1
0, this converges to the usual OLS result.
4.5 A Regression Model with Known o
Same model, but now the innovation variance is unknown and the conditional mean
parameters are known.
This collapses to estimating the variance for a univariate iid process whose mean is
known to be zero.
Likelihood is
j(j
t
|a
t
, o, o
2
) o
T
exp
(

X
t
(j
t
a
0
t
o)
2
2o
2
)
.
Prior: inverse-gamma density is conjugate to a normal likelihood
j(o
2
)

o
2
0
o
2
!

2+o)
2
exp

o)o
2
0
2o
2
!
,
where o) represents prior degrees of freedom, and o
2
0
is a prior scale parameter
To nd the posterior kernel, multiply the prior and likelihood. After some algebra,
j(o
2
|j
t
)

o
2
0
o
2
!

2+o)+T
2
exp

o)o
2
0
+cc
2o
2
!
,
where
cc =
X
t
(j
t
a
0
t
o)
2
.
The posterior degrees of freedom are the sum of prior o) and the number of obser-
vations,
o)
jcct
= o) +T.
The posterior sum of squares is the sum of the prior sum of squares plus the sample
sum of squares,
cc
jcct
= o)o
2
0
+cc.
To nd the posterior, we just need to update the scale and degree of freedom
parameters.
4.6 A Regression Model with Unknown o and o
2
Use the Gibbs idea along with the results of the previous two subsections.
4.7 Multivariate normal model with Jereys prior and known
variance
To reect the absence of prior information about conditional mean parameters,
Jereys proposed
j(j|) ||
(a+1)2
,
where a = dim(j).
This can be interpreted as the limit of the conjugate normal as
1
0
0.
In this case, the posterior is
j(j|j, ) ||
(a+1)2
exp[(12)
X
t
(j
t
j)
0

1
(j
t
j)],
= .( j, T
1
).
This can be combined with a marginal prior for , using the Gibbs idea.
4.8 VAR with Jereys prior (Zellner 1970)
(1)j
t
= &
t
&
t
.11(0, )
= constants plus VAR parameters
Prior:
j(, ) = j(|)j()
j() = 1W
j(|) = Jc))vcjc
Likelihood function: Gaussian
Full conditionals:
j(|, j) j()j(j|, )
This is the same as for the covariance matrix for a multivariate normal random
vector.
j(|, j) j(|)j(j|, ),
= ||
(T+o+1)2
exp{(12)
X
t
[(1)j
t
]
0

1
[(1)j
t
].
This is the SUR criterion.
It follows that is conditionally normal with mean and variance
1(|, Y ) = [A
0
(
1
1)A]
1
A
0
(
1
1)j,
ov(|, Y ) = [A
0
(
1
1)A]
1
,
where j and A represent the left- and right-hand variables, respectively.
4.9 Probability of a binomial random variable
Suppose j
t
takes on 2 values, with probability j and 1 j, respectively.
Likelihood:
)(j
t
|j) j
a
1
t
(1 j)
a
2
t
,
where a
1
t
= number of draws of state 1, a
2
t
= number of draws of state 2, and
a
1
t
+a
2
t
= total number of draws.
Prior: The conjugate prior for a binomial likelihood is a beta denstity,
)(j) j
a
1
0
1
(1 j)
a
2
0
1
,
To nd the posterior kernel, multiply the likelihood and the prior,
)(j|j
t
) j
a
1
t
+a
1
0
1
(1 j)
a
2
t
+a
2
0
1
.
In this case, to nd the posterior we just need to update the prior counters.
Posterior moments can then be calculated by integration. E.g., the posterior mean
of j is
Z
j)(j|j
t
)oj =
(a
1
t
+a
1
0
)
(a
1
t
+a
1
0
)+(a
2
t
+a
2
0
)
.
4.10 Probabilities for a multinomial random variable
What if j
t
is discrete but takes on more than 2 values, with probabilities j
1
, j
2
, ...j
a
,
respectively?
Much like the previous case, but ...
likelihood is multinomial
conjugate prior is a Dirichlet density.
4.11 Transition Probabilities for a Markov-Switching Model
Suppose j
t
takes on 2 values, with transition probability matrix
1 =
"
j 1 j
1 q q
#
.
Likelihood:
conditional on being in state i at time t, j
t+1
is a binomial random variable.
By factoring the joint density for j
t
into the product of conditionals (and
ignoring the initial condition), the likelihood can be expressed as
)(j
t
|j, q) j
a
11
t
(1 j)
a
12
t
q
a
22
t
(1 q)
a
21
t
.
Prior: j, q are independent beta random variables
)(j, q) = )(j))(q),
)(j) j
a
11
0
1
(1 j)
a
12
0
1
,
)(q) q
a
22
0
1
(1 q)
a
21
0
1
.
To nd the posterior kernel, multiply prior and likelihood,
)(j, q|j
t
) j
a
11
t
+a
11
0
1
(1 j)
a
12
t
+a
12
0
1
q
a
22
t
+a
22
0
1
(1 q)
a
21
t
+a
21
0
1
,
)(j|j
t
))(q|j
t
).
Hence the posterior is also a product of binomial densities.
4.12 Markov-switching models with more than 2 states
Follow the same steps as above, but use multinomial-Dirichlet probability models.
5 Examples of objects that Bayesian want to com-
pute
posterior mode: argmax j(0|j) (analogous to MLE)
posterior mean
1(0|j) =
Z
0j(0|j)o0
Posterior variance
\ (0|j) =
Z
[0 1(0|j)][0 1(0|j)]
0
j(0|j)o0
other posterior moments, expected loss
1(j(0)) =
Z
j(0)j(0|j)o0.
Credible sets (condence intervals)
jv(0 |j) =
Z

j(0|j)o0.
(The set can be adjusted to deliver a pre-specied probability)
normalizing constants (marginal likelihood)
j(j) =
Z
j(j|0)j(0)o0.
Bayes factors (analogous to likelihood ratio)
1 =
j(j|A
1
)
j(j|A
2
)
=
R
j(j|0
1
, A
1
)j(0
1
)o0
1
R
j(j|0
2
, A
2
)j(0
2
)o0
2
.
(Remark on plug-in v. marginalization. Latter accounts better for parameter
uncertainty.)
Posterior odds ratios (model comparison)
1O1 =
j(A
1
)
j(A
2
)
j(j|A
1
)
j(j|A
2
)
= prior odds 1.
6 Bayesian computations
6.1 Old School (Zellner 1970)
Form conjugate prior and likelihood.
Derive analytical expression for posterior distribution.
Evaluate integrals analytically
Some interesting models can be formulated this way, but this is a restrictive class.
6.2 New Wave
Can break out of conjugate families by resorting to asymptotic approximations and
numerical methods.
How to do that will be our main subject over the next few lectures.

Vous aimerez peut-être aussi