Vous êtes sur la page 1sur 49

STATS 225: Bayesian Analysis

Lecture 1: Introduction
Babak Shahbaba
Department of Statistics, UCI
Why Bayesian?
Statistics methods are mainly inspired by applied scientic
problems.
The overall goal of statistical analysis is to provide a robust
framework for designing scientic studies, collecting empirical
evidence, and analyzing the data, in order to understand
unknown phenomena, answer scientic questions, and make
decisions.
To this end, we rely on the observed data as well as our
domain knowledge.
Why Bayesian?
Our domain knowledge, which we refer to as our prior
information, is itself based on previous empirical evidence.
For example, if we are interested in the average normal body
temperature, we would of course measure body temperature
of samples from the population, but we also know, based on
previous empirical evidence, that this average is a number
close to 98.6

F.
In this case, our prior knowledge asserts that values around 98
are more plausible compared to values around 90 for example.
Why Bayesian?
We could of course attempt to minimize our reliance on prior
information.
Most frequentist methods follow this principle and use the
domain knowledge for example to decide which characteristics
of the population are relevant to our scientic problem (e.g.,
we do not include height as a risk factor for cancer), but avoid
using priors when making inference.
Note that this should not give us the illusion that our method
is objective.
Why Bayesian?
Bayesian methods on the other hand provide a mathematical
framework to incorporate prior knowledge in the process of
making inference.
This is based on the philosophy that if the prior is in fact
informative, this should lead to more accurate inference and
better decision.
The counterargument is that this makes our analysis more
prone to mistakes.
This is of course true! While the underlying concept for
Bayesian statistics is quite simple, implementing Bayesian
methods tend to be more complex compared to their
frequentist counterparts.
Therefore, if you want to be a bad statistician (or bad
scientist), it is better to be a frequentist than a Bayesian!
Bayesian inference
Bayesian inference is making statements about unknown
quantities in terms of probabilities given the observed data
and our prior knowledge.
Our prior knowledge represents the extent of our belief and
uncertainty regarding the value of unobservables. We express
our prior using probability models.
We also use probability models to dene the underlying
mechanism that has generated the data.
Bayesian inference
Bayesian inference therefore starts by dening the joint
probability for our prior opinion and the mechanism based on
which the data are generated.
To make inference, we update our prior opinion about
unobservables given the observed date. We refer to this
updated opinion as our posterior opinion, which itself is
expressed in terms of probabilities.
As we can see, probability has a central role in Bayesian
statistics.
For deriving Bayesian methods and making statistical
inference, probability provides a coherent and axiomatic
framework.
Probability: Its personal!
In the Bayesian paradigm, probability is a measure of
uncertainty.
Coins dont have probabilities, people have probabilities,
Persi Diaconis.
The only relevant thing is uncertaintythe extent of our own
knowledge and ignorance, Bruno deFinetti.
In this view, all that matters is uncertainty, and all
uncertainties are expressed in terms of probability.
Therefore, we use probability models for random variables that
change and those that might not change (e.g., the population
mean) but we are uncertain about their value .
Probability: Its personal!
Consider the well-known coin tossing example. What is the
probability of head in one toss?
There are only two possibilities for the outcome: head and
tail. Assuming symmetry (i.e., a fair coin), head and tail equal
probability 1/2.
In the frequentist view, probability is assigned to an event by
regarding it as a class of individual events (i.e., trials) all
equally probable and stochastically independent.
For the coin tossing example, we assume a sequence of iid
tosses, and the probability of head is 1/2 since the number of
times we observe head divided by the number of trials reaches
1/2 as the number of trials grows.
Probability: Its personal!
Note that while Bayesians and frequentists provide the same
answer, there is a fundamental and philosophical dierence in
how they view probability.
Bayesians feel comfortable to assign probabilities to events
that are not repeatable.
For example, I can show you a picture of a car and ask what
is the probability that the price of this car is less than
$5,000?.
Likelihood function for estimation
As mentioned above, we also dene the underlying mechanism
that generates data, y, using a probability model, P(y|),
which depends on the unknown parameter of interest, .
Frequentist methods use only this probability for inference.
To estimate model parameters, we can nd their values such
that the probability of the observed data is maximum.
For this, we rst need to construct the corresponding
likelihood function.
The likelihood function is dened by plugging-in the observed
data in the probability distribution and expressing it as a
function of model parameters, i.e., f (, y).
We then maximize the likelihood function with respect to
model parameters .
Under weak regularity conditions, the MLE demonstrates
attractive properties as the sample size n increases (i.e.,
n ). These include its asymptotic normality, consistency,
and eciency.
US kidney cancer death rate by county during 1980s
(Gelman)
Example 2.8 in Gelman et.al. Counties in the United States
with the highest kidney cancer death rates (estimated using
ML) during 1980s.
It seems the most of the counties with the highest cancer
death rate are in the Great Plains.
US kidney cancer death rate by county during 1980s
Surprisingly, a map of the counties with the lowest rates also
shows a similar pattern, i.e., they are mostly in the Great
Plains.
US kidney cancer death rate by county during 1980s
This is due to the fact that counties with small population
(i.e., those in the Great Plains) are very volatile. A county
with 10000 population and zero number of death would be
among the lowest ones, and just one death makes the same
county one of the highest ones (the average death rate is
4.6 10
5
).
Bayesian approach for estimating cancer death rates provide
more reasonable results.
For now, think of these estimates as the weighted average
between the observed rate in the county and the national
average rate. Weights are proportional to population size.
Likelihood function for hypothesis testing
Within the frequentist framework, we also use likelihood
function to device standard tests (Wald test, score test, and
likelihood ratio test) to perform hypothesis testing.
To evaluate a hypothesis, we then calculate the corresponding
p-value, and reject the null hypothesis if the p-value is below
a certain cuto.
The likelihood principle states that after observing the data,
y, all relevant information for inference about is contained
in the likelihood function for the observed y. Moreover, two
likelihood functions contain the same information about if
they are proportional to each other.
To see how this could go wrong, consider the coin experiment
example (proposed by David MacKay).
The coin experiment
A scientist has just received a grant to examine whether a
specic coin is fair (i.e., P(H) = P(T) = 0.5) or not.
He sets up a lab and starts tossing the coin. Of course,
because of his limited budget, he can only toss the coin a
nite number of times.
He tosses the coin 12 times, of which only 3 are heads.
He hires a frequentist statistician and ask him to estimate the
p-value hoping that the result could be published in one of the
journals that only publish if p-value is less than 0.05!
The statistician says: you tossed the coin 12 times and you
got 3 heads. The one-sided p-value is 0.07.
The coin experiment
The scientist says: Well, it wasnt exactly like that... I
actually repeated the coin tossing experiment until I got 3
heads and then I stopped.
The statistician say: In that case, your p-value is 0.03.
Assignment 1- Q1: Show how the statistician came up with
these results.
Later, we will give the same problem to a Bayesian statistician
discuss her approach.
Prior probability
As mentioned above, within the Bayesian framework, we use
probability not only for the data, y, where our certainty is due
to data variation, but also for model parameters, , where our
uncertainty is due to the fact that is the population
parameter, and it is almost always unknown.
Therefore, before doing statistical inference, we need to
specify the extent of our belief and our uncertainty about the
possible values of the parameter using prior probability.
We denote this probability as P().
We usually use our (or others) domain knowledge, which is
accumulated based on previous scientic studies.
We almost always have such information, although it could be
vague.
Prior probability
For example, consider the study conducted by Mackowiak, et
al. to nd whether the average normal body temperature is
the widely accepted value of 98.6

F.
Their hypothesis was that the average normal body
temperature is, in fact, less than 98.6

F.
For the average normal body temperature, for example, we
know that it should be close to 98.6

F.
Lets denote the average normal body temperature for the
population as . We know that should be close to 98.6

F;
that is, values close to 98.6

F are more plausible than values


close to 90

F for example.
We assume that as we move away from the 98.6

F the values
become less likely in a symmetric way (i.e., it does not matter
if we go higher or lower).
Prior probability
Based on the above assumptions (and ignoring the fact that
body temperature cannot be negative), we can set
N(98.6,
2
).
In the above prior,
2
determines how certain we are about
the average normal body temperature being around 98.6

F.
If we believe that it is almost impossible that the average
normal body temperature is above 113.6 and below 83.6, we
can set = 5 so the approximate 99.7% interval includes all
the plausible values from 83.6 to 113.6.
A general advise is that we should keep an open mind,
consider all possibilities, and avoid using very restrictive priors.
Probability distributions to express our opinion
What is the 50% probability interval for the price of Think
City (provide 25% and 75% quantiles)?
This is the interval you would be willing to make equal bets
for the actual value being inside or outside of the interval.
Moreover, you believe there is 25% chance that the interval is
too high, and 25% chance that it is too low.
Lets again assume that as you move away from the center of
this interval, the values become less likely in a symmetric way.
In this setting, we can use a normal distribution to express our
belief regarding the price of the car.
A normal distribution with mean and standard deviation
has approximately a 50% interval [
2
3
, +
2
3
].
More on priors
Sometimes, our prior opinion is based on what we know about
the underlying structure of the data.
For example, in many classication problems, we have prior
knowledge about how classes can be arranged in a hierarchy.
Hierarchical classication problems of this sort are abundant
in statistics and machine learning.
One such example is prediction of genes biological functions.
More on priors
As shown in the following gure, gene functions usually are
presented in a hierarchical form, starting with very general
classes (e.g., cell processes) and becoming more specic in
lower levels of the hierarchy (e.g., cell division).
Gene
Functions
) q
/ w / w
Structural Elements Cell Processes
Cell
Envelop
Chromosome
Replication
Cell
Division
Ribosome
Constituents
Figure: A part of a gene annotation hierarchy proposed by Riley
(1993) for E. coli genome.
More on priors
Another example is analyzing gene expression values to
identify genes that are dierentially expressed between two
groups (e.g., cancer vs. healthy).
Traditionally, statistical methods have focused on individual
genes ignoring possible relationships among them.
In many real situations, however, knowing that a gene, G
g
, is
dierentially expressed could increase the probability of
dierential expression for other genes that are related to that
G
g
(i.e., either they activate G
g
or are activated by it).
In practice, prior biological information is available regarding
the interconnectivity among genes.
Such information allows us to divide genes into subsets and
shift the focus of hypothesis testing towards gene sets as
opposed to individual genes.
Many studies have shown that focusing on gene sets could
result in higher statistical power and provide more robust
results.
The price is right!
Now lets go back to the electric car example. What would
you say if I ask you to guess the price of Think City?
Would you change your answer if I tell you that you win the
car if your answer is right?
Lets start with a simpler question (proposed by Gelman and
Nolan, 2002).
I am going to keep tossing a coin and you are going to guess
the outcome, you would win $1 every time you guess head
and the outcome is head. What would be your strategy?
Would you change your strategy if I tell you the coin is not a
fair coin and the probability of head is only 0.1?
Decision theory
It is clear that to make decisions, we need more than just
probability: we need a measure of loss or gain for each
possible outcome.
For this, we usually use a loss function that assigns to each
possible outcome a number that represents the cost and the
amount of regret (e.g., loss of prot) we endure when that
outcome occurs.
Alternatively, we could use a utility function that assigns a
number to each outcome representing the gain and the
amount of satisfaction due to that outcome.
Decision theory
Decision theory has a key role in Bayesian statistical inference.
Decision theory provides a mathematical framework for
making decision under uncertainty; that is, when the outcome
of an event is not known. We do, however, know what our
loss (or gain) would be if any of the possible outcomes occur.
To make decisions, we often use expected loss principle states
and choose the option whose expected loss is minimum.
This of course seems easy in theory, but it may be dicult in
practice.
Whats next?
We so far tried to establish why we use Bayesian analysis.
Throughout this course, we will discuss dierent Bayesian
models and their applications for analyzing scientic problems.
We rst start with simple models with one unknown
parameter.
We then move to discuss models with more than one
parameter.
These models tend to be complex, and we would need to use
computational methods to perform statistical inference. We
will discuss some of these methods in this course.
Before we discuss Bayesian models, we briey review rigorous
probability based on the book by Jerey Rosenthal.
Probability
We are familiar with statements such as X Poisson(5)
distribution. We interpret it as X being a non-negative
random variable such that P(X = k) = 5
k
exp(5)/k!.
We call X a discrete random variable.
We are also familiar with statements like Y Normal(0, 1).
It means, for example, P(a Y b) =
_
b
a
1

2
exp(
y
2
2
)dy.
We know that P(Y = y) = 0 for any real number y. Y is an
example of absolutely continuous random variable.
Introductory courses on probability group random variables as
either discrete or continuous.
This is not completely correct. There are other types of
random variables. For example, consider a random variable Z
dened as follows. We toss a coin, if its head, we set Z = X,
otherwise, we set Z = Y.
Probability measure
A mathematical rigorous probability theory is studied in the
context of measure theory.
A probability measure space (or a probability triple),
(, F, P), dened as follows:
is a non-empty set referred to as the sample space (i.e., for
example the sample space for the poisson distribution consists
of all the non-negative integers).
F is a -algebra, which is a collection of measurable (i.e., the
probability is dened) subsets of (including itself and the
empty set ), all their complements, and their countable
unions. That is, is closed under complement (i.e., if A F,
then A
c
F), under countable unions and intersections.
P is a probability measure mapping between F and a real
number between 0 and 1 such that P() = 0, P() = 1, and
P is countably additive, i.e., if A
1
, A2, ... are disjoint subsets
included in F, we have
P(A
1
A
2
, ...) = P(A
1
) + P(A
2
)+, ...
Some additional results
P(A
c
) = 1 P(A)
Monotonicity: if A B, where A, B F, then P(A) P(B).
Countable sub-additivity: if A
1
, A
2
, ... F, which may not be
disjoint in general, then P(

n
A
n
)

n
P(A
n
)
P(A B) = P(A) + P(B) P(A B), where A, B F may
not be disjoint.
Some examples
Tossing a fair coin:
= {H, T}
F = {, , {H}, {T}}
P() = 0, P() = 1, P(H) = P(T) = 1/2.
Tossing n fair coins:
= {(x
1
, x
2
, ..., x
n
)}, where x
i
= 01.
F = 2

= {all subsets}
P(A) =
|A|
2
n
Poisson(5) distribution.
= {0, 1, 2, ...}
F = 2

= {all subsets}
P(A) =

kA
5
k
exp(5)/k! A F
What about continuous distributions (where the sample space
is not countable) such as Uniform(0, 1)?
Some examples
The probability triple corresponding to the Uniform(0, 1)
distributionin is called the Lebesgue measure on [0, 1].
The sample space is obviously = [0, 1].
To construct the corresponding -algebra, consider J as the
set of all intervals (e.g., open, closed, half-open, singleton,
etc.) contained in [0, 1].
Now add all the countable unions of intervals, their
complements, their countable intersections, etc. (for original
intervals and those created later) to create a -algebra.
The smallest -algebra create, B = (J) is called the Borel
-algebra, and each of its elements is called a Borel set.
For the Uniform(0, 1) distribution, the probability of each
interval is equal to the length of that interval. That is,
P([a, b] = P([a, b)) = P((a, b]) = P((a, b)) = b a for
0 a b 1 (to dene P more precisely, we need to discuss
outer measure and extension theorem).
Similar procedure is used for other continuous distributions.
Random variables
Random variables: it assigns numerical values to each possible
outcome within a sample space, .
Therefore, given a probability triple, (, F, P), a random
variable X is a measurable function from to the real
numbers R.
For example, we can dene X(Taile) = 0 and X(Head) = 1.
Since X is measurable, we can talk about P(X = 0) by which
we mean P(Tail ). In general, we can talk about P(X B),
for any Borel set B.
Another example: if (, F, P) is a Lebesgue measure on [0,
1], we can dene a random variable X() = 3 + 4, for all
.
Random variables
P(X > x) = P( , X() > x)
= P{ , 3 + 4 > x}
= P{ >
x 4
3
}
P(X > x) =
_
_
_
1 x 4
7x
3
4 x 7
0 x 7
Independence
Independence: Two events (or random variables) are
independent if they do not aect each others probability.
That is, knowing whether an event A has occurred does not
change the probability of event B; we say:
P(A B)/P(A) = P(B)
or alternatively:
P(A B) = P(A)P(B)
We can extend this to any number of events presented as a
collection {A

}
I
:
P(A

1
A

2
... A

j
) = P(A

1
)P(A

2
)...P(A

j
)
for any choice of
1
,
2
, ...,
j
I
Independence
SImilarly, we can talk about the independence of random
variables. Random variables X and Y are independent if
P(X S
1
, Y S
2
) = P(X S
1
)P(Y S
2
)
or alternatively
P(X x, Y y) = P(X x)P(Y y) x, y R
For a collection of independent random variables, {X

}
I
we
have
P(X

i
S
i
, 1 i n) =
n

i =1
P(X

i
S
i
)
Note that if X and Y are independent, so are f (X) and g(Y).
Expected values
For simple random variables whose range is nite, we can
represent the distinct values as x
1
, x
2
, ..., x
n
and write
X =

n
i
x
i
1
A
i
, where 1
A
is an indicator function such that
1
A
() =
_
1 A
0 / A
The expected value (mean or expectation) for such variables is
dened as:

X
= E(X) = E
_
n

i
x
i
1
A
i
_
=
n

i
x
i
P(A
i
)
where A
i
= { ; X() = x
i
}, and {A
i
} is a nite partition
(or in general any collection) of .
Expected values
For example, if we toss a coin and dene
X() =
_
10 if Head
20 if Tail,
then E(X) = 10 1/2 + 20 1/2 = 15.
Another example: Consider the Lebesgue measure, and lets
dene X as follows:
X() =
_
_
_
4 < 0.25
6 = 0.25
8 > 0.25,
then E(X) = 4 1/4 + 6 0 + 8 3/4 = 7
Some properties of expectation
E(1
A
) = P(A)
E(c) = c
E(aX + bY) = aE(X) + bE(Y)
Expectation is order preserving, i.e., if X() Y() for all
, then E(X) E(Y)
|E(X)| E(|X|)
If X and Y are independent, then E(XY) = E(X)E(Y); note
that the other direction does not always hold.
Some properties of expectation
If f (X) is a function of X, f : R R, then f itself is a
simple random variable and can be written as
f (X) =

n
i
f (x
i
)1
A
i
and E(f (x)) =

n
i
f (x
i
)P(A
i
).
Especially, if f (X) = (x
X
)
2
, the expectation of f is the
variance of X: Var (X) = E((x
X
)
2
), which leads to the
well known conclusion that Var (X) = E(X
2
) E(X)
2
and
also Var (X) E(X
2
).
Some other properties of variance
Var (aX + b) = a
2
Var (X)
Var (X + Y) = Var (X) + Var (Y) + 2Cov(X, Y), where
Cov(X, Y) = E((x
X
)(y
y
)) = E(XY) E(X)E(Y)
If X and Y are independent, then Cov(X, Y) = 0 and
Var (X + Y) = Var (X) + Var (Y).
Variance of X is in fact its second central moment.
In general, the k
th
moment of a random variable is dened as
E(X
k
).
With some mathematical precautions, the above properties
can be extended to non-simple random variables.
The integration connection
Similar to the integral, expectation has some nice properties
such as linearity, oerder-preserving and so forth.
In fact, it can be shown that given a probability triple
(, F, P), and a measurable function X,
E(X) =
_

XdP =
_

X()P(d)
which is the integral of X with respect to the probability
measure.
If (, F, P) is the Lebesgue measure on [0, 1], and X is
Riemann integrable, then the above integral is the common
calculus-style integral: E(X) =
_
1
0
X(t)dt.
Even if X is not Riemann integrable (but nevertheless a
measurable function with respect to the Lebesgue measure),
we can still get the expectation which in this case called the
Lebesgue integral. That is, the Lebesgue integral is the
generalization of the Riemann integral.
Distributions
Given a random variable X on a probability triple (, F, P),
its distribution is a probability measure on the sample space
R (with the Borel -algebra) dened as
(B) = P(X B) B Borel
X
Moreover, the cumulative distribution function of a random
variable X is dened as F
X
(x) = P(X x) for x R.
Note that lim
x
F
X
(x) = 0 and lim
x
F
X
(x) = 1.
Recall that we dened the expected value of a measurable
function f (X) as E(f (X)) =
_

f (X())P(d) with respect


to the probability measure P.
Alternatively, we can dene
E(f (X)) =
_

f (t)(dt) =
_

f (t)d. This is known as


the change of variable theorem.
Some simple distributions
One simple distribution is the point mass
c
, which is the
distribution of random variable X where P(X = c) = 1.
Another simple distribution is the Poisson() distribution,
where (X) =

j =0
(
j
exp()/j !)
j
Normal(0, 1) distribution is dened as

N
(B) =
_

f (t)1
B
(t)(dt). where
f (x) =
1

2
exp(x
2
/2) is called the density function, and
is the Lebesgue measure on R.
If a distribution has, , has a density, f , instead of taking the
integral of a function g(t) with respect to , we can take the
integral g(t)f (t) with respect to .
_

g(t)(dt) =
_

g(t)f (t)(dt)
That is, for such cases we mainly take a calculus-style integral
using the density function.
Convergence
Convergence with probability 1 (almost surely):
P(lim
n
X
n
= X) = 1.
Convergence in probability: lim
n
P(|X
n
X| ) = 0.
Convergence in distribution: lim
n
P(X
n
x) = P(X x).
Convergence with probability 1 convergence in probability
convergence in distribution.
Weak law of large numbers: Let X
1
, X
2
, ... be a sequence of
independent random variables with the same mean m and
nite variance, then their partial average,
1
n
(X
1
+ X
2
+ ... + X
n
) converges in probability to m.
Strong law of large numbers: If besides the above conditions
the forth momen, E((X
i
m)
4
) is also nite, the partial
average converges to m with probability 1 (i.e., almost surely).
Central limit theorem: Let X
1
, X
2
, ... be iid with nite mean m
and nite variance v. Set S
n
= X
1
+ X
2
+ ... + X
n
. Then as
n ,
S
n
nm

nv
convergence in distribution to Z N(0, 1).
Conditional probability and expectation
Conditional probability is simply dened as P(A|B) =
P(AB)
P(B)
,
where P(B) > 0. This is the proportion of the event B that
also includes the event A. In other words, you investigate the
event A in a smaller subset of the sample space where the
event B occurs.
Similarly, for random variables, we can dene the conditional
distribution as P(Y S|B) =
P(YS,B)
P(B)
.
Using this new constructed (conditional) distribution, , we
can dene conditional expectations in the usual way:
E(Y|B) =
_
yd(y), E(f (Y)|B) =
_
f (y)(dy).
This seems quite straightforward as long as P(B) > 0. But
what happens when P(B) = 0; for example, can we discuss
P(A|X = 0.5) where X is a random variable with Uniform(0,
1) distribution.
Conditional probability and expectation
To resolve this issue, we regard the conditional probability
P(A|X) and expectation E(Y|X) as themselves being random
variables that are functions of X. These new values should
have the correct expected values:
E(P(A|X)) = P(A) E(E(Y|X)) = E(Y)
However, having the above correct expected values is not
enough to specify the distribution of P(A|X) and E(E(Y|X)).
More specically, we need these for any Borel S T
E(P(A|X)1
XS
) = P(A {X S})
E(E(Y|X)1
XS
) = E(Y1
XS
)
Since the above expectations would not be aected by
changes on a set of measure 0, the above denitions are only
unique up to a set of measure 0. We can in fact change
P(A|X) without restriction whenever P(X = x) = 0.
Conditional probability and expectation
Also note that when S = R, we again obtain
E(P(A|X)) = P(A) E(E(Y|X)) = E(Y)
Some useful properties:
P(A B) = P(A)P(B|A) = P(B)P(A|B)
If A is independent of B, then P(A|B) = P(A) and
P(A B) = P(A)P(B)
The total probability rule: if B
1
, B
2
, ..., B
n
partition the sample
space (i.e., their are mutually exclusive and

n
i =1
B
i
= ), then
P(A) = P(A|B
1
)P(B
1
) + P(A|B
2
)P(B
2
) + ... + P(A|B
n
)P(B
n
)
The multiplication rule: If A
1
, A
2
, ..., A
n
is a sequence of
events, then P(A
1
A
2
... A
n
) =
P(A
1
)P(A
2
|A
1
)P(A
3
|A
1
A
2
)...P(A
n
|A
1
... A
n1
)
Conditional probabilities play a very important role in
Bayesian statistics, and we will discuss them more in future.

Vous aimerez peut-être aussi