Cs155 17 em Marked

Probabilistic
Graphical Models
Lecture 17 EM
CS/CNS/EE 155
Andreas Krause
Announcements
Project poster session on Thursday Dec 3, 4-6pm
in Annenberg 2nd floor atrium!
Easels, poster boards and cookies will be provided!
Final writeup (8 pages NIPS format) due Dec 9
Approximate inference
Three major classes of general-purpose approaches
Message passing
E.g.: Loopy Belief Propagation (today!)
Inference as optimization
Approximate posterior distribution by simple distribution
Mean field / structured mean field
Assumed density filtering / expectation propagation
Sampling based inference

Importance sampling, particle filtering
Gibbs sampling, MCMC
Many other alternatives (often for special cases)

3
Sample approximations of expectations

x1,,xN samples from RV X
Law of large numbers:
Hereby, the convergence is with probability 1

(almost sure convergence)
Finite samples:
Monte Carlo sampling from a BN

Sort variables in topological ordering X1,,Xn
For i = 1 to n do
Sample xi ~ P(Xi | X1=x1, , Xi-1=xi-1)
Works even with high-treewidth models! C

D
I
G
L
J
H
Computing probabilities through sampling

Want to estimate probabilities
Draw N samples from BN
C
D
Marginals
I
G
L
J
H
Conditionals
Rejection sampling problematic for rare events
Sampling from intractable distributions

Given unnormalized distribution
P(X) Q(X)
Q(X) efficient to evaluate, but normalizer intractable
For example, Q(X) = j (Cj)
Want to sample from P(X)
Ingenious idea:
Can create Markov chain that is efficient to simulate
and that has stationary distribution P(X)
7
Markov Chain Monte Carlo

Given an unnormalized distribution Q(x)
Want to design a Markov chain with stationary
distribution
(x) = 1/Z Q(x)
Need to specify transition probabilities P(x | x)!
Designing Markov Chains

1) Proposal distribution R(X | X)
Given Xt = x, sample proposal x~R(X | X=x)
Performance of algorithm will strongly depend on R
2) Acceptance distribution:
Suppose Xt = x
With probability
set Xt+1 = x
With probability 1-, set Xt+1 = x
Theorem [Metropolis, Hastings]: The stationary

distribution is Z-1 Q(x)
Proof: Markov chain satisfies detailed balance condition!
Gibbs sampling
Start with initial assignment x(0) to all variables
For t = 1 to do
Set x(t) = x(t-1)
For each variable Xi
Set vi = values of all x(t) except xi
Sample x(t)i from P(Xi | vi)
Gibbs sampling satisfies detailed balance

equation for P
Can efficiently compute conditional
distributions P(Xi | vi) for graphical models
10
Summary of Sampling
Randomized approximate inference for computing
expections, (conditional) probabilities, etc.
Exact in the limit
But may need ridiculously many samples
Can even directly sample from intractable distributions

Disguise distribution as stationary distribution of Markov Chain
Famous example: Gibbs sampling
11
Summary of approximate inference

Deterministic and randomized approaches
Deterministic
Loopy BP
Mean field inference
Assumed density filtering
Randomized
Forward sampling
Markov Chain Monte Carlo
Gibbs Sampling
12
Recall: The light side

Assumed
everything fully observable
low treewidth
no hidden variables
Then everything is nice

Efficient exact inference in large models
Optimal parameter estimation without local minima
Can even solve some structure learning tasks exactly
13
The dark side

s4
s1
s2
s5
represent
s6
s11
s3
s7 s8
s9 s
10
s12
States of the world,

sensor measurements,
Graphical model
In the real world, these assumptions are often

violated..
Still want to use graphical models to solve interesting
problems..
14
Remaining Challenges
Inference
Approximate inference for high-treewidth models
Learning
Dealing with missing data
Representation
Dealing with hidden variables
15
Learning general BNs

Fully observable
Known structure
Unknown structure
Easy!
Hard
Missing data
16
Dealing with missing data

So far, have assumed all variables
are observed in each training example
C
D
I
G
L
In practice, often have missing data

Some variables may never be observed
Missing variables may be different
for each example
J
H
17
Gaussian Mixture Modeling
18
Learning with missing data

Suppose X is observed variables, Z hidden variables
Training data: x(1), x(2),, x(N)
Marginal likelihood:
Marginal likelihood doesnt decompose

19
Intuition: EM Algorithm
Iterative algorithm for parameter learning in case of
missing data
EM Algorithm
Expectation Step: Hallucinate hidden values
Maximization Step: Train model as if data were fully
observed
Repeat
Will converge to local maximum
20
E-Step:
x: observed data; z: hidden data
Hallucinate missing values by computing
distribution over hidden variables using current
parameter estimate:
For each example x(j), compute:
Q(t+1)(z | x(j)) = P(z | x(j), (t))
21
Towards M-step: Jensen inequality

Marginal likelihood doesnt decompose
Theorem [Jensens inequality]:

For any distribution P(z) and function f(z),
22
Lower-bounding marginal likelihood

Jensens inequality:
From E-step:
23
Lower bound on marginal likelihood

Bound of marginal likelihood with hidden variables
Recall: Likelihood in fully observable case:
Lower-bound interpreted as weighted data set
24
M-step: Maximize lower bound

Lower bound:
Choose (t+1) to maximize lower bound
Use expected sufficient statistics (counts). Will see:

Whenever we used Count(x,z) in fully observable case,
replace by EQt+1[Count(x,z)]
25
Coordinate Ascent Interpretation

Define energy function
For any distribution Q and parameters :
EM algorithm performs coordinate ascent on F:
Monotonically converges to local maximum
26
EM for Gaussian Mixtures
27
EM Iterations [by Andrew Moore]
28
EM in Bayes Nets
Complete data likelihood
B
A
29
EM in Bayes Nets
Incomplete data likelihood
B
A
30
E-Step for BNs

Need to compute
For fixed z, x: Can compute using inference
Naively specifying full distribution would be intractable
B
A
31
M-step for BNs

Can optimize each CPT independently!
MLE in fully observed case:
MLE with hidden data:
32
Computing expected counts
Suppose we observe O=o

Variables A hidden
33

Known structure
Unknown structure
Fully observable
Easy!
Hard (2.)
Missing data
EM
Now
34
Structure learning with hidden data

Fully observable case:
Score(D;G) = likelihood of data under most likely parameters
Decomposes over families
Score(D;G) = FamScorei(Xi | PaXi)
Can recompute score efficiently after adding/removing edges
Incomplete data case:

Score(D;G) = lower bound from EM
Does not decompose over families
Search is very expensive
Structure-EM: Iterate
Computing of expected counts
Multiple iterations of structure search for fixed counts
Guaranteed to monotonically improve likelihood score
35
Hidden variable discovery

Sometimes, invention of a hidden variable can
drastically simplify model
36

Known structure
Unknown structure
Fully observable
Easy!
Hard (2.)
Missing data
EM
Structure-EM
37

Cs155 17 em Marked

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Cs155 17 em Marked

Transféré par

Droits d'auteur :

Formats disponibles

Probabilistic

Final writeup (8 pages NIPS format) due Dec 9

Sampling based inference

Many other alternatives (often for special cases)

Sample approximations of expectations

Hereby, the convergence is with probability 1

Monte Carlo sampling from a BN

Works even with high-treewidth models! C

Computing probabilities through sampling

Rejection sampling problematic for rare events

Sampling from intractable distributions

Markov Chain Monte Carlo

Designing Markov Chains

Theorem [Metropolis, Hastings]: The stationary

Gibbs sampling satisfies detailed balance

Can even directly sample from intractable distributions

Summary of approximate inference

Recall: The light side

Then everything is nice

The dark side

States of the world,

In the real world, these assumptions are often

Learning general BNs

Dealing with missing data

In practice, often have missing data

Gaussian Mixture Modeling

Learning with missing data

Marginal likelihood doesnt decompose

Will converge to local maximum

Towards M-step: Jensen inequality

Theorem [Jensens inequality]:

Lower-bounding marginal likelihood

Lower bound on marginal likelihood

Recall: Likelihood in fully observable case:

Lower-bound interpreted as weighted data set

M-step: Maximize lower bound

Choose (t+1) to maximize lower bound

Use expected sufficient statistics (counts). Will see:

Coordinate Ascent Interpretation

For any distribution Q and parameters :

EM algorithm performs coordinate ascent on F:

Monotonically converges to local maximum

EM for Gaussian Mixtures

EM Iterations [by Andrew Moore]

E-Step for BNs

M-step for BNs

MLE with hidden data:

Computing expected counts

Suppose we observe O=o

Learning general BNs

Structure learning with hidden data

Incomplete data case:

Guaranteed to monotonically improve likelihood score

Hidden variable discovery

Learning general BNs

Vous aimerez peut-être aussi