gmmHmmTutoChief Wissap09

Gaussian Mixture Model (GMM)
and
Hidden Markov Model (HMM)
Samudravijaya K
Tata Institute of Fundamental Research, Mumbai
chief@tifr.res.in
09-JAN-2009
Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006).
1 of 88
Pattern Recognition
Model
Training
Generation
Input Signal
Processing
Testing Pattern Output
Matching
GMM: static patterns

HMM: sequential patterns
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 2 of 88

Basic Probability
Joint and Conditional probability
p(A, B) = p(A|B) p(B) = p(B|A) p(A)
Bayes’ rule
p(B|A) p(A)
p(A|B) =
p(B)
If Ais are mutually exclusive events,

X
p(B) = p(B|Ai) p(Ai)
i
p(B|A) p(A)
p(A|B) = P
i p(B|Ai) p(Ai )

Normal Distribution
Many phenomenon are described by Gaussian pdf

1 1
p(x|θ) = √ exp − 2 (x − µ)2 (1)
2πσ 2 2σ
pdf is parameterised by θ = [µ, σ 2] where mean = µ and variance=σ 2.
A convenient pdf : second order statistics is sufficient.
Example: Heights of Pygmies ⇒ Gaussian pdf with µ = 4f t & std-dev(σ) = 1f t
OR: Heights of bushmen ⇒ Gaussian pdf with µ = 6f t & std-dev(σ) = 1f t
Question:If we arbitrarily pick a person from a population ⇒

what is the probability of the height being a particular value?

If I pick arbitrarily a Pygmy, say x, then

1 1
Pr(Height of x=4’1”) = √ exp − (4′1′′ − 4)2 (2)
2π.1 2.1
Note: Here mean and variances are fixed, only the observations, x, change.
Also see: P r(x = 4′1′′) ≫ P r(x = 5′) AND P r(x = 4′1′′) ≫ P r(x = 3′)

Conversely: Given a person’s height is 4′1′′ ⇒
Person is more likely to be a pygmy than bushman.
If we observe heights of many persons – say 3′6′′, 4′1′′, 3′8′′, 4′5′′, 4′7′′, 4′, 6′5′′
and all are from same population (i.e. either pygmy or bushmen.)
⇒ then more certain we are that the population is pygmy.
More the observations ⇒ better will be our decision

Likelihood Function
x[0], x[1], . . . , x[N − 1]

⇒ set of independent observations from pdf parameterised by θ.
Previous Example: x[0], x[1], . . . , x[N − 1] are heights observed and

θ is the mean of density which is unknown (σ 2 assumed known).
N
Y
L(X; θ) = p(x0 . . . xN −1 ; θ) = p(xi; θ)
i=0
N
!
1 1 X
= exp − 2 (xi − θ)2 (3)
(2πσ 2)N |2 2σ i=0
L(X; θ) is a function of θ and is called Likelihood Function
Given: x0 . . . xN −1, ⇒ what can we say about value of θ, i.e. best estimate of θ.

Maximum Likelihood Estimation
Example: We know height of a person x[0] = 4′4′′.

Most likely to have come from which pdf ⇒ θ = 3′, 4′6′′ or 6′ ?
Maximum of L(x[0]; θ = 3′), L(x[0]; 4′6′′) and L(x[0]; θ = 6′) ⇒ choose θb = 4′6′′.
If θ is just a parameter, we will choose arg max L(x[0]; θ).

θ

Maximum Likelihood Estimator
 
θ1
 
 θ2 
 
Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ = 
 . 

 
 . 
θm−1
N
Y
We form Likelihood function L(X; θ) = p(xi; θ)
i=0
θbM LE = arg max L(X; θ)

θ
For height problem:

b M LE = 1
P
⇒ can show (θ) N xi
⇒ Estimate of mean of Gaussian = sample mean of measured heights.

Bayesian Estimation
• MLE ⇒ θ is assumed unknown but deterministic
• Bayesian Approach: θ is assumed random with pdf p(θ) ⇒ Prior Knowledge.
p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ) p(θ)
| {z } p(x) |{z}
Aposterior Prior
• Height problem: Unknown mean is random ⇒ pdf Gaussian N (γ, ν 2)

1 1
p(µ) = √ exp − 2 (µ − γ)2
2πν 2 2ν
σ 2γ + nν 2x̄
Then : (bµ)Bayesian =
σ 2 + nν 2
⇒ Weighted average of sample mean and a prior mean

Gaussian Mixture Model
p(x) = α p(x|N (µ1; σ1)) + (1 − α) p(x|N (µ2; σ2))
M
X X
p(x) = wm p(x|N (µm; σm)), wi = 1
m=1
Characteristics of GMM:
Just like ANNs are universal approximators of functions, GMMs are universal
approximators of densities (provided sufficient no. of mixtures are used); true for
diagonal GMMs as well.

General Assumption in GMM
• Assume that there are M components.
• Each component generates data from a Gaussian with mean µm and covariance
matrix Σm.

GMM
Consider the following probability density function shown in solid blue
It is useful to parameterise or “model” this seemingly arbitrary “blue” pdf

Gaussian Mixture Model (Contd.)
Actually – pdf is a mixture of 3 Gaussians, i.e.

X
p(x) = c1N (x; µ1, σ1) + c2N (x; µ2, σ2) + c3N (x; µ3, σ3 ) and ci = 1 (4)
pdf parameters: c1, c2, c3, µ1, µ2, µ3, σ1, σ2, σ3

Observation from GMM
Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind
a curtain, a person picks a ball from urn
If red ball ⇒ generate x[i] from N (x; µ1, σ1)

If blue ball ⇒ generate x[i] from N (x; µ2, σ2)
If green ball ⇒ generate x[i] from N (x; µ3, σ3)
We have access only to observations x[0], x[1], . . . , x[N − 1]
Therefore : p(x[i]; θ) = c1N (x; µ1, σ1) + c2N (x; µ2, σ2) + c3N (x; µ3, σ3)
but we do not which urn x[i] comes from!
Can we estimate component θ = [c1 c2 c3 µ1 µ2 µ3 σ1 σ2 σ3]T from the observations?
N
Y
arg max p(X; θ) = arg max p(xi; θ) (5)
θ θ
i=1

Estimation of Parameters of GMM
Easier Problem: We know the component for each observation

Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3
X1 = {x[0], x[3], x[5] } belong to p1(x; µ1, σ1 )

X2 = {x[1], x[2], x[8], x[9] } belongs to p2(x; µ2, σ2)
X3 = {x[3], x[6], x[7], x[10], x[11], x[12]} belongs to p3(x; µ3, σ3)
From: X1 = {x[0], x[3], x[5] }

3
b
c1 = 13
and µ b1 = 31 {x[0] + x[3] + x[5]}
2 1
2 2 2

b1 = 3 (x[0] − µ
σ b1) + (x[2] − µ
b1) + (x[5] − µ
b1)
In practice we do not know which observation come from which pdf.

⇒ How do we solve for arg max p(X; θ) ?
θ

Incomplete & Complete Data
x[0], x[1], . . . , x[N − 1] ⇒ incomplete data,
Introduce another set of variables y[0], y[1], . . . , y[N − 1]

such that y[i] = 1 if x[i] ∈ p1, y[i] = 2 if x[i] ∈ p2 and y[i] = 3 if x[i] ∈ p3
Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3
miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12]
y[i] = missing data—unobserved data ⇒ information about component
z = (x; y) is complete data ⇒ observations and which density they come from
p(z; θ) = p(x, y; θ)
⇒ θbM LE = arg max p(z; θ)
θ
Question: But how do we find which observation belongs to which density ?

Given observation x[0] and θ g , what is the probability of x[0] coming from first
distribution?
p (y[0] = 1|x[0]; θ g )
p(y[0]=1,x[0];θ g )
= p(x[0];θ g )
g g
p(x[0]|y[0]=1;µ1 ,σ1 ) · p(y[0]=1)
= P3 g
j=1 p(x[0]|y[0]=j;θ ) p(y[0]=j)
g g g
p(x[0]|y[0]=1,µ1 ,σ1 ) · c1
= g g g g g g
p(x[0]|y[0]=1;µ1 ,σ1 ) c1 +p(x[0]|y[0]=2;µ2 ,σ2 ) c2 +...
All parameters are known ⇒ we can calculate p(y[0] = 1|x[0]; θ g )
(Similarly calculate p(y[0] = 2|x[0]; θ g ), p(y[0] = 3|x[0]; θ g )
Which density? ⇒ y[0] = arg max p(y[0] = i|x[0]; θ g ) – Hard allocation

i

Parameter Estimation for Hard Allocation
x[0] x[1] x[2] x[3] x[4] x[5] x[6]

p(y[j] = 1|x[j]; θ g ) 0.5 0.6 0.2 0.1 0.2 0.4 0.2
p(y[j] = 2|x[j]; θ g ) 0.25 0.3 0.75 0.3 0.7 0.5 0.6
p(y[j] = 3|x[j]; θ g ) 0.25 0.1 0.05 0.6 0.1 0.1 0.2
Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2
2 4 1
Updated Parameters: b
c1 = 7 b
c2 = 7 b
c3 = 7 (different from initial guess!)
Similarly (for Gaussian) find: bi2
bi , σ
µ for ith pdf

Parameter Estimation for Soft Assignment
x[0] x[1] x[2] x[3] x[4] x[5] x[6]

p(y[j] = 1|x[j]; θ g ) 0.5 0.6 0.2 0.1 0.2 0.4 0. 2
p(y[j] = 2|x[j]; θ g ) 0.25 0.3 0.75 0.3 0.7 0.5 0.6
p(y[j] = 3|x[j]; θ g ) 0.25 0.1 0.05 0.6 0.1 0.1 0.2
Example: Prob. of each sample belonging to component 1
p(y[0] = 1|x[0]; θ g ), p(y[1] = 1|x[1]; θ g ) p(y[2] = 1|x[2]; θ g ), · · · · · ·
Average probability that a sample belongs to Comp.#1 is

N
1 X
cnew
b1 = p(y[i] = 1|x[i]; θ g )
N i=1
0.5 + 0.6 + 0.2 + 0.1 + 0.2 + 0.4 + 0.2 2.2
= =
7 7

Soft Assignment – Estimation of Means & Variances
Recall: Prob. of sample j belonging to component i
p(y[j] = i|x[j]; θ g )
Soft Assignment: Parameters estimated by taking weighted average !

PN g
xi · p(y[i] = 1 | x[i]; θ )
µnew
1 = i=1
PN g
i=1 p(y[i] = 1 | x[i]; θ )
PN
i=1(xi − µb1)2 · p(y[i] = 1 | x[i]; θ g )
(σ12 )new = PN
i=1 p(y[i] = 1 | x[i]; θ g )
These are updated parameters starting with initial guess θ g

Maximum Likelihood Estimation of Parameters of GMM
1. Make initial guess of parameters: θ g = cg1 , cg2 , cg3 , µg1 , µg2 , µg3 , σ1g , σ2g , σ3g
2. Knowing parameters θ g , find Prob. of sample xi belonging to j th component.
g
p[y[i] = j | x[i]; θ ] for i = 1, 2, . . . N ⇒ no. of observations
for j = 1, 2, . . . M ⇒ no. of components

3.
N
1 X
cnew
bj = p(y[i] = j | x[i]; θ g )
N i=1
4. PN
i=1 xi · p(y[i] = j | x[i]; θ g )
µnew
j = PN
i=1 p(y[i] = j | x[i]; θ g )
5. PN
i=1 (xi b1 )2 · p(y[i] = j | x[i]; θ g )
−µ
(σj2 )new = PN
i=1 p(y[i] = j | x[i]; θ g )
6. Go back to (2) and repeat until convergence

Live demonstration: http://www.neurosci.aist.go.jp/˜ akaho/MixtureEM.html
Practical Issues
• E.M. can get stuck in local minima.
• EM is very sensitive to initial conditions; a good initial guess helps; k-means

algorithm is used prior to application of EM algorithm

Size of a GMM
Bayesian Information Criterion (BIC) value of a GMM can be defined as follows:
d
BIC(G | X) = logp(X | Ĝ) − logN
2
where
Ĝ represent the GMM with the ML parameter configuration
d represents the number of parameters in G
N is the size of the dataset
the first term is the log-likelihood term;

the second term is the model complexity penalty term.
BIC selects the best GMM corresponding to the largest BIC value by trading off
these two terms.

source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005
The BIC criterion can discover the true GMM size effectively as shown in the figure.

Maximum A Posteriori (MAP)
• Sometimes, it is difficult to get sufficient number of examples for robust estimation

of parameters.
• However, one may have access to large number of similar examples which can be
utilized.
• Adapt the target distribution from such a distribution. For example, adapt a
speaker independent model to a new speaker using small amount of adaptation
data.

MAP Adaptation
ML Estimation
θbM LE = arg max p(X|θ)

θ
MAP Estimation
θbM AP = arg max p(θ|X)

θ
p(X|θ) p(θ)
= arg max
θ p(X)
= arg max p(X|θ) p(θ)
θ
p(θ) is the a priori distribution of parameters θ.

A conjugate prior is chosen such that the corresponding posterior belongs to the
same functional family as the prior.

Simple Implementation of MAP-GMMs
Source: Statistical Machine Learning from Data: GMM; Samy Bengio

Simple Implementation
Train a prior model p with a large amount of available data (say, from multiple
speakers). Adapt the parameters to a new speaker using some adaptation data (X).
Let α = [0, 1] be a parameter that describes the faith on the prior model.
Adapted weight of j th mixture
X
p
ŵj = αwj + (1 − α) p(j|xi) γ
P
i
Here γ is a normalization factor such that wj = 1.

Simple Implementation (contd.)
means P
p i p(j|xi) xi
µ̂j = αµj + (1 − α) P
i p(j|xi)
Weighted average of sample mean and a prior mean
variances
′
P ′
p p p p(j|xi) xixi
iP ′
σ̂j = α σj + µj µj + (1 − α) − µ̂j µ̂j
i p(j|xi)

HMM
• Primary role of speech signal is to carry a message; sequence of sounds (phonemes)

encode a sequence of words.
• The acoustic manifestation of a phoneme is mostly determined by:
– Configuration of articulators (jaw, tongue, lip)

– physiology and emotional state of speaker
– Phonetic context
• HMM models sequential patterns; speech is a sequential pattern
• Most text dependent speaker recognition systems use HMMs
• Text verification involves verification/recognition of phonemes

Phoneme recognition
Consider two phonemes classes /aa/ and /iy/.

Problem: Determine to which class a given sound belongs.
Processing of speech signal results in a sequence of feature (observation) vectors:

o1, . . . , oT (say MFCC vectors)
We say the speech is /aa/ if: p(aa|O) > p(iy|O)
Using Bayes Rule
AcousticM odel P riorP rob

z }| { z }| {
p(O|aa) p(aa) p(O|iy) p(iy)
V s.
p(O) p(O)
Given p(O|aa), p(aa), p(O|iy) and p(iy) ⇒ which is more probable ?

Parameter Estimation of Acoustic Model
How do we find the density function paa(.) and piy (.).
⇒ paa() parameterised by θaa

We assume a parametric model:
⇒ pij () parameterised by θiy
Training Phase: Collect many examples of /aa/ being said

⇒ Compute corresponding observations o1, . . . , oTaa
Use the Maximum Likelihood Principle
θd
aa = arg max p(O; θaa)
θaa
Recall: if the pdf is modelled as a Gaussian Mixture Model

. ⇒ then we use EM Algorithm

Modelling of Phoneme
Our Articulators are moving from a configuration

To enunciate /aa/ in a word ⇒ for previous phoneme to /aa/ and then proceeding
to move to configuration of next phoneme.
Can think of 3 distinct time periods:
⇒ Transition from previous phoneme

⇒ Steady state
⇒ Transition to next phoneme
Features for 3 “time-interval ”are quite different
⇒ Use different density functions to model the three time intervals

⇒ model as paa1 (; θaa1) paa2 (; θaa2) paa3 (; θaa3)
Also need to model the time durations of these time-intervals – transition probs.

Stochastic Model (HMM)
a11
a12
1 2 3
p(f) p(f) p(f)
f(Hz) f(Hz) f(Hz)

HMM Model of Phoneme
• Use term “State”for each of the three time periods.
• Prob. of ot from j th state, i.e. paaj (ot; θaaj ) ⇒ denoted as bj (ot)
1 2 3
1 2 3
p(; aa ) p(; aa ) p(; aa )
o1 o2 o3 . . .
o10
• Observation, ot, is generated by which state density?
– Only observations are seen, the state-sequence is “hidden”

– Recall: In GMM, the “mixture component is “hidden”

Probability of Observation
Recall: To classify, we evaluate P r(O|Λ) – where Λ are parameters of models

In /aa/ Vs /iy/ calculation: ⇒ P r(O|Λaa) Vs P r(O|Λiy )
Example: 2-state HMM model and 3 observations o1 o2 o3
Model parameters are assumed known:

Transition Prob. ⇒ a11, a12, a21, and a22 – model time durations
State density ⇒ b1(ot) and b2(ot).
bj (ot) are usually modelled as single Gaussian with parameter µj , σj2 or by GMMs

Probability of Observation through one Path
T = 3 observations and N = 2 nodes ⇒ 8 paths thru 2 nodes for 3 observations
Example: Path P1 through states 1, 1, 1.

P r{O|P1, Λ} = b1(o1) · b1(o2) · b1(o3)
Prob. of Path P1 = P r{P1|Λ} = a01 · a11 · a11
P r{O, P1|Λ} = P r{O|P1, Λ} · P r{P1|Λ} = a01b1(o1).a11b1(o2).a11b1(o3)

Probability of Observation
Path o1 o2 o3 p(O, Pi |Λ)
P1 1 1 1 a01 b1 (o1).a11 b1 (o2).a11b1 (o3 )
P2 1 1 2 a01 b1 (o1).a11 b1 (o2).a12 b2 (o3 )
P3 1 2 1 a01 b1 (o1).a12 b2 (o2).a21 b1 (o3 )
P4 1 2 2 a01 b1 (o1).a12 b2 (o2).a22 b2 (o3 )
P5 2 1 1 a02 b2 (o1).a21 b1 (o2).a11b1 (o3 )
P6 1 1 2 a02 b2 (o1).a21 b1 (o2).a12 b2 (o3 )
P7 1 1 1 a02 b2 (o1).a22 b1 (o2).a21 b1 (o3 )
P8 1 1 2 a02 b2 (o1).a22 b1 (o2).a22 b2 (o3 )
X X
p(O|Λ) = P {O, Pi|Λ} = P {O|Pi, Λ} · P {Pi, Λ}
Pi Pi
Forward Algorithm ⇒ Avoid Repeat Calculations:

Two Multiplications
z }| {
a01b1(o1)a11b1(o2).a11b1(o3) + a02b2(o1).a21b1(o2).a11b1(o3)
=[a01.b1(o1)a11b1(o2) + a02b2(o1)a21b1(o2)]a11b1(o3)
| {z }
One Multiplication

Forward Algorithm – Recursion
Let :α1(t = 1) = a01b1(o1)

Let :α2(t = 1) = a02b2(o2)
Recursion : α1(t = 2) = [a01b1(o1).a11 + a02b2(o1).a21].b1(o2)

= [α1(t = 1).a11 + α2(t = 1).a21].b1(o2)

General Recursion in Forward Algorithm
hX i
αj (t) = αi(t − 1)aij .bj (ot)
= P {o1, o2, . . . ot, st = j|Λ}
Note
Sum of probabilities of all paths ending at
αj (t) ⇒ node j at time t with partial observation
sequence o1, o2, . . . , ot
The probability of the entire observation (o1, o2, . . . , oT ), therefore, is
N
X
p(0|Λ) = P {o1, o2, . . . , oT , ST = j|Λ}
j=1
N
X
= αj (T )
j=1
where N=No. of nodes

Backward Algorithm
• analogous to Forward, but coming from the last time instant T
Example: a01b1(o1).a11b1(o2).a11b1(o3) + a01b1(o1).a11b1(o2)a12b2(o3) + . . .
= [a01.b1(o1).a11b1(o2)].(a11b1(o3) + a12b2(o3))
β1(t = 2) = p{o3|st=2 = 1, Λ}
= p{o3, st=3 = 1|st=2 = 1; Λ} + p{o3, st=3 = 2|st=2 = 1, Λ}
p{o3|st=3 = 1, st=2 = 1, Λ}.p{st=3 = 1|st=2 = 1, Λ} +
=
p{o3|st=3 = 2, st=2 = 1, Λ}.p{st=3 = 2|st=2 = 1, Λ}
= b1(o3).a11 + b2(o3).a12

General Recursion in Backward Algorithm
Given that we are at node j at time t

βj (t) ⇒ Sum of probabilities of all paths such that
partial sequence ot+1, . . . , oT are observed
N
X
βi(t) = [aij bj (ot+1)] βj (t + 1)
| {z }
|j=1 {z } Prob. of observation ot+2 . . . oT given
now we are in j th node at t + 1
th
Going to each node from i node
= p{ot+1, . . . , ot|st = i, Λ}

Estimation of Parameters of HMM Model
• Given known Model parameters, Λ:
– Evaluated p(O|Λ) ⇒ useful for classification

– Efficient Implementation: Use Forward or Backward Algo.
• Given set of observation vectors, ot how do we estimate parameters of HMM?

– Do not know which states ot come from
∗ Analogous to GMM – do not know which component
– Use a special case of EM – Baum-Welch Algorithm
– Use following relations from Forward/Backward
N
X
p(O|Λ) = αN (T ) = β1(T ) = αj (t)βj (t)
j=1

Parameter Estimation for Known State Sequence
Assume each state is modelled as a single Gaussian:

bj = Sample mean of observations assigned to state j.
µ
bj2 = Variance of the observations assigned to state j.
σ
and
No. of times transition was made from i to j

Trans. Prob. from state i to j =
Total number of times we made transition from i
In practice since we do not know which state generated the observation

⇒ So we will do probabilistic assignment.

Review of GMM Parameter Estimation
Do not know which component of the GMM generated output observation.
Given initial model parameters Λg , and observation sequence x1, . . . , xT .

Find probability xi comes from component j ⇒ Soft Assignment
p[component = 1, xi|Λg ]
p[component = 1|xi; Λg ] =
p[xi|Λg ]
So, re-estimation equations are:

T
bj 1X
C = p(comp = j|xi ; Λg )
T i=1
PT PT
g
i=1 xi p(comp = j|xi ; Λ ) i=1 (xi − µbj )2 p(comp = j|xi ; Λg )
bnew
µj = PT bj2
σ = PT
g g
i=1 p(comp = j|xi ; Λ ) i=1 p(comp = j|xi ; Λ )
A similar analogy holds for hidden Markov models

Baum-Welch Algorithm
Here: We do not know which observation ot comes from which state si

Again like GMM we will assume initial guess parameter Λg
Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is
g
g p{qt = i, qt+1 = j, O|Λ }
τbt(i, j) = p{qt = i, qt+1 = j|O, Λ } =
p{O|Λg }
g
P
where p{O|Λ } = αN (T ) = i αi(T )

Baum-Welch Algorithm
Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is

g
g p{q t = i, q t+1 = j, O|Λ }
τbt(i, j) = p{qt = i, qt+1 = j|O, Λ } =
p{O|Λg }
g
P
where p{O|Λ } = αN (T ) = i αi (T )
From ideas of Forward-Backward Algorithm, numerator is
p{qt = i, qt+1 = j, O|Λg } = αi(t).aij bj (ot+1).βj (t + 1)
αi(t).aij bj (ot+1)βj (t + 1)
So τbt(i, j) =
αN (t)

Estimating Transition Probability
Trans. Prob. from state i to j = No. of times transition was made from i to j
Total number of times we made transition from i
τbt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”
If we average τbt(i, j) over all time-instants, we get the number of times the system
was in ith state and made a transition to j th state. So, a revised estimation of
transition probability is
PT −1
t=1 τt (i, j)
anew
bij = N
PT X
t=1 ( τt(i, j) )
|j=1 {z }
all transitions out
of i at time=t

Estimating State-Density Parameters
Analogous to GMM: which observation belonged to which component,
New estimates for the state pdf parameters are (assuming single Gaussian)
PT
t=1 γi(t)ot
bi =
µ PT
t=1 γi(t)
PT
X
d γ
t=1 i (t)(o t − b
µ i )(ot − b
µi )T
= PT
i
t=1 γi(t)
These are weighted averages ⇒ weighted by Prob. of being in state j at t
– Given observation ⇒ HMM model parameters estimated iteratively

– p(O|Λ) ⇒ evaluated efficiently by Forward/Backward algorithm

Viterbi Algorithm
Given the observation sequence,

• the goal is to find corresponding state-sequence that generated it
• there are many possible combination (N T ) of state sequence ⇒ many paths.
One possible criterion : Choose the state sequence corresponding to path that with
maximum probability
max P {O, Pi|Λ}

i
Word : represented as sequence of phones

Phone : represented as sequence states
Optimal state-sequence ⇒ Optimal phone-sequence ⇒ Word sequence

Viterbi Algorithm and Forward Algorithm
Recall Forward Algorithm : We found probability over each path and summed over
all possible paths
N T
X
p{O, Pi|Λ}
i=1
Viterbi is just special case of Forward algo.
(
instead of sum of prob. of all paths
At each node
choose path with max prob.
In Practice: p(O|Λ) approximated by Viterbi (instead of Forward Algo.)

Viterbi Algorithm

Decoding
• Recall: Desired transcription[

W obtained by maximising
[
W = arg max p(W |O)
W
• Search over all possible W – astronomically large!
• Viterbi Search – find most likely path through a HMM
– Sequence of phones (states) which is most probable

– Mostly: most probable sequence of phones correspond to most probable
sequence of words

Training of a Speech Recognition System
– HMM parameter’s estimated using large databases – 100 hours

∗ Parameters estimated using Maximum Likelihood Criterion

Recognition

Speaker Recognition
Spectra (formants) of a given sound are different for different speakers.
Spectra of 2 speakers for one “frame” of /iy/
Derive speaker dependent model of a new speaker by MAP adaptation of Speaker-

Independent (SI) model using small amount of adaptation data; use for speaker
recognition.

References
• Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, 2001.
• Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press,

1990.
• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam

Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707
• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam

Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707
• Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang,

Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN
0-13-015157-2

• Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack.
Edinburgh: Edinburgh University Press, c1990.
• Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge,
MA., 1998.
• Maximum Likelihood from incomplete data via the em algorithm, J. Royal

Statistical Soc. 39(1), pp. 1-38, 1977.
• Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation

of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp. 291-
298, 1994.
• A Gentle Tutorial of the EM Algorithm and its Application to Parameter

Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes,
ISCI, TR-97-021.
• Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in
N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.

gmmHmmTutoChief Wissap09

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

gmmHmmTutoChief Wissap09

Transféré par

Droits d'auteur :

Formats disponibles

Gaussian Mixture Model (GMM)

GMM: static patterns

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 2 of 88

Joint and Conditional probability

p(A, B) = p(A|B) p(B) = p(B|A) p(A)

If Ais are mutually exclusive events,

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 3 of 88

Many phenomenon are described by Gaussian pdf

pdf is parameterised by θ = [µ, σ 2] where mean = µ and variance=σ 2.

A convenient pdf : second order statistics is sufficient.

Example: Heights of Pygmies ⇒ Gaussian pdf with µ = 4f t & std-dev(σ) = 1f t

OR: Heights of bushmen ⇒ Gaussian pdf with µ = 6f t & std-dev(σ) = 1f t

Question:If we arbitrarily pick a person from a population ⇒

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 4 of 88

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 5 of 88

More the observations ⇒ better will be our decision

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 6 of 88

x[0], x[1], . . . , x[N − 1]

Previous Example: x[0], x[1], . . . , x[N − 1] are heights observed and

L(X; θ) is a function of θ and is called Likelihood Function

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 7 of 88

Example: We know height of a person x[0] = 4′4′′.

If θ is just a parameter, we will choose arg max L(x[0]; θ).

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 8 of 88

θbM LE = arg max L(X; θ)

For height problem:

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88

• MLE ⇒ θ is assumed unknown but deterministic

• Bayesian Approach: θ is assumed random with pdf p(θ) ⇒ Prior Knowledge.

• Height problem: Unknown mean is random ⇒ pdf Gaussian N (γ, ν 2)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 10 of 88

p(x) = α p(x|N (µ1; σ1)) + (1 − α) p(x|N (µ2; σ2))

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 11 of 88

• Assume that there are M components.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 12 of 88

Consider the following probability density function shown in solid blue

It is useful to parameterise or “model” this seemingly arbitrary “blue” pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 13 of 88

Actually – pdf is a mixture of 3 Gaussians, i.e.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 14 of 88

If red ball ⇒ generate x[i] from N (x; µ1, σ1)

but we do not which urn x[i] comes from!

Can we estimate component θ = [c1 c2 c3 µ1 µ2 µ3 σ1 σ2 σ3]T from the observations?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 15 of 88

Easier Problem: We know the component for each observation

X1 = {x[0], x[3], x[5] } belong to p1(x; µ1, σ1 )

From: X1 = {x[0], x[3], x[5] }

In practice we do not know which observation come from which pdf.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 16 of 88

x[0], x[1], . . . , x[N − 1] ⇒ incomplete data,

Introduce another set of variables y[0], y[1], . . . , y[N − 1]

y[i] = missing data—unobserved data ⇒ information about component

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 17 of 88

Which density? ⇒ y[0] = arg max p(y[0] = i|x[0]; θ g ) – Hard allocation

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 18 of 88

x[0] x[1] x[2] x[3] x[4] x[5] x[6]

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 19 of 88

x[0] x[1] x[2] x[3] x[4] x[5] x[6]

Example: Prob. of each sample belonging to component 1

p(y[0] = 1|x[0]; θ g ), p(y[1] = 1|x[1]; θ g ) p(y[2] = 1|x[2]; θ g ), · · · · · ·

Average probability that a sample belongs to Comp.#1 is

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 20 of 88

Recall: Prob. of sample j belonging to component i