Académique Documents
Professionnel Documents
Culture Documents
and
Hidden Markov Model (HMM)
Samudravijaya K
Tata Institute of Fundamental Research, Mumbai
chief@tifr.res.in
09-JAN-2009
Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006).
1 of 88
Pattern Recognition
Model
Training
Generation
Input Signal
Processing
Testing Pattern Output
Matching
Bayes’ rule
p(B|A) p(A)
p(A|B) =
p(B)
p(B|A) p(A)
p(A|B) = P
i p(B|Ai) p(Ai )
Note: Here mean and variances are fixed, only the observations, x, change.
Also see: P r(x = 4′1′′) ≫ P r(x = 5′) AND P r(x = 4′1′′) ≫ P r(x = 3′)
If we observe heights of many persons – say 3′6′′, 4′1′′, 3′8′′, 4′5′′, 4′7′′, 4′, 6′5′′
and all are from same population (i.e. either pygmy or bushmen.)
⇒ then more certain we are that the population is pygmy.
N
Y
L(X; θ) = p(x0 . . . xN −1 ; θ) = p(xi; θ)
i=0
N
!
1 1 X
= exp − 2 (xi − θ)2 (3)
(2πσ 2)N |2 2σ i=0
Given: x0 . . . xN −1, ⇒ what can we say about value of θ, i.e. best estimate of θ.
Maximum of L(x[0]; θ = 3′), L(x[0]; 4′6′′) and L(x[0]; θ = 6′) ⇒ choose θb = 4′6′′.
θ1
θ2
Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ =
.
.
θm−1
N
Y
We form Likelihood function L(X; θ) = p(xi; θ)
i=0
p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ) p(θ)
| {z } p(x) |{z}
Aposterior Prior
σ 2γ + nν 2x̄
Then : (bµ)Bayesian =
σ 2 + nν 2
⇒ Weighted average of sample mean and a prior mean
M
X X
p(x) = wm p(x|N (µm; σm)), wi = 1
m=1
Characteristics of GMM:
Just like ANNs are universal approximators of functions, GMMs are universal
approximators of densities (provided sufficient no. of mixtures are used); true for
diagonal GMMs as well.
• Each component generates data from a Gaussian with mean µm and covariance
matrix Σm.
pdf parameters: c1, c2, c3, µ1, µ2, µ3, σ1, σ2, σ3
Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind
a curtain, a person picks a ball from urn
Therefore : p(x[i]; θ) = c1N (x; µ1, σ1) + c2N (x; µ2, σ2) + c3N (x; µ3, σ3)
N
Y
arg max p(X; θ) = arg max p(xi; θ) (5)
θ θ
i=1
Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3
miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12]
z = (x; y) is complete data ⇒ observations and which density they come from
p(z; θ) = p(x, y; θ)
⇒ θbM LE = arg max p(z; θ)
θ
Question: But how do we find which observation belongs to which density ?
p (y[0] = 1|x[0]; θ g )
p(y[0]=1,x[0];θ g )
= p(x[0];θ g )
g g
p(x[0]|y[0]=1;µ1 ,σ1 ) · p(y[0]=1)
= P3 g
j=1 p(x[0]|y[0]=j;θ ) p(y[0]=j)
g g g
p(x[0]|y[0]=1,µ1 ,σ1 ) · c1
= g g g g g g
p(x[0]|y[0]=1;µ1 ,σ1 ) c1 +p(x[0]|y[0]=2;µ2 ,σ2 ) c2 +...
All parameters are known ⇒ we can calculate p(y[0] = 1|x[0]; θ g )
(Similarly calculate p(y[0] = 2|x[0]; θ g ), p(y[0] = 3|x[0]; θ g )
2 4 1
Updated Parameters: b
c1 = 7 b
c2 = 7 b
c3 = 7 (different from initial guess!)
Similarly (for Gaussian) find: bi2
bi , σ
µ for ith pdf
p(y[j] = i|x[j]; θ g )
PN
i=1(xi − µb1)2 · p(y[i] = 1 | x[i]; θ g )
(σ12 )new = PN
i=1 p(y[i] = 1 | x[i]; θ g )
g
p[y[i] = j | x[i]; θ ] for i = 1, 2, . . . N ⇒ no. of observations
4. PN
i=1 xi · p(y[i] = j | x[i]; θ g )
µnew
j = PN
i=1 p(y[i] = j | x[i]; θ g )
5. PN
i=1 (xi b1 )2 · p(y[i] = j | x[i]; θ g )
−µ
(σj2 )new = PN
i=1 p(y[i] = j | x[i]; θ g )
Practical Issues
d
BIC(G | X) = logp(X | Ĝ) − logN
2
where
Ĝ represent the GMM with the ML parameter configuration
d represents the number of parameters in G
N is the size of the dataset
BIC selects the best GMM corresponding to the largest BIC value by trading off
these two terms.
The BIC criterion can discover the true GMM size effectively as shown in the figure.
• However, one may have access to large number of similar examples which can be
utilized.
• Adapt the target distribution from such a distribution. For example, adapt a
speaker independent model to a new speaker using small amount of adaptation
data.
ML Estimation
MAP Estimation
Train a prior model p with a large amount of available data (say, from multiple
speakers). Adapt the parameters to a new speaker using some adaptation data (X).
Let α = [0, 1] be a parameter that describes the faith on the prior model.
X
p
ŵj = αwj + (1 − α) p(j|xi) γ
P
i
Here γ is a normalization factor such that wj = 1.
means P
p i p(j|xi) xi
µ̂j = αµj + (1 − α) P
i p(j|xi)
Weighted average of sample mean and a prior mean
variances
′
P ′
p p p p(j|xi) xixi
iP ′
σ̂j = α σj + µj µj + (1 − α) − µ̂j µ̂j
i p(j|xi)
θd
aa = arg max p(O; θaa)
θaa
Also need to model the time durations of these time-intervals – transition probs.
a11
a12
1 2 3
p(f) p(f) p(f)
1 2 3
1 2 3
p(; aa ) p(; aa ) p(; aa )
o1 o2 o3 . . .
o10
Note
Sum of probabilities of all paths ending at
αj (t) ⇒ node j at time t with partial observation
sequence o1, o2, . . . , ot
N
X
p(0|Λ) = P {o1, o2, . . . , oT , ST = j|Λ}
j=1
N
X
= αj (T )
j=1
β1(t = 2) = p{o3|st=2 = 1, Λ}
= p{o3, st=3 = 1|st=2 = 1; Λ} + p{o3, st=3 = 2|st=2 = 1, Λ}
p{o3|st=3 = 1, st=2 = 1, Λ}.p{st=3 = 1|st=2 = 1, Λ} +
=
p{o3|st=3 = 2, st=2 = 1, Λ}.p{st=3 = 2|st=2 = 1, Λ}
= b1(o3).a11 + b2(o3).a12
N
X
βi(t) = [aij bj (ot+1)] βj (t + 1)
| {z }
|j=1 {z } Prob. of observation ot+2 . . . oT given
now we are in j th node at t + 1
th
Going to each node from i node
= p{ot+1, . . . , ot|st = i, Λ}
N
X
p(O|Λ) = αN (T ) = β1(T ) = αj (t)βj (t)
j=1
and
p[component = 1, xi|Λg ]
p[component = 1|xi; Λg ] =
p[xi|Λg ]
g
g p{qt = i, qt+1 = j, O|Λ }
τbt(i, j) = p{qt = i, qt+1 = j|O, Λ } =
p{O|Λg }
g
P
where p{O|Λ } = αN (T ) = i αi(T )
g
P
where p{O|Λ } = αN (T ) = i αi (T )
αi(t).aij bj (ot+1)βj (t + 1)
So τbt(i, j) =
αN (t)
Trans. Prob. from state i to j = No. of times transition was made from i to j
Total number of times we made transition from i
If we average τbt(i, j) over all time-instants, we get the number of times the system
was in ith state and made a transition to j th state. So, a revised estimation of
transition probability is
PT −1
t=1 τt (i, j)
anew
bij = N
PT X
t=1 ( τt(i, j) )
|j=1 {z }
all transitions out
of i at time=t
New estimates for the state pdf parameters are (assuming single Gaussian)
PT
t=1 γi(t)ot
bi =
µ PT
t=1 γi(t)
PT
X
d γ
t=1 i (t)(o t − b
µ i )(ot − b
µi )T
= PT
i
t=1 γi(t)
One possible criterion : Choose the state sequence corresponding to path that with
maximum probability
Recall Forward Algorithm : We found probability over each path and summed over
all possible paths
N T
X
p{O, Pi|Λ}
i=1
Viterbi is just special case of Forward algo.
(
instead of sum of prob. of all paths
At each node
choose path with max prob.
In Practice: p(O|Λ) approximated by Viterbi (instead of Forward Algo.)
[
W = arg max p(W |O)
W
• Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge,
MA., 1998.
• Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in
N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.