Vous êtes sur la page 1sur 61

Gaussian Mixture Model (GMM)

and
Hidden Markov Model (HMM)

Samudravijaya K
Tata Institute of Fundamental Research, Mumbai
chief@tifr.res.in

09-JAN-2009

Majority of the slides are taken from S.Umesh’s tutorial on ASR (WiSSAP 2006).

1 of 88
Pattern Recognition

Model
Training
Generation
Input Signal
Processing
Testing Pattern Output
Matching

GMM: static patterns


HMM: sequential patterns

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 2 of 88


Basic Probability

Joint and Conditional probability

p(A, B) = p(A|B) p(B) = p(B|A) p(A)

Bayes’ rule

p(B|A) p(A)
p(A|B) =
p(B)

If Ais are mutually exclusive events,


X
p(B) = p(B|Ai) p(Ai)
i

p(B|A) p(A)
p(A|B) = P
i p(B|Ai) p(Ai )

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 3 of 88


Normal Distribution

Many phenomenon are described by Gaussian pdf


 
1 1
p(x|θ) = √ exp − 2 (x − µ)2 (1)
2πσ 2 2σ

pdf is parameterised by θ = [µ, σ 2] where mean = µ and variance=σ 2.

A convenient pdf : second order statistics is sufficient.

Example: Heights of Pygmies ⇒ Gaussian pdf with µ = 4f t & std-dev(σ) = 1f t

OR: Heights of bushmen ⇒ Gaussian pdf with µ = 6f t & std-dev(σ) = 1f t

Question:If we arbitrarily pick a person from a population ⇒


what is the probability of the height being a particular value?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 4 of 88


If I pick arbitrarily a Pygmy, say x, then
 
1 1
Pr(Height of x=4’1”) = √ exp − (4′1′′ − 4)2 (2)
2π.1 2.1

Note: Here mean and variances are fixed, only the observations, x, change.

Also see: P r(x = 4′1′′) ≫ P r(x = 5′) AND P r(x = 4′1′′) ≫ P r(x = 3′)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 5 of 88


Conversely: Given a person’s height is 4′1′′ ⇒
Person is more likely to be a pygmy than bushman.

If we observe heights of many persons – say 3′6′′, 4′1′′, 3′8′′, 4′5′′, 4′7′′, 4′, 6′5′′
and all are from same population (i.e. either pygmy or bushmen.)
⇒ then more certain we are that the population is pygmy.

More the observations ⇒ better will be our decision

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 6 of 88


Likelihood Function

x[0], x[1], . . . , x[N − 1]


⇒ set of independent observations from pdf parameterised by θ.

Previous Example: x[0], x[1], . . . , x[N − 1] are heights observed and


θ is the mean of density which is unknown (σ 2 assumed known).

N
Y
L(X; θ) = p(x0 . . . xN −1 ; θ) = p(xi; θ)
i=0
N
!
1 1 X
= exp − 2 (xi − θ)2 (3)
(2πσ 2)N |2 2σ i=0

L(X; θ) is a function of θ and is called Likelihood Function

Given: x0 . . . xN −1, ⇒ what can we say about value of θ, i.e. best estimate of θ.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 7 of 88


Maximum Likelihood Estimation

Example: We know height of a person x[0] = 4′4′′.


Most likely to have come from which pdf ⇒ θ = 3′, 4′6′′ or 6′ ?

Maximum of L(x[0]; θ = 3′), L(x[0]; 4′6′′) and L(x[0]; θ = 6′) ⇒ choose θb = 4′6′′.

If θ is just a parameter, we will choose arg max L(x[0]; θ).


θ

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 8 of 88


Maximum Likelihood Estimator

 
θ1
 
 θ2 
 
Given x[0], x[1], . . . , x[N − 1] and pdf parameterised by θ = 
 . 

 
 . 
θm−1

N
Y
We form Likelihood function L(X; θ) = p(xi; θ)
i=0

θbM LE = arg max L(X; θ)


θ

For height problem:


b M LE = 1
P
⇒ can show (θ) N xi
⇒ Estimate of mean of Gaussian = sample mean of measured heights.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 9 of 88


Bayesian Estimation

• MLE ⇒ θ is assumed unknown but deterministic

• Bayesian Approach: θ is assumed random with pdf p(θ) ⇒ Prior Knowledge.

p(x|θ)p(θ)
p(θ|x) = ∝ p(x|θ) p(θ)
| {z } p(x) |{z}
Aposterior Prior

• Height problem: Unknown mean is random ⇒ pdf Gaussian N (γ, ν 2)


 
1 1
p(µ) = √ exp − 2 (µ − γ)2
2πν 2 2ν

σ 2γ + nν 2x̄
Then : (bµ)Bayesian =
σ 2 + nν 2
⇒ Weighted average of sample mean and a prior mean

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 10 of 88


Gaussian Mixture Model

p(x) = α p(x|N (µ1; σ1)) + (1 − α) p(x|N (µ2; σ2))

M
X X
p(x) = wm p(x|N (µm; σm)), wi = 1
m=1

Characteristics of GMM:
Just like ANNs are universal approximators of functions, GMMs are universal
approximators of densities (provided sufficient no. of mixtures are used); true for
diagonal GMMs as well.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 11 of 88


General Assumption in GMM

• Assume that there are M components.

• Each component generates data from a Gaussian with mean µm and covariance
matrix Σm.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 12 of 88


GMM

Consider the following probability density function shown in solid blue

It is useful to parameterise or “model” this seemingly arbitrary “blue” pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 13 of 88


Gaussian Mixture Model (Contd.)

Actually – pdf is a mixture of 3 Gaussians, i.e.


X
p(x) = c1N (x; µ1, σ1) + c2N (x; µ2, σ2) + c3N (x; µ3, σ3 ) and ci = 1 (4)

pdf parameters: c1, c2, c3, µ1, µ2, µ3, σ1, σ2, σ3

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 14 of 88


Observation from GMM

Experiment: An urn contains balls of 3 different colurs: red, blue or green. Behind
a curtain, a person picks a ball from urn

If red ball ⇒ generate x[i] from N (x; µ1, σ1)


If blue ball ⇒ generate x[i] from N (x; µ2, σ2)
If green ball ⇒ generate x[i] from N (x; µ3, σ3)
We have access only to observations x[0], x[1], . . . , x[N − 1]

Therefore : p(x[i]; θ) = c1N (x; µ1, σ1) + c2N (x; µ2, σ2) + c3N (x; µ3, σ3)

but we do not which urn x[i] comes from!

Can we estimate component θ = [c1 c2 c3 µ1 µ2 µ3 σ1 σ2 σ3]T from the observations?

N
Y
arg max p(X; θ) = arg max p(xi; θ) (5)
θ θ
i=1

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 15 of 88


Estimation of Parameters of GMM

Easier Problem: We know the component for each observation


Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3

X1 = {x[0], x[3], x[5] } belong to p1(x; µ1, σ1 )


X2 = {x[1], x[2], x[8], x[9] } belongs to p2(x; µ2, σ2)
X3 = {x[3], x[6], x[7], x[10], x[11], x[12]} belongs to p3(x; µ3, σ3)

From: X1 = {x[0], x[3], x[5] }


3
b
c1 = 13
and µ b1 = 31 {x[0] + x[3] + x[5]}
2 1
 2 2 2

b1 = 3 (x[0] − µ
σ b1) + (x[2] − µ
b1) + (x[5] − µ
b1)

In practice we do not know which observation come from which pdf.


⇒ How do we solve for arg max p(X; θ) ?
θ

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 16 of 88


Incomplete & Complete Data

x[0], x[1], . . . , x[N − 1] ⇒ incomplete data,

Introduce another set of variables y[0], y[1], . . . , y[N − 1]


such that y[i] = 1 if x[i] ∈ p1, y[i] = 2 if x[i] ∈ p2 and y[i] = 3 if x[i] ∈ p3

Obs: x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] x[8] x[9] x[10] x[11] x[12]
Comp. 1 2 2 1 3 1 3 3 2 2 3 3 3
miss: y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] y[8] y[9] y[10] y[11] y[12]

y[i] = missing data—unobserved data ⇒ information about component

z = (x; y) is complete data ⇒ observations and which density they come from
p(z; θ) = p(x, y; θ)
⇒ θbM LE = arg max p(z; θ)
θ
Question: But how do we find which observation belongs to which density ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 17 of 88


Given observation x[0] and θ g , what is the probability of x[0] coming from first
distribution?

p (y[0] = 1|x[0]; θ g )
p(y[0]=1,x[0];θ g )
= p(x[0];θ g )
g g
p(x[0]|y[0]=1;µ1 ,σ1 ) · p(y[0]=1)
= P3 g
j=1 p(x[0]|y[0]=j;θ ) p(y[0]=j)
g g g
p(x[0]|y[0]=1,µ1 ,σ1 ) · c1
= g g g g g g
p(x[0]|y[0]=1;µ1 ,σ1 ) c1 +p(x[0]|y[0]=2;µ2 ,σ2 ) c2 +...
All parameters are known ⇒ we can calculate p(y[0] = 1|x[0]; θ g )
(Similarly calculate p(y[0] = 2|x[0]; θ g ), p(y[0] = 3|x[0]; θ g )

Which density? ⇒ y[0] = arg max p(y[0] = i|x[0]; θ g ) – Hard allocation


i

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 18 of 88


Parameter Estimation for Hard Allocation

x[0] x[1] x[2] x[3] x[4] x[5] x[6]


p(y[j] = 1|x[j]; θ g ) 0.5 0.6 0.2 0.1 0.2 0.4 0.2
p(y[j] = 2|x[j]; θ g ) 0.25 0.3 0.75 0.3 0.7 0.5 0.6
p(y[j] = 3|x[j]; θ g ) 0.25 0.1 0.05 0.6 0.1 0.1 0.2
Hard Assign. y[0]=1 y[1]=1 y[2]=2 y[3]=3 y[4]=2 y[5]=2 y[6]=2

2 4 1
Updated Parameters: b
c1 = 7 b
c2 = 7 b
c3 = 7 (different from initial guess!)
Similarly (for Gaussian) find: bi2
bi , σ
µ for ith pdf

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 19 of 88


Parameter Estimation for Soft Assignment

x[0] x[1] x[2] x[3] x[4] x[5] x[6]


p(y[j] = 1|x[j]; θ g ) 0.5 0.6 0.2 0.1 0.2 0.4 0. 2
p(y[j] = 2|x[j]; θ g ) 0.25 0.3 0.75 0.3 0.7 0.5 0.6
p(y[j] = 3|x[j]; θ g ) 0.25 0.1 0.05 0.6 0.1 0.1 0.2

Example: Prob. of each sample belonging to component 1

p(y[0] = 1|x[0]; θ g ), p(y[1] = 1|x[1]; θ g ) p(y[2] = 1|x[2]; θ g ), · · · · · ·

Average probability that a sample belongs to Comp.#1 is


N
1 X
cnew
b1 = p(y[i] = 1|x[i]; θ g )
N i=1
0.5 + 0.6 + 0.2 + 0.1 + 0.2 + 0.4 + 0.2 2.2
= =
7 7

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 20 of 88


Soft Assignment – Estimation of Means & Variances

Recall: Prob. of sample j belonging to component i

p(y[j] = i|x[j]; θ g )

Soft Assignment: Parameters estimated by taking weighted average !


PN g
xi · p(y[i] = 1 | x[i]; θ )
µnew
1 = i=1
PN g
i=1 p(y[i] = 1 | x[i]; θ )

PN
i=1(xi − µb1)2 · p(y[i] = 1 | x[i]; θ g )
(σ12 )new = PN
i=1 p(y[i] = 1 | x[i]; θ g )

These are updated parameters starting with initial guess θ g

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 21 of 88


Maximum Likelihood Estimation of Parameters of GMM
1. Make initial guess of parameters: θ g = cg1 , cg2 , cg3 , µg1 , µg2 , µg3 , σ1g , σ2g , σ3g
2. Knowing parameters θ g , find Prob. of sample xi belonging to j th component.

g
p[y[i] = j | x[i]; θ ] for i = 1, 2, . . . N ⇒ no. of observations

for j = 1, 2, . . . M ⇒ no. of components


3.
N
1 X
cnew
bj = p(y[i] = j | x[i]; θ g )
N i=1

4. PN
i=1 xi · p(y[i] = j | x[i]; θ g )
µnew
j = PN
i=1 p(y[i] = j | x[i]; θ g )

5. PN
i=1 (xi b1 )2 · p(y[i] = j | x[i]; θ g )
−µ
(σj2 )new = PN
i=1 p(y[i] = j | x[i]; θ g )

6. Go back to (2) and repeat until convergence

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 22 of 88


WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 23 of 88
WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 24 of 88
Live demonstration: http://www.neurosci.aist.go.jp/˜ akaho/MixtureEM.html

Practical Issues

• E.M. can get stuck in local minima.

• EM is very sensitive to initial conditions; a good initial guess helps; k-means


algorithm is used prior to application of EM algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 25 of 88


Size of a GMM

Bayesian Information Criterion (BIC) value of a GMM can be defined as follows:

d
BIC(G | X) = logp(X | Ĝ) − logN
2

where
Ĝ represent the GMM with the ML parameter configuration
d represents the number of parameters in G
N is the size of the dataset

the first term is the log-likelihood term;


the second term is the model complexity penalty term.

BIC selects the best GMM corresponding to the largest BIC value by trading off
these two terms.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 26 of 88


source: Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005

The BIC criterion can discover the true GMM size effectively as shown in the figure.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 27 of 88


Maximum A Posteriori (MAP)

• Sometimes, it is difficult to get sufficient number of examples for robust estimation


of parameters.

• However, one may have access to large number of similar examples which can be
utilized.

• Adapt the target distribution from such a distribution. For example, adapt a
speaker independent model to a new speaker using small amount of adaptation
data.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 28 of 88


MAP Adaptation

ML Estimation

θbM LE = arg max p(X|θ)


θ

MAP Estimation

θbM AP = arg max p(θ|X)


θ
p(X|θ) p(θ)
= arg max
θ p(X)
= arg max p(X|θ) p(θ)
θ

p(θ) is the a priori distribution of parameters θ.


A conjugate prior is chosen such that the corresponding posterior belongs to the
same functional family as the prior.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 29 of 88


Simple Implementation of MAP-GMMs

Source: Statistical Machine Learning from Data: GMM; Samy Bengio

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 30 of 88


Simple Implementation

Train a prior model p with a large amount of available data (say, from multiple
speakers). Adapt the parameters to a new speaker using some adaptation data (X).

Let α = [0, 1] be a parameter that describes the faith on the prior model.

Adapted weight of j th mixture

 X 
p
ŵj = αwj + (1 − α) p(j|xi) γ
P
i
Here γ is a normalization factor such that wj = 1.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 31 of 88


Simple Implementation (contd.)

means P
p i p(j|xi) xi
µ̂j = αµj + (1 − α) P
i p(j|xi)
Weighted average of sample mean and a prior mean

variances

 ′
P ′
p p p p(j|xi) xixi
iP ′
σ̂j = α σj + µj µj + (1 − α) − µ̂j µ̂j
i p(j|xi)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 32 of 88


HMM

• Primary role of speech signal is to carry a message; sequence of sounds (phonemes)


encode a sequence of words.

• The acoustic manifestation of a phoneme is mostly determined by:

– Configuration of articulators (jaw, tongue, lip)


– physiology and emotional state of speaker
– Phonetic context

• HMM models sequential patterns; speech is a sequential pattern

• Most text dependent speaker recognition systems use HMMs

• Text verification involves verification/recognition of phonemes

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 33 of 88


Phoneme recognition

Consider two phonemes classes /aa/ and /iy/.


Problem: Determine to which class a given sound belongs.

Processing of speech signal results in a sequence of feature (observation) vectors:


o1, . . . , oT (say MFCC vectors)

We say the speech is /aa/ if: p(aa|O) > p(iy|O)

Using Bayes Rule

AcousticM odel P riorP rob


z }| { z }| {
p(O|aa) p(aa) p(O|iy) p(iy)
V s.
p(O) p(O)
Given p(O|aa), p(aa), p(O|iy) and p(iy) ⇒ which is more probable ?

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 34 of 88


Parameter Estimation of Acoustic Model

How do we find the density function paa(.) and piy (.).

⇒ paa() parameterised by θaa


We assume a parametric model:
⇒ pij () parameterised by θiy

Training Phase: Collect many examples of /aa/ being said


⇒ Compute corresponding observations o1, . . . , oTaa

Use the Maximum Likelihood Principle

θd
aa = arg max p(O; θaa)
θaa

Recall: if the pdf is modelled as a Gaussian Mixture Model


. ⇒ then we use EM Algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 35 of 88


Modelling of Phoneme

Our Articulators are moving from a configuration


To enunciate /aa/ in a word ⇒ for previous phoneme to /aa/ and then proceeding
to move to configuration of next phoneme.

Can think of 3 distinct time periods:

⇒ Transition from previous phoneme


⇒ Steady state
⇒ Transition to next phoneme

Features for 3 “time-interval ”are quite different

⇒ Use different density functions to model the three time intervals


⇒ model as paa1 (; θaa1) paa2 (; θaa2) paa3 (; θaa3)

Also need to model the time durations of these time-intervals – transition probs.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 36 of 88


Stochastic Model (HMM)

a11

a12
1 2 3
p(f) p(f) p(f)

f(Hz) f(Hz) f(Hz)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 37 of 88


HMM Model of Phoneme

• Use term “State”for each of the three time periods.

• Prob. of ot from j th state, i.e. paaj (ot; θaaj ) ⇒ denoted as bj (ot)

1 2 3

1 2 3
p(; aa ) p(; aa ) p(; aa )

o1 o2 o3 . . .
o10

• Observation, ot, is generated by which state density?

– Only observations are seen, the state-sequence is “hidden”


– Recall: In GMM, the “mixture component is “hidden”

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 38 of 88


Probability of Observation

Recall: To classify, we evaluate P r(O|Λ) – where Λ are parameters of models


In /aa/ Vs /iy/ calculation: ⇒ P r(O|Λaa) Vs P r(O|Λiy )

Example: 2-state HMM model and 3 observations o1 o2 o3

Model parameters are assumed known:


Transition Prob. ⇒ a11, a12, a21, and a22 – model time durations
State density ⇒ b1(ot) and b2(ot).
bj (ot) are usually modelled as single Gaussian with parameter µj , σj2 or by GMMs

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 39 of 88


Probability of Observation through one Path

T = 3 observations and N = 2 nodes ⇒ 8 paths thru 2 nodes for 3 observations

Example: Path P1 through states 1, 1, 1.


P r{O|P1, Λ} = b1(o1) · b1(o2) · b1(o3)
Prob. of Path P1 = P r{P1|Λ} = a01 · a11 · a11

P r{O, P1|Λ} = P r{O|P1, Λ} · P r{P1|Λ} = a01b1(o1).a11b1(o2).a11b1(o3)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 40 of 88


Probability of Observation
Path o1 o2 o3 p(O, Pi |Λ)
P1 1 1 1 a01 b1 (o1).a11 b1 (o2).a11b1 (o3 )
P2 1 1 2 a01 b1 (o1).a11 b1 (o2).a12 b2 (o3 )
P3 1 2 1 a01 b1 (o1).a12 b2 (o2).a21 b1 (o3 )
P4 1 2 2 a01 b1 (o1).a12 b2 (o2).a22 b2 (o3 )
P5 2 1 1 a02 b2 (o1).a21 b1 (o2).a11b1 (o3 )
P6 1 1 2 a02 b2 (o1).a21 b1 (o2).a12 b2 (o3 )
P7 1 1 1 a02 b2 (o1).a22 b1 (o2).a21 b1 (o3 )
P8 1 1 2 a02 b2 (o1).a22 b1 (o2).a22 b2 (o3 )
X X
p(O|Λ) = P {O, Pi|Λ} = P {O|Pi, Λ} · P {Pi, Λ}
Pi Pi

Forward Algorithm ⇒ Avoid Repeat Calculations:


Two Multiplications
z }| {
a01b1(o1)a11b1(o2).a11b1(o3) + a02b2(o1).a21b1(o2).a11b1(o3)
=[a01.b1(o1)a11b1(o2) + a02b2(o1)a21b1(o2)]a11b1(o3)
| {z }
One Multiplication

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 41 of 88


Forward Algorithm – Recursion

Let :α1(t = 1) = a01b1(o1)


Let :α2(t = 1) = a02b2(o2)

Recursion : α1(t = 2) = [a01b1(o1).a11 + a02b2(o1).a21].b1(o2)


= [α1(t = 1).a11 + α2(t = 1).a21].b1(o2)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 42 of 88


General Recursion in Forward Algorithm
hX i
αj (t) = αi(t − 1)aij .bj (ot)
= P {o1, o2, . . . ot, st = j|Λ}

Note
Sum of probabilities of all paths ending at
αj (t) ⇒ node j at time t with partial observation
sequence o1, o2, . . . , ot

The probability of the entire observation (o1, o2, . . . , oT ), therefore, is

N
X
p(0|Λ) = P {o1, o2, . . . , oT , ST = j|Λ}
j=1
N
X
= αj (T )
j=1

where N=No. of nodes

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 43 of 88


Backward Algorithm
• analogous to Forward, but coming from the last time instant T
Example: a01b1(o1).a11b1(o2).a11b1(o3) + a01b1(o1).a11b1(o2)a12b2(o3) + . . .
= [a01.b1(o1).a11b1(o2)].(a11b1(o3) + a12b2(o3))

β1(t = 2) = p{o3|st=2 = 1, Λ}
= p{o3, st=3 = 1|st=2 = 1; Λ} + p{o3, st=3 = 2|st=2 = 1, Λ}
p{o3|st=3 = 1, st=2 = 1, Λ}.p{st=3 = 1|st=2 = 1, Λ} +
=
p{o3|st=3 = 2, st=2 = 1, Λ}.p{st=3 = 2|st=2 = 1, Λ}
= b1(o3).a11 + b2(o3).a12

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 44 of 88


General Recursion in Backward Algorithm

Given that we are at node j at time t


βj (t) ⇒ Sum of probabilities of all paths such that
partial sequence ot+1, . . . , oT are observed

N
X
βi(t) = [aij bj (ot+1)] βj (t + 1)
| {z }
|j=1 {z } Prob. of observation ot+2 . . . oT given
now we are in j th node at t + 1
th
Going to each node from i node

= p{ot+1, . . . , ot|st = i, Λ}

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 45 of 88


Estimation of Parameters of HMM Model

• Given known Model parameters, Λ:

– Evaluated p(O|Λ) ⇒ useful for classification


– Efficient Implementation: Use Forward or Backward Algo.

• Given set of observation vectors, ot how do we estimate parameters of HMM?


– Do not know which states ot come from
∗ Analogous to GMM – do not know which component
– Use a special case of EM – Baum-Welch Algorithm
– Use following relations from Forward/Backward

N
X
p(O|Λ) = αN (T ) = β1(T ) = αj (t)βj (t)
j=1

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 46 of 88


Parameter Estimation for Known State Sequence

Assume each state is modelled as a single Gaussian:


bj = Sample mean of observations assigned to state j.
µ
bj2 = Variance of the observations assigned to state j.
σ

and

No. of times transition was made from i to j


Trans. Prob. from state i to j =
Total number of times we made transition from i

In practice since we do not know which state generated the observation


⇒ So we will do probabilistic assignment.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 47 of 88


Review of GMM Parameter Estimation

Do not know which component of the GMM generated output observation.

Given initial model parameters Λg , and observation sequence x1, . . . , xT .


Find probability xi comes from component j ⇒ Soft Assignment

p[component = 1, xi|Λg ]
p[component = 1|xi; Λg ] =
p[xi|Λg ]

So, re-estimation equations are:


T
bj 1X
C = p(comp = j|xi ; Λg )
T i=1
PT PT
g
i=1 xi p(comp = j|xi ; Λ ) i=1 (xi − µbj )2 p(comp = j|xi ; Λg )
bnew
µj = PT bj2
σ = PT
g g
i=1 p(comp = j|xi ; Λ ) i=1 p(comp = j|xi ; Λ )

A similar analogy holds for hidden Markov models

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 48 of 88


Baum-Welch Algorithm

Here: We do not know which observation ot comes from which state si


Again like GMM we will assume initial guess parameter Λg

Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is

g
g p{qt = i, qt+1 = j, O|Λ }
τbt(i, j) = p{qt = i, qt+1 = j|O, Λ } =
p{O|Λg }
g
P
where p{O|Λ } = αN (T ) = i αi(T )

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 49 of 88


Baum-Welch Algorithm

Then prob. of being in “state=i at time=t” and “state=j at time=t+1” is


g
g p{q t = i, q t+1 = j, O|Λ }
τbt(i, j) = p{qt = i, qt+1 = j|O, Λ } =
p{O|Λg }

g
P
where p{O|Λ } = αN (T ) = i αi (T )

From ideas of Forward-Backward Algorithm, numerator is

p{qt = i, qt+1 = j, O|Λg } = αi(t).aij bj (ot+1).βj (t + 1)

αi(t).aij bj (ot+1)βj (t + 1)
So τbt(i, j) =
αN (t)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 50 of 88


Estimating Transition Probability

Trans. Prob. from state i to j = No. of times transition was made from i to j
Total number of times we made transition from i

τbt(i, j) ⇒ prob. of being in “state=i at time=t” and “state=j at time=t+1”

If we average τbt(i, j) over all time-instants, we get the number of times the system
was in ith state and made a transition to j th state. So, a revised estimation of
transition probability is

PT −1
t=1 τt (i, j)
anew
bij = N
PT X
t=1 ( τt(i, j) )
|j=1 {z }
all transitions out
of i at time=t

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 51 of 88


Estimating State-Density Parameters

Analogous to GMM: which observation belonged to which component,

New estimates for the state pdf parameters are (assuming single Gaussian)
PT
t=1 γi(t)ot
bi =
µ PT
t=1 γi(t)

PT
X
d γ
t=1 i (t)(o t − b
µ i )(ot − b
µi )T
= PT
i
t=1 γi(t)

These are weighted averages ⇒ weighted by Prob. of being in state j at t

– Given observation ⇒ HMM model parameters estimated iteratively


– p(O|Λ) ⇒ evaluated efficiently by Forward/Backward algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 52 of 88


Viterbi Algorithm

Given the observation sequence,


• the goal is to find corresponding state-sequence that generated it
• there are many possible combination (N T ) of state sequence ⇒ many paths.

One possible criterion : Choose the state sequence corresponding to path that with
maximum probability

max P {O, Pi|Λ}


i

Word : represented as sequence of phones


Phone : represented as sequence states
Optimal state-sequence ⇒ Optimal phone-sequence ⇒ Word sequence

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 53 of 88


Viterbi Algorithm and Forward Algorithm

Recall Forward Algorithm : We found probability over each path and summed over
all possible paths

N T
X
p{O, Pi|Λ}
i=1
Viterbi is just special case of Forward algo.
(
instead of sum of prob. of all paths
At each node
choose path with max prob.
In Practice: p(O|Λ) approximated by Viterbi (instead of Forward Algo.)

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 54 of 88


Viterbi Algorithm

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 55 of 88


Decoding

• Recall: Desired transcription[


W obtained by maximising

[
W = arg max p(W |O)
W

• Search over all possible W – astronomically large!

• Viterbi Search – find most likely path through a HMM

– Sequence of phones (states) which is most probable


– Mostly: most probable sequence of phones correspond to most probable
sequence of words

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 56 of 88


Training of a Speech Recognition System

– HMM parameter’s estimated using large databases – 100 hours


∗ Parameters estimated using Maximum Likelihood Criterion

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 57 of 88


Recognition

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 58 of 88


Speaker Recognition

Spectra (formants) of a given sound are different for different speakers.

Spectra of 2 speakers for one “frame” of /iy/

Derive speaker dependent model of a new speaker by MAP adaptation of Speaker-


Independent (SI) model using small amount of adaptation data; use for speaker
recognition.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 59 of 88


References

• Pattern Classification, R.O.Duda, P.E.Hart and D.G.Stork, John Wiley, 2001.

• Introduction to Statistical Pattern Recognition, K.Fukunaga, Academic Press,


1990.

• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam


Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707

• The EM Algorithm and Extensions, Geoffrey J. McLachlan and Thriyambakam


Krishnan, Wiley-Interscience; 2nd edition, 2008. ISBN-10: 0471201707

• Fundamentals of Speech Recognition, Lawrence Rabiner & Biing-Hwang Juang,


Englewood Cliffs NJ: PTR Prentice Hall (Signal Processing Series), c1993, ISBN
0-13-015157-2

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 60 of 88


• Hidden Markov models for speech recognition, X.D. Huang, Y. Ariki, M.A. Jack.
Edinburgh: Edinburgh University Press, c1990.

• Statistical methods for speech recognition, F.Jelinek, The MIT Press, Cambridge,
MA., 1998.

• Maximum Likelihood from incomplete data via the em algorithm, J. Royal


Statistical Soc. 39(1), pp. 1-38, 1977.

• Maximum a Posteriori Estimation for Multivariate Guassian Mixture Observation


of Markov Chains, J.-L.Gauvain and C.-H.Lee, IEEE Trans. SAP, 2(2), pp. 291-
298, 1994.

• A Gentle Tutorial of the EM Algorithm and its Application to Parameter


Estimation for Gaussian Mixture and Hidden Markov Models, J.A.Bilmes,
ISCI, TR-97-021.

• Boosting GMM and Its Two Applications, F.Wang, C.Zhang and N.Lu in
N.C.Oza et al. (Eds.) LNCS 3541, pp. 12-21, 2005.

WiSSAP 2009: “Tutorial on GMM and HMM”, Samudravijaya K 61 of 88

Vous aimerez peut-être aussi