Académique Documents
Professionnel Documents
Culture Documents
Today
Adapta2on
of
Gaussian
Mixture
Models
Maximum
A
Posteriori
(MAP)
Maximum
Likelihood
Linear
Regression
(MLLR)
The
Problem
I
have
a
liOle
bit
of
labeled
data,
and
a
lot
of
unlabeled
data.
I
can
model
the
training
data
fairly
well.
But
we
always
t
training
data
beOer
than
tes2ng
data.
Can
we
use
the
wealth
of
unlabeled
data
to
do
beOer?
MAP
Adapta2on
Constrain
the
contribu2on
of
unlabeled
data.
i =
i
p(i|xu ) + (1 i )i U
Let the alpha terms dictate how much weight to give to the new, unlabeled data compared to the exi2ng es2mates.
MAP
adapta2on
The
movement
of
the
parameters
is
constrained.
MLLR
adapta2on
Another
idea
Maximum
Likelihood
Linear
Regression.
Apply
an
ane
transforma2on
to
the
means.
Dont
change
the
covariance
matrices
= W
MLLR
adapta2on
Another
view
on
adapta2on.
Apply
an
ane
transforma2on
to
the
means.
Dont
change
the
covariance
matrices
= W
MLLR
adapta2on
The
new
means
are
the
MLE
of
the
means
with
the
new
data.
x p(i|x, i , i , i )xi i = Wi i = x p(i|x, i , i , i )
MLLR
adapta2on
The
new
means
are
the
MLE
of
the
means
with
the
new
data.
x p(i|x, i , i , i )xi i = Wi i = x p(i|x, i , i , i )
MLLR
adapta2on
The
new
means
are
the
MLE
of
the
means
with
the
new
data.
i = Wi i Wi = = x p(i|x, i , i , i )xi p(i|x, i , i , i ) x x p(i|x, i , i , i )xi (1 )T x p(i|x, i , i , i )
Why
MLLR?
We
can
2e
the
transforma2on
matrices
of
mixture
components.
For
example:
You
know
that
the
red
and
green
classes
are
similar
Assump2on:
Their
transforma2ons
should
be
similar
Why
MLLR?
We
can
2e
the
transforma2on
matrices
of
mixture
components.
For
example:
You
know
that
the
red
and
green
classes
are
similar
Assump2on:
Their
transforma2ons
should
be
similar
Speech
Representa2on
Extract
a
feature
representa2on
of
speech.
Samples
every
10ms.
MFCC 16 dims
Similarity
of
sounds
MFCC2
/s/
MFCC1
MFCC
ScaOer
MFCC2
/s/
MFCC1
UBM
eng
MFCC2
/s/
MFCC1
MAP
adapta2on
When
we
have
a
segment
of
speech
to
evaluate,
Generate
MFCC
features.
Use
MAP
adapta2on
on
the
UBM
Gaussian
Mixture
Model.
MAP
Adapta2on
MFCC2
/s/
MFCC1
MAP
Adapta2on
MFCC2
/s/
MFCC1
UBM-MAP
Claim:
The
dierences
between
speakers
can
be
represented
by
the
movement
of
the
mixture
components
of
the
UBM.
UBM-MAP
training
Supervector
Training
Data
UBM
Training
Supervector
A
vector
of
adapted
means
of
the
gaussian
mixture
components
MAP
xi = 0
...
ti = Speaker ID
UBM-MAP
training
xi = 0 1 ... k
Training
Data
UBM
Training
Supervector
ti = Speaker ID
Mul2class
SVM
Training
MAP
UBM-MAP
Evalua2on
UBM
Supervector
Mul2class
SVM
Test Data
MAP
Predic2on
Alternate
View
Do
we
need
all
this?
What
if
we
just
train
an
SVM
on
labeled
MFCC
data?
Labeled
Training
Data
Mul2class
SVM
Training
Test
Data
Mul2class
SVM
Predic2on
Results
UBM-MAP
(with
some
variants)
is
the
state-of- the-art
in
Speaker
Recogni2on.
Current
state
of
the
art
performance
is
about
97%
accuracy
(~2.5%
EER)
with
a
few
minutes
of
speech.
Model
Adapta2on
Adapta2on
allows
GMMs
to
be
seeded
with
labeled
data.
Incorpora2on
of
unlabeled
data
gives
a
more
robust
model.
Adapta2on
process
can
be
used
to
dieren2ate
members
of
the
popula2on
UBM-MAP
Next
Time
Spectral
Clustering