Vous êtes sur la page 1sur 8

INDIVIDUAL VOICE ACTIVITY DETECTION USING PERIODIC TO APERIODIC

COMPONENT RATIO BASED ACTIVITY DETECTION (PARADE) AND GAUSSIAN


MIXTURE SPEAKER MODELS

Naotoshi Seo

University of Maryland
ENEE632 Speech Processing
Final Project

ABSTRACT In section 3, several speaker identification methods are in-


troduced, and one of most widely used speaker identification
This paper describes a study of individual voice activity detec- methods, the based speaker identification [21], is further ex-
tion (IVAD) which detect speech regions of an interest person amined. In section 4, an IVAD system is proposed and exam-
in audio. An IVAD method can be constructed by cascading ined. It is constructed by cascading PARADE and the GMM
a voice activity detection (VAD) method and a speaker iden- based speaker identification. In section 5, conclusion of this
tification method. Therefore, various VAD and speaker iden- study and outlines of future works are provided.
tification methods were studied. Especially, a VAD method
called PARADE [10] and a speaker identification method based
on GMMs [21] are deeply investigated. The PARADE method 2. VOICE ACTIVITY DETECTION
is a noise robust VAD method utilizing the periodic compo-
nent to aperiodic component ratio (PAR). The GMM based Voice activity detection (VAD) is an algorithm used in speech
speaker identification method is one of most widely used speaker processing wherein the presence or absence of human speech
identification methods which represents the feature space as a is detected in regions of audio. VAD is very important for
mixture of Gaussian distributions. Finally, a semi-automatic speech communication applications such as speech recogni-
IVAD system is proposed which cascades the PARADE and tion, hands-free telephony and speech coding. When noise-
the GMM based speaker identification and the performance is free speech is acquired, a proper threshold set in the signal
examined by experiments. level allows relatively easy detection of the speech period.
However, real speech is distorted by background noise such as
Index Terms Voice Activity Detection, Speaker Identi-
computer-fans, air-conditioners and many other environment
fication, Text Independent Speaker Identification, PARADE,
sounds, especially in distant-talking situations. Inaccurate de-
GMM
tection of the speech period causes serious problems such as
degradation of recognition performance and deterioration of
1. INTRODUCTION speech quality. It is therefore highly desirable to develop a
robust and reliable VAD method.
Individual voice activity detection (IVAD) is an algorithm to In general, VAD consists of two parts: an acoustic fea-
detect speech regions of an interest person in audio. In con- ture extraction part, and a decision mechanism part. The
trast, voice activity detection (VAD) is an algorithm wherein former extracts acoustic features that can appropriately indi-
the presence or absence of human speech is detected in re- cate the probability of target speech signals existing in ob-
gions of audio. VAD is important to help speech processing served signals, which also include environmental sound sig-
techniques such as speech enhancement [19], speech coding nals. Based on these acoustic features, the latter part finally
[25], and automatic speech recognition [13]. IVAD is also im- decides whether the target speech signals are present in the
portant because it enables to enhance these speech processing observed signals using, for example, a well-adjusted thresh-
techniques by specializing to a person of interest. old [16], the likelihood ratio [24], and hidden Markov models
An IVAD method can be constructed by 1) extracting voice [2]. The performance of the each part seriously influences
activity segments in audio and 2) classifying each segment VAD performance.
into speakers and picking segments of an interest person. There- The short-term signal energy and zero-crossing rate [20]
fore, VAD and speaker identification methods are studied. In have long been used as simple acoustic features for VAD.
section 2 several VAD methods are introduced, and a noise However, they are easily degraded by environmental noise,
robut VAD method, PARADE [10], is deeply investigated. and environmental sounds also possess a similar energy and
zerocrossing rate to speech signals. To cope with this prob- based on the statistics of the PARs, which represent the differ-
lem, various kinds of robust acoustic features have been pro- ence between periods that contain only noise and periods that
posed for VAD such as autocorrelation function based fea- contain both noise and speech. The statistical VAD method
tures [1], spectrum based features that utilize harmonicity [23][29],proposed by Sohn et al. [24] utilizes likelihoods derived from
pitch based features [26], the power in the band-limited re- posteriori SNRs, whereas the PARADE method calculates like-
gion [16], mel-frequency cepstral coefficients [14], features lihoods derived from probability distributions of the PARs
based on higher order statistics [15], and periodic to a peri- without knowledge of the SNR. Section 2.3 describes the like-
odic component ratio (PAR) [9]. In this project, I examine lihood calculation in detail.
a voice activity detection method using PAR features called
PARADE [10].
2.2. Decomposition of periodicity and aperiodicity
In this section, how to decompose sound signals into their
2.1. Periodic to Aperiodic Component Ratio based Activ-
periodic and aperiodic components is explained. Let us define
ity Detection (PARADE)
the short time power of observed signal at frame n by
Sound signals can be decomposed into their periodic and ape-
L
X
riodic components. For example, speech signals consist not
(n) = |s(t)g(t)|2 , (1)
only of periodic signals such as steady parts of vowels and
t=0
voiced consonants, but also of non-periodic signals such as
fluctuations included in vowels, voiced consonants, stops, frica- that is, the 0th-order autocorrelation coefficient where s(t) is
tives, and affricates. With regard to psychoacoustics, findings the observed signal of length L at the frame n and g(t) is
derived from concurrent vowel identification experiments sug- the symmetric analysis window of length L such as hanning
gest that the human auditory system can suppress harmonic window. Since the autocorrelation coefficients equal to the
interferers and perceive the residual target signal [3]. This inverse Fourier transform of the power spectrum, the short
finding suggests that the human auditory system may process time power can be also obtained as
both the harmonic (periodic) component and the residue after
canceling the harmonicity (aperiodic) component, which de- M 1
1 X
viates from the dominant periodicity. In terms of automatic (n) = |S(n, m)|2 (2)
M m=0
speech recognition, the word accuracy in noisy environments
can be improved by using the periodic and aperiodic compo-
nents of observed signals [11][9]. The PARADE [10] method where S(n, m) is an STFT representation of s(t) at a fre-
utilizes an acoustic feature that represents the power ratios of quency bin of m [0, M 1] and a temporal frame n.
the periodic and aperiodic components in observed signals. Let us assume that an observed signal s(t) can be de-
This feature is referred to as the periodic component to aperi- scribed as the sum of its periodic and aperiodic components,
odic component ratio (PAR). sp (t) and sa (t), respectively.
Now, let us define the problem that the PARADE method
s(t) = sp (t) + sa (t) (3)
solve. Observed signals are recorded monaurally, and there
is only one dominant sound, i.e. the target speech, which is
Hereafter, let us denote STFT representations of the above
in the presence of background noise whose frequency spectra
signals by S(n, m), Sp (n, m), and Sa (n, m), respectively,
distribute widely over all frequencies. There is no assumption
and their short time power in a time frame by (n), p (n),
as regards the stationarity of the noise power, thus the power
and a (n) respectively. Now, we assume the additivity on the
of noise changes dynamically. In addition, there is no prior
powers of the components as
knowledge about the kind of background noise.
To cope with this problem, the PARADE method decom- |S(n, m)|2 = |Sp (n, m)|2 + |Sa (n, m)|2 (4)
poses observed signals into their periodic and aperiodic com-
ponents. In details, the term aperiodic component includes Then the following equation can be derived.
both environmental noise and aperiodic components of speech
signals, and the term periodic component includes a domi- (n) = p (n) + a (n) (5)
nant harmonic component in the observed signal. Because
the PAR of noise is normally independent of the power of the Let us denote the fundamental frequency (F0) of the periodic
noise, voice activity detection based on the PAR is expected component and the number of harmonics at frame n by f0 (n)
to be insensitive to any dynamic change in the SNR. Section and (n), respectively, and an operator to transform the k-th
2.2 describes the decomposition method performed using sub- harmonic frequency kf0 (n) to an index of a frequency bin in
band signals in the frequency domain. the corresponding frequency domain by [kf0 (n)]. Now, we
The VAD method then detects the existence of speech have (n) = b [f0M
(n)] c.
Under assumption that harmonic components of observed When Hn = 0, i.e., there is no speech signal in the ob-
signal can be calculated as sum of power of pure tones, we served signal, assuming (n) = a (n), we can estimate the
can obtain the following equation error of an aperiodic component a(n) as

(n) a (n) = (n) a (n) = p (n). (12)


X
(p) = |Sp (n, [kf0 (n)])|2 (6)
k=1 Assuming that the error distribution follows a Gaussian distri-
PL1 PL1 bution whose mean and standard deviation are 0 and a (n)
where = 2 t=0 g(t)2 /( t=0 g(t))2 . In addition, we can with positive constant . Then, the likelihood of the observed
introduce the assumption that means of the average power of signal for non-speech periods can be modeled by
the aperiodic components at frequencies of the dominant har-
monic components is equal to that over the whole frequency 2 !
1 1 p (n)
rage, that is, p((n)|Hn = 0) = exp 2 .
2 a (n) 2 a (n)
M 1 (n) (13)
1 X 1 X In contrast, for the case Hn = 1, i.e. a speech signal
|Sa (n, m)|2 = |Sa (n, [kf0 (n)])|2 . (7)
M m=0 (n) exists in the observed signal, we can estimate the error of the
k=1
periodic component p (n) as
Under these assumptions, consequently, we can obtain the
following equations. p (n) = (n) a (n) p (n). (14)

M 1
1 X Because we cannot know the true value of a (n), we consider
(n) = |S(n, m)|2 (8) an ideal case of a (n) = 0. Here, p (n) becomes
M m=0

P(n) p (n) = (n) p (n) = a (n). (15)


k=1 |S(n, [kf0 (n)])|2 (n)(n)
p (n) = (9) Under this assumption, we again assume that the error distri-
1 (n)
bution follows a Gaussian distribution whose mean and stan-
a (n) = (n) p (n) (10) dard deviation are 0 and p (n) with positive constant .
Then, the likelihood of the observed signal for non-speech
where p (n) and a (n) indicate estimated values of p (n)
periods can be modeled by
and a (n).
This method requires F0 value, and so we will estimate 2 !
1 1 a (n)
it as the value that maximized the numerator of equation 9, p((n)|Hn = 1) = exp 2 .
that is, we determine the estimate f0 (n) by searching the fre- 2 p (n) 2 p (n)
quency range that includes the F0 of human speech (e.g. from (16)
50 to 500Hz) as follows: Finally, we calculate the likelihood ratio L(n) at frame n
as
(n)
X p((n)|Hn = 1)
L(n) = (17)
f0 (n) = arg max |S(n, [kf0 (n)])|2 (n)(n) p((n)|Hn = 0)
f0 (n)
k=1
If the likelihood is higher than a threshold decided by users
(11)
of this detector, we decide that a speech signal exists in the
This equation coincides with one adopted by a robust F0 es-
frame. The threshold would be determined as
timator known as REPS [18], therefore, this equation provide
us fairly reliable F0 estimates. p(Hn = 0)
= (18)
p(Hn = 1)
2.3. Likelihood Ratio Test
where p(Hn = 1) and p(Hn = 0) denotes the probability
If the decomposition can ideally estimate the powers of peri- of existence and non-existence of speech in the sound signal
odic components, we can detect speech signals based solely respectively. A user should be able to guess the probability of
on these estimates. However, the decomposition cannot com- existence of speech in the sound signal by looking a plot or
pletely avoid power estimation errors, therefore, statistical de- listening the sound signal. The guess is not necessary to be
tection method as [24] is required. accurate, rather, the user may adjust the probability parame-
Let us presume the state of the existence of speech signals ter, e.g., he may put larger probability of existence than his
(1 or 0) at frame n to be random variables, and denote them initial guess into the VAD detector when he want to decrease
by Hn . the probability of miss detection.
0.5 0.5
(a)

(a)
0 0
0.5 0.5
1 2 3 4 5 6 1 2 3 4 5 6
4 4
x 10 x 10
0.5 0.5
(b)

(b)
0
0.5 0.5
0 1 2 3 4 5 6 0 1 2 3 4 5 6
4
x 10 4
x 10

Fig. 1. The first test signal (a) Speech waveform in the silence Fig. 3. The second test signal in the presence of real world
(b) Noisy speech waveform created by adding white noise at street noise (a) Speech waveform in the silence (b) Noisy
an SNR of 0dB speech waveform created by adding street noise at an SNR
of 0dB

Power
(a)

2
0 5
Power

(a)
0 10 20 30 40 50 60 70 80
0
Power of Periodic Part 0 20 40 60 80 100
(b)

2
5
0 Power of Periodic Part

(b)
0 10 20 30 40 50 60 70 80
0
Power of Aperiodic Part 0 20 40 60 80 100
(c)

2
0 5
Power of Aperiodic Part

(c)
0 10 20 30 40 50 60 70 80
8 0
6 log likelihood ratio 0 20 40 60 80 100
(d)

4
2
0 8
6 log likelihood ratio
(d)

0 10 20 30 40 50 60 70 80 4
1.5 2
1 0
VAD Result 0 20 40 60 80 100
(e)

0.5
0 1.5
0.5 1 VAD Result
(e)

0 10 20 30 40 50 60 70 80 0.5
0
0.5
0 20 40 60 80 100

Fig. 2. VAD results for a speech signal in the presence of


white noise shown in Fig. 1(b). (a) Power of signal (b) Power Fig. 4. VAD results for a speech signal in the presence of real
of periodic components (c) Power of aperiodic components world street noise shown in Fig. 3(b). (a) Power of signal (b)
(d) Log likelihood ratios derived from the periodic and aperi- Power of periodic components (c) Power of aperiodic compo-
odic components (e) VAD result nents (d) Log likelihood ratios derived from the periodic and
aperiodic components (e) VAD result
2.4. Experimental Results
tios and VAD result based on the likelihood ratio test is shown
The experimental results of the PARADE method are pre-
in Fig. 4(d)-(e). This result shows that the PARADE method
sented here.
worked well in the presence of non-stationary noise.
The first test signal was created by adding white noise to a
speech data shown in Fig. 1(a) at an SNR of 0dB (Fig. 1(b)).
The result of decomposition of the periodic/aperiodic compo- 3. SPEAKER RECOGNITION
nents is shown in Fig. 2(a)-(c). It shows that the PARADE
method well decomposes periodic and a periodic components The next step of the individual voice activity detection is to
well even at such a low SNR. In addition, the log likelihood pick speech frames of an interest person from voice activ-
ratios is shown in Fig. 2(d). It correctly indicated the period ity region frames obtained by a VAD method. To perform
in which speech signals exist. The VAD result based on the this task, we can use speaker recognition methods to identify
likelihood ratio test is shown in Fig. 2(e). speakers of each frame.
The second test signal was created by adding noise in the Speaker recognition is the process of automatically rec-
real world. I added a street noise to a speech data at an SNR of ognizing a speaker by using speaker-specific information in-
0dB. The street noise includes non-stationary sounds such as cluded in speech waves [12]. This technique can be used to
sounds of passing cars. The generated signal is shown in Fig. verify the identity claimed by people accessing certain pro-
3(b). The result of decomposition of the periodic/aperiodic tected systems; that is, it enables access control of various ser-
components is shown in Fig. 4(a)-(c). The log likelihood ra- vices by voice [7]. Speaker recognition can be classified into
two specific tasks: identification and verification. Speaker speech signal processing. The likelihood of the GMM is
identification is the process of determining which one of the
T
Y
voices known to the system best matches the input voice sam-
ple. Speaker verification is the process of accepting or re- p(X|) = p(xt |) (19)
t=1
jecting the identity claim of a speaker. Speaker recognition
methods can also be divided into text-dependent and text- Since the distribution of these vectors is unknown, it is ap-
independent. When the same text is used for both training proximately modeled by a mixture of Gaussian densities
and testing, the system is said to be text-dependent. For text-
c
X
independent operation, the text used to train and test the sys-
tem is completely unconstrained. p(xt |) = wi N (xt , i , i ) (20)
i=1
One of the early successful text-independent speaker iden-
tification method is based on vector quantization (VQ). In this where = {wi , i , i } denotes a set of model parameters,
method, VQ codebooks consisting of a small number of rep- wi and N (xt , i , i ), i = 1, . . . , c, are the mixture weights
resentative feature vectors are used as an efficient means of and the d-variate Gaussian component densities with mean
characterizing speaker-specific features. A speaker-specific vectors i and covariance matrices i , respectively.
codebook is generated by clustering the training feature vec- In training the GMM, these parameters are estimated such
tors of each speaker. In the recognition stage, an input utter- that in some sense, they best match the distribution of the
ance is vector-quantized using the codebook of each reference training vectors. The most widely used training method is
speaker and the VQ distortion accumulated over the entire in- the maximum likelihood (ML) estimation, where a new pa-
put utterance is used to make the recognition decision [5]. rameter model is found such that p(X|) p(X|). An
Another early successful method is ergodic HMM based auxiliary function Q is used for this task
method. The basic structure is the same as the VQ-method.
Over a long timescale, the temporal variation in speech signal c X
X T
=
Q(, ) p(i|xt , ) log[wi N (xt , i )]
i , (21)
parameters is represented by stochastic Markovian transitions
between states. This method uses a multiple-state ergodic i=1 t=1

HMM, that is, all possible transitions between states are al- where p(i|xt , ) is the a posteriori probability for acoustic
lowed, to classify speech segments into one of the broad pho- class i, i = 1, . . . , c and satisfies
netic categories corresponding to the HMM states. After the
classification, appropriate features are selected. In the train- c
X
ing phase, reference templates are generated and verification p(i|xt , ) = 1 (22)
thresholds are computed for each phonetic category. In the i=1
verification phase, after the phonetic categorization, a com-
to zero,
Setting derivatives of the Q function with respect to
parison with the reference template for each particular cate-
gory provides a verification score for that category. The fi- the following reestimation formulas are found
nal verification score is a weighted linear combination of the wi N (xt , i , i )
scores from each category [17]. p(i|xt , ) = Pc (23)
The GMM based speaker identification is a most widely k=1 wk N (xt , k , k )

used method. This method also works as VQ-method. By T


X T
X
representing the feature space as a mixture of Gaussian dis-
i = p(i|xt , )/ p(i|xt , ) (24)
tributions, from the training feature vector set, the GMM al- t=1 t=1
gorithm generates a model parameter set including mean vec- PT
tors, covariance matrices and mixture weights. Parameters are i = t=1 p(i|xt , )(xt i )T
i )(xt
PT (25)
trained in an unsupervised classification using the expectation t=1 p(i|xt , )
maximisation (EM) algorithm. This algorithm provides an T
iterative maximum likelihood estimation technique. Experi- 1X
w
i = p(i|xt , ) (26)
ments have shown that GMMs are effective models capable T t=1
of achieving high identification accuracy [21].
For speaker identification, let k , k = 1, . . . , N, denote speaker
In this project, the GMM based speaker identification method
models of N speakers, a classifier is designed to classify X
is studied.
into N speaker models by computing the log-likelihood of the
unknown X given each speaker model k and select speaker
3.1. Gaussian Mixture Models k as
XT
Let X = {x1 , x2 , . . . , xT } be a set of T vectors, each of k = arg max log p(xt |k ). (27)
1kN
which is a d-dimensional feature vector extracted by digital t=1
3.2. Experimental Results I propose a semi-automatic hybrid IVAD system of VAD
by the PARADE method and the speaker identification based
The results of the GMM based speaker identification experi- on GMMs. The system works as follows:
ments are presented here. 1. Use the PARADE method to obtain voice activity frames.
The King speech corpus [8] was used. The KING cor- 2. Label a few first voice activity frames to speakers by hand
pus was created for research in the area of speaker identifica- to train GMMs.
tion. It was collected partly in New Jersey and partly in San 3. Train GMMs and classify voice activity frames of the en-
Diego. There are twenty-six San Diego speakers (numbered tire signal.
01 to 26) and twenty-five New Jersey speakers (numbered 27 Note that the PARADE method requires nothing to prepare
to 60, with some gaps in the sequence). All speakers are male. for users to run despite several VAD methods such as [22][4][28]
There are ten sessions for each speaker (numbered 01 to 10), require users to prepare training sets. After VAD is performed,
and each session was recorded in both a wide-band (wb) and users prepare training sets for speaker identification using frames
a narrow-band (nb) channel. Sessions were recorded a week obtained by VAD. Note that users only have to label a few
to a month apart. first frames of individuals, and it is not so burdensome be-
The data were processed in 30 ms frames. Frames were cause segmentation is already processed by VAD. Finally, the
Hamming windowed and preemphasised with p = 0.9. For IVAD system trains GMMs using the prepared training sets
each frame, 46 mel-spectral bands of a width of 110 me1 and classifies voice activity frames in the entire signal to each
and 20 melfrequency cepstral coefficients (MFCC) were de- speaker and pick voice activity frames of an interest person.
termined and used as features [27]. In this project, I used the
first 26 speakers in the King corpose that were recorded in the
same place. The first 5 sessions of each speaker were used for 4.1. Experimental Results
training, and rest 5 were used for testing. The experimental results of the proposed method are presented
The experimental results are presented in Table 1. The here.
GMMs for each speaker were trained with 16, 32 mixtures. The first test signal shown in Fig. 5(a) contains utterances
I stopped experiments to increase the number of mixtures at of two speakers recorded in silence. A VAD by the PARADE
32 because the classification rate was not improved any more. method was applied to the signal and the result is shown in
I also experimented the effect of mean subtraction of feature Fig. 5(b). The ground truth is that the first 3 and the last 2
vectors. The mean subtraction will normalize feature vectors voice activity frames are utterances of person 1 and rest are ut-
so that they become independent with the recording trans- terances of person 2. The training of GMMs were performed
fer function produced by microphone architecture, distance using the first 5 voice activity frames in the signal where the
to the microphone, and the room acoustics [6]. The experi- first 3 frames are utterances of person 1 and the last 2 frames
ments show that the mean subtraction improves the classifi- are utterances of person 2. After that, speaker identification
cation rate. was performed to each voice activity frame in the entire signal
and the resulted voice activity regions of person 1 are shown
(Number of mixtures, Error Rate in 5(c). This result shows that this system worked well for a
Do mean subtraction) speech signal under small noise.
(16,N) 0.1410 The second test signal was created by adding white noise
(16,Y) 0.0705 at an SNR of 20db to the original of Fig. 5(a). The created
(32,Y) 0.1026 signal is shown in Fig. 6(a). The VAD worked well for this
signal as shown in Fig. 6(b). The same procedures were re-
Table 1. Speaker identification error rate performed for 26
peated and the result of the speaker identification can be seen
speakers using the King corpus database
in Fig. 6(c). This shows that the GMM based speaker iden-
tification did not work well in the presence of noise, that is,
the GMM method is not so robust to noise as the PARADE
method. Use of noise robut speaker identification method in-
4. INDIVIDUAL VOICE ACTIVITY DETECTION
stead of the GMM based method is desired.
Individual voice activity detection (IVAD) is an algorithm to
detect the speech regions of an interest person in audio. The 5. CONCLUSION
IVAD helps to enhance speech processing techniques such as
speech enhancement, speech coding, and automatic speech Various VAD methods, especially the PARADE method, were
recognition more than only VAD by enabling to specialize studied. The PARADE [10] method was implemented and the
these algorithm to an interest person. In this section, I propose noise robustness of the method was verified by experiments.
an IVAD method which is constructed by cascading of a VAD Moreover, various speaker identification methods, especially
method and a speaker identification method. the GMM based speaker identification [21], were studied. I
1 1
(a)

(a)
0 0

1 1
0 5 10 15 0 5 10 15
5
x 10 5
x 10
1.5 1.5
1 1
(b)

0.5

(b)
0.5
0
0
0.5
0 5 10 15 0.5
5
0 5 10 15
x 10 5
x 10
1.5
1.5
1
1
(c)

0.5

(c)
0.5
0
0.5 0
0 5 10 15 0.5
5 0 5 10 15
x 10 5
x 10

Fig. 5. The first signal at presence of two speakers in silence


Fig. 6. The second test signal created by adding white noise
(a) Speech waveform (b) VAD result. The first 3 frames and
at an SNR of 20db (a) Speech waveform (b) VAD result.
the last 2 frames are utterences of speaker 1 and rest are utter-
Again, the first 3 frames and the last 2 frames are utterences
ances of speaker 2. (c) IVAD result for an interest speaker 1
of speaker 1 and rest are utterances of speaker 2. (c) IVAD
(Error rate is 0)
result for an interest speaker (Error rate is 0.18)

constructed an semi-automatic individual voice activity de- [6] X. Huang et al., Spoken language processing, Prentice
tection (IVAD) system by cascading VAD by the PARADE HALL PTR, 2001.
method and speaker identification based on GMMs. How-
ever, the experiment showed that the GMM based speaker [7] S. Furui, Recent advances in speaker recognition, Patter
identification method is not so robust to noise as the PARADE Recognition Letters 18 (1997), 859872.
method. In the future I will investigate noise robust speaker
[8] A. Higggins and D.Vermilyea, King speech corpus, Lin-
identification methods and training-free speaker identification
guistic Data Consortium (1995).
methods to make a completely automatic IVAD system.
[9] Nakatani T. Minami Y. Ishizuka, K. and N. Miyazaki,
6. REFERENCES Speech feature extraction method using subband-
based periodicity and nonperiodicity decomposition, J.
[1] B. S. Atal and L. R. Rabiner, A pattern recogni- Acoust. Soc. Am. 120 (2006), 443452.
tion approach to voiced-unvoiced-silence classification
[10] T. Fujimoto M. Ishizuka, K. Nakatani and N. Miyazaki,
with applications to speech recognition, IEEE Trans.
Noise robust front-end processing with voice activity de-
Acoust., Speech, and Signal Process 24 (1976), 201
tection based on periodic to aperiodic component ratio,
212.
Proc. Interspeech 7 (2007), 230233.
[2] S. Basu, A linked-hmm model for robust voicing and [11] Moreno D. M. Russell M. J. Jackson, P. J. B. and J. Her-
speech detection, Proc. ICASSP 1 (2003), 816819. nando, Covariation and weighting of harmonically de-
[3] A. de Cheveigne, Separation of concurrent harmonic composed streams for asr, Proc. Interspeech (2003),
sounds: Fundamental frequency estimation and a time- 23212324.
domain cancellation model of auditory processing, J. [12] B. H. Juang, The past, present, and future of speech pro-
Acoust. Soc. Am. 93 (1993), 32713290. cessing, IEEE Signal Processing Magazine 15 (1998),
no. 3, 2448.
[4] Z. Enqing, D. Guizhong L. Yatong and Z. Xiaodi, Ap-
plying support vector machines to voice activity detec- [13] Mak B. Junqua, J.-C. and B. Reaves, A robust algorithm
tion, Proceedings of the sixth International Conference for word boundary detection in the presence of noise,
on Signal Processing (2002), 11241127. IEEE Trans. Speech Audio Process. 2 (1994), 406412.
[5] F. Soong et al., A vector quantization approach to [14] Deligne S. Kristjansson, T. and P. Olsen, Voicing fea-
speaker recognition, Proc. IEEE ICASSP (1985), 387 tures for robust speech detection, Proc. Interspeech
390. (2005), 369372.
[15] Swamy N. S. Li, K. and M. O. Ahmad, An improved [28] Brown G.J.-Wan-V. Wrigley, S.N. and S. Renals, Speech
voice activity detection using higher order statistics, and crosstalk detection in multichannel audio, IEEE
IEEE Trans. Speech Audio Process. 13 (2005), 965 Transactions on Speech and Audio Processing 13
974. (2005), no. 1, 8491.

[16] M. Marzinzik and B. Kollmeier, Speech pause detection [29] Krishnamachari-K. L. Yantorno, R. E. and J. M.
for noise spectrum estimation by tracking power enve- Lovekin, The spectral autocorrelation peak valley ratio
lope dynamics, IEEE Trans. Speech Audio Process. 19 (sapvr) . a usable speech measure employed as a co-
(2002), 109118. channel detection system, Proc. IEEE Int. Workshop In-
tell. Signal Process. (2001).
[17] T. Matsui and S. Furui., Comparison of text-independent
speaker recognition methods using vq-distortion and
discrete/continuous hmms, In ICASSP (92), 157160.

[18] T. Nakatani and T. Irino, Robust and accurate funda-


mental frequency estimation based on dominant har-
monic components,, J. Acoust. Soc. Am. 116 (2004),
36903700.

[19] Le Bouquin-Jeannes R. and G. Faucon, Study of voice


activity detector and its influence on a noise reduction
system, Speech Communication (1995), 245254.

[20] L. R. Rabiner and M. R. Sambur, An algorithm for de-


termining the endpoints of isolated utterances, The Bell
Syst. Tech. Journal 53 (1975), 297315.

[21] D. A. Reynolds, Speaker identification and verification


using gaussian mixture speaker models, Speech Com-
munication 17 (1995), 91108.

[22] C. Shao and M. Bouchard, Efficient classification of


noisy speech using neural networks, Proceedings of the
Seventh International Symposium on Signal Processing
and Its Applications (2002), 357360.

[23] Hung J.-W. Shen, J.-L. and L.-S. Lee, Robust entropy-
based endpoint detection for speech recognition in noisy
environments, Proc. ICSLP (1998).

[24] Kim N.-S. Sohn, J. and W. Sung, A statistical model-


based voice activity detection, IEEE Signal Process.
Lett. 6 (1999), 13.

[25] K. Srinivasan and A. Gersho, Voice activity detection for


cellular networks, Proc. of IEEE Workshop on Speech
Coding for Telecommunications (1993), 8586.

[26] R. Tucker, Voice activity detection using a periodicity


measure, IEE Proceedings-I 139 (1992), 377380.

[27] M. Wagner, Combined speech-recognition/ speakerver-


ification system with modest training requirements, Pro-
ceedings of the Sixth Australian International Confer-
ence on Speech Science and Technology (1996), 139
143.

Vous aimerez peut-être aussi