Vous êtes sur la page 1sur 6

International Journal for Research in Applied Science & Engineering Technology (IJRASET)

ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:6.887


Volume 5 Issue VIII, August 2017- Available at www.ijraset.com

Glottal Analysis Using Speech signals


Korutla Sudhir Sai1, .Polasi Phani Kumar2
1
Studying M.Tech. in Telematics specialization, 2Associate Professor, Electronics and Communication Engineering department
Velagapudi Ramakrishna Siddhartha Engineering College, Jawaharlal Nehru Technological University -Kakinada, Kanuru,
Vijayawada-7, Andhra Pradesh

Abstract: Speech processing applications are nowadays omnipresent in our daily life . By offering solutions to companies
seeking for efficiency enhancement with simultaneous cost saving, the market of speech technology is forecast to be particularly
promising in the next years. The present thesis deals with advances in glottal analysis in order to incorporate new techniques
within speech processing applications. While current systems are usually based on information related to the vocal tract
configuration, the airflow passing through the vocal folds, and called glottal flow, is expected to exhibit a relevant
complementarity. Unfortunately, glottal analysis from speech recordings requires specific complex processing operations, which
explains why it has been generally avoided. The main goal of this paper is to provide new advances in glottal analysis so as to
popularize it in speech processing. First, new techniques for glottal excitation estimation and modeling are proposed and shown
to outperform other state-of-the-art approaches on large corpora of real speech. Moreover, proposed methods are integrated
within various speech processing applications: speech synthesis, voice pathology detection, speaker recognition and expressive
speech analysis. They are shown to lead to a substantial improvement when compared to other existing techniques. More
specifically, the present thesis covers three separate but interconnected parts. In the first part, new algorithms for robust pitch
tracking and for automatic determination of glottal closure instants are developed. This step is necessary as accurate glottal
analysis requires to process pitch-synchronous speech frames. In the second part, a new non-parametric method based on
Complex Cepstrum is proposed for glottal flow estimation. In addition, a way to achieve this decomposition asynchronously is
investigated. A comprehensive comparative study of glottal flow estimation approaches is also given. The pseudo-periodicity of
voiced speech can be exploited in several speech processing applications. This requires however that the precise locations of the
Glottal Closure Instants (GCIs) are available. The focus of this chapter is the evaluation of automatic methods for the detection
of GCIs directly from the speech waveform. A new procedure to determine GCIs, called the Speech Event Detection using the
Residual Excitation And a Mean-based Signal (SEDREAMS) algorithm, is proposed. Relying on this expertise, the usefulness of
glottal information for voice pathology detection and expressive speech analysis is explored. In the third part, a new excitation
modeling called Deterministic plus Stochastic Model of the residual signal is proposed. This model is applied to speech synthesis
where it is shown to enhance the naturalness and quality of the delivered voice. Finally, glottal signatures derived from this
model are observed to lead to an increase of identification rates for speaker recognition purpose.
Keywords: Information Technology, Voice Technology, Speech Processing, Speech Analysis, Speech Synthesis, Speaker
Recognition, Voice Pathology, Expressive Speech, Glottal flow, Source-tract Separation, Pitch Estimation, Glottal Closure
Instant, Excitation Modeling.

I. INTRODUCTION
In speech processing, Glottal Closure Instants (GCIs) are referred to the instances of significant excitation of the vocal tract. These
particular time events correspond to the moments of high energy in the glottal signal during voiced speech. Knowing the GCI
location is of particular importance in speech processing.For speech analysis, closed-phase LP autoregressive analysis techniques
have been developed for better estimating the prediction coefficients, which results in a better estimation of the vocal tract
resonances. These techniques explicitly require the determination of GCIs. A wide range of applications also implicitly assume that
these instants are located. In concatenative speech synthesis, it is well known that some knowledge of a reference instant is
necessary to eliminate concatenation discontinuities.
GCI has also been used for voice transformation

A. Voice quality enhancement


B. Speaker identification
C. glottal source estimation, speech coding and transmission

©IJRASET (UGC Approved Journal): All Rights are Reserved 2267


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:6.887
Volume 5 Issue VIII, August 2017- Available at www.ijraset.com

Many methods have been proposed to locate the GCIs directly from speech waveforms. The earliest attempts relied on the
determinant of the auto covariance matrix . A study of the use of the Linear Prediction (LP) residual was investigated in .Indeed, as
GCIs correspond to instants of significant excitation, it is assumed that a large value in the LP residual is informative about the GCI
location. In , GCIs were determined as the maxima of the Frobenius norm. An approach based on a weighted nonlinear prediction
was proposed in .In an algorithm based on a wavelet decomposition was considered. Some techniques also exploit the phase
properties due to the impulse-like nature at the GCI by computing a group delay function . The DYPSA algorithm, ,estimates GCI
candidates using the projected phase-slope and employs dynamic programming to retain the most likely ones. In GCIs are located
by the center-of-gravity based signal and then refined by using minimum-phase group delay functions derived from the amplitude
spectra. More recently, authors in proposed to detect discontinuities in frequency by confining the analysis around a single
frequency. Pitch tracking refers to the task of estimating the contours of the fundamental frequency for voiced segments. Such a
system is of particular interest in several applications of speech processing, such as speech coding, analysis, synthesis or
recognition. Pitch tracking method exploiting the harmonics contained in the spectrum of the residual signal. The idea of using a
summation of harmonics for detecting the fundamental frequency is not new the use of a sub harmonic summation so as to account
for the phenomenon of virtual pitch. This approach was inspired by the use of spectral and cepstral comb filters the use of the Sub
harmonic-to-Harmonic Ratio for estimating the pitch frequency and for voice quality analysis.
Glottal Closure Instants (GCI) location finding is important and is directly evaluated from the speech waveform. Here we study the
GCI using Speech Event Detection using Residual Excitation and the Mean Based Signal (SEDREAMS) algorithm. Speech coding
uses parameter estimation using audio signal processing techniques to model the speech signal combined with generic data
compression algorithms to represent the Subband coding and Glottal Closure Instant (GCI) using SEDREAMS algorithm The
proposed structure significantly reduces error and precise locations of Glottal Closure Instants (GCIs) are found using SEDREAMS
algorithm.
GCIs correspond to the positive zero-crossings of a filtered signal obtained by successive integrations of the speech waveform and
followed by a mean removal operation. Comparative studies of the most popular approaches were led . It was shown that the
DYPSA algorithm and the technique proposed in clearly outperformed other state-of-the art methods. In speech analysis, complex
is usually employed to deconvolve the speech signal into a periodic pulse train and the vocal system impulse response.The
Complex Cepstrum (CC) based technique is introduced and similarities with the ZZT-based method are discussed. Both approaches
are viewed as two different ways to reach the same goal of separating the maximum and minimum-phase contributionsfrom a signal
Z-transform. The impact of windowing, which plays a crucial role on the decomposition quality

II. PITCH TRACKING WITH SRH METHOD


The proposed method relies on the analysis of the residual signal. For this, an auto-regressive modeling of the spectral envelope is
estimated from the speech signal s(t) and the residual signal r(t) is obtained by inverse filtering. This whitening process has the
advantage of removing the main contributions of both the noise and the vocal tract resonances. For each Hanning-windowed frame,
covering several cycles of the resulting residual signal r(t), the amplitude spectrum R(f) is computed. R(f) has a relatively flat
envelope and, for voiced segments of speech, presents peaks at the harmonics of the fundamental frequency F0. From this spectrum,
and for each frequency in the range [F0,min, F0,max],he Summation of Residual Harmonics (SRH) is computed as:

Considering only the term R(k · f) in the summation, this equation takes the contribution of the Nharm first harmonics into account.
It could then be expected that this expression reaches a maximum for f = F0. However, this is also true for the harmonics present in
the range [F0,min, F0,max]. For this reason, the substraction by R((k− 1/2 ) · f) allows to significantly reduce the relative
importance of the maxima of SRH at the even harmonics. The estimated pitch value F_0 for a given residual frame isthus the
frequency maximizing SRH(f) at that time. This may be problematic for low-pitched voices for which the third harmonic may be
present in the initial range [F0,min, F0,max]. Albeit we made several attempts to incorporate a correction in Equation by
substracting a term in R((k ± 1/3 ) · f), no improvement was observed (this was especially truein noisy conditions). For this reason,
the proposed algorithm works in two steps. In the first step,the described process is performed using the full range [F0,min,
F0,max], from which the mean pitchfrequency F0,mean of the considered speaker is estimated. In the second step, the final pitch
tracking is obtained by applying the same process but in the range [0.5 · F0,mean; 2 · F0,mean]. It can be indeed assumed that a

©IJRASET (UGC Approved Journal): All Rights are Reserved 2268


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:6.887
Volume 5 Issue VIII, August 2017- Available at www.ijraset.com

normal speaker will not exceed these limits. Note that this idea of restricting the range of F0 for a given speaker is similar to what
has been proposed in (for the choice of the window length

Fig.2 Illustration of the proposed method in clean and noisy speech (using a jet noise with a SNR of 0 dB). Top plot: The pitch
ground truth and the estimates F_0. Bottom plot: The ideal voicing decisions and the values of SRH

The Summation of Residual Harmonics (SRH) is proposed both for pitch estimation and for the determination of voicing
boundaries. SRH is shown to lead to a significant improvement, while its performance is comparable to other techniques in clean
conditions.

III. GCI DETECTION WITH THE SEDREAMS ALGORITHM


Speech Event Detection using the Residual Excitation And a Mean-based Signal (SE-DREAMS) algorithm. It is a reliable and
accurate method for locating both GCIs and GOIs (although in a less accurate way) from the speech waveform. This focuses on
GCIs, the determination of GOI locations by the SEDREAMS algorithm is omitted. The two steps involved in this method are:

A. The determination of short intervals where GCIs are expected to occur


B. The refinement of the GCI locations within these intervals.

The analysis is focused on a mean-based signal. Denoting the speech waveform as s(n), the mean-based signaly(n) is defined as:

where w(m) is a windowing function of length 2N +1. While the choice of the window shape is not critical (a typical Blackman
window is used in this study), its length influences the time response of this filtering operation, and may then affect the reliability of
the method. The impact of the window length on the misidentification rate is illustrated in Figure 3 for the female speaker.
Optimality is seen as a trade-off between two opposite effects. A too short window causes the appearance of spurious extrema in the
mean-based signal, giving birth to false alarms. On the other hand, a too large window smooths it, affecting in this way the miss
rate. However we clearly observed for the three speakers a valley between 1.5 and 2 times the average pitch period T0,mean.
Throughout the rest of this thesis we used for SEDREAMS a window whose length is 1.75·T0,mean.

Fig. 3 Effect of the window length used by SEDREAMS on the misidentification rate for the speaker

©IJRASET (UGC Approved Journal): All Rights are Reserved 2269


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:6.887
Volume 5 Issue VIII, August 2017- Available at www.ijraset.com

Interestingly it is observed that the mean-based signal oscillates at the local pitch period. However the mean-based signal is not
sufficient in itself foraccurately locating GCIs.It turns out that GCIs may occur at a non-constant relative position within the cycle.
However, once minima and maxima of the mean-based signal are located, it is straightforward to derive short intervals of presence
where GCIs are expected to occur more precisely

IV. GLOTTAL FLOW ESTIMATION WITH THE COMPLEX CEPSTRUM-BASED DECOMPOSITION


This Method proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper,we show
that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a
windowed speech signal as done by the Zerosof the Z-Transform (ZZT) decomposition. Based on exactly the same principles
presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase
characteristics which conform the speech production model that the anticausal component is mainly dueto the glottal flow open
phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed.
Separation can then be achieved by a linear homomorphic filtering in the complex cepstrum domain, which presents the property to
map convolution into addition. In speech analysis, complex cepstrum is usually employed to deconvolve the speech signal into a
periodic pulse train and the vocal system impulse response. Its typical applications concern pitch detection, vocoding, formant
tracking, pole-zero modeling,... but also reach seismic processing or echo detection. Complex Cepstrum based decomposition.The
complex cepstrum (CC) ^x(n) of a discrete signal x(n) is defined by the following equations

where Equations are respectively the Discrete-TimeFourier Transform (DTFT), the complex logarithm and the inverse DTFT
(IDTFT). Our decomposition arises from the fact that the complex cepstrum ^x(n) of an anticausal (causal) signal is zero for all n
positive (negative). Retaining only the negative part of the CC should then estimate the glottal contribution.
One difficulty when computing the CC lies in the estimationof \X(!), which requires an efficient phasunwrapping algorithm. In this
work, we computed the FFT on a sufficiently large number of points (typically 4096) such that:

A. The grid on the unit circle is sufficiently fine, which facilitates the phase evaluation
B. Distortion from aliasing in ^x(n) is minimized.

Fig.4. Corresponding complex cepstrum where the separation of maximum and minimum-phase components can be linearly
achieved.

V. EVALUATION AND RESULTS


First a wave signal is taken and decimated to four frequency bands by two stages of convolution. Then the signal is filtered to have
lower and upper speech bands by taking FFT for the time domain response of the filter. The resulting frequency domain of the four
bands of the decimated signals are reconstructed by again convolving the four bands and taking FFT for the above signal. The
synthesized signal is written in terms of wave file and given as an input for finding the GCI location. The first step for GCI is

©IJRASET (UGC Approved Journal): All Rights are Reserved 2270


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:6.887
Volume 5 Issue VIII, August 2017- Available at www.ijraset.com

estimating the fundamental frequency which is known as pitch tracking. Here we analyze the residual signal which is obtained by
inverse filtering. Hamming window is used for finding the spectrum and for each frequency in the range [fo,min, fo,max], the
Summation of Residual Harmonics (SRH) was computed and results are shown be

Fig. 5(a) Effect of the window length used by SEDREAMS

Fig. 5(b) Glottal Source Estimation with the Complex Cepstrum based Decomposition

Fig. 5(c) GCI Detection with the SEDREAMS Algorithm.

©IJRASET (UGC Approved Journal): All Rights are Reserved 2271


International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor:6.887
Volume 5 Issue VIII, August 2017- Available at www.ijraset.com

WAVEFORMS POSITIVE POSITIVE AVERAGE


POLARITY WITH POLARITY WITH PROBABILITY
PROBABILTY PROBABILITY
MALE VOCALTRACK 76.07% 75.63% 75.85%

FEMALE 74.32% 73.25% 73.78%


VOCALTRACK
Table 5.(c). Performance of speaker independent vocal track system

VI. CONCLUSION
The performance of this system degrades completely with the noisy speech and hence cannot be used in real time applications like
Automatic Speech Recognition (ASR)

A. This system will improve the efficiency and the error rate is reduced when compared to delta modulation encoding systems.
The proposed structure is used with SEDREAMS algorithm to analyze the glottal closure instant (GCIs) locations.
B. We have not included the mapping functions corresponding to source characteristics (shape of the glottal pulse), duration
patterns and other speaker-specific features for applying the complex cepstrum for mixed-phase decomposition, allowing the
estimation of the glottal source. The importance of a suited windowing has been highlighted. Interestingly some
C. The significant differences between the voice qualities were observed in the excitation.

REFERENCES
[1] John G. Proakis and Dimitris G. Manolakis. Digital Signal Processing, Principles, Algorithms, and Applications.Prentice Hall.New Jersey, 2008.
[2] Roberts R. A. and Mullis C. T. Digital Signal Processing. Addison-Wesley, Reading. Mass, 2006.
[3] Oppenheim A. V. and Schafer R. W. Discrete-Time Signal Processing. Prentice Hall. Englewood Cliffs, New Jersey, 2007.
[4] Crochiere R. E. and Rabiner L. R. Multirate Digital Signal Processing. Prentice Hall, Engelwood Cliffs, New Jersey,1983.
[5] Schafer R. W. and Rabiner L. R., “A Digital Signal Processing Approach to Interpolation,” Proc. IEEE, Vol. 61, pp. 692-702, June 2003.
[6] Mcgillem C. D. and Cooper G. R. Continous and Discrete Signal and System Analysis, 2nd ed., Holt Rinehart and Winston,New York, 1984.
[7] Crochiere R. E. and Rabiner L. R., “Optimum FIR Digital Filter Implementations for Decimations, Interpolation, and Narrowband Filtering,” IEEE Trans. on
Acoustics, Speech, and Signal Processing,” Vol. ASSP-23, pp. 444-456, Oct.2004.
[8] Ashraf M. Aziz, “Subband Coding of Speech Signals Using Decimation and Interpolation”, AerospaceSciences & AviationTechnology, ASAT- 13, May 26 –
28, 2009.
[9] Crochiere R. E. and Rabiner L. R., “Further Considerations in the Design of Decimators andInterpolators,” IEEE Trans.on Acoustics, Speech, and Signal
Processing,” Vol. ASSP-24, pp. 296-311, August 2007.
[10] Crochiere R. E. and Rabiner L. R. ,”Interpolation and Decimations of Digital Signals – A Tutorial Review,” Proc. IEEE,Vol. 69, pp. 300-331, March 2008.
[11] Andreas I.Koutrouvelis, George P.Kafentzis, NikolayD.Gaubitch, and Richard Hesdens ,”A Fast Method for High ResolutionVoiced/Unvoiced detection and
Glottal Closure/Opening Instant Estimation of Speech”, IEEE/ACM Transactions onAudio, Speech, and Language Processing,Vol 24,No.2,February 2016

©IJRASET (UGC Approved Journal): All Rights are Reserved 2272

Vous aimerez peut-être aussi