Vous êtes sur la page 1sur 4

Feature Based Classification of Dysfluent and Normal

Speech
P.Mahesha D.S.Vinod
Department of Computer Science and Engineering Department of Information Science and Engineering
S.J.College of Engineering S.J.College of Engineering
Mysore-570006 Mysore-570006
maheshsjce@yahoo.com dsvinod@daad-alumni.de

ABSTRACT Table 1. Types of dysfluencies


Abstract--- This paper is intended to develop a new approach for Repetition Syllable repetition
identification and classification of dysfluent and normal speech (The baby ate the s-s-soup).
using Mel-Frequency Cepstral Coefficient (MFCC). Stuttering is a Whole word repetition
fluency disorder in which the fluency of speech is interrupted by (The baby-baby ate the soup)
occurrences of dysfluencies such as repetitions, prolongations, Phrase or sentence repetition
interjections, silent pauses, broken words, incomplete phrases and (The baby-the baby ate the soup).
revisions. In this work we have considered three types of Prolongation Syllable prolongation
dysfluencies such as repetition, prolongation and interjection to (The baaaby ate the soup).
characterize stuttered speech. After obtaining dysfluent and Interjection Common interjections are um and uh
normal speech, the speech signals are analyzed in order to extract (The baby um ate the um soup).
MFCC features. The k-Nearest Neighbor(k-NN) classifier is used The [pause] baby ate the [pause] soup. Silent
to classify the speech as dysfluent and normal speech. The 80% of duration within speech considered normal
the data is used for training and 20% for testing. The avarage Pauses
and considered as dysfluency, if they last
accuracy of 86.67% and 93.34% is obtained for dysfluent and more than 2 sec.
normal speech respectively.

Keywords There are not many clear and quantifiable characteristic to


Stuttering, Normal Speech, MFCC, kNN distinguish the dysfluencies of stutterers and normal speakers. It
was found from literature survey, that sound or syllable
1. INTRODUCTION repetitions, word repetitions and prolongation are sufficient to
The proceedings are the records of the conference. ACM hopes to differentiate them[6, 12].
The Stuttering also known as dysphemia and stammering is a
speech fluency disorder that affects the flow of speech. It is one of There are number of diagnosis methods to evaluate stuttering. The
the serious problem in speech pathology and poorly understood stuttering assessment process is carried out by transcribing the
disorder. Approximately about 1% of the population suffering recorded speech and locating the dysfluencies occured and
from this disorder and has found to affect four times as many counting the number of occurences. These types of stuttering
males as females [11, 5, 16, 3]. Stuttering is the subject of interest assessments are based on the knowledge and experience of speech
to researchers from various domains like speech physiology, pathologist. The main drawback of making such assessment are
pathology, psychology, acoustics and signal analysis. Therefore, time consuming, subjective, inconsistent and prone to error.
this area is a multidisciplinary research field of science.
In this work, we are proposing an approach to classify dysfluent
The speech fluency can be defined in terms of continuity, rate, co- and normal speech using MFCC feature extraction. In order to
articulation and effort. Continuity relates to the degree to which classify stuttered speech we have considered three types of
syllables and words are logically sequenced and also the presence dyfluencies such as, repetition, prolongation and interjection.
or absence of pauses. If a semantic units follow one another in a
continual and logical flow of information, the speech is 2. SPEECH DATA
interpreted as fluent[4]. If there is a break in the smooth, The speech samples are obtained from University College London
meaningful flow of speech, then it is dysfluent speech. The types Archive of Stuttered Speech (UCLASS) [15 14]. The database
of dysfluency that characterize stuttering disorder are shown in consists of recording for monologs, readings and conversation.
Table 1 [6]. There are 40 different speakers contributing 107 reading recording
in the database. In this work speech samples are taken from
standard reading of 25 different speakers with age between 10
Permission to make digital or hard copies of all or part of this work for years to 20 years. The samples were chosen to cover wide range
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
of age and stuttering rate. The repetition, prolongation and filled
copies bear this notice and the full citation on the first page. To copy pause dysfluencies are segmented manually by hearing the speech
otherwise, or republish, to post on servers or to redistribute to lists, signal. The segmented samples are subjected to feature extraction.
requires prior specific permission and/or a fee. The same standard English passage that were used in UCLASS
CCSEIT-12, October 26-28, 2012, Coimbatore [Tamil nadu, India] database are used in preparing the fluent database. Twenty normal
Copyright 2012 ACM 978-1-4503-1310-0/12/10$10.00. speakers with mean age group of 25 were made to read the
passage and recorded using cool edit version 2.1.

594
3. METHODOLOGY diagram for computing MFCC is given in figure 2. The step-by-
The overall process of dysfluent and normal speech classification step computations of MFCC are discussed briefly in the following
is divided into 4 steps as shown in figure 1. sections.

Figure 2. MFCC computation

3.3.1 Step1: Framing


In framing, we split the pre-emphasis signal into several frames,
such that we are analyzing each frame in the short time instead of
analyzing the entire signal at once[9]. Hamming window is
applied to each frame, which will cause loss of information at the
beginning and end of frames. To overcome this an overlapping is
applied, to reincorporate the information back into extracted
feature frames. The frame length is set to 25ms and there is 10ms
overlap between two adjacent frames to ensure stationary between
frames.
Figure 1. Schematic diagram of classification method
3.3.2 Step2: Windowing
The effect of the spectral artifacts from framing process is reduced
by windowing [9]. Windowing is a point-wise multiplication
3.1 Pre-emphasis between the framed signal and the window function. Whereas in
This step is performed to enhance the accuracy and efficiency of frequency domain, the combination becomes the convolution
the feature extraction processes. This will compensate the high- between the short-term spectrum and the transfer function of the
frequency part that was suppressed during the sound production window. A good window function has a narrow main lobe and
mechanism of humans. The speech signal s(n) is sent to the high- low side lobe levels in their transfer function [9]. The purpose of
pass filter: applying Hamming window is to minimize the spectral distortion
and the signal discontinuities. Hamming window function is
s2 (n) = s (n) a s (n 1) (1) shown in following equation:

Where s2(n) is output signal and the recommended value of a is 2 n (3)


w(n) = 0.54 0.46 cos , 0 n N 1
usually between 0.9 and 1.0[10]. The z- transform of the filter is N 1

H ( z ) = 1 a z 1 (2) If the window is defined as w(n), 0 n N-1. Then the


result of windowing signal is
The aim of this stage is to boost the amount of energy in the high
frequencies. Y ( n) = X ( n) W ( n) (4)
3.2 Segmentation Where, N = number of samples in each frame, Y(n) = Output
In this paper we are considering 3 types of dysfluencies in signal, X (n) = input signal and W (n) = Hamming window.
stuttered speech such as repetitions, prolongations and
interjections, these were identified by hearing the recorded speech
3.3.3 Step 3 :Fast Fourier Transform (FFT)
samples and were segmented manually. The segmented samples
The purpose of FFT is to convert the signal from time domain to
are subjected to feature extraction.
frequency domain preparing to the next stage (Mel frequency
3.3 Feature Extraction (MFCC) wrapping). The basis of performing Fourier transform is to
Feature extraction is to convert an observed speech signal to some convert the convolution of the glottal pulse and the vocal tract
type of parametric representation for further investigation and impulse response in the time domain into multiplication in the
processing. Several feature extraction algorithms are used for this frequency domain [2]. The equation is given by :
task such as Linear Predictive Coefficients (LPC), Linear
Predictive Cepstral Coefficients (LPCC), Mel Frequency Cepstral
Coefficients (MFCC) and Perceptual Linear Prediction(PLP) Y ( w) = FFT [ h (t ) X (t ) ] = H ( w) X ( w)
cepstra. (5)
The MFCC feature extraction is one of the best known and most
commonly used features for speech recognition. It produces a
multi dimensional feature vector for every frame of speech. In this If X (w), H (w) and Y (w) are the Fourier Transform of X
study we have considered 12MFCC. The method is based on (t), H (t) and Y (t) respectively.
human hearing perceptions which cannot perceive frequencies
over 1KHz. In other words, MFCC is based on known variation of
the human ears critical bandwidth with frequency[7].The block

595
3.3.4 Step 4 :Mel Filter Bank Processing only approximated locally and all computation is delayed until the
A set of triangular filter banks is used to approximate the classification is done. Each query object (test speech signal) is
frequency resolution of the human ear. The Mel frequency scale is compared with each of training object (training speech signal).
linear up to 1000 Hz and logarithmic thereafter[1]. A set of Then the object is classified by a majority vote of its neighbors
overlapping Mel filters are made such that their center frequencies with the object being assigned to the class most common amongst
are equidistant on the Mel scale. The Filter banks can be its k nearest neighbors (k is a positive integer, typically small). If
implemented in both time domain and frequency domain. For the k = 1, then the object is simply assigned to the class of its nearest
purpose of MFCC processing, filter banks are implemented in neighbor [8].
frequency domain. The filter bank according to Mel scale is In this study minimum distance is calculated from test speech
shown in figure 3. signal to each of the training speech signal in the training set. This
classifies test speech sample belonging to the same class as the
most similar or nearest sample point in the training set of data. A
Euclidean distance measure is used to find the closeness between
each training set data and test data. The Euclidean distance
measure equation is given by :
n
d e ( a , b) = (b a )
i =1
i i
2
(8)

Our aim is to perform two class classification (dysfluent vs.


normal) using the MFCC features. We have considered two
Figure 3. Mel scale filter bank
different training data set, one for dysfluent speech samples that
The figure 3 shows a set of triangular filters which are used to includes 3 types of dysfluencies such as repetitions, prolongations
compute a weighted sum of filter spectral components and the and interjections, second training data set is for normal (fluent)
output of the process approximates to a Mel scale. The magnitude speech. For each test samples the training data set is found with k
frequency response of each filter is triangular in shape and equal nearest members. Further, for these k nearest members, suitable
to unity at the center frequency and decrease linearly to zero at class label is identified based on majority voting. Class labels can
center frequency of two adjacent filters. The output of each filter be dysfluent speech or normal speech.
is sum of its filtered spectral components. Afterwords
approximation of Mels for a particular frequency can be 4. RESULTS AND DISCUSSIONS
expressed using following equation : The samples were chosen as explained in section 2 of this paper.
The database is divided int two subsets: training set and testing set
f
based on the ratio 80:20 respectively. To analyze speech samples
mel ( f ) = 2595* log10 1 + (6)
700 first we extract MFCC feature, afterwards two training database is
constructed for dysfluent and normal speech samples. Once the
3.3.5 Step 5: Discrete Cosine Transform(DCT) system is trained, test set is employed to estimate the performance
In this step log Mel spectrum is converted back to time domain of a classifier. Table 2 indicates the significant difference of
using DCT. The outcome of conversion is called MFCCs. Since MFCC coefficients for dysfluent and fluent speech
the speech signal represented as a convolution between slowly
The experiment was repeated 3 times, each time different training
varying vocal tract impulse response (filter) and quickly varying
and testing sets were built randomly. The result of training and
glottal pulse (source), so, the speech spectrum consists of the
testing for dysfluent and normal speech is shown in table 3 and 4
spectral envelope (low frequency) and the spectral details (high
respectively. We have obtained the average accuracy 86.67% and
frequency). Now, we have to separate the spectral envelope and
93.34 for dysfluent and normal speech in that order.
spectral details from the spectrum.
The logarithm has the effect of changing multiplication into Table 2. Mean MFCC coefficients for dysfluent and normal
addition. Therefore we can simply convert the multiplication of speech
the magnitude of the Fourier transform into addition by taking the
DCT of the logarithm of the magnitude spectrum. Dysfluent Normal
c%i speech speech
We can calculate the Mel frequency cepstrum from the result of
the last step using equation 7[13]. 1 -3.4922 4.5201
2 0.9650 -0.7038
K 1 3 -1.8837 -2.2776
c%n = (log S%k ) cos n k , n = 1, 2, 3, ...K (7) 4 -1.0960 3.3386
k =1 2K
5 0.4963 0.7528
Where c%n is MFCCi , S%k is Mel spectrum and K is the number of 6 -0.0790 -0.5428
cepstrum coefficients. 7 -1.1014 -3.1698
8 -0.1747 -1.9449
3.4 Classification 9 1.1086 -1.6467
The k-Nearest Neighbor (k-NN) is used as a classification 10 0.3596 1.5044
technique. The k-NN classifies new instance query based on 11 -0.6475 -0.9052
closest training examples in the feature space. k-NN is a type of 12 -0.1359 -0.172
instance-based learning, or lazy learning where the function is

596
Table 3. Dysfluent speech classification result with 3 different [4] C.Buchel C and Sommer M. What causes stuttering? PLoS
set Biol 2(2): e46 doi:10.1371/journal.pbio.0020046, 2004.
Speech samples Dysfluent [5] D.Sherman. Clinical and experimental use of the iowa scale
Set 1 Set 2 Set 3 of severity of stuttering. Journal of Speech and Hearing
Training number 40 40 40 Disorders, pages 316-320, 1952.
Test number 10 10 10 [6] Johnson et al. The Onset of Stuttering; Research Findings
Correct classification 8 8 8 and Implications. University of Minnesota
Rate of classification (%) 90 80 80 Press., Minneapolis.
Average classification (%) 86.67 [7] Lindasalwa et al. Voice recognition algorithms using mel
frequency cepstral coefficients (mfcc) and dynamic time
Table 4. Normal speech classification result with 3 different warping (dtw) techniques. Journal of Computing, 2(3):2151-
set 9617, March 2010.
[8] Hao Luo Faxin Yu, Zheming Lu and Pinghui Wang. Three-
Speech samples Dysfluent dimensional model analysis and processing. Advanced topics
Set 1 Set 2 Set 3 in science and technology, Springer, 2010.\
Training number 40 40 40 [9] J.G.Proakis and D.G.Manolakis. Digital signal processing.
Test number 10 10 10 Principles, Algorithms and Applications. MacMillan, New
Correct classification 9 9 9 York.
Rate of classification (%) 90 90 90 [10] J.Harrington and S.Cassidy. Techniques in Speech Acoustics.
Average classification (%) 86.67 Kluwer Academic Publishers, Dordrecht.
[11] M.A.Young. Predicting ratings of severity of stuttering.
Journal of Speech and Hearing Disorders, pages 31- 54,
5. CONCLUSION 1961.
The speech signal can be used as a reliable indicator of speech [12] M.E.WIngate. Criteria for stuttering. Journal of Speech and
abnormalities. We have proposed an approach to discriminate Hearing Research, 13:596-607, 1977.
dysfluent and normal speech based on MFCC feature analysis. By
using k-NN classifier, an average accuracy of 86.67% and 93.34% [13] Ibrahim Patel and Y Srivinasa Rao. A frequency spectral
is obtained for dysfluent and normal speech respectively. In this feature modeling for hidden markov model based automated
work we have considered combination of three types of speech recognition. The second International conference on
dysfluencies which are important in classification of dysfluent Networks and Communications, 2010.
speech. In the future work number of training data can be [14] P.Howell and M. Huckvale. Facilities to assist people to
increased to improve the accuracy of testing data and different research into stammered speech. Stammering research: an
feature extraction algorithm can be used to improve the on-line journal published by the British Stammering
performance. Association, 1:130-242, 2004.
[15] S.Devis P.Howell and J.Batrip. The UCLASS archive of
6. REFERENCES stuttered speech. Journal of Speech, Language and Hearing
[1] Speech technology: A practical introduction Research, 52:556-559, April 2009.
topic:Spectrogram, cepstrum and mel-frequency [16] E.M.Prather W.L.Cullinan and D.Williams. Comparison of
analysis.Technical report, Carnegie Mellon University procedures for scaling severity of stuttering. Journal of
andInternational Institute of Information Technology Speech and Hearing Research, pages 187-194, 1963.
Hyderabad.
[2] C. Becchetti and Lucio Prina Ricotti. Speech Recognition.
John Wiley and Sons, England.
[3] Oliver Bloodstein. A Handbook on Stuttering. Singular
Publishing Group Inc., San-Diego and London.

597

Vous aimerez peut-être aussi