Académique Documents
Professionnel Documents
Culture Documents
EMERGING TOPICS
IN COMPUTING
Received 10 April 2013; revised 17 July 2013; accepted 17 July 2013. Date of publication 25 July 2013;
date of current version 21 January 2014.
Digital Object Identifier 10.1109/TETC.2013.2274797
ABSTRACT
This paper proposes a system that allows recognizing a persons emotional state starting from
audio signal registrations. The provided solution is aimed at improving the interaction among humans and
computers, thus allowing effective human-computer intelligent interaction. The system is able to recognize
six emotions (anger, boredom, disgust, fear, happiness, and sadness) and the neutral state. This set of
emotional states is widely used for emotion recognition purposes. It also distinguishes a single emotion
versus all the other possible ones, as proven in the proposed numerical results. The system is composed
of two subsystems: 1) gender recognition (GR) and 2) emotion recognition (ER). The experimental analysis
shows the performance in terms of accuracy of the proposed ER system. The results highlight that the a priori
knowledge of the speakers gender allows a performance increase. The obtained results show also that the
features selection adoption assures a satisfying recognition rate and allows reducing the employed features.
Future developments of the proposed solution may include the implementation of this system over mobile
devices such as smartphones.
INDEX TERMS
I. INTRODUCTION
2168-6750
2013 IEEE. Translations and content mining are permitted for academic research only.
Personal use is also permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
complexity [13], jitter and shimmer (pitch and amplitude micro-variations, respectively), harmonics-to-noiseratio, distance between signal spectrum and formants [16].
D. CLASSIFICATION METHODS
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
The paper presents a gender-driven emotion recognition system whose aim, starting from speech recordings, is to individuate the gender of speakers and then, on the basis of this
information, to classify the emotion characterizing the speech
signals.
Concerning the first step, the paper proposes a gender
recognition method based on the pitch. This method employs
a typical speech signal feature and a novel extraction method.
It guarantees excellent performance: 100% accuracy. In practice, it always recognises the gender of the speaker.
Concerning the emotion recognition approach, the paper
proposes a solution based on traditional features sets and
classifiers but, differently from the state of the art, it employs
two classifiers (i.e., two Support Vector Machines): the one
trained on the basis of signals recorded by male speakers
and the other one trained by female speech signals. The
choice between the two classifiers is driven by the gender information individuated through the gender recognition
method.
To the best of authors knowledge, the proposed genderdriven emotion recognition system represents a novel
approach with respect to the literature in the field.
evaluated GR methods.
architecture.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
1) The signal s(n) is divided into frames.
2) The pitch frequency for each frame is estimated.
3) A number of frames of s(n) is grouped into a oddnumber of blocks.
4) The pitch PDF is estimated for each block.
5) The mean of each pitch PDF (PDFmean ) is computed.
6) The decision about Male or Female is taken, for
each block, by comparing their PDFmean with a fixed
threshold thr computed by using the training set.
7) The final decision on the whole signal gender is taken
by the majority rule: the signal s(n) is classified as
Male if the majority of its blocks are classified as
Male. Otherwise, it is classified as Female.
Average pitch frequencies for male and female speakers
and the individuated threshold thr , referred to a recording
of 20 blocks are shown in Fig. 2.
N
1
X
s(n)s(n + )
[0, 1 . . . N 1]
NX
1
s(n)s(n + )
n=0
Speech signal exhibits a relative periodicity and its fundamental frequency, called pitch (frequency), is usually the lowest
frequency component [35]. In the case of this paper, it is
important to estimate the Probability Density Function (PDF)
of the pitch of a given speech signal. The applied procedure
applied will be described in the following subsection. In
general, for voice speech, pitch is usually defined as the rate
of vibration of the vocal folds [36] and for this reason, can be
considered a distinctive feature of an individual. Estimating
the pitch of an audio sample could therefore help classify it as
belonging to either a male or a female, since its value for male
speakers is usually lower than the one for female speakers.
Many pitch estimation algorithms have been proposed in the
past years, involving both time- and frequency-domain analysis [35]. Many developed methods are context-specific, but
pitch estimators designed for a particular application depend
248
(1)
n=0
(3)
[1 , 1 + 1, 1 + 2 . . . 2 ] .
From the practical viewpoint, the physiological limitation
of the pitch range implies a first significant reduction of the
number of samples involved in the computation and, as a
consequence, of the overall complexity. The discrete-time
signals s(n) is divided into frames, in this paper each of
them composed of N = 640 samples, acquired by using
Fs = 16 [KHz]. The value of N is chosen in order to obtain
frames of Lf = 40 [ms], which allow considering human
voice as a stationary statistical process.
If the entire autocorrelation function R( ) computed as
in (1) is evaluated, the number of samples is equal to
640. Considering the physiological pitch limitations reported
above, (i.e., P1 = 50 [Hz] and P2 = 500 [Hz]), 1 = 32
) in (3) is calculated by
and 2 = 320, as indicated in (2), R(
using 2 1 + 1 = 289 samples.
The autocorrelation function shows how much the signal
correlates with itself, at different delays . Considering that,
given a sufficiently periodical speech recording, its autocorrelation will present the highest value at delays corresponding to multiples of pitch periods [35], [37].
Defining the pitch period as
)
pitch = arg max R(
(4)
(5)
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
(6)
)
pitch = arg max R(
pitch =
Fs
pitch
(7)
(8)
1
pitch + 1 . . . pitch 1, pitch ,
r
1
pitch + 1 . . . pitch + 1
(9)
r
Fs
0
pitch
= 0
(10)
pitch
0
pitch
is the reference pitch value in the remainder of this
paper. Autocorrelation downsampling may cause the Fine
Search to be applied around a local maximum of the autocorrelation in (3) instead of the global one. This occurs only
) is
if the delay corresponding to the global maximum of R(
1
farther than | r 1| samples from the delay corresponding
). This event occurs rarely
to the global maximum of R(
and, as a consequence, the new approach guarantees a good
estimation and represents a reasonable compromise between
performance and computational complexity energy saving,
making feasible the implementation of the GR algorithm over
mobile platforms.
3) PDF ESTIMATION
(12)
H
1
p 12 + h 1p
X
(13)
wh rect
PDF(p) =
1p
h=0
wh is the coefficient associated to the h-th bin and implements the mentioned weighted sum, as explained in the folbt ,v
lowing. If wh is the number of pitch
, v = 1, ..., V , whose values fall within the h-th bin, then the PDF is simply estimated
through the number of occurrences and is called histogram
count. In order to have a more precise PDF estimation and,
consequently, more accurate features vectors, this paper links
the coefficient wh to the energy distribution of the signal s(n)
in the frequency range of the PDF.
Given the Discrete Fourier Transform (DFT) of s(n),
PN 1
j2n Nk
, k
DFT (s(n)) = S(k) =
n=0 s(n) e
[0, ..., N 1], the signal energy (Es ) definition and the
ParsevalPRelation, written for
the defined DFT , we have
N 1
1 PN 1
2
2
|s(n)|
Es =
=
n=0
k=0 |S(k)| . A single energy
N
2
component is |S(k)| , where k represents the index of the
frequency fk = Nk , k = 0, ..., N 1. To evaluate the energy
component of each frequency bin h, we would need to know
the energy contribution carried by each pitch occurring within
bt ,v 2
bin h. In practice, the quantity |S(pitch
)| , v = 1, ..., V would
be necessary but the DFT is a function of an integer number k
bt ,v
and pitch
R. So, to allow the computation for real numbers,
249
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
bt ,v
bt ,v
pitch
is approximated by the closest integer number pitch,int
,
defined as follows:
k
j
k
j
bt ,v
bt ,v
bt ,v
if
< 12
pitch
pitch
pitch
bt ,v
pitch,int
=
GR
1GR
= PDFmean =
(14)
l
m
bt ,v
pitch
bt ,v
if pitch
bt ,v
pitch
1
2
wh =
v=1
H
1 X
V
X
(15)
2
bt ,v
)
S(pitch,int
h=0 v=1
In order to determine the best feature vector GR that maximizes the efficiency of the proposed GR method, different
feature vectors were evaluated by combining different individual features:
H
1
X
ph wh
(17)
h=0
(18)
h=0
P 1
In practice, g = 1 if H
h=0 ph wh thr , g = 1
otherwise. Starting from experimental tests, the employed
threshold thr has been estimated to be 160 [Hz]. This numerical value has been employed in the proposed gender-driven
emotion recognition system.
Numerical results, not reported here for sake of brevity,
have shown that the proposed method is able to recognize the
gender with 100% accuracy.
B. EMOTION RECOGNITION (ER) SUBSYSTEM
For the sake of completeness, in the reminder of this subsection, the definitions of the considered principal features
are listed and defined. Energy and amplitude are simply the
energy and the amplitude of the audio signals and no explicit
formal definition is necessary. Concerning Pitch related features the extraction approach is based on the pitch estimation
described in Section III-A.2.
The other considered features can be defined as follows:
a) Formants: In speech processing, formants are the resonance frequencies of the vocal tract. The estimation of their
frequency and their 3 [dB] bandwidth is fundamental for the
VOLUME 1, NO. 2, DECEMBER 2013
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
i =
MFCCi =
j=1
N
X
fCOG =
fk |S(k)|2
k=1
N
X
(22)
2
|S(k)|
k=1
k=1
N
X
(23)
|S(k)|
k=1
(19)
(20)
i
Pj cos
(j 0.5) , i [1, M ] (21)
13
e) Standard Deviation (SD): The standard deviation of a spectrum is defined as the measure of how much the frequencies
in a spectrum can deviate from the centre of gravity. SD
corresponds to the square root of the second central moment
2 :
= 2
(24)
f) Skewness: The skewness of a spectrum is a measure of
symmetry and it is defined as the third central moment of
the considered sequence s(n), divided by the 1.5 power of the
second central moment:
3
1 = q
(25)
32
g) Kurtosis: The excess of Kurtosis is defined as the ratio
between the fourth central moment and the square of the second central moment of the considered sequence s(n) minus 3:
2 =
4
3
22
(26)
1X
Tq
T =
Q
(27)
q=1
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
1
Q1
Q
X
Tq Tq1
(28)
q=2
Q1
X
1
(Tq+1 Tq )(Tq Tq1 ) (30)
(Q 2)T q=2
3) EMOTIONS CLASSIFIERS
J5PPQ =
1
(Q4)T
Q3
X
q=3
T
+ Tq1 + Tq + Tq+1 + Tq+2
Tq q2
5
(31)
min 0g (g ) =
g
`g
`g
X
1 XX u v
ug ,
yg yg (xug , xvg )ug vg
2
u=1
u=1 v=1
`g
X
(32)
ug yug = 0,
u=1
0 ug C, u
`
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
results highlight that the gender information allows incrementing the accuracy of the emotion recognition system on
average.
A. WITHOUT GENDER RECOGNITION
Recognition.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
been obtained by exploiting, in the testing phase, the Gender Recognition subsystem introduced in Section III-A and
providing 100% gender recognition. In this case, as depicted
in Fig. 3 and as extensively explained before, two SVMs,
one for each gender, have been trained: the first SVM
through male speeches signals, the second through female
ones.
Also in this case, SVM training and testing phases
have been carried out by two k-fold (k = 10) crossvalidations and, again, the overall BES signals have been
employed by dividing male speech from female speech
signals.
Reported results show that the employment of information
related to speaker gender allows improving the performance.
The overall set of features (182) has been employed for these
tests. In more detail, Table 3 and Table 4 show the confusion
matrices concerning male and female speech signals, respectively.
TABLE 3. Confusion Matrix of Male Speech Signals by applying
Gender Recognition.
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
V. CONCLUSION
255
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
REFERENCES
[1] F. Burkhardt, M. van Ballegooy, R. Englert, and R. Huber, An emotionaware voice portal, in Proc. ESSP, 2005, pp. 123131.
[2] J. Luo, Affective Computing and Intelligent Interaction, vol. 137. New
York, NY, USA: Springer-Verlag, 2012.
[3] R. W. Picard, Affective Computing. Cambridge, MA, USA: MIT Press,
2000.
[4] G. Chittaranjan, J. Blom, and D. Gatica-Perez, Whos who with big-five:
Analyzing and classifying personality traits with smartphones, in Proc.
15th Annu. ISWC, Jun. 2011, pp. 2936.
[5] A. Vinciarelli, M. Pantic, and H. Bourlard. (2009, Nov.). Social
signal processing: Survey of an emerging domain, Image
Vision Comput. [Online]. 27(12), pp. 17431759. Available:
http://dx.doi.org/10.1016/j.imavis.2008.11.007
[6] Z. Zeng, M. Pantic, G. Roisman, and T. Huang, A survey of affect
recognition methods: Audio, visual, and spontaneous expressions, IEEE
Trans. Pattern Anal. Mach. Intell., vol. 31, no. 1, pp. 3958, Jan. 2009.
[7] A. Gluhak, M. Presser, L. Zhu, S. Esfandiyari, and S. Kupschick, Towards
mood based mobile services and applications, in Proc. 2nd Eur. Conf.
Smart Sens. Context, 2007, pp. 159174.
[8] K. K. Rachuri, M. Musolesi, C. Mascolo, P. J. Rentfrow, C. Longworth,
and A. Aucinas, Emotionsense: A mobile phones based adaptive platform
for experimental social psychology research, in Proc. UbiComp, 2010,
pp. 281290.
[9] M. El Ayadi, M. S. Kamel, and F. Karray, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern
Recognit., vol. 44, no. 3, pp. 572587, 2011.
[10] R. Fagundes, A. Martins, F. Comparsi de Castro, and M. Felippetto de
Castro, Automatic gender identification by speech signal using eigenfiltering based on Hebbian learning, in Proc. 7th Brazilian SBRN, 2002,
pp. 212216.
[11] Y.-M. Zeng, Z.-Y. Wu, T. Falk, and W.-Y. Chan, Robust GMM based
gender classification using pitch and RASTA-PLP parameters of speech,
in Proc. Int. Conf. Mach. Learn. Cybern., 2006, pp. 33763379.
[12] Y.-L. Shue and M. Iseli, The role of voice source measures on automatic gender classification, in Proc. IEEE ICASSP, Mar./Apr. 2008,
pp. 44934496.
[13] F. Yingle, Y. Li, and T. Qinye, Speaker gender identification based on
combining linear and nonlinear features, in Proc. 7th WCICA 2008,
pp. 67456749.
[14] H. Ting, Y. Yingchun, and W. Zhaohui, Combining MFCC and pitch to
enhance the performance of the gender recognition, in Proc. 8th Int. Conf.
Signal Process., vol. 1. 2006.
[15] D. Deepawale and R. Bachu, Energy estimation between adjacent formant
frequencies to identify speakers gender, in Proc. 5th Int. Conf. ITNG
2008, pp. 772776.
[16] F. Metze, J. Ajmera, R. Englert, U. Bub, F. Burkhardt, J. Stegmann,
C. Muller, R. Huber, B. Andrassy, J. Bauer, and B. Littel, Comparison of
four approaches to age and gender recognition for telephone applications,
in Proc. IEEE ICASSP, vol. 4. Apr. 2007, pp. IV-1089IV-1092.
[17] E. Zwicker. (1961). Subdivision of the audible frequency range into
critical bands (frequenzgruppen), J. Acoust. Soc. Amer. [Online]. 33(2),
p. 248. Available: http://link.aip.org/link/?JAS/33/248/1
[18] R. Stibbard, Automated extraction of ToBI annotation data from the reading/leeds emotional speech corpus, in Proc. ISCA ITRW Speech Emotion,
Sep. 2000, pp. 6065.
[19] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. G. Taylor, Emotion recognition in human-computer interaction, IEEE Signal Process. Mag., vol. 18, no. 1, pp. 3280, Jan. 2001.
[20] N. Campbell, Building a corpus of natural speech-and tools for the
processing of expressive speech-the JST CREST ESP project, in Proc.
7th Eur. Conf. Speech Commun. Technol., 2001, pp. 15251528.
[21] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in Proc. Int. Joint Conf. Artif. Intell., vol. 14.
1995, pp. 11371145.
[22] W. H. Rogers and T. J. Wagner, A fine sample distribution-free performance bound for local discrimination rules, Ann. Stat., vol. 6, no. 3, pp.
506514, 1978.
[23] R. Hanson, J. Stutz, and P. Cheeseman, Bayesian Classification Theory.
Washington, DC, USA: NASA Ames Research Center, 1991.
[24] V. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA:
Springer-Verlag, 1999.
[25] B. Yegnanarayana, Artificial Neural Networks. Englewood Cliffs, NJ,
USA: Prentice-Hall, 2004.
256
IEEE TRANSACTIONS ON
EMERGING TOPICS
IN COMPUTING
257