Vous êtes sur la page 1sur 6

Omar Zelmati, Boban Bondžulić, Milenko Andrić, Dimitrije Bujaković

Spectral Analysis of Male and Female Speech


Abstract—In this paper a spectral analysis is applied on a male their effectiveness has been placed in evidence [4-6].
and female speech audio database. The effect of unsounded audio However, because of its simplicity and efficiency, DFT is used
signal parts on the spectrum rate is studied and it is shown that in practice and usually only the magnitude spectrum is
silent parts disturb strongly the spectral analysis and these parts considered, based on the belief that phase has little perceptual
should be deleted. A comparison between different spectra is
importance [7]. The overall shape of the DFT magnitude
made based on correlation. For the case of the spectrums that
origins from the same speaker, it is shown that these spectrums spectrum, called spectral envelope contains information on the
are strongly correlated, while a significant correlation between resonance properties of the vocal tract and has been found to
spectrums of the same speaker’s gender is highlighted. Finally, be the most informative part of the spectrum for purpose of
the effect of audio signal duration on the spectrums correlation is speaker recognition [1].
discussed. The obtained results are very promising and can be Dynamic spectral features (spectral transition) as well as
used in several fields of speech signal analysis such as speaker
instantaneous spectral features play an important role in
recognition and speaker gender identification.
human speech perception [8] and many researches on feature
Index Terms— Spectral analysis, Speech signal, Correlation. extraction for isolated word recognition have investing the
effectiveness of using spectral features. In [9] it is shown that
Mel Frequency Cepstral Coefficients (MFCCs) are the robust
I. INTRODUCTION features for isolated word recognition. In this research these
coefficients are calculated in six steps: pre-emphasis, applying
ANALYSIS of speech signal has various applications such
Hamming windowing, frame blocking, FFT, Log Mel
as speaker identification, automatic speech recognition,
spectrum calculation and applying Discrete Cosine Transform.
speaker gender recognition, speech enhancement, etc. In
Furthermore, MFCCs and other measures like pitch, log
recent years, many researches have been carried out in order
energy and mel-band energy, have been fixed as based
to take some advantages of the spectral analysis benefits over
features for emotion recognition by speech signal in [10].
the speech time analysis.
Linear Prediction Coding Method (LPC) is applied for
As it is stated in [1], spectral analysis (known also “Fourier
speaker gender recognition in [11]. LPC method is such a
representation”) is often used to highlight certain properties of
filter applied on the FFT in order to spectrally flatten the input
the speech signal that may be hidden or less obvious if the
signal. Beyond its use in gender recognition, this method has
signal is represented in time domain. These properties are
been used in speech enhancement and particularly for noise
extracted from a spectrum using different techniques and,
reduction [12].
furthermore, they can be exploited in various methods
In this research the spectral analysis of recently formed
according to the application and the purpose of the research.
database of audio signals is performed. Spectral analyses are
Various approaches and implementations in order to extract
done regard to the speaker gender (male or female) and as the
features of spectrum have been studied in literature [2], such
measure of spectral properties of different speakers the
as Discrete Fourier Transform (DFT), and its fast
correlation between spectrums of the recorded audio signals is
implementations: Fast Fourier Transform (FFT) and Short-
used. These correlations are analyzed for different speakers
Time Fourier Transform (STFT).
and for different texts. Beside this, the effects of the audio
Regardless to the signal type, Fourier representation aims to
signals duration are also analyzed through the correlation
decompose it into its frequency components [3]. For the
between male and female speakers.
special case of speaker recognition, numerous signal
The rest of paper is organized as follow: in the Section II
decomposition techniques based on DFT are proposed.
the used audio signals database is described; in Section III it is
Moreover, some alternatives such as non-harmonic bases,
applied spectral analysis based on FFT for audio recordings of
aperiodic functions and data-driven bases derived from
used database through three analyses: speaker based
independent component analysis have been discussed, and
correlation, text based correlation and analysis of the duration

Omar Zelmati is with the Military Academy, University of Defence in of speech signals on the correlation between various speakers.
Belgrade, 33 Generala Pavla Jurišića Šturma, 11000 Belgrade, Serbia (e-mail: In the last part of this paper, results are discussed and some
omarzelmati1991@gmail.com). conclusions are highlighted with some direction of further
Boban Bondžulić is with the Military Academy, University of Defence in
Belgrade, 33 Generala Pavla Jurišića Šturma, 11000 Belgrade, Serbia (e-mail: research.
Milenko Andrić is with the Military Academy, University of Defence in II.DESCRIPTION OF SPEECH SIGNALS DATABASE
Belgrade, 33 Generala Pavla Jurišića Šturma, 11000 Belgrade, Serbia (e-mail:
andricsmilenko@gmail.com). For the purposes of the male and female speech signal
Dimitrije Bujaković is with the Military Academy, University of Defence analysis, an audio recordings database is created with the
in Belgrade, 33 Generala Pavla Jurišića Šturma, 11000 Belgrade, Serbia (e-
mail: dimitrije.bujakovic@va.mod.gov.rs).
recording of five already prepared texts in the Serbian
language read by five male and five female speakers. As each segments is much smaller, as these noisy sounds tend to have
speaker read five texts, the complete database consists of 50 lower frequencies and, as a result, spectral centroid values are
audio records. The duration of each record is about 30 lower [13]. If the Xi(k), k = 0,…, N-1 are the DFT coefficients
seconds. of the i-th frame of length N, the spectral centroid is
All speakers are aged between 20 and 25 years (students of calculated as:
the Military Academy in Belgrade). In order to guarantee the
same environmental conditions of recording, the recording is N 1
done in the same place, in a sound-isolated room. All voice  (k  1) X (k ) i
recordings were recorded using the SpectraLAB software Ci  k 0
. (2)
package on a DELL laptop, with sampling rate fs = 8 kHz and N 1

16-bit resolution. As input, the microphone of the headphones  X (k )

k 0
for the VoIP communication Genius HS04S is used. The
sensitivity of used microphone is –60 dB, while its frequency
response is within 50 Hz and 20000 Hz. After determining all voiced segments of the audio signal,
Audio recordings contain words used in military the new-voiced signal is created using concatenation of all
terminology (gun, pistol, airplane, attack, defense, etc.), so it voiced segments of the original audio signal. Furthermore, the
represents a good basis that in further research they can be FFT is applied on each modified signal. Magnitudes of the
used for isolated word recognition or for speaker FFT of an audio signal from created database before and after
identification. silence removal are shown in this Fig. 1.

In this research the FFT of the recorded audio set is applied
in order to calculate the correlation between the different
obtained spectrums. Although the recording is done in special 0.7

conditions, the segmentation step is necessary in order to 0.6


eliminate the effect of the background noise in the spectrum 0.5

form. All analyzed speech signals are preprocessed due to
elimination of the audio signal silent parts. The algorithm for
silence removal is proposed in [13] and in this part of the
paper is briefly described. The method of silence removal is 0.2

based on two audio features, the signal energy and the spectral 0.1

centroid. The applied algorithm consists of four stages: 0

0 500 1000 1500 2000 2500 3000 3500 4000
1) Extraction of the signal energy and spectral centroid Frequency (Hz)
from the already decomposed sequences of audio signal; a)
2) Thresholds estimation for each sequence, where two
thresholds are calculated on the base of the extracted features; 0.9

3) Application of thresholds criterion on the audio signals 0.8

sequences, and 0.7

4) Speech segments detection based on the threshold 0.6

criterion and post-processing.

Firstly, the audio signal is divided into non-overlapping
frames of the same duration (in this research the frame 0.4

duration is 50 ms), where si(n), n = 0,…, N-1 are the audio 0.3

samples of the i-th frame of length N. The energy of one 0.2

frame of audio signal can be calculated using:

N 1 0
 s ( n)
2 0 500 1000 1500 2000 2500 3000 3500 4000
E (i )  i . (1) Frequency (Hz)

N n 0

This energy is used to detect the silent frames presented in Fig. 1. Normalized spectrum of the audio signal: a) before silence removing,
the audio signal, based on the assumption that, if the level of b) after silence removing.
background noise is not very high, the energy of the voice Comparing spectrums before silence removal (Fig. 1.a) and
segments is significantly greater than the energy of the silent after silence removal (Fig. 1.b), it can be noticed that silent
segments [13]. Silent segments may contain environmental parts strongly affect the spectrum shape. After silence
sounds, and for that reason, the measurement of the spectral removal, the magnitudes of frequencies above 1800 Hz are
centroid is performed. The spectral centroid of low energy suppressed. The correlation between the spectrums of audio
signal before and after silence removal is about 20%.
A. Speaker-based Correlation compared. The normalized spectrums of audio signals from a
In the first analysis, the spectrums of audio signals after one male (speaker 3) and one female (speaker 7) extracted
silence removal of one speaker that speak different texts are from used database are shown in Fig. 2.

Text 1

0 500 1000 1500 2000 2500 3000 3500 4000

Text 2

0 500 1000 1500 2000 2500 3000 3500 4000


Text 3

0 500 1000 1500 2000 2500 3000 3500 4000

Text 4

0 500 1000 1500 2000 2500 3000 3500 4000

Text 5

0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)
Text 1

0 500 1000 1500 2000 2500 3000 3500 4000
Text 2

0 500 1000 1500 2000 2500 3000 3500 4000
Text 3


0 500 1000 1500 2000 2500 3000 3500 4000
Text 4

0 500 1000 1500 2000 2500 3000 3500 4000
Text 5

0 500 1000 1500 2000 2500 3000 3500 4000
Frequency (Hz)

Fig. 2. Spectrums of the different audio speech signals for the same: a) male speaker (speaker 3) and b) female speaker (speaker 7).
From the Fig. 2, it can be noticed that signals origin from 68% for the male and 69% for the female speaker. From
the same speaker and different read texts, have the similar this, it can be concluded that spectral shape after silence
spectral shape. Analyzing the spectral properties of male removal may be used for feature extraction in order to
speaker (Fig. 2.a), it can be noticed that spectral components perform speaker recognition.
are wider, while for female speaker (Fig. 2.a), spectral
B. Text-based Correlation
components are concentrated around 200 Hz and 1000 Hz.
In order to quantify spectral similarity, the correlation of In order to determine the correlation between the
spectrum of audio signals produced by male and female spectrums of the same text uttered by different speakers, the
speaker is calculated with regard to the spoken texts. Results correlation between the spectrums of different speakers is
of this analysis are shown in Table I (male speaker) and calculated on the text 1. The results of spectral correlations
Table II (female speaker). are shown in Table III. The part of this table colored in blue
represents the correlation between spectrums after silence
removal of five male speakers for the read text, while the
SPECTRUM BASED CORRELATION FOR THE MALE SPEAKER part colored in orange represents the correlation between
spectrums of all the five female speakers for the same text.
Text1 Text2 Text3 Text4 Text5 The yellow part of Table III is the spectral correlation
Without silence removing between males and female speakers.
Text1 1 0.48 0.46 0.41 0.45
Text2 0.48 1 0.47 0.40 0.43 SPECTRAL CORRELATION OF TEXT 1
Text3 0.46 0.47 1 0.39 0.43
Text4 0.41 0.40 0.39 1 0.40 Male speakers Female speakers
Text5 0.45 0.43 0.43 0.40 1 1 2 3 4 5 6 7 8 9 10

With silence removing Without silence removing

Text1 1 0.69 0.68 0.66 0.64 1 1 0.21 0.36 0.33 0.37 0.23 0.22 0.17 0.21 0.10
Male speakers

Text2 0.69 1 0.71 0.68 0.67 2 0.21 1 0.25 0.23 0.24 0.14 0.10 0.14 0.16 0.08
3 0.36 0.25 1 0.38 0.33 0.28 0.24 0.29 0.33 0.14
Text3 0.68 0.71 1 0.70 0.68
4 0.33 0.23 0.38 1 0.35 0.22 0.20 0.24 0.26 0.15
Text4 0.66 0.68 0.70 1 0.70
5 0.37 0.24 0.33 0.35 1 0.20 0.18 0.23 0.22 0.19
Text5 0.64 0.67 0.68 0.70 1
6 0.23 0.14 0.28 0.22 0.20 1 0.31 0.28 0.36 0.14
Female speakers

7 0.22 0.10 0.24 0.20 0.18 0.31 1 0.27 0.30 0.15

SPECTRUM BASED CORRELATION FOR THE FEMALE SPEAKER 8 0.17 0.14 0.29 0.24 0.23 0.28 0.27 1 0.37 0.29
9 0.21 0.16 0.33 0.26 0.22 0.36 0.30 0.37 1 0.21
Text1 Text2 Text3 Text4 Text5 10 0.10 0.08 0.14 0.15 0.19 0.14 0.15 0.29 0.21 1
Without silence removing With silence removing
Text1 1 0.37 0.36 0.36 0.32 1 1 0.53 0.54 0.59 0.55 0.44 0.43 0.32 0.38 0.31
Male speakers

Text2 0.37 1 0.37 0.38 0.32 2 0.53 1 0.56 0.61 0.55 0.43 0.35 0.40 0.45 0.36
Text3 0.36 0.37 1 0.37 0.31 3 0.54 0.56 1 0.60 0.50 0.49 0.44 0.43 0.51 0.37
Text4 0.36 0.38 0.37 1 0.32 4 0.59 0.61 0.60 1 0.57 0.50 0.46 0.46 0.51 0.43

Text5 0.32 0.32 0.31 0.32 1 5 0.55 0.55 0.50 0.57 1 0.39 0.34 0.37 0.39 0.42

With silence removing 6 0.44 0.43 0.49 0.50 0.39 1 0.63 0.52 0.66 0.44
Female speakers

Text1 1 0.70 0.68 0.68 0.65 7 0.43 0.35 0.44 0.46 0.34 0.63 1 0.50 0.58 0.42

Text2 0.70 1 0.70 0.69 0.70 8 0.32 0.40 0.43 0.46 0.37 0.52 0.50 1 0.58 0.59
9 0.38 0.45 0.51 0.51 0.39 0.66 0.58 0.58 1 0.51
Text3 0.68 0.70 1 0.71 0.71
10 0.31 0.36 0.37 0.43 0.42 0.44 0.42 0.59 0.51 1
Text4 0.68 0.69 0.71 1 0.72
Text5 0.65 0.70 0.71 0.72 1
From the Table III, it can be noticed that the average
Comparing results of spectral correlation presented in spectral correlation of the different male speakers after
Table I and II, a high correlation can be noticed between the applying silence removing is about 56% while for different
spectrums of the same speaker for different texts read either female speakers is about 54%. The average correlation of
for male or for female in the case after silent removing while the male-female part is about 42%. From this, it can be
it is very low for the case without silence removing. The concluded that spectrums of the same gender are notably
highest correlation value is obtained for the female speaker correlated, whilst spectrums of different gender are less
(72%), while the lowest correlation is obtained for the male correlated. This can be explained by the fact that the energy
speaker (64%). However, the average spectral correlation of the audio origins from female speaker is concentrated
for the male and for the female speaker is nearly the same: around 200 Hz and 1000 Hz, while energy for male speaker
is distributed on larger frequency rage. The average spectral From the Fig. 3, it can be noticed that for higher duration
correlation of all male speakers for the case without silence of analyzed signal the spectral correlation between speakers
removing is about 30% and for all female speakers it is is higher with silence removing as well as without it. Beside
about 27%. These results confirm that the correlation this, from this figure it can be concluded that spectral
decreases remarkably because of the parts of the signal correlation after silence removal is robust to the speaker
having a low energy (silent parts). These conclusions may gender, while by comparing audio signals that origin from
be used for feature extraction in order to perform gender the speakers of different gender, it is concluded that the
recognition using speech signal. spectral correlation is lower. These results can be used for
determination of the optimal speech recording duration in
C.Effect of Audio Signal Duration
data training of speech analysis datasets.
For the purpose of the audio signal duration effect
investigation on the spectrum correlation, ten signals were
prepared by concatenating all the uttered texts for each 0.5
speaker. In such manner, each speaker is presented by a male vs male
0.45 female vs female
larger signal. The spectrum of this signal after silence
male vs female
removal is determined and the correlation between it and
original signals is calculated. The results are shown in Table


ONES 0.25

Speaker Text1 Text2 Text3 Text4 Text5 0.2

1 0.77 0.75 0.81 0.72 0.70
2 0.76 0.80 0.78 0.81 0.77 Text 1 Text 2 Text 3 Text 4 Text 5 All texts
3 0.81 0.73 0.75 0.76 0.76
4 0.85 0.84 0.81 0.77 0.72
All texts

5 0.73 0.79 0.67 0.81 0.80 0.75

male vs male
6 0.82 0.69 0.68 0.66 0.64
0.7 female vs female
7 0.79 0.78 0.71 0.80 0.68 male vs female
8 0.78 0.71 0.76 0.70 0.78 0.65
9 0.81 0.68 0.75 0.79 0.70

10 0.74 0.82 0.77 0.70 0.83 0.6

From the Table IV, it can be noticed that the correlation is
enhanced using longer speech signal duration. By analyzing 0.5
results from the Table I, it can be noticed that the average
correlation between the different texts of the speaker 3 (with
silence removing) is around 68%. On the other hand, the
average correlation of the same speaker calculated based on Text 1 Text 2 Text 3 Text 4 Text 5 All texts
Table IV is about 76%. From this, it can be concluded that if
the audio signal is longer, the spectral correlation after
silence removal is higher. Fig. 3. Average correlation for the different texts and speaker gender:
In order to support this conclusion, correlation matrices a) without silence removing and b) with silence removing.
for each of the five analyzed texts and for all concatenated
texts are calculated. From each obtained matrix, the average IV. CONCLUSION
correlation between spectrums of male speakers (male vs.
This paper presented a spectral analysis applied on
male), the average correlation between spectrums of female
recently created database that consists of male and female
speakers (female vs. female) and the average correlation
speech samples. Using the spectral correlation measure, in
between speakers of different gender (male vs. female) are
this research it is analyzed the effect of unsounded audio
calculated. Referring to the Table III the male vs. male
signal parts on the spectrum shape and it is shown that silent
average correlation is computed based on the blue part of
parts strongly affects the results of a spectral analysis.
the table, while the orange part serves to calculate the
female vs. female average correlation and the part in yellow
is used to calculate the male vs. female correlation. Results
of this analysis are presented in Fig. 3.
Beside this, in this research the spectrums obtained from Fourier-Bessel expansion," IEEE Trans. on Speech and Audio Proc,
vol. 7, no. 3, pp. 289-294. Apr. 1999.
signals of the same speaker for different uttered texts are [5] B. Imperl, Z. Kai, and B. Horvat, "A study of harmonic features for
compared and it is shown that they are strongly correlated. the speaker recognition," Speech Communication, vol. 22, no. 4, pp.
Moreover, there is a significant correlation between 385-402. Feb. 1997.
[6] G.-J. Jang, T.-W. Lee, and Y.-H. Oh, "Learning statistically efficient
spectrums obtained from speakers of different gender features for speaker recognition," Neurocomputing, vol. 49, no. 4, pp.
reading the same text. In this research is also analyzed the 329-348. Jun. 2002.
effect of the audio signal duration on the spectrum [7] T. Kinnunen and H. Li, "An overview of text-independent speaker
correlation. Obtained results show that for longer duration recognition: From features to supervectors," Speech communication,
vol. 52, no. 1, pp. 12-40. May 2010.
signal, the spectral correlation is higher. These results may [8] G. Ruske and T. Schotola, "The efficiency of demisyllable
be ground for future researches related to the speech signal segmentation in the recognition of spoken words," in ICASSP'81.
analysis. In future research, the problem of spectral-based IEEE International Conference on Acoustics, Speech, and Signal
Processing, NY, USA, vol. 1, pp. 971-974. 04-01-1981.
feature extraction for speaker identification will be [9] M. P. Kesarkar, "Feature extraction for speech recognition," M.Tech.
considered. Furthermore, the case of background noise, Credit Seminar Report, Bombay, India, 2003.
music and speakers distant will be taken into account. [10] O.-W. Kwon, K. Chan, J. Hao, and T.-W. Lee, "Emotion recognition
by speech signals," European Conference on Speech Communication
and Technology. Geneva, Switzerland, interspeech. 09-01-2003.
[11] K. Rakesh, S. Dutta, and K. Shama, "Gender Recognition using
speech processing techniques in LABVIEW," Internat. Journal of
REFERENCES Advs. in Engin. & Tech., vol. 1, no. 2, p. 51. Jul. 2011.
[1] L. R. Rabiner and R. W. Schafer, Theory and applications of digital [12] M. Hydari, M. R. Karami, and E. Nadernejad, "Speech Signals
speech processing, NJ, USA: Pearson, 2011. Enhancement Using LPC Analysis based on Inverse Fourier
[2] A. V. Oppenheim, Discrete-time signal processing, ND, India: Methods," Contemporary Engineering Sciences, vol. 2, no. 1, pp. 1-
Pearson Education India, 1999. 15. Jan. 2009.
[3] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech [13] T. Giannakopoulos, "Study and application of acoustic information
Signals, New Jersey, USA: Prentice-Hall, 1978. for the detection of harmful content, and fusion with visual
[4] K. Gopalan, T. R. Anderson, and E. J. Cupples, "A comparison of information," PhD. Dissertation, IT Dept., NKA Univ., Athens,
speaker identification results using features based on cepstrum and Greece, 2009.