Académique Documents
Professionnel Documents
Culture Documents
!"#$%&'()+,-./012345<yA|
M ASARYK U NIVERSITY
FACULTY OF I NFORMATICS
Lrnt Oroszlny
Declaration
Hereby I declare, that this paper is my original authorial work, which
I have worked out by my own. All sources, references and literature
used or excerpted during elaboration of this work are properly cited
and listed in complete reference to the due source.
Lrnt Oroszlny
Acknowledgement
I would like to thank my supervisor Mgr. Ludek Brtek, Ph.D. for his
support and guidance, which helped me write this thesis.
iii
Abstract
The aim of the bachelor work is to introduce the reader to the basics of digital sound processing and describe a few selected methods
used in the field of speech processing. Furthermore it provides the
evaluation of the usability of four popular mathematical software in
the mentioned area and their comparison. As the elaborated speech
processing methods are not directly implemented in these programs,
the practical part of this thesis consists of their implementation as a
library of functions for the open-source Matlab-alternative, GNU Octave.
iv
Keywords
DSP, digital signal processing, audio processing, speech processing,
numerical computation, computer algebra system, CAS, octave, matlab, maple.
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . .
1.1 Organization of this thesis . . . . . . . . . . .
1.2 Digital audio processing . . . . . . . . . . . .
1.2.1 Digital representation of sound . . . .
1.2.2 Working with audio signals . . . . . .
1.2.3 Practical application . . . . . . . . . .
Analysis of speech . . . . . . . . . . . . . . . . . .
2.1 The source/system speech production model
2.2 Perception of sound . . . . . . . . . . . . . .
2.2.1 Pitch perception . . . . . . . . . . . .
2.2.2 Loudness . . . . . . . . . . . . . . . .
2.3 Short-time analysis . . . . . . . . . . . . . . .
2.3.1 Window functions . . . . . . . . . . .
2.4 Time-domain analysis . . . . . . . . . . . . .
2.5 Frequency-domain analysis . . . . . . . . . .
2.5.1 Fourier transform . . . . . . . . . . . .
2.5.2 Discrete Fourier transform . . . . . . .
2.5.3 Short-time Fourier transform . . . . .
2.6 Linear predictive analysis . . . . . . . . . . .
2.6.1 Perceptual linear prediction . . . . .
2.7 Cepstral analysis . . . . . . . . . . . . . . . .
2.7.1 Mel-frequency cepstral coefficients . .
Sound analysis with mathematical software . . .
3.1 Matlab . . . . . . . . . . . . . . . . . . . . . .
3.2 GNU Octave . . . . . . . . . . . . . . . . . . .
3.3 Maple . . . . . . . . . . . . . . . . . . . . . . .
3.4 Wolfram Mathematica . . . . . . . . . . . . .
3.5 Comparison . . . . . . . . . . . . . . . . . . .
3.5.1 Plotting . . . . . . . . . . . . . . . . .
The speech package for GNU Octave . . . . . . .
4.1 Installation . . . . . . . . . . . . . . . . . . . .
4.2 Input handling . . . . . . . . . . . . . . . . . .
4.3 Function reference . . . . . . . . . . . . . . .
4.3.1 The sti(), ste() and zcr() functions . . .
4.3.2 The stf() function . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
3
4
5
5
7
7
8
8
9
10
10
11
13
13
14
14
15
16
16
17
18
18
19
20
20
21
22
24
24
24
25
25
27
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
29
29
30
30
31
Chapter 1
Introduction
1.1 Organization of this thesis
Chapter 1 introduces the concepts and notions of the subject matter
of this thesis
Chapter 2 covers the theory of the methods utilized in the later parts
of the thesis
Chapter 3 explores the existing possibilities of analyzing audio signals using mathematical software
Chapter 4 introduces the speech package for GNU Octave created
as the practical part of the thesis and provides a discussion on
issues experienced during its development
Chapter 5 presents the conclusions reached and final thoughts on
the topic
1. I NTRODUCTION
1. I NTRODUCTION
number of bits per second), which determine the quality of the representation. According to the Nyquist-Shannon sampling theorem, the
sampling rate (number of samples per second) must be two times
the frequency of the highest frequency component in the sampled
signal, otherwise it comes to aliasing and the recreation of the signal is no longer possible[4]. Higher bit rate allows more precision for
storing the value of the magnitude of the signal sampled at a specific time, causing the recreated signal to be less distorted and more
similar to the sampled signal. Exact recreation of the sampled analog signal from discrete samples is not possible. Figure 1.1 shows the
waveform for a sound sequence encoded with LPCM with sample
rate of 44100 and 16-bit sample size. LPCM is used by most of the
popular non-compressed audio formats (CD audio, WAV, AIFF), and
is used exclusively throughout this thesis.
1.2.2 Working with audio signals
Due to the properties of LPCM encoding, discrete-time signal processing methods are easily applicable on data encoded in this manner. Most of these methods consist of the application of various mathematical formulas over the array of values represented by LPCMcoded data. While the majority of programming languages is fully
capable to realize these types of calculations, there are mathematical software developed specifically for the purpose of solving technical computational problems. Two basic types of such software differ in their approach to solve these problems: computer algebra systems manipulate mathematical expressions in symbolic form, while
numerical computing systems focus on performing numerical algorithms. Chapter 3 is devoted to elaborating the usefulness and comparing the power of these products in terms of signal processing,
more specifically sound analysis.
1.2.3 Practical application
Practical applications of sound processing vary on a wide scale including storage, level compression, data compression, transmission,
enhancement and speech processing[5]. This thesis focuses on the
methods used for speech processing, specifically the ones for short5
1. I NTRODUCTION
time analysis of speech signal in time and frequency domain. The
extraction of information from the results of these analyses is the
subject to speech recognition, speaker recognition and voice analysis and are not included in the scope of this thesis. Speech processing
itself is a very wide topic, comprising among others speech coding,
synthesis and enhancement, and is covered in-depth in [6], [7].
Chapter 2
Analysis of speech
This chapter is a short overview of the theoretical background of
the sound analysis methods utilized or referenced to in later parts of
this thesis. All the information in this chapter is processed from [4],
[6] and [7], except where noted otherwise.
2. A NALYSIS OF SPEECH
a much slower rate than the time variations of the speech waveform,
allowing us to find them by analyzing short segments of the signal.
Since the vocal tract changes shape slowly, it can be viewed as timeinvariant over intervals varying on the order of 10 ms, depending on
the speaker. Although there are more sophisticated models of speech
production, this model is sufficient for most applications in speech
processing[7].
(2.1)
2. A NALYSIS OF SPEECH
tone. The width of the critical band depends on the central frequency
f . This effect can be represented as the application of a set of bandpass filters to the sound signal. Frequency unit bark was introduced
to capture the idea of this phenomenon; the width of a critical band is
roughly 1 bark at any frequency. The relation of the frequency in [Hz]
to the frequency in [bark] is described by the following equation:
= 6ln
f
+
600
s
f
600
2
+ 1
(2.2)
2.2.2 Loudness
Loudness is a subjective measure, not to be confused with objective measures like sound pressure or sound intensity. The perceived
loudness is related to sound pressure, duration and frequency. Equal
loudness curves evaluate the difference between perceived loudness
and the sound pressure of a pure tone over the audible frequency
spectrum. These curves are defined by international standard ISO
226:2003 based on several modern experimental determinations[8].
The unit of loudness level for pure tones is phon, 1 phon is defined
as the loudness of a 1 dB sound pressure level pure tone at 1 kHz
frequency. A different unit is also used for capturing the loudness
phenomenon, sone, but it is not used in this thesis.
9
2. A NALYSIS OF SPEECH
(s(k))w(n k) ,
(2.3)
k=
2. A NALYSIS OF SPEECH
suppress the effect of the marginal samples of the analyzed segment.
Many types of differently shaped windows exist. In speech processing the two most commonly used are the rectangular and the Hamming window. The rectangular window does not do any weighing,
only selects the samples included in the microsegment. Its value is 1
inside the chosen interval and 0 outside of it. The Hamming window
is defined by the following relation:
(
0.54 0.46cos(2n/(L 1)) for 0 n L 1
w(n) =
0
otherwise,
(2.4)
2. A NALYSIS OF SPEECH
source/system production model. The function of short-time energy
is defined as
En =
[s(k)w(n k)]2 ,
(2.5)
k=
|s(k)|w(n k) ,
(2.6)
k=
Mn =
(2.7)
k=
where
(
1 for s(k) 0
sgn[s(k)] =
0 for s(k) < 0
(2.8)
(2.9)
n=
2. A NALYSIS OF SPEECH
n is an array of m values. It is used to detect periodicity in signals and
is the basis of many spectral analysis methods. To be able to detect
the pitch period, the windowed segment must contain at least two
fundamental periods of the analyzed signal, so it is recommended to
use windows with lengths 20-40 ms, depending on the speaker.
For application on discrete signals, the discrete-time Fourier transform (DTFT) is defined as
X() =
xk eik ,
(2.10)
k=
where denotes the angular frequency and xn denotes the nth sample of the signal. It can be seen from the definition, that the
values of the DTFT can be complex even when working with real
signals. The DTFT of an aperiodic signal (speech segments are periodic only over short time intervals, so they are treated as aperiodic in
this formula) is a continuous periodic complex function with period
2. The magnitude of the DTFT |X()| is an even function, while the
phase ]X() is odd.
13
2. A NALYSIS OF SPEECH
2.5.2 Discrete Fourier transform
To be able process the spectra of discrete signals using computers, we also need them to be discrete. To achieve this, the discrete
Fourier transform (DTF) method is used. The DTF of a discrete signal
of length N (x0 , ..., xn1 ) is also of length N, and can be computed
using the following formula:
Xn =
N
1
X
xk ei2 N k .
(2.11)
k=0
s(k)w(n k)ei2 N k .
(2.12)
k=
2. A NALYSIS OF SPEECH
G
S(z)
Pp
,
=
E(z)
1 i=1 ai z i
(2.13)
Q
X
ai s(k i) + Gu(k) .
(2.14)
i=1
2. A NALYSIS OF SPEECH
know how to reckon the coefficients. The following formula describes
a symmetric positive-semi-definite matrix which represents a system
of linear equations:
M
2 k
X
n [i, k] =
sn [m]sn [m + k i]
(2.15)
m=M1 k
2. A NALYSIS OF SPEECH
it from time-domain analysis its called quefrency (the names cepstrum
and quefrency were created by changing the order of letters in the
words spectrum and frequency respectively). Cepstral analysis has
uses in pitch detection and pattern recognition and also in techniques
for estimation of the vocal tract system (cepstral LPC coefficients).
2.7.1 Mel-frequency cepstral coefficients
Mel-frequency cepstral coefficients (MFCC) are parameters of a signal
used for capturing its short-time spectral characteristics in a compact form. The method consists of computing the power spectrum of
a windowed segment of the analyzed signal, then applying a bank
of triangular shaped bandpass filters. The central frequencies and
bandwidth of the filters are determined so that they cover the entire analyzed frequency band, are uniformly spaced on the mel-scale,
and the central frequency of a given filter is also the first frequency
value of the next filter. The number of filters is usually based on the
bandwidth of the original signal with respect to the critical band theory of auditory perception, e.g. for 8 kHz bandwidth 20 filters are
used. The filters are applied to the signal segment separately and the
sums of the output values of each filter are calculated. The MFCCs
are the values of the discrete cosine transform applied to the list of
logarithms of the filter outputs. Various sources differ in some details
on the exact method of the computation of the MFCCs, the practical
part of this thesis implements the one described in [6].
MFCCs are commonly used in the fields of speaker recognition
and speech recognition, and are increasingly finding uses in music
information retrieval [12].
17
Chapter 3
3.1 Matlab
Matlab is a programming environment for algorithm development,
data analysis, visualization, and numerical computation, developed
by MathWorks[13]. It is primarily intended for numerical computations, although it has an optional toolbox granting the ability to
perform symbolic computations. Its user base varies on a wide scale
across industry and academia, counting roughly one million users
(2004)[14]. Along with Simulink, an additional package also developed by MathWorks for multidomain simulation and Model-Based
Design, it is popular among engineers and economists. Matlab is
distributed as a stripped down core program along with optional
toolboxes providing additional functionality grouped by fields of usage. While most of the functionality is accessible from the command
prompt, Matlab comes with a GUI for added convenience. MathWorks provides licensing options separately for educational and commercial uses.
18
3.3 Maple
Maple is a commercial computer algebra system (CAS). Its development began in 1980 at the University of Waterloo in Ontario, Canada;
it is on the commercial market since 1988, distributed and developed by Maplesoft. Maple supports numeric computation as well
as symbolic computation and visualization[17]. The power of CAS
systems lie in their capabilities to perform symbolic mathematics.
Computations with exact quantities such as fractions, radicals and
symbols eliminate the propagation of rounding errors, thus achieving more precise results than solely numerical systems. The approximation of the results can be computed at arbitrary precision, not
limited by the underlying hardware. One of Maples main advantages over other mathematical software is its above par GUI in terms
of user-friendliness and intuitiveness. Its highlights include the ability to process input in standard mathematical notation, the typing
of which is assisted by really well-written automatic formatting; the
context-sensitive menus appearing when right-clicking an expression and its overall natural feel. Like for Matlab, separate student
and commercial licenses exist also for Maple.
Maple is a software which can be used in signal processing to a
high extent given its strong mathematical and visualization abilities.
It has built-in functions for the essential methods used in the field,
however it is not that well supplied with implementations of additional methods. The Maplesoft website and other community websites provide free expansions and tutorials for applications in signal
processing. For uses specifically in sound processing the low number
of supported audio formats can be a major drawback.
3.5 Comparison
This section aims to compare the software described beforehand
in aspects of their use in signal processing (and more specifically audio processing) using tables and charts as well as text description for
demonstrating the results of the study.
Student
Commercial (single user)
Matlab1
$99
$4471.22
Maple
$124
$2,845
Mathematica
$126.62
$3,219.93
Octave
Free
21
Matlab
SPT
Yes
SPT
Yes
Symbolic
No
SPT
SPT
Maple
No
Yes
No
Yes
Yes
No
No
No
Mathematica
No
Yes
No
Yes
Yes
No
No
No
Octave
Yes
Yes
signal
Yes
No
Yes
signal
signal
.WAV
.AU (.SND)
.RAW
.AIFF
.w64 (Wave64)
.FLAC
Matlab
LPCM
-law
No
No
No
No
Maple
LPCM/ADPCM
No
No
No
No
No
Mathematica*
Yes
Yes
No
Yes
Yes
Yes
Octave
LPCM
-law
LPCM
No
No
No
3.5.1 Plotting
Each of the mentioned software have the ability to visualize data
to some extent. This is an area where Octave is significantly weaker
than its commercial competitors, since it is limited by the functionality of gnuplot, and the interfacing of the two programs does not
even allow to use its every function when called from Octave. Plots
can be manipulated from the command line interface of Octave. It
allows to create static 2d and 3d plots, with multiple functions on
the same figure. For use in sound processing its capabilities are sufficient, however the lack of a GUI makes working with plots tiresome.
22
23
Chapter 4
4.1 Installation
The speech package takes advantage of the Octave package interface, making it effortless to install; you can do so by opening Octave,
going into the directory where the package archive is located and
typing the following command to the Octave command prompt:
pkg install speech.tar.gz
The package is dependent on the signal package, which is itself dependent on a few more packages, all of which can be downloaded
for free from the Octave-forge website.
Figure 4.1: Sample output of the zcr() function, called with 150 points for
interpolation of the plot
computes the convolution by calling the pre-compiled filter() function, which uses FFT algorithms to get the result. This optimization
results in a drastic improvement in run time.
The idea behind the computation of the result of the zcr() (zerocrossing rate) function is similar: it takes the convolution of the window with the array gotten by the subtraction of the signum of the
original signal from the signum of the signal shifted in time by one
sample. The zcr() function does not take the window type as an argument; for computing the zero-crossing rate the rectangular window
is used.
Because the result of applying STACF to a signal is a two-dimensional set of data, for its visualization a 3d plot was chosen. Calling the
stacf() with an additional argument greater than zero plots the result
as a 3d mesh plot. The default viewpoint of the plot is normal to the
x and y axes, the values on the z axis are represented by the colors
of the graph (similarly to a spectrogram), making the peaks of the
graph prominent. The plot can be rotated arbitrarily, to return to the
original viewpoint, type the command view(2) to the Octave prompt.
27
28
N = 600, p = 20
8.3799 0.0619s
0.0555 0.0015s
0.0628 0.0022s
N = 1000, p = 40
54.1800 0.1390s
0.1143 0.0154s
0.1325 0.0025s
N = 1000, p = 200
> 60s
0.8040 0.0300s
0.2853 0.0221s
Table 4.1: Comparison of the run times of three different implementations of the covariance LPC method
30
Chapter 5
Final thoughts
This thesis focused on the analysis of sound signals with methods
utilized primarily in the ever-expanding field of speech processing.
While the possibility of computers communicating with people in
natural spoken language is still available only in science-fiction, the
accomplishments of research in the field already started to make impact on our everyday lives. GPS devices featuring the option of synthesized voices giving directions became common, voice control of
various electronic gadgets is no news either. Automatic captioning
of a few selected live broadcasts are available on the Czech National
Television and the expansion of this feature is in the works. It is clear
to see that despite the remarkable progress achieved in the recent
past, the field of speech processing is yet to reach its zenith and there
is a bright prospect for further research.
While cutting-edge research requires cutting-edge software usually developed with a robust financial capital in some cases opensource software offer a reasonable alternative for low-budget enterprises. Thanks to the enthusiasm of the development community,
GNU Octave is becoming increasingly usable for practical application.
The speech package created as the practical part of this thesis is
a user-collaborated expansion of GNU Octave, which enhances its
speech processing capabilities. The methods implemented are used
primarily in speech recognition and speaker recognition, fields that
have been gathering ground rapidly in the recent past (especially the
former automatic captioning is a hot topic nowadays). Mastering
31
5. F INAL THOUGHTS
these fields is instrumental to achieve the goal of fluent spoken communication with computers, however not sufficient. The techniques
of understanding the linguistically imperfect output of speech recognition methods also have to take their great leap forward.
32
Bibliography
[1] Wikipedia. Sound Wikipedia, The Free Encyclopedia, 2012.
[Online; accessed 20-May-2012].
[2] Wikibooks. Signals and Systems/Definition of Signals and Systems. http://en.wikibooks.org/w/index.php?title=
Signals_and_Systems/Definition_of_Signals_and_
Systems&oldid=2254812. [Online; accessed 20-05-2012].
[3] Bores Signal Processing. Introduction to DSP. http://www.
bores.com/courses/intro/index.htm. [Online; accessed
20-05-2012].
[4] Steven W. Smith. The scientist and engineers guide to digital signal processing. California Technical Publishing, San Diego, CA,
USA, 1997.
[5] Wikipedia. Audio signal processing Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[6] J. Psutka L. Mller J. Matouek V. Radov. Mluvme s poctacem
c esky. Academia, Praha, 2006.
[7] L.R. Rabiner and R.W. Schafer. Introduction to Digital Speech Processing. Foundations and Trends in Signal Processing. Now Publishers, 2007.
[8] ISO 226:2003. Acoustics Normal equal-loudness-level contours.
ISO, Geneva, Switzerland.
[9] Wikipedia. File:lindos1.svg, 2012. [Online; accessed 20-May2012].
[10] Wikipedia. Fast fourier transform Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[11] H Hermansky. Perceptual linear predictive (plp) analysis of
speech. Journal of the Acoustical Society of America, 87(4):1738
1752, 1990.
33
5. F INAL THOUGHTS
[12] Wikipedia. Mel-frequency cepstrum Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[13] MathWorks. MATLAB The Language of Technical Computing, 2012. [Online; accessed 20-May-2012].
[14] Wikipedia. Matlab Wikipedia, The Free Encyclopedia, 2012.
[Online; accessed 20-May-2012].
[15] MathWorks. MATLAB - Documentation, 2012. [Online; accessed 20-May-2012].
[16] GNU Octave. http://www.gnu.org/software/octave/,
2012. [Online; accessed 20-May-2012].
[17] Wikipedia. Maple (software) Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[18] Wikipedia. Mathematica Wikipedia, The Free Encyclopedia,
2012. [Online; accessed 20-May-2012].
[19] Wolfram Research. Mathematica 8 Documentation Center, 2012.
[Online; accessed 20-May-2012].
[20] Wolfram Research. Wolfram Demonstrations Project. http:
//demonstrations.wolfram.com/, 2012. [Online; accessed
20-May-2012].
34