Vous êtes sur la page 1sur 39

}w

!"#$%&'()+,-./012345<yA|

M ASARYK U NIVERSITY
FACULTY OF I NFORMATICS

Octave System Sound


Processing Library
B ACHELOR S THESIS

Lrnt Oroszlny

Brno, spring 2012

Declaration
Hereby I declare, that this paper is my original authorial work, which
I have worked out by my own. All sources, references and literature
used or excerpted during elaboration of this work are properly cited
and listed in complete reference to the due source.

Lrnt Oroszlny

Advisor: Mgr. Ludek Brtek, Ph.D.


ii

Acknowledgement
I would like to thank my supervisor Mgr. Ludek Brtek, Ph.D. for his
support and guidance, which helped me write this thesis.

iii

Abstract
The aim of the bachelor work is to introduce the reader to the basics of digital sound processing and describe a few selected methods
used in the field of speech processing. Furthermore it provides the
evaluation of the usability of four popular mathematical software in
the mentioned area and their comparison. As the elaborated speech
processing methods are not directly implemented in these programs,
the practical part of this thesis consists of their implementation as a
library of functions for the open-source Matlab-alternative, GNU Octave.

iv

Keywords
DSP, digital signal processing, audio processing, speech processing,
numerical computation, computer algebra system, CAS, octave, matlab, maple.

Contents
1

Introduction . . . . . . . . . . . . . . . . . . . . . .
1.1 Organization of this thesis . . . . . . . . . . .
1.2 Digital audio processing . . . . . . . . . . . .
1.2.1 Digital representation of sound . . . .
1.2.2 Working with audio signals . . . . . .
1.2.3 Practical application . . . . . . . . . .
Analysis of speech . . . . . . . . . . . . . . . . . .
2.1 The source/system speech production model
2.2 Perception of sound . . . . . . . . . . . . . .
2.2.1 Pitch perception . . . . . . . . . . . .
2.2.2 Loudness . . . . . . . . . . . . . . . .
2.3 Short-time analysis . . . . . . . . . . . . . . .
2.3.1 Window functions . . . . . . . . . . .
2.4 Time-domain analysis . . . . . . . . . . . . .
2.5 Frequency-domain analysis . . . . . . . . . .
2.5.1 Fourier transform . . . . . . . . . . . .
2.5.2 Discrete Fourier transform . . . . . . .
2.5.3 Short-time Fourier transform . . . . .
2.6 Linear predictive analysis . . . . . . . . . . .
2.6.1 Perceptual linear prediction . . . . .
2.7 Cepstral analysis . . . . . . . . . . . . . . . .
2.7.1 Mel-frequency cepstral coefficients . .
Sound analysis with mathematical software . . .
3.1 Matlab . . . . . . . . . . . . . . . . . . . . . .
3.2 GNU Octave . . . . . . . . . . . . . . . . . . .
3.3 Maple . . . . . . . . . . . . . . . . . . . . . . .
3.4 Wolfram Mathematica . . . . . . . . . . . . .
3.5 Comparison . . . . . . . . . . . . . . . . . . .
3.5.1 Plotting . . . . . . . . . . . . . . . . .
The speech package for GNU Octave . . . . . . .
4.1 Installation . . . . . . . . . . . . . . . . . . . .
4.2 Input handling . . . . . . . . . . . . . . . . . .
4.3 Function reference . . . . . . . . . . . . . . .
4.3.1 The sti(), ste() and zcr() functions . . .
4.3.2 The stf() function . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

3
3
3
4
5
5
7
7
8
8
9
10
10
11
13
13
14
14
15
16
16
17
18
18
19
20
20
21
22
24
24
24
25
25
27
1

4.3.3 The stacf() function . . . . . . . . . .


4.3.4 The lpc_cov() function . . . . . . . .
4.3.5 The strceps() and stcceps() functions
4.3.6 The plp() and mfcc() functions . . .
4.3.7 Complementary functions . . . . . .
4.4 Known issues . . . . . . . . . . . . . . . . .
Final thoughts . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

27
28
29
29
30
30
31

Chapter 1

Introduction
1.1 Organization of this thesis
Chapter 1 introduces the concepts and notions of the subject matter
of this thesis
Chapter 2 covers the theory of the methods utilized in the later parts
of the thesis
Chapter 3 explores the existing possibilities of analyzing audio signals using mathematical software
Chapter 4 introduces the speech package for GNU Octave created
as the practical part of the thesis and provides a discussion on
issues experienced during its development
Chapter 5 presents the conclusions reached and final thoughts on
the topic

1.2 Digital audio processing


Audio is a Latin word for the phrase to hear, meaning the perception
of sound. Sound is an acoustic wave (oscillation of pressure propagating trough a medium) which is composed of frequencies within the
hearing range (about 20 Hz 20 KHz for an average human)[1]. It can
be represented in the form of signals (signal = a function of independent variables that carry some information[2]). Signal processing
is a very wide field of study devoted to analysis of and operations
on signals be it continuous, discrete, periodic, aperiodic, acoustic or
3

1. I NTRODUCTION

Figure 1.1: Waveform of spoken sequence I am prepared, encoded with


LPCM-coding, visualized using Octave

electromagnetic. This thesis focuses on the analysis of discrete-time


audio signals.
1.2.1 Digital representation of sound
Natural sound perceived by human ears is analog - continuous in
time, defined with infinite precision[3]. Signals with these properties however cannot be represented equivalently in a digital implicitly discrete environment. To be able to work with them in such
environment, they need to be converted to a suitable format. Analog to digital (A/D) conversion is done using methods of sampling
and quantization. There are different ways of storing the resulting
data, the most fitting for our purpose is linear pulse-code modulation (LPCM). A digital audio signal encoded with LPCM is an array of values which correspond to the values of the magnitude of
the original signal sampled at equal intervals of time and quantized
to the nearest value within a digital step. The two main characteristics of a sound wave encoded in LPCM is its sample rate and its
bit per sample ratio (not to be confused with bit rate, which means
4

1. I NTRODUCTION
number of bits per second), which determine the quality of the representation. According to the Nyquist-Shannon sampling theorem, the
sampling rate (number of samples per second) must be two times
the frequency of the highest frequency component in the sampled
signal, otherwise it comes to aliasing and the recreation of the signal is no longer possible[4]. Higher bit rate allows more precision for
storing the value of the magnitude of the signal sampled at a specific time, causing the recreated signal to be less distorted and more
similar to the sampled signal. Exact recreation of the sampled analog signal from discrete samples is not possible. Figure 1.1 shows the
waveform for a sound sequence encoded with LPCM with sample
rate of 44100 and 16-bit sample size. LPCM is used by most of the
popular non-compressed audio formats (CD audio, WAV, AIFF), and
is used exclusively throughout this thesis.
1.2.2 Working with audio signals
Due to the properties of LPCM encoding, discrete-time signal processing methods are easily applicable on data encoded in this manner. Most of these methods consist of the application of various mathematical formulas over the array of values represented by LPCMcoded data. While the majority of programming languages is fully
capable to realize these types of calculations, there are mathematical software developed specifically for the purpose of solving technical computational problems. Two basic types of such software differ in their approach to solve these problems: computer algebra systems manipulate mathematical expressions in symbolic form, while
numerical computing systems focus on performing numerical algorithms. Chapter 3 is devoted to elaborating the usefulness and comparing the power of these products in terms of signal processing,
more specifically sound analysis.
1.2.3 Practical application
Practical applications of sound processing vary on a wide scale including storage, level compression, data compression, transmission,
enhancement and speech processing[5]. This thesis focuses on the
methods used for speech processing, specifically the ones for short5

1. I NTRODUCTION
time analysis of speech signal in time and frequency domain. The
extraction of information from the results of these analyses is the
subject to speech recognition, speaker recognition and voice analysis and are not included in the scope of this thesis. Speech processing
itself is a very wide topic, comprising among others speech coding,
synthesis and enhancement, and is covered in-depth in [6], [7].

Chapter 2

Analysis of speech
This chapter is a short overview of the theoretical background of
the sound analysis methods utilized or referenced to in later parts of
this thesis. All the information in this chapter is processed from [4],
[6] and [7], except where noted otherwise.

2.1 The source/system speech production model


Many of the algorithms elaborated later in this chapter assumes
the source/system speech production model, which consists of an excitation generator and a time-varying linear system (see Figure 2.1). The
excitation generator simulates the modes of sound generation in the
vocal tract. Voiced sounds are excited by periodic pulses of air pressure, the frequency of which determines the perceived pitch of the
sound. Unvoiced sounds are excited by random white noise. The linear system simulates the frequency shaping of the vocal tract tube at
a specific time. The parameters of the linear system change in time at

Figure 2.1: The source/system model for a speech signal [7]

2. A NALYSIS OF SPEECH
a much slower rate than the time variations of the speech waveform,
allowing us to find them by analyzing short segments of the signal.
Since the vocal tract changes shape slowly, it can be viewed as timeinvariant over intervals varying on the order of 10 ms, depending on
the speaker. Although there are more sophisticated models of speech
production, this model is sufficient for most applications in speech
processing[7].

2.2 Perception of sound


This subsection of this chapter briefly introduces a few psychoacoustic phenomena utilized in later parts of this thesis without going
into physiologic details of the human auditory system. Detailed description of the matter can be found in [6], [7].
2.2.1 Pitch perception
Sounds that have a periodic structure on short time intervals are
perceived as having a subjective quality called pitch. The relation
of pitch to the fundamental frequency of the sound waveform was
empirically determined and can be approximated by the following
equation:
m = 2595 log10 10(1 + f /700) ,

(2.1)

where m is perceived pitch in [mel] and f is frequency in [Hz].


Mel is the unit of subjective pitch. By definition, a signal with frequency 1000 Hz and loudness level 40 phon has pitch of 1000 mel.
A sound having two times as high subjective pitch than a referential
sound has also two times the value of pitch in mels. Figure 2.2 shows
the non-linear relation of the pitch in [mel] to the logarithm of the
fundamental frequency of the sound in [Hz].
Another psychoacoustic phenomenon utilized in speech analysis
is the concept of critical bands. Due to the structure of the basilar
membrane in the inner ear, when listening to a pure tone with a
certain frequency f , noise outside of a frequency band around the
central frequency f does not have an impact on the sensation of the
8

2. A NALYSIS OF SPEECH

Figure 2.2: Relation of subjective pitch of sound to its frequency

tone. The width of the critical band depends on the central frequency
f . This effect can be represented as the application of a set of bandpass filters to the sound signal. Frequency unit bark was introduced
to capture the idea of this phenomenon; the width of a critical band is
roughly 1 bark at any frequency. The relation of the frequency in [Hz]
to the frequency in [bark] is described by the following equation:

= 6ln

f
+
600

s

f
600

2

+ 1

(2.2)

2.2.2 Loudness
Loudness is a subjective measure, not to be confused with objective measures like sound pressure or sound intensity. The perceived
loudness is related to sound pressure, duration and frequency. Equal
loudness curves evaluate the difference between perceived loudness
and the sound pressure of a pure tone over the audible frequency
spectrum. These curves are defined by international standard ISO
226:2003 based on several modern experimental determinations[8].
The unit of loudness level for pure tones is phon, 1 phon is defined
as the loudness of a 1 dB sound pressure level pure tone at 1 kHz
frequency. A different unit is also used for capturing the loudness
phenomenon, sone, but it is not used in this thesis.
9

2. A NALYSIS OF SPEECH

Figure 2.3: Equal loudness curves defined by revised standard ISO


226:2003 (Blue line shows the original ISO standard for 40 Phons)[9]

2.3 Short-time analysis


2.3.1 Window functions
For reasons discussed in section 2.1, the analysis of speech mostly
relies on analysis of short segments of the speech waveform, called
microsegments. Most of the short-time analysis methods can be described by the relation
Qn =

(s(k))w(n k) ,

(2.3)

k=

where s(k) is the value of the analyzed LPCM-coded signal in time


k, (.) is the transformation used and w(n) is a weighing function or
window function. A window function is a mathematical function that
is zero-valued outside of some chosen interval. Its purpose is to select
a segment in the neighborhood of a central sample and optionally to
10

2. A NALYSIS OF SPEECH
suppress the effect of the marginal samples of the analyzed segment.
Many types of differently shaped windows exist. In speech processing the two most commonly used are the rectangular and the Hamming window. The rectangular window does not do any weighing,
only selects the samples included in the microsegment. Its value is 1
inside the chosen interval and 0 outside of it. The Hamming window
is defined by the following relation:
(
0.54 0.46cos(2n/(L 1)) for 0 n L 1
w(n) =
0
otherwise,

(2.4)

where L is the length of the window in samples. The window


length in seconds is L divided by the sampling frequency of the analyzed signal. Figure 2.4 illustrates the selection of microsegments
using the Hamming window.

Figure 2.4: Example of the application of the Hamming window to a


speech waveform

2.4 Time-domain analysis


Short-time energy (STE), short-time intensity (STI) and short-time
zero-crossing rate (STZCR) are three basic functions used in speech
processing to estimate the parameters of the excitation signal in the
11

2. A NALYSIS OF SPEECH
source/system production model. The function of short-time energy
is defined as
En =

[s(k)w(n k)]2 ,

(2.5)

k=

and gives information about the average energy of the signal in


the microsegment. The short-time intensity defined as
Mn =

|s(k)|w(n k) ,

(2.6)

k=

represents the same quality of a signal, but this function is less


sensitive to abrupt changes in the amplitude of the analyzed signal
due to using absolute value instead of raising to second power. The
short-time zero-crossing rate is a simple measure of how many times
did the amplitude change its premonitory sign. It is defined as

Mn =

|sgn[s(k)] sgn[s(k 1)]|w(n k) ,

(2.7)

k=

where
(
1 for s(k) 0
sgn[s(k)] =
0 for s(k) < 0

(2.8)

and w(n) is a rectangular window. The most common use of the


above mentioned functions is to detect the beginning and end of
speech in a signal and to distinguish voiced and unvoiced sounds.
The suggested window size is from the interval 10-25 ms.
Another time-domain function worth mentioning is the short-time
autocorrelation function (STACF), defined as
Rn (m) =

s(k)w(n k)s(k + m)w(n k m) ,

(2.9)

n=

where w(n) is a rectangular or Hamming window. Notice that


Rn (m) is a two dimensional function, where each value for time index
12

2. A NALYSIS OF SPEECH
n is an array of m values. It is used to detect periodicity in signals and
is the basis of many spectral analysis methods. To be able to detect
the pitch period, the windowed segment must contain at least two
fundamental periods of the analyzed signal, so it is recommended to
use windows with lengths 20-40 ms, depending on the speaker.

2.5 Frequency-domain analysis


2.5.1 Fourier transform
Fourier transform is a mathematical operation, which expresses a
function of time as a function of frequency, called the frequency spectrum. Under suitable conditions, the time domain signal can be reconstructed from the frequency domain representation (using the inverse Fourier transform). There is a strong connection between the
two representations (e.g. convolution of two signals in time domain
corresponds to their multiplication in the frequency domain), which
makes the Fourier transform the most important method in signal
processing and a basis for numerous other methods.

For application on discrete signals, the discrete-time Fourier transform (DTFT) is defined as

X() =

xk eik ,

(2.10)

k=

where denotes the angular frequency and xn denotes the nth sample of the signal. It can be seen from the definition, that the
values of the DTFT can be complex even when working with real
signals. The DTFT of an aperiodic signal (speech segments are periodic only over short time intervals, so they are treated as aperiodic in
this formula) is a continuous periodic complex function with period
2. The magnitude of the DTFT |X()| is an even function, while the
phase ]X() is odd.
13

2. A NALYSIS OF SPEECH
2.5.2 Discrete Fourier transform
To be able process the spectra of discrete signals using computers, we also need them to be discrete. To achieve this, the discrete
Fourier transform (DTF) method is used. The DTF of a discrete signal
of length N (x0 , ..., xn1 ) is also of length N, and can be computed
using the following formula:

Xn =

N
1
X

xk ei2 N k .

(2.11)

k=0

The frequency represented by value Xn is Nn .sf for values n


dN/2e, where sf is the sampling frequency. Values where n > dN/2e
contain redundant information about the signal.

When computing the DFT of a signal using the definition, it takes


O(N 2 ) operations to get the results. There is a group of less timeconsuming algorithms called fast Fourier transforms (FFT), which can
compute the DFT an its inverse in only O(N logN ) operations. The
most well-known of these is the Cooley-Tukey algorithm [4], which
can compute a DFT of a signal whose length is a power of two in
only (N/2)log2 N complex multiplies and N log2 N complex additions
[10]. The description of these algorithms is not in the scope of this
thesis.

2.5.3 Short-time Fourier transform


As with other short-time analysis methods, we can get the formula
for the short-time Fourier transform with combining (2.3) with (2.11):
X(n, ) =

s(k)w(n k)ei2 N k .

(2.12)

k=

The short-time Fourier transform of a signal is often represented


using spectrograms, see figure 2.5.
14

2. A NALYSIS OF SPEECH

Figure 2.5: Spectrogram of speech sequence shown in figure 1.1

2.6 Linear predictive analysis


Linear predictive analysis is one of the most powerful methods of
speech analysis. It is a method used to estimate the parameters of the
speech production model described in chapter 2.1. The linear system
in this model can be described by an all-pole model system function:
H(z) =

G
S(z)
Pp
,
=
E(z)
1 i=1 ai z i

(2.13)

where p is the order of the model and G is the gain parameter.


The goal of linear prediction is to estimate the linear prediction coefficients (LPC) ai with minimized average squared prediction error
based on the assumption that the k-th sample of signal s(k) can be
described as the linear combination of Q previous samples of the signal and the excitation u(k):
s(k) =

Q
X

ai s(k i) + Gu(k) .

(2.14)

i=1

There are several methods to estimate these parameters, the one


implemented in the practical part of the thesis is the covariance method.
The method is elaborated in [7], for our purpose it is sufficient to
15

2. A NALYSIS OF SPEECH
know how to reckon the coefficients. The following formula describes
a symmetric positive-semi-definite matrix which represents a system
of linear equations:
M
2 k
X

n [i, k] =

sn [m]sn [m + k i]

(2.15)

m=M1 k

The solution of this system are the linear prediction coefficients


estimated with the covariance method.
2.6.1 Perceptual linear prediction
Perceptual linear prediction (PLP) is a revision of the linear prediction method, taking into account the ideas of sound perception described in section 2.2. A few transformations are applied to the analyzed signal before describing it by an all-pole model:

Calculation of the power spectrum of the speech signal

Non-linear transformation of the frequency spectrum to the


bark-scale described in 2.2.1 and application of a set of bandpass filters representing critical bands of hearing

Weighing the samples in respect to an equal loudness curve

Application of the relation between the intensity of the sound


and perceived loudness.

This method was introduced by Hynek Hermansky in 1989 [11], and


is one of nowadays the most utilized methods in speech recognition
[6].

2.7 Cepstral analysis


The cepstrum of a signal is defined as the inverse Fourier transform
of the logarithm of the spectrum of the signal. The idea behind the
cepstrum is that the log spectrum of a sound can be also treated as a
waveform and can be subjected to further Fourier analysis. The independent variable of the cepstrum is time, but in order to differentiate
16

2. A NALYSIS OF SPEECH
it from time-domain analysis its called quefrency (the names cepstrum
and quefrency were created by changing the order of letters in the
words spectrum and frequency respectively). Cepstral analysis has
uses in pitch detection and pattern recognition and also in techniques
for estimation of the vocal tract system (cepstral LPC coefficients).
2.7.1 Mel-frequency cepstral coefficients
Mel-frequency cepstral coefficients (MFCC) are parameters of a signal
used for capturing its short-time spectral characteristics in a compact form. The method consists of computing the power spectrum of
a windowed segment of the analyzed signal, then applying a bank
of triangular shaped bandpass filters. The central frequencies and
bandwidth of the filters are determined so that they cover the entire analyzed frequency band, are uniformly spaced on the mel-scale,
and the central frequency of a given filter is also the first frequency
value of the next filter. The number of filters is usually based on the
bandwidth of the original signal with respect to the critical band theory of auditory perception, e.g. for 8 kHz bandwidth 20 filters are
used. The filters are applied to the signal segment separately and the
sums of the output values of each filter are calculated. The MFCCs
are the values of the discrete cosine transform applied to the list of
logarithms of the filter outputs. Various sources differ in some details
on the exact method of the computation of the MFCCs, the practical
part of this thesis implements the one described in [6].
MFCCs are commonly used in the fields of speaker recognition
and speech recognition, and are increasingly finding uses in music
information retrieval [12].

17

Chapter 3

Sound analysis with mathematical


software
As it can be seen from the previous chapter, sound analysis (and
in general signal processing) is a very math-heavy task, so it is reasonable to look into software developed specifically for performing
mathematical computations if aiming to get engaged in this field.
This chapter is devoted to the review of some of the more widely
used mathematical software on the market and also to the comparison of their sound processing capabilities.

3.1 Matlab
Matlab is a programming environment for algorithm development,
data analysis, visualization, and numerical computation, developed
by MathWorks[13]. It is primarily intended for numerical computations, although it has an optional toolbox granting the ability to
perform symbolic computations. Its user base varies on a wide scale
across industry and academia, counting roughly one million users
(2004)[14]. Along with Simulink, an additional package also developed by MathWorks for multidomain simulation and Model-Based
Design, it is popular among engineers and economists. Matlab is
distributed as a stripped down core program along with optional
toolboxes providing additional functionality grouped by fields of usage. While most of the functionality is accessible from the command
prompt, Matlab comes with a GUI for added convenience. MathWorks provides licensing options separately for educational and commercial uses.
18

3. S OUND ANALYSIS WITH MATHEMATICAL SOFTWARE


Matlabs Signal Processing Toolbox implements numerous industry standard digital and analog signal processing methods. It provides tools for visualizing signals in time and frequency domain,
computing FFTs, FIR and IIR filter design and other signal processing techniques. As Matlab is also an interpreted procedural programming language, the tools in the signal processing toolbox can be used
to develop custom algorithms[15]. The Matlab Coder and Simulink
Coder tools can generate processor specific C/C++ code from algorithms prototyped in Matlab, which makes it a powerful tool in embedded system design (where signal processing is used to an utmost
degree), thus making it the software of choice for a large fraction of
professionals working in the field.

3.2 GNU Octave


GNU Octave (later only Octave) is a high-level interpreted language,
primarily intended for numerical computations[16]. It can be considered a Matlab clone, as it is very similar to Matlab, making programs easily portable between the two. Unlike Matlab, Octave does
not have a GUI (planned to be included in the next major release).
It is used through its interactive command-line interface. For visualization the external gnuplot program is used, which is included in the
Octave installation package. Octave is open-source and is released
under the GNU General Public License. As part of the GNU project,
its development is entirely community based; people who are interested are encouraged to take part. Octaves package system makes
the creation and installation of additional packages expanding the
core programs functionality effortless. A collection of collaborative
development packages for Octave can be found on the Octave-forge
website.
While Octave has limited signal processing capabilities on its own,
Octave-forge maintains a signal package developed to broaden the
spectrum of available methods. However the number of functions in
this package is only a fraction of what is in Matlab, the ones actually
implemented are syntactically almost identical to their counterparts
in the Signal Processing Toolbox, which makes Octave a reasonable
19

3. S OUND ANALYSIS WITH MATHEMATICAL SOFTWARE


choice if Matlabs functionality is desired, but it is out of the budget.

3.3 Maple
Maple is a commercial computer algebra system (CAS). Its development began in 1980 at the University of Waterloo in Ontario, Canada;
it is on the commercial market since 1988, distributed and developed by Maplesoft. Maple supports numeric computation as well
as symbolic computation and visualization[17]. The power of CAS
systems lie in their capabilities to perform symbolic mathematics.
Computations with exact quantities such as fractions, radicals and
symbols eliminate the propagation of rounding errors, thus achieving more precise results than solely numerical systems. The approximation of the results can be computed at arbitrary precision, not
limited by the underlying hardware. One of Maples main advantages over other mathematical software is its above par GUI in terms
of user-friendliness and intuitiveness. Its highlights include the ability to process input in standard mathematical notation, the typing
of which is assisted by really well-written automatic formatting; the
context-sensitive menus appearing when right-clicking an expression and its overall natural feel. Like for Matlab, separate student
and commercial licenses exist also for Maple.
Maple is a software which can be used in signal processing to a
high extent given its strong mathematical and visualization abilities.
It has built-in functions for the essential methods used in the field,
however it is not that well supplied with implementations of additional methods. The Maplesoft website and other community websites provide free expansions and tutorials for applications in signal
processing. For uses specifically in sound processing the low number
of supported audio formats can be a major drawback.

3.4 Wolfram Mathematica


Wolfram Mathematica, developed by Wolfram Research, is an allaround commercial mathematical software and computer algebra sys20

3. S OUND ANALYSIS WITH MATHEMATICAL SOFTWARE


tem that can be deployed in any area that requires technical computation. It consists of a kernel, a command-line interface which interprets inputs as expressions in Mathematica code and returns outputs
in the same format; and a front-end providing sophisticated input
and output handling, including typeset mathematics, graphics, GUI
components, tables and sound[18]. The number of features included
in Mathematica is second to no other mathematical software on the
market, ranging from high-performance symbolic and arbitrary precision numerical computations to the support multi-threading and
user-level parallel programming. Mathematicas interface is less user
friendly and has the steepest learning curve of the software reviewed
in this chapter. Student and commercial licenses are available for
Mathematica.

Wolfram Mathematica is a very powerful mathematical software,


which makes it suitable for signal processing purposes. The range of
supported file formats (audio and overall) is superior to its competitors, including also compressed audio formats like FLAC.

3.5 Comparison
This section aims to compare the software described beforehand
in aspects of their use in signal processing (and more specifically audio processing) using tables and charts as well as text description for
demonstrating the results of the study.

Student
Commercial (single user)

Matlab1
$99
$4471.22

Maple
$124
$2,845

Mathematica
$126.62
$3,219.93

Octave
Free

Table 3.1: Comparison of prices


The prices are stated for purchase from the Czech Republic. Converted to USD
with exchange rates valid on 5/18/2012
1
including Signal Processing Toolbox

21

3. S OUND ANALYSIS WITH MATHEMATICAL SOFTWARE


Window functions
Convolution
Cross-correlation
Discrete FT
Symbolic FT
Short-time FT
Real/complex cepstrum
LPC

Matlab
SPT
Yes
SPT
Yes
Symbolic
No
SPT
SPT

Maple
No
Yes
No
Yes
Yes
No
No
No

Mathematica
No
Yes
No
Yes
Yes
No
No
No

Octave
Yes
Yes
signal
Yes
No
Yes
signal
signal

Table 3.2: Comparison of built-in signal processing functionality


SPT - Signal Processing Toolbox, Symbolic - Symbolic Toolbox
signal - signal package

.WAV
.AU (.SND)
.RAW
.AIFF
.w64 (Wave64)
.FLAC

Matlab
LPCM
-law
No
No
No
No

Maple
LPCM/ADPCM
No
No
No
No
No

Mathematica*
Yes
Yes
No
Yes
Yes
Yes

Octave
LPCM
-law
LPCM
No
No
No

Table 3.3: Comparison of supported audio formats and encodings


*According to its documentation, Mathematica supports all standard raster formats and codecs [19]

3.5.1 Plotting
Each of the mentioned software have the ability to visualize data
to some extent. This is an area where Octave is significantly weaker
than its commercial competitors, since it is limited by the functionality of gnuplot, and the interfacing of the two programs does not
even allow to use its every function when called from Octave. Plots
can be manipulated from the command line interface of Octave. It
allows to create static 2d and 3d plots, with multiple functions on
the same figure. For use in sound processing its capabilities are sufficient, however the lack of a GUI makes working with plots tiresome.
22

3. S OUND ANALYSIS WITH MATHEMATICAL SOFTWARE


The specgram() function of the signal package for plotting spectrograms directly proved a valuable feature seeing that only Matlab has
an equivalent.
The other programs have far greater visualization power in general, allowing among others to create interactive plots and animations. A noteworthy feature of Mathematica is the ability to create
.CDF files Wolframs own format, designed to allow easy authoring of dynamically generated interactive content[18] that can be
viewed with the CDF player downloadable form th Wolfram website free of charge. The CDF player integrates with the most popular
browsers which allows .CDF documents to be embedded in HTML
pages. Demonstrations of its capabilities in the field of audio processing can be found in the Wolfram Demonstrations Project[20].

23

Chapter 4

The speech package for GNU Octave


The speech package is a library of functions implementing methods of speech analysis described in chapter 2 for GNU Octave. This
chapter describes this package in detail and discusses the issues faced
during its development.

4.1 Installation
The speech package takes advantage of the Octave package interface, making it effortless to install; you can do so by opening Octave,
going into the directory where the package archive is located and
typing the following command to the Octave command prompt:
pkg install speech.tar.gz
The package is dependent on the signal package, which is itself dependent on a few more packages, all of which can be downloaded
for free from the Octave-forge website.

4.2 Input handling


The input signal can be passed to the functions of the package in
two different ways: as an array containing the values of sampled data
or as a string containing the path to a .wav file. If passed as an array,
the window size is expected to be passed as a number of samples; if
passed as a .wav file, window size is expected in milliseconds. While
the inclusion this feature may seem as unnecessary complication, it
can rather speed up the work flow if working with a larger number
of sound samples in separate files.
24

4. T HE SPEECH PACKAGE FOR GNU O CTAVE


The functions can handle a maximum of two channels of audio
data at once (stereo signals). If a matrix is passed as an input signal, its smaller dimension is treated as the representation of separate channels, and the function is applied recursively to the first two
channels. Multi-channel audio files are treated as matrices whose
columns represent separate channels. To get the results for both channels, appropriate amount of output parameters must be specified.
When calling a function with the exception of the input signal
the arguments can be omitted, if so, default values are used and no
visualization of the result is done. The functions lpc_cov(), plp() and
mfcc() handle only vectors as input signals.

4.3 Function reference


Function references are included in the source codes of the functions in a way that Octave recognizes, so they are accessible from
Octave by typing help function_name into the command prompt. This
section provides an overview of each of the functions included in the
package rather than describing their syntax.
4.3.1 The sti(), ste() and zcr() functions
The sti(), ste() functions implement the short-time intensity and
short-time energy functions respectively, which were described in
section 2.4. They take as arguments the input signal, window size,
window type and an integer as an option for visualization.
To compute the results, the functions do not implement the definitions directly; the summation of sampled values inside each windowed segment would be excessively time-consuming. The transformations are applied to each sample of the signal in advance, then its
convolved with a window of given length. The result of the function call will be a vector with the same length as the analyzed signal, whose each value corresponds to the windowed signal segment
with the middle at the same index. As a side-effect of using convolution, windows near the ends of the input signal handle it as if it
were padded with zeros on both sides. The built-in conv() function
25

4. T HE SPEECH PACKAGE FOR GNU O CTAVE

Figure 4.1: Sample output of the zcr() function, called with 150 points for
interpolation of the plot

computes the convolution by calling the pre-compiled filter() function, which uses FFT algorithms to get the result. This optimization
results in a drastic improvement in run time.

The idea behind the computation of the result of the zcr() (zerocrossing rate) function is similar: it takes the convolution of the window with the array gotten by the subtraction of the signum of the
original signal from the signum of the signal shifted in time by one
sample. The zcr() function does not take the window type as an argument; for computing the zero-crossing rate the rectangular window
is used.

For plotting the results the complementary function plot_stf() is


used. It interpolates the result with the optional last argument of the
mentioned functions before plotting, making the plot more readable
and more pleasant for the eyes. If the value passed is 0, no interpolation is done. The results are plotted over the waveform of the
analyzed signal to make their meaning perfectly clear.
26

4. T HE SPEECH PACKAGE FOR GNU O CTAVE


4.3.2 The stf() function
This is a complementary function, which applies a function passed
to it as a function handle to windowed segments of the input signal. It is a very useful tool for performing short-time analysis of a
signal. For methods which combine samples inside a window in a
non-trivial way to get the result, the convolution of the window with
the signal cannot be used. In most cases it is superfluous to compute
the results of a function for as many windowed segments as there
are samples in the signal; the gap between the centers of adjacent
windows can be specified explicitly and passed to the function as an
argument.

4.3.3 The stacf() function


This function implements the short-time autocorrelation function
using the xcorr() function from the signal package, and the supporting stf() function from the speech package. The xcorr() function performs cross-correlation using FFT algorithms. Called with one argument, it returns the autocorrelation of the input signal (autocorrelation is cross correlation of a signal with itself). In this case, using
the xcorr() function makes the noticeable difference in the run time
compared to the direct implementation of the definition.

Because the result of applying STACF to a signal is a two-dimensional set of data, for its visualization a 3d plot was chosen. Calling the
stacf() with an additional argument greater than zero plots the result
as a 3d mesh plot. The default viewpoint of the plot is normal to the
x and y axes, the values on the z axis are represented by the colors
of the graph (similarly to a spectrogram), making the peaks of the
graph prominent. The plot can be rotated arbitrarily, to return to the
original viewpoint, type the command view(2) to the Octave prompt.
27

4. T HE SPEECH PACKAGE FOR GNU O CTAVE


4.3.4 The lpc_cov() function
The signal package has a function for estimating linear prediction
coefficients using the autocorrelation method1 . The lpc_cov() function of the speech package implements the estimation with the covariance method, described in 2.6 which does not have an implementation in Octave.
To find a suitable optimization for the algorithm of computing the
coefficients with this method was crucial, since computing the matrix
of the linear equation system which describes the coefficients implementing the formula 2.15 directly would take p(p+1)(N p) iterations
of one multiplication and one addition, where p is the desired model
order and N is the length of the analyzed signal segment. From the
formula it can be seen, that each row of the matrix can be interpreted
as the cross-correlation of the whole signal segment with a moving
windowed subsegment of length N p, thus allowing the matrix to
be computed by p+1 calls to the xcorr() function. The speech package
implements this optimization.
In Matlab, the arcov() function implements a different algorithm.
It computes a different matrix of size (N p) (p + 1), the solution of which is the same set of values. The matrix is computed simply by cutting up the signal segment to p + 1 overlapping segments
of length N p, which serve as the columns of a temporal matrix.
Then the matrixis flipped vertically and every value is divided by
the expression N p. Table 4.1 compares run times of the Octave
implementations of the three algorithms described above (the algorithm of Matlabs arcov() function was ported to Octave to provide
reasonable comparison).
1. The documentation of the function says that it uses the Burg method for estimation, however the function produces the same results as the aryule() function
which uses the Yule-Walker method (also called autocorrelation method). Matlabs
lpc() function produces also the same results, and is documented as using autocorrelation method. Matlabs and Octaves arburg() function using the Burg method
produces different output, thus leading to believe there is an error in the documentation of Octaves lpc() function

28

4. T HE SPEECH PACKAGE FOR GNU O CTAVE


Trivial
Speech pkg
Matlab-style

N = 600, p = 20
8.3799 0.0619s
0.0555 0.0015s
0.0628 0.0022s

N = 1000, p = 40
54.1800 0.1390s
0.1143 0.0154s
0.1325 0.0025s

N = 1000, p = 200
> 60s
0.8040 0.0300s
0.2853 0.0221s

Table 4.1: Comparison of the run times of three different implementations of the covariance LPC method

The lpc_cov() function calculates the linear prediction coefficients


for the whole input signal. To calculate them for windowed segments
of the input signal, use it in combination with the stf() function. Since
linear prediction coefficients are mainly used for pattern matching
purposes in speech recognition or speaker recognition which require the comparison of the coefficients of different segments and
matching them with a library of model coefficients this function
does not provide any kind of visualization.
4.3.5 The strceps() and stcceps() functions
Functions for the computation of the real and complex cepstrum
are included in the signal package (rceps(), cceps()). The strceps()
and stcceps() functions of the speech package apply the rceps() and
cceps() functions respectively to windowed segments of the analyzed
signal using the stf() function.
4.3.6 The plp() and mfcc() functions
The plp() function implements the perceptual linear prediction
method, as it is described in [6]. The output of this function is a vector of length Q (model order, passed to the function as an argument),
containing the PLP coefficients of the input signal. To get representative results, it is recommended to use this function in combination
with the stf() function.
The mfcc() function works similarly to the plp() function, except it
computes the mel-frequency cepstral coefficients of the input signal,
29

4. T HE SPEECH PACKAGE FOR GNU O CTAVE


as it is described in [6]. Again, combined use with the stf() function
is highly recommended.
4.3.7 Complementary functions
sgn() A modified version of the built-in sign() (signum) function,
used to determine the zero-crossing rate of a signal.
dct_n() A modified version of the dct() (discrete cosine transform)
function included in the signal package, used in the mfcc() function.
elc() This function creates an equal loudness curve for loudness
level 40 Phon of arbitrary length, using an approximation equation
described in [6]. The values are normalized to be from the interval
[0,1]. Used by the plp() function.
cbandfilt() This function creates an array of arbitrary length which
corresponds to the impulse response of the critical band filters used
in the PLP method.
mel2hz(), hz2mel(), bark2hz(), hz2bark() These functions implement the conversion of frequency values between units [mel] and
[Hz], [bark] and [Hz] respectively.

4.4 Known issues


When analyzing short signal segments using methods which compute parameters of a model that estimates the vocal tract system (i.e.
the lpc_cov(), plp() and mfcc() functions), sometimes the warning
matrix singular to machine precision is thrown. This happens when the
samples in the segment carry less information than it is needed to describe a model of given order, usually at the ends of the signal where
it contains silence or if the chosen model order is unreasonably high.
The results obtained when the function has thrown this warning are
not valid.

30

Chapter 5

Final thoughts
This thesis focused on the analysis of sound signals with methods
utilized primarily in the ever-expanding field of speech processing.
While the possibility of computers communicating with people in
natural spoken language is still available only in science-fiction, the
accomplishments of research in the field already started to make impact on our everyday lives. GPS devices featuring the option of synthesized voices giving directions became common, voice control of
various electronic gadgets is no news either. Automatic captioning
of a few selected live broadcasts are available on the Czech National
Television and the expansion of this feature is in the works. It is clear
to see that despite the remarkable progress achieved in the recent
past, the field of speech processing is yet to reach its zenith and there
is a bright prospect for further research.
While cutting-edge research requires cutting-edge software usually developed with a robust financial capital in some cases opensource software offer a reasonable alternative for low-budget enterprises. Thanks to the enthusiasm of the development community,
GNU Octave is becoming increasingly usable for practical application.
The speech package created as the practical part of this thesis is
a user-collaborated expansion of GNU Octave, which enhances its
speech processing capabilities. The methods implemented are used
primarily in speech recognition and speaker recognition, fields that
have been gathering ground rapidly in the recent past (especially the
former automatic captioning is a hot topic nowadays). Mastering
31

5. F INAL THOUGHTS
these fields is instrumental to achieve the goal of fluent spoken communication with computers, however not sufficient. The techniques
of understanding the linguistically imperfect output of speech recognition methods also have to take their great leap forward.

32

Bibliography
[1] Wikipedia. Sound Wikipedia, The Free Encyclopedia, 2012.
[Online; accessed 20-May-2012].
[2] Wikibooks. Signals and Systems/Definition of Signals and Systems. http://en.wikibooks.org/w/index.php?title=
Signals_and_Systems/Definition_of_Signals_and_
Systems&oldid=2254812. [Online; accessed 20-05-2012].
[3] Bores Signal Processing. Introduction to DSP. http://www.
bores.com/courses/intro/index.htm. [Online; accessed
20-05-2012].
[4] Steven W. Smith. The scientist and engineers guide to digital signal processing. California Technical Publishing, San Diego, CA,
USA, 1997.
[5] Wikipedia. Audio signal processing Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[6] J. Psutka L. Mller J. Matouek V. Radov. Mluvme s poctacem
c esky. Academia, Praha, 2006.
[7] L.R. Rabiner and R.W. Schafer. Introduction to Digital Speech Processing. Foundations and Trends in Signal Processing. Now Publishers, 2007.
[8] ISO 226:2003. Acoustics Normal equal-loudness-level contours.
ISO, Geneva, Switzerland.
[9] Wikipedia. File:lindos1.svg, 2012. [Online; accessed 20-May2012].
[10] Wikipedia. Fast fourier transform Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[11] H Hermansky. Perceptual linear predictive (plp) analysis of
speech. Journal of the Acoustical Society of America, 87(4):1738
1752, 1990.
33

5. F INAL THOUGHTS
[12] Wikipedia. Mel-frequency cepstrum Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[13] MathWorks. MATLAB The Language of Technical Computing, 2012. [Online; accessed 20-May-2012].
[14] Wikipedia. Matlab Wikipedia, The Free Encyclopedia, 2012.
[Online; accessed 20-May-2012].
[15] MathWorks. MATLAB - Documentation, 2012. [Online; accessed 20-May-2012].
[16] GNU Octave. http://www.gnu.org/software/octave/,
2012. [Online; accessed 20-May-2012].
[17] Wikipedia. Maple (software) Wikipedia, The Free Encyclopedia, 2012. [Online; accessed 20-May-2012].
[18] Wikipedia. Mathematica Wikipedia, The Free Encyclopedia,
2012. [Online; accessed 20-May-2012].
[19] Wolfram Research. Mathematica 8 Documentation Center, 2012.
[Online; accessed 20-May-2012].
[20] Wolfram Research. Wolfram Demonstrations Project. http:
//demonstrations.wolfram.com/, 2012. [Online; accessed
20-May-2012].

34

Vous aimerez peut-être aussi