Vous êtes sur la page 1sur 15

Nadine E.

Miner
Sandia National Laboratories
Albuquerque, NM
Thomas P. Caudell
Department of Electrical and
Computer Engineering
University of New Mexico

A Wavelet Synthesis Technique


for Creating Realistic Virtual
Environment Sounds

Abstract
This paper describes a new technique for synthesizing realistic sounds for virtual
environments. The four-phase technique described uses wavelet analysis to create a
sound model. Parameters are extracted from the model to provide dynamic sound
synthesis control from a virtual environment simulation. Sounds can be synthesized
in real time using the fast inverse wavelet transform. Perceptual experiment validation is an integral part of the model development process. This paper describes the
four-phase process for creating the parameterized sound models. Several developed
models and perceptual experiments for validating the sound synthesis veracity are
described. The developed models and results demonstrate proof of the concept
and illustrate the potential of this approach.

Presence, Vol. 11, No. 5, October 2002, 493507

2002 by the Massachusetts Institute of Technology

Introduction

Stochastic, nonpitched sounds fill our real-world environment. Here,


stochastic sounds are defined as nondeterministic, randomly varying sounds.
Many nonpitched, stochastic sounds have a characteristically identifiable structure. For example, the sound of rain has a characteristic structure that makes it
easily identifiable as rain and easily distinguishable from random noise. Humans almost continuously hear stochastic sounds such as wind, rain, motor
sounds, and different types of impact sounds. Because of their prevalence in
real-world environments, it is important to include these types of sounds in
realistic virtual environment simulations.
Most current virtual reality (VR) systems use digitized sounds rather than
synthesized sounds. Digitized sounds are static and do not dynamically change
in response to user actions or to changes within a virtual environment. Creating an acoustically rich virtual environment can require thousands of sounds
and their variations. Using a digitized sound approach requires switching
between static sounds. Furthermore, obtaining a digitized, application-specific
sound sequence is difficult and often impractical (Miner, 1994). The alternative to using digitized sound is to use sound synthesis. Although sound synthesis may be preferred, essentially no virtual sound systems are available today
that provide flexible, real-time sound synthesis tools for the virtual world
builder. The approach described in this paper is a step towards filling this void.
The synthesis technique provides a method for creating flexible, dynamic
sound models that yield a variety of sounds and increase the richness and realism of a virtual experience.
Miner and Caudell 493

494 PRESENCE: VOLUME 11, NUMBER 5

The overall goal of this research is to develop methods for synthesizing perceptually compelling sounds for
virtual environments. The perceptual believability of a
synthesized sound is the ultimate test of success. One
advantage of this approach is that perceptually convincing sounds need not be mathematically precise. Creating physically accurate simulations of complex sounds is
computationally intensive. It is anticipated that synthesis
of perceptually convincing sounds will be less so because
evaluation of complex physics equations is not required.
This research develops some new parameterized models to synthesize sounds. A parameterized model is one
in which changing parametric values prior to simulation
results in a new synthesized sound. There are two reasons for choosing a parameterized model approach.
First, parameterization provides the possibility of obtaining a variety of sounds from a single model. (For
example, one parameterized rain model might generate
the sound of light rain, medium rain, heavy rain, and
the sound of a waterfall.) The second reason is to create
dynamic sound models: manipulating the sound model
parameters in real time can yield a dynamically changing
sound. With the rain model example, changing the parameters as the virtual simulation evolves allows the rain
sound to progressively and dynamically increase in intensity as the graphics simulation shows increasing and
darkening clouds. Overall, model parameterization provides flexibility and dynamic control such that a variety
of sounds result from a small model set.
The synthesis method described uses wavelets for
modeling non-pitched stochastic-based sounds. It is
likely that this method will be equally successful in synthesizing pitched sounds. Wavelet analysis provides an
efficient method of extracting model parameters. Parameter modification and sound synthesis can be accomplished in real time in parallel with a virtual environment simulation. Overall, wavelets are highly
appropriate for modeling real-world sounds and providing real-time sound synthesis.
This paper describes a four-phase model development
and sound synthesis process. Three perceptual experiments were conducted to validate the sound synthesis
veracity (that is, perceptual quality). These experiments
and results are briefly described here, although a more

extensive description of the experiments is contained in


Miner, Goldsmith, and Caudell (2002). Experimental
results indicated that the synthesized sounds are perceptually convincing to human listeners. Finally, this paper
describes several different parameterized stochastic
sound models developed to demonstrate the functionality and potential of this sound synthesis approach.

1.1 Related Work


Some related work in synthesizing real-world
sounds using dynamic, parameterized models exists in
the literature. Gaver (1994), in developing a sound interface for human computer interaction, proposed
some physical-like models for real-world sounds. Gaver
implemented parameterized models for impact, scrapping, breaking, and bouncing sounds. The synthesis
algorithms succeeded in creating parameterized sounds
in real time, however, the results were somewhat
cartoon-like and required training to interpret. Van den
Doel and Pai (1997) proposed a general framework for
producing impact sounds for virtual environments. They
used physical modeling of the vibration dynamics of
physical bodies. The models were parameterized based
on the material and object shape, and the collision force
and location. Prototype sound simulations produced
realistic impact sounds. Smith (1992) used a digital
waveguide method for developing physical models of
string, wind, and brass instruments. This method yields
excellent-quality music synthesis, and some high-end
synthesizer keyboards are based on this technology. In
recent work by Cook (1997), the synthesis of percussive
sounds was explored. Cook introduced the PhiSAM
(Physically Informed Spectral Additive Modeling) approach, which is based on the spectral modeling technique of modal synthesis but with controls and parameters that have physical meaning. The PhiSAM approach
yields perceptually convincing, real-time synthesis of a
variety of sounds.
A major difference between these methods and the
one proposed here is the emphasis on the importance of
modeling the stochastic sound components. Serra
(1989) showed that incorporating stochastic components in a sound model results in sound simulations

Miner and Caudell 495

with more realism. The wavelet-based modeling approach presented here focuses on capturing and modeling the stochastic components of sounds. The result is
realistic sound synthesis of real-world sounds.

Sound Synthesis Using Wavelets

The sound synthesis method described here uses


wavelets for modeling stochastic-based sounds (Miner,
1998a). The Fourier theorem is the basis of many signal
analysis techniques including the Fourier transform
(FT), the short-time Fourier transform (STFT) and,
more recently, the wavelet transform. The Fourier theorem states that all signals are composed of a combination of sine waves of varying frequency, amplitude, and
phase that may or may not change with time. The FT
technique breaks a signal up into its constituent sinusoidal components. The FT method is most useful when
considering stationary signals (that is, signals that do
not change over time). However, most real-world signals are not stationary in time. An FT variation that captures some time-varying information by analyzing signal
windows is the short-time Fourier transform. The
STFT captures the frequency information for different
sections of time, but the resolution is limited and fixed
by the choice of window size. The wavelet transform
(WT) was selected for this work because the FT and
STFT methods do not adequately model the timevarying nature of real-world signals. Wavelet analysis
provides a time-based windowing technique with variable-sized windows. Wavelets examine the highfrequency content of a signal with a narrow time window and the low-frequency content with a wide time
window. Fast wavelet algorithms provide the potential
for synthesizing wavelet-modeled sounds in real time.
The fast wavelet algorithms are comparable in terms of
compute time to the fast Fourier transform algorithms
according to Ogden (1996).

2.1 Background on Wavelets


Wavelet analysis is a logical approach for analysis
of time-varying, real-world signals. As with the FT and

STFT methods, wavelet analysis consists of signal decomposition (wavelet transform) and reconstruction
(inverse wavelet transform) phases.
Alfred Haar (1910) is credited with the first use of a
wavelet, the Haar wavelet, although the term wavelet
was not coined until Morlet used it in his signalprocessing work of 1983. Esteban and Galand (1977) in
their subband coding research proposed a nonaliasing
filtering scheme. With this scheme, signals are filtered
into low- and high-frequency components with a pair of
filters. The filters are mirror images with respect to the
middle, or quadrature, frequency, /2 (Strang &
Nguyen, 1996). Filters chosen according to this
scheme are called quadrature mirror filters (QMFs) or
conjugate quadrature filters (CQFs). Wavelet functions
developed with QMFs provide exact signal reconstruction.
Stromberg (1982) is often credited with the development of the first orthonormal wavelets. However, the
system introduced by Yves Meyer in 1985 received
more recognition and became known as the Meyer basis
(Meyer, 1993). Orthonormal wavelet functions provide
a means for the efficient decomposition of a signal.
These functions ultimately define a specific set of filters
for signal decomposition and reconstruction and provide for real-time wavelet synthesis.
Ingrid Daubechies (1988) constructed wavelet bases
with compact support, meaning that the wavelets are
nonzero on an interval of finite length (as opposed to
the infinite interval length of the FTs sine and cosine
bases functions). Compactly supported wavelet families
accomplish signal decomposition and reconstruction
using only finite impulse response (FIR) filters. This
development made the discrete-time wavelet transform
a reality. Stephane Mallat (1989) proposed the fast
wavelet transform (FWT) algorithm for the computation of wavelets in 1987. This technique is unified with
other noise reduction techniques through the concepts
of multiresolution analysis (MRA), which is based on
the concept that objects can be examined using varying
levels of resolution. Cohen & Ryan (1995) provided a
more complete mathematical description of MRA and
the associated properties.

496 PRESENCE: VOLUME 11, NUMBER 5

Figure 1. Illustration of wavelet transform steps to calculate wavelet


coefficients, Dij (example uses Daubechies 4 (db4) wavelet type).

2.2 The Wavelet Transform


The wavelet transform decomposes a signal into
wavelet coefficients through a series of filtering operations. The wavelet transform is similar to the STFT in
that both techniques analyze an input signal in sections
by translation of an analysis function. With the STFT,
the analysis function is a window; the window is translated in time but is not otherwise modified. The wavelet
approach replaces the STFT window with a wavelet
function, . The wavelet function is scaled (or expanded
or dilated) in addition to being translated in time. The
is often called a mother wavelet because it gives
birth to a family of wavelets through the dilations and
translations.
A wavelet is not necessarily symmetric, but, for perfect reconstruction to be possible, it does satisfy
x)dx 0. Other properties of the wavelets used in
the sound synthesis approach presented in this paper are
orthonormality and compact support. Wavelet families
that satisfy these conditions are the Daubechies wavelets
(often denoted by dbN, where N is the wavelet order),
Symlet wavelets (symN ), and Coiflet wavelets (coif N ).
The sound synthesis method proposed here uses the
Daubechies wavelets, although other wavelet families
may prove equally viable. The choice of wavelet type is
highly application specific.
Figure 1 graphically illustrates the wavelet transform

steps on an arbitrary signal using the Daubechies 4


(db4) wavelet type. First, the wavelet is compared
against an input signal section. A measure of the goodness of fit between the wavelet and the input signal is
captured in a wavelet coefficient (indicated by Dij in
figure 1). Large coefficients indicate a good fit. Next,
the wavelet is shifted (or translated ) in time and the
comparison operation is repeated, resulting in another
wavelet coefficient. This translation and comparison
process is repeated for the duration of the input signal.
All of these wavelet coefficients are considered to be on
the same level. Stretching (or scaling) the wavelet and
repeating the series of comparison and translation operations create subsequent levels of wavelet coefficients.
The result is a set of wavelet coefficients (referred to as
detail and approximation coefficients) that completely
describe the input signal.
When moving between levels, the wavelet is most
commonly scaled or stretched by a factor of 2. Thus,
scaling is also known as dilation. The scale parameter, a,
indicates the analysis level. Small values of a provide a
local, fine-grain, or high-frequency analysis, whereas
large values correspond to large-scale, coarse-grain, or
low-frequency analysis. Translation is often referred to
by the b parameter, which moves the time localization
center of each wavelet; thus, each a,b(x) is localized
around x b.
Two functions are used in a wavelet analysis: the
wavelet function and the scaling function. The wavelet
and scaling functions are orthogonal to each other. The
wavelet function creates a high-pass filter ( gk) that provides the detail coefficients; the scaling function has
low-frequency oscillations and is used to create a lowpass filter (hk) to provide the approximation coefficients.
The wavelet and scaling filters are quadrature mirror
filters (QMF), and this makes perfect signal reconstruction possible.
Many additional sources are available to provide more
details of wavelet analysis. Misiti et al. (1996) provided
a high-level treatment of wavelets. More-formal mathematical treatments of wavelets can be found in Daubechies (1992), Cohen & Ryan (1995), Ogden (1996),
and Meyer (1993). An introductory tutorial on wavelet
analysis is available on the Web (Miner, 1998b).

Miner and Caudell 497

Figure 2. Four-phase process for sound synthesis model


development.

Development of Sound Synthesis


Models

Development of the wavelet sound model is accomplished through a four-phase process (as shown in
figure 2): analysis, parameterization, synthesis, and validation.

3.1 Analysis Phase


The analysis phase begins with a digitized sound
sample. The digitized sound representation can be obtained in a number of different ways including digitally
recording a real-world sound using a DAT recorder or
computer; digitizing an analog recording of a sound;
obtaining a digitized sound from a sound effects library,
CD, or from the Internet; and accepting the digitized
representation from a computer simulation of a physical
event.
Next, a wavelet type () and scaling function () are
selected for the decomposition process. The best choice
of wavelet is one that decomposes the salient signal features most effectively. However, identifying the salient
signal features is difficult because these features vary
from signal to signal and application to application. A

generalized algorithm for determining the best wavelet


for decomposition does not exist. We propose examining the original digitized sound signals at different scales
(that is, time domain expansion and contraction) to determine the best wavelet type (in terms of shape similarity between the wavelet type and various sound characteristics at different levels). Wavelet type selection is
largely an iterative process based on how well the original signal can be resynthesized. The models presented
here used Daubechies wavelet types db4, db5 and db6
and corresponding scaling functions from the standard
Daubechies family. Choosing a wavelet type from this
family is a safe first choice for any signed decomposition.
Once the wavelet and corresponding scale function
are selected, the original digitized sound is decomposed
using the discrete wavelet transform (DWT). The two
wavelet coefficient sets resulting from the first level of
decomposition are referred to as approximation coefficients in vector A1 and detail coefficients in vector D1
(that is, Aj and Dj, where j level). In a multilevel decomposition, the approximation coefficients are decomposed into coarser-grained coefficient vectors by recursing on the decomposition algorithm. Each coefficient
vector serves as the input to successive wavelet decomposition stages. The second-level coefficients are denoted by AA2 and DA2. Given that the original signal
has length N, the DWT consists of log2N stages at
most. The result of the DWT is a set of approximation
and detail coefficients that contain all of the timevarying frequency information of the original signal.
These coefficients become the parameters that control
the sound synthesis.
Multiple levels of decomposition provide access to
different sound frequency components. The choice of
decomposition level for developing sound synthesis
models is largely iterative as parameterization and validation experiments serve to refine the selection. We decomposed to level 5 for the models presented here.
All wavelet operations for this research, including decomposition and reconstruction, were performed using
Matlab. Software systems that support wavelet operations typically contain a single-level or multilevel decomposition function. In Matlab, these functions are

498 PRESENCE: VOLUME 11, NUMBER 5

dwt and wavedec, respectively, and inputs to these functions include an input signal vector, the desired decomposition level, and the wavelet type. The function output is a set of coefficient vectors and corresponding
vector lengths. For example, in Matlab, the signal f
is decomposed to level three using the Daubechies
wavelet type db2 with the command: [C,L] wavedec ( f,3,db2). The resulting four wavelet coefficient
groups are contained in the vector C, namely the approximation coefficients, AAA3, followed by the detail
coefficients DAA3, DA2, and D1. The length of each
wavelet coefficient group is maintained in the L vector.
The wavelet coefficients become the inputs for the parameterization phase as described in subsection 3.2.

Figure 3. Illustration of a perceptual sound space. Circles indicate


extent of synthetic sounds possible from parameterized models.

3.2 Parameterization Phase


The second phase of the model development process is parameterization, which entails determining
groups of wavelet coefficients and specific modifications
to their values to provide perceptually convincing sound
synthesis. The wavelet decomposition coefficients are
the source of the parameters for the sound synthesis
model. Depending on the level of decomposition, essentially unlimited control in amplitude, time, and frequency are available; however, the parameters are not
directly related to the physical characteristics of the
sound source, as is the case with other parametric approaches (such as, Van den Doel and Pai (1997)). Determining the sound model parameterization is largely
an iterative process. For example, increasing the lowfrequency content of a model can result in the perception of a larger sound source having generated the
sound. Manipulating the low-frequency and highfrequency coefficients (or parameters) of an engine
model turns the sound of a standard-sized car engine
into the sound of a large truck or a small toy car, respectively. For reconstruction, the sound is synthesized using the modified wavelet coefficients, and the result is
perceptually analyzed. The process iterates by determining what additional parameter manipulations are required to obtain the desired sound. If more highfrequency information is required, the detail coefficients
are modified further. The cycle of parameter modifica-

tions, synthesis, and evaluation continues until a clear


definition of parameterization and coefficient manipulation is established for changing the original sound into a
variety of new sounds.
Parameter manipulations perceptually modify the synthesized sound. A perceptual sound space diagram, as
depicted in figure 3, represents the effect of parameter
manipulations on three base sounds. The axes represent
perceptual sound dimensions. These dimensions are the
perceived result of changes in the model parameters.
Each circle represents a variety of perceived sounds
achievable from individual wavelet models. The center
of each circle represents the original digitized sound
(base sound) from which the model was developed. Parameter manipulation extends the sound perception into
many dimensions. It is feasible to move from one type
of sound to another by changing the parameter settings
as indicated in the figure by the overlapping circles. For
example, manipulating the rain model parameters creates a sound that includes the sound of light rain, medium rain, heavy rain, a small waterfall, and some motor
sounds.
We have examined three different types of parameter
manipulation methods: magnitude scaling of coefficient
groups to emphasize or de-emphasize certain frequency
regions, scaling filter manipulations to frequency shift

Miner and Caudell 499

the original signal, and envelope manipulations to alter


the amplitude, onset, offset, and duration of the sound.
These parameterization methods produce compelling
variations of the original sound. Other parameterization
techniques and manipulations may increase the synthesis
potential of a model by producing a greater variety of
sounds.
Magnitude scaling provides a straightforward way of
changing the frequency content of a sound. For example, a large sound source, such as an airplane engine,
will have large approximation coefficients (Aj ), indicating a significant low-frequency contribution. The airplane engine sound can be converted into a car sound
by de-emphasizing the approximation coefficients and
enhancing the high-frequency detail coefficients (Dj ).
Various scaling techniques can be applied to wavelet
coefficient groups to achieve different effects. One manipulation is to multiply or divide a coefficient group by
a scalar. This simple manipulation is powerful and effective. In fact, all of the magnitude manipulation operations for the models described in this paper are simply
multiplying different groups of wavelet coefficients by
scalar values, as described in subsection 4.1. Different
combinations of coefficient manipulations result in a
variety of perceptually related sounds. Manipulations
that are more complex involve filtering coefficient
groups by static or dynamic functions. The desired perceptual result determines the filter structures. Overall,
the magnitude scaling method provides a means of creating an assortment of sounds by manipulating wavelet
coefficient groups.
The second type of parameter manipulation is modifying scaling filter parameters. Scaling filter manipulations can shift the sound in frequency without changing
the frequency contributions. Combining this method
with magnitude scaling provides frequency shifting and
frequency emphasis or de-emphasis. The scaling filter
computes the decomposition and reconstruction filters.
By stretching or compressing the scaling filter prior to
calculation of the reconstruction filters, the original signal frequency content is shifted down or up, respectively. Scaling filter manipulations can change the sound
of a brook to the sound of a large, slow moving river
(stretching scaling filter) or to the sound of a rapidly

moving stream (compressing scaling filter). The scaling


filter manipulation method involves five steps:
1. Decompose the original signal using a wavelet
with scale filter support.
2. Obtain the scaling filter, S, associated with the
wavelet.
3. Extract the standard reconstruction scaling filters
from the wavelet so that it can be modified.
4. Perform compression or expansion operations on
the reconstruction filters.
5. Reconstruct the signal using the modified reconstruction filter.
The compression/expansion operations (step 4) can be
accomplished with a number of different methods such
as linear interpolation or cubic-spline interpolation, followed by resampling. Through laboratory experimentation, cubic-spline interpolation was found superior to
linear interpolation in terms of maintaining the perceptual quality of the original sound. Cubic-spline fits a
third-degree polynomial between every two points and
yields a smoother sound than does linear interpolation.
Matlab contains all the functions necessary to complete
these steps. The models described in section 4 demonstrate the variety of sounds created by scaling filter manipulations.
Two classes of envelope manipulations can be used
with the wavelet synthesis method. The first type of manipulation involves envelope filtering the wavelet coefficients prior to synthesis. This includes manipulations
discussed in the magnitude scaling approach, where the
envelope is a scalar filter. The envelope filter shape is
determined by the perceptual effect desired. For example, a Gaussian-shaped envelope can be applied to a
group, or groups, of wavelet coefficients, or across all
wavelet coefficients. Then, the filtered wavelet coefficients undergo the normal synthesis process. The result
is a synthesized sound that is a derivation of the original
sound, wherein the frequency region around which the
Gaussian envelope was centered is emphasized and the
surrounding frequency regions are de-emphasized. Any
envelope shape can be applied to the wavelet coefficients
including linear, nonlinear, quadratic, exponential, and
random filters, and filters derived from mathematical

500 PRESENCE: VOLUME 11, NUMBER 5

functions or from characteristic shapes of sounds. The


wavelet operations of compression and denoising can be
grouped with this parameterization method. Envelopes
resulting in the compression of the number of wavelet
coefficients can be useful for saving on data storage
space and data transmission times. Compression and
denoising functions on the wavelet coefficients can yield
a variety of perceptually related sounds.
The second class of envelope manipulations imposes
time domain filtering operations on all, or part, of the
synthesized sound. These operations are applied to the
sound after synthesis. This type of sound processing is
commonly applied to digitized sound samples to achieve
a customized application sound. Time domain filtering
can alter the overall amplitude, onset, and/or offset
characteristics and duration. Time domain amplitude
filtering with a random characteristic can be applied to
the synthesized sound of rain to obtain a continuously
varying and natural sounding rainstorm. Combining the
time domain enveloping with wavelet parameter enveloping can enhance the naturalness of the synthesized
sound.

3.3 Synthesis Phase


The synthesis phase uses the inverse discrete wavelet transform (IDWT). The parameters, modified wavelet coefficients, are the inputs to the IDWT. The IDWT
starts with the modified coefficient vectors and constructs a signal by inverting the decomposition steps.
The first step convolves up-sampled versions of the
lowest-level coefficient vectors with high-pass and lowpass filters that are mirror reflections of the decomposition filters. Successively higher-level vectors are reconstructed by recursively iterating over the same process.
This continues for reconstruction of all coefficient vectors. The result is a new waveform containing the synthesized sound.
In Matlab, idwt performs single-level reconstruction
and waverec performs multilevel reconstruction. The
input to these functions is the coefficient vectors, the
vector lengths, and wavelet type. Users can supply the
reconstruction filters in lieu of the wavelet type. For
example, the signal is reconstructed from the coefficient

vector, C, and lengths vector, L, and Daubechies wavelet db2 with the command f waverec(C,L,db2).
The output from this function is the reconstructed signal, f . The reconstructed signal can be converted to a
standard audio file format (and sent to an audio output
device for playback), saved for later use, or transmitted
over a computer network for remote application.

3.4 Validation Phase


Validation is the final phase of the sound synthesis
process. Because the goal is to create perceptually convincing sounds, performing a rigorous mathematical
proof is not feasible to validate the sound synthesis success. Instead, success is examined by human judgments
of the perceptual sound imagery. During development,
the designer listens to the synthesized sound and decides if the desired aural imagery has been achieved. If
the goal has not been achieved, different parameter manipulations are implemented by returning to the parameterization phase. It is also reasonable to reanalyze the
original sound with a different wavelet decomposition if
the aural imagery is far off the mark. Successive iterations continue through the four-phase development
process until the desired audio result is obtained.
Formal validation of the sound models requires psychoacoustic experimentation. Varieties of experiments
are possible. Bonebright, Miner, Goldsmith, and
Caudell (1998) presented a test battery for validating
sound veracity, although these experiments have not
been accepted as a standard. For this research, a set of
three psychoacoustic experiments was conducted: similarity rating, freeform identification, and context-based
rating. These studies provide validation of the aural imagery produced from the synthesized sound models.
The similarity rating experiment examines the relationships between sound models and provides information
about how models might be successfully expanded to
synthesize a broader range of sounds. The freeform
identification experiment evaluates the scope of the synthesized sounds by using the subjects perceptual identification capability. The context-based rating experiment
provides metrics that indicate the sound synthesis success by comparing the synthesized sounds against hu-

Miner and Caudell 501

man expectations. These experiments are generally useful for evaluating any type of sound synthesis. In
addition, the experiments can provide valuable cognitive
and perceptual information for psychoacoustic researchers. The experiments are briefly described here, and an
in-depth description of the experiments are provided by
Miner et al. (2002).
The similarity rating experiment examined the perceptual parameter space of sound synthesis models and
examined the effect of various parameter settings for
those models. Subjects rated the similarity between two
synthesized sounds on a five-point rating scale. Twentytwo subjects (seven men and fifteen women) participated. Twenty unique sound stimuli were used to create
190 sound pair combinations rated by the subjects. Two
techniques were used to analyze the data: multidimensional scaling (MDS) and Pathfinder analysis. The MDS
analysis provided evidence to show that manipulation of
sound model parameters changed the sound perception
in a predictable way. This is important for being able to
reliably control the sound synthesis from a virtual environment simulation system. The Pathfinder analysis revealed relationships within and across different sound
groups. This information is important for extending a
sound model synthesis capability to a broader class of
sounds. These results also proved useful for fine-tuning
sound synthesis models.
The similarity rating experiment provided a tool for
examining the sound stimuli relatedness without imposing experimenter bias. However, this experiment did
not reveal the perceptual extent of aural images that
could be synthesized with the wavelet models, nor did it
provide a metric for the sound synthesis quality. The
next two studies were designed to provide this information.
The second experiment was a freeform identification
experiment used to examine the perceptual identification of the synthesized sounds without providing a context. This experiment answered the question what aural image comes to mind when you listen to this
sound? This is a freeform identification experiment
similar to that run by Ballas (1993) and Mynatt (1994).
The purpose of the experiment was two-fold. First, the
experiment tested whether the synthesized sound re-

sembled the base sound (that is, the sound being synthesized) strongly enough to elicit a freeform identification without any verbal or visual context. Secondly, the
experiment identified perceptually related sound labels
that were not the base sound. These perceptually related
labels served to extend the synthesis domain for individual models. In this experiment, subjects listened to synthesized sounds and entered an identification description. Identification phrases included a noun and
descriptive adjectives. Thirty-five sound stimuli were
presented in random order to 22 subjects (seven men
and fifteen women). Results indicated that the synthesized sounds most frequently elicited the correct freeform response (correct in the sense that the response
matched the target sound being synthesized). Results
showed that a wide variety of perceptually convincing
sounds could be obtained by manipulating the model
parameters. Mechanically oriented labels emerged as the
high-frequency information in the synthesized sound
was increased. Sound labels indicating larger objects
emerged when the low-frequency content of the synthesized sound was increased, and this result showed that
manipulating the model parameters resulted in predictable changes in aural imagery.
The third experiment was a context-based rating experiment designed to provide a sound synthesis veracity
metric by asking subjects to rate the sound quality
within a verbal context. Phrases obtained from the freeform experiment were paired with synthesized sounds.
The phrases provide a perceptual context for the
sounds. Twenty-seven subjects (five men and 22
women) were asked to rate how well the phrases
matched the sounds they heard. Subjects rated 207 randomly presented sound and phrase label pairs on a fivepoint scale, with 1 no match and 5 perfect match.
Both digitized and synthesized sounds were included.
Results quantified free-form label responses thereby
providing an indication of label quality. Furthermore,
this experiment provided numeric information about
how the aural imagery changes as the model parameter
settings changed. Thus, this experiment numerically
validated the perceptual success of the parameter manipulations.
Examination of perceptual experiment results indi-

502 PRESENCE: VOLUME 11, NUMBER 5

cates whether design iteration is necessary. Iteration of


the process refines the synthesis model to obtain the
desired perceptual characteristics. Reanalysis of the
model involves iterating through the process starting
either with phase 1 (a new wavelet analysis) or phase 2
(parameterization).
For more information on the experimental results,
refer to Miner et al. (2002). Section 4 describes several
example sound synthesis models. We also present metric
values for the rain sound model as an example of the
experimental results obtained.

Example Sound Synthesis Models

This section describes four example sound models


that have been developed using this four-phase process.
A high-level summary of some of the experimental results are also included to illustrate the effectiveness of
the synthesis method.

4.1 General Model Development


Details
The equipment used to develop the models and
run the experiments was a Sun Sparc Server 20 host
computer interfaced with a Network Computing Devices (NCD) smart terminal (model MCX). The NCD
workstation contained an embedded soundboard to
allow playback of the synthesized sounds. Synthesized
sounds were listened to through both workstation
speakers and AKG K240 stereo headphones. MathWorks Matlab version 4.2 provided the wavelet decomposition, scaling filter extraction, and reconstruction
routines. Custom Matlab routines were developed for
sound signal input and output, parameterizations, and
other functions required for model development. All
Matlab functions were performed in non-real time, prior
to validation experiment execution. In practice, decomposition would be performed in non-real time to set up
the sound models. Reconstruction, or synthesis, would
be performed in real time with parameter values being
dictated by a VR simulation.
Model development for each of these examples began

with a digitized base sound sample. The base sounds


were digitized at a 22,050 Hz sample rate with 16-bit
resolution. The sounds were captured using a portable
digital audio tape (DAT) recorder and a studio-quality
microphone.
The wavelet type used for decomposition has a direct
effect on the sound synthesis results. Decomposition
with a relatively complex wavelet type (for example,
Daubechies 4, 5, or 6) can provide the basis for a powerful sound synthesis model. These example models created sounds resulting from decomposition with the
Daubechies 4 (db4) and Daubechies 6 (db6) wavelet
types with a decomposition level of 5. Level 5 decomposition was selected for these models because fewer
levels of decomposition produced overly dramatic
changes in the resulting synthesized sounds. Manipulations of finer levels of detail (that is, decomposition levels greater than 5) did not create perceptually significant
changes. These results were determined during the
model design process by iterative cycles through the
analysis, parameterize, synthesize, and perceptual validation steps. The choice of wavelet type and decomposition level is application specific.
To demonstrate the effect of varying model parameters, several parameterizations were applied to the base
sounds as described in table 1. The perceptual experiments described by Miner et al. (2002) used a subset of
these sounds to validate the synthesis method. Each row
in the table represents one sound model. The table arranges the sounds into five columns according to parameter setting type. The first column contains the
model name and represents the synthesized sound with
no parameter manipulations. The last four columns represent the coefficient groupings and parameter manipulations. The first two parameterizations (column 1 and 2
in table 1) were magnitude-scale operations on the level
1 detail (D1) and level 5 approximation (A5) coefficients
obtained from a wavelet decomposition using the db4
wavelet type. Scaling the D1 coefficients resulted in enhancing the high-frequency sound components. Scaling
the A5 coefficients resulted in enhancing the lowfrequency sound components. Initially, coefficient
groups were scaled by factors as large as 20 and 100.
For these sound model examples, scalings of this magni-

Miner and Caudell 503

Table 1. Parameter Settings for Example Models. Original/Base Sound Plus Four Categories of Parameter Settings. Parameters
1 and 2 are the Scalar Values Applied to the Coefficient Groups (D1 and A5 ). Parameters 3 and 4 are the Length of the
Modified Scaling Filter
Parameter settings
Original sound

Scale details (D1)

Scale approx. (A5)

Increase filter points

Decrease filter points

Rain
Car motor
Footstep
Breaking glass

1.2, 2, 4, 5, 8, 10, 20, 100


2, 4, 5, 8, 10, 20, 100
2, 4, 8
2, 4, 8

2, 4, 5, 8, 10, 20, 100


2, 4, 5, 8, 10, 20, 100
2, 4, 8
2, 4, 8

14, 17, 20, 24


14, 17, 20, 24
14, 17, 20, 24
14, 17, 20, 24

6, 7, 8, 9
6, 7, 8, 9
6, 7, 8, 9
6, 7, 8, 9

tude did not yield perceptually compelling sounds (that


is, the sounds became unrecognizable). Scaling coefficient groups by less than a factor of 1.2 did not yield a
perceptual change to the sound.
The next two parameterizations (column 3 and 4 in
table 1) involved scaling filter manipulations of the Daubechies wavelet type 6 (db6) reconstruction function. The
db6 wavelet has a twelve-point reconstruction-scaling filter. One parameterization increased the number of points
in the reconstruction filter to stretch the filter (shift sound
down in frequency). The other parameterization decreased
the scaling filter length (compressed the filter) shifting the
sound frequency up. Each base sound had the same scaling
filter manipulations applied. Scale filter manipulations
ranged from decreasing the filter length in half (creating a
six-point filter) to doubling the filter length (creating a
24-point filter). The results were perceptually convincing
in some cases but not in others, depending on the base
sound characteristics.
To evaluate individual effects, parameterizations were
applied one at a time rather than in combination. In practice, combinations of magnitude scaling, scaling filter manipulations, and enveloping would be applied to wavelet
coefficient groups at various levels to create a powerful
sound model capable of producing thousands of sounds.

4.2 Four Example Parameterized


Sound Models
This section contains sound model descriptions for
four example models. The perceptual results of the parameter manipulations (as indicated in table 1) are sum-

marized. Two models (rain and car engine) are continuous stochastic sound models consisting primarily of
nonpitched sound and infinite duration. Two models
(footsteps and glass breaking) are finite-duration sounds
defined as time-limited sounds whose onset and offset
characteristics significantly influence the sound perception. Raw average context-based rating data and standard deviations are provided for the rain sound model.
This data is provided to demonstrate the effects of specific parameter manipulations.
4.2.1 Rain. This model simulated the sound of
rain. The original digitized sound was that of rain hitting
concrete in an open-air environment. Parameter manipulations yielded the synthesis of light rain, medium rain, and
progressively heavier rain. The perception of increasing
wind accompanied the sound of increasing rain and conveyed the sense of a large rainstorm. Other perceptually
grouped sounds that emerged during the perceptual freeform identification experiment were bacon frying, machine
room sounds, a waterfall, a large fire, and applause.
Table 1 shows the parameter manipulations for this
model. Increasing the magnitude scaling of the detail
coefficient vectors (D1) resulted in the perception of
increasingly softer rain. Bacon frying, fire, and other
sounds were also perceived. Increasing the magnitude
scale of the approximation coefficients (A5) increased
the contribution of the lowest-frequency sound components, resulting in a deeper, more reverberant sound.
Thus, manipulating groups of coefficients (parameters)
increases the scope of the sounds generated by the
model.

504 PRESENCE: VOLUME 11, NUMBER 5

Table 2. Perceptual Experiment Results for Rain Model. Includes Subset of Freeform Identification Labels, Average ContextBased Ratings, and Standard Deviations for Base (Original) Sound and Two Parameter Manipulations. (1 no match, 5
perfect match)
Rating results
Base sound

Base w/D1*8

Base w/A5*4

Sound stimuli group Context labels

Avg rating Std dev Avg rating Std dev Avg rating Std dev

Rain

4.30
2.63
3.85
2.19
3.07
2.41

Hard rain
Light drizzle of rain
Water running in a shower
Bacon frying
Small waterfall
Large waterfall

Table 2 supports these claims and contains a subset of


the results from two of the perceptual experiments. Column 2 contains some labels obtained for the rain model
during the freeform identification experiment. The last
columns contain average rating results and standard deviations from the context-based rating experiment for
the original (or base) sound and two parameterizations
(detail coefficients on level 1 (D1) scaled by a factor of 8
and approximation coefficients on level 5 (A5) scaled by
a factor of 4).
These results show how the perception of the sound
changes as the parameters are modified. For example,
increasing the A5 intensity increases the perception of
hard rain. This synthesized sound matched the label
Hard Rain on average of 4.63 0.79 out of 5. Increasing the D1 coefficient increases the perception of a
light drizzle of rain (average rating of 4.26 1.06 out
of 5). Furthermore, these results show how a variety of
sounds are perceived from one model, and that changing the parameters effects the relative convincingness of
those sounds. For example, the rain base sound with
A5*4 simulates a convincing sound of a large waterfall
(4.15 1.03) more so than does the base sound with
D1*8 (1.37 0.74) and more so than does the base
sound without any modifications (2.41 1.19).
Other parameter manipulations, as explained in subsection 4.1, included increasing or decreasing scaling

1.20
1.42
1.32
1.11
1.21
1.19

2.11
4.26
3.67
3.44
2.63
1.37

1.42
1.06
1.21
1.55
1.33
0.74

4.63
1.63
2.67
1.89
2.96
4.15

0.79
0.88
1.41
0.97
1.26
1.03

filter length. Changing the scaling filter length upon


reconstruction had the effect of shifting the sound up or
down in frequency. For the rain model, the largest result of these manipulations was changing the perceived
size of the raindrops and the surface hardness. Shifting
the sound higher in frequency produced the sound of
smaller raindrops hitting a harder surface. The opposite
was true for shifting the sound down in frequency.
These results were observed during repeated iterations
through the four-phase development process and specifically during the validation phase as observed by human
listeners.
Applying denoising techniques to the rain model resulted in a synthesized sound similar to that of a car engine. The denoising technique is based on coefficient
thresholding; coefficients with values below the threshold are set to zero. Thus, the technique is akin to compression of the coefficient vectors. Denoising and compression have considerable promise in terms of adding
to the variety of sounds synthesized from individual
models and in saving on parameter storage and communication time.
4.2.2 Car Engine. This model simulated the
sounds of a car engine idling with parameter adjustments for different-sized cars, different types of machines, and different types of engines. The base sound

Miner and Caudell 505

for this model was that of a digitized mid-sized, fourcylinder engine idling in an open-air environment. Adjusting the parameters resulted in the perception of a
large diesel truck, a standard truck, a small car, and a
large car. Perceptual labels identified during the freeform experiment were different engine types, machinery, construction site machines, tractor, jackhammer,
drill, helicopter, and various-sized airplane engines.
Magnitude scaling and scale filter manipulations were
performed on this model. Increasing the magnitude of
the D1 coefficients increases the high-frequency sound
content, resulting in a smaller engine sound, such as a
lawn mower or toy car. Increasing the magnitude of the
A5 coefficients results in a smoother-sounding engine
because the high-frequency metallic sounds are
drowned out by the enhanced low-frequency components. The result is a smoother, larger-sounding engine
such as a helicopter or airplane. Decreasing the magnitude of the coefficient vectors had the inverse effect.
The scale manipulations were intended to create
consistent-sized car engine sounds but with different
RPM characteristics. This effect was not achieved however. Instead, they shifted the car engine in frequency,
thus changing the perception of the sound type. Highfrequency shifts resulted in a buzzing sound, reminiscent of a swarm of bees. Low-frequency shifts resulted
in large-engine sounds, such as that of an airplane. Perhaps a more uniform original base sound is required to
create the desired effect of RPM variation through scale
manipulation.
Thresholding the car motor model resulted in significant reduction of the coefficients and did not perceptually change the synthesized sound. Using Matlabs automatic global threshold function (threshold level of
1004) resulted in a 50.83% of coefficients being set to
zero, but retained 99.1% of signal energy. Using this
significantly reduced set of coefficients did not have a
significant perceptual effect on the synthesized result.
This demonstrates how entire groups of coefficients may
be eliminated without changing the synthesized sound
perception, thereby dramatically reducing the number
of coefficients required. This is an example of the significant compression rates that may be possible for wavelet
sound models.

4.2.3 Footsteps. This model simulated the


sound of footsteps on gravel. Parameter manipulations
resulted in the perception of footsteps on different material types such as dirt, snow, a hard concrete floor, or a
wood floor and of different weights of the person walking. Additional perceptually grouped sounds were chewing, crumbling paper, crushing or dropping various objects (from soft to hard objects), stomping of horse
hooves, stepping on leaves, lighting a gas grill, a lions
roar, and gunfire.
Increasing the magnitude of the detail coefficients
resulted in the perception of decreasing the size of the
person stepping and creating a harder, less resonant surface. Further increasing the high-frequency components
changed the perception from a footstep to crumbling
paper and a fire. Increasing the low-frequency coefficients resulted in the perception of a footstep on a softer
surface, such as mud or fresh snow. The perceived
weight of the person also increased as approximation
scaling increased. Increasing the reconstruction filter
length shifted the sound down in frequency. The result
was similar to the sound of an explosion because the
high-frequency crackling was removed. The addition of
envelope manipulations to increase the starting signal
intensity and exponentially decay over time would increase the perception of an explosion.
4.2.4 Glass Breaking. This model simulated the
sound of breaking glass with parameter adjustments for the
glass thickness or density, the surface hardness on which
the glass is breaking, and the force of the impact. Exercising this model during perceptual experiments resulted in
responses of dropping a heavy glass on a wood floor,
throwing crystal against a concrete floor, breaking a window, breaking a plate, and keys falling to the floor.
Increasing the high-frequency detail coefficients resulted in the perception of decreasing glass thickness.
The higher the scale factor, the harder the impact surface seemed and the less resonant the sound. The
throwing velocity was perceived to increase as the detail
scale increased. Scaling low-frequency coefficients
achieved the inverse effect. The perceived glass thickness
increased as the scale factor increased, going from a
plate or cup to a heavy vase or window. The surface

506 PRESENCE: VOLUME 11, NUMBER 5

hardness decreased as the scale factor increased because


the surface resonance increased. Large approximation
scale factors gave the perception of a wooden surface.
Increasing the reconstruction filter length shifted the
sound down in frequency. The result was less like glass
breaking because of the lack of high-frequency components. Conversely, decreasing the filter length shifted
the sound to the high-frequency region. Decreasing
filter lengths resulted in decreasing the glass thickness
and increasing the surface hardness.

Future Extensions

One direction for future work is to merge several


different models into a more generalized sound synthesis model. For example, merging the electric and car
motor models may yield a general motor model. This is
desirable because users would have in a single model a
variety of engine sounds, engine loads, RPMs, and so
on. Another example would be a general running water
model that could provide synthesis of rain, brook, rivers, waterfalls, water from faucets, and more.
Real-time sound synthesis for this technique is possible. Completing the analysis and parameterization
phases in non-real time produces the parameterized
model. The parameter manipulation and synthesis
phases can be computed in real time in parallel with
graphical and environmental VR simulations, and realtime implementations of wavelet transforms are available
on many desktop platforms. Because this technique is all
software based, it is feasible that the sound synthesis can
be efficiently combined with three-dimensional sound
localization and offloaded to a parallel sound server.
This would create a software-only virtual sound system.
We are exploring compression of wavelet coefficients
further to enhance real-time performance.

Conclusions

We have described a four-phase development process for a new sound synthesis approach using wavelets.
The iterative nature of the process allows continuous

model refinement according to perceptual sound quality


results. The analysis and synthesis phases use the discrete wavelet transform and the inverse discrete wavelet
transform, respectively. The parameterization phase creates dynamic, flexible sound models that, when exercised, are capable of producing sounds with a variety of
perceptual qualities. We described three perceptual validation experiments using human subjects designed to
elucidate the perceived synthesized sounds and rate the
sound synthesis veracity. Several continuous and noncontinuous stochastic-based sound models have been
developed using this method, including models for rain,
car engine, brook, glass breaking, and footstep sounds.
These models provide evidence of the validity of this
approach. Several steps are required before these sound
synthesis models are available to users, including further
model and parameterization development, real-time
implementation, development of an intuitive user interface, and integration with virtual reality simulation systems.

Acknowledgments
Sandia National Laboratories supported this work under its
Doctoral Study Program. We thank the reviewers who provided helpful comments, and we also thank the experiment
volunteers.

References
Ballas, J. (1993). Common factors in the identification of an
assortment of brief everyday sounds. Journal of Experimental Psychology: Human Perception and Performance, 19(2),
250 267.
Bonebright, T., Miner, N., Goldsmith, T., & Caudell, T.
(1998). Data collection and analysis techniques for evaluating the perceptual qualities of auditory stimuli. Proceedings
of the International Conference on Auditory Displays. Available online at www.icad.org/websiteV2.0/Conferences/
ICAD98/icad98programme.html
Cohen, A., & Ryan, R. D. (1995). Wavelets and multiscale
signal processing. London: Chapman & Hall.
Cook, P. (1997). Physically informed sonic modeling

Miner and Caudell 507

(PhISM): Synthesis of percussive sounds. Computer Music


Journal, 21(3), 38 49.
Daubechies, I. (1988). Orthonormal bases of compactly supported wavelets. Communications in Pure and Applied
Mathematics, 41, 909 996.
. (1992). Ten Lectures on wavelets. Philadelphia: SIAM.
Esteban, D., & Galand, C. (1977). Applications of quadrature
mirror filters to split-band voice coding schemes. Proceedings of the IEEE International Conference on Acoustic Signals and Speech Processing, 191195.
Gaver, W. (1994). Using and creating auditory icons. In G.
Kramer (Ed.), Auditory display: Sonification, audification,
and auditory interfaces, proc. vol. XVIII (pp. 417 446).
Reading, MA: Addison-Wesley.
Haar, A. (1910). Zur theorie der orthogonalen funktionensysteme. Math. Ann., 69, 331371.
Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 11, 674 693.
Meyer, Y. (1993). Wavelets: Algorithms and applications. Philadelphia: SIAM.
Miner, N. E. (1994). Using voice input and audio feedback to
enhance the reality of a virtual experience. Proceedings of the
1994 IMAGE Conference, 337343.
. (1998a). Creating wavelet-based models for real-time
synthesis of perceptually convincing environmental sounds.
Doctoral dissertation, University of New Mexico.
. (1998b). An introduction to wavelet theory and analysis. Sandia National Laboratories technical report no.
SAND98-2265.

Miner, N. E., Goldsmith, T. E., & Caudell, T. P. (2002). Perceptual validation experiments for evaluating the quality of
wavelet synthesized sounds. Presence: Teleoperators and Virtual Environments, 11(5), 508 524.
Misiti, M., Misiti, Y., Oppenheim, G., & Poggi, J. (1996).
Wavelet toolbox for use with Matlab. MathWorks, Inc.
Mynatt, E. D. (1994). Designing with auditory icons. Proceedings of the Second International Conference on Auditory Display (ICAD), 109 119.
Ogden, R. (1996). Essential wavelets for statistical applications
& data analysis. Boston: Birkhauser.
Serra, X. (1989). A system for sound analysis/transformation/
synthesis based on a deterministic plus stochastic decomposition.
Doctoral dissertation, Stanford University, and CCRMA
report no. STAN-M-58.
Smith, J. O. (1992). Physical modeling using digital
waveguides. Computer Music Journal, 16(4), 74 87.
Strang, G., & Nguyen, T. (1996). Wavelets and filter banks.
Wellesley, MA: Wellesley-Cambridge Press.
Stromberg, J. O. (1982). A modified Franklin system and
higher order spline systems on n as unconditional bases
for Hardy spaces. In A. Beckner (Ed.), Conference in honor
of A. Zygmund, vol. II. (pp. 475 493). Monterey, CA:
Wadsworth Mathematics Series.
Van den Doel, K., & Pai, D. K. (1996). Synthesis of shape
dependent sounds with physical modeling. Proceedings of the
International Conference on Auditory Displays. Available
online at www.icad.org/websiteV2.0/Conferences/
ICAD96/Proc96/dendoel.htm

Vous aimerez peut-être aussi