Académique Documents
Professionnel Documents
Culture Documents
Miner
Sandia National Laboratories
Albuquerque, NM
Thomas P. Caudell
Department of Electrical and
Computer Engineering
University of New Mexico
Abstract
This paper describes a new technique for synthesizing realistic sounds for virtual
environments. The four-phase technique described uses wavelet analysis to create a
sound model. Parameters are extracted from the model to provide dynamic sound
synthesis control from a virtual environment simulation. Sounds can be synthesized
in real time using the fast inverse wavelet transform. Perceptual experiment validation is an integral part of the model development process. This paper describes the
four-phase process for creating the parameterized sound models. Several developed
models and perceptual experiments for validating the sound synthesis veracity are
described. The developed models and results demonstrate proof of the concept
and illustrate the potential of this approach.
Introduction
The overall goal of this research is to develop methods for synthesizing perceptually compelling sounds for
virtual environments. The perceptual believability of a
synthesized sound is the ultimate test of success. One
advantage of this approach is that perceptually convincing sounds need not be mathematically precise. Creating physically accurate simulations of complex sounds is
computationally intensive. It is anticipated that synthesis
of perceptually convincing sounds will be less so because
evaluation of complex physics equations is not required.
This research develops some new parameterized models to synthesize sounds. A parameterized model is one
in which changing parametric values prior to simulation
results in a new synthesized sound. There are two reasons for choosing a parameterized model approach.
First, parameterization provides the possibility of obtaining a variety of sounds from a single model. (For
example, one parameterized rain model might generate
the sound of light rain, medium rain, heavy rain, and
the sound of a waterfall.) The second reason is to create
dynamic sound models: manipulating the sound model
parameters in real time can yield a dynamically changing
sound. With the rain model example, changing the parameters as the virtual simulation evolves allows the rain
sound to progressively and dynamically increase in intensity as the graphics simulation shows increasing and
darkening clouds. Overall, model parameterization provides flexibility and dynamic control such that a variety
of sounds result from a small model set.
The synthesis method described uses wavelets for
modeling non-pitched stochastic-based sounds. It is
likely that this method will be equally successful in synthesizing pitched sounds. Wavelet analysis provides an
efficient method of extracting model parameters. Parameter modification and sound synthesis can be accomplished in real time in parallel with a virtual environment simulation. Overall, wavelets are highly
appropriate for modeling real-world sounds and providing real-time sound synthesis.
This paper describes a four-phase model development
and sound synthesis process. Three perceptual experiments were conducted to validate the sound synthesis
veracity (that is, perceptual quality). These experiments
and results are briefly described here, although a more
with more realism. The wavelet-based modeling approach presented here focuses on capturing and modeling the stochastic components of sounds. The result is
realistic sound synthesis of real-world sounds.
STFT methods, wavelet analysis consists of signal decomposition (wavelet transform) and reconstruction
(inverse wavelet transform) phases.
Alfred Haar (1910) is credited with the first use of a
wavelet, the Haar wavelet, although the term wavelet
was not coined until Morlet used it in his signalprocessing work of 1983. Esteban and Galand (1977) in
their subband coding research proposed a nonaliasing
filtering scheme. With this scheme, signals are filtered
into low- and high-frequency components with a pair of
filters. The filters are mirror images with respect to the
middle, or quadrature, frequency, /2 (Strang &
Nguyen, 1996). Filters chosen according to this
scheme are called quadrature mirror filters (QMFs) or
conjugate quadrature filters (CQFs). Wavelet functions
developed with QMFs provide exact signal reconstruction.
Stromberg (1982) is often credited with the development of the first orthonormal wavelets. However, the
system introduced by Yves Meyer in 1985 received
more recognition and became known as the Meyer basis
(Meyer, 1993). Orthonormal wavelet functions provide
a means for the efficient decomposition of a signal.
These functions ultimately define a specific set of filters
for signal decomposition and reconstruction and provide for real-time wavelet synthesis.
Ingrid Daubechies (1988) constructed wavelet bases
with compact support, meaning that the wavelets are
nonzero on an interval of finite length (as opposed to
the infinite interval length of the FTs sine and cosine
bases functions). Compactly supported wavelet families
accomplish signal decomposition and reconstruction
using only finite impulse response (FIR) filters. This
development made the discrete-time wavelet transform
a reality. Stephane Mallat (1989) proposed the fast
wavelet transform (FWT) algorithm for the computation of wavelets in 1987. This technique is unified with
other noise reduction techniques through the concepts
of multiresolution analysis (MRA), which is based on
the concept that objects can be examined using varying
levels of resolution. Cohen & Ryan (1995) provided a
more complete mathematical description of MRA and
the associated properties.
Development of the wavelet sound model is accomplished through a four-phase process (as shown in
figure 2): analysis, parameterization, synthesis, and validation.
dwt and wavedec, respectively, and inputs to these functions include an input signal vector, the desired decomposition level, and the wavelet type. The function output is a set of coefficient vectors and corresponding
vector lengths. For example, in Matlab, the signal f
is decomposed to level three using the Daubechies
wavelet type db2 with the command: [C,L] wavedec ( f,3,db2). The resulting four wavelet coefficient
groups are contained in the vector C, namely the approximation coefficients, AAA3, followed by the detail
coefficients DAA3, DA2, and D1. The length of each
wavelet coefficient group is maintained in the L vector.
The wavelet coefficients become the inputs for the parameterization phase as described in subsection 3.2.
vector, C, and lengths vector, L, and Daubechies wavelet db2 with the command f waverec(C,L,db2).
The output from this function is the reconstructed signal, f . The reconstructed signal can be converted to a
standard audio file format (and sent to an audio output
device for playback), saved for later use, or transmitted
over a computer network for remote application.
man expectations. These experiments are generally useful for evaluating any type of sound synthesis. In
addition, the experiments can provide valuable cognitive
and perceptual information for psychoacoustic researchers. The experiments are briefly described here, and an
in-depth description of the experiments are provided by
Miner et al. (2002).
The similarity rating experiment examined the perceptual parameter space of sound synthesis models and
examined the effect of various parameter settings for
those models. Subjects rated the similarity between two
synthesized sounds on a five-point rating scale. Twentytwo subjects (seven men and fifteen women) participated. Twenty unique sound stimuli were used to create
190 sound pair combinations rated by the subjects. Two
techniques were used to analyze the data: multidimensional scaling (MDS) and Pathfinder analysis. The MDS
analysis provided evidence to show that manipulation of
sound model parameters changed the sound perception
in a predictable way. This is important for being able to
reliably control the sound synthesis from a virtual environment simulation system. The Pathfinder analysis revealed relationships within and across different sound
groups. This information is important for extending a
sound model synthesis capability to a broader class of
sounds. These results also proved useful for fine-tuning
sound synthesis models.
The similarity rating experiment provided a tool for
examining the sound stimuli relatedness without imposing experimenter bias. However, this experiment did
not reveal the perceptual extent of aural images that
could be synthesized with the wavelet models, nor did it
provide a metric for the sound synthesis quality. The
next two studies were designed to provide this information.
The second experiment was a freeform identification
experiment used to examine the perceptual identification of the synthesized sounds without providing a context. This experiment answered the question what aural image comes to mind when you listen to this
sound? This is a freeform identification experiment
similar to that run by Ballas (1993) and Mynatt (1994).
The purpose of the experiment was two-fold. First, the
experiment tested whether the synthesized sound re-
sembled the base sound (that is, the sound being synthesized) strongly enough to elicit a freeform identification without any verbal or visual context. Secondly, the
experiment identified perceptually related sound labels
that were not the base sound. These perceptually related
labels served to extend the synthesis domain for individual models. In this experiment, subjects listened to synthesized sounds and entered an identification description. Identification phrases included a noun and
descriptive adjectives. Thirty-five sound stimuli were
presented in random order to 22 subjects (seven men
and fifteen women). Results indicated that the synthesized sounds most frequently elicited the correct freeform response (correct in the sense that the response
matched the target sound being synthesized). Results
showed that a wide variety of perceptually convincing
sounds could be obtained by manipulating the model
parameters. Mechanically oriented labels emerged as the
high-frequency information in the synthesized sound
was increased. Sound labels indicating larger objects
emerged when the low-frequency content of the synthesized sound was increased, and this result showed that
manipulating the model parameters resulted in predictable changes in aural imagery.
The third experiment was a context-based rating experiment designed to provide a sound synthesis veracity
metric by asking subjects to rate the sound quality
within a verbal context. Phrases obtained from the freeform experiment were paired with synthesized sounds.
The phrases provide a perceptual context for the
sounds. Twenty-seven subjects (five men and 22
women) were asked to rate how well the phrases
matched the sounds they heard. Subjects rated 207 randomly presented sound and phrase label pairs on a fivepoint scale, with 1 no match and 5 perfect match.
Both digitized and synthesized sounds were included.
Results quantified free-form label responses thereby
providing an indication of label quality. Furthermore,
this experiment provided numeric information about
how the aural imagery changes as the model parameter
settings changed. Thus, this experiment numerically
validated the perceptual success of the parameter manipulations.
Examination of perceptual experiment results indi-
Table 1. Parameter Settings for Example Models. Original/Base Sound Plus Four Categories of Parameter Settings. Parameters
1 and 2 are the Scalar Values Applied to the Coefficient Groups (D1 and A5 ). Parameters 3 and 4 are the Length of the
Modified Scaling Filter
Parameter settings
Original sound
Rain
Car motor
Footstep
Breaking glass
6, 7, 8, 9
6, 7, 8, 9
6, 7, 8, 9
6, 7, 8, 9
marized. Two models (rain and car engine) are continuous stochastic sound models consisting primarily of
nonpitched sound and infinite duration. Two models
(footsteps and glass breaking) are finite-duration sounds
defined as time-limited sounds whose onset and offset
characteristics significantly influence the sound perception. Raw average context-based rating data and standard deviations are provided for the rain sound model.
This data is provided to demonstrate the effects of specific parameter manipulations.
4.2.1 Rain. This model simulated the sound of
rain. The original digitized sound was that of rain hitting
concrete in an open-air environment. Parameter manipulations yielded the synthesis of light rain, medium rain, and
progressively heavier rain. The perception of increasing
wind accompanied the sound of increasing rain and conveyed the sense of a large rainstorm. Other perceptually
grouped sounds that emerged during the perceptual freeform identification experiment were bacon frying, machine
room sounds, a waterfall, a large fire, and applause.
Table 1 shows the parameter manipulations for this
model. Increasing the magnitude scaling of the detail
coefficient vectors (D1) resulted in the perception of
increasingly softer rain. Bacon frying, fire, and other
sounds were also perceived. Increasing the magnitude
scale of the approximation coefficients (A5) increased
the contribution of the lowest-frequency sound components, resulting in a deeper, more reverberant sound.
Thus, manipulating groups of coefficients (parameters)
increases the scope of the sounds generated by the
model.
Table 2. Perceptual Experiment Results for Rain Model. Includes Subset of Freeform Identification Labels, Average ContextBased Ratings, and Standard Deviations for Base (Original) Sound and Two Parameter Manipulations. (1 no match, 5
perfect match)
Rating results
Base sound
Base w/D1*8
Base w/A5*4
Avg rating Std dev Avg rating Std dev Avg rating Std dev
Rain
4.30
2.63
3.85
2.19
3.07
2.41
Hard rain
Light drizzle of rain
Water running in a shower
Bacon frying
Small waterfall
Large waterfall
1.20
1.42
1.32
1.11
1.21
1.19
2.11
4.26
3.67
3.44
2.63
1.37
1.42
1.06
1.21
1.55
1.33
0.74
4.63
1.63
2.67
1.89
2.96
4.15
0.79
0.88
1.41
0.97
1.26
1.03
for this model was that of a digitized mid-sized, fourcylinder engine idling in an open-air environment. Adjusting the parameters resulted in the perception of a
large diesel truck, a standard truck, a small car, and a
large car. Perceptual labels identified during the freeform experiment were different engine types, machinery, construction site machines, tractor, jackhammer,
drill, helicopter, and various-sized airplane engines.
Magnitude scaling and scale filter manipulations were
performed on this model. Increasing the magnitude of
the D1 coefficients increases the high-frequency sound
content, resulting in a smaller engine sound, such as a
lawn mower or toy car. Increasing the magnitude of the
A5 coefficients results in a smoother-sounding engine
because the high-frequency metallic sounds are
drowned out by the enhanced low-frequency components. The result is a smoother, larger-sounding engine
such as a helicopter or airplane. Decreasing the magnitude of the coefficient vectors had the inverse effect.
The scale manipulations were intended to create
consistent-sized car engine sounds but with different
RPM characteristics. This effect was not achieved however. Instead, they shifted the car engine in frequency,
thus changing the perception of the sound type. Highfrequency shifts resulted in a buzzing sound, reminiscent of a swarm of bees. Low-frequency shifts resulted
in large-engine sounds, such as that of an airplane. Perhaps a more uniform original base sound is required to
create the desired effect of RPM variation through scale
manipulation.
Thresholding the car motor model resulted in significant reduction of the coefficients and did not perceptually change the synthesized sound. Using Matlabs automatic global threshold function (threshold level of
1004) resulted in a 50.83% of coefficients being set to
zero, but retained 99.1% of signal energy. Using this
significantly reduced set of coefficients did not have a
significant perceptual effect on the synthesized result.
This demonstrates how entire groups of coefficients may
be eliminated without changing the synthesized sound
perception, thereby dramatically reducing the number
of coefficients required. This is an example of the significant compression rates that may be possible for wavelet
sound models.
Future Extensions
Conclusions
We have described a four-phase development process for a new sound synthesis approach using wavelets.
The iterative nature of the process allows continuous
Acknowledgments
Sandia National Laboratories supported this work under its
Doctoral Study Program. We thank the reviewers who provided helpful comments, and we also thank the experiment
volunteers.
References
Ballas, J. (1993). Common factors in the identification of an
assortment of brief everyday sounds. Journal of Experimental Psychology: Human Perception and Performance, 19(2),
250 267.
Bonebright, T., Miner, N., Goldsmith, T., & Caudell, T.
(1998). Data collection and analysis techniques for evaluating the perceptual qualities of auditory stimuli. Proceedings
of the International Conference on Auditory Displays. Available online at www.icad.org/websiteV2.0/Conferences/
ICAD98/icad98programme.html
Cohen, A., & Ryan, R. D. (1995). Wavelets and multiscale
signal processing. London: Chapman & Hall.
Cook, P. (1997). Physically informed sonic modeling
Miner, N. E., Goldsmith, T. E., & Caudell, T. P. (2002). Perceptual validation experiments for evaluating the quality of
wavelet synthesized sounds. Presence: Teleoperators and Virtual Environments, 11(5), 508 524.
Misiti, M., Misiti, Y., Oppenheim, G., & Poggi, J. (1996).
Wavelet toolbox for use with Matlab. MathWorks, Inc.
Mynatt, E. D. (1994). Designing with auditory icons. Proceedings of the Second International Conference on Auditory Display (ICAD), 109 119.
Ogden, R. (1996). Essential wavelets for statistical applications
& data analysis. Boston: Birkhauser.
Serra, X. (1989). A system for sound analysis/transformation/
synthesis based on a deterministic plus stochastic decomposition.
Doctoral dissertation, Stanford University, and CCRMA
report no. STAN-M-58.
Smith, J. O. (1992). Physical modeling using digital
waveguides. Computer Music Journal, 16(4), 74 87.
Strang, G., & Nguyen, T. (1996). Wavelets and filter banks.
Wellesley, MA: Wellesley-Cambridge Press.
Stromberg, J. O. (1982). A modified Franklin system and
higher order spline systems on n as unconditional bases
for Hardy spaces. In A. Beckner (Ed.), Conference in honor
of A. Zygmund, vol. II. (pp. 475 493). Monterey, CA:
Wadsworth Mathematics Series.
Van den Doel, K., & Pai, D. K. (1996). Synthesis of shape
dependent sounds with physical modeling. Proceedings of the
International Conference on Auditory Displays. Available
online at www.icad.org/websiteV2.0/Conferences/
ICAD96/Proc96/dendoel.htm