Académique Documents
Professionnel Documents
Culture Documents
Concept
Gunshots, steps, door noises, vehicles, barking dogs each sound could be a critical proof on the
crime scene. Along with acoustic measurements (such as reverberation time, wall isolation,
outdoor noise), binaural miking in acoustic simulations is indeed a powerful instrument for
testing a witness capabilities in hearing sound evidences and check for their direction of arrival on
the crime scene. This article is a short introduction to binaural audio and reverberation issues.
Binaural recording is a 3D audio technique aimed at the playback through headphones or a pair of
speakers. It should not be mistaken for stereophonic sound, which rather refers to a
bidimensional soundscape. Binaural miking aims at rendering the exact position of the sources
relative to the listener in a 3D environment, that is, sound is all around the listener wearing
headphones or being in front of the speakers. Binaural audio through speakers is much more
difficult to implement due to cross-talk between the channels.
The first tests in binaural audio go back to 1881. A pair of coal microphones were placed in front of
the stage of the Opera in Paris and spaced like human ears. The acoustic signals were transduced
and sent to destination by telephone.
Several years later, in 1931-32, the researchers at Bell Telephone Laboratories in New Jersey (USA)
founded some of the most important bases of electroacoustics. Harvey Fletcher, Father of
stereophonic sound, also known for the Fletcher-Munson Loudness Curves (a graph showing
human auditory sensitivity, expressed in dB SPL, with varying frequency along with isophonic
1
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
curves) started investigating about the nature of sound, speech and listening and patented the
first medical acoustic hearing aid device.
The accurate description of each part of an acoustic system, in a particular way the speaking
source, the microphone, the electric transmission line, the receiver (headphones, loudspeaker or
listener) soon became essential.
The first binaural dummy head was called Oscar. It was built in 1932 at Bell Labs. It was made of
wax and had a pair of dynamic microphones with a diameter of 1.4 inches (about 3.56 cm) placed
in proximity of the ears.
The experiments revealed that the inaccuracy in localisation, regarding the sensing of the distance
of the source above all, were due to the discrepancies between the mean aperture of the real
(human) ear and the mic diaphragm dimensions (1.4 inches are comparable to a wavelength at 9
kHz).
New binaural heads became more sophisticated, apart from some rudimental experiments made
with spheres and opposite mics. Commercial dummy heads are produced by Neumann, AKG,
Sennheiser, Bruel & Kjaer, Knowles Electronics and others. It is also possible to purchase in-ear
microphones (these mics look like in-ear monitors).
Sound localisation is the capability of pinpointing the position of one or more sound sources in
terms of distance, azimuth and elevation. The information about position is not contained in the
receptor cells in the auditory system as on the retina in sight. On the contrary, it has to be
computed exploiting other information. The main information cues available are ITD (Interaural
Time Difference), ILD (Interaural Level Difference), also named IID (Interaural Intensity Difference).
It is known that sound takes different time lapses to reach the two ears, while the ILD is produced
by the shadowing effect of the head at the opposite-to-direction-of-arrival ear, impeding energy
carried by the sound itself.
Stern et al. in [1] point out how the ITD and ILD work on complementary frequency ranges (at least
for what concerns free space and simple point sources).
The ILDs are prominent for frequencies above 1.5 kHz, because in the high-end of the audible
spectrum the head has dimensions comparable with the wavelengths of the impinging sound
waves, thus reflecting a significant portion of the sound.
ITDs are present for all frequencies, but only at higher frequencies periodic sounds are decoded
unambiguously. In other words, the maximum feasible ITD (physically) is less than a half period of
the wavelength at each frequency. The reason is that the two ears sample the sound in space.
In order to avoid spatial aliasing the Nyquist theorem must be respected in space.
Since the maximum ITD for a human head is about 660 sec, the ITDs useful to localisation are
those below 1.5 kHz.
Given that different azimuth and elevation angles generate more or less the same ITD and this is
approximately constant with frequency and human specimen, the ITD [2] does not locate the
source position unequivocally.
2
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
The ILD exhibits instead greater variability changing the listener and it depends quite critically on
frequency. Thats why the ILD turns out to be more useful to source localisation.
Referring to Figure 1, the formula for calculating the ITD can be Woodworth formula (extended for
frequencies below 1.5 kHz) [3]:
a ( + sin )
ITD ( , ) = cos (1.1)
c
where a is the radius of the head (supposing it is spherical), c is the speed of sound (about 344 m/s
in air), is the azimuth and is elevation.
Woodworths original formulation did not consider that the ITD diminishes outdistancing the
source on the listeners horizontal plane. The factor cos reckons with that.
At frequencies above 1.5 kHz Woodworths formula gets less accurate.
Consider another sensory cue: the IPD (Interaural Phase Difference). When the IPD is greater or
equal to 180, the source position is ambiguous. This ambiguity is induced by the periodic nature
of phase.
Since the wavelength for higher frequencies is shorter, it may happen that the distance between
the ears is greater than a wavelength for adequately high frequencies (>1.5kHz).
This gives rise to spatial aliasing, as mentioned above, and the perceived ITD proves to be
ambiguous, as the ear is responsive only to the phase and not to time differences.
3
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
So, for such frequencies the ILD supporting information is necessary. In other words, a high
frequency signal (>1.5 kHz) produces an ITD greater than the signal period, while a low frequency
signal is such that, because of the ITD ranging in a period of the signal itself, the phase difference
perceived at the ear allows the listener to evaluate the ITD unambiguously.
The human listener takes advantage of the ITD and ILD cues in order to decode the direction of
arrival of the sound. In addition, the presence of the pinna in the ear produces furthermore a
coloration of the sound, depending on the direction of arrival of the sound wave.
Figure 2 shows the human ear anatomy.
In general, localisation performance is excellent in the front stage of the horizontal plane, good on
the rear and much worse on the vertical plane.
where Y( f ) is the output signal frequency spectrum and X( f ) is the input signal spectrum.
A transfer function (or, equivalently in time domain, the impulse response) characterizes and
describes the channel between the source and the receiver.
The HRTF is the head related transfer function, a function of 4 variables (f, r, , ), where f is the
frequency, r is the distance of the source from the listener, and are respectively the azimuth
and the elevation.
So, we have a left HRTF and a right HRTF and these describe the channel between the ear and the
source relatively to the left and right ear.
A HRTF matches only one configuration of source and receiver and the left and right HRTFs do
embed the ITD and ILD information.
In order to extract the ITD and ILD from the HRTFs [4] we compute the modulus of the interaural
transfer function, which is defined as:
HrR, , ( )
HINT ( ) = (1.3)
HrL, , ( )
where = 2f is the angular frequency and the superscripts R,L denote respectively the right and
left channel. The modulus (expressed in dB) of HINT ( ) is:
4
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
d INT ( ) d ( R ( ) L ( ) )
= (1.5)
d d
The HRTF method does not replace the Duplex model, but supports the latter in order to obtain a
better estimation of the source position in the listening environment. Research brought to more
complicated human models and the BRTF (Body-Related Transfer Function) were invented.
These new transfer functions describe human body vibrations and their transmission to the
auditory system thus contributing to the overall sound perception.
As a first effect reverberation reduces the efficacy of the information used by the listener. In
presence of reflections, in fact, the auditory system may decode the source position erroneously
due to high intensity echoes interfering the direct sound and superimposing themselves to it.
Tests in anechoic chamber and reverberant chamber brought some significant results through the
years.
Some of them are listed below:
a wide band stationary noise is localised more scarcely in a reverberant field [5];
a wide band noise is localised easier in a reverberant field than single sparse tones, that is,
localisation improves with the spectral density of the source;
high attack transient sounds are better localised than others independently on
reverberation decay time .
Thanks to the precedence effect the listener is able to weight wavefronts in different ways.
The first front on arrival (direct sound) is given more weight.
Steep attack transients trigger the precedence effect and favour localisation, while stationary
sources are decoded with a greater difficulty.
5
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
Studies about separation of concurrent sources [6] revealed that the capability of grouping sounds
according to their harmonicity depends on the extent of their fundamental frequency (f0) range.
In free space the listener identifies with ease those sounds having a proper pitch difference.
The situation changes in a reverberant environment: direct sound and echoes with the same
fundamental frequencies relate to each other on the basis of harmonic ratios. A flutter in f0 causes
inharmonicity between direct sound and reflections. For this reason, in a reverberant field, the
mechanism of grouping sounds according to harmonicity loses its efficacy.
In a confined space distance perception depends on 2 factors [7]: the ratio between direct sound
energy and reflected energy and the time delay between direct sound and reflected sound.
Room reflections also contribute to the general spatial impression charactering the listening
environment and the sound source.
Some parameters are introduced here: ASW (Apparent Source Width), which is the perceived
length of a source, and LEV (Listener Envelopment), the perception of a surrounding sound.
ASW is determined principally by the intensity level of the early lateral reflections within the first
80 ms from the arrival of the direct sound at the listeners position.
LEV is the subjective immersive sound sensation produced by reverb and depends on the nature of
late reflections, in particular on their distribution over time, their level and direction of arrival.
In [8] Bradley explains how ASW and LEV are caused by the precedence effect, during which the
direct sound and the early reflections are temporarily merged together.
ASW finds justification in the circumstance that the identification of the direction of arrival is
distorted by the enhancement of the merged event produced by early lateral reflections.
The reverberation tail instead is not merged with direct sound and appears as a sort of diffused
halo around the direct sound thus giving the listener a sense of immersion quantifiable with LEV.
Figure 2 (b) highlights reverberation filling up those darker zones with lower energy, which are
much sharper in Figure 2 (a).
The attack phonemes are less susceptible to the effect of reverberation than the final (release)
ones.
1
Spectrogram: time vs frequency representation of a signal.
6
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
(a)
(b)
Figure 2 Spectrogram of speech (a) short reverb (b) long reverb
Studies about the masking effect of reverb on words were discussed in [9], [10].
We define self-masking the phenomenon in which the initial part of a phoneme masks the final
part of the same phoneme, mixing up the transients.
In overlap-masking the echoes of earlier phonemes mask the following phonemes.
Early and late reflections have different effects on speech intelligibility [7].
The early reflections are highly correlated with the original speech and they can possibly assist its
understanding, increasing its loudness, or that attribute of auditory sensation in terms of which
sounds can be ordered on a scale extending from quiet to loud [11].
The late reflections are less correlated with the original signal and for this reason they behave like
additive noise.
7
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
Lochner and Burger [12] defined a parameter for speech intelligibility according to the ratio
between the energy introduced by the early and the late reflections.
This parameter is included in standard ISO 3382 (Measurement of the reverberation time of
rooms with reference to other acoustical parameters) with the name Early-to-Late Index, or
Clarity, where the crossover interval separating the early reflections and the tail of the reverb is 50
ms for speech and 80 ms for music:
50 ms 80 ms
2 2
h (t )dt h (t )dt
0 0
C 50 = 10log
C 80 = 10log
(1.6)
2 2
h (t )dt h (t )dt
50 ms 80 ms
Its known that reverberation colours the signal spectrum. Apparently, human listeners are able to
compensate for this spectral distortion. Its going to be a study object in machine
implementations.
One could simulate an artificial room by placing the virtual sources relative to the listener by
convoluting sounds with impulse responses.
It is also possible to add an artificial reverb after having modelled the room properly.
9
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
Annexe A: Reverberation
Reverberation is the persistence of sound, after the source ceased to produce it, occurring in an
enclosed space where reflective surfaces are present.
The sound reaches the listener through the direct path from the source and through multiple
reflections from those surfaces impacted by the sound field.
The first step in understanding the characteristics of a reverberant environment is the pointwise
computation of its impulse response.
The impulse response describes completely the properties of a room according to the specific
configuration of source and receiver.
Typically, one excites the room with a continuous wide band signal (reference signal), such as pink
noise, sine sweep, MLS in order to record the room impulse response.
The impulse response is obtained from the deconvolution between the reference signal and the
recorded response. As expected, the impulse response looks like Figure A.3, where the first sound
front (direct sound) arrives at time 0 ms, followed by the early reflections, which get more and
more numerous and thicker and decrease in amplitude thus defining the reverb tail or late
reflections.
The human ear perceives two distinct sounds if these are separated by a time longer than 30-80
ms interval (it depends on the person). This means that the late reflections appear like a halo with
exponential decay.
The impulse response changes with the listening position inside the room. The pattern of the early
reflections, that is, the sequence of repeating early reflections, and the arrival of the first
reflection are influenced by the position of source and receiver relative to the reflective surfaces
10
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
and by the dimensions of the room (see Figure A.4). The decay time depends on the room
dimensions and by the surface absorbency.
The delay time between the arrival of the direct sound and the reflections has the side effect of
colouring the received sound. This phenomenon is called comb filtering. A comb filter introduces
nulls in the frequency response of the room for a certain frequency and its harmonics.
Sound coloration is also due to the materials in the room, which absorb the sound differently with
frequency.
When choosing the material, with the aim of operating the acoustic correction of a room, its good
to consult the data sheets or the charts listing all the absorption coefficients for all the frequency
ranges.
All materials (including air) absorb higher frequencies efficiently, since typically these carry a lower
energy relative to lower frequencies.
High frequencies are also more easily diffused, given that their wavelengths (short) are
comparable with the dimensions of the objects (obstacles) in a room.
From the perception point of view, high frequencies will be more directional. For this reason, for
instance, the subwoofer in a surround system could be placed at whatever position in the room
(rationally depending on room dimensions, cross over frequency, shockwave direction) and for
this reason it is very common to find the external far end (port) of a bass reflex tube in different
positions on a loudspeaker case.
Another effect is, changing the listening position, that reverb gets more apparent walking away
from the source. There exist a distance, called critical distance, beyond which the reverberant field
has a greater intensity in comparison to the direct field:
11
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
Q
dc =
16
S an
n n
(1.7)
If the absorption coefficients are unknown, an alternative formula for dc is the following:
Q V
dc = (1.8)
60
100p RT
Figure 5 The reflection order is the number of impacts the sound carries out before reaching the listener/receiver [13]
2
Reverberation time (RT60, RT30, RT20, RT10): typically defined as the time interval sound energy takes to
decay 60, 30, 20, 10 dB, after the excitation in the room has ceased.
12
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
Figure B.6 Ear anatomy [David Darling Encyclopedia of Science] : (1) pinna (2) lobe (3) auditory channel (4) eardrum (5) ossicles
stirrup, anvil, malleus (6) Eustachian tube (7) oval window (8) saccule (9) semicircular canals (10) cochlea
13
Francesca Ortolani - Binaural approach in acoustic scene simulations in audio forensics
References:
1. R. M. Stern, G. J. Brown e D. Wang, Binaural Sound Localization, In Computational Auditory
Scene Analysis: Principles, Algorithms, and Applications, Wiley-IEEE Press, 2006
3. Brian Carty e Victor Lazzarini, Binaural HRTF based spatialisation: new approaches and
implementation, 12th Int. Conference on Digital Audio Effects (DAFx-09), Como, Italy,
September 1-4, 2009
14