Sound field capture with microphone arrays

,
proximity microphones, and optimal filters
Philippe-Aubert Gauthier
1,2
, Thomas Padois
1
, Telina Ramanana
1,2
, Anthony Bolduc
1,2
, Yann Pasco
1,2
, and Alain
Berry
1,2
1
Groupe d’Acoustique de l’Universit´ e de Sherbrooke, Universit´ e de Sherbrooke, Sherbrooke, J1K 2R1, Canada
2
Centre for Interdisciplinary Research in Music, Media, and Technology, McGill University, Montr´ eal, H3A 1E3, Canada
Correspondence should be addressed to Philippe-Aubert Gauthier
(Philippe-Aubert.Gauthier@USherbrooke.ca)
ABSTRACT
The aim of sound field reproduction is to physically reproduce a sound field in an extended area using
loudspeaker arrays. Typically, the virtual scene is described by a composition of simple sources (spherical
waves, plane waves) fed by monophonic signals from proximity microphones or a microphone array recording.
It would be useful to combine these approaches in order to separate monophonic signals to drive simple
foreground virtual sources from the remaining immersing sound environment. In this paper, the combination
of proximity microphones and microphone arrays is investigated in order to separate the sound source signals
with proximity microphones from the remaining sound environment. Optimal (Wiener) filters are used for
this purpose. Results of numerical simulations and acoustic imaging illustrate the viability of the approach.
1. INTRODUCTION
The aim of sound field reproduction (e.g. Ambisonics
and wave field synthesis (WFS)) [1, 2] is to physically re-
produce a sound field in a given area using loudspeaker
arrays. The virtual scene is described by: 1) a spatial
composition of simple sources (e.g. spherical waves,
plane waves) fed by monophonic signals from proxim-
ity microphones or 2) a sound field recording made with
microphone arrays (that are conformal or not with the
loudspeaker array). The former case is a direct exten-
sion of audio engineering methods from pop or rock mu-
sic recording and mixing with direct inputs or proxim-
ity microphones near instruments or amplifiers. The mix
is created using gain control and panning. This is di-
rectly translated in WFS work flow where individual vir-
tual sources are fed by signals from proximity micro-
phones. The latter case of microphone array is consid-
ered as blind sound field capturing. It is also an exten-
sion of 2-channel stereo or 5.1 Surround techniques with
channel encoding as used in classical music recording
where a 2- or 5-microphone array is used to record the
band as a whole without individual source or instrument
processing except for final mastering. For sound field
reproduction such as WFS, this type of approaches can
be achieved with a microphone array that fits the loud-
speaker array or with a non-conformal microphone array
combined with sound field extrapolation [3].
A research topic recently investigated by the authors is
sound field reproduction applied to industrial noise and
sound environments such as aircraft cabin [4]. This re-
search axis is being further developed for industrial or
worker sound environment simulation using wave field
synthesis applied to auditory and sound perception stud-
ies. One of the requirements for such studies is the abil-
ity to accurately reconstruct captured sound field and the
possibility to perform parametric modifications of the
original sound field to conduct parametric studies. Ex-
amples of modifications are: Reducing general surround-
ing sound, changing the level and position of foreground
sound sources, etc. In this context, it would be interest-
ing to combine the aforementioned capture approaches
in order to separate monophonic signals to drive simple
foreground virtual sources from remaining sound envi-
ronment at the microphone array. Multichannel optimal
filters are used for this purpose. Numerical simulations
and acoustic imaging with one proximity microphone
and a 48-microphone array demonstrate the viability of
the approach.
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
1
Gauthier et al. Microphone arrays and proximity microphones
2. BACKGROUND AND OBJECTIVES
By marked contrast with music reproduction, the ap-
plicability of sound field reproduction and spatial au-
dio within an industrial context cannot rely on illusion,
effects or even the creativity of the sound designers
and engineers. Indeed, for industrial partners and re-
searchers interested in spatial sound reproduction, there
is a strict requirement of physical accuracy. As an illus-
trative example, it is not acceptable to recreate an acous-
tic environment using proximity microphones near fore-
ground machinery to drive properly placed virtual spher-
ical sources in WFS. Indeed, proximity microphones also
capture sources’ near-field sound that does not neces-
sarily radiates in the far field (even few meters from
the sources) where the listener stands. To circumvent
this context-specific difficulty, the objective of the pro-
posed method is to separate, in the microphone array sig-
nals, sound from the foreground sources (referenced by
proximity microphones) from the surrounding environ-
ment. Therefore, the aim is also to predict the foreground
source signal at the microphone array in order to drive a
simple WFS spherical source by the far field sound of the
given foreground source (i.e. neglecting the near field
sound as captured by proximity microphones, or sound
components coming from other sources). This would ap-
ply to a reproduction strategy as shown in Fig. 1.
2.1. Optimal filters
Generally, the purpose of optimal filters is to filter a given
reference signal in order to cancel or imitate a desired
or disturbance signal. Accordingly, they are optimal in
the sense that they minimize the difference between the
desired signal and the filtered reference. Using power
spectral density of the reference and the cross spectral
density of the reference and desired signal, one can find
the optimal (or Wiener) filters [5]. Here, the proximity
microphone signal is considered as the input of an un-
known system while the microphone array signals are
the outputs of this identical system and the optimal fil-
ters perform blind system identification. In our case, the
Wiener filters predict at the array the part of the signal
that is correlated with the proximity microphone. When
subtracting this predicted part of the signal from the orig-
inal array signal, one could separate this source from the
other sources. Therefore, in this paper, if the optimal fil-
ter succeeds, two goals are reached at once: 1) the fore-
ground sources (for which a proximity microphone was
used) and correlated signals are removed from the array
signal, resulting in a separated surround background ar-
ray signal and 2) the filtered signal that predicts the fore-
ground sound at the array is the foreground sound signal,
as measured by the proximity microphone but without
near field coloration. This feature solves the aforemen-
tioned practical issue.
3. SIGNAL PROCESSING AND APPLICATIONS
The aim of the method is to drive a sound field reproduc-
tion stage as shown in Fig. 1. Assuming that foreground
signals have been extracted fromthe original microphone
array recording and stored in y
i
(n) for foreground source
i, and assuming that the relative position of the fore-
ground source with the microphone array origin has been
measured on site, it is easy to reproduce these signals as
simple virtual spherical waves with conventional WFS
algorithms. As an example, the remaining array signals
s
m
(n) (with m=1 ... M for a M-channel array) could then
be simply processed by uniformly distributed delay-and-
sum beamformers [6] and reproduced as plane waves by
a standard WFS algorithm. Since this reproduction stage
could rely on standard WFS algorithms, it is not reported
in this paper.
In order to achieve this separation, the investigated signal
processing is shown in Fig. 2. A set of I reference signals
x
i
(n) are taken from proximity microphones correspond-
ing to on-site identified I foreground sound sources. The
optimal single-input-multiple-output (SIMO) filter W
i
attempts to predict the contribution y
mi
(n) of the fore-
ground source i at the microphone array. The difference
between the total array signals d
m
(n) and y
mi
(n) thus rep-
resents the contribution of all other sources at the micro-
phone array. Once this problemis solved for the i-th fore-
ground source, it can be repeated for the i +1th source
where d
m
are replaced by s
mi
and so on.
Not discussed in this paper is the way to transform the ar-
ray signal associated with foreground source i, i.e. y
mi
(n)
to a single monophonic signal y
i
(n) for the correspond-
ing virtual point source. The simplest strategy would be
to select a single signal from the array. A more com-
plex approach could rely on focused beamforming with
a focus point equal to the on-site identified foreground
source. This latter method could also help in reducing
the reverberated sound in y
i
(n) if requested. The compar-
ison of these approaches and their impact on sound field
reproduction using WFS is a topic of future research.
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 2 of 8
Gauthier et al. Microphone arrays and proximity microphones
Foreground
source
Surrounding sound
environment
(as plane waves)
Foreground
source
Surrounding signals
s
m
(n)
WFS
loudspeaker
array
Foreground signals
y
i
(n)
Microphone
array processing
(beamforming)
I channels
M L
Fig. 1: Signal processing at reproduction stage with M-microphone array, I reference signals and L plane waves for
the surrounding sound environment.
4. OPTIMAL FILTERS
The I proximity microphones provide reference signals
x
i
(n) stored in a Lth-order buffer x
i
(n) = [x
i
(n) x
i
(n −
1) · · · x
i
(n − L + 1)]
T
. Once filtered by the set of
M independent Lth-order SISO (Single-Input-Single-
Output) FIR (Finite-Impulse-Response) optimal fil-
ters W
i
= [w
1i
· · · w
mi
· · · w
Mi
] with w
mi
∈ R
L×1
=
[w
mi
1
· · · w
mi
l
· · · w
mi
L
]
T
, x
i
(n) provides M foreground
signals y
mi
(n) given by y
mi
(n) = w
T
mi
x
i
(n) or:
y
i
(n) = W
T
i
x
i
(n), (1)
with y
i
(n) =[y
1i
(n) · · · y
mi
(n) · · · y
Mi
(n)]
T
∈R
M×1
. The
residual microphone array signals correspond to the sur-
rounding sound environment which is not correlated to
the reference signals from the proximity microphone:
They are given by s
m
(n) = d
m
(n) −∑
I
i=1
y
mi
(n). Since
all SISO optimal filters in the SIMO set are independent,
it is possible to write the optimal filter for each mi pair.
The cost function to be minimized is [5]
J
mi
= E[s
mi
(n)
2
], (2)
with s
mi
(n) = d
m
(n) −y
mi
(n) and where E[·] denotes
mathematical expectation. The quadratic cost function
is written in matrix form
J
mi
= w
T
mi
A
i
w
mi
−2w
T
mi
b
mi
+c
m
, (3)
with the reference autocorrelation matrix A
i
=
E[x
i
(n)x
T
i
(n)] ∈ R
L×L
, the cross-correlation vector
b
mi
= E[x
i
(n)d
m
(n)] ∈ R
L×1
, and the mean-square
value of the m-th microphone signal c
m
= E[d
2
m
(n)].
To find the optimal filter coefficients, one sets to zero
the derivatives of the cost function with respect to filter
coefficients
∂J
mi
∂w
mi
= 2[A
i
w
mi
−b
mi
] = 0, (4)
w
mi
= A
−1
i
b
mi
. (5)
Equation (4) gives the discrete form of Wiener-Hopf
equations [5]
L

l=0
w
mi
l
R
x
i
x
i
(n−l) −R
x
i
d
m
(n) = 0. (6)
For the causally constrained-case, one sets 0 ≤n ≤L−1.
For the unconstrained case one sets −∞ ≤n ≤∞ and the
summation operates from−∞to ∞. In this equation, R
x
i
x
i
and R
x
i
d
m
are the auto-correlation of the reference signal
and cross-correlation between the reference signal and
array microphone m, respectively. For the applications
discussed in this paper, the proposed method relies on
the unconstrained normal equations. In this case, the op-
timal filter from reference signal i to microphone m is
found using discrete Fourier transform of the previous
equation. Thus, for frequency bin k [5]
˜ w
mi
(k) =
˜
S
x
i
d
m
(k)/
˜
S
x
i
x
i
(k) 0 ≤k ≤L−1, (7)
where ˜ indicates frequency domain quantity. Finite-
impulse-response filter coefficients w
mi
are found by in-
verse Fourier transform. In Eq. (7),
˜
S
x
i
d
m
is the cross
spectral density and
˜
S
x
i
x
i
the power spectral density.
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 3 of 8
Gauthier et al. Microphone arrays and proximity microphones
Foreground
source i
Proximity
microphone
Microphone
array
Surrounding sound
environment
(background sources)
SIMO optimal
filter
W
i
Reference signal
x
i
(n)
Array signals
d
m
(n)
Surrounding signals
s
mi
(n)
-1
Foreground signals
y
mi
(n)
+
M channels M
M
Fig. 2: Signal processing for foreground and surrounding signal separation at the microphone array illustrated for the
i-th reference signal and M-microphone array.
5. NUMERICAL SIMULATIONS
The results are based on an audio ray-tracing simulation
of the environment shown in Fig. 3. The space simulates
two large rooms and two corridors. The room height is
4 m. The largest dimensions of the model are 76 m by
21 m. Wall reflectivity coefficients are set to 0.8, 0.75
and 0.7 for the low, mid and high frequency ranges, re-
spectively. The 3D model was done with Blender [7]
and the acoustical simulation was achieved using the
E.A.R. [8] plug-in for Blender. The sound speed is
343 m/s. Based on Schroeder curve of one impulse re-
sponse obtained from the model, the reverberation time
of the space is approximately 1.59 seconds. Preliminary
validation tests of E.A.R. with free-field condition in or-
der to verify sound speed and direction of arrival at the
circular microphone array were done and proved conclu-
sive. E.A.R. was also shown to be in good agreement
with Sabine prediction of the reverberation time [8].
The background and foreground sound sources are simu-
lated as omnidirectional sources fed by real monophonic
recordings of various machinery noise (engines, indus-
trial sewing machine). These stationary or nearly station-
ary signals are shown as Welch power spectral densities
in Fig. 4. These power spectral densities overlap over
certain frequency bands. The foreground sound tends to
cover more uniformly the entire spectrum and the back-
ground sources have more pronounced low-frequency
content. The reference signal used for the foreground
source was simulated with a omnidirectional proximity
microphone 50 cm from the foreground sound source in
the 3D model. In Fig. 3, note that the foreground source
10
−1
10
0
−100
−80
−60
−40
−20
Frequency [kHz]
P
o
w
e
r
/
f
r
e
q
u
e
n
c
y

[
d
B
/
H
z
]


Background source #1
Background source #2
Foreground source
Fig. 4: Power spectral densities of sources.
and the background source #1 can both produce direct
sound at the microphone array while the background
source #2 cannot provide direct sound at the microphone
array, since this background source is not aligned with
an opening like background source #1. This may impact
on the acoustical imaging results and should be kept in
mind. Simulations were done with a 44.1 kHz sampling
rate before being downsampled to 12 kHz to test the pre-
sented method. Downsampling was achieved to reduce
the computational burden of the algorithm to illustrate
the validity of the method. This does not limit the ex-
tent of the reported results since they could be achieved
at higher sampling rates.
To capture the sound field, a 48-channel microphone ar-
ray was used and shown in Fig. 3. The microphones are
omnidirectional. It is a uniform circular and horizontal
array at 1.22 m above the floor in the 3D model. The
array radius is 1 m.
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 4 of 8
Gauthier et al. Microphone arrays and proximity microphones
Background
sound source #1
Foreground
sound source
48-microphone
array
Background
sound source #2
76 m
2
1

m
135
o
176.8
o
Fig. 3: Top view of the modeled environment with background sources and a foreground source.
6. NUMERICAL RESULTS
First, standard beamforming was performed on the orig-
inal microphone array signals d
m
(n) to evaluate the
acoustical map and validate the incoming sound direc-
tions of the foreground and background sound sources
along with sound reflections. The acoustical maps are
obtained using horizontal beamforming with a scan grid
defined by a 360

(with 1

increments) horizontal circle.
In this case, beamforming was achieved in the frequency
domain on the basis of the cross-spectral density matrix
of the microphone signals. Averaging across the 1 kHz
octave band was then performed.
The resulting acoustical map of the original scene with
foreground and background sources in the architectural
space is shown in Fig. 5(a). One notes a dominant
lobe at 135

which corresponds to the direct sound of
the foreground source. The second dominant lobe is
at 177

which also fits the relative angular position of
background source #1 that is visible from the array (see
Fig. 3). The remaining part of the acoustical map is a
combination of reflected sound and microphone array ar-
tifacts.
Next, source separation is performed and the acoustic
maps of extracted foreground signals and remaining sur-
rounding sounds are analyzed. An example of opti-
mal filter in the time-domain is given in Fig. 6 for the
first microphone of the array with a FIR order of 16384
(at 12 kHz sampling frequency). As mentioned earlier,
the non-causal part was conserved. This is reflected
in the main impulse being retarded by 8192 samples
(0.6827 seconds). This filter build-up time also intro-
duces 8192 incomplete samples in the first part of the
separated signal. These should be simply discarded. For
the next results, the first 8192 samples of y
mi
(n) and
s
m
(n) were rejected.
The result of foreground extraction is reported in Fig. 7.
In this figure, the original foreground, background and
mixed signals were computed in a preliminary stage us-
ing the impulse responses from the architecture simula-
tion while uniquely driving the corresponding sources.
This allows for direct comparison of the extracted sound
signal with the actual background and foreground sig-
nals alone. Clearly, the method performs well for most
of the frequencies. The extracted foreground signal is
erroneous below 60 Hz: Some of the background sound
signal leaks in the extracted foreground signal. This is
caused by the fact that the background sound at the om-
nidirectional proximity microphone is no less than 25 dB
louder than the foreground sound signal at 26 Hz (for ex-
ample) which makes signal separation difficult. This is
discussed from a practical perspective in Sec. 7.
Using these separated microphone array signals, acous-
tical maps using beamforming were once again com-
puted for the 1 kHz octave band. In Fig. 5(b), the
comparison of the original foreground map and the ex-
tracted foreground map is provided. The original fore-
ground map was obtained while using the same ray-
tracing model but from a simulation where the back-
ground sources were muted. The agreement between
the two maps is very good (-58.92 dB FS for the origi-
nal map and -59.03 dB FS for the extracted map at 135

).
This suggests that, besides being able to properly sepa-
rate the frequency content of the foreground and back-
ground sound as suggested by Fig. 7, the method is able
to preserve the phase and relative time-alignment infor-
mation between each of the microphones, for both sepa-
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 5 of 8
Gauthier et al. Microphone arrays and proximity microphones
0 45 90 135 180 225 270 315 360
−80
−70
−60
−50
(a)
L
e
v
e
l

[
d
B

r
e
f

1
]


0 45 90 135 180 225 270 315 360
−80
−70
−60
−50
L
e
v
e
l

[
d
B

r
e
f

1
]
(b)


0 45 90 135 180 225 270 315 360
−80
−70
−60
−50
Steering direction [°]
L
e
v
e
l

[
d
B

r
e
f

1
]
(c)


Original scene
Original foreground
Extracted foreground
Original scene
Original scene
Original background
Extracted background
−58.63 dB →
← −64.31 dB
← 176.8° 135° →
Fig. 5: (a): Acoustical map obtained with beamform-
ing of the original scene with foreground and background
sources. (b): Comparison of the acoustical maps of the
original foreground sound with the extracted foreground
sound. (c): Comparison of the acoustical maps of the
original background signals with extracted background
signals. Exact angular positions of sources with respect
to the array are shown as vertical dashed lines.
0 0.2 0.4 0.6 0.8 1 1.2
−0.5
0
0.5
1
1.5
Time [s]
w
1
1
Fig. 6: Example of optimal FIR filter coefficients from
proximity microphone to array microphone #1.
10
−1
10
0
−100
−80
−60
−40
−20
P
o
w
e
r
/
f
r
e
q
u
e
n
c
y

[
d
B
/
H
z
]
Foreground sound at microphone #1 of the array


10
−1
10
0
−100
−80
−60
−40
−20
P
o
w
e
r
/
f
r
e
q
u
e
n
c
y

[
d
B
/
H
z
]
Background sound at microphone #1 of the array


10
−1
10
0
−100
−80
−60
−40
−20
Frequency [kHz]
P
o
w
e
r
/
f
r
e
q
u
e
n
c
y

[
d
B
/
H
z
]
Mixed sound at microphone #1 of the array


Original
Extracted
Original
Extracted
Original
Reconstructed
Fig. 7: Power spectral density of foreground and back-
ground signals separation for one microphone of the ar-
ray. Top: Original and extracted (y
11
(n)) foreground sig-
nals. Center: Original and extracted background (s
11
(n))
sound signals. Bottom: Original (d
1
(n)) and recon-
structed mixed (s
11
(n) +y
11
(n)) signals.
rated array signals. Accordingly, any subsequent spatial
microphone array processing (as in Fig. 1) on either ex-
tracted foreground or background signals should perform
adequately. Acloser look at Fig. 5(b) shows that the orig-
inal and extracted acoustical maps correspond less per-
fectly for two directions. The first direction corresponds
to the background source angular position at 176.8

. The
second direction (between 270 and 315

) corresponds to
the lowest level of the map where it is assumed that this
discrepancy is less significant and possibly introduced
through beamforming side lobes and background source
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 6 of 8
Gauthier et al. Microphone arrays and proximity microphones
#2. In Fig. 5(c), the agreement between the original and
extracted background map is also good (-65.13 dB FS at
177

for the original map and -65.78 dB FS at 177

for
the extracted map). As for Fig. 5(a), in Fig. 5(c), the cor-
respondence is good except for the angular position of
the foreground source.
6.1. Investigation of near-field effects
In the previous case, most of the signal sent to the fore-
ground source fully propagates to the microphone array
since there is no near-field effect simulation in the ray-
tracing model. In order to evaluate the impact of near-
field sound such as evanescent waves, a synthetically
generated tone at 93 Hz was artificially added to the ref-
erence signal x
i
(n) with different levels, namely -26, -20
and -14 dB FS. The peak signal in the original reference
occurs at 58 Hz at -40 dB FS. Therefore the addition of
this additional near-field signal is drastic with respect to
the original reference signal. The effect on the extracted
foreground signal at one of the microphone of the array
is shown in Fig. 8. Ideally, the extracted power spec-
tral density should not be influenced by this simulated
near-field effect. Note that the four extracted curves are
superimposed in Fig. 8, except at 93 Hz. Clearly, the ex-
traction of the foreground signal is less efficient at 93 Hz
where part of the artificial near-field signal (that does not
radiate to the microphone array) has been associated with
other environmental sound at the microphone array.
In order to attenuate this undesirable effect, an additional
filter is introduced for each mi path in w
mi
. This filter is
derived from the coherence measurement given by:
˜
C
mi
(k) =|
˜
S
x
i
d
m
(k)|
2
/
˜
S
x
i
x
i
(k)
˜
S
d
m
d
m
(k), (8)
on which a sigmoid function is applied
˜
F
mi
(k) = 1/(1+e
−c(
˜
C
mi
(k)−a)
), (9)
with 0 ≤
˜
F
mi
(k) ≤1, where a represents the threshold co-
herence below which the filter will cut the signal and c is
the steepness of the sigmoid function, i.e. how rapidly
the gate is open around the threshold. Therefore,
˜
F
mi
represents a frequency-domain gate that only let through
coherent signals from the i-th proximity microphone to
m-th microphone in the array. The final filter is a multi-
plication of the optimal filter and this coherence filter
˜ w
mi
(k) =
˜
F
mi
(k)
˜
S
x
i
d
m
(k)/
˜
S
x
i
x
i
(k) 0 ≤k ≤L−1. (10)
For this test, a was set to 0.33 and c was set to 10. The
result of this modified filter for foreground source sepa-
ration is illustrated in Fig. 9. Clearly, the artificial peak
0.08 0.09 0.1 0.11 0.12
−80
−60
−40
−20
0
Frequency [kHz]
P
o
w
e
r
/
f
r
e
q
u
e
n
c
y

[
d
B
/
H
z
]
Foreground sound at microphone #1 of the array


Original
Extracted
Extracted, 93Hz @ −26dB
Extracted, 93Hz @ −20dB
Extracted, 93Hz @ −14dB
Fig. 8: Power spectral densities of foreground signal sep-
aration illustrated at the first microphone of the array for
original scene with three cases of artificial signal mixed
with the reference signal.
0.08 0.09 0.1 0.11 0.12
−80
−60
−40
−20
0
Frequency [kHz]
P
o
w
e
r
/
f
r
e
q
u
e
n
c
y

[
d
B
/
H
z
]
Foreground sound at microphone #1 of the array


Original
Extracted
Extracted, 93Hz @ −26dB
Extracted, 93Hz @ −20dB
Extracted, 93Hz @ −14dB
Fig. 9: Power spectral densities of foreground signal sep-
aration with optimal filtering combined with coherence
filtering illustrated at the first microphone of the array for
original scene with three cases of artificial signal mixed
with the reference signal.
at 93 Hz is reduced except for the most extreme case of
93 Hz at -14 dB FS. In all cases, the peak is lower than
without the additional coherence filter. However, part of
the extracted signal just above the 93 Hz peak is also al-
tered, but it is still closer to the original foreground sig-
nal. This avenue seems promising and should be further
investigated.
7. PRACTICAL CONSIDERATIONS
Although the method performs well over a large band-
width, it was observed that some leakage of the back-
ground sound environment in the proximity microphone
signal can degrade the performance of the source sep-
aration. For the reported case, this was enhanced by
the recordings’ spectral difference: At 26 Hz (the fre-
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 7 of 8
Gauthier et al. Microphone arrays and proximity microphones
quency at which the method was less performing) the
background sound spectrumat the proximity microphone
was no less than 25 dB louder than the foreground sound
signal. If this highlights a limitation of the proposed
method, it also gives some hints for practical consid-
erations. First, the proximity microphone should be in
very close proximity of the source in order to increase
the foreground to background ratio. Second, directive
(cardioid or hyper-cardioid) microphones could be used
as proximity microphones to further reduce background
signals. Third, if the proximity microphone is close to
the source, the channel sensitivity can be reduced and
this should also attenuate the background sound leak-
age in the foreground sound signal. Finally, this may
not impact too much in practical situations since in many
cases placing a microphone very close to a sound source
will enhance the low-frequency content of this source
because of low-frequency evanescent waves in the near-
field not taken into account in the reported simulations
(this is also known as proximity effect in audio engineer-
ing). Consequently, it is not expected that the method
would suffer greatly in real practical situation. Another
approach would be to use a vibration sensor on the fore-
ground source in place of a proximity microphone.
Once the foreground and background has been separated
at the microphone array, one has to spatially process this
information in order to drive conventional WFS algo-
rithms and virtual sources. The reproduction of back-
ground and surrounding sound environments could be
achieved using beamforming and plane wave reproduc-
tion, although a crossover might be necessary to intro-
duce more beams for the high frequencies than for the
low frequencies where main lobes are larger [6].
8. CONCLUSION
In this paper, a simple and efficient method based on
optimal filters, microphone array and proximity micro-
phone was proposed. The aim was to separate the mi-
crophone array signal in different parts. The first part
is associated with foreground sound sources, identified
on site and recorded with proximity microphones. The
second part is associated with the remaining background
sound environment related to other background sound
sources and reflected sound. The optimal filters can ex-
tract the foreground sound source signal at the micro-
phone array, i.e. excluding near field in the close prox-
imity of the foreground source. The results show that
this separation is effective over a wide frequency band.
Acoustical maps of the original foreground and back-
ground signals were compared with acoustical maps of
the extracted signals. Their agreement demonstrates that
time-alignment and relative phase between microphone
signals (both informations are essential to the perfor-
mance of any subsequent array or spatial processing) is
preserved through the separation process. In order to in-
crease the separation performance, a filtering stage based
on coherence function was introduced. Investigations of
howthis performs for sound field reproduction are a topic
of current research. In due course, several industrial en-
vironments will be recorded using a 196-channel micro-
phone array system where some of these channels will
be used as proximity microphones for up to 8 foreground
sources. The separated signals will then be reproduced
using a 96-channel WFS system.
9. REFERENCES
[1] Nicol R. and Emerit M., “3D-sound Reproduc-
tion Over an Extensive Listening Area: A Hybrid
Method Derived from Holophony and Ambisonic,”
presented at the AES 16th International Confer-
ence, Rovaniemi, Finland, 1999.
[2] Ahrens J., Analytical Methods of Sound Field Syn-
thesis, Springer, Berlin, 2012.
[3] Hulsebos E., de Vries D., and Bourdillat E., “Im-
proved Microphone Array Configurations for Au-
ralization of Sound Fields by Wave-Field Synthe-
sis,” J. Audio Eng. Soc., vol. 50, no. 10, pp. 779–
790 (2002 October).
[4] Gauthier P.-A., Camier C., Lebel F.-A., Pasco Y.,
and Berry A., “Experiments of Sound Field Repro-
duction Inside Aircraft Cabin Mock-Up,” presented
at the 133rd AES Convention, San Francisco, 2012.
[5] Elliott S., Signal Processing for Active Control,
Academic Press, San Diego, 2001.
[6] Hur Y., Abel J.S., Park Y.-C., and Youn D.H.,
“Techniques for Synthetic Reconfiguration of Mi-
crophone Arrays,” J. Audio Eng. Soc., vol. 59, no.
6, pp. 404–418 (2011 June).
[7] Home of the Blender project, http://www.
blender.org, accessed on 2014 January 22th.
[8] E.A.R, http://www.explauralisation.org,
accessed on 2014 January 22th.
AES 55
TH
INTERNATIONAL CONFERENCE, Helsinki, Finland, 2014 August 27–29
Page 8 of 8

Sign up to vote on this title
UsefulNot useful