Académique Documents
Professionnel Documents
Culture Documents
(Received 27 May 2015; revised 4 March 2016; accepted 13 June 2016; published online 21 July
2016)
Speech intelligibility is strongly affected by the presence of maskers. Depending on the spectrotemporal structure of the masker and its similarity to the target speech, different masking aspects
can occur which are typically referred to as energetic, amplitude modulation, and informational
masking. In this study speech intelligibility and speech detection was measured in maskers that
vary systematically in the time-frequency domain from steady-state noise to a single interfering
talker. Male and female target speech was used in combination with maskers based on speech for
the same or different gender. Observed data were compared to predictions of the speech intelligibility index, extended speech intelligibility index, multi-resolution speech-based envelope-powerspectrum model, and the short-time objective intelligibility measure. The different models served
as analysis tool to help distinguish between the different masking aspects. Comparison shows that
overall masking can to a large extent be explained by short-term energetic masking. However, the
other masking aspects (amplitude modulation an informational masking) influence speech intelligibility as well. Additionally, it was obvious that all models showed considerable deviations from the
data. Therefore, the current study provides a benchmark for further evaluation of speech prediction
C 2016 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4955079]
models. V
[CGC]
Pages: 524540
I. INTRODUCTION
a)
524
0001-4966/2016/140(1)/524/17/$30.00
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
modulation filters. AMM can generally be caused by fluctuating maskers or even by intrinsic modulations of a stationary masker which interfere with those of the target within a
certain auditory filter (e.g., Stone et al., 2012). In contrast to
AMM, coherent across-frequency amplitude modulations
(co-modulation) in the masker can reveal entire parts of the
target speech. In this case dip listening (Bronkhorst, 2000;
Lorenzi et al., 2006) comes into play which is most prominent for low modulation rates (usually below 8 Hz). The
observed masking release in fluctuating maskers can be conceptually compared to the psychophysical phenomenon of
co-modulation masking release (CMR) (e.g., Hall et al.,
1984), where a release from masking for a pure tone in
noise is caused by the coherence of modulations in many
frequency bands. In addition, Howard-Jones and Rosen
(1993) showed that a masking release occurs also for modulations that are not present over the entire frequency spectrum. Taken together, in contrast to AMM, dip listening
describes the use of reduced short-term EM in temporal
troughs of the masker or likewise in peaks of the target.
Informational masking (IM) usually refers to masking
that does not occur in the auditory periphery, but in more
central regions of the auditory system (Durlach et al.,
2003b). Pollack (1975) describes informational masking as
the uncertainty in the trial-to-trial variation in the noise
waveform in psychoacoustic measurements, whereas for
Brungart et al. (2001) the term holds for interfering talkers
and speech-on-speech masking when the masker is a
similar-sounding distractor (e.g., same gender). IM can
also be prompted by factors such as speaker spectrum, sentence structure, and semantic content of the target signal,
although these aspects also influence EM and AMM. ShinnCunningham (2008) claims that IM is governed by the
aspects of object formation and object selection, whereas the
latter is influenced by the amount of attention that is used to
form the particular auditory object. But generally, IM must
be clearly separated from general inattention toward the task
(Durlach et al., 2003a). Lutfi (1990) even proposed a calculation for informational masking based on the statistical
structure of waveforms in a tone detection experiment and
found the amount to be about 22% within maskers that are
thought to be energetic maskers only. Lutfi et al. (2013) and
Durlach et al. (2003a,b) claim that two aspects rule informational masking: uncertainty of the masker and similarity
between target and masker. These aspects were elaborated
on in Lutfi et al. (2013), but also overlap with the definition
of EM and AMM by Stone et al. (2012). Micheyl et al.
(2000) describe IM to be present when masker and target are
similar in terms of temporal coherence, harmonic structure,
and frequency range. An alternative assumption can be to
take IM as those masking effects that cannot be described by
speech intelligibility models which consider EM and AMM.
Overall IM is less clear cut and is often brought up if speech
reception thresholds cannot be explained by EM and AMM.
The concepts of EM and AMM have been successfully
used in speech intelligibility models to predict speech reception thresholds (SRTs) in various masking conditions. The
AI and SII use band importance functions for the different
analysis channels and thus provide a weighted measure of
J. Acoust. Soc. Am. 140 (1), July 2016
525
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
II. METHODS
A. Subjects
FIG. 1. Spectrograms of the four SSN-based maskers that were used in the
speech intelligibility and detection measurements. The basis was a SSN,
which has the same average long-term spectrum as the ISTS (Holube et al.,
2010). For the SAM-SSN condition, the SSN was fully modulated with an
8-Hz sinusoid, which resulted in a regular and coherent modulation. For the
BB-SSN condition the modulations were derived from intact broad-band
speech (see text for further details) and were thus irregular, but coherent.
For the AFS-SSN the SSN was split into 32 frequency channels to generate
an across-frequency shifted SSN. Four adjacent channels were multiplied
with the same sequence from a broad-band speech envelope. The sequence
that was used for the next four adjacent channels was from another section
in time of the broad-band speech envelope, thus modulations patterns are
shifted across frequencies. This resulted in irregular and incoherent modulations across the spectrum.
FIG. 2. Spectrograms of the speech-like maskers ISTS, ST, and their noisevocoded versions. Noise-vocoding was performed with a 32 auditory channel vocoder within the frequency range of 50 Hz to 12 kHz. The spectral
weighting of each filter was maintained. It should be noted that the masker
sequences in the four panels are not identical.
Schubotz et al.
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
Incoherent amplitude modulations across the auditory channels were introduced in the across-frequency shifted SSN
masker (AFS-SSN, lower right panel of Fig. 1). This masker
was created by filtering the SSN into 32 auditory channels
within a frequency range of 50 Hz12 kHz using a 4th-order
Gammatone filter bank with 1-ERB (equivalent rectangular
bandwidth) spacing of the auditory filters. Four adjacent
channels were then modulated with the same envelope. The
envelopes were random sections in time from the same
low-pass filtered Hilbert envelope used for the BB-SSN. As
a consequence, coherent modulations were introduced only
in those parts of the masker spectrum that belong to the four
adjacent auditory filters. Overall there were eight different
randomly time-shifted modulations applied to the 32 bands.1
Since the basis for the first four maskers was the ISTS, all
four makers had the same long-term spectrum as the (female)
ISTS. For those maskers, where a male speech spectrum was
used, the basis was a transformed ISTS. The STRAIGHT algorithm (Kawahara et al., 2008) was used to lower the fundamental frequency (F0) and to lengthen the vocal tract of the original
ISTS signal in such a way that the mean fundamental frequency
of the transformed ISTS resembled that of the original male
OLSA target speaker (F0 110 Hz). The vocal tract lengthening was empirically performed to find a realistic factor for a natural sounding male ISTS. Otherwise, there was no attempt to
match the long-term spectrum to that of the male target talker.
The transformed ISTS was then used to derive a SSN with a
male speech spectrum, and from that the other three modulated
maskers were derived as described earlier. Thus, for the case of
a male target talker, the masker spectra of the male ISTS and
the four SSN-based maskers were similar to the target material.
This was not the case for the combinations male target and
female maskers as well as female target and female maskers.
2. Speech-like maskers
527
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
TABLE II. The parameters a and b as chosen for the STOI model to fit the
model to the thresholds from the listening experiments. The parameters
were chosen such that the psychometric function [see Taal et al. (2011), Eq.
(8)] for the SSN predictions matched the SSN condition of the empirical
data. The parameters were changed for the different gender combination of
target and masker as indicated.
Combination of target and masker spectra
male/female
female/female
male/male
26.994
28.938
35.916
12.550
13.140
17.496
D. STOI measure
IV. RESULTS
TABLE I. Parameter values that were used in the ideal observer stage of the
mr-sEPSM for the different gender combinations of target and masker. They
were chosen such that the SSN model prediction matched the SSN condition
of the empirical data. The values q and m were fixed for all predictions (see
Jrgensen et al., 2013), whereas the other two parameters were adjusted for
each of the three gender combinations of target and masker spectrum.
Gender combination of target and masker spectrum
Male/female
Female/female
Male/male
rs
0.351
0.4
0.655
0.5
0.5
0.5
50
50
50
0.6
0.6
0.8
529
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
male
target
female
masker
SRT80
SRT50
Detection
female
target
female
masker
SRT80
SRT50
Detection
male
target
SRT80
male
masker
SRT50
Detection
NV-ISTS
ISTS
NV-ST
ST
SSN
**
**
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
*
**
**
**
**
***
*
***
*
**
***
**
**
**
**
***
***
***
**
***
***
***
***
**
*
**
*
**
***
**
***
***
***
**
*
**
***
**
***
**
*
***
***
**
***
*
***
***
**
**
**
***
**
*
**
**
**
**
***
**
**
**
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
**
**
**
*
**
**
**
**
*
**
**
**
**
***
**
***
*
***
***
**
**
**
**
**
**
*
***
*
**
***
**
**
**
***
**
**
**
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
for the significance levels) showed that SRT50s for the SSNbased maskers differed significantly for SSN and SAM-SSN
(as always the case for the whole data set), SSN and AFSSSN, AFS- and BB-SSN, and AFS- and SAM-SSN. The
SRT50s for the four speech-like maskers did not differ significantly, except for NV-ISTS versus ST.
Comparing the SRT80 with the SRT50 across the
upper panel in Fig. 3, a constant offset of about 4 dB (6 dB
for speech-like maskers) between the two measures was
observed. Otherwise the performance for SRT80 was very
similar to the SRT50 for the eight maskers: the highest
SRT80 was observed for the SSN and AFS-SSN (5.2 dB,
5.1 dB) maskers and there was a masking release in
SRT80s for the modulated maskers. The largest release
occurred again for the SSN and SAM-SSN (7.6 dB). Here the
ANOVA also showed a highly significant main effect of
masker [F(2.73,19.11) 34.12, p < 0.001]. Post hoc pairwise
comparisons showed significantly different SRT80s between
AFS- and BB-SSN and AFS- and SAM-SSN. Moreover,
there were significant differences between SAM-SSN and
BB-SSN, due to the regularity of the modulations. Regarding
the speech-like maskers there were significant differences
between NV-ISTS and ISTS, but not for the other maskers.
All SDTs were well below the SRTs. For the SSN and
SAM-SSN maskers, SDTs were about 10 dB lower. For the
BB- and AFS-SSN maskers SDTs were about 15 dB lower.
Finally, for the speech-like masker, SDTs were as much as
20 dB lower. Nevertheless, the overall pattern of SDTs was
comparable; as for the SRTs, the highest thresholds were
observed for the SSN and AFS-SSN. There was also a
release from masking for the modulated SSN-based
maskers in the SDT experiment, but unlike for the SRTs,
the masking release did not increase with regularity, i.e.,
the SAM-SSN masker showed no lower SDT than the BBSSN masker. The largest masking release occurred for
SSN and BB-SSN (instead of SAM-SSN as for the SRTs)
and amounted to 13.3 dB. This was slightly larger than for
the intelligibility measurements. When considering the
SDTs for the speech-like maskers, hardly any difference
could be observed. As for speech intelligibility, the type
of interfering talker (ISTS, ST) and the presence or
absence of fundamental frequency information due to
noise-vocoding did not influence the SDTs. A one-way
repeated-measures ANOVA showed a highly significant
main effect of masker [F(7, 49) 82.02 (p < 0.001)]. Post
hoc pairwise comparisons showed differences for the
SSN-based maskers for SSN and AFS-SSN and AFS- and
BB-SSN. There were no significant differences for the
SDTs of the speech-like maskers.
2. Female target and female masker spectrum
for the speech-like maskers were at about 20 dB. A oneway repeated-measures ANOVA showed a highly significant
main effect of masker [F(7, 49) 65.88, p < 0.001]. Post
hoc pairwise comparison with Bonferroni correction showed
that SRT50s for the SSN and AFS-SSN were not significantly different. In contrast, SRT50s differed for AFS- and
BB-SSN, AFS- and SAM-SSN, and SAM-SSN and BBSSN. Thus, the coherence and the regularity of the modulations had a significant effect on SRT50s. The pairwise
comparison between the four speech-like maskers showed
no significant differences.
The overall pattern of SRT80s and SRT50s was the
same; the offset between the two measures was again 4 dB
for the SSN-based maskers and about 6 dB for the speechlike maskers. The highest SRT80s were 5.3 and 5.5 dB
(SSN and AFS-SSN) and are almost identical to the upper
panel. The largest release from masking occurred for SSN
and SAM-SSN (9.1 dB). All the SRT80s for the speech-like
maskers were in a similar range of about 15 dB. Again, the
ANOVA showed a highly significant main effect of masker
[F(7,49) 53.54, p < 0.001] and post hoc pairwise comparisons showed significantly different SRT80s for AFS- and
BB-SSN, AFS- and SAM-SSN, and SAM-SSN and BBSSN. There were no significant differences in SRT80s for
the speech-like maskers.
The course of the SDTs was again similar to the SRTs.
As for speech intelligibility, the highest SDTs were obtained
with the SSN and AFS-SSN (16.9 dB and 20.1 dB), but
the lowest SDT was found for BB-SSN. The masking release
for the SSN and BB-SSN condition was 13.2 dB and thus
almost identical to the value in the upper panel. SDTs for the
speech-like maskers were again about 20 dB lower than the
SRTs. Also here the ANOVA showed a highly significant
main effect of masker [F(7,49) 66.93, p < 0.001]. Post hoc
pairwise comparison showed significant differences in SDTs
between AFS- and BB-SSN and AFS- and SAM-SSN. The
pairwise comparison between the four speech-like maskers
showed no significant differences in SDTs.
3. Male target and male masker spectrum
The lower panel of Fig. 3 shows the SRTs and SDTs for
the combination of male target and male masker spectrum.
The course of the SRTs was similar to the other two panels.
The highest SRT50s were obtained with the SSN and
AFS-SSN (8.2 and 9.3 dB). A masking release occurred
for the introduction of coherent modulations (BB- and SAMSSN) across the frequency spectrum. It was largest between
SSN and SAM-SSN (9.5 dB), which was almost exactly the
same as for the other panels in Fig. 3. Considering the
speech-like maskers, SRTs for NV-ISTS, ISTS, and NV-ST
were similar. An exception, compared to the other panels,
was the ST masker. For the combination of male target and
male masker spectrum, this masker yielded SRTs that were
about 5 dB higher and had a larger standard deviation than
for all other speech-like maskers in this and the other panels.
In this case some subjects had severe problems with speech
intelligibility in the single talker interferer as will be discussed later. A one-way repeated-measures ANOVA showed
Schubotz et al.
531
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
The upper panel shows the data and predictions for the
combination male target and female masker spectrum. SI
predictions by the SII were more or less independent of
masking condition, ranging from 7.5 dB (SAM-SSN) to
8.4 dB (ST), as was expected given that all SSN-based
maskers, ISTS, and NV-ISTS share the same long-term
Schubotz et al.
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
target and masker material for the lower panel. For both, the
overall picture was the same as in the upper panel. SII predictions hardly differed for the individual maskers and ESII
and ESIIsen predictions for SAM-SSN were almost identical
to the SRT50s for SAM-SSN gained in the listening experiments. The release from masking for the other modulated
SSN-based maskers was again slightly overestimated and
largely overestimated for the speech-like maskers. For the
lower panel, the ESII RMSEs were smaller than for the other
two panels, while the RMSEs for the ESIIsen were similar
across all three panels. Again, the ESII predictions generally
resemble the pattern of SDTs with an offset of about 10 dB,
except for SSN masker in male target and masker combination (lower panel of Fig. 4).
Predictions made by the mr-sEPSM showed the overall
smallest RMSE for the combination of male target and male
masker. The release from masking was generally underestimated by about 10 dB (5 dB for the male target and male
masker) for the SSN-based maskers, but the AFS-SSN
SRT50 was very well met. The SRT50 for NV-ISTS deviated by only 1.7 dB and all other SRT50 predictions for the
speech-like maskers were also similar to the data from the
listening experiments. The largest RMSE for the middle and
lower panel was obtained with the STOI model. The release
from masking was underestimated by about 10 dB and masking effects of the speech-like maskers were generally overestimated. This was true for both combinations with the same
gender for target and masker.
All data in Fig. 4 are shown without standard deviations,
as these were very different for the models. For the SII and
ESII model, the errors were about 64 dB for the SSN-based
and 610 dB for the speech-like maskers. This was the case
for all three gender combinations of target and masker.
When real sentences were used as input (ESIIsen) the errors
for the SSN-based maskers rose slightly to 66 dB, but
remained the same as for the ESII for speech-like maskers.
Errors for the mr-sEPSM model were in the range of
0.2%23%, where the lowest errors occurred for very low
SNRs and largest errors in the range between SNRs of 5
and 15 dB. For the combinations with the same gender the
maximal errors were slightly larger (up to 28%), whereas
errors in general were smaller for the speech-like maskers
(up to 20%). The assumed errors for the STOI predictions
were 60.03 and 60.06 in STOI units, which corresponded
to 612 dB for the SSN-based and about 623 dB for the
speech-like maskers.
VI. DISCUSSION
533
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
The SSN and AFS-SSN maskers showed highest intelligibility and detection thresholds. SRT50s, SRT80s, and
SDTs for these two maskers did in most cases not differ significantly from each other. The AFS-SSN had the most incoherent modulations across frequency of all modulated
maskers (see lower right panel of Fig. 1). SRTs for SAMSSN and BB-SSN were lower than for AFS-SSN and showed
a statistically significant release from masking for these
coherently modulated maskers. In addition to across534
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
side of lower panel in Fig. 3), this SRT was not significantly
different from the SSN-based masker SRTs. In this case,
some listeners showed intelligibility thresholds above 0 dB,
causing the large standard deviations in Fig. 3. Interestingly,
this was not observed for the female target and female single
talker masker, although the gender of target and masker
were the same, too.
Generally, SRT80s and SRT50s showed a parallel pattern for all gender combinations of target and masker. The
offset between SRT80s and SRT50s was always about 4 dB
for SSN-based maskers and larger (about 6 dB) for speechlike maskers, reflecting the shallower psychometric function
typically found for speech maskers (MacPherson and
Akeroyd, 2014).
C. Relation of SRTs and SDTs
still allows an estimation of the combined effects of modulation frequency selective AMM caused by modulated
maskers, spectral coherence of AMM, and the effect of IM.
D. Model predictions
535
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
FIG. 5. Time-averaged SNRenv outputs (in dB) of the mr-sEPSM across auditory and modulation filters for the CLUE and OLSA speech material in
the ISTS and SAM-SSN maskers. The different shades of grey indicate
SNRenv in the auditory and modulation filters. It is visible that the CLUE
material shows higher SNRenv values in high auditory and high modulation
filters, whereas the OLSA material has lower values in these filters.
Consequently, the ideal observer stage in the model gains lower SNRenv for
the OLSA speech material which might explain why predictions fail for this
certain speech material.
parameter
set 1
parameter
set 2
rs
0.351
0.5
50
0.6
0.6
0.5
50
0.6
parameter
set 1
parameter
set 2
rs
0.655
0.5
50
0.8
0.715
0.5
50
0.8
Schubotz et al.
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
FIG. 6. Predictions of the mr-sEPSM model when using a different reference than the SRT50 in the SSN masker. Predictions were performed for the
combination of male target speech and female or male masker spectrum,
respectively. The experimental data are depicted with open, the model predictions with closed symbols. Parameter set 1 was chosen to match the SSN,
thus data from Fig. 4 are replotted. Parameter set 2 was chosen as to match
the AFS-SSN for each combination. For the upper panel (female masker)
the RMSE decreased greatly when a reference other than the SRT of the
SSN was used. For the lower panel (male masker) this is not the case. Here
the three RMSEs do not differ much from another. This was presumably the
case, because parameter set 1 provided small RMSE to begin with.
537
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
addressed with different speech-like maskers, somewhat different from studies such as Brungart et al. (2001), where IM
was studied with identical target and interfering talkers, or
with identical masking speech material. Moreover, in these
studies, target and masker sentences started at the same time,
causing a temporal overlay of the two signals. Thus, potentially beneficial masker gaps were shortened, although words
from both sentences were not perfectly aligned. Nevertheless,
the current study can draw conclusions on IM, if differences
between observed data and model predictions are interpreted
as IM effects which are not captured in the models.
It can be hypothesized that the removal of the fundamental frequency had no influence on IM, given that observed
SRTs for noise-vocoded and intact speech-like maskers were
not significantly different (Fig. 3). This is in contrast to
Rosen et al. (2013) where substantial differences occurred
for noise-vocoded maskers and natural speech. That study
found that natural speech is the most effective masker. The
findings of the current study, however, do not suggest a
strong influence of fundamental frequency information on
IM. Regarding Shinn-Cunningham (2008), this lack of difference supports the assumption that fundamental frequency is
not the dominant factor in object formation as long as there
are other signal aspects that differ between target and masker.
Interestingly, the model predictions by the ESII and
mr-sEPSM yielded SRT50s that were slightly higher for the
noise-vocoded maskers.
The lack of significant differences between SRTs and
SDTs for intact and noise-vocoded speech-like maskers in
Fig. 3 could also be caused by the current target sentence
material, which leaves only little room for uncertainties and
thus IM in general. The OLSA sentence material is very
structured and hence predictable, so listeners might know
the target sentence quite well and can therefore concentrate
on the target material and ignore the masker. This would
actually lead to a de-masking effect and could explain the
similar thresholds for all speech-like maskers. A greater variance between the individual maskers could appear in more
realistic settings, e.g., when the beginning of the sentence is
unclear in timing, the target material itself more irregular
(no matrix sentence tests), or when the maskers itself are
more realistic (real environment recordings).
In contrast to the described masking aspects and models
there is another approach to describe masking, based on salient time-frequency segments of the auditory signal representation. The concept of time-frequency segments has
recently come up in the field of computational auditory scene
analysis (CASA), where so-called glimpses (Cooke, 2006)
are used for a representation of the dominating (in terms of
SNR) source in the mixture of signal and background noise.
A glimpse can thus be defined as a spectro-temporal region
where target speech is least affected by the masker. Due to
the redundant information of speech across the spectrotemporal plane, a sparse distribution of glimpses is often
enough for speech perception (Cooke, 2006). Brown and
Wang (2005) proposed that SRTs can be derived from
glimpses and that the usage of glimpses can often sufficiently explain the perception of a signal. A glimpsing
approach could be seen as a generalized analysis combining
538
elements of the classic energetic and amplitude modulation masking by considering short-time SNRs in the timefrequency plane. Conceptually, even the aspect of informational masking could be incorporated in terms of processing
efficiency of the provided (time-frequency) information.
This would yield different intelligibility scores for comparable time-frequency distributions of target and masker
depending on the context of the masking situation.
The current data set provides a systematic approach to
quantify masking effects in monaural speech processing and
might provide a helpful benchmark for (joint) psychoacoustic, SI, and CASA model development. The maskers are publically available (Medizinische Physik, Universitat
Oldenburg, 2016) in combination with some target sentences
(of the Oldenburger Satztest). The original ISTS is available
at EHIMA (2011).
VII. SUMMARY AND CONCLUSIONS
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
predicted the influence of amplitude modulation masking, despite problems with the model calibration.
(5) Comparison of speech intelligibility with speech detection and model data allows qualitative and quantitative
statements, regarding the three masking effects:
Qualitatively, energetic masking appears to have the
largest influence on speech intelligibility and speech
detection, followed by amplitude modulation masking
and informational masking in the current study.
Comparison of the ESII (ESIIsen) model predictions and
observed speech detection data with the observed speech
reception thresholds suggests an amount of amplitude
modulation masking of at least 34 dB and an additional
67 dB of informational masking for the speech-like
maskers in the current study.
ACKNOWLEDGMENTS
ANSI (1969). S3.5, Methods for the Calculation of the Articulation Index
(American National Standards Institute, New York).
ANSI (1997). S3.5-1997, Methods for the Calculation of the Speech
Intelligibility Index (American National Standards Institute, New York).
Arbogast, T. L., Mason, C. R., and Kidd, G., Jr. (2005). The effect of spatial separation on informational masking of speech in normal-hearing and
hearing-impaired listeners, J. Acoust. Soc. Am. 117(4), 21692180.
Barker, J., and Cooke, M. (2007). Modelling speaker intelligibility in
noise, Speech Commun. 49(5), 402417.
Brand, T., and Kollmeier, B. (2002). Efficient adaptive procedures for
threshold and concurrent slope estimations for psychophysics and speech
intelligibility tests, J. Acoust. Soc. Am. 111(6), 28012810.
Bregman, A. S., Liao, C., and Levitan, R. (1990). Auditory grouping based
on fundamental frequency and formant peak frequency, Can. J. Psychol.
44(3), 400413.
Bronkhorst, A. W. (2000). The cocktail party phenomenon: A review of
research on speech intelligibility in multiple-talker conditions, Acta
Acust. united Acust. 86(1), 117128.
Brown, G., and Wang, D. (2005). Separation of speech by computational
auditory scene analysis, in Speech Enhancement, edited by J. Benesty, S.
Makino, and J. Chen (Springer, New York), pp. 371402.
Brungart, D. S. (2001). Informational and energetic masking effects in the
perception of two simultaneous talkers, J. Acoust. Soc. Am. 109,
11011109.
Brungart, D. S., and Simpson, B. D. (2007). Cocktail party listening in a
dynamic multitalker environment, Percept. Psychophys. 69(1), 7991.
Brungart, D. S., Simpson, B. D., Ericson, M. A., and Scott, K. R. (2001).
Informational and energetic masking effects in the perception of multiple
simultaneous talkers, J. Acoust. Soc. Am. 110, 25272538.
Cooke, M. (2006). A glimpsing model of speech perception in noise,
J. Acoust. Soc. Am. 119, 15621573.
J. Acoust. Soc. Am. 140 (1), July 2016
539
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
Lutfi, R. A., Gilbertson, L., Heo, I., Chan, A., and Stamas, J. (2013). The
information-divergence hypothesis of informational masking, J. Acoust.
Soc. Am. 134(3), 21602170.
MacPherson, A., and Akeroyd, M. A. (2014). Variations in the slope of the
psychometric functions for speech intelligibility: A systematic survey,
Trends Hear. 18, 126.
Medizinische Physik, Universitat Oldenburg (2016). Database of maskers
with varying amounts of spectro-temporal speech features, http://
www.uni-oldenburg.de/mediphysik-akustik/mediphysik/downloads/ (Last
viewed June 30, 2016).
Meyer, R. M., and Brand, T. (2013). Comparison of different short-term
speech intelligibility index procedures in fluctuating noise for listeners
with normal and impaired hearing, Acta Acust. Acust. 99(3), 442456.
Micheyl, C., Arthaud, P., Reinhart, C., and Collet, L. (2000). Informational
masking in normal-hearing and hearing-impaired listeners, Acta Otolaryngol. 120(2), 242246.
Piechowiak, T., Ewert, S. D., and Dau, T. (2007). Modeling comodulation
masking release using an equalization-cancellation mechanism, J. Acoust.
Soc. Am. 121, 21112126.
Pollack, I. (1975). Auditory informational masking, J. Acoust. Soc. Am.
57, S5.
Rhebergen, K. S., and Versfeld, N. J. (2005). A Speech Intelligibility
Index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners, J. Acoust. Soc. Am.
117(4), 21812192.
540
Schubotz et al.
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19