1 4955079

Monaural speech intelligibility and detection in maskers with
varying amounts of spectro-temporal speech features

Wiebke Schubotz, Thomas Brand, Birger Kollmeier, and Stephan D. Ewerta)
Medizinische Physik and Cluster of Excellence Hearing4all, Universit
at Oldenburg, D-26111 Oldenburg,
Germany
(Received 27 May 2015; revised 4 March 2016; accepted 13 June 2016; published online 21 July
2016)
Speech intelligibility is strongly affected by the presence of maskers. Depending on the spectrotemporal structure of the masker and its similarity to the target speech, different masking aspects
can occur which are typically referred to as energetic, amplitude modulation, and informational
masking. In this study speech intelligibility and speech detection was measured in maskers that
vary systematically in the time-frequency domain from steady-state noise to a single interfering
talker. Male and female target speech was used in combination with maskers based on speech for
the same or different gender. Observed data were compared to predictions of the speech intelligibility index, extended speech intelligibility index, multi-resolution speech-based envelope-powerspectrum model, and the short-time objective intelligibility measure. The different models served
as analysis tool to help distinguish between the different masking aspects. Comparison shows that
overall masking can to a large extent be explained by short-term energetic masking. However, the
other masking aspects (amplitude modulation an informational masking) influence speech intelligibility as well. Additionally, it was obvious that all models showed considerable deviations from the
data. Therefore, the current study provides a benchmark for further evaluation of speech prediction
C 2016 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4955079]
models. V
[CGC]
Pages: 524540
I. INTRODUCTION
Speech is one of the most important ways for human

communication. However, in everyday life speech signals
are often perceived within a background noise (masker) and
speech intelligibility (SI) can be severely hampered. In such
listening conditions, a number of binaural and monaural
signal properties can affect speech intelligibility. While binaural cues depend on the spatial distribution and the stimulus
properties of the interfering sounds, monaural cues only
depend on the latter. These monaural masker properties can
include frequency content, amplitude modulations, and duration of temporal gaps within the masker (Brungart, 2001).
Thus, monaural speech intelligibility in a background noise
depends largely on the spectro-temporal structure of the
masker, so that the degree to which speech perception is
influenced can hardly be attributed to one single masking
effect (Brungart, 2001; Durlach et al., 2003a; Stone et al.,
2012). Considering masking of speech in a background noise
(monaural or diotic), several masking effects have been
described in the literature, such as energetic masking, amplitude modulation masking, and informational masking. They
have been used to describe masked thresholds or to motivate
models of speech reception, although their relative role and
possible partial redundancies are not entirely clarified yet.
Further insight might be provided by systematically assessing the effect of different spectro-temporal masker features
on speech intelligibility and speech detection.
a)
Electronic mail: stephan.ewert@uni-oldenburg.de
524
J. Acoust. Soc. Am. 140 (1), July 2016
Energetic masking (EM) refers to spectro-temporal

regions where the noise energy is larger than the target
energy (Barker and Cooke, 2007). In this case the response
within the auditory periphery is mainly caused by the masker
signal (Stone et al., 2012). Thus, EM can be described by the
speech-to-noise (or speech-to-masker) ratio at the output of
auditory filters (Durlach et al., 2003a). Classically, stationary
noise is thought of as an energetic masker (Arbogast et al.,
2005; Barker and Cooke, 2007), but there are conceptional
problems in this assumption. Stone et al. (2012) argued that
a stationary noise mainly acts as a modulation masker due to
its intrinsic modulations. Following Stone et al. (2012), only
sinusoids, spectrally separated to avoid beating, act as pure
energetic maskers. Nevertheless, the classical concept of
energetic masking can successfully describe speech reception in stationary maskers based on per band long-term
signal-to-noise ratios (SNR) as in the articulation index (AI)
(ANSI, 1969) and speech intelligibility index (SII) (ANSI,
1997).
Amplitude modulation masking (AMM) occurs when
there is an interaction between the temporal modulations in
the target signal and the masker. For example, Dubbelboer
and Houtgast (2008) showed that the detection of target
modulations is more difficult when masker modulations are
present. In psychophysical experiments Houtgast (1989) and
Ewert and Dau (2000) showed that amplitude modulations
are masked in an envelope-frequency selective manner
which can be accounted for by the concept of modulation filters. For speech perception, Dubbelboer and Houtgast (2008)
proposed a description of AMM by the signal-to-noise
ratio in the modulation domain at the output of auditory
0001-4966/2016/140(1)/524/17/$30.00
C 2016 Acoustical Society of America

V
Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 203.110.242.24 On: Sat, 12 Nov 2016 05:24:19
modulation filters. AMM can generally be caused by fluctuating maskers or even by intrinsic modulations of a stationary masker which interfere with those of the target within a
certain auditory filter (e.g., Stone et al., 2012). In contrast to
AMM, coherent across-frequency amplitude modulations
(co-modulation) in the masker can reveal entire parts of the
target speech. In this case dip listening (Bronkhorst, 2000;
Lorenzi et al., 2006) comes into play which is most prominent for low modulation rates (usually below 8 Hz). The
observed masking release in fluctuating maskers can be conceptually compared to the psychophysical phenomenon of
co-modulation masking release (CMR) (e.g., Hall et al.,
1984), where a release from masking for a pure tone in
noise is caused by the coherence of modulations in many
frequency bands. In addition, Howard-Jones and Rosen
(1993) showed that a masking release occurs also for modulations that are not present over the entire frequency spectrum. Taken together, in contrast to AMM, dip listening
describes the use of reduced short-term EM in temporal
troughs of the masker or likewise in peaks of the target.
Informational masking (IM) usually refers to masking
that does not occur in the auditory periphery, but in more
central regions of the auditory system (Durlach et al.,
2003b). Pollack (1975) describes informational masking as
the uncertainty in the trial-to-trial variation in the noise
waveform in psychoacoustic measurements, whereas for
Brungart et al. (2001) the term holds for interfering talkers
and speech-on-speech masking when the masker is a
similar-sounding distractor (e.g., same gender). IM can
also be prompted by factors such as speaker spectrum, sentence structure, and semantic content of the target signal,
although these aspects also influence EM and AMM. ShinnCunningham (2008) claims that IM is governed by the
aspects of object formation and object selection, whereas the
latter is influenced by the amount of attention that is used to
form the particular auditory object. But generally, IM must
be clearly separated from general inattention toward the task
(Durlach et al., 2003a). Lutfi (1990) even proposed a calculation for informational masking based on the statistical
structure of waveforms in a tone detection experiment and
found the amount to be about 22% within maskers that are
thought to be energetic maskers only. Lutfi et al. (2013) and
Durlach et al. (2003a,b) claim that two aspects rule informational masking: uncertainty of the masker and similarity
between target and masker. These aspects were elaborated
on in Lutfi et al. (2013), but also overlap with the definition
of EM and AMM by Stone et al. (2012). Micheyl et al.
(2000) describe IM to be present when masker and target are
similar in terms of temporal coherence, harmonic structure,
and frequency range. An alternative assumption can be to
take IM as those masking effects that cannot be described by
speech intelligibility models which consider EM and AMM.
Overall IM is less clear cut and is often brought up if speech
reception thresholds cannot be explained by EM and AMM.
The concepts of EM and AMM have been successfully
used in speech intelligibility models to predict speech reception thresholds (SRTs) in various masking conditions. The
AI and SII use band importance functions for the different
analysis channels and thus provide a weighted measure of
energetic masking. The extended speech intelligibility index

(ESII) by Rhebergen et al. (2006) uses the same concept in
short time frames and can therefore cope with fluctuating
maskers where listeners are able to listen in the dips. The
influence of amplitude modulations was first considered in
the speech transmission index (Steeneken and Houtgast,
1980), where the reduction of the modulations of clean
speech due to noise and reverberation was measured. It was
assumed that a reduction of the modulations leads to a
decrease in speech intelligibility. More recently, a sub-band
SNR analysis was also used in the envelope domain
(Dubbelboer and Houtgast, 2008; Jrgensen and Dau, 2011;
Jrgensen et al., 2013) implementing the concept of AMM
for speech intelligibility predictions.
So far it is still not clear what the relative role of the
above masking aspects is in arbitrary maskers with different
spectro-temporal features. The comparison of data from listening experiments and model predictions could therefore
help to shed light on the different masking aspects, given that
models analyze certain known signal characteristics and are
thus sensitive to certain masking aspects only. Furthermore, a
comparison of data from listening experiments and model
predictions can test the validity of the assumed processing
stages implemented in a speech prediction model. Thus, models can serve as analysis tool to help separate the above masking aspects from another.
Another open question is the relation of SI and the
masked threshold of the speech signal itself. While SI models conceptually relate SI to masked thresholds, it is unclear
how SI and masked threshold are related in empirical data as
a function of spectro-temporal masker features. By comparing speech detection thresholds (SDTs) and SRTs, it is,
according to Arbogast et al. (2005) in principle possible to
quantitatively account for masking aspects other than EM.
This would, for example, allow a quantitative estimation of
the much discussed aspect of IM.
The aim of this study was therefore to assess speech
intelligibility and speech detection as a function of the
spectro-temporal structure of the masker. A systematic variation of masker properties in the time-frequency domain
for same and different gender talkers was used to help
understand the relative contribution of the described masking aspects (EM, AMM, and IM). Monaural SRTs and
SDTs were measured with the same subjects for eight
maskers ranging from stationary noise to single interfering
talkers to systematically assess the role of the different
masking aspects. Specific features in the time-frequency
domain were changed separately in the maskers, while the
long-term power spectrum was kept identical. SRTs were
measured at 50% and 80% SI, allowing investigation of the
masking aspects at three performance levels from speech
detection to 80% correct reception. Four SI models were
used as analysis tools to help separate the masking aspects.
Measured SI and speech detection data were compared to
predictions of the SII, the ESII, the multi-resolution speech
envelope power spectrum model (mr-sEPSM) by Jrgensen
et al. (2013), and the short-time objective intelligibility
measure (STOI) by Taal et al. (2011).
Schubotz et al.
525
II. METHODS
A. Subjects
Eight listeners, aged 2334 years, participated in the

measurements on speech intelligibility and speech detection.
They all had audiometric thresholds of less than 20 dB hearing level (HL) at octave frequencies between 125 Hz and
8 kHz. The listeners were naive to the target material and
received an hourly compensation for their participation.
B. Stimuli
The stimuli presented in the measurements consisted of

target sentences from the Oldenburger Satztest (OLSA)
(Wagener et al., 1999) embedded in a background masker.
The OLSA sentence material consists of grammatically correct, German sentences with no semantically context that
follow a controlled sentence structure (noun, verb, numeral,
adjective, object), e.g., Peter kauft achtzehn nasse Schuhe
(Peter buys eighteen wet shoes). Target sentences were
derived from a base list of 10 words per category and were
recorded with co-articulation (see Wagener et al., 1999 for
details). The OLSA contains altogether 120 target sentences
arranged in 45 lists of 20 sentences each. All lists provide
the same speech intelligibility. Three gender combinations
of target and masker spectrum were used throughout the
experiments: male/female, female/female, and male/male.
For the female/female combination, the female version of
the OLSA was used (Wagener et al., 2014). Both male and
female target talkers had accents that were standard German.
The frequency range of the target material was limited to
12 kHz, the sampling frequency was 44.1 kHz. Overall eight
background maskers were generated such that their spectrotemporal characteristics changed while maintaining their
long-term spectrum. They covered a systematic change from
speech-shaped stationary noise to a single talker. The eight
maskers can be divided into two groups: four maskers were
based on a stationary speech-shaped noise (see Fig. 1), the
other four were speech-like maskers (see Fig. 2).
FIG. 1. Spectrograms of the four SSN-based maskers that were used in the
speech intelligibility and detection measurements. The basis was a SSN,
which has the same average long-term spectrum as the ISTS (Holube et al.,
2010). For the SAM-SSN condition, the SSN was fully modulated with an
8-Hz sinusoid, which resulted in a regular and coherent modulation. For the
BB-SSN condition the modulations were derived from intact broad-band
speech (see text for further details) and were thus irregular, but coherent.
For the AFS-SSN the SSN was split into 32 frequency channels to generate
an across-frequency shifted SSN. Four adjacent channels were multiplied
with the same sequence from a broad-band speech envelope. The sequence
that was used for the next four adjacent channels was from another section
in time of the broad-band speech envelope, thus modulations patterns are
shifted across frequencies. This resulted in irregular and incoherent modulations across the spectrum.
gaps between and within sentences were shortened to

approximately 150 ms. The Hilbert envelope was low-pass
filtered to 64 Hz with a 4th-order Butterworth filter. This
masker was termed BB-SSN (lower left panel of Fig. 1). The
resulting amplitude modulations in the SAM- and BB-SSN
were coherent across all auditory channels (also referred
to as across-channel co-modulation in the following).
1. Speech-shaped noise based maskers
The basis was a stationary speech-shaped noise with flat

modulation spectrum (SSN) (upper left panel of Fig. 1) that
was derived from the International Speech Test Signal
(ISTS) (Holube et al., 2010, see below for details), which
consists of speech uttered by six different female talkers.
The SSN was produced by a fast Fourier transformation of
the ISTS, followed by randomization of the phase of the
coefficients and inverse Fourier transformation. Thus, ISTS
and SSN had the same long-term spectrum. For a second
masker, the SSN was fully sinusoidally amplitude modulated
(SAM) with an 8-Hz sinusoid to introduce regular temporal
gaps in the masker. This masker was termed SAM-SSN and
is depicted in the upper right panel of Fig. 1. The SSN was
also multiplied with the Hilbert envelope of a broad-band
speech signal, introducing irregular temporal gaps and modulations as in intact speech. The underlying broad-band
speech signal was a sequence of ten randomly selected
OLSA sentences from the male target material. Temporal
526
FIG. 2. Spectrograms of the speech-like maskers ISTS, ST, and their noisevocoded versions. Noise-vocoding was performed with a 32 auditory channel vocoder within the frequency range of 50 Hz to 12 kHz. The spectral
weighting of each filter was maintained. It should be noted that the masker
sequences in the four panels are not identical.
Schubotz et al.
Incoherent amplitude modulations across the auditory channels were introduced in the across-frequency shifted SSN
masker (AFS-SSN, lower right panel of Fig. 1). This masker
was created by filtering the SSN into 32 auditory channels
within a frequency range of 50 Hz12 kHz using a 4th-order
Gammatone filter bank with 1-ERB (equivalent rectangular
bandwidth) spacing of the auditory filters. Four adjacent
channels were then modulated with the same envelope. The
envelopes were random sections in time from the same
low-pass filtered Hilbert envelope used for the BB-SSN. As
a consequence, coherent modulations were introduced only
in those parts of the masker spectrum that belong to the four
adjacent auditory filters. Overall there were eight different
randomly time-shifted modulations applied to the 32 bands.1
Since the basis for the first four maskers was the ISTS, all
four makers had the same long-term spectrum as the (female)
ISTS. For those maskers, where a male speech spectrum was
used, the basis was a transformed ISTS. The STRAIGHT algorithm (Kawahara et al., 2008) was used to lower the fundamental frequency (F0) and to lengthen the vocal tract of the original
ISTS signal in such a way that the mean fundamental frequency
of the transformed ISTS resembled that of the original male
OLSA target speaker (F0 110 Hz). The vocal tract lengthening was empirically performed to find a realistic factor for a natural sounding male ISTS. Otherwise, there was no attempt to
match the long-term spectrum to that of the male target talker.
The transformed ISTS was then used to derive a SSN with a
male speech spectrum, and from that the other three modulated
maskers were derived as described earlier. Thus, for the case of
a male target talker, the masker spectra of the male ISTS and
the four SSN-based maskers were similar to the target material.
This was not the case for the combinations male target and
female maskers as well as female target and female maskers.
2. Speech-like maskers
Four speech-like maskers were generated, two using

intact speech and two using noise-vocoded speech (right and
left panels of Fig. 2). The intact speech-like maskers were
the ISTS (Holube et al., 2010) and a single talker taken from
a study by Hochmuth et al. (2014). The ISTS represents
non-sense speech originating from a mixture of six female
talkers with different languages. The single talker (ST) was a
sequence of ten OLSA sentences spoken by a female talker
that was not identical with the target talker from the female
OLSA version. In the noise-vocoded maskers (NV-ISTS,
NV-ST) the fine structure of the intact speech was removed,
while maintaining the energy in the individual frequency
bands. The vocoding was performed using a 32-channel vocoder (based on a 4th-order Gammatone filter bank) with 1ERB filtering in the range of 50 Hz12 kHz. The Hilbert envelope was extracted from each analysis channel and used to
modulate random white noise. Before recombining the individual channels, they were filtered with the same analysis filters in order to restrict the filter output to the appropriate
frequency range. This was done for the combination of male
target and female masker spectrum. For the same gender
combination of target and masker (female/female, male/
male), the parameters of the noise-vocoding were slightly
altered. There were only 16 channels with 2-ERB spacing

and the Hilbert envelope was low-pass filtered to 32 Hz with
a 2nd-order Butterworth filter, to further remove possible
temporal pitch information. All other parameters were
unchanged. The overall power spectrum was the same for
the original and the noise-vocoded masker sequence.
For the case of male masker spectra, the female ISTS
and the female single talker had to be substituted: the
ISTS was transformed into a male ISTS version with the
STRAIGHT algorithm (as described earlier). The male single talker masker was taken from a study by Hochmuth et al.
(2014) and was different from the target talker of the male
OLSA version. While the male target talkers F0 was
110 Hz, the male single talker masker had a slightly higher
F0 of 120 Hz and a slightly higher speaking rate with slightly
shorter inter-word gaps. The single talker masker sentences
were about 20% shorter than the target sentences. The noisevocoding for the male speech-like talkers was performed as
described above. The average long-term spectrum was
nearly identical for the ISTS, NV-ISTS, and all SSN-based
maskers, as the SSN was derived from the ISTS (female
masker spectrum). This was also the case for the male version of the ISTS, NV-ISTS, and the resulting SSN-based
maskers (male masker spectrum). The spectra of the ST and
NV-ST were the same for each gender and were similar to
those of the other maskers of the respective gender.
C. Apparatus and procedures
Speech intelligibility was measured using the OLSA

target sentences with an adaptive procedure to vary the SNR
(Brand and Kollmeier, 2002). 20 target sentences were used
to obtain the SNR and a given percent correct rate with
standard deviations smaller than 1 dB. The SNR at which
50% and 80% of the presented words in a sentence were
understood correctly was determined and are referred to as
SRT50 and SRT80, respectively. The noise level was fixed
at 65 dB sound pressure level throughout the entire experiment. The measurement was performed as an open test
where listeners repeated the perceived words to the experiment supervisor and were allowed to guess. No feedback
was provided. A different sentence list containing 20 sentences from the OLSA corpus was used for each masker and for
SRT50 and SRT80, resulting in a total of 320 target sentences (20 8 2). A target sentences could appear more than
once, but never twice in the same list. All listeners received
a different order of test lists for each measurement. Two test
lists were presented to receive the SRT80 in a cafeteria noise
prior to the actual measurements to familiarize the listeners
with the task and the speech material.
SDTs were assessed using a 1-up-2-down two-interval,
two-alternative forced choice procedure to determine the
SNR with 70.7% correct responses on the psychometric
function (Levitt, 1971). There were two intervals presented
to the listener, one containing a random OLSA sentence embedded in the masker, the other containing the masker only.
The noise token varied for each trial, but was the same for
the two intervals (half-frozen noise). The listeners had to
choose the interval that contained the sentence.
Schubotz et al.
527
The duration of the masker was chosen such that the

noise started one second before the target sentence started.
SRTs and SDTs were measured once for the combination
where target material and masker had similar spectra (male/
male, female/female), and twice (test, re-test) for the combination male target and female masker spectrum.
The stimuli were presented monaurally to the right ear
via Sennheiser HD 580 headphones, which were calibrated
with pure tones on a Bruel & Kjaer artificial ear and equalized. The measurements took place in a double-walled,
sound-attenuating booth. The sampling frequency of the
stimuli was 44.1 kHz. All measurements were performed
using the AFC-package for MATLAB (Ewert, 2013).
Each listener completed the SRT50, SRT80, and SDT
experiment in random order, first for male/female, then for
female/female and finally for male/male. For each experiment, the order of the eight masker types was Latin-Square
balanced to control for possible learning effects. In summary, each listener measured thresholds in 24 conditions (3
experiments, 8 masker types) for each gender combination
of target and masker spectra.
III. SPEECH INTELLIGIBIILTY MODELS
Four speech intelligibility models were used to predict

the SRTs for the same maskers as in the listening experiments. Predictions for SRT80s were omitted, since they
showed the same overall pattern as the SRT50s. Two well
established models were the speech intelligibility index (SII,
ANSI, 1997) and the extended SII (ESII) that was proposed
by Rhebergen and Versfeld (2005) and Rhebergen et al.
(2006). Two more recent models were the mr-sEPSM by
Jrgensen et al. (2013) and the STOI developed by Taal
et al. (2011). For SII (ESII) the implementations from the
Speech Intelligibility Prediction toolbox (SIP-Toolbox)
(Fraunhofer IDMT, Project Group Hearing, Speech and
Audio Technology, 2013; Kollmeier et al., 2011) developed
at the Fraunhofer IDMT were used. For the other models,
code that was public available (STOI) and source code provided by the authors (mr-sEPSM) were used.
A. Speech intelligibility index
The SII standard (ANSI, 1997) is a further development

of the AI standard (ANSI, 1969). The signal was split into
21 critical bands, the SNR was calculated in each band and
then multiplied with a band importance function. The
speech-in-noise (SPIN) frequency weighting function from
ANSI, 1997 (1997, Table B.1, rightmost column) was used
in the model. The sum over all frequency bands yields the
overall SII for the given speech material in the masker. The
input variables for this model were the long-term target and
masker spectrum and temporal gaps in the stimulus are not
taken into account. Therefore, a stationary noise, spectrally
matched to the OLSA sentence material was used instead of
the individual target sentences. For the maskers, the longterm spectra of the eight individual masker types were used.
Although the SII provides a measure for speech intelligibility in a given communication system (ANSI, 1997) it does
not provide a speech intelligibility score. In order to predict
528
SRT50s, the SII at the mean observed SNR (at SRT50) in

the SSN was taken as reference and the model outcome was
matched to this reference SII. SII predictions are expected to
be sensitive mainly to long-term energetic masking, given
that the analysis is based on the long-term SNR in each analysis band. Predicted SRTs are expected to be very similar
across all maskers, as the long-term spectra of all maskers
were similar.
B. Extended speech intelligibility index
The ESII was proposed by Rhebergen and Versfeld

(2005) and Rhebergen et al. (2006) as an extension to the SII
model in order to account for speech intelligibility in fluctuating maskers. The extension used in the current study
(Rhebergen et al., 2006) incorporates a temporal analysis in
short time frames before further processing and the effect of
forward masking is also included. Thus, the model is capable
of predicting the influence of short-time energetic masking
(also called dip listening; Bronkhorst, 2000), as its
analysis is based on short-time SNRs. For this model, it is
recommended to use a stationary speech shaped noise as the
target (Rhebergen et al., 2006) instead of real sentences
(here, the same speech shaped noise as for the SII was used).
Thus, only the fluctuations of the masker are considered in
the analysis. Again, signals were analyzed within 21 critical
frequency bands (see Rhebergen and Versfeld, 2005) and
partitioned into time frames, ranging from 35 ms for the
lowest band to 9.4 ms for the highest band. Then the conventional SII was calculated within each time frame, and the
bands weighted with the SPIN band-importance function.
These short-time SII values were averaged to yield ESII
model outputs. To derive actual speech intelligibility predictions from the model outputs, the same procedure and reference were used as for the SII predictions. In the current
study, the ESII was run five times with different temporal
masker sequences (the target was always stationary speech
shaped noise) and the predicted SRTs were averaged.
To compare the ESII predictions to a model configuration that takes the actual fluctuations of the target signal into
account, an ESII extension by Meyer and Brand (2013) was
used (referred to as ESIIsen). The ESII calculations were
identical to the calculation proposed by Rhebergen et al.
(2006) using the exact same parameters. However, instead of
the stationary speech shaped noise as target, intact sentences
(sen) from the OLSA corpus were used as target. Thus, the
instantaneous SNR, as calculated from the root-mean-square
values, can be very low due to silent periods in the target
sentences. The resulting model outcome is again an average
over all short-time SII values. The calculations were performed for twenty target sentences, each presented in a different temporal masker sequence. The final model outcomes
were averaged.
C. Multi-resolution speech-based envelope power
spectrum model
To assess the role of amplitude modulation masking, the

mr-sEPSM by Jrgensen et al. (2013) was used. The model
is based on the envelope-power spectrum model (Ewert and
Schubotz et al.
Dau, 2000), developed to account for amplitude modulation

masking. The mr-sEPSM applies a short-time analysis and is
designed to predict speech intelligibility in fluctuating
maskers. Its core element is a modulation filter bank with
nine modulation filters (1 Hz256 Hz) that analyzes the output of 22 auditory filters ranging from 63 Hz to 8 kHz. The
signal-to-noise ratio of the Hilbert envelope is calculated,
averaged over time and combined in a root-mean-square
manner over all auditory and modulation filters. The resulting SNRenv value is then converted to percent correct values
by the ideal observer stage of the model, using four parameters: the parameter q is independent of the speech material,
q 0.5 was used here, following Jrgensen et al. (2013).
The other three parameters m, rs, and k, represent the size of
the vocabulary used in the observer stage, a value that is
determined by the redundancy of the speech material, and an
experimentally determined value that shifts the psychometric
function. In this study these values were chosen such that the
SRT in the SSN masker was met for each gender combination of target and masker spectrum. Table I shows the corresponding values.
The input for this model was the masker alone and a
target sentence presented in the same masker sequence. The
model calculations were performed for a variety of SNRs to
obtain a psychometric function. At each SNR twenty target
sentences mixed with a different masker were used and the
results were averaged. Unlike in the listening experiments,
the signal level was fixed at 65 dB for the model calculations
as in Jrgensen et al. (2013).
TABLE II. The parameters a and b as chosen for the STOI model to fit the
model to the thresholds from the listening experiments. The parameters
were chosen such that the psychometric function [see Taal et al. (2011), Eq.
(8)] for the SSN predictions matched the SSN condition of the empirical
data. The parameters were changed for the different gender combination of
target and masker as indicated.
Combination of target and masker spectra
male/female
female/female
male/male
26.994
28.938
35.916
12.550
13.140
17.496
to be fitted to map STOI units to SI. The parameter set used

for each combination of target and masker spectrum in the
current study are given in Table II. Predictions were averaged over twenty target sentences each presented in a different masker token.
In their study, Taal et al. (2011) claim that STOI works
well in case of additive noise. They tested an unprocessed
(UN) condition, resembling intact speech degraded by different maskers, indicating that the model can generally account
for such a listening situation. For this UN condition, no preprocessing (time-frequency weighting or single channel
noise reduction) was applied as for other types of speech
degradation in that study. The stimuli from the current study
extend the degraded speech conditions, as they provide
several types of additive noise. The model predictions were
obtained without any preprocessing of the speech and
the noisy speech.
D. STOI measure
IV. RESULTS
The STOI measure (Taal et al., 2011) is based on a

short-time analysis and subsequent cross correlation of the
temporal envelopes of clean and degraded speech signals.
The main focus of STOI lies on simplicity. For example, the
model does not contain a band importance function as do the
SII and ESII. Unlike the other models, STOI makes fewer
assumptions on energetic and amplitude modulation masking. The model performs a decomposition of the signal into
discrete time-frequency bins. The duration of the temporal
envelope frames is 386 ms and provided the best match SI
data of Kjems et al. (2009). The frequency analysis is performed with 15 one-third octave bands, covering a range
from 50 Hz to 4.3 kHz. To compare the model output (in
STOI units) to experimental data, a logistic function with the
free parameters a and b [see Eq. (8) in Taal et al., 2011] has
A. Experimental speech reception and detection
TABLE I. Parameter values that were used in the ideal observer stage of the
mr-sEPSM for the different gender combinations of target and masker. They
were chosen such that the SSN model prediction matched the SSN condition
of the empirical data. The values q and m were fixed for all predictions (see
Jrgensen et al., 2013), whereas the other two parameters were adjusted for
each of the three gender combinations of target and masker spectrum.
Gender combination of target and masker spectrum
Male/female
Female/female
Male/male
rs
0.351
0.4
0.655
0.5
0.5
0.5
50
50
50
0.6
0.6
0.8
Figure 3 shows the measured average SRTs and SDTs

with the corresponding standard deviations. The masker
types are denoted on the abscissa, the measured SRTs
and SDTs on the ordinate. The individual thresholds are
connected with lines to guide the eye. The different gender
combinations of target and masker material are displayed in
the different panels of Fig. 3: the combination male target
and female masker spectrum is shown in the top panel, the
combination of female target and masker spectrum in the
middle panel, and the combination of male target and masker
spectrum in the bottom panel. Open squares represent the
SRT50 while open circles indicate the SRT80. SDTs are
depicted with open triangles. Significant differences in SRTs
within the masker groups (SSN-based on the left-hand side;
speech-like on the right-hand side) and the significance level
are indicated with asterisks in Fig. 3. Here, p < 0.05 is
marked with *, p < 0.01 with **, and p < 0.001 with ***.
Significant differences in SRTs and SDTs across the SSNbased and speech-like maskers are shown in Table III.
1. Male target and female masker spectrum
The upper panel of Fig. 3 shows a characteristic pattern

of SRTs and SDTs across the eight masker conditions: the
left-hand side of the panel shows data obtained with the
SSN-based maskers, the right-hand side with speech-based
Schubotz et al.
529
TABLE III. Statistically significant differences in speech intelligibility and

detection thresholds across the SSN-based and speech-like maskers, indicated by asterisks as in Fig. 3. The upper part of the table displays the significances for the gender combination of male target and female masker. The
middle and lower part displays the significant differences for the measurements with the same gender spectra.
male
target
female
masker
SRT80
SRT50
Detection
female
target
female
masker
SRT80
SRT50
Detection
male
target
SRT80
male
masker
SRT50
Detection
FIG. 3. Mean speech intelligibility (SRT) and detection thresholds (SDT) as

measured for the eight masker types along with the standard deviations for
SRT50s (open squares), SRT80s (open circles), and SDTs (open triangles).
The three different gender combinations of target and masker are shown in
the different panels. The mean values were obtained from individual data of
the same eight listeners. Significant differences within the SSN-based and
speech-like maskers are depicted with asterisks; the number of asterisks
indicates the confidence level (* is p < 0.05, ** is p < 0.01, and *** is
p < 0.001). Significant differences across the SSN-based and speech-like
maskers are given in Table III. The masker types SAM-, BB-, and AFS-SSN
are abbreviated in the figure label.
maskers. The SSN yielded the highest SRT50 of 7.5 dB,

followed by the SRT50 of the AFS-SSN (9.2 dB). The
SRTs for the modulated maskers with spectral coherence of
the applied modulations were lower than for AFS-SSN and
amount to 14.1 dB and 17.1 dB for BB- and SAM-SSN,
respectively. The largest masking release of 9.6 dB occurred
530
NV-ISTS
ISTS
NV-ST
ST
SSN
**
**
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
*
**
**
**
**
***
*
***
*
**
***
**
**
**
**
***
***
***
**
***
***
***
***
**
*
**
*
**
***
**
***
***
***
**
*
**
***
**
***
**
*
***
***
**
***
*
***
***
**
**
**
***
**
*
**
**
**
**
***
**
**
**
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
SSN
SAM-SSN
BB-SSN
AFS-SSN
**
**
**
*
**
**
**
**
*
**
**
**
**
***
**
***
*
***
***
**
**
**
**
**
**
*
***
*
**
***
**
**
**
***
**
**
**
for SSN and SAM-SSN. On the right-hand side of the

upper panel in Fig. 3, SRT50s obtained with the speech-like
maskers are shown. All SRT50s lay in a similar range,
whereas the NV-ISTS showed the highest SRT50 (17.5 dB)
and the ST masker the lowest (22.6 dB). Apparently,
whether the masker was a single talker speaking meaningful
or nonsense (as in the case of the ISTS) sentences or if the
original or noise-vocoded version of the speech-like maskers
was presented, hardly influenced speech intelligibility.
Statistical differences in SRT50s for the different maskers
were assessed with a one-way repeated-measures analysis of
variance (ANOVA) showing a highly significant main effect
of masker [F(7,49) 127.16, p < 0.001]. Post hoc pairwise
comparisons with Bonferroni correction (see aterisks in Fig. 3
Schubotz et al.
for the significance levels) showed that SRT50s for the SSNbased maskers differed significantly for SSN and SAM-SSN
(as always the case for the whole data set), SSN and AFSSSN, AFS- and BB-SSN, and AFS- and SAM-SSN. The
SRT50s for the four speech-like maskers did not differ significantly, except for NV-ISTS versus ST.
Comparing the SRT80 with the SRT50 across the
upper panel in Fig. 3, a constant offset of about 4 dB (6 dB
for speech-like maskers) between the two measures was
observed. Otherwise the performance for SRT80 was very
similar to the SRT50 for the eight maskers: the highest
SRT80 was observed for the SSN and AFS-SSN (5.2 dB,
5.1 dB) maskers and there was a masking release in
SRT80s for the modulated maskers. The largest release
occurred again for the SSN and SAM-SSN (7.6 dB). Here the
ANOVA also showed a highly significant main effect of
masker [F(2.73,19.11) 34.12, p < 0.001]. Post hoc pairwise
comparisons showed significantly different SRT80s between
AFS- and BB-SSN and AFS- and SAM-SSN. Moreover,
there were significant differences between SAM-SSN and
BB-SSN, due to the regularity of the modulations. Regarding
the speech-like maskers there were significant differences
between NV-ISTS and ISTS, but not for the other maskers.
All SDTs were well below the SRTs. For the SSN and
SAM-SSN maskers, SDTs were about 10 dB lower. For the
BB- and AFS-SSN maskers SDTs were about 15 dB lower.
Finally, for the speech-like masker, SDTs were as much as
20 dB lower. Nevertheless, the overall pattern of SDTs was
comparable; as for the SRTs, the highest thresholds were
observed for the SSN and AFS-SSN. There was also a
release from masking for the modulated SSN-based
maskers in the SDT experiment, but unlike for the SRTs,
the masking release did not increase with regularity, i.e.,
the SAM-SSN masker showed no lower SDT than the BBSSN masker. The largest masking release occurred for
SSN and BB-SSN (instead of SAM-SSN as for the SRTs)
and amounted to 13.3 dB. This was slightly larger than for
the intelligibility measurements. When considering the
SDTs for the speech-like maskers, hardly any difference
could be observed. As for speech intelligibility, the type
of interfering talker (ISTS, ST) and the presence or
absence of fundamental frequency information due to
noise-vocoding did not influence the SDTs. A one-way
repeated-measures ANOVA showed a highly significant
main effect of masker [F(7, 49) 82.02 (p < 0.001)]. Post
hoc pairwise comparisons showed differences for the
SSN-based maskers for SSN and AFS-SSN and AFS- and
BB-SSN. There were no significant differences for the
SDTs of the speech-like maskers.
2. Female target and female masker spectrum
The middle panel of Fig. 3 shows the intelligibility

thresholds for the combination of female target and masker
in the same style as in the upper panel. Again SRT50 for
SSN and AFS-SSN were similar, the values were 7.7 and
8.5 dB. SRTs in general declined as modulations were
introduced to the SSN masker, with the largest masking
release being 10.4 dB for SSN and SAM-SSN. All SRT50
for the speech-like maskers were at about 20 dB. A oneway repeated-measures ANOVA showed a highly significant
main effect of masker [F(7, 49) 65.88, p < 0.001]. Post
hoc pairwise comparison with Bonferroni correction showed
that SRT50s for the SSN and AFS-SSN were not significantly different. In contrast, SRT50s differed for AFS- and
BB-SSN, AFS- and SAM-SSN, and SAM-SSN and BBSSN. Thus, the coherence and the regularity of the modulations had a significant effect on SRT50s. The pairwise
comparison between the four speech-like maskers showed
no significant differences.
The overall pattern of SRT80s and SRT50s was the
same; the offset between the two measures was again 4 dB
for the SSN-based maskers and about 6 dB for the speechlike maskers. The highest SRT80s were 5.3 and 5.5 dB
(SSN and AFS-SSN) and are almost identical to the upper
panel. The largest release from masking occurred for SSN
and SAM-SSN (9.1 dB). All the SRT80s for the speech-like
maskers were in a similar range of about 15 dB. Again, the
ANOVA showed a highly significant main effect of masker
[F(7,49) 53.54, p < 0.001] and post hoc pairwise comparisons showed significantly different SRT80s for AFS- and
BB-SSN, AFS- and SAM-SSN, and SAM-SSN and BBSSN. There were no significant differences in SRT80s for
the speech-like maskers.
The course of the SDTs was again similar to the SRTs.
As for speech intelligibility, the highest SDTs were obtained
with the SSN and AFS-SSN (16.9 dB and 20.1 dB), but
the lowest SDT was found for BB-SSN. The masking release
for the SSN and BB-SSN condition was 13.2 dB and thus
almost identical to the value in the upper panel. SDTs for the
speech-like maskers were again about 20 dB lower than the
SRTs. Also here the ANOVA showed a highly significant
main effect of masker [F(7,49) 66.93, p < 0.001]. Post hoc
pairwise comparison showed significant differences in SDTs
between AFS- and BB-SSN and AFS- and SAM-SSN. The
pairwise comparison between the four speech-like maskers
showed no significant differences in SDTs.
3. Male target and male masker spectrum
The lower panel of Fig. 3 shows the SRTs and SDTs for
the combination of male target and male masker spectrum.
The course of the SRTs was similar to the other two panels.
The highest SRT50s were obtained with the SSN and
AFS-SSN (8.2 and 9.3 dB). A masking release occurred
for the introduction of coherent modulations (BB- and SAMSSN) across the frequency spectrum. It was largest between
SSN and SAM-SSN (9.5 dB), which was almost exactly the
same as for the other panels in Fig. 3. Considering the
speech-like maskers, SRTs for NV-ISTS, ISTS, and NV-ST
were similar. An exception, compared to the other panels,
was the ST masker. For the combination of male target and
male masker spectrum, this masker yielded SRTs that were
about 5 dB higher and had a larger standard deviation than
for all other speech-like maskers in this and the other panels.
In this case some subjects had severe problems with speech
intelligibility in the single talker interferer as will be discussed later. A one-way repeated-measures ANOVA showed
Schubotz et al.
531
a significant main effect of masker [F(1.43, 10.01) 18.14,

p < 0.001] for the SRT50s. Post hoc pairwise comparisons
showed differences for AFS- and BB-SSN and AFS- and
SAM-SSN. Pairwise comparisons for the speech-like masker
thresholds indicated no significant differences. Because of
the large standard deviations, the SRT50 for the ST masker
was not significantly different from all other SSN-based
maskers (see Table III). This is in contrast to the combination of female target and masker spectrum.
As before, there was an offset of 4 dB between the
SRT50 and SRT80for the speech-based maskers and 6 dB
for the speech-like maskers (10 dB for the ST condition).
The highest SRT80s were again found for SSN and AFSSSN (6.4 dB and 5.8 dB), although compared with the
other two panels, the SRT80 for SSN was about 1 dB lower
in this case. The masking release between SSN and SAMSSN was 6.8 dB; this was slightly smaller than in the other
two panels of Fig. 3. The ANOVA showed a significant
main effect of masker [F(1.31, 9.17) 9.73, p 0.009] and
post hoc pairwise comparisons showed significant differences in SRT80s between AFS- and BB-SSN, AFS- and SAMSSN, and between BB- and SAM-SSN. Again, the SRT80
for the ST was not significantly different from all SSN-based
masker SRT80s. There were no differences for the SRT80 of
the speech-like maskers.
The SDTs were similar to the data presented in the other
panels as well. The largest SDTs was observed for SSN
(12.3 dB) and AFS-SSN (20.4 dB), the largest masking
release accounted to 15.2 dB for SSN versus BB-SSN. SDTs
for all speech-like maskers were around 34 dB, with the
ST condition showing the lowest SDT. This was in contrast
to the SRTs for this panel, where the ST masker showed the
highest SRT for this certain masker. The SDTs for the
speech-like maskers were in general about 5 dB higher than
the SDTs for the other gender combinations of target and
maskers. Again, the ANOVA showed a highly significant
main effect of masker [F(7,49) 84.62, p < 0.001] for the
SDTs. Pairwise comparisons with Bonferroni correction
showed significant differences between SSN and AFS-SSN,
AFS- and BB-SSN, and AFS- and SAM-SSN. Pairwise comparison for the speech-like masker thresholds indicated no
significant differences for the SDTs.
V. MODEL PREDICTIONS
Given that no systematic differences were observed

between the pattern of results for SRT50s and SRT80s, the
model predictions were assessed for SRT50s only. Figure 4
shows the predicted SRT50s of the five speech intelligibility
models for all three gender combinations of target and
masker along with the observed data in the same style as in
Fig. 3. The model predictions are shown as filled symbols
(see legend), observed SRT50s and SDTs are shown as open
symbols. The root-mean square errors (RMSE) between the
observed SRT50s and the model predictions are given in the
legend. All model predictions in Fig. 4 were adjusted to
match the SRT50 in the SSN masker for each spectral
combination.
532
FIG. 4. Predicted SRT50s of the five speech intelligibility models. The

observed data (SRT50s, SDTs) are shown with open symbols, model predictions with closed symbols. The legend presents the root mean square errors
for each model. All model outcomes were matched to the SRT50 in the SSN
masker of the gender combination of target and masker in each panel.
Model predictions are shown without errors, as these are discussed in the
text.
The upper panel shows the data and predictions for the
combination male target and female masker spectrum. SI
predictions by the SII were more or less independent of
masking condition, ranging from 7.5 dB (SAM-SSN) to
8.4 dB (ST), as was expected given that all SSN-based
maskers, ISTS, and NV-ISTS share the same long-term
Schubotz et al.
power spectrum. This also supports that the applied signal

manipulations did indeed not change the long-term spectrum
of the maskers. As ST and NV-ST had a similar long-term
spectrum as the other six maskers, the predicted SRT50s for
these maskers were also similar to those of the SSN-based
maskers.
In contrast, predictions of the ESII and ESIIsen varied
strongly across the eight maskers showing a good agreement
with the observed data for the SSN-based maskers. The prediction for the SAM-SSN masker (17.73 dB) was almost
identical to the observed SRT50. The release from masking
was overestimated by 34 dB for BB- and AFS-SSN. For the
speech-like maskers there was a considerable offset between
observed data and predictions, the predictions underestimated the SRT50s by about 10 dB. Overall, ESII and
ESIIsen predictions fit the listeners data better than the SII
predictions as is seen by the respective RMSEs. The ESIIsen
predicted SRT50s slightly better than ESII in the case of different genders for target and masker and for female target
and masker. ESIIsen showed the lowest RMSE for male
target and female masker due to a better match with the
observed data for the speech-like maskers. SRT50s were still
underestimated, but were generally about 3 dB higher than
those of the ESII. It is obvious that the ESII predictions
closely follow the pattern of the SDT with a global offset of
about 10 dB. Predictions of the mr-sEPSM yielded large
RMSEs, which was mainly caused by the lack of predictive
power in the modulated SSN-based maskers. The mr-sEPSM
showed a masking release that grew with coherence of
the modulations (AFS-SSN versus BB-SSN) across the
frequency spectrum and regularity (BB-SSN versus SAMSSN), as was seen in the observed data. However, the general decrease in SRTs that was observed in the modulated
SSN-based maskers, was not accounted for by this model.
SRTs were overestimated by more than 10 dB. This trend
continued for the speech-like maskers where the masking
was overestimated by more than 5 dB. However, it should be
noted that if the model predictions were shifted down by
about 8 dB, the observed data would be explained much
better, then except for the SSN condition (see Sec. VI).
The largest RMSE in the upper panel of Fig. 4 was
found for the speech intelligibility predictions of the STOI
model. STOI neither showed a masking release for the
modulated maskers, nor for the speech-like maskers. The
coherence of the applied modulations did not influence the
outcomes of the model at all, neither did regularity of
the modulations within the masker. Predictions generally
overestimated the masking effects by about 10 dB.
The other two panels of Fig. 4 showed the model predictions together with the observed data for the combinations
where target and maskers were both female (middle panel),
or male (lower panel). While the spectrum of the female target talker was different from that of the female-based
maskers, the male-based masker spectra were better matched
to the spectrum of the male target talker. This was, because
the male version of the ISTS (and subsequently all SSNbased maskers) were designed to match the F0 of the male
target talker. Thus, prediction differences for the middle and
lower panel could be caused by a better spectral match of
target and masker material for the lower panel. For both, the
overall picture was the same as in the upper panel. SII predictions hardly differed for the individual maskers and ESII
and ESIIsen predictions for SAM-SSN were almost identical
to the SRT50s for SAM-SSN gained in the listening experiments. The release from masking for the other modulated
SSN-based maskers was again slightly overestimated and
largely overestimated for the speech-like maskers. For the
lower panel, the ESII RMSEs were smaller than for the other
two panels, while the RMSEs for the ESIIsen were similar
across all three panels. Again, the ESII predictions generally
resemble the pattern of SDTs with an offset of about 10 dB,
except for SSN masker in male target and masker combination (lower panel of Fig. 4).
Predictions made by the mr-sEPSM showed the overall
smallest RMSE for the combination of male target and male
masker. The release from masking was generally underestimated by about 10 dB (5 dB for the male target and male
masker) for the SSN-based maskers, but the AFS-SSN
SRT50 was very well met. The SRT50 for NV-ISTS deviated by only 1.7 dB and all other SRT50 predictions for the
speech-like maskers were also similar to the data from the
listening experiments. The largest RMSE for the middle and
lower panel was obtained with the STOI model. The release
from masking was underestimated by about 10 dB and masking effects of the speech-like maskers were generally overestimated. This was true for both combinations with the same
gender for target and masker.
All data in Fig. 4 are shown without standard deviations,
as these were very different for the models. For the SII and
ESII model, the errors were about 64 dB for the SSN-based
and 610 dB for the speech-like maskers. This was the case
for all three gender combinations of target and masker.
When real sentences were used as input (ESIIsen) the errors
for the SSN-based maskers rose slightly to 66 dB, but
remained the same as for the ESII for speech-like maskers.
Errors for the mr-sEPSM model were in the range of
0.2%23%, where the lowest errors occurred for very low
SNRs and largest errors in the range between SNRs of 5
and 15 dB. For the combinations with the same gender the
maximal errors were slightly larger (up to 28%), whereas
errors in general were smaller for the speech-like maskers
(up to 20%). The assumed errors for the STOI predictions
were 60.03 and 60.06 in STOI units, which corresponded
to 612 dB for the SSN-based and about 623 dB for the
speech-like maskers.
VI. DISCUSSION
The maskers used in the current study differed in their

spectro-temporal features regarding regularity and acrossfrequency coherence. In the following, these aspects, the
relation of SRTs and SDTs, as well as the relation of SI predictions by the applied models and SRT50s are discussed.
A. Role of long-term spectrum and absolute threshold
SRT50s in the SSN masker were around 7.5 dB, in

line with the threshold of 7.1 dB as reported in Wagener
et al. (1999) for the OLSA material in a spectrally matched
Schubotz et al.
533
masking noise. A strong release from masking was observed

between the SSN and SAM-SSN maskers and slightly less
between SSN and BB-SSN. The speech-like maskers showed
considerably lower thresholds without systematic variations
among them. This pattern of results was comparable across
all three gender combinations of target and masker. SRTs
were particularly similar for the two gender combinations
using female maskers. In contrast, e.g., Brungart et al.
(2001) found best SI performance when target and masker
were of different gender. A possible explanation for the current finding is the difference between male and female voices in the frequency range below 100 Hz. Male voices have
more energy in this range, but such low frequencies do not
contribute much to speech intelligibility, e.g., male and
female voices are similar in terms of low-frequency speech
processing in the auditory system and can therefore produce
similar SRTs. This constitutes a difference in the EM content
for the different-gender maskers, but does not seem to be important for speech processing. Brungart (2001) states that the
aspect of EM is dominated by IM in the case of speech-like
maskers. Thus, it should be easier to distinguish the target
from a masker of different gender as there is a reduced similarity between the two voices. Accordingly, the SRTs for the
case of a male target in a female masker should be lower
than for the female target. The absence of this decrease can
be explained by the general differences between the target
and speech-like maskers. As the target and masker talkers
were never identical in the current experiments, there were
always sufficient separation cues for the two gender combinations, leading to the similar SRTs. A possible cue could be
the speakers F0 (Bregman et al., 1990; Brungart and
Simpson, 2007) or the difference in the overall long-term
spectrum (which could in principle have an effect of 1.5 dB
in the observed thresholds as determined from the SII
predictions).
To exclude a potential effect of absolute hearing threshold in the conditions with the speech-like maskers (lowest
thresholds), SDTs were measured in quiet with three of the
listeners. The same setup as for the masked SDTs was used,
but one of the two intervals contained the target sentence
and the other silence. These measurements were performed
for the male and female target material and showed SDTs of
58.8 dB (61.55 dB) for the male and 59.3 dB (61.28 dB)
for the female target sentences. SDTs in silence were therefore well below the masked SRTs (and SDTs) in Fig. 3 and
rule out any possible floor effect caused by audibility of the
target sentence.
B. Role of spectro-temporal masker structure for SRTs
The SSN and AFS-SSN maskers showed highest intelligibility and detection thresholds. SRT50s, SRT80s, and
SDTs for these two maskers did in most cases not differ significantly from each other. The AFS-SSN had the most incoherent modulations across frequency of all modulated
maskers (see lower right panel of Fig. 1). SRTs for SAMSSN and BB-SSN were lower than for AFS-SSN and showed
a statistically significant release from masking for these
coherently modulated maskers. In addition to across534
frequency coherence, SAM-SSN had a regular modulation

pattern and showed the lowest thresholds. This is in line with
Stone et al. (2012), stating that regular modulations result in
a greater release from masking than irregular modulations.
AFS-SSN and BB-SSN share the same characteristics of
their modulation patterns when individual frequency bands
are considered, however, BB-SSN has a coherent modulation
pattern across frequency whereas AFS-SSN has not. This
demonstrates that coherent AMM is required for an effective
masking release. Both AFS-SSN and BB-SSN are conceptually similar to the classical random and co-modulated
maskers in psychoacoustic CMR experiments (e.g., Hall
et al., 1984). Thus, the findings are similar to psychoacoustic
CMR (Hall et al., 1984; Dau et al., 2013). Howard-Jones
and Rosen (1993) support this result, although they found a
release from masking also for non-coherent modulations. In
the case of CMR, across-channel mechanisms were estimated to cause 24 dB masking release (Piechowiak et al.,
2007; Dau et al., 2013) which is in the same range as the differences in SRTs in the current study. A different explanation for the decrease in SRTs for the coherently modulated
masker is the representation of the spectral shape of the target signal. A gap in the masker allows a good representation
across the entire frequency spectrum at one time and thus
across-frequency grouping of information can be used as a
cue to separate between target and masker.
The lower SRTs for BB- and SAM-SSN can generally
be explained by the concept of dip listening (e.g.,
Bronkhorst, 2000). According to this assumption, the SRTs
for the BB-SSN maskers should be lower than those
observed in the SAM-SSN, given that temporal gaps in the
masker are longer for BB-SSN. However, thresholds were
lowest for SAM-SSN showing that regularity is an important
factor. This is in contrast to the simulations of the ESII
model (representing dip listening by its short-time EM analysis) that showed no difference. Another aspect is the modulation rate of the masker: the SAM-SSN rate was 8 Hz and
thus higher than the typical prominent speech-modulation
rate of 45 Hz (Dubbelboer and Houtgast, 2008). As AMM
is assumed to be modulation frequency selective (e.g.,
Houtgast, 1989; Ewert and Dau, 2000), the BB-SSN, showing the typical speech-like modulation rates, should cause
more masking than the 8-Hz SAM-SSN. This was observed
in the data. Based on AMM, a 4-Hz SAM-SSN should cause
higher thresholds than the 8-Hz SAM-SSN, while considering dip-listening, thresholds should decrease as larger temporal masker gaps are provided. Thus, assessment of masker
SAM rate in a future study might help to better distinguish
between short-time EM and AMM.
SRTs for speech-like maskers were slightly lower and
significantly different (see Table III) from those of most
SSN-based maskers (except for SAM-SSN and BB-SSN in
some cases) for all three gender combinations of talker and
masker. Dip listening (Bronkhorst, 2000) is a possible explanation for the lower SRTs, as there are temporal gaps in
which the target sentence can be perceived in quiet and these
are similar for all speech-like maskers. An exception is the
case of male target and male masker. As there were large
standard deviations for the male single talker masker (right
Schubotz et al.
side of lower panel in Fig. 3), this SRT was not significantly
different from the SSN-based masker SRTs. In this case,
some listeners showed intelligibility thresholds above 0 dB,
causing the large standard deviations in Fig. 3. Interestingly,
this was not observed for the female target and female single
talker masker, although the gender of target and masker
were the same, too.
Generally, SRT80s and SRT50s showed a parallel pattern for all gender combinations of target and masker. The
offset between SRT80s and SRT50s was always about 4 dB
for SSN-based maskers and larger (about 6 dB) for speechlike maskers, reflecting the shallower psychometric function
typically found for speech maskers (MacPherson and
Akeroyd, 2014).
C. Relation of SRTs and SDTs
SDTs showed a similar pattern to SRTs but were always

well below SRTs. As SRTs, SDTs were highest for the SSN
and AFS-SSN, lower for the SAM-SSN and BB-SSN, and
lowest for the speech-like maskers. It appears plausible that
SDTs were in general lower given that the task was pure signal detection and intelligibility was of no concern.
Interesting is the reception-detection (RD) gap between
SRT50s and SDTs which quantifies the target level above
detection threshold required for 50% speech reception as a
function of the masker type. According to Arbogast et al.
(2005), speech detection is ruled mainly by the aspect of
energetic masking. Following this hypothesis, the RD gap
can be used to estimate the additional effect of AMM and
IM: it amounts to 1011 dB for SSN and SAM-SSN, 17.6 dB
for BB-SSN, 15.1 dB for AFS-SSN, and 2022 dB for the
speech-like maskers. These numbers are about 1 dB smaller
for female target and masker, but vary more (26 dB) for
male target and masker. Assuming that the RD gap for SSN
and SAM-SSN represents the difference between EM for
speech reception and speech detection, the larger RD gap for
BB- and AFS-SSN can be interpreted to reflect the additional
aspect of AMM (as IM is assumed to be absent in stationary
noise) and amounts to 57 dB. Regarding Arbogast et al.
(2005), the BB-SSN provides least EM of all SSN-based
maskers. This assumption is supported by model predictions
(ESII, see Sec. VI D) and the concept of dip listening, given
that BB-SSN provided larger temporal gaps than SAM-SSN.
For the speech-like maskers, a considerably larger RD
gap was found than for the SSN-based maskers. Here the RD
gap can be interpreted to again contain the 57 dB of AMM
and a further offset of about 6 dB, which might be attributable to either effects of coherent across-frequency amplitude
modulations or, most likely, informational masking
(Micheyl et al., 2000; Brungart et al., 2001). Again, this is
supported by the ESII predictions that accounted for dip listening (short-time EM), but showed largely different SRTs
for the SSN-based and speech-like maskers.
It can, however, be argued whether speech detection is
mainly ruled by energetic masking (as proposed by Arbogast
et al., 2005) or whether there is also AMM caused by
the interaction of target and masker (Stone et al., 2012;
Dubbelboer and Houtgast, 2008). In that case, the RD gap
still allows an estimation of the combined effects of modulation frequency selective AMM caused by modulated
maskers, spectral coherence of AMM, and the effect of IM.
D. Model predictions
Speech reception predictions were compared the SRT50

data to gain deeper insight in the role of the different types
of masking accounted for by the individual SI models. Given
that IM is not addressed in any of the models, differences
between observed data and predictions might hint to effects
of IM.
1. SII and ESII (ESIIsen)
The SII predictions rely only on the long-term energy

spectrum of the individual maskers, thus only long-term EM
is explained. For the current maskers, SII predicted a more
or less constant SRT, supporting the initial design goal of
identical masker long-term spectra. As expected, the SII simulations show clearly that long-term EM is a poor predictor
for SI if the masker is not stationary and the spectrotemporal features of the masker are varied.
The ESII extends the concept of EM by a short-time
analysis of the stimulus, and was therefore expected to yield
more accurate predictions for maskers with temporal gaps.
Thus, the ESII accounts for short-time EM and consequently
listening in the dips (Bronkhorst, 2000). It is interesting to
directly compare the ESII predictions to SDTs instead of the
SRTs. Obviously, SDTs and ESII simulations generally
shared the same pattern (see Fig. 4), except for an offset of
about 10 dB. This can be attributed to the adjustment of the
ESII predictions to the SRT50 in the SSN. Both ESII predictions and the observed SDTs shared a difference of about
23 dB between SSN and the speech-like maskers. Even specific details in the threshold pattern (the SRT for BB-SSN
was lower than the SRT for SAM-SSN) were similar, suggesting the ESII to be a very good model for speech detection. Thus, SDTs are well explained by short-time energetic
masking as covered by the ESII. Following Arbogast et al.
(2005), this means that all deviations of the observed SRTs
from the SRTs predicted by the ESII reflect masking aspects
other than EM.
Given that the ESII perfectly accounted for the SRTs of
the SAM-SSN (when matched to the SRT in the SSN), it can
be hypothesized that the masking effect of SAM-SSN is
indeed explained by short-time EM. Therefore, in all other
maskers additional AMM and IM must be responsible for the
higher SRTs. These additional masking effects amounted to
34 dB (AMM) for the BB- and AFS-SSN and about 10 dB
(AMM and IM) in the case of the speech-like maskers. The
additional masking for the BB- and AFS-SSN is likely related
to the irregularity of the temporal fluctuations and reduced
across-frequency coherence in the maskers. The underestimation for speech-like maskers is in line with Rhebergen et al.
(2006), who mentioned that the ESII could underestimate
masking effects when speech-like maskers are used.
Rhebergen et al. (2006) suggested that in this case IM acts in
addition to (short-time) EM, but is not captured by the model.
In comparison to the estimates of 57 dB for AMM and an
Schubotz et al.
535
additional 6 dB for IM in the speech-like maskers, gained

from the RD gap, the ESII predictions suggest slightly different estimates of 34 dB and 67 dB, respectively.
If real sentences were used as target in the predictions
(ESIIsen), the predicted SRTs differed slightly from the ESII
predictions, except for the SAM- and BB-SSN maskers. In
contrast to the ESII, the ESIIsen predicted slightly higher
thresholds for BB-SSN than for SAM-SSN, suggesting that
the ESIIsen concept with exploiting the temporal statistics of
the target sentences, better accounts for the observed SI data
than ESII. Further differences occured for the speech-like
maskers, where the predicted SRTs were about 25 dB higher
for ESIIsen. The gap between SRTs and the short-time EM
based ESII predictions was reduced when the real statistics of
short-time SNRs were better estimated by using real target
sentences in the ESIIsen.
In conclusion, short-time EM with taking the full statistics of short-time SNRs into account is a better model for SI
than long-term EM.
2. mr-sEPSM
The mr-sEPSM was expected to take EM and AMM

aspects into account. However, it did not predict the SRT50s
very well when initially matched to the SRT50 in the SSN.
Masking was generally overestimated by 510 dB in the
model. Despite the overestimation of the SRT50s, a masking
release due to the coherence of the applied modulations was
visible and the size of the release between AFS- and BBSSN and AFS- and SAM-SSN was captured correctly,
although there was no specific analysis of across-channel coherence (as in Jrgensen and Dau, 2014). The overall poor
predictive power of the mr-sEPSM was surprising as this
model was especially designed to predict speech intelligibility in fluctuating maskers and has performed well in
Jrgensen et al. (2013). Since predictions from the original
study could be reproduced well with the available model version, the reason for the prediction offset in the current study
must be related to the stimuli employed, including the target
material (OLSA versus CLUE speech material). Although
the same masker type (SSN) was used for reference in the
current study as in Jrgensen et al. (2013), the resulting parameters in the ideal observer stage were slightly different.
The values q and m were fixed for all predictions, but the
other two parameters were adjusted for each of the three gender combinations of target and masker (see Table I).
Close investigation of the time-averaged SNRenv
showed that the OLSA speech material in the current study
is very different from the CLUE speech material used in
Jrgensen et al. (2013) concerning its energy content across
frequency. Figure 5 (compare to Fig. 5 in Jrgensen et al.,
2013) shows the output of the SNRenv analysis for both
speech materials in ISTS and in SAM-SSN (at 16 dB and
7 dB speech-to-masker ratio, respectively). It is clearly
visible that the CLUE material has more energy in high auditory and modulation filters. If the model predictions, as
stated in Jrgensen et al. (2013), are largely based on the
SNRenv in those high auditory and modulation filters, the
influence of SNRenv in those regions might be
536
FIG. 5. Time-averaged SNRenv outputs (in dB) of the mr-sEPSM across auditory and modulation filters for the CLUE and OLSA speech material in
the ISTS and SAM-SSN maskers. The different shades of grey indicate
SNRenv in the auditory and modulation filters. It is visible that the CLUE
material shows higher SNRenv values in high auditory and high modulation
filters, whereas the OLSA material has lower values in these filters.
Consequently, the ideal observer stage in the model gains lower SNRenv for
the OLSA speech material which might explain why predictions fail for this
certain speech material.
overrepresented for the CLUE material. When OLSA speech

material is used instead, the SNRenv in those filters is
smaller, causing a general overestimation of the masking.
Moreover, predictions of the mr-sEPSM are largely
based on the conversion of the SNRenv to percent correct values within the ideal observer stage. Predictions in Fig. 4
were gained by matching the model outputs to the SRT50 of
the SSN masker, which was not spectrally matched to the
target material. This is different from Jrgensen et al.
(2013), where the reference SSN was spectrally matched to
the CLUE speech material. A consequence are the different
parameter sets for the current study and the study by
Jrgensen et al. (2013). To assess the effect of these different
parameter sets, the choice of parameters was tested more
closely (Table IV and Fig. 6). Two gender combinations
(male target in female and male maskers) were used, since
the mr-sEPSM showed better predictions in the case of male
target in male maskers.
Parameter set 1 was used to match the model outcomes
to the SRT50 in the SSN (this is re-plotted from Fig. 4) and
parameter set 2 to match the SRT50 in the AFS-SSN. The
TABLE IV. This table shows the different parameter sets for the mr-sEPSM
when the SSN (parameter set 1) or AFS-SSN (parameter set 2) was used for
matching the model predictions. The values q and m were fixed throughout
and only the other two parameters were adjusted.
Male target/ female masker
parameter
set 1
parameter
set 2
Male target/ male masker
rs
0.351
0.5
50
0.6
0.6
0.5
50
0.6
parameter
set 1
parameter
set 2
rs
0.655
0.5
50
0.8
0.715
0.5
50
0.8
Schubotz et al.
Taken together, the choice of parameters in the ideal

observer stage is less crucial for predictions when target and
masker have a similar spectrum, but has a great effect if they
are not.
3. STOI
FIG. 6. Predictions of the mr-sEPSM model when using a different reference than the SRT50 in the SSN masker. Predictions were performed for the
combination of male target speech and female or male masker spectrum,
respectively. The experimental data are depicted with open, the model predictions with closed symbols. Parameter set 1 was chosen to match the SSN,
thus data from Fig. 4 are replotted. Parameter set 2 was chosen as to match
the AFS-SSN for each combination. For the upper panel (female masker)
the RMSE decreased greatly when a reference other than the SRT of the
SSN was used. For the lower panel (male masker) this is not the case. Here
the three RMSEs do not differ much from another. This was presumably the
case, because parameter set 1 provided small RMSE to begin with.
corresponding predictions and RMSEs for the two parameter

sets are shown in Fig. 6. The upper panel shows the predictions for the case of the female masker. The predictions with
the smallest RMSE were gained with parameter set 2, which
matched the SRT50 for the AFS-SSN. In this case the
SRT50 for the NV-ISTS was predicted exactly, and predicted SRTs for other speech-like maskers were closer to the
observed data. SRT50s for the SSN-based maskers were still
overestimated, but only by 34 dB. For parameter set 1
(SSN reference) there was an overestimation of about
510 dB for all maskers. For the male target and masker
spectrum (lower panel of Fig. 6) the predicted SRTs for the
two parameter sets were more similar. The reason is that
mr-sEPSM predictions matched the observed data better in
the first place: for parameter set 1 the RMSE was considerably lower than for the different gender of talker and
masker in the upper panel of Fig. 6. Even the SRT50 in the
SSN was met very well and remaining discrepancies were
mostly caused by predictions for the modulated and speechlike maskers.
The STOI model performed well for most data from

Kjems et al. (2009), but underestimated SI for the unprocessed (UN) condition in additive noise (car and bottle
noise) as stated in Taal et al. (2011). In the current study, it
failed to correctly predict SRT50s in all maskers. A possible
reason could be that there was no time-frequency processing
prior to the model analysis in the current study, contrarily to
most conditions in Taal et al. (2011). However, this appears
unlikely since an underestimation in Taal et al. (2011) also
appeared for the UN condition, where preprocessing was
omitted. A possible explanation mentioned in Taal et al.
(2011) is the average noise spectrum, which was different
from that of the clean speech tokens, as was the case in the
current study. This could explain the underestimation of SI
for the different gender of target and masker spectra, but
does not explain the deficits for the combinations where
target and masker are of the same gender and therefore have
similar spectra. Taal et al. (2011) stated that, in general,
model deficits could be overcome by introducing bandimportance functions in the analysis and this could lead to
more accurate predictions, especially when target and
masker are of the same gender.
Although STOI showed much less predictive power,
compared to SII and ESII, it has a potential advantage, since
its additional parameters (a and b, see Table II) can be
adjusted to match the observed data. However, a further
analysis of different parameter sets to fit the psychometric
function was omitted in the current study, because STOI predictions did not represent the overall pattern of observed
SRT50s (e.g., a masking release) in the first place. To which
extent the overall pattern of the STOI predictions would be
altered when parameters are changed to match a reference
condition other than the SSN, is subject to investigation in a
further study.
In general, STOI does not seem to account for effects
such as masking release or listening in the dips.
E. Implications for the role of energetic, amplitude
modulation, and informational masking
Altogether, results from the current study show that SRTs

were influenced by all three masking effects, EM, AMM, and
IM, as supported by the model predictions of the ESII, ESIIsen,
and mr-sEPSM. The current data suggest that most masking can
be explained by (short-time) EM (ESII, ESIIsen). Nevertheless,
there are claims (Stone et al., 2012) that masking exhibited by a
stationary masker is mostly AMM instead of EM.
It remains difficult to exactly quantify IM, but observed
SRTs cannot be explained by the presence of EM and AMM
alone. Model predictions by the ESII (ESIIsen) showed an
offset of 1015 dB compared to the SRTs in speech-like
maskers and this can be well attributed to IM, as discussed
in Rhebergen et al. (2006). In the current study, IM was
Schubotz et al.
537
addressed with different speech-like maskers, somewhat different from studies such as Brungart et al. (2001), where IM
was studied with identical target and interfering talkers, or
with identical masking speech material. Moreover, in these
studies, target and masker sentences started at the same time,
causing a temporal overlay of the two signals. Thus, potentially beneficial masker gaps were shortened, although words
from both sentences were not perfectly aligned. Nevertheless,
the current study can draw conclusions on IM, if differences
between observed data and model predictions are interpreted
as IM effects which are not captured in the models.
It can be hypothesized that the removal of the fundamental frequency had no influence on IM, given that observed
SRTs for noise-vocoded and intact speech-like maskers were
not significantly different (Fig. 3). This is in contrast to
Rosen et al. (2013) where substantial differences occurred
for noise-vocoded maskers and natural speech. That study
found that natural speech is the most effective masker. The
findings of the current study, however, do not suggest a
strong influence of fundamental frequency information on
IM. Regarding Shinn-Cunningham (2008), this lack of difference supports the assumption that fundamental frequency is
not the dominant factor in object formation as long as there
are other signal aspects that differ between target and masker.
Interestingly, the model predictions by the ESII and
mr-sEPSM yielded SRT50s that were slightly higher for the
noise-vocoded maskers.
The lack of significant differences between SRTs and
SDTs for intact and noise-vocoded speech-like maskers in
Fig. 3 could also be caused by the current target sentence
material, which leaves only little room for uncertainties and
thus IM in general. The OLSA sentence material is very
structured and hence predictable, so listeners might know
the target sentence quite well and can therefore concentrate
on the target material and ignore the masker. This would
actually lead to a de-masking effect and could explain the
similar thresholds for all speech-like maskers. A greater variance between the individual maskers could appear in more
realistic settings, e.g., when the beginning of the sentence is
unclear in timing, the target material itself more irregular
(no matrix sentence tests), or when the maskers itself are
more realistic (real environment recordings).
In contrast to the described masking aspects and models
there is another approach to describe masking, based on salient time-frequency segments of the auditory signal representation. The concept of time-frequency segments has
recently come up in the field of computational auditory scene
analysis (CASA), where so-called glimpses (Cooke, 2006)
are used for a representation of the dominating (in terms of
SNR) source in the mixture of signal and background noise.
A glimpse can thus be defined as a spectro-temporal region
where target speech is least affected by the masker. Due to
the redundant information of speech across the spectrotemporal plane, a sparse distribution of glimpses is often
enough for speech perception (Cooke, 2006). Brown and
Wang (2005) proposed that SRTs can be derived from
glimpses and that the usage of glimpses can often sufficiently explain the perception of a signal. A glimpsing
approach could be seen as a generalized analysis combining
538
elements of the classic energetic and amplitude modulation masking by considering short-time SNRs in the timefrequency plane. Conceptually, even the aspect of informational masking could be incorporated in terms of processing
efficiency of the provided (time-frequency) information.
This would yield different intelligibility scores for comparable time-frequency distributions of target and masker
depending on the context of the masking situation.
The current data set provides a systematic approach to
quantify masking effects in monaural speech processing and
might provide a helpful benchmark for (joint) psychoacoustic, SI, and CASA model development. The maskers are publically available (Medizinische Physik, Universitat
Oldenburg, 2016) in combination with some target sentences
(of the Oldenburger Satztest). The original ISTS is available
at EHIMA (2011).
VII. SUMMARY AND CONCLUSIONS
Speech intelligibility and speech detection were measured

in various monaural maskers. Speech reception thresholds
were also compared to predictions by speech intelligibility
models. The obtained results lead to the following
conclusions:
(1) A constant offset of 4 dB between SRT50 and SRT80
was observed for the SSN-based maskers. For the
speech-like maskers, this offset increased to 6 dB,
reflecting the shallower psychometric function typically
found for speech maskers (MacPherson and Akeroyd,
2014). This was robust for all gender combinations of
target and masker.
(2) A statistically significant co-modulation masking release
appeared in SRTs for all gender combinations of target
and masker. SRTs significantly decreased as modulations changed from being incoherent to coherent across
frequency (BB-SSN versus AFS-SSN). The temporal
regularity of the applied coherent modulations (SAMSSN versus BB-SSN) also improved speech intelligibility and resulted in statistically lower SRTs compared to
a masker with irregular modulations.
(3) Informational masking effects did not prominently arise
for speech intelligibility measurements performed with a
matrix sentence test, such as the Oldenburger Satztest.
There was no significant difference for thresholds
obtained with nonsense speech-like and single-talker
interferers and there was no effect of presence or absence
of fundamental frequency information for speech-like
maskers.
(4) Speech intelligibility predictions with the SRT in the stationary masker (SSN) used for model calibration showed
the best results for the ESII and ESIIsen. Predictions
for the male target material and male masker were best
for the mr-sEPSM. Prediction accuracy for the mrsEPSM increased considerably when the predictions
were matched to other reference masker conditions, such
as AFS-SSN. Altogether, the ESII (and ESIIsen) support
the assumption that the influence of EM can be observed
in the speech detection data. The mr-sEPSM correctly
Schubotz et al.
predicted the influence of amplitude modulation masking, despite problems with the model calibration.
(5) Comparison of speech intelligibility with speech detection and model data allows qualitative and quantitative
statements, regarding the three masking effects:
Qualitatively, energetic masking appears to have the
largest influence on speech intelligibility and speech
detection, followed by amplitude modulation masking
and informational masking in the current study.
Comparison of the ESII (ESIIsen) model predictions and
observed speech detection data with the observed speech
reception thresholds suggests an amount of amplitude
modulation masking of at least 34 dB and an additional
67 dB of informational masking for the speech-like
maskers in the current study.
ACKNOWLEDGMENTS
This work was supported by the Deutsche

Forschungsgemeinschaft, Sonderforschungsbereich Das
aktive Geh
or (DFG SFB/TRR31). We thank the Medical
Physics group for fruitful discussions, Sren Jrgensen for
providing the implementation for the mr-sEPSM model
and discussion about it, and Thomas Biberger for helpful
discussions on parameters of the modulation filterbank of the
model.
1
In pilot measurements, different numbers of co-modulated filters (8, 4, 2,

and 1, leading to 4, 8, 16, and 32 different filter envelopes, respectively)
were assessed. SRTs were similar for 1 and 2 co-modulated filters and for
4 and 8 co-modulated filters. For the case of 1 and 2 co-modulated filters,
the masker signals were still similar to stationary noise (caused by a strong
effect of filter overlap), while the de-correlated envelopes across the sets
of 4 co-modulated filters were well preserved in the re-synthesized
maskers.
ANSI (1969). S3.5, Methods for the Calculation of the Articulation Index
(American National Standards Institute, New York).
ANSI (1997). S3.5-1997, Methods for the Calculation of the Speech
Intelligibility Index (American National Standards Institute, New York).
Arbogast, T. L., Mason, C. R., and Kidd, G., Jr. (2005). The effect of spatial separation on informational masking of speech in normal-hearing and
hearing-impaired listeners, J. Acoust. Soc. Am. 117(4), 21692180.
Barker, J., and Cooke, M. (2007). Modelling speaker intelligibility in
noise, Speech Commun. 49(5), 402417.
Brand, T., and Kollmeier, B. (2002). Efficient adaptive procedures for
threshold and concurrent slope estimations for psychophysics and speech
intelligibility tests, J. Acoust. Soc. Am. 111(6), 28012810.
Bregman, A. S., Liao, C., and Levitan, R. (1990). Auditory grouping based
on fundamental frequency and formant peak frequency, Can. J. Psychol.
44(3), 400413.
Bronkhorst, A. W. (2000). The cocktail party phenomenon: A review of
research on speech intelligibility in multiple-talker conditions, Acta
Acust. united Acust. 86(1), 117128.
Brown, G., and Wang, D. (2005). Separation of speech by computational
auditory scene analysis, in Speech Enhancement, edited by J. Benesty, S.
Makino, and J. Chen (Springer, New York), pp. 371402.
Brungart, D. S. (2001). Informational and energetic masking effects in the
perception of two simultaneous talkers, J. Acoust. Soc. Am. 109,
11011109.
Brungart, D. S., and Simpson, B. D. (2007). Cocktail party listening in a
dynamic multitalker environment, Percept. Psychophys. 69(1), 7991.
Brungart, D. S., Simpson, B. D., Ericson, M. A., and Scott, K. R. (2001).
Informational and energetic masking effects in the perception of multiple
simultaneous talkers, J. Acoust. Soc. Am. 110, 25272538.
Cooke, M. (2006). A glimpsing model of speech perception in noise,
J. Acoust. Soc. Am. 119, 15621573.
Dau, T., Piechowiak, T., and Ewert, S. D. (2013). Modeling within-and

across-channel processes in comodulation masking release, J. Acoust.
Soc. Am. 133(1), 350364.
Dubbelboer, F., and Houtgast, T. (2008). The concept of signal-to-noise ratio in the modulation domain and speech intelligibility, J. Acoust. Soc.
Am. 124, 39373946.
Durlach, N. I., Mason, C. R., Kidd, G., Jr., Arbogast, T. L., Colburn, H S.,
and Shinn-Cunningham, B. G. (2003a). Note on informational masking,
J. Acoust. Soc. Am. 113, 29842987.
Durlach, N. I., Mason, C. R., Shinn-Cunningham, B. G., Arbogast, T. L.,
Colburn, H S., and Kidd, G., Jr. (2003b). Informational masking:
Counteracting the effects of stimulus uncertainty by decreasing targetmasker similarity, J. Acoust. Soc. Am. 114, 368379.
EHIMA (2011). International Speech Test Signal 16 and 24 bit, European
Hearing Instrument Manufacturers Association, www.ehima.com (Last
viewed June 30, 2016).
Ewert, S. D. (2013). AFCA modular framework for running psychoacoustic experiments and computational perception models, in
Proceedings of the International Conference on Acoustics AIA-DAGA,
Merano, pp. 13261329.
Ewert, S. D., and Dau, T. (2000). Characterizing frequency selectivity for
envelope fluctuations, J. Acoust. Soc. Am. 108, 11811196.
Fraunhofer IDMT, Project Group Hearing, Speech and Audio Technology.
(2013). SIP-Toolbox: Sound Quality and Speech Intelligibility Prediction
Toolbox, Fraunhofer IDMT, Oldenburg, Germany, http://www.idmt.
fraunhofer.de/de/institute/projects_products/q_t/sip-toolbox.html (Last viewed
June 30, 2016).
Hall, J. W., Haggard, M. P., and Fernandes, M. A. (1984). Detection in
noise by spectro-temporal pattern analysis, J. Acoust. Soc. Am. 76,
5056.
Hochmuth, S., J
urgens, T., Brand, T., and Kollmeier, B. (2014).
Multilingualer Cocktailparty-Einfluss von sprecher- und sprachspezifischen Faktoren auf die Sprachverstandlichkeit im St
orschall
(Multilingual effect of speaker- and speech-specific factors on speech
intelligibility in noise in cocktail party situations), in Proceedings of the
17th Jahrestagung der Deutschen Gesellschaft f
ur Audiologie 2014,
Oldenburg, Germany.
Holube, I., Fredelake, S., Vlaming, M., and Kollmeier, B. (2010).
Development and analysis of an International Speech Test Signal, Int. J.
Audiol. 49, 891903.
Houtgast, T. (1989). Frequency selectivity in amplitude-modulation
detection, J. Acoust. Soc. Am. 85, 16761680.
Howard-Jones, P. A., and Rosen, S. (1993). Uncomodulated glimpsing in
checkerboard noise, J. Acoust. Soc. Am. 93(5), 29152922.
Jrgensen, S., and Dau T. (2011). Predicting speech intelligibility based on
the signal-to-noise envelope power ratio after modulation-frequency selective processing, J. Acoust. Soc. Am. 130, 14751487.
Jrgensen, S., and Dau, T. (2014). Modeling speech intelligibility based on
the signal-to-noise envelope power ratio, Doctoral dissertation, Technical
University of Denmark, Department of Electrical Engineering, Hearing
Systems.
Jrgensen, S., Ewert, S. D., and Dau, T. (2013). A multi-resolution
envelope-power based model for speech intelligibility, J. Acoust. Soc.
Am. 134, 436446.
Kawahara, H., Morise, M., Takahashi, T., Nisimura, R., Irino, T., and
Banno, H. (2008). Tandem-Straight: A temporally stable power spectral
representation for periodic signals and applications to interference-free
spectrum, F0, and aperiodicity estimation, in Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing,
Las Vegas.
Kjems, U., Boldt, J. B., Pedersen, M. S., Lunner, T., and Wang, D. (2009).
Role of mask pattern in intelligibility of ideal binary-masked noisy
speech, J. Acoust. Soc. Am. 126, 14151426.
Kollmeier, B., Rennies, J., and Brand, T. (2011). Tools to predict binaural
speech intelligibility in complex listening environments for normal and
hearing-impaired listeners, J. Acoust. Soc. Am. 129(4), 2669.
Levitt, H. C. C. H. (1971). Transformed up-down methods in psychoacoustics, J. Acoust. Soc. Am. 49(2B), 467477.
Lorenzi, C., Gilbert, G., Carn, H., Garnier, S., and Moore, B. C. (2006).
Speech perception problems of the hearing impaired reflect inability to
use temporal fine structure, Proc. Natl. Acad. Sci. 103(49), 1886618869.
Lutfi, R. A. (1990). How much masking is informational masking?,
J. Acoust. Soc. Am. 88(6), 26072610.
Schubotz et al.
539
Lutfi, R. A., Gilbertson, L., Heo, I., Chan, A., and Stamas, J. (2013). The
information-divergence hypothesis of informational masking, J. Acoust.
Soc. Am. 134(3), 21602170.
MacPherson, A., and Akeroyd, M. A. (2014). Variations in the slope of the
psychometric functions for speech intelligibility: A systematic survey,
Trends Hear. 18, 126.
Medizinische Physik, Universitat Oldenburg (2016). Database of maskers
with varying amounts of spectro-temporal speech features, http://
www.uni-oldenburg.de/mediphysik-akustik/mediphysik/downloads/ (Last
viewed June 30, 2016).
Meyer, R. M., and Brand, T. (2013). Comparison of different short-term
speech intelligibility index procedures in fluctuating noise for listeners
with normal and impaired hearing, Acta Acust. Acust. 99(3), 442456.
Micheyl, C., Arthaud, P., Reinhart, C., and Collet, L. (2000). Informational
masking in normal-hearing and hearing-impaired listeners, Acta Otolaryngol. 120(2), 242246.
Piechowiak, T., Ewert, S. D., and Dau, T. (2007). Modeling comodulation
masking release using an equalization-cancellation mechanism, J. Acoust.
Soc. Am. 121, 21112126.
Pollack, I. (1975). Auditory informational masking, J. Acoust. Soc. Am.
57, S5.
Rhebergen, K. S., and Versfeld, N. J. (2005). A Speech Intelligibility
Index-based approach to predict the speech reception threshold for sentences in fluctuating noise for normal-hearing listeners, J. Acoust. Soc. Am.
117(4), 21812192.
540
Rhebergen, K. S., Versfeld, N. J., and Dreschler, W. A. (2006). Extended

speech intelligibility index for the prediction of the speech reception
threshold in fluctuating noise, J. Acoust. Soc. Am. 120, 39883997.
Rosen, S., Souza, P., Ekelund, C., and Majeed, A. A. (2013). Listening to
speech in a background of other talkers: Effects of talker number and noise
vocoding, J. Acoust. Soc. Am. 133(4), 24312443.
Shinn-Cunningham, B. G. (2008). Object-based auditory and visual
attention, Trends Cogn. Sci. 12(5), 182186.
Steeneken, H. J. M., and Houtgast, T. (1980). A physical method for measuring speech-transmission quality, J. Acoust. Soc. Am. 67(1), 318326.
Stone, M. A., F
ullgrabe, C., and Moore, B. C. J. (2012). Notionally steady
background noise act primarily as a modulation masker of speech,
J. Acoust. Soc. Am. 132(1), 317326.
Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy
speech, IEEE Trans. Audio, Speech, Lang. Process. 19(7), 21252136.
Wagener, K., Brand, T., and Kollmeier B. (1999). Entwicklung und
Evaluation eines Satztests f
ur die deutsche Sprache III: Design, Optimierung
und Evaluation des Oldenburger Satztests (Development and evaluation of
a sentence test for German language III: design, optimization and evaluation
of the Oldenburg sentence test), Z. Audiol. 38(3), 8695.
Wagener, K., Hochmuth, S., Ahrlich, M., Zokoll, M., and Kollmeier, B.
(2014). Der weibliche Oldenburger Satztest (The female version of the
Oldenburg sentence test), in Proceedings of the 17th Jahrestagung der
Deutschen Gesellschaft f
ur Audiologie 2014, Oldenburg, Germany.
Schubotz et al.

1 4955079

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

1 4955079

Transféré par

Droits d'auteur :

Formats disponibles

Monaural speech intelligibility and detection in maskers with

varying amounts of spectro-temporal speech features

Speech is one of the most important ways for human

Electronic mail: stephan.ewert@uni-oldenburg.de

J. Acoust. Soc. Am. 140 (1), July 2016

Energetic masking (EM) refers to spectro-temporal

C 2016 Acoustical Society of America

energetic masking. The extended speech intelligibility index

Eight listeners, aged 2334 years, participated in the

The stimuli presented in the measurements consisted of

gaps between and within sentences were shortened to

1. Speech-shaped noise based maskers

The basis was a stationary speech-shaped noise with flat

J. Acoust. Soc. Am. 140 (1), July 2016

Four speech-like maskers were generated, two using

altered. There were only 16 channels with 2-ERB spacing

Speech intelligibility was measured using the OLSA

The duration of the masker was chosen such that the

Four speech intelligibility models were used to predict

The SII standard (ANSI, 1997) is a further development

J. Acoust. Soc. Am. 140 (1), July 2016

SRT50s, the SII at the mean observed SNR (at SRT50) in

The ESII was proposed by Rhebergen and Versfeld

To assess the role of amplitude modulation masking, the

Dau, 2000), developed to account for amplitude modulation

to be fitted to map STOI units to SI. The parameter set used

The STOI measure (Taal et al., 2011) is based on a

A. Experimental speech reception and detection

J. Acoust. Soc. Am. 140 (1), July 2016

Figure 3 shows the measured average SRTs and SDTs

The upper panel of Fig. 3 shows a characteristic pattern

TABLE III. Statistically significant differences in speech intelligibility and

FIG. 3. Mean speech intelligibility (SRT) and detection thresholds (SDT) as

maskers. The SSN yielded the highest SRT50 of 7.5 dB,

J. Acoust. Soc. Am. 140 (1), July 2016

for SSN and SAM-SSN. On the right-hand side of the

The middle panel of Fig. 3 shows the intelligibility

a significant main effect of masker [F(1.43, 10.01) 18.14,

Given that no systematic differences were observed

J. Acoust. Soc. Am. 140 (1), July 2016

FIG. 4. Predicted SRT50s of the five speech intelligibility models. The

power spectrum. This also supports that the applied signal

The maskers used in the current study differed in their

SRT50s in the SSN masker were around 7.5 dB, in

masking noise. A strong release from masking was observed

J. Acoust. Soc. Am. 140 (1), July 2016

frequency coherence, SAM-SSN had a regular modulation

SDTs showed a similar pattern to SRTs but were always

Speech reception predictions were compared the SRT50

The SII predictions rely only on the long-term energy

additional 6 dB for IM in the speech-like maskers, gained

The mr-sEPSM was expected to take EM and AMM

J. Acoust. Soc. Am. 140 (1), July 2016

overrepresented for the CLUE material. When OLSA speech

Male target/ male masker

Taken together, the choice of parameters in the ideal

corresponding predictions and RMSEs for the two parameter

The STOI model performed well for most data from

Altogether, results from the current study show that SRTs

J. Acoust. Soc. Am. 140 (1), July 2016

Speech intelligibility and speech detection were measured

This work was supported by the Deutsche

In pilot measurements, different numbers of co-modulated filters (8, 4, 2,