Vous êtes sur la page 1sur 10

.

Weighting of the fine-structure for the perception of the mono-syllabic speech stimuli in the presence of noise Introduction:
Information in speech is redundant. For normal-hearing subjects, this means that the signal is robust to corruption, and that speech remains intelligible under adverse listening conditions, such as in high levels of background noise. In the normal auditory system, a complex sound like speech is filtered into frequency channels on the basilar membrane. The signal at a given place can be considered as a timevarying envelope superimposed on the more rapid fluctuations of a carrier (temporal fine structure, TFS) whose rate depends partly on the center frequency and bandwidth of the channel, which is important for the perception of speech in noise. The relative envelope magnitude across channels conveys information about the spectral shape of the signal and changes in the relative envelope magnitude indicate how the short-term spectrum changes over time, which is the key role player in the perception of speech in quite. The TFS carries information both about the fundamental frequency (F0) of the sound (when it is periodic) and about its short-term spectrum.

The bandpass signal at a specific place on the basilar membrane (or the signal produced by bandpass filtering to simulate the waveform at one place on the basilar membrane) can be analyzed using the Hilbert transform to create what is called the analytic signal (Bracewell 1986). In the mammalian auditory system, phase locking tends to break down for frequencies above 45 kHz (Palmer and Russell, 1986), so it is generally assumed that TFS information is not used for frequencies above that limit. The role of TFS in speech perception for frequencies below 5 kHz remains somewhat unclear.

The upper limit of phase locking in humans is not known. Although TFS in the stimulus on the basilar membrane is present up to the highest audible frequencies, this paper is especially concerned with TFS information as represented in the patterns of phase locking in the auditory nerve. This information probably weakens at high frequencies, and so one way of exploring the use of TFS information is to examine changes in performance on various tasks as a function of frequency. Many studies have assessed the relative importance of TFS and envelope information for speech intelligibility, for normal-hearing subjects. The challenge inherent in evaluating the individual contribution of frequency-specific (place) and temporally coded (temporal) cues to auditory perception typically arises from difficulty in decomposing an auditory signal (such as speech) into a modulator (or envelope) and a carrier so that either can be independently altered, reduced or replaced. One such method involves decomposition of the signal by means of the Hilbert transform. This method will be referred to as the Hilbert approach. Although it has several variants, it can generally be described as follows. A priori, it is assumed that a broadband signal, S(t), can be described as the sum of N modulated bands, Sn(t), such as

(1) where mn(t) and cn(t) are, respectively, the modulator and the carrier in the nth band. In order to reduce possible confusion,the original modulator and carrier will always be referred to as m(t) and c(t), respectively. The computed envelope and phase (or temporal fine structure; TFS), defined later on, will always be referred to as a(t) and cos/(t). From Eq. (1), it is clear that the modulator and the carrier could easily be manipulated separately. However, for an observed signal such as speech, mn(t) and cn(t) are unknown, and therefore must be determined. By introducing Zn(t), the analytic signal defined by

_; (2) where and H[_ _ _] is the Hilbert transform, one can determine the Hilbert instantaneous amplitude, an(t), and the Hilbert instantaneous phase, /n(t), respectively given by

so that the original signal can be rewritten as

It is commonly assumed that mn(t) _ an(t) and cn(t) _cos /n(t), and thus one can manipulate the envelope and/or the fine structure independently and synthesize a modifiedversion of the original signal. Several recent studies, however, suggest that the Hilbert approach may be inappropriate to decompose complex signals such as speech. It should be noted that this restriction is limited to those situations where the envelope and/or the fine structure are manipulated (e.g., filtered) prior to be added back together to synthesize a new signal. Ghitza (2001) first suggested that part of the original envelope information can be recovered from the Hilbert fine structure at the output of the auditory filters. The intelligibility of TFSspeech may be influenced by reconstructed E cues. The reconstructed envelope cues make a contribution to the intelligibility of TFS-speech, even though the envelope cues alone are not sufficient to give good intelligibility. The fact that learning is required to achieve high intelligibility with TFS-speech may indicate that the auditory system normally uses TFS cues in conjunction with envelope cues; when envelope cues are minimal, TFS information may be difficult to interpret. Alternatively, the learning may reflect the fact that TFS cues are distorted in the TFS-speech (relative to unprocessed speech), and it may require some training to overcome the effects of the distortion. Several behavioral (Zeng et al., 2004; Gilbert and Lorenzi, 2006) and neurophysiological (Heinz and Swaminathan, 2009) studies have since confirmed that envelopes derived from the TFS can produce good speech intelligibility. In the behavioral studies, normal-hearing (NH) listeners were presented with the TFS of speech stimuli or with a series of noise or tone carriers amplitude-modulated by the recovered envelopes. In the latter case, a technique similar to vocoder processing (Shannon et al., 1995) was used and the recovered envelopes corresponded to the outputs of a bank of gammachirp auditory filters (Irino and Patterson, 1997) in response to the original speech fine structure. Vocoder processing has been used to remove TFS information from speech, so allowing speech intelligibility based on envelope and spectral cues to be measured (Dudley, 1939; Van Tasell et al., 1987; Shannon et al., 1995). Aspeech signal is filtered into a number of channels (N), and the envelope of each channel signal is used to modulate a carrier signal, typically a noise (for a noise vocoder) or a sine wave with a frequency equal to the channel center frequency (for a tone vocoder). The modulated signal for each channel is filtered to restrict the bandwidth to the original channel bandwidth and the modulated signals from each channel are then combined. For a single talker, provided that N is sufficiently large, the resulting signal is highly intelligible to both normalhearing and hearing-impaired subjects (Shannon et al., 1995; Turner et al., 1995;Baskent, 2006; Lorenzi et al., 2006b). However, if the original signal includes both a target talker and a background sound, intelligibility is greatly reduced, even for normal-hearing subjects (Dorman et al., 1998; Fu et al., 1998; Qin and Oxenham, 2003; Stone and Moore, 2003), leading to the suggestion that TFS information may be important for separation of a talker and background into separate auditory streams (Friesen et al., 2001).

Zeng et al. (2004) found up to 40% correct performance for sentences and Gilbert and Lorenzi (2006) found up to 60% correct performance for consonants. Gilbert and Lorenzi (2006) also showed that performance decreases with increasing number of analysis bands. The authors attributed the effect of the number of bands to the ratio between the bandwidth of the analysis filters and that of the auditory filters. They also concluded that consonant identification is essentially abolished when the bandwidth of the analysis filters is less than or equal to four times the bandwidth of normal auditory filters. As India is multi culture and multi linguist and most of them are bilingual ---- Indian English is different from other countries due to the influence of the multi ligngism----- So there is dearth of knowing to what extent the low frequencies contribute to the speech intelligibility in Indian English--- and to know However how much of the low frequency hearing preservation is needed for the perception of speech in noise is not explored If this information can be known, it stands as a outcome measure prior to implantation . Hence forth the present study felt the need of Weighting of the fine-structure for the perception of the mono-syllabic speech stimuli in noise (i.e) knowing whether high frequency or low frequency fine structure information is required for speech perception in noise.

Methodology:
Subjects : Age range : 18 to 25 yrs. (young Adults)

Control group: Normal hearing individuals (normal hearing as per ANSI Criteria). All should have normal hearing, defined as having audiometric thresholds of 20 dB HL (hearing level) or less at octave frequencies between 250 and 8000 Hz and normal immittance measures and histories consistent with normal hearing. Experimental group: Individuals with Moderately severe Sensory Neural Hearing loss (post lingual deafness). The hearing-impaired subjects were selected to have flat moderate hearing losses ,and they were divided into two groups: young (n_7; mean age_24; range: 1825) and elderly (n _ 7; mean age _ 68; range:6372), because there is some evidence that the ability to use TFS decreases with increasing age . Air-conduction, Bone-conduction, and impedance audiometry for the hearing impaired subjects were consistent with sensorineural impairment. The origin of hearing loss was unknown for all elderly subjects and was either congenital or hereditary for the young ones. All impaired subjects had been fitted with a hearing aid on the tested ear for _9 years. Number of subjects: As many as possible with in the time constraints of the data collection. All subjects were fully informed about the goal of the present study and provided written consent before their participation.

Stimuli to be used: To overcome the bias because of the differences in the semantic knowledge of the subjects and to check the efficiency of the technology, the speech material in the study consisted of 50 ISHA PB words. All the PB words will be produced by a female and a male speaker and recorded using SLM and Adobe Audition at a 44100-Hz sampling rate in a sound proof booth into the laptop.

Instruments to be used: MATLAB 2010a for signal processing. GSI 61(Dual channel) for presenting stimuli. Stimuli Synthesis Phase I : Stimuli: Speech signals will be digitized (16-bit resolution) at a 44.1-kHz sampling frequency; they will then be band-pass filtered using Butterworth filters (72 dB/oct rolloff) based on the green wood frequency - function critical bands spanning the range 808,020 Hz. The bands will be less than two times as wide as the normalauditory filters (44), and probably comparable to the widths of the auditory filters of the impaired subjects ,thus ensuring that recovered E cues would be minimal for both groups of subjects. The use of these analysis bands also ensured that the amount of spectral information provided by the E stimuli was similar for the normal-hearing and hearing-impaired subjects.( :ref: Gilbert G, Lorenzi C (2006) J Acoust Soc Am 119:2438244) These bandpass filtered signals were then processed in three ways. In the first (referred to as intact), the signals were summed over all frequency bands. These signals contained both TFS and E information. In the second (referred to as E), the envelope was extracted in each frequency band using the Hilbert transform followed by lowpass filtering with a Butterworth filter (cutoff frequency_64 Hz, 72 dB/oct rolloff). The filtered envelope was used to amplitude modulate a sine wave with a frequency equal to the centre frequency of the band, and with random starting phase. The 16 amplitude-modulated sine waves were summed over all frequency bands. These stimuli contained only E information. In the third (referred to as TFS), the Hilbert transform was used to decompose the signal in each frequency band into its E and TFS components.

Procedure All stimuli were delivered monaurally to the right ear via TDH 39 headphones. The stimuli were presented to the normal-hearing subjects at a level of SRS and to the hearing-impaired subjects to ensure that the stimuli were audible and comfortably loud. Condition I; Each individual be presented with PB words both with and without noise. Condition II: Each individual will be presented with processed signal (without the TFS information) of atleast of one band of frequencies, every time in ascending (low high) order of elimination band.

Response : Oral repetition of the word would be expected. Scoring : A score of 0 for every wrong repetition and 1 for every correct repetition would be allotted.

ROL : Band width: The function described in 1961 (Greenwood,1961b) hypothesized that critical bandwidth, in Hz, might follow an exponential function: CB c 10 + h_of distance, x (in any physical units or normalized distance), along the cochlear partition, and correspond also to a constant distance on the basilar membrane. The frequency-position function obtained as above (see Fig. l), is: Fletcher (1940. 1953), and Zwicker et al. (1957).

F=A(lOUX-k), where F is in Hz and x is in mm and where suitable constants (for man) are: A = 165 and a = 0.06, the latter an empirical constant arising in the critical band function but found also to agree closely with the logarithmic slope of BCkCsys volume compliance gradient for the human cochlear partition; and k, an integration constant left here at the original value 1, but that may sometimes be better replaced by a number from about 0.8 to 0.9 to set a lower frequency limit dictated by convention or by the best fit to data. Although the value k = 0.88 would yield the conventional lower frequency limit of 20 Hz for man, I will continue to use 1.0 for man and most of the other species in this paper, excepting the cat since Liberman (1982) has found that a k of 0.8 best adjusts this function to his low frequency data points in the cat. NO. of Bands : Traditionally, the spectral magnitudes have been regarded as of primary importance for perception, although under some conditions, the phases of the components play an important role. (Moore 2002). The bandpass signal at a specific place on the basilar membrane (or the signal produced by bandpass filtering to simulate the waveform at one place on the basilar membrane) can be analyzed using the Hilbert transform to create what is called the analytic signal (Bracewell 1986). Hilbert transform can be used to decompose the time signal into its envelope (E; the relatively slow variations in amplitude over time) and temporal fine structure (TFS; the rapid oscillations with rate close to the center frequency of the band) Each filter was chosen to have a bandwidth of 1 ERBN, where ERBN stands for the equivalent rectangular bandwidth of the auditory filter as determined using young normally hearing listeners at moderate sound levels (Glasberg and Moore 1990; Moore 2003). The suffix N denotes normal hearing. Traditionally, the envelope has been regarded as the most important carrier of information, at least for speech Signals.Both E and TFS information are represented in the timing of neural discharges, although TFS information depends on phase locking to individual cycles of the stimulus waveform (Young and Sachs 1979). In most mammals, phase locking weakens for frequencies above 45 kHz, although some useful phase locking information may persist for frequencies up to at least 10 kHz (Heinz et al. 2001).

Vous aimerez peut-être aussi